Solved: Re: ASA 5525 , what exactly is the failover link used for?

jmaxwellUSAF · ‎01-21-2023

Hello.

"ciscoasa(config)# failover interface ip folink 172.27.48.1 255.255.255.0 standby 172.27.48.2"

The above is the Cisco literature config for a failover link for a one of two 5525s in a high-availability setup.

What exactly is this link used for? What traverses the link?

Thank you.

jmaxwellUSAF · ‎01-21-2023

FW/pri/stby# sh failover state

State Last Failure Reason Date/Time
This host - Primary
Standby Ready Ifc Failure 16:44:17 EST Jan 19 2023
Other host - Secondary
Active Ifc Failure 15:48:20 EST Jan 19 2023
Lumin5: No Link

====Configuration State===
Sync Done
Sync Done - STANDBY
====Communication State===
Mac set

!-------------------------------------!

The above data makes me conclude that the cause of this event was a human disconnecting an ethernet cable from interface Lumen5. Do you agree?

OK then, there still exists a related symptom that Anyconnect clients are experiencing not the usual 1-2 second join time to the Anyconnect VPN, but instead are experiencing about 1 minute join time.

Do you suspect that, if I make the original primary ASA the primary ASA again, that this symptom will resolve?

If so, what is the correct procedure to do this in this active production-critical ASA5525 HA setup?

View solution in original post

Rob Ingram · ‎01-21-2023

@jmaxwellUSAF the correct procedure to perform failover is, execute the no failover active command on the active unit or the failover active command is run on the standby unit.

What was the issue when there was no connectivity? It sounds like there is some other underlying issue.

View solution in original post

Rob Ingram · ‎01-21-2023

@jmaxwellUSAF on the ASA you've got the failover link and a stateful failover link. These can be dedicated physical interfaces or combined/shared.

The failover link replicates the configuration and sends hellos to determine the health of the other unit. The stateful failover link synchronises TCP/UDP connections, arp/nat tables, IKE/IPSec connections, RAVPN sessions etc.

jmaxwellUSAF · ‎01-21-2023

Hi Rob.

May you please explain what logical events occur during a failover event? May you please include how the aforementioned link is involved?

I am responsible for handling a current event in which our medium-sized enterprise's #1 crucial HA node has experienced failover. I have already opened a TAC case, but TAC is slow to update.

May you please suggest troubleshooting strategy and some commands that will aid in my root cause analysis?

Thank you.

Rob Ingram · ‎01-21-2023

@jmaxwellUSAF the two peers in the failover pair continually send heartbeats over the failover link, the configuration is replicated from the active to the secondary (if any changes) and the connections/tables etc are continually sync'd. In the event of failure of the primary, the secondary would detect the peer has gone down and take over the active role and assume the IP and MAC address of the failed unit and passes traffic.

To troubleshoot failover use the commands "show failover" to determine the devices are in sync and "show failover history" to determine any historical events. Post the output of these commands if you require further assistance.

Each interface (outside, inside, failover) on the ASAs need L2 connectivity, so check the switch configuration to determine they are in the same VLAN in order for them to communicate with each other.

jmaxwellUSAF · ‎01-21-2023

1. The below data is from Cisco software "CLI Analyser" system diagnostics.
Point one "interface check" is obviously relevant.
The second point "Significant percentage of packets have been dropped." may or may not be relevant.

2. Below this CLI analyser data is the output of #show failover history.

May you please assist? Thank you.
---

Failover event detected in the last week
A failover event was detected in the last week. The reason reported for the failover was:

Interface check

Reference the ASA Config Guide for more details about failover health monitoring. (!! This link is broken. !!)

Last Failover at: 15:48:55 EST Jan 19 2023
---

Significant percentage of packets have been dropped while punting from DP to CP
While trying to handle packets passing between the Data Path processes to the Control Point process, allocation or enqueue failures caused some packets to be dropped. The allocation failures happen when the device does not have enough memory to allocate the incoming packet. Enqueue failures happen when the queue limit has been reached and cannot grow further. Please reach out to Cisco TAC for more assistance in diagnosing this issue.
The following punt events have experienced drops:
syslog - 3729084284 drops (31.41697527696345%)
Recommendation: Consider tuning the amount of syslogs being generated in the 'logging' configuration. Use 'show logging queue' and 'show logging' to view current statistics for logs being generated.

==================================================

FW/pri/stby# show failover history

...(!! irrelevant output omitted !!)...

13:29:14 EST Jan 12 2023
Active Failed Interface check
This host:1
single_vf: LUMEN2of2
Other host:0

13:29:24 EST Jan 12 2023
Failed Standby Ready Interface check
This host:0
Other host:0

13:29:33 EST Jan 12 2023
Standby Ready Just Active Other unit wants me Active

13:29:33 EST Jan 12 2023
Just Active Active Drain Other unit wants me Active

13:29:33 EST Jan 12 2023
Active Drain Active Applying Config Other unit wants me Active

13:29:33 EST Jan 12 2023
Active Applying Config Active Config Applied Other unit wants me Active

13:29:33 EST Jan 12 2023
Active Config Applied Active Other unit wants me Active

15:48:55 EST Jan 19 2023
Active Failed Interface check
This host:1
single_vf: Lumin
Other host:0

15:49:15 EST Jan 19 2023
Failed Standby Ready Interface check
This host:0
Other host:0

16:44:17 EST Jan 19 2023
Standby Ready Failed Interface check
This host:1
single_vf: dmz
Other host:0

16:44:20 EST Jan 19 2023
Failed Standby Ready Interface check
This host:0
Other host:0

Rob Ingram · ‎01-21-2023

@jmaxwellUSAF so do you have a L1/L2 issue with one or more interfaces? Provide the output of "show failover"

There also seems to be an excessive amount of syslogs being generated, provide the information as per the recommendations in your last response.

jmaxwellUSAF · ‎01-21-2023

FW/pri/stby# sh failover
Failover On
Failover unit Primary
Failover LAN Interface: (!! omitted !!)Failover_Link GigabitEthernet0/7 (up)
Reconnect timeout 0:00:00
Unit Poll frequency 1 seconds, holdtime 15 seconds
Interface Poll frequency 5 seconds, holdtime 25 seconds
Interface Policy 1
Monitored Interfaces 4 of 466 maximum
MAC Address Move Notification Interval not set
failover replication http
Version: Ours 9.14(3), Mate 9.14(3)
Serial Number: Ours (!! omitted !!), Mate (!! omitted !!)
Last Failover at: 15:48:55 EST Jan 19 2023
This host: Primary - Standby Ready
Active time: 613115 (sec)
slot 0: ASA5525 hw/sw rev (1.0/9.14(3)) status (Up Sys)
Interface Outside (!! omitted !!): Normal (Monitored)
Interface Inside (!! omitted !!): Normal (Monitored)
Interface dmz (!! omitted !!): Normal (Monitored)
Interface management (!! omitted !!): Link Down (Not-Monitored)
Interface Lumin5 (!! omitted !!): Normal (Monitored)
Other host: Secondary - Active
Active time: 158129 (sec)
slot 0: ASA5525 hw/sw rev (1.0/9.14(3)) status (Up Sys)
Interface Outside (!! omitted !!): Normal (Monitored)
Interface Inside (!! omitted !!): Normal (Monitored)
Interface dmz (!! omitted !!): Normal (Monitored)
Interface management (!! omitted !!): Normal (Not-Monitored)
Interface Lumin5 (!! omitted !!): Normal (Monitored)

Stateful Failover Logical Update Statistics
Link : (!! omitted !!)Stateful_Link GigabitEthernet0/6 (up)
Stateful Obj xmit xerr rcv rerr
General 2715560429 15526 23292267 552
sys cmd 3113962 103 3113961 45
up time 0 0 0 0
RPC services 0 0 0 0
TCP conn 1637888998 0 12539570 471
UDP conn 589643709 0 4166096 36
ARP tbl 418666897 15423 3370500 0
Xlate_Timeout 0 0 0 0
IPv6 ND tbl 0 0 0 0
VPN IKEv1 SA 6238 0 19 0
VPN IKEv1 P2 39641 0 181 0
VPN IKEv2 SA 65772968 0 97500 0
VPN IKEv2 P2 117271 0 815 0
VPN CTCP upd 0 0 0 0
VPN SDI upd 0 0 0 0
VPN DHCP upd 0 0 0 0
SIP Session 9158 0 0 0
SIP Tx 6217 0 0 0
SIP Pinhole 547 0 0 0
Route Session 176977 0 1650 0
Router ID 0 0 0 0
User-Identity 117846 0 1975 0
CTS SGTNAME 0 0 0 0
CTS PAC 0 0 0 0
TrustSec-SXP 0 0 0 0
IPv6 Route 0 0 0 0
STS Table 0 0 0 0
Umbrella Device-ID 0 0 0 0

Logical Update Queue Information
Cur Max Total
Recv Q: 0 33 49769711
Xmit Q: 0 333 3134372037

jmaxwellUSAF · ‎01-21-2023

It is unknown but possible that a junior tech physically unplugged an ethernet cable from one of these interfaces at the start of this event.

1. If a human physically unplugged an ethernet cable here, would it make sense that this human error caused this event?

2. Does the data suggest which interface he mistakenly unplugged?

Thank you.

Rob Ingram · ‎01-21-2023

@jmaxwellUSAF yes unplugging an interface (that is monitored) on the active unit could cause a failover - which in your case could be outside, inside, dmz or lumin5.

FYI, I note that the management interface on Pri/Stby is "Link Down" but is "Normal" on the Secondary, though this interface is not monitored, so this would not cause a failover.

From the output provided I don't see an mention of an interface being unplugged.

Can you provide the output of "show failover state" - that would show that last failure reason.

jmaxwellUSAF · ‎01-21-2023

FW/pri/stby# sh failover state

State Last Failure Reason Date/Time
This host - Primary
Standby Ready Ifc Failure 16:44:17 EST Jan 19 2023
Other host - Secondary
Active Ifc Failure 15:48:20 EST Jan 19 2023
Lumin5: No Link

====Configuration State===
Sync Done
Sync Done - STANDBY
====Communication State===
Mac set

!-------------------------------------!

The above data makes me conclude that the cause of this event was a human disconnecting an ethernet cable from interface Lumen5. Do you agree?

OK then, there still exists a related symptom that Anyconnect clients are experiencing not the usual 1-2 second join time to the Anyconnect VPN, but instead are experiencing about 1 minute join time.

Do you suspect that, if I make the original primary ASA the primary ASA again, that this symptom will resolve?

If so, what is the correct procedure to do this in this active production-critical ASA5525 HA setup?

Rob Ingram · ‎01-21-2023

@jmaxwellUSAF yes it looks like that interface may have been unintentially disconnected briefly.

On the current standby unit, run the command "failover active" to make that the new active device. You can then see if that resolves your anyconnect connection issue.

If there is another issue when the anyconnect users connect, I'd check the L1/L2 connections of the current active unit (after you've failed over) and check the connected switch for interface errors etc.

jmaxwellUSAF · ‎01-21-2023

Now I am unable to connect at all to the Anyconnect VPN. This is a new symptom.

During the previous symptom of slow 1 minute joining, the client would, as normal, first prompt me for my credentials. Now it gives me no prompts. The client information button displays "1:06:14 PM Unable to contact VPN1.MYCOMPANY.com "

When I execute on the newly active primary ASA "#show vpn-sessiondb anyconnect", i see many active users.

What is the cause of symptom "1:06:14 PM Unable to contact VPN1.MYCOMPANY.com "? (I expect it is related to the execution of "failover active", so now new connections cannot join until the current Anyconnect users are kicked off, and the circuit is reset through the new active ASA.)
How do I remediate this symptom?

Rob Ingram · ‎01-21-2023

@jmaxwellUSAF you perform the reverse, run "failover active" on the other unit - to fail back.

Did you perform the L1/L2 checks?

Before you fail back, run some troubleshooting tests - can the new active unit route to the internet? Run "ping 8.8.8.8"

Can you ping the FQDN from your laptop that you are using to connect to the VPN?

Turn on webvpn debugs to see if there are any connection attempts logged.

jmaxwellUSAF · ‎01-21-2023

Previously I executed "failover active" on the ASA that was originally configured as standby but was in the active state.

Currently I cannot ping 8.8.8.8 from either ASA. I cannot ping the Anyconnect FQDN from my laptop.

May you please assist?

Rob Ingram · ‎01-21-2023

@jmaxwellUSAF this is basic conectivity troubleshooting -

Is the outside interface up?

Run traceroute from the ASA to 8.8.8.8 does it go via the correct path?

Run "show route" do you still have a default route?

Can you ping the next hop IP address?

Check the L1/L2 connectivity for the outside interface of both ASA (check the connected switch) is the ASA interface in the correct VLAN?