Cisco ASA 5512 (OS 9.1(1)) HA failover issues

carlo.taddei1 · ‎01-25-2016

I have 2 cisco ASA 5512 units (security plus license) in an Active / Standby HA Failover cluster (I've tested both stateful and stateless failover unsuccessfully).

Both units are connected (over a switching path) to a VRRP - enabled gateway, set up by our hosting center provider where the units are installed.

Only one unit can be active (and retain the active state) in a stable way at a given time - meaning that as soon as I attempt to manually trigger failover from the currently active unit to the standby unit, the standby unit (gaining the ACTIVE role) will not be able to route traffic and will fall back to the standby role automatically within minutes - the standby unit is not able to retain stably the active state (even if L3 connectivity should be available, according to the hosting center statement).

This scenario happens with Stateful Active / standby failover as well as with stateless Active / standby failover.

What is also interesting is that the 2 units are configured with 2 public IPs, namely X.X.X.1 and X.X.X.2.

At time t0 (before manual failover is triggered) -> both Public IPs are reachable and PRIMARY Unit is Active

At time t1 -> manual failover is triggered and X.X.X.1 is no longer reachable (assigned to the SECONDARY unit now ACTIVE) -> X.X.X.2 is assigned to the PRIMARY unit and is 100 % reachable

at time t2 -> cluster realizes that cannot route traffic via the SECONDARY unit and falls back to the former scenario (PRIMARY unit is ACTIVE and SECONDARY unit is STANDBY) -> X.X.X.2 is no longer reachable; X.X.X.1 is again reachable;

at time t2 + approx. 30 minutes -> X.X.X.2 becomes again reachable.

The 2 units outside interfaces are connected to the same segment with VRRP settings. According to the hosting site, there are no particular L2 settings in place concerning L2 ports connectivity upstream.

It looks like the cluster is able to route traffic (in a stable way, that is with the unit staying permanently in "Active" status) only when connected to one of the 2 VRRP enabled ports - the behavior described above happens no matter which ASA is connected to THAT port, and always follows the same pattern.

Hosting center has checked the VRRP settings (as well as L2 / switching path) and wasn't able to find any significant issues that might justify this behavior.

In the case of Stateful failover, I've also observed several "Route Session" errors in the LU updates exchanged between the 2 ASAs:

on the PRIMARY ASA:

ASA-fw# show failover
Failover On
Failover unit Primary
Failover LAN Interface: failover GigabitEthernet0/5 (up)
Unit Poll frequency 3 seconds, holdtime 10 seconds
Interface Poll frequency 3 seconds, holdtime 15 seconds
Interface Policy 1
Monitored Interfaces 3 of 114 maximum
failover replication http
Version: Ours 9.1(1), Mate 9.1(1)
Last Failover at: 09:14:07 UTC Jan 25 2016
This host: Primary - Active
Active time: 415235 (sec)
slot 0: ASA5512 hw/sw rev (1.0/9.1(1)) status (Up Sys)
Interface outside (X.X.X.195): Normal (Monitored)
Interface inside (192.168.1.1): Normal (Monitored)
Interface management (192.168.99.1): Normal (Monitored)
Other host: Secondary - Standby Ready
Active time: 3090024 (sec)
slot 0: ASA5512 hw/sw rev (1.0/9.1(1)) status (Up Sys)
Interface outside (X.X.X.200): Normal (Monitored)
Interface inside (192.168.1.3): Normal (Monitored)
Interface management (192.168.99.2): Normal (Monitored)

Stateful Failover Logical Update Statistics
Link : failover GigabitEthernet0/5 (up)
Stateful Obj xmit xerr rcv rerr
General 1518905 0 56441 9
sys cmd 54791 0 54790 0
up time 0 0 0 0
RPC services 0 0 0 0
TCP conn 1126177 0 156 0
UDP conn 140986 0 889 0
ARP tbl 190094 0 606 0
Xlate_Timeout 0 0 0 0
IPv6 ND tbl 0 0 0 0
VPN IKEv1 SA 4 0 0 0
VPN IKEv1 P2 8 0 0 0
VPN IKEv2 SA 0 0 0 0
VPN IKEv2 P2 0 0 0 0
VPN CTCP upd 0 0 0 0
VPN SDI upd 0 0 0 0
VPN DHCP upd 0 0 0 0
SIP Session 0 0 0 0
Route Session 6842 0 0 9
User-Identity 3 0 0 0
CTS SGTNAME 0 0 0 0
CTS PAC 0 0 0 0
TrustSec-SXP 0 0 0 0
IPv6 Route 0 0 0 0

Logical Update Queue Information
Cur Max Total
Recv Q: 0 12 135522
Xmit Q: 0 31 1717200

ASA-fw#

on the SECONDARY ASA: ASA-fw# show failover
Failover On
Failover unit Secondary
Failover LAN Interface: failover GigabitEthernet0/5 (up)
Unit Poll frequency 3 seconds, holdtime 10 seconds
Interface Poll frequency 3 seconds, holdtime 15 seconds
Interface Policy 1
Monitored Interfaces 3 of 114 maximum
failover replication http
Version: Ours 9.1(1), Mate 9.1(1)
Last Failover at: 09:14:07 UTC Jan 25 2016
This host: Secondary - Standby Ready
Active time: 3090024 (sec)
slot 0: ASA5512 hw/sw rev (1.0/9.1(1)) status (Up Sys)
Interface outside (X.X.X.200): Normal (Monitored)
Interface inside (192.168.1.3): Normal (Monitored)
Interface management (192.168.99.2): Normal (Monitored)
Other host: Primary - Active
Active time: 415193 (sec)
slot 0: ASA5512 hw/sw rev (1.0/9.1(1)) status (Up Sys)
Interface outside (X.X.X.195): Normal (Monitored)
Interface inside (192.168.1.1): Normal (Monitored)
Interface management (192.168.99.1): Normal (Monitored)

Stateful Failover Logical Update Statistics
Link : failover GigabitEthernet0/5 (up)
Stateful Obj xmit xerr rcv rerr
General 408964 0 1527527 6852
sys cmd 69155 0 69153 0
up time 0 0 0 0
RPC services 0 0 0 0
TCP conn 258657 0 1127204 0
UDP conn 34851 0 140988 0
ARP tbl 44206 0 190171 0
Xlate_Timeout 0 0 0 0
IPv6 ND tbl 0 0 0 0
VPN IKEv1 SA 30 0 4 0
VPN IKEv1 P2 8 0 4 0
VPN IKEv2 SA 0 0 0 0
VPN IKEv2 P2 0 0 0 0
VPN CTCP upd 0 0 0 0
VPN SDI upd 0 0 0 0
VPN DHCP upd 0 0 0 0
SIP Session 0 0 0 0
Route Session 2010 0 0 6852
User-Identity 47 0 3 0
CTS SGTNAME 0 0 0 0
CTS PAC 0 0 0 0
TrustSec-SXP 0 0 0 0
IPv6 Route 0 0 0 0

Logical Update Queue Information
Cur Max Total
Recv Q: 0 20 1811827
Xmit Q: 0 30 466771

ASA-fw#

I was wondering if there are some debugs that might assist me in troubleshooting this issue and possibly help in identifying Hosting center related misconfig - issues.

Also, I wanted to ask details on the "Route Session" entry and the impact of several xerr/rerr in this field.

Lastly, could this be a SW bug ? I read on cisco's website of several bugs for ASA clustering under 9.1(1) release:

http://www.cisco.com/c/en/us/td/docs/security/asa/asa91/release/notes/asarn91.html

yasirirfan · ‎01-25-2016

Hi Carlo

Make sure your firewall is not hitting the bug CSCug88962, as I had issues with 9.1 version . the most stable version will be 9.2(4). you may consider upgrading the Firewall to much stable version. For your platform Cisco recommends 9.2.4.SMP IOS version.

Cheers

Yasir