Cisco ASA on Firepower 2120 looses TCP sessions during failover

Network Diver · ‎01-18-2023

Hi,

Since 2019 we are using ASA running on Firepower2120. We started with ASA 9.10.x through 9.12.x, 9.14.x and yesterday upgraded to 9.16.3.23.

We have four Firepower2120 devices, and two of them are running in multicontext active/active HA mode with 2 or more contexts.

Before I do a software update, I failover the ASA contexts running on node 2 to node 1.

failover active group 2

Then wait until failover has completed, before performing the update on the free node using the web chassis manager.

In about 25% of the cases the firewall looses TCP sessions to the servers hosted in attached VLANs ... but weirdly to the servers that are hosted by the ASA context that was *not* failed over and remained running on the node 1. Sometimes the hosts are not reachable with ssh/ping when in failover, sometimes it just kills NFS sessions through the firewall and result in stale NFS.

It seems only to happen when the firewall was running for several months and we want to do a software update. When we want to reproduce that behaviour few days later for debugging with Cisco TAC, the issue is not reproduceable. It sometimes happened on firewall HA pair 1, sometimes on HA pair 2. Both are setup identically and connected to the same uplink switches.

We had this through all the years with ASA versions from 9.10.x to 9.14.x, opened every time a Cisco TAC but none of them ever found the problem source and could provide a fix.

Are we the only ones using ASA on Firepower hardware and facing this issue? I know Cisco firewalls since it was named PIX. It was stable until Cisco ported ASA to the Firepower hardware. Now that Cisco drops the ASA-X hardware and switches to Firepower only, I have concerns ...

Regards,
Bernd

Marius Gunnerud · ‎01-19-2023

I cannot say that I have seen this issue. Something to check next time it happens is the connection table (show conn) to make sure that the IPs or subnets in question are being sent out the correct interface. Though this is not the same as your issue, I have seen the ASA start routing traffic out the wrong interface after a reboot and I need to do a clear conn to get it solved.

--
Please remember to select a correct answer and rate helpful posts

Network Diver · ‎01-19-2023

After the failover of group 2, the stateful xmit/rcv statistics looked okay. I didn't spend too much time in connection table analysis because I was busy finishing the update and bringing all other stuff back to life.

# show failover
Failover On
Failover unit Primary
Failover LAN Interface: folink Port-channel11 (up)
Reconnect timeout 0:00:00
Unit Poll frequency 1 seconds, holdtime 15 seconds
Interface Poll frequency 5 seconds, holdtime 25 seconds
Interface Policy 1
Monitored Interfaces 20 of 1292 maximum
MAC Address Move Notification Interval not set
Version: Ours 9.14(2)15, Mate 9.14(2)15
Serial Number: Ours JAD2249001U, Mate JAD22250MHC
Group 1 last failover at: 20:38:08 CEST Jul 14 2021
Group 2 last failover at: 20:01:56 CET Jan 18 2023

  This host:    Primary
  Group 1       State:          Active
                Active time:    47777354 (sec)
  Group 2       State:          Active
                Active time:    23 (sec)

		slot 0: FPR-2120 hw/sw rev (1.3/9.14(2)15) status (Up Sys)
		  ctx1 Interface xyz (10.0.x.1): Normal (Monitored)
		  ctx2 Interface abc (10.0.y.1): Normal (Monitored)
		  [...]

  Other host:   Secondary
  Group 1       State:          Standby Ready
                Active time:    0 (sec)
  Group 2       State:          Standby Ready
                Active time:    3438678 (sec)

		slot 0: FPR-2120 hw/sw rev (1.2/9.14(2)15) status (Up Sys)
		  ctx1 Interface xyz (10.0.x.2): Normal (Monitored)
		  ctx2 Interface abc (10.0.y.2): Normal (Monitored)
		  [...]

Stateful Failover Logical Update Statistics
	Link : folink Port-channel11 (up)
	Stateful Obj 	xmit       xerr       rcv        rerr
	General		31970394284 0          591474553  1452
	sys cmd  	6370070    0          6369888    0
	up time  	0          0          0          0
	RPC services  	0          0          0          0
	TCP conn 	17961653620 0          274284881  0
	UDP conn 	13961233499 0          266997375  5
	ARP tbl  	40861466   0          43510811   0
	Xlate_Timeout  	0          0          0          0
	IPv6 ND tbl  	0          0          0          0
	VPN IKEv1 SA 	0          0          0          0
	VPN IKEv1 P2 	0          0          0          0
	VPN IKEv2 SA 	0          0          0          0
	VPN IKEv2 P2 	0          0          0          0
	VPN CTCP upd 	0          0          0          0
	VPN SDI upd 	0          0          0          0
	VPN DHCP upd 	0          0          0          0
	SIP Session 	90         0          20489      0
	SIP Tx 	53         0          10647      0
	SIP Pinhole 	0          0          4749       1447
	Route Session 	275483     0          275710     0
	Router ID 	0          0          0          0
	User-Identity 	3          0          3          0
	CTS SGTNAME 	0          0          0          0
	CTS PAC 	0          0          0          0
	TrustSec-SXP 	0          0          0          0
	IPv6 Route 	0          0          0          0
	STS Table 	0          0          0          0
	Umbrella Device-ID 	0          0          0          0

	Logical Update Queue Information
	 	 	Cur 	Max 	Total
	Recv Q: 	0 	123 	617319545
	Xmit Q: 	0 	318 	32048036240

MHM Cisco World · ‎01-19-2023

what IP you use as GW in Server of host inside ?

Network Diver · ‎01-19-2023

The primary IP of course, not the standby.

interface Port-channel3.194
 description OpenShift 4 Cluster Zone
 nameif openshift4
 security-level 30
 ip address 10.0.46.1 255.255.255.0 standby 10.0.46.2

context admin
  allocate-interface Management1/1
  storage-url private disk0:/private-storage disk0
  storage-url shared disk0:/shared-storage shared
  config-url disk0:/config/admin.cfg
  join-failover-group 1

context ctx1
  description Application zones (prod)
  allocate-interface Port-channel1.1140
  allocate-interface Port-channel3.149
  allocate-interface Port-channel3.153-Port-channel3.154
  allocate-interface Port-channel3.156-Port-channel3.157
  allocate-interface Port-channel3.163
  allocate-interface Port-channel3.174
  allocate-interface Port-channel3.176-Port-channel3.179
  allocate-interface Port-channel3.185-Port-channel3.194
  storage-url private disk0:/private-storage disk0
  storage-url shared disk0:/shared-storage shared
  config-url disk0:/config/ctx1.cfg
  join-failover-group 1

context ctx2
  description Application zones (non-prod)
  allocate-interface Port-channel1.1140
  allocate-interface Port-channel3.115
  allocate-interface Port-channel3.120
  allocate-interface Port-channel3.158
  allocate-interface Port-channel3.900
  storage-url private disk0:/private-storage disk0
  storage-url shared disk0:/shared-storage shared
  config-url disk0:/config/ctx2.cfg
  join-failover-group 2

MHM Cisco World · ‎01-19-2023

there is SW connect two FW, do
show mac address-table <<- see if the SW point to right active FW or not ?

Network Diver · ‎01-19-2023

Failover link is a 2 x 1 Gb/s directly attached port-channel with 1m cat7 cables. If it was a failover link network issue, then it would happen every time a failover is done and not just once after an uptime of several months.

interface Port-channel11
 description LAN/STATE Failover Interface

failover
failover lan unit primary
failover lan interface folink Port-channel11
failover link folink Port-channel11
failover interface ip folink 172.16.0.1 255.255.255.248 standby 172.16.0.2
failover group 1
  preempt
failover group 2
  secondary
  preempt
no failover wait-disable

MHM Cisco World · ‎01-20-2023

show failover

please share this

MHM Cisco World · ‎01-20-2023

failover replication http

I think I found issue here, the HTTP is based on TCP, and to make both FW exchange the TCP session info of HTTP we need to add above command on ASA FW and same I see in FTD

https://www.cisco.com/c/en/us/support/docs/security/firepower-management-center/212699-configure-ftd-high-availability-on-firep.html

Network Diver · ‎01-21-2023

Scroll up for the output of "show failover".

According to the manual and CLI help the command "failover replication http" just replicates HTTP sessions to the standby unit. The problem we had the last time was with NFS sessions.

But that still does not explain why TCP sessions from context 1 could get lost, when context 2 is failed over. Context 1 did not move and should not be affected.

Same failover configuration with same ASA software version and multicontext works flawlessly on our other ASA5516-X models in the branch offices, but not always on the Firepower 2120 models in our datacenter. Maybe because the firewalls in the branch offices do a power cycle every couple of months due to building power maintenance and the firewalls in our datacenter run for years, accumulating session garbage in memory ... until I convinced myself to go through the nightmare of an update with killing TCP sessions again and restart all failing servers due to this.

Marius Gunnerud · ‎01-24-2023

Is it just the TCP sessions that are getting lost or also UDP and/or others (i.e. routing, vpn, etc.)?

I just had an issue where traffic stopped passing through an FTD. After a deepdive into the logs it turned out it was the "portmanager" that had died.

Jump into expert mode navigate to the following path and analyse the messages and portmgr.out log files, use | grep to narrow the search field if the output is large:
cd /ngfw/opt/cisco/platform/logs
less messages
less portmgr.out

--
Please remember to select a correct answer and rate helpful posts

Network Diver · ‎01-24-2023

We're using ASA, not FTD.

MHM Cisco World · ‎01-26-2023

Unsupported Features

For Stateful Failover, the following state information is not passed to the standby Firepower Threat Defense device:

Sessions inside plaintext tunnels such as GRE or IP-in-IP. Sessions inside tunnels are not replicated and the new active node will not be able to reuse existing inspection verdicts to match the correct policy rules.
Decrypted TLS/SSL connections—The decryption states are not synchronized, and if the active unit fails, then decrypted connections will be reset. New connections will need to be established to the new active unit. Connections that are not decrypted (they match a do-not-decrypt rule) are not affected and are replicated correctly.
Multicast routing.

the stateful failover is use for health check and exchange status of connection but for FPR it have limitation (as seen above)
when packet is go to Snort and decrypt then connection status is not exchange between Active and standby
all connection will be reset.

Network Diver · ‎01-26-2023

We have none of that. In most cases it killed SSH sessions to servers through the firewall. Last time it was NFS. I didn't open an SSH session before for testing. The failover and update two years before that did not kill any TCP sessions and at some time in between the node 2 once did a reboot that also caused no issue. Failover config didn't change in between. Very weird.