01-18-2023 10:03 PM
Hi,
Since 2019 we are using ASA running on Firepower2120. We started with ASA 9.10.x through 9.12.x, 9.14.x and yesterday upgraded to 9.16.3.23.
We have four Firepower2120 devices, and two of them are running in multicontext active/active HA mode with 2 or more contexts.
Before I do a software update, I failover the ASA contexts running on node 2 to node 1.
failover active group 2
Then wait until failover has completed, before performing the update on the free node using the web chassis manager.
In about 25% of the cases the firewall looses TCP sessions to the servers hosted in attached VLANs ... but weirdly to the servers that are hosted by the ASA context that was *not* failed over and remained running on the node 1. Sometimes the hosts are not reachable with ssh/ping when in failover, sometimes it just kills NFS sessions through the firewall and result in stale NFS.
It seems only to happen when the firewall was running for several months and we want to do a software update. When we want to reproduce that behaviour few days later for debugging with Cisco TAC, the issue is not reproduceable. It sometimes happened on firewall HA pair 1, sometimes on HA pair 2. Both are setup identically and connected to the same uplink switches.
We had this through all the years with ASA versions from 9.10.x to 9.14.x, opened every time a Cisco TAC but none of them ever found the problem source and could provide a fix.
Are we the only ones using ASA on Firepower hardware and facing this issue? I know Cisco firewalls since it was named PIX. It was stable until Cisco ported ASA to the Firepower hardware. Now that Cisco drops the ASA-X hardware and switches to Firepower only, I have concerns ...
Regards,
Bernd
01-19-2023 03:40 AM
I cannot say that I have seen this issue. Something to check next time it happens is the connection table (show conn) to make sure that the IPs or subnets in question are being sent out the correct interface. Though this is not the same as your issue, I have seen the ASA start routing traffic out the wrong interface after a reboot and I need to do a clear conn to get it solved.
01-19-2023 05:08 AM
After the failover of group 2, the stateful xmit/rcv statistics looked okay. I didn't spend too much time in connection table analysis because I was busy finishing the update and bringing all other stuff back to life.
# show failover
Failover On
Failover unit Primary
Failover LAN Interface: folink Port-channel11 (up)
Reconnect timeout 0:00:00
Unit Poll frequency 1 seconds, holdtime 15 seconds
Interface Poll frequency 5 seconds, holdtime 25 seconds
Interface Policy 1
Monitored Interfaces 20 of 1292 maximum
MAC Address Move Notification Interval not set
Version: Ours 9.14(2)15, Mate 9.14(2)15
Serial Number: Ours JAD2249001U, Mate JAD22250MHC
Group 1 last failover at: 20:38:08 CEST Jul 14 2021
Group 2 last failover at: 20:01:56 CET Jan 18 2023
This host: Primary
Group 1 State: Active
Active time: 47777354 (sec)
Group 2 State: Active
Active time: 23 (sec)
slot 0: FPR-2120 hw/sw rev (1.3/9.14(2)15) status (Up Sys)
ctx1 Interface xyz (10.0.x.1): Normal (Monitored)
ctx2 Interface abc (10.0.y.1): Normal (Monitored)
[...]
Other host: Secondary
Group 1 State: Standby Ready
Active time: 0 (sec)
Group 2 State: Standby Ready
Active time: 3438678 (sec)
slot 0: FPR-2120 hw/sw rev (1.2/9.14(2)15) status (Up Sys)
ctx1 Interface xyz (10.0.x.2): Normal (Monitored)
ctx2 Interface abc (10.0.y.2): Normal (Monitored)
[...]
Stateful Failover Logical Update Statistics
Link : folink Port-channel11 (up)
Stateful Obj xmit xerr rcv rerr
General 31970394284 0 591474553 1452
sys cmd 6370070 0 6369888 0
up time 0 0 0 0
RPC services 0 0 0 0
TCP conn 17961653620 0 274284881 0
UDP conn 13961233499 0 266997375 5
ARP tbl 40861466 0 43510811 0
Xlate_Timeout 0 0 0 0
IPv6 ND tbl 0 0 0 0
VPN IKEv1 SA 0 0 0 0
VPN IKEv1 P2 0 0 0 0
VPN IKEv2 SA 0 0 0 0
VPN IKEv2 P2 0 0 0 0
VPN CTCP upd 0 0 0 0
VPN SDI upd 0 0 0 0
VPN DHCP upd 0 0 0 0
SIP Session 90 0 20489 0
SIP Tx 53 0 10647 0
SIP Pinhole 0 0 4749 1447
Route Session 275483 0 275710 0
Router ID 0 0 0 0
User-Identity 3 0 3 0
CTS SGTNAME 0 0 0 0
CTS PAC 0 0 0 0
TrustSec-SXP 0 0 0 0
IPv6 Route 0 0 0 0
STS Table 0 0 0 0
Umbrella Device-ID 0 0 0 0
Logical Update Queue Information
Cur Max Total
Recv Q: 0 123 617319545
Xmit Q: 0 318 32048036240
01-19-2023 04:15 AM
what IP you use as GW in Server of host inside ?
01-19-2023 05:00 AM
The primary IP of course, not the standby.
interface Port-channel3.194
description OpenShift 4 Cluster Zone
nameif openshift4
security-level 30
ip address 10.0.46.1 255.255.255.0 standby 10.0.46.2
context admin
allocate-interface Management1/1
storage-url private disk0:/private-storage disk0
storage-url shared disk0:/shared-storage shared
config-url disk0:/config/admin.cfg
join-failover-group 1
context ctx1
description Application zones (prod)
allocate-interface Port-channel1.1140
allocate-interface Port-channel3.149
allocate-interface Port-channel3.153-Port-channel3.154
allocate-interface Port-channel3.156-Port-channel3.157
allocate-interface Port-channel3.163
allocate-interface Port-channel3.174
allocate-interface Port-channel3.176-Port-channel3.179
allocate-interface Port-channel3.185-Port-channel3.194
storage-url private disk0:/private-storage disk0
storage-url shared disk0:/shared-storage shared
config-url disk0:/config/ctx1.cfg
join-failover-group 1
context ctx2
description Application zones (non-prod)
allocate-interface Port-channel1.1140
allocate-interface Port-channel3.115
allocate-interface Port-channel3.120
allocate-interface Port-channel3.158
allocate-interface Port-channel3.900
storage-url private disk0:/private-storage disk0
storage-url shared disk0:/shared-storage shared
config-url disk0:/config/ctx2.cfg
join-failover-group 2
01-19-2023 05:11 AM
there is SW connect two FW, do
show mac address-table <<- see if the SW point to right active FW or not ?
01-19-2023 09:19 PM
Failover link is a 2 x 1 Gb/s directly attached port-channel with 1m cat7 cables. If it was a failover link network issue, then it would happen every time a failover is done and not just once after an uptime of several months.
interface Port-channel11
description LAN/STATE Failover Interface
failover
failover lan unit primary
failover lan interface folink Port-channel11
failover link folink Port-channel11
failover interface ip folink 172.16.0.1 255.255.255.248 standby 172.16.0.2
failover group 1
preempt
failover group 2
secondary
preempt
no failover wait-disable
01-20-2023 07:01 AM - edited 01-20-2023 07:07 AM
show failover
please share this
01-20-2023 11:45 AM - edited 01-20-2023 11:45 AM
failover replication http
I think I found issue here, the HTTP is based on TCP, and to make both FW exchange the TCP session info of HTTP we need to add above command on ASA FW and same I see in FTD
https://www.cisco.com/c/en/us/support/docs/security/firepower-management-center/212699-configure-ftd-high-availability-on-firep.html
01-21-2023 11:16 PM
Scroll up for the output of "show failover".
According to the manual and CLI help the command "failover replication http" just replicates HTTP sessions to the standby unit. The problem we had the last time was with NFS sessions.
But that still does not explain why TCP sessions from context 1 could get lost, when context 2 is failed over. Context 1 did not move and should not be affected.
Same failover configuration with same ASA software version and multicontext works flawlessly on our other ASA5516-X models in the branch offices, but not always on the Firepower 2120 models in our datacenter. Maybe because the firewalls in the branch offices do a power cycle every couple of months due to building power maintenance and the firewalls in our datacenter run for years, accumulating session garbage in memory ... until I convinced myself to go through the nightmare of an update with killing TCP sessions again and restart all failing servers due to this.
01-24-2023 02:36 AM
Is it just the TCP sessions that are getting lost or also UDP and/or others (i.e. routing, vpn, etc.)?
I just had an issue where traffic stopped passing through an FTD. After a deepdive into the logs it turned out it was the "portmanager" that had died.
Jump into expert mode navigate to the following path and analyse the messages and portmgr.out log files, use | grep to narrow the search field if the output is large:
cd /ngfw/opt/cisco/platform/logs
less messages
less portmgr.out
01-24-2023 09:37 AM
We're using ASA, not FTD.
01-26-2023 11:02 AM
For Stateful Failover, the following state information is not passed to the standby Firepower Threat Defense device:
Sessions inside plaintext tunnels such as GRE or IP-in-IP. Sessions inside tunnels are not replicated and the new active node will not be able to reuse existing inspection verdicts to match the correct policy rules.
Decrypted TLS/SSL connections—The decryption states are not synchronized, and if the active unit fails, then decrypted connections will be reset. New connections will need to be established to the new active unit. Connections that are not decrypted (they match a do-not-decrypt rule) are not affected and are replicated correctly.
Multicast routing.
the stateful failover is use for health check and exchange status of connection but for FPR it have limitation (as seen above)
when packet is go to Snort and decrypt then connection status is not exchange between Active and standby
all connection will be reset.
01-26-2023 09:07 PM
We have none of that. In most cases it killed SSH sessions to servers through the firewall. Last time it was NFS. I didn't open an SSH session before for testing. The failover and update two years before that did not kill any TCP sessions and at some time in between the node 2 once did a reboot that also caused no issue. Failover config didn't change in between. Very weird.
Discover and save your favorite ideas. Come back to expert answers, step-by-step guides, recent topics, and more.
New here? Get started with these tips. How to use Community New member guide