Firepower Cluster Problem

MARTIN HUERTER · ‎08-16-2023

We have two Firepower 4145 firewall appliances that are clustered together on our network, each node is located in two separate data centers (data center A and data center B). Below the cluster (inside interfaces) we have a pair of Catalyst 6880 switches virtualized (vss) and above the cluster (outside interfaces) we have two Catalyst 9500 stacked switches (stack with virtual stack links). The cluster control links also connect to the pair of stacked Catalyst 9500 switches. On a normal day of operations this design works great and we can upgrade the software on the clusters, forcing failover from the control node to the standby node and then back again with out a problems. I have attached a diagram of the devices in question.

With in the past 6 months we have encountered complete power failures in both of our data centers. The first power outage was in Data Center B. When this happened, we anticipated the control node would continue to function and pass traffic. However, the control node (node 1) went into a failure state and stop forwarding traffic. The switches above and below the cluster were functioning as expected, but could not get the firewall cluster node 1 to work. Once power was restored to Data Center B, node 1 continued to be in the same state of not forwarding traffic and node 2 was not forwarding traffic either. We cycled the power on node 1 and when it finished booting both nodes joined the cluster and starting functioning properly.

On the second power outage just a few weeks ago, power was lost in Data Center A and the same situation occurred. Node 2 stopped forwarding traffic, but the switches above and below node 2 were functioning. Once power was restored to Data Center A, we had to go back to Data Center B and power cycle node 2 in order to get the firewalls to start passing traffic once again.

Unfortunately we did not have time to troubleshoot this problem or capture and logs while the event was happening. When the node in the data center lost power, the logs were lost, and when we had to power cycle the non-functioning node, the logs were also lost. FMC only showed the cluster was down and no members in the cluster. After the power cycle and completion of the failed node re-booted, FMC just saw the nodes come up and joined the cluster. So little to no forensic information is available.

Since we can force failovers gracefully during software upgrades without problems, I don't think it's the failover routine. My suspicion is when one of the Catalyst 9500 loses power, it is has some sort of problem on the cluster control links and causing the node that is supposed to become the single node member of the cluster doesn't happen. That is the only difference I can see between doing a graceful failover during software upgrades and a complete power failure of the devices above and below the firewall cluster. When power is restored to both nodes, a power cycle has to be performed to stimulate the node in the bad (non-forwarding) state in order for it to come up and start functioning again.

I hope I was able to articulate this problem into and understandable narrative. If anyone has any ideas, I welcome them. If someone is knowledgeable with the exchange between the cluster nodes on the cluster control links, I would appreciate your input or your speculation of why our firewall cluster does not failover properly during a power outage.

tvotna · ‎08-17-2023

@MARTIN HUERTER, Hard to say without logs, probably something wrong with switches. BTW, ASA cluster logs survive reboots. Example:

ASA# dir disk0:/log

Directory of disk0:/log/

1074429412 -rwx 252724 12:37:08 Aug 17 2023 asa-appagent.log
1074429413 -rw- 276025 02:12:00 Aug 12 2023 asa-miovif.log
1074429420 -rw- 1154997 13:28:38 Aug 17 2023 asa-ssp_ntp.log
1074429425 -rw- 1048476 18:51:10 Aug 10 2023 cluster_trace.log.4
1074429426 -rw- 1048485 01:02:53 Aug 11 2023 cluster_trace.log.3
1074429427 -rw- 1048512 02:05:11 Aug 13 2023 cluster_trace.log.2
1074429423 -rw- 1048461 20:23:35 Aug 15 2023 cluster_trace.log.1
1074429424 -rw- 642214 13:30:50 Aug 17 2023 cluster_trace.log

8 file(s) total size: 6519894 bytes
21475885056 bytes total (21241675776 bytes free/98% free)

ASA# show cluster info trace level critical
Aug 17 12:37:07.949 [CRIT]Received heartbeat event 'slave heartbeat failure' for member unit-2-1 (ID: 2). Member stats:
HB count: 1028705
HB drops: 0
Average gap (ms): 665
Maximum slip (ms): 1
Last activity since (ms): 1998
Event delay (ms): 0
Poll count: 3
Aug 17 12:37:07.949 [CRIT]Received datapath event 'slave heartbeat failure' with parameter 2.
ASA#
ASA# show cluster info trace level info time Aug 17 12:40:00
Aug 17 12:37:07.949 [INFO]ASLR enabled, text region 55abc3163000-55abc8519735
Aug 17 12:37:07.949 [INFO]Notify chassis de-bundle port for blade unit-2-1, stack 0x000055abc4a4965b 0x000055abc4a14a28 0x000055abc4a0fced
Aug 17 12:37:07.949 [INFO]State machine notify event CLUSTER_EVENT_MEMBER_STATE (id 2,DISABLED,0x0000000000000000)
Aug 17 12:37:07.949 [INFO]ASLR enabled, text region 55abc3163000-55abc8519735
Aug 17 12:37:07.949 [INFO]Notify chassis de-bundle port for blade unit-2-1, stack 0x000055abc4a03856 0x000055abc4a0cb48 0x000055abc4cda62d
Aug 17 12:37:07.949 [CRIT]Received heartbeat event 'slave heartbeat failure' for member unit-2-1 (ID: 2). Member stats:
HB count: 1028705
HB drops: 0
Average gap (ms): 665
Maximum slip (ms): 1
Last activity since (ms): 1998
Event delay (ms): 0
Poll count: 3
Aug 17 12:37:07.949 [CRIT]Received datapath event 'slave heartbeat failure' with parameter 2.

MARTIN HUERTER · ‎08-17-2023

tvotna,

Thank you for your response. Our firewall cluster is running in FTD mode, not ASA mode. The switches above and below the firewall cluster continue to function, just the surviving cluster node fails to pass traffic when it has been failed over to.

tvotna · ‎08-17-2023

On FTD clustering is implemented by underlying ASA, so there is no difference. Check disk0:/log from "system support diagnostic-cli". The "show cluster info trace" command may not show messages before the reboot, but files on disk may still contain this info, unless they were overwritten by new messages.

By "something wrong with switches" I mean either an interop issue or a configuration issue. E.g. if it was ASA5585, you'd need to configure clustering specifically to support VSS (channel-group 1 mode active [vss-id {1 | 2 }], port-channel span-cluster [vss-load-balance], health-check holdtime <timeout> [vss-enabled]). I don't remember anything like this in FXOS, so presumably Firepower chassis figures out everything automatically. We're on Nexus.

MARTIN HUERTER · ‎08-17-2023

On the Firepower NGFW platforms run FXOS as the base operating system which manages the chassis, you then stand up the firewalls as FTD or ASA for the FW/IPS/IDS/VPN functions. So FTD and ASA are never associated with each other. It is an either/or operation.

We originally had a ASA5585X firewall cluster in this same location of our network and we configured the inside interfaces with the vss characteristics. Since we have replaced them with FTD firewalls, the consideration of vss no longer needs to be address on the interface configurations. As far at the FTD firewall cluster knows, it looks like one logical switch.

tvotna · ‎08-18-2023

Believe you or not, FTD dataplane is still 90% - 99% same as on ASA, because it runs ASA code. And clustering dataplane and control plane is a purely ASA feature on FTD.

Have you had a chance to check disk0:/log directory? Do you at least have an output which illustrates unit 1 in a "failed state" as you mentioned? Did you confirm from CLI that unit 1 (control) was kicked out of the cluster and hence the cluster left with no active units?

Clustering logs can shed some light on why this happened. BTW I see that "vss-enabled" option (health-check holdtime <timeout> [vss-enabled]) is still there on Firepower-based cluster, not sure if it can be configured in FMC GUI or requires flex config. Docs:

vss-enabled —Floods the heartbeat messages on all EtherChannel interfaces in the cluster control link to ensure that at least one of the switches can receive them. If you configure the cluster control link as an EtherChannel (recommended), and it is connected to a VSS, vPC, StackWise, or StackWise Virtual pair, then you might need to enable the vss-enabled option. For some switches, when one node in the redundant system is shutting down or booting up, EtherChannel member interfaces connected to that switch may appear to be Up to the ASA, but they are not passing traffic on the switch side. The ASA can be erroneously removed from the cluster if you set the ASA holdtime timeout to a low value (such as .8 seconds), and the ASA sends keepalive messages on one of these EtherChannel interfaces.

MHM Cisco World · ‎08-17-2023

I send you message check it