Solved: Firepower 2110/4110 HA/Cluster causing Split-Brain due to power issue

alex.f. · ‎01-28-2022

Hello,

I witnessed a strange behaviour in two of our eight Firepower Clusters.
Affected are one FTD 2110 HA Cluster and one FTD 4110 HA Cluster all on 6.6.5.

After a mayor power outage in our DC all Systems went down.
(VXRAIL, Switches, Router, Firewalls all offline for an hour)
The Staff on side switched the power back on and all devices came back online in no specific order.
Nothing has been brocken but some Client couldn't reach their Gateways.

It turns out that the one DC Cluster went in active / active and didn't negotiate their HA State.
The GW IP was active on both FTDs and the Clients lost the connection form time to time.
Switch Active Peer had no effect and the sync between both FTDs didn't finish after 30 min. so we rebooted the "standby" FTD and let the active up and running.
The Reboot changes nothing and we had to power shutdown on FTD (the not working "standby".)

I did some research and found no specific bug.
I will try the following Steps on a Maintenance Window next Week:

- HA Suspend on the "active" FTD
(boot the "standby" and looking for some logs or crash reports)
- Resume HA
- Reboot both FTDs

Any thought or tasks would be helpful.

regards
Alex

alex.f. · ‎09-21-2022

We had to break the HA.

Reimage the FirePower

and rebuild the HA.

View solution in original post

balaji.bandi · ‎01-28-2022

Since unexpected power outage, May be something might have crashed,

- Please confirm between the device Layer 2 is ok ?

- the one offline (remove all the connection, boot the device and check, is that booted ? as expected before you go to next step ?

BB

***** Rate All Helpful Responses *****

How to Ask The Cisco Community for Help

alex.f. · ‎01-30-2022

Yes, both Clusters are directly connected withe an Ethernet cable.

StefanH1 · ‎01-28-2022

Hi Alex,

we faced similar issues on several customers deployments already after power outages and simultaneous boot of the primary and secondary node.

Please send the output of

show failover
show failover history

from the primary and secondary appliances to validate my suspicion.

If you notice something like this on the secondary node:

> show failover
Failover Off (pseudo-Standby)
Failover unit Secondary
Failover LAN Interface: failover-link Ethernet1/8 (up)
Reconnect timeout 0:00:00
Unit Poll frequency 1 seconds, holdtime 15 seconds
Interface Poll frequency 5 seconds, holdtime 25 seconds
Interface Policy 1
Monitored Interfaces 3 of 1288 maximum
MAC Address Move Notification Interval not set
> show failover history
==========================================================================
From State                 To State                   Reason
==========================================================================
16:03:29 UTC Jul 14 2021
Disabled                   Negotiation                Set by the config command

16:03:31 UTC Jul 14 2021
Negotiation                Cold Standby               Detected an Active mate

16:03:32 UTC Jul 14 2021
Cold Standby               App Sync                   Detected an Active mate

16:04:05 UTC Jul 14 2021
App Sync                   Disabled                   CD App Sync error is App Config Apply Failed
16:06:17 UTC Jul 14 2021
Disabled                   Negotiation                Set by the config command

16:06:19 UTC Jul 14 2021
Negotiation                Cold Standby               Detected an Active mate

16:06:20 UTC Jul 14 2021
Cold Standby               App Sync                   Detected an Active mate

16:06:54 UTC Jul 14 2021
App Sync                   Disabled                   CD App Sync error is App Config Apply Failed
==========================================================================

You should be able to let the secondary node resync with primary via command

config high-availability resume

Validate via

show failover
show failover history

If you rebooted the secondary in Pseudo Standby state it might actually be in failover off state. In this case you will have to:

+ break the HA

+ de-register the affected device and re-register it again.

+ add it back to the HA pair.

+ if after this the device is still not able to sync then most likely we need to reimage it.

Also see: https://www.cisco.com/c/en/us/support/docs/security/firepower-management-center/212699-configure-ftd-high-availability-on-firep.html#anc12

Especially the FAQ section.

Best regards

Stefan

alex.f. · ‎01-30-2022

Hi Stefan,

here are my findings so far ...

#2110
Cisco Fire Linux OS v6.6.5 (build 13)
Cisco Firepower 2110 Threat Defense v6.6.5.1 (build 15)

> show failover
descriptor exec history interface state statistics |
> show failover history
==========================================================================
From State To State Reason
==========================================================================
08:29:15 UTC Jan 18 2022
Not Detected Disabled No Error

08:29:23 UTC Jan 18 2022
Disabled Negotiation Set by the config command

08:30:08 UTC Jan 18 2022
Negotiation Just Active No Active unit found

08:30:09 UTC Jan 18 2022
Just Active Active Drain No Active unit found

08:30:09 UTC Jan 18 2022
Active Drain Active Applying Config No Active unit found

08:30:09 UTC Jan 18 2022
Active Applying Config Active Config Applied No Active unit found

08:30:09 UTC Jan 18 2022
Active Config Applied Active No Active unit found

==========================================================================
>
>
>
> show failover
Failover On
Failover unit Primary
Failover LAN Interface: Failover Ethernet1/12 (down)
Reconnect timeout 0:00:00
Unit Poll frequency 1 seconds, holdtime 15 seconds
Interface Poll frequency 5 seconds, holdtime 25 seconds
Interface Policy 1
Monitored Interfaces 3 of 1292 maximum
MAC Address Move Notification Interval not set
failover replication http
Version: Ours 9.14(3)15, Mate 9.14(3)15
Serial Number: Ours ###########, Mate Unknown
Last Failover at: 08:30:09 UTC Jan 18 2022
This host: Primary - Active
Active time: 1065178 (sec)
slot 0: FPR-2110 hw/sw rev (1.1/9.14(3)15) status (Up Sys)
Interface outside (XXX.XXX.XXX.1): Unknown (Waiting)
Interface inside (XXX.XXX.YYY.81): Unknown (Waiting)
Interface diagnostic (0.0.0.0): Unknown (Waiting)
slot 1: snort rev (1.0) status (up)
slot 2: diskstatus rev (1.0) status (up)
Other host: Secondary - Failed
Active time: 0 (sec)
slot 0: FPR-2110 hw/sw rev (1.1/9.14(3)15) status (Unknown/Unknown)
Interface outside (XXX.XXX.XXX.2): Unknown (Waiting)
Interface inside (XXX.XXX.YYY.82): Unknown (Waiting)
Interface diagnostic (0.0.0.0): Unknown (Waiting)
slot 1: snort rev (1.0) status (up)
slot 2: diskstatus rev (1.0) status (up)

Stateful Failover Logical Update Statistics
Link : Failover Ethernet1/12 (down)
Stateful Obj xmit xerr rcv rerr
General 0 0 0 0
sys cmd 0 0 0 0
up time 0 0 0 0
RPC services 0 0 0 0
TCP conn 0 0 0 0
UDP conn 0 0 0 0
ARP tbl 0 0 0 0
Xlate_Timeout 0 0 0 0
IPv6 ND tbl 0 0 0 0
VPN IKEv1 SA 0 0 0 0
VPN IKEv1 P2 0 0 0 0
VPN IKEv2 SA 0 0 0 0
VPN IKEv2 P2 0 0 0 0
VPN CTCP upd 0 0 0 0
VPN SDI upd 0 0 0 0
VPN DHCP upd 0 0 0 0
SIP Session 0 0 0 0
SIP Tx 0 0 0 0
SIP Pinhole 0 0 0 0
Route Session 0 0 0 0
Router ID 0 0 0 0
User-Identity 0 0 0 0
CTS SGTNAME 0 0 0 0
CTS PAC 0 0 0 0
TrustSec-SXP 0 0 0 0
IPv6 Route 0 0 0 0
STS Table 0 0 0 0
Rule DB B-Sync 0 0 0 0
Rule DB P-Sync 0 0 0 0
Rule DB Delete 0 0 0 0

Logical Update Queue Information
Cur Max Total
Recv Q: 0 0 0
Xmit Q: 0 0 0
>

#4110
> show failover history
==========================================================================
From State To State Reason
==========================================================================
10:04:50 CET Jan 15 2022
Not Detected Disabled No Error

10:04:52 CET Jan 15 2022
Disabled Negotiation Set by the config command

10:05:07 CET Jan 15 2022
Negotiation Just Active No Active unit found

10:05:07 CET Jan 15 2022
Just Active Active Drain No Active unit found

10:05:07 CET Jan 15 2022
Active Drain Active Applying Config No Active unit found

10:05:07 CET Jan 15 2022
Active Applying Config Active Config Applied No Active unit found

10:05:07 CET Jan 15 2022
Active Config Applied Active No Active unit found

==========================================================================
>

> show failover
Failover On
Failover unit Primary
Failover LAN Interface: Failover Port-channel2 (down)
Reconnect timeout 0:00:00
Unit Poll frequency 1 seconds, holdtime 15 seconds
Interface Poll frequency 5 seconds, holdtime 25 seconds
Interface Policy 1
Monitored Interfaces 30 of 1291 maximum
MAC Address Move Notification Interval not set
failover replication http
Version: Ours 9.14(3)15, Mate 9.14(3)15
Serial Number: Ours ############, Mate Unknown
Last Failover at: 10:05:07 CET Jan 15 2022
This host: Primary - Active
Active time: 1323243 (sec)
slot 0: UCSB-B200-M3-U hw/sw rev (0.0/9.14(3)15) status (Up Sys)
Interface A (x.x.x.1): Normal (Waiting)
Interface A (x.x.x.1): Normal (Waiting)
Interface A (x.x.x.1): Normal (Waiting)
Interface A (x.x.x.1): Normal (Waiting)
Interface A (x.x.x.1): Normal (Waiting)
Interface V (x.x.x.1): Normal (Waiting)
Interface V (x.x.x.1): Normal (Waiting)
Interface V (x.x.x.1): Normal (Waiting)
Interface V (x.x.x.1): Normal (Waiting)
Interface D (x.x.x.1): Normal (Waiting)
Interface D (x.x.x.1): Normal (Waiting)
Interface D (x.x.x.1): Normal (Waiting)
Interface D (x.x.x.1): Normal (Waiting)
Interface D (x.x.x.1): Normal (Waiting)
Interface D (x.x.x.1): Normal (Waiting)
Interface D (x.x.x.1): Normal (Waiting)
Interface D (x.x.x.1): Normal (Waiting)
Interface D (x.x.x.1): Normal (Waiting)
Interface D (x.x.x.1): Normal (Waiting)
Interface D (x.x.x.1): Normal (Waiting)
Interface D (x.x.x.1): Normal (Waiting)
Interface D (x.x.x.11): Normal (Waiting)
Interface D (x.x.x.11): Normal (Waiting)
Interface D (x.x.x.11): Normal (Waiting)
Interface D (x.x.x.1): Normal (Waiting)
Interface D (x.x.x.1): Normal (Waiting)
Interface D (x.x.x.1): Normal (Waiting)
Interface D (x.x.x.1): Normal (Waiting)
Interface D (x.x.x.1): Normal (Waiting)
Interface diagnostic (0.0.0.0): Unknown (Waiting)
slot 1: snort rev (1.0) status (up)
slot 2: diskstatus rev (1.0) status (up)
Other host: Secondary - Failed
Active time: 0 (sec)
slot 0: UCSB-B200-M3-U hw/sw rev (0.0/9.14(3)15) status (Unknown/Unknown)
Interface A (x.x.x.2): Unknown (Waiting)
Interface A (x.x.x.2): Unknown (Waiting)
Interface A (x.x.x.2): Unknown (Waiting)
Interface A (x.x.x.2): Unknown (Waiting)
Interface A (x.x.x.2): Unknown (Waiting)
Interface V (x.x.x.2): Unknown (Waiting)
Interface V (x.x.x.2): Unknown (Waiting)
Interface V (x.x.x.2): Unknown (Waiting)
Interface V (x.x.x.2): Unknown (Waiting)
Interface D (x.x.x.2): Unknown (Waiting)
Interface D (x.x.x.2): Unknown (Waiting)
Interface D (x.x.x.2): Unknown (Waiting)
Interface D (x.x.x.2): Unknown (Waiting)
Interface D (x.x.x.2): Unknown (Waiting)
Interface D (x.x.x.2): Unknown (Waiting)
Interface D (x.x.x.2): Unknown (Waiting)
Interface D (x.x.x.2): Unknown (Waiting)
Interface D (x.x.x.2): Unknown (Waiting)
Interface D (x.x.x.2): Unknown (Waiting)
Interface D (x.x.x.2): Unknown (Waiting)
Interface D (x.x.x.2): Unknown (Waiting)
Interface D (x.x.x.12): Unknown (Waiting)
Interface D (x.x.x.12): Unknown (Waiting)
Interface D (x.x.x.12): Unknown (Waiting)
Interface D (x.x.x.2): Unknown (Waiting)
Interface D (x.x.x.2): Unknown (Waiting)
Interface D (x.x.x.2): Unknown (Waiting)
Interface D (x.x.x.2): Unknown (Waiting)
Interface D (x.x.x.2): Unknown (Waiting)
Interface diagnostic (0.0.0.0): Unknown (Waiting)
slot 1: snort rev (1.0) status (up)
slot 2: diskstatus rev (1.0) status (up)

Stateful Failover Logical Update Statistics
Link : Failover Port-channel2 (down)
Stateful Obj xmit xerr rcv rerr
General 0 0 0 0
sys cmd 0 0 0 0
up time 0 0 0 0
RPC services 0 0 0 0
TCP conn 0 0 0 0
UDP conn 0 0 0 0
ARP tbl 0 0 0 0
Xlate_Timeout 0 0 0 0
IPv6 ND tbl 0 0 0 0
VPN IKEv1 SA 0 0 0 0
VPN IKEv1 P2 0 0 0 0
VPN IKEv2 SA 0 0 0 0
VPN IKEv2 P2 0 0 0 0
VPN CTCP upd 0 0 0 0
VPN SDI upd 0 0 0 0
VPN DHCP upd 0 0 0 0
SIP Session 0 0 0 0
SIP Tx 0 0 0 0
SIP Pinhole 0 0 0 0
Route Session 0 0 0 0
Router ID 0 0 0 0
User-Identity 0 0 0 0
CTS SGTNAME 0 0 0 0
CTS PAC 0 0 0 0
TrustSec-SXP 0 0 0 0
IPv6 Route 0 0 0 0
STS Table 0 0 0 0
Rule DB B-Sync 0 0 0 0
Rule DB P-Sync 0 0 0 0
Rule DB Delete 0 0 0 0

Logical Update Queue Information
Cur Max Total
Recv Q: 0 0 0
Xmit Q: 0 0 0
>

StefanH1 · ‎01-30-2022

Failover On
Failover unit Primary
Failover LAN Interface: Failover Port-channel2 (down)

Failover On
Failover unit Primary
Failover LAN Interface: Failover Ethernet1/12 (down)

Is this output from the primarys of 2 different clusters? Both appliances are primary and this seems to be a 2110 and a 4110.

Can you please add the output from the secondary appliances of both clusters?

alex.f. · ‎02-01-2022

yes, these are the output from two different clusters.

On Wednesday I have a maintenace window and can switch on the standby FTD again.

I will report my finding.

Aref Alsouqi · ‎01-29-2022

I had seen a similar behaviour of this and ended up upgrading the FMC/firewalls to 7.0.1 which seems to have fixed this issue.

alex.f. · ‎09-21-2022

We had to break the HA.

Reimage the FirePower

and rebuild the HA.