Split-Brained ASA Cisco 5516-X - Please help me experts!

Synatrix · ‎12-08-2023

I have enjoyed a pair of Cisco firewalls on my platform for a number of years.

The past 2 years have been hell though!

The latest incident saw my complete environment being unavailable for 8 hours.

My provider told me that this was not their fault but rather Cisco's becuase the dual firewalls I pay them to manage, maintain and monitor had become "split brained". As a professional hosting provider - they have told me that it is normal for this to take 8 hours to resolve.

Can anyone here with experience from Cisco (or using Cisco products) provide me with a sanity check that having a dual high availability pair of firewalls going down for 8 hours is normal.

Many Thanks,
David

MHM Cisco World · ‎12-08-2023

Can you more elaborate

You have issue with asa ha ?

It is split brain asa ha? How you know that?

MHM

Synatrix · ‎12-08-2023

Thank you.

It is a HA that we rent from ioMart and they told us that it was a split-brain issue. It took them 8 hours to diagnose and fix before we had any connectivity to our dozen (or so) clients using this solution.

We've had the solution for a number of years and the firewalls have failed over successfully in the past but on this occasion we lost an entire day of the platform and I had nearly all of my clients screaming at me

Ultimately, I just want to know if 8 hours is what the community here would deem an appropriate amount of time for a professional hosting company such as ioMart to identify and resolve an issue on enterprise grade "high availability" gear.

marce1000 · ‎12-08-2023

>....me with a sanity check that having a dual high availability pair of firewalls going down for 8 hours is normal.
- You should get an idea of what they are doing by examining the logs of the ASA's ; now it becomes more likely that this could be due to external network events or problems (too) ; so you should get an integrated view of what is going on. This can be done by for instance configuring a central syslog server on the ASA's and the switches they are connected to , including the rest of the network they are servicing. Then examining the central logging could provide better insights ,

M.

-- Each morning when I wake up and look into the mirror I always say ' Why am I so brilliant ? '
When the mirror will then always repond to me with ' The only thing that exceeds your brilliance is your beauty! '

Synatrix · ‎12-08-2023

Thank you M,

That's great advice - I will request the full logs so that I can attempt to validate what they're telling me is correct.

I really appreciate this community.

MHM Cisco World · ‎12-10-2023

If the failure link is UP ok

And the monitor link is show ""monitored""

Then 75% cases I see is both ASA use same virtual mac' which make SW not forward heartbeat between them and each one assumes the peer is dead.

Check this point.

MHM

tvotna · ‎12-10-2023

This is not correct. Firewalls do exchange messages over both failover link and regular interfaces and split-brain should normally never happen.

It can only happen if there is a bug in the software or if network design is wrong (this most commonly happens when both most data interfaces and failover link are connected to same switch, which is totally wrong) or under extreme traffic loads due to CPU hogs.

The split-brain may not be noticed by operator for some time, although there are SNMP traps and MIBs to poll to detect it, unless customer calls in to complain. But fixing the issue is typically trivial: all you need to do is to login from console (e.g. from FXOS) and disable all data interfaces on one box. Usually this takes less than 8 hours.

Synatrix · ‎12-13-2023

Well, my provider "ioMart" has still not provided me with any details of why we experienced an 8 hour outage.

Worse still, they are not prepared to share the detailed logs with me even though we rent the dedicated hardware from them (and have done for the past 7 years).

They are claiming that this is related to an LACP bug and yet, we worked with them several months ago to ensure that the firewalls received the latest 9.16.1firmware as a result of them previously claiming this was needed to resolve a known LACP bug that was causing infrequent outages on our rented equipment.

They (iomart) say that they have an open case with the manufacturer but that Cisco have not yet come back to them since reporting it on December 7th (6 days ago at time of me writing this message).

Do you (experts) think this sounds right? Am I just being paranoid here or am I being shafted by a company without either (or both) the experience or aptitude to properly manage this specialist hardware?

Thanks again all,
David

MHM Cisco World · ‎12-13-2023

@tvotna you need to read about this case more

@Synatrix

This Cisco doc. About split brain

https://www.cisco.com/c/en/us/support/docs/security/adaptive-security-appliance-asa-software/217691-troubleshoot-split-brain-issues-on-asa-f.html#:~:text=Split%2Dbrain%20is%20a%20scenario,resulting%20in%20loss%20of%20services.

MHM

Rob Ingram · ‎12-13-2023

@Synatrix 8 hours to resolve this issue seems excessive and I agree with the quick/temporary fix suggested by @tvotna that could have been implemented by the provider to restore service. As you do not manage the system yourself, you are in the hands of your provider. In a managed/monitored environment there should be logs to confirm where the issue was for TAC to troubleshoot, I'd chase for an update on the TAC call to see where that leads.