09-09-2021 09:33 AM
Hello!
We have been running our HA Pair FTD's without any major issues for about 5 months now but all of a sudden, they decided the both wanted to be active and started a split-brain scenario. This completely brought our entire network down and I could see that the two nodes had a status of active. The quick solution was to shut down the 2nd node and reboot the primary node, then network connectivity was restored. I had opened a high priority TAC case but wasn't taken seriously at all and they were not able to determine a root cause. I was able to look back in the event logs and noticed that the primary node's "primary detection engine" failed, then it appeared that snort started crashing/restarted and we were seeing snort memory usage up in the 90% range. We are currently running FMC version 6.7 and the HA links are via direct connect. Anyone ever experience anything like this? When this occurred, nobody was even logged into FMC or making any changes, it happened out of the blue.
Thank you
09-09-2021 10:13 AM
we are runing 6.5 FTD 4K, we have not come across this issue ?
Since you mentioned only 5months uptime that should not cause any issue, the only reason i see below for the split brain.
Both FTD are not see each other due to Layer 2 sync missing ? do you monitor all interface for the fail over scenario ?
Look at the Logs what is the reason of Failover took place.
Look at the Directlly connected switch has any Logs ? ( any STP Loops ?)
above all good then come back to your Logs.
- primary node's "primary detection engine" failed (since you mentioned this was the issue, then we need to enage with TAC to investigate this, they are technical experts should give some light what is the reason behind to cause for this, by review the crash logs)
I am sure they might come with soltuion for patching or upgrade suggestions based on outcome.
09-09-2021 10:30 AM
Hey Balaji,
TAC had me run outputs and delivery logs to them and they were not able to find any misconfiguration issues or any reason as to why this happened other than "there was a communication error between the HA pairs". Not able to pinpoint what the error would have been though. Is a direct HA link between the two nodes not recommended?
09-09-2021 11:36 AM
Is a direct HA link between the two nodes not recommended?
When you say direct link, where is these Devices ? are they in same Place next to each other ? or different Location you use any DWDM to connect each other ?
can you give more information to understand the issue ?
09-09-2021 11:42 AM
Sorry, they are directly right next to each other, no DWDM.
09-09-2021 10:18 AM
I have experience similar situations but that was way back when ASA5585 first was integrated with Sourcefire module. Could you describe more your setup. Which Hardware are you running FTD on? I am assuming the FTD devices are running the same version as the FMC and not lower?
Do you have standby IPs setup for at least one of the data interfaces if not for all the data interfaces? Do you have monitoring enabled for these data interfaces that have standby IPs?
09-09-2021 10:37 AM
Hi Marius,
Thank you for the reply. We are running Firepower 2110 security appliances and both are on the same version. The HA Link is connected via Ethernet1/11 interfaces with a separate primary IP and secondary IP. The State link is set up on Ethernet1/12 interfaces with separate Primary and secondary IPs. I hope that answers your question.
09-09-2021 10:46 AM
What I meant was not the failover and state link as these are required to have standby IPs, I mean the data links. Do these have IPs and are they configured to be monitored?
The reason I ask is that if the failover link fails and you do not have monitoring on the data interfaces then the FTDs will not be able to identify if this is a real failover situation and will then assume the peer is dead and take over the active role, essentially becoming a split-brain situation. When you have standby IPs configured on the data interfaces and configured them to be "monitored" then if the failover link fails, hello packets will be sent out the data interfaces to verify if this is a true failure situation. If a reply is received there will not be a failover situation, but if the peer does not send a reply on the data interface after 3 tries, a failover situation has occurred and the standby takes over the active role.
 
					
				
				
			
		
Discover and save your favorite ideas. Come back to expert answers, step-by-step guides, recent topics, and more.
New here? Get started with these tips. How to use Community New member guide