08-16-2011 07:02 AM - edited 02-21-2020 04:25 AM
Hi all
I have a failover pair of ASA 5520 (Software Version 8.2(4)4)
located in two different data centers.
Because of a network issue the layer 2 connection between both locations has been interrupted for a couple of seconds and the ASAs went into split-brain as one would expect them to do.
The thing is that after approx. 1 minute the secondary ASA switched off its failover configuration (i.e. "show run" gives "no failover") without anybody telling it to do so. Here is the "show failover history" of the device:
07:57:34 MESZ Aug 15 2011
Standby Ready Just Active HELLO not heard from mate
07:57:34 MESZ Aug 15 2011
Just Active Active Drain HELLO not heard from mate
07:57:34 MESZ Aug 15 2011
Active Drain Active Applying Config HELLO not heard from mate
07:57:34 MESZ Aug 15 2011
Active Applying Config Active Config Applied HELLO not heard from mate
07:57:34 MESZ Aug 15 2011
Active Config Applied Active HELLO not heard from mate
07:58:03 MESZ Aug 15 2011
Active Cold Standby Failover state check
07:58:18 MESZ Aug 15 2011
Cold Standby Disabled HA state progression failed
At this point failover was switched off completely and the split-brain remained even after the layer-2-connection has been reestablished.
This is no good.:( I have searched for "HA state progression failed" without any useful result/explanation.
Why did the device switch off failover on its own and how can we assure that it won't do this again?
Best regards,
Grischa
08-18-2011 06:59 AM
Hi Grischa,
Can you confikrm if the failover link is connected directly with eachother and the rest of the interfaces are connected through a switch??
You might be hitting this bug:
However, further research on the issue, suggests me that one possible work around would be to manually enable failover on the secondary device again.
Let me know if this helps.
Thanks,
Varun
08-19-2011 03:07 AM
Hi Varun
The devices are installed at different locations. The failover link is a MPLS-L2transport across the providers backbone from one location to the other. I.e. failover interfaces are connected to a switch on either side, switches have a dot1q-trunk to the providers access routers which connect the two ports via MPLS-L2transport:
ASA1 - Switch1 - PE1-Router ----MPLS-L2transport---- PE2-Router - Switch2 - ASA2
Due to a network issue the MPLS-L2transport has been interrupted for a couple of seconds and afterwards we had a persisting split-brain situation.
Of course I have enabled failover manually again. But the plan is that the ASAs recover for themselves into normal operation as soon as the network is stable again.
Regards,
Grischa
08-19-2011 04:27 AM
I would suggest you open a TAC case for it, because this seems to be some unusual behavior. If you manually enable the failover again, does the secondary become standby and functions properly??
-Varun
08-19-2011 06:49 AM
Yes, only thing I needed to do was issuing "failover" on the secondary. It detected its active mate and went properly into standby:
09:16:18 MESZ Aug 15 2011
Disabled Negotiation Set by the config command
09:16:19 MESZ Aug 15 2011
Negotiation Cold Standby Detected an Active mate
09:16:21 MESZ Aug 15 2011
Cold Standby Sync Config Detected an Active mate
09:16:31 MESZ Aug 15 2011
Sync Config Sync File System Detected an Active mate
09:16:31 MESZ Aug 15 2011
Sync File System Bulk Sync Detected an Active mate
09:16:31 MESZ Aug 15 2011
Bulk Sync Standby Ready Detected an Active mate
I guess we will go the TAC way if we encounter this situation a second time. This time we will be warned and know where to look at.
Is there really no documentation available of the "HA state progression failed" message? What does it mean and how is it triggered usually?
Regards,
Grischa
08-19-2011 07:00 AM
Well if I go by the symptoms and the conditions of your issue, it definitely seems to be the bug that I have provided you. I've tried doing some research on it, but could not find any documentation for it, since this is not a expected behavior. I guess opening a TAC would be the right step if it happens again. When the first instance of this issue was noticed, it was not possible to recreate the issue.
This is encountered, everytime the switch encounters an issue, if it happens again, TAC case would be best, so that we have some data to pull out, when the secondary is failed and identify whether it is the same problem.
Let me know if this answers your question.
Thanks,
Varun
Discover and save your favorite ideas. Come back to expert answers, step-by-step guides, recent topics, and more.
New here? Get started with these tips. How to use Community New member guide