ASA failover: secondary ASA disabled failover on its own

grischast · ‎08-16-2011

Hi all

I have a failover pair of ASA 5520 (Software Version 8.2(4)4)

located in two different data centers.

Because of a network issue the layer 2 connection between both locations has been interrupted for a couple of seconds and the ASAs went into split-brain as one would expect them to do.

The thing is that after approx. 1 minute the secondary ASA switched off its failover configuration (i.e. "show run" gives "no failover") without anybody telling it to do so. Here is the "show failover history" of the device:

07:57:34 MESZ Aug 15 2011

Standby Ready Just Active HELLO not heard from mate

07:57:34 MESZ Aug 15 2011

Just Active Active Drain HELLO not heard from mate

07:57:34 MESZ Aug 15 2011

Active Drain Active Applying Config HELLO not heard from mate

07:57:34 MESZ Aug 15 2011

Active Applying Config Active Config Applied HELLO not heard from mate

07:57:34 MESZ Aug 15 2011

Active Config Applied Active HELLO not heard from mate

07:58:03 MESZ Aug 15 2011

Active Cold Standby Failover state check

07:58:18 MESZ Aug 15 2011

Cold Standby Disabled HA state progression failed

At this point failover was switched off completely and the split-brain remained even after the layer-2-connection has been reestablished.

This is no good.:( I have searched for "HA state progression failed" without any useful result/explanation.

Why did the device switch off failover on its own and how can we assure that it won't do this again?

Best regards,

Grischa

varrao · ‎08-18-2011

Hi Grischa,

Can you confikrm if the failover link is connected directly with eachother and the rest of the interfaces are connected through a switch??

You might be hitting this bug:

http://tools.cisco.com/Support/BugToolKit/search/getBugDetails.do?method=fetchBugDetails&bugId=CSCtg55257

However, further research on the issue, suggests me that one possible work around would be to manually enable failover on the secondary device again.

Let me know if this helps.

Thanks,

Varun

Thanks,
Varun Rao

grischast · ‎08-19-2011

Hi Varun

The devices are installed at different locations. The failover link is a MPLS-L2transport across the providers backbone from one location to the other. I.e. failover interfaces are connected to a switch on either side, switches have a dot1q-trunk to the providers access routers which connect the two ports via MPLS-L2transport:

ASA1 - Switch1 - PE1-Router ----MPLS-L2transport---- PE2-Router - Switch2 - ASA2

Due to a network issue the MPLS-L2transport has been interrupted for a couple of seconds and afterwards we had a persisting split-brain situation.

Of course I have enabled failover manually again. But the plan is that the ASAs recover for themselves into normal operation as soon as the network is stable again.

Regards,

Grischa

varrao · ‎08-19-2011

I would suggest you open a TAC case for it, because this seems to be some unusual behavior. If you manually enable the failover again, does the secondary become standby and functions properly??

-Varun

Thanks,
Varun Rao

grischast · ‎08-19-2011

Yes, only thing I needed to do was issuing "failover" on the secondary. It detected its active mate and went properly into standby:

09:16:18 MESZ Aug 15 2011

Disabled Negotiation Set by the config command

09:16:19 MESZ Aug 15 2011

Negotiation Cold Standby Detected an Active mate

09:16:21 MESZ Aug 15 2011

Cold Standby Sync Config Detected an Active mate

09:16:31 MESZ Aug 15 2011

Sync Config Sync File System Detected an Active mate

09:16:31 MESZ Aug 15 2011

Sync File System Bulk Sync Detected an Active mate

09:16:31 MESZ Aug 15 2011

Bulk Sync Standby Ready Detected an Active mate

I guess we will go the TAC way if we encounter this situation a second time. This time we will be warned and know where to look at.

Is there really no documentation available of the "HA state progression failed" message? What does it mean and how is it triggered usually?

Regards,

Grischa

varrao · ‎08-19-2011

Well if I go by the symptoms and the conditions of your issue, it definitely seems to be the bug that I have provided you. I've tried doing some research on it, but could not find any documentation for it, since this is not a expected behavior. I guess opening a TAC would be the right step if it happens again. When the first instance of this issue was noticed, it was not possible to recreate the issue.

This is encountered, everytime the switch encounters an issue, if it happens again, TAC case would be best, so that we have some data to pull out, when the secondary is failed and identify whether it is the same problem.

Let me know if this answers your question.

Thanks,

Varun

Thanks,
Varun Rao