Solved: FWSM failed to standby after deleting multiple contexts

dwilliams · ‎12-02-2009

We are running redundant FWSMs in two 7613 routers in multiple context mode. The FWSMs are configured for Active/Standby failover. We were deleteing several contexts in the Primary FWSM in an effort to reclaim some unused licenses and they suddenly failed over. Could we have inadvertently triggered the failover by deleting too many contexts, and associated interfaces, too quickly? Has anyone ever experienced anything like this?

Kureli Sankar · ‎12-02-2009

Dave,

failover interface-policy 1

That line right there says that even if one interface fails the unit will failover.

http://www.cisco.com/en/US/docs/security/fwsm/fwsm40/command/reference/ef.html#wp1667030

If the number of failed interfaces meets the configured policy and the other FWSM is functioning properly, the FWSM will mark itself as failed and a failover may occur (if the active FWSM is the one that fails). Only interfaces that are designated as monitored by the monitor-interface command count towards the policy.

Now, the question is how many interfaces are you monitoring? "sh run monitor" - output.

We do see quite a few postmortem cases without the necessary data for us to arrive at a root cause. Without all the data we can only guess...

I am sure you would have smartnet and if so, you are welcome to open a TAC case where we can take the time to put your config in the lab and see if we can do the same context/interface removing that you did to see the behavior.

-KS

View solution in original post

Kureli Sankar · ‎12-02-2009

I have not heard/seen of an issue similar to this. But, it certainly is possible. Not knowing what you had configured for failover parameters and what the logs indicated for reason to failing over, it is hard to say why it would have failed over.

sh fail history

sh run fail

syslogs from the time of the problem

It may be a good idea if you can remove the vlans from being pushed down from the swtich hosting the standby unit (firewall vlan commands) and then remove the same vlans from being pushed down from the swtich hosting the active unit. Verify both units see the same number of vlans and failover status shows active/standby ready and then delete the corresponding contexts from the active unit first.

David Williams · ‎12-02-2009

Thanks Kusankar.

I pasted in the basic failover setup that we have configured. I wish I had the show commands or even syslogs from when this took place. Unfortunately another engineer was doing the work when this happened and none of that information was collected. Since that time I have upgraded the router code and reloaded the 7613s so I'm afraid any hope of gleaning anything useful at this point is slim. It comes up now, because I have to go through and do another round of cleanup. Given the history I'm cautious to just jump in and make the changes. I do have a lab with 7606s and a similar setup that I can do some testing in. I was hoping that perhaps this could have been caused by deleting contexts without shutting down and or removing the interfaces first. Perhaps if the FWSM saw x number of interfaces vanish from the primary it would trigger a forced failover. I haven't been successful in a search for documentation that would support such a theory. I know there is some interface monitoring that can be done, but we never implemented that, and it is my understanding that it doesn't do it by default in multiple context mode.

If you or anyone else has any information, I would appreciate any suggestions. At this point I think I need to schedule some lab time and try to replicate the problem. I was hoping to avoid that but it is looking like it may be unavoidable.

failover
failover lan unit primary
failover lan interface failover Vlan73
failover interface-policy 1
failover replication http
failover link state Vlan74
failover interface ip failover 10.254.254.1 255.255.255.0 standby 10.254.254.2
failover interface ip state 10.254.253.1 255.255.255.0 standby 10.254.253.2

failover
failover lan unit secondary
failover lan interface failover Vlan73
failover interface-policy 1
failover replication http
failover link state Vlan74
failover interface ip failover 10.254.254.1 255.255.255.0 standby 10.254.254.2
failover interface ip state 10.254.253.1 255.255.255.0 standby 10.254.253.2

Thanks again Kusankar!

Kureli Sankar · ‎12-02-2009

Dave,

failover interface-policy 1

That line right there says that even if one interface fails the unit will failover.

http://www.cisco.com/en/US/docs/security/fwsm/fwsm40/command/reference/ef.html#wp1667030

If the number of failed interfaces meets the configured policy and the other FWSM is functioning properly, the FWSM will mark itself as failed and a failover may occur (if the active FWSM is the one that fails). Only interfaces that are designated as monitored by the monitor-interface command count towards the policy.

Now, the question is how many interfaces are you monitoring? "sh run monitor" - output.

We do see quite a few postmortem cases without the necessary data for us to arrive at a root cause. Without all the data we can only guess...

I am sure you would have smartnet and if so, you are welcome to open a TAC case where we can take the time to put your config in the lab and see if we can do the same context/interface removing that you did to see the behavior.

-KS

David Williams · ‎12-02-2009

Hi Kusankar,

That is good info. We do have smartnet, but I knew going into this that I had next to no information to provide TAC, so I thought I would start here.

Currently, we don't have any monitor-interface commands configured, but it did get me thinking. If a context and its associated interfaces were deleted from the primary/active FWSM, and one of those interfaces was configured with the monitor-interface command, would that command be removed when the interface was deleted? That, in conjunction with the failover interface-policy 1 command, could explain the failover. Of course this is all speculation without more information about the specific incident.

I tell you what. I will lab this up quick later this week and see if I can duplicate. I will post my results and open a TAC case if we still have reservations about moving forward.

Thanks again for all your advice Kusankar! At least I have some things to try in the lab that I didn't have before. And who knows. After I post my testing results, maybe it will help someone else out down the road.

Kureli Sankar · ‎12-02-2009

There are no "monitor interface" lines in any of the contexts presently?

http://www.cisco.com/en/US/docs/security/fwsm/fwsm40/command/reference/m.html#wp1765154

This means that this context did get deleted. That could have very well been the trigger of the failover. The failover lines go in the system space but, the monitor interface lines go in individual contexts.

Good luck with your recreate.

-KS