Re: Failback of primary CSM and FWSM after reload

CCC4046_2 · ‎05-06-2005

I am testing a dual 6500 switch configuration with FWSM and CSM. Both CSM and FWSM have stateful failover and MSFC has HSRP. CSM is bridged and FWSM is routing.

The configuration is basically Client - MSFC - FWSM - CSM - Real server.

The 6500s have 10G trunk between them and I have two access switches (L2) on the real server side. The L2 access switches provide a secondary path for the fault tolerance VLANs. I am pinging machines connected to each L2 switch.

When I reload the primary box, all the connected machines continue to ping and the real servers drop for approximately 5 seconds before the CSM fails over. I am satisfied with the compromise of CSM timeout and stability (I found if the CSM timeouts were too small R-PVST convergence was not fast enough and both CSMs bridged, creating a broadcast storm).

When the primary switch returns to service, the HSRP fails back seemlessly, the firewall failsback seamlessly. However, when the CSM fails back, I can no longer ping the real server.

Even though I cannot ping the real servers, I can still access the servers via the VIP. After a period of time the pings return (presumably after a timeout, approx 60sec).

While the pings from outside do not work. If I ping the default gateway (FWSM interface) from the real server, the FWSM responds and the pings from the outside suddenly restart.

I think the issue is to do with either the ARP cache in the FWSM or a problem with the movement of the real server mac-address within the switch network.

Does this ring any bells for anyone?

pascal_parrot · ‎05-06-2005

Hi,

We had a somewhat similar issue. We upgraded the FWSM to 2.3.2 and the problem seems to have disappeared.

Best regards,

Pascal

CCC4046_2 · ‎05-06-2005

Thanks Pascal,

I will give it a try on Monday.

Cheers,

Dave B

Gilles Dufour · ‎05-07-2005

I believe that would be the following bug fixed in 2.3.2 of FWSM.

CSCeg53853 - FWSM fails to update ARP entry when a packet is not targeted for FW

Gilles.

CCC4046_2 · ‎05-08-2005

I checked the FWSM code today and is already at 2.3.2. However, I have found the source of the problem.

When the CSM function moves from the secondary to back to the active, the mac-address of the real server remains on the Po258 on the CSM that has been taken out of service.

If I clear the mac-addresses of Po258 on the inactive CSM the switch broadcasts the frame and the return frame updates the Mac-address table forwarding frames towards the active CSM. Similarly if the real server sends a frame, the mac-address table is also aupdated.

Now to find the solution. I think I will raise a TAC case for it.

pascal_parrot · ‎05-09-2005

Dave,

Configuring a shorter time for the mac address table aging time might solve the problem:

mac-address-table aging-time 10 vlan

Best regards,

Pascal

CCC4046_2 · ‎05-09-2005

Thanks Pascal,

I raised a TAC case yesterday.

I shortened the aging timer on the Client side yesterday to 10 seconds, which improved it quite a bit. But still by no means hitless.

There was a bug in earlier CSM code that resulted in gratuitous ARPs not being sent for real servers on failover, but it has been fixed in later code. I am going to roll back the code from 4.2.1 to 3.1.10.

I am sure when I tested it under 3.1 code, the failover was hitless.

I will update the forum when I have tested it under 3.1.10.

CCC4046_2 · ‎05-10-2005

Same thing happens under 3.1.10 code.

Back to the drawing board.

I have observed the same issue when failing the CSM from primary to secondary (and vi

I admit 5 seconds is not to bad when the entire primary unit fails (switch, FWSM and CSM), but it takes at least 10 seconds when the primary returns to service!