Cisco 6509-E VSS Replace Supervisor - Page 2

leonardo_liberati · ‎10-19-2011

Good Morning,

we have to replace the sup720 of a vss bundle.

I've read the how-to on cisco.com web site:

http://www.cisco.com/en/US/products/ps9336/products_configuration_example09186a0080a64891.shtml

but I have some questions about the replacement:

When we perform the recovery procedure is likely to malfunction despite one of the two nodes VSS is currently up and running?

What kind of impact we can have?

May occur during recovery procedure a fault on the running machine?

In the event of a fault which these cases the recovery process and especially the estimated timing?

Note: We have a VSS with FWSM (one for each 6509) and with WISM (two for each 6509).

Thanks a lot

Leonardo

Jon Marshall · ‎10-19-2011

Leo

However, forcing preemption transitions requires a reload and potential traffic outage.

Yes, exactly. So with no preemption, sup in active chassis fails. Standby becomes active (without reboot ?). You replace failed sup, boot up and it becomes secondary.

No loss of forwarding traffic, other than obviously only having half the bandwidth until the secondary comes on line.

I trust you and Reza on this, so what am i missing ?

Jon

Leo Laohoo · ‎10-19-2011

I trust you and Reza on this, so what am i missing ?

I'll trust Reza to show up soon and show me some pointers as he seems to be THE master (if not one of them) in VSS. I've just gotten my hands dirty a few months ago.

Jon Marshall · ‎10-19-2011

Leonardo

Apologies for hijacking your thread but hopefully it has helped.

Jon

leonardo_liberati · ‎10-19-2011

I've followed the guide and I'have zero impact in the network.

I' ve changed the supervisor with a new one...but... I gave this error (taken from the active vss node)

Oct 20 00:21:45: %LINK-3-UPDOWN: Interface TenGigabitEthernet1/5/4, changed state to up

Oct 20 00:21:45: %LINK-3-UPDOWN: Interface TenGigabitEthernet1/5/5, changed state to up

Oct 20 00:21:45: %LINK-SW1_SP-3-UPDOWN: Interface TenGigabitEthernet1/5/4, changed state to up

Oct 20 00:21:45: %LINK-SW1_SP-3-UPDOWN: Interface TenGigabitEthernet1/5/5, changed state to up

Oct 20 00:21:58: %ISSU-SW1_SP-3-PEER_IMAGE_INCOMPATIBLE: Peer image (s72033_sp-ADVENTERPRISEK9_WAN-M), version (12.2.(33)SXI4) on peer uid (-511934427) is incompatible

Oct 20 00:22:09: %VSLP-SW1_SP-5-RRP_ROLE_RESOLVED: Role resolved as ACTIVE by VSLP

Oct 20 00:22:09: %VSL-SW1_SP-5-VSL_CNTRL_LINK: New VSL Control Link Te1/5/5

Oct 20 00:22:10: %VSLP-SW1_SP-5-VSL_UP: Ready for control traffic

Oct 20 00:22:12: %ISSU-SW1_SP-3-PEER_IMAGE_INCOMPATIBLE: Peer image (s72033_sp-ADVENTERPRISEK9_WAN-M), version (12.2.(33)SXI4) on peer uid (-511934427) is incompatible

I've tried to clear the BOOT variable in ROMMON mode but nothing...any suggest?

Leo Laohoo · ‎10-19-2011

IOS and/or feature set not the same?

leonardo_liberati · ‎10-19-2011

the are the same, I've copied the IOS from the Active to a Compact Flash...and then I have followed the Cisco procedure...

Jon Marshall · ‎10-19-2011

Leonardo

Is that the only image you had on the active sup ? Are you sure you copied the one the the active is currently running.

Check the image on the active supervisor ie. which one it is actually running.

Jon

Leo Laohoo · ‎10-19-2011

Hey Jon,

Can I tell you the story about the nightmare we had trying to upgrade the IOS on a pair of chassis rigged for VSS?

So we followed the documentation about Upgrading a VSS. At any which case when we tell the blades to run the new IOS, based on the documentation, the secondary would run the new IOS, wait for the first one to come up with the new IOS. So far so good right? Next, one of the two would boot up to the OLD IOS. Next the second one would boot into the OLD IOS.

Anyway, three of us tried the same proceedure and we get the same or similar result: After a few minutes the stack will roll back to the OLD IOS.

After hours of screaming at the computer monitor, we tried low-tech:

1. We let the VSS stabilize with the old IOS;

2. Copy the new IOS to the primary (sup-bootdisk:) and secondary (slavesup-bootdisk:);

3. Configured the bootstring.

4. Reboot the VSS.

This method works for us.

Jon Marshall · ‎10-19-2011

Leo

From the VSS Q & A

Q. What is enhanced fast software upgrade (eFSU)?

A. eFSU is a mechanism to perform software upgrades while maintaining high-availability. It leverages the existing features of Nonstop Forwarding (NSF) and Stateful Failover (SSO) and significantly reduces the downtime to less than 200ms.

So it sounds like it took a bit longer than 200ms then

That's the trouble with these sort of systems. They are so critical to the network that you don't get the chance to actually test any of it. You just have to schedule an outage and hope it all works. Last place i worked we were big enough to be able to get access to Cisco labs so we could just book in there and then test as much as we liked.

Still didn't guarantee it would work but it sure made you feel a bit better !

Jon

Leo Laohoo · ‎10-19-2011

Thanks Jon,

I believe the 2nd attempt was using this process. Didn't work either.

We had the entire weekend worth of power works, so we were fine with our outage. It was getting the correct process down that was the issue.

Reza Sharifi · ‎10-19-2011

Jon, Leo,

Sorry guys to get back to this treat so late. I was doing work email and eating dinner.

We tested the VSS extensivily. Here is my take on it.

First, don't even try using ISSU if you are running a modular image and trying to go to a non-modular or viceversa It just doesn't work. We finaly opened a ticket with TAC and got in contact with IOS BU to find out that it is not possible.

Second, ISSU works fine (most of the time) if you are upgrading using the same image type.

The way we did this was by using a PC connected to an access switch that was uplinked to a pair of 6500 (VSS).

We configured a loopback address an the VSS and ran continues ping from the PC. During the upgrade the maximum number of pings we missed was 2. Now, as I said in the begging, this has worked most of the time, because we had problem using ISSU with SXI2a and earlier versions. Starting SXI4 and later we have seen this works at least 95% of the time. The reason I mentioned outage window in my fist post is that there is that 5% chance of things going wrong and because of it, I would not risk doing this during production. I have not tested ISSU with SXI5 and SXJ and don't know if there is any issue with these images when it comes to upgrade using ISSU. At the end of the day, in my opinion it all comes down to the IOS version you are using.....

Hope this was helpful guys.

Thanks,

Reza

Leo Laohoo · ‎10-19-2011

Very helpful Reza. Thanks!

Jon Marshall · ‎10-20-2011

Thanks Reza , much appreciated.

Another question if you don't mind. As you can see from the thread preemption was discussed.

So in the scenario where you have one sup per chassis and the active sup fails. The standy sup takes over and traffic forwarding is uninterrupted albeit halved. You replace the failed sup and it boots up. If this sup was configured to preempt what actually happens ?

I don't understand a couple of things -

1) if the new sup preempts then i would have thought that meant it was ready to begin forwarding. So whether the other chassis reloads or not, why is there a possible outage

2) does the other chassis have to reload when it is preempted. If so, do you know why ? Doesn't seem logical to me.

Jon

Reza Sharifi · ‎10-20-2011

Jon,

We never tested preemption, we were told to stay away from preemption, because it causes more rebooting and really does not provide any benefit since both switches are logically one.

So, when the new sup comes online, it will be the stand-by device. If preemption is configure then the stand-by will take over and becomes the active device. At this point the previous active switch reboots and when it is reloaded, it will be the new stand-by. In any case, there should not be any outage in any of these scenarios, since you always have one switch forwarding traffic and the other one rebooting.

Reza

Jon Marshall · ‎10-20-2011

Thanks again Reza.

Jon