Nexus 5000 vPC primary failover

eric.ahernandez · ‎03-12-2012

Good day all,

We have 7 Nexus 2000 dual-homed to 2 cisco nexus 5548UP using vPC in production, deployed as a routed access layer because the nexus 5000 have the layer 3 daughtercard and it does eigrp routing and also HSRP .

This weekend we did some tests for high availability, the first one was to take down the vpc peer-link and it behaved as stated in the documentation, the vpc secondary shut down the vpc members, we got the vpc peer-link and all good, no problems whatsoever. Second test was shutting down the vpc secondary nexus 5000 and we had no problems either, we powered on the nexus 5000 again and all good.

Problem was with our final test, we shut down the nexus 5000 configured as vpc primary, no problems there. Then we powered on the switch and the vpc adjacency formed ok, after the delay the vpcs went up and the FEX started to go online one by one, and just when i was thinking "what a wonderful technology", when the final FEX became online and something happened, the communication in all the DataCenter was intermittent, 6 to 10 pings got through and then 10 pings timed out and so on, and it happened to all the devices in the datacenter, I didn't have much time to do the troubleshooting because it affected all the datacenter but the show vpc results seemed OK and all the FEX appeared as Online so i don't really know what happened, I had to reload the vpc primary nexus 5000 and when it started back again it worked.

Have any of you encountered this problem? I thought the failure of a vpc peer (primary or secondary) would be seamless, or at least that's what cisco states.

both N5K have the same specs and NX-OS version:

Software

BIOS: version 3.5.0

loader: version N/A

kickstart: version 5.0(3)N2(1)

system: version 5.0(3)N2(1)

power-seq: Module 1: version v1.0

Module 2: version v1.0

Module 3: version v5.0

uC: version v1.2.0.1

SFP uC: Module 1: v1.0.0.0

BIOS compile time: 02/03/2011

kickstart image file is: bootflash:/n5000-uk9-kickstart.5.0.3.N2.1.bin

kickstart compile time: 6/13/2011 6:00:00 [06/13/2011 07:43:33]

system image file is: bootflash:/n5000-uk9.5.0.3.N2.1.bin

system compile time: 6/13/2011 6:00:00 [06/13/2011 09:33:42]

Hardware

cisco Nexus5548 Chassis ("O2 32X10GE/Modular Universal Platform Supervisor")

Intel(R) Xeon(R) CPU with 8299528 kB of memory.

mikemu · ‎03-13-2012

Eric,

Hard to say whats going on without looking at the configs and the entire setup. But, once the other 5K comes up and they all begin to since arp info and so on, its not out of the realm to loose a bit of ICMP. But if it keeps occuring after a minute or so something in the config needs a tweak, possibly STP..

eric.ahernandez · ‎03-13-2012

Hi Michael thanks for your response,

Yeah the problem kept happening for a while losing ICMP constantly and for like 5 seconds and then then ICMP got through for another 5, like 6 or 7 minutes so I reloaded the vpc primary N5K (and then i encountered the bug CSCti82166 but that's a different story), I was thinking it was something about the STP just like you pointed out, but I thought the vPC didn't handle STP except when the vPC is teared down, or is it really necessary?

Anyway, what would be the best practice for STP on these Nexus 5000 vPC? I saw some documents where they use the command peer-switch to make both peer switches have the same priority and act as root whenever the other peer falls but I think they removed that command because my N5K don't allow it, and the other best practice I found was to configure the STP to fit the roles of vPC, making the primary as root primary and the secondary as root secondary. If you'd like I could post some config.

Thanks anyway

mikemu · ‎03-13-2012

Eric,

There are quite a few other discussions on this topic within support forums that may help. Also, here is a link to review that provides some detail and design\config examples..

eric.ahernandez · ‎03-13-2012

I'm sorry Michael but I don't see any link, I've been checking out the support forum since that happened but I haven't found something like what happened to me, I thought when recovering the vpc primary the vpc adjacency would be formed and there would be no need for STP. or at least that's my understanding of the behaviour of vPC in these kinds of scenarios.

eric.ahernandez · ‎03-13-2012

Attaching a Diagram and the configs of both Nexus 5548UP switches for better understanding.

Prashanth Krishnappa · ‎03-13-2012

Hi Eric

One bug which comes to my mind is

http://tools.cisco.com/Support/BugToolKit/search/getBugDetails.do?method=fetchBugDetails&bugId=CSCts26382

If you can I suggest that you try with 5.0(3)N2(2b) which has fix for this.

eric.ahernandez · ‎03-13-2012

Thanks Prashanth!

That's an idea, one question though, the bug states that "The observed impact is noticed while vPC delay restore is in effect" could you elaborate on this? as I understand it, this happens while the vpc primary is restoring all its functions, when it happened to our N5Ks the vpc adjacency was already formed and the vpcs were up, even the FEX were online, please correct me if I'm wrong.

Anyway I'll try to do the NX-OS upgrade to the latest version 5.1(3)N1(1a), as I'm trying to fix the bug CSCti82166, and hopefully it will fix this issue as well, it's hard to troubleshoot these kind of issues.

Prashanth Krishnappa · ‎03-13-2012

Hi Eric

When a vPC secondary switch comes up, all vPCs on it are kept down until the delay restore timer(default 240 seconds on L3 switches) expires. However, if you have any traffic coming into the vPC secondary switch on a L3 interface(which does not have this 240 seconds delay restore, the expected behavior is to send the traffic over peer-link so that the vPC primary can send it to the destination across the vPC. But due to CSCts26382, it does not happen for 240 seconds in 5.0(3)N2(1) leading to blackholing of traffic coming into the vPC secondary on L3 links.

eric.ahernandez · ‎03-14-2012

Wow thanks a lot for the explanation Prashanth, does this also happen when the vpc primary comes up? because that's when I got the problems, when I did the test shutting down vpc secondary everything went well, problem was with the vpc primary shut down scenario, when it came up this happened.

Jess Probasco · ‎08-15-2012

Can anyone confirm weather or not the 2k extenders will stay up and use ISSU when upgrading Nexus 5548's using Layer 3 NX-OS to a newer version? The 5548's would be upgraded 1 at a time and we have the same setup as above. I just want to confirm that the 2k extenders would not reboot or go offline while once the updated 5548 came back online with a newer NX-OS version.

Prashanth Krishnappa · ‎08-15-2012

Hello Jess

If you have Layer 3 module in the Nexus 5548s, ISSU is not supported and hence upgrade will be disruptive. However, if your FEX is dual homed(Active/Active), the FEX will reload when the second 5548 is upgraded.

Thanks

-Prashanth

Jess Probasco · ‎08-15-2012

So all FEX's will reboot and all servers connected will lose network connectivity durring that proccess? or do the FEX's support not disruptive?

eric.ahernandez · ‎08-15-2012

Ok so I solved it. I ugraded both Nexus 5548UP (kickstart image and system image) to the most recent version (5.1.3N2.1 I think) and it worked when we did the failover tests.

Also at Jess... When I did the upgrade it was disruptive as Prashanth said, when you have the two nexus 5500 with vPC configured and your FEX's are dual homed, you're gonna recover connectivity to your servers after you upgrade the second Nexus 5500.

This is how the upgrade went for me: I upgraded one of the Nexus 5500 (the vpc secondary), and the FEX's were still up in the meantime until the Nexus 5500 came back online, then the FEX's went offline due to version mismatch in both vpc switches, they came back online after I did the upgrade and reloaded the vpc primary.

Jess Probasco · ‎08-15-2012

Thanks Eric and Prashanth for the quick replys,

I put together a lab today and ran through the the same steps you said above except before I installed the new nx-os on the primary vpc peer I issued a reload fex command and forced it to come up on the upgraded 5500 with the new nx-os code. The outage was about 2 minutes per FEX and can be better controlled.

I will now do our production network and with the ability to perform a fex by fex reload it will minimize the outage impact.

I found this document that help me out along with this form.

http://www.cisco.com/en/US/docs/switches/datacenter/nexus5000/sw/upgrade/503_N1_1/n5k_upgrade_downgrade_503.html#wp641802