Re: Access redundancy testing on Nexus 5000 and VPC

franciscohs · ‎02-15-2011

I'm doing some redundancy tests for servers connected through CNA (only ethernet for the moment) to two Nexus 5010 switches with VPC. Test lab looks like the attached file. Additionaly, both Nexus 5K switches are connected through the management port and VPC keepalive uses it. These are simple tests for now, just the test machine ping flooding (about 1000 per second) the ESX and looking how many pings I loose in each case. (which gives me a rought downtime in milliseconds)

I've so far tested taking down individual link to the servers, uplinks, VPC peer link and failover times have been very good, as expected.

My main problem is in an scenario where a N5K fails. When I take the Nexus down, failover is very quick as expected, similar to any other link failure.

The problem comes when the Nexus comes back up. At that point I have a downtime of over 3 secs and I'm not so sure why. I can see that on that downtime moment, the VPC keepalive shows the other device up, in fact, it seems the downtime starts when the VPC keepalive goes up and thus I'm thinking this has to do with some kind of VPC inconsistency. (there is not much time to test and each reboot takes a lot of time) Changing keepalive values doesn't help much either, it doesn't seem to be related to it.

Can anyone explain what might be the root of this problem and how to overcome it?

Thanks

Francisco

Darren Ramsey · ‎02-15-2011

Is your ESX server set to use link up/down or beaconing?

franciscohs · ‎02-18-2011

Darren, the ESX is set to use link status.

Darren Ramsey · ‎02-18-2011

We saw the same thing during our initial HA testing of our N5K deployment. Basically the ESX server sees linkup and starts to forward traffic before VPC allows the rebooted switch to forward North/South. What you get is a black hole for several seconds.

Set it to beaconing and reboot the N5K and see if that makes the difference.

http://kb.vmware.com/kb/1005577

By enabling the beaconing setting and running PowerPathVE for multipath Fiber Channel, we can reboot a single N5K in a VPC pair and lose no Ethernet or FC frames.

vdsudame · ‎02-18-2011

with beaconing, just make sure we run a code later than 4.2.1.N1.1. We have an issue in the NX-OS on the n5k where we mark the frame with incorrect length at ingress and stomp it with a crc, therefore egress detects the crc error and which in turn could cause some performance degradation. We have seen few customers run into this issue. It is documented in CSCtb96758.

Thanks, Vinayak

Darren Ramsey · ‎02-18-2011

Good info...We POC'ed the N5K and 2232 on DeeWhy beta and then went directly to 4.2 for production, so that why I never saw that bug in the lab.

franciscohs · ‎02-18-2011

Thanks, that's good info. For the moment we are using port channels between the Nexus and the ESX hosts, which doesn't allow beacons.

In principle I used trunks, since I think this is the most consistent way of configuring the networking part of the solution, every switch connection is a port channel (in this case, Nexus switch - Distributed switch).

Maybe we may need to reconsider this and this is a good reason to. We are not using Nexus 1000v, but one of the reasons I wanted to try it out was to be able to use LACP on port channels.

What is the official Cisco recomendation about this?

vdsudame · ‎02-18-2011

if you arent using n1kv, then unfortunately we cant do lacp binding. It would be regular etherchannel mode on.

Sample Configuration of Etherchannel/ Link Aggregation with ESX and Cisco/HP Switches

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1004048

Supported Cisco configuration: EtherChannel Mode ON – ( Enable Etherchannel only)

franciscohs · ‎02-22-2011

Well, yes, that's exacly what I'm doing right now and I'm aware of the LACP limitation.

I still wonder if LACP would help thou. I just tried disconnecting the host from one of the N5K, leaving the VPC configured in both and if I reboot the device with no ESX connected, I still loose about 24 50ms spaced pings to the host when the N5K comes up. That's for a host member of the VPC but not connected to the rebooted device at all.

Is there an explanation to this?

Darren Ramsey · ‎02-18-2011

So I take it you are doing Route based on ip hash and static 802.3ad? LACP on both the Nexus and ESX would likely fix the packet loss issue, but as you pointed out you need the 1000v.

Have not tested LACP with 1000v, but would be interested in hearing the results during and after a N5K reload.

Of course with ISSU, the need for reload should be minimal, but you have to plan for worst case and know your failover times in the Data Center.

vdsudame · ‎02-18-2011

Thats correct, it would be static 802.3ad or trunk in HPs case. I dont have the testing results readily available with me, hopefully someone on the forum can respond about that.

Thanks, Vinayak

vdsudame · ‎02-22-2011

francisco, i see you updated the thread but I dont see your response. please let us know if you have further questions.

franciscohs · ‎02-22-2011

I replied to an earlier message of yours and the message got in the middle of the linear looking thread. The message is higher in the thread. Anyway, my response was:

Well, yes, that's exacly what I'm doing right now and I'm aware of the LACP limitation.

I still wonder if LACP would help thou. I just tried disconnecting the host from one of the N5K, leaving the VPC configured in both and if I reboot the device with no ESX connected, I still loose about 24 50ms spaced pings to the host when the N5K comes up. That's for a host member of the VPC but not connected to the rebooted device at all.

Is there an explanation to this?

vdsudame · ‎02-22-2011

what release of code are you running on the nexus switches? Also the nexus switch which you rebooted (for which you have removed the connection to the host), was it the vpc primary initially ?

franciscohs · ‎02-22-2011

Version is 4.2(1)N2(1a)

The rebooted device is the VPC secondary.