Odd behavior with Nexus in failover recovery?

Andrew Cormier · ‎04-11-2014

Hi,

Not quite sure what to make of this.

We have a couple of esx servers that have one 10G nic in Nex1 and a second in Nex 2. The links are not port channeled. The two nexus switches are linked by two 10g connections channeld and vpc'ed . Each nexus has two 1G connections Po'ed to a 3750 (our main access switch).

When we shut down one nexus all is good. No loss of connectivity, maybe a little blip but for the most part all is good.

When we bring up the shutdown nexus we are good UNTIL the nex actually comes on line. Then all the nodes lose connectivity for about 30-60 seconds.

The only thing we see in the logs is this.. last 9 seconds (as opposed to 30-60) but the timing is similar. (eth1/47 is one of the interswitch trunks)

%ETHPORT-5-IF_UP: Interface Ethernet1/47 is up in mode trunk
%ETHPORT-3-IF_ERROR_VLANS_SUSPENDED: VLANs 1,34 on Interface port-channel5 are being suspended. (Reason: vPC
peer is not reachable over cfs)

and 9 seconds later they are unsupended

ERROR_VLANS_REMOVED: VLANs 1,34 on Interface port-channel5 are removed from suspended state.

I dont know if this is normal but it would be nice if we could have two switches for HA .

Thoughts?

Thanks

sean_evershed · ‎04-11-2014

Sounds like you had a peer link failure during the test.

I suggest that you configure auto-recovery on both your vPC peers then re-test.

See below a config example:

http://www.cisco.com/c/en/us/support/docs/switches/nexus-7000-series-switches/116187-configure-vpc-00.html

Andrew Cormier · ‎04-12-2014

Thanks Sean

I had enabled auto-recoverey on both switches yesterday but no luck.

The odd thing is (and I tested again). If I reboot switch 2 and ping one of the host AND ping switch 2 I see host is up and switch down (normall). The second switch 2 starts replying to pings the host goes offline for about a minute. It is like switch one is saying "OH, my VPC peer is back, I am going to shut down my ports and wait until the all the vpc's are up"

Andrew Cormier · ‎04-12-2014

NOTE: We see the same behavior regardless of which switch is restarted. I am wondering if this is related to role priority? Stumped :(

Brian McPhillips · ‎03-27-2019

Hi Andrew,

Did you get this resolved? I am seeing the same issue with our customer. Primary is powered off, a few pings are missed. Primary is powered back on and we miss 20-30 pings during the bootup. I did see the traffic switch back to VPC port channel, but HSRP did not fail back at the same time.

Auto recovery is not enabled in this setup.

Mark Malone · ‎03-27-2019

Hi
Is peer-switch enabeld in the VPC domain for the convergence , if not i would enable it try again see if it improves convergence during fails
If your coming from NAS servers make sure peer-gateway is set too in VPC

Brian McPhillips · ‎03-27-2019

Thanks Mark for the quick reply.

peer switch is enabled on both, strangely peer gateway is only enabled on the primary and not the secondary. I'm guessing it needs to be on both.

peer-config-check-bypass and delay restore 150 are on both. This setup was configured by someone else so trying to remedy this issue. Is it worth configuring auto recovery also? It seems to be best practice from what i can see.

Mark Malone · ‎03-27-2019

Hi
yes both VPC configs should basically mirror each other apart from roles obviously, and auto recovery is good practice should be n place too , what version nx-os are they on , turn on ip arp synchronize too for convergence , this speeds up teh address tables during a convergence