Re: Minimazing traffic disruption when doing VPC peer reload/upgrade

from88 · ‎08-18-2020

Hello,

In our DC we're utilizing VPC feature with pair of Nexus switches like N9K, N6K, N5K. Sometimes where's an urgency to reload one of the pair switches. The reason could be: upgrade, or device misbehavior (for example because of more than 13xx days uptime, etc). Some servers which are connected to that Nexus switches are running Cloud (OpenStack) service with Ceph storage, and even the smallest traffic disruption causes big disruption to that storage servers. So I started to think how it's possible to minimize the traffic loss during reload. I noticed that the traffic disruption appears when the reloaded device starts to boot up.

Theoretically it could be that the VPC Port-channels are up before the IGP routing adjacencies are UP - and that would explain why the loss is occurring - the devices tries to send the traffic to one of the port-channels leg which just got UP, and the reloaded switch still don't know any L3 route. But this is just a guess. I don't know which elelement starts first when the Nexus device start ups. Maybe you do ? Another theory: If the routing would start before the VPC, we would be able to workaround the problem by adding OSPF Router max metric on startup command to alter the OSPF costs.

Maybe you know some tips and tricks to prevent traffic loss in situations like these ? Thanks

balaji.bandi · ‎08-18-2020

Technically if the devices are dual homed, you should not see any interuption in service, since they are dual homed.

IGP peering required to peet with bounded IP address to interface - not with VIP, that is not recomended.

Do you have high level diagram network to undestand more - when was the last reboot took place and what service you lost ?

BB

***** Rate All Helpful Responses *****

How to Ask The Cisco Community for Help

from88 · ‎08-19-2020

Hello,

Thanks for reply - by the IGP peering i meant - VPC Switches layer 3 uplink. So the diagram is like that:

it's hard to understand - how it's possible to not experience any packet drop if one of the servers port-channel ports is going down. AFAIK there's no graceful removal of the physical interface from port-channel.

balaji.bandi · ‎08-19-2020

Personally for better high availabiltiy i prefer to do FEX with dual homed with Parent device -

in your case The FEX are not diual homed (but it is also suggest deployed method with understanding), So the control will be with Server.

attached deployment options.

Also if the Parent device doing Layer 3 IGP, it required seperate Link, you can not use vPC link for IGP peering (that is the limitaiton that required to look - if that is your case).

BB

***** Rate All Helpful Responses *****

How to Ask The Cisco Community for Help

from88 · ‎08-19-2020

Thanks, from design perspective - we're using separate L3 LINKs and separate VPC peer link per VPC switches pair. So, still i think as the server is attached via Layer 2 (port channels) to Nexus - so when rebooting it, might be the case that the physical interface got up faster than the VPC Switch establish OSPF neighborships, and that would explain loss of few packets.

Would you agree ? Or maybe you know the internal sequence of switch bootup ? Which elements are bought up first ?;)

balaji.bandi · ‎08-19-2020

May be loss 1 or 2 ping expected, when the device convergence take place.

But application should be able to co-up with that i guess.

BB

***** Rate All Helpful Responses *****

How to Ask The Cisco Community for Help

from88 · ‎08-20-2020

thanks, i guess that CEPH storage can't cope for even 1-2ping loss