Re: N9K HW Refresh with GIR for leafs running VXLAN and vPC

malcolmsalmons · ‎02-28-2025

Hi

We're currently about to start a HW refresh project to replace EoL N9Ks with new N9K hardware, so a like for like from the perspective of ports etc.

This is for VXLAN leafs running OSPF, MPBGP and VPC. For the replacement we'd like to cause minimal disruption to dual attached servers, i.e. 1 connection to each leaf in the vPC pair.

I've been looking at GIR, putting the switches in maintenance mode prior to swap out etc and am interested to know if anyone's done this and if so what kind of impact this had on the connected servers?

Any thoughts appreciated.

Thanks

Malc

AshSe · ‎03-04-2025

Hello @malcolmsalmons

Here is my one cent for this maintenance window:

When performing a hardware refresh for Nexus 9000 (N9K) switches in a VXLAN fabric with OSPF, MP-BGP, and vPC, minimizing disruption to dual-attached servers is critical. Using features like Graceful Insertion and Removal (GIR) and Maintenance Mode can help reduce the impact, but there are still some considerations and potential impacts to be aware of.

Key Considerations for Minimal Disruption

vPC Dual-Attached Servers:
1. Since your servers are dual-attached (one connection to each leaf in the vPC pair), the goal is to ensure that at least one path remains active at all times.
2. vPC provides redundancy, so as long as one switch in the vPC pair remains operational, traffic should continue to flow without significant disruption.
GIR and Maintenance Mode:
1. GIR allows you to gracefully remove a switch from the network by isolating it from the control plane and data plane while maintaining existing traffic flows as much as possible.
2. When you put a switch into Maintenance Mode, it stops advertising routes (e.g., OSPF and BGP) and withdraws itself from the control plane. This ensures that traffic is rerouted to other available paths before the switch is powered down or removed.
3. For vPC, the peer switch will take over the traffic for the dual-attached servers when one switch is in Maintenance Mode.
Impact on Servers:
1. Minimal Impact Expected: If the servers are properly dual-attached and the vPC configuration is healthy, the servers should not experience significant disruption. Traffic will fail over to the remaining active switch in the vPC pair.
2. Potential Brief Disruption: There may be a brief moment of packet loss during the failover process, especially if the vPC peer-link or keepalive link is not functioning optimally. Ensure these links are healthy before starting the process.
3. Single-Attached Devices: If there are any single-attached devices connected to the switch being replaced, they will lose connectivity during the process.
Control Plane Protocols (OSPF and MP-BGP):
1. When a switch is placed into Maintenance Mode, it will withdraw its OSPF and BGP routes. This can cause a brief reconvergence in the network, but the impact should be minimal if the rest of the fabric is healthy and redundant.
2. Ensure that the BGP graceful restart feature is enabled, as this can help minimize disruption during the control plane reconvergence.
vPC Role Considerations:
1. If the switch being replaced is the vPC primary, the secondary switch will take over the primary role. Ensure that the vPC role transition is smooth and that the peer-link and keepalive are stable.
2. If the vPC peer-link goes down during the process, it can cause a split-brain scenario, leading to traffic disruption. Validate the health of the vPC domain before proceeding.

Recommended Steps for Minimal Disruption

Pre-Maintenance Checks:
1. Verify the health of the vPC domain (peer-link and keepalive).
2. Check the status of OSPF and BGP neighbors to ensure there are no existing issues.
3. Confirm that all servers are properly dual-attached to both switches in the vPC pair.
Graceful Removal of the Switch:
1. Place the switch into Maintenance Mode using the install maintenance-mode command.
2. Verify that the switch has withdrawn its routes and that traffic has shifted to the remaining active switch in the vPC pair.
3. Monitor the network for any anomalies or packet loss.
Hardware Replacement:
1. Power down the switch and replace the hardware.
2. Configure the new switch with the same settings as the old one (vPC domain, OSPF, BGP, VXLAN, etc.).
Reintegration:
1. Bring the new switch online and verify its configuration.
2. Remove Maintenance Mode and allow the switch to rejoin the network.
3. Verify that OSPF and BGP neighbors are established and that the vPC pair is healthy.
Post-Maintenance Validation:
1. Check the vPC status to ensure both switches are operational and synchronized.
2. Verify that all servers have connectivity and that traffic is flowing as expected.
3. Monitor the network for any anomalies.

Potential Risks and Mitigations

vPC Peer-Link Failure:
1. Risk: If the vPC peer-link fails during the process, it can cause a split-brain scenario.
2. Mitigation: Validate the health of the peer-link and keepalive before starting. Have a rollback plan in case of issues.
Control Plane Reconvergence:
1. Risk: OSPF and BGP reconvergence can cause brief traffic disruption.
2. Mitigation: Use BGP graceful restart and ensure the rest of the fabric is healthy.
Configuration Mismatch:
1. Risk: The new switch may have configuration mismatches, causing vPC or protocol issues.
2. Mitigation: Pre-stage the configuration and validate it before reintegration.

Summary

Using GIR and Maintenance Mode is a good approach to minimize disruption during the hardware refresh. For dual-attached servers, the impact should be minimal as long as the vPC domain is healthy and the failover process is smooth. However, you should still expect a brief moment of packet loss during the transition. Proper planning, pre-maintenance checks, and post-maintenance validation are key to ensuring a successful and low-impact hardware replacement.

Hope This Helps!!!

AshSe

Community Etiquette:

Insert photos/images inline - don't attach.
Always mark helpful and correct answers, it helps others find what they need.
For a prompt reply, kindly tag @name. An email will be automatically sent to the member.

malcolmsalmons · ‎03-04-2025

Hi AshSe

Thanks a lot for the comprehensive response, that aligns with what I've read.

Out of interest have you undertaken a HW swap using this method and if so what kind of hit do you see on end hosts, i.e. is it a sub-second thing that servers etc wont see as a problem or is it still long enough to cause issues for servers, storage etc?

Thanks

Malc