04-22-2024 08:37 AM
Hello,
we have deployed pair of 9500-32C in Stacwise-virtual. It is Aggregation node of big site and connecting to backbone by 2 EBGP links (one from each switch).
I have noticed long downtime during ISSU and as we are deploying more of these to other sites prepared lab as per drawing bellow.
Running on cat9k_iosxe.17.09.04a.SPA.bin which was suggested release at the time.
Now I'm testing the convergence during force-switchover, which part of ISSU process.
In following test run the switch 1 was master (ebgp neighbor 10.0.0.1 is connected to it) and switch 2 is standby (connecting ebgp 10.0.0.3).
EBGP is configured to redistribute connected networks, which for test is Loopback and Vlan100 192.168.100.0/0 with test5 switch SVI at address 192.168.100.100. Before failover both test3 and test4 can reach test5.
[2024-04-15 16:09:19] test1#show bgp summary
[2024-04-15 16:09:20] <shortened>
[2024-04-15 16:09:20] Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd
[2024-04-15 16:09:20] 10.0.0.1 4 65002 29 30 8 0 0 00:22:06 5
[2024-04-15 16:09:20] 10.0.0.3 4 65002 29 30 8 0 0 00:21:17 5
[2024-04-15 16:09:34] test1#show bgp ipv4 unicast neighbors 10.0.0.1 advertised-routes
[2024-04-15 16:09:34] <shortened>
[2024-04-15 16:09:34] Network Next Hop Metric LocPrf Weight Path
[2024-04-15 16:09:34] *> 10.0.0.0/31 0.0.0.0 0 32768 ?
[2024-04-15 16:09:34] *> 10.0.0.2/31 0.0.0.0 0 32768 ?
[2024-04-15 16:09:34] *> 10.0.1.0/31 10.0.0.1 0 0 65002 ?
[2024-04-15 16:09:35] *> 192.168.1.1/32 0.0.0.0 0 32768 ?
[2024-04-15 16:09:35] *> 192.168.1.3/32 10.0.0.1 0 0 65002 ?
[2024-04-15 16:09:35] *> 192.168.1.4/32 10.0.0.1 0 65002 ?
[2024-04-15 16:09:35] *> 192.168.100.0 0.0.0.0 0 32768 ?
[2024-04-15 16:09:35] Total number of prefixes 7
[2024-04-15 16:09:35] test1#show bgp ipv4 unicast neighbors 10.0.0.3 advertised-routes
[2024-04-15 16:09:42] BGP table version is 8, local router ID is 192.168.1.1
[2024-04-15 16:09:42] Status codes: s suppressed, d damped, h history, * valid, > best, i - internal,
[2024-04-15 16:09:42] r RIB-failure, S Stale, m multipath, b backup-path, f RT-Filter,
[2024-04-15 16:09:42] x best-external, a additional-path, c RIB-compressed,
[2024-04-15 16:09:43] t secondary path, L long-lived-stale,
[2024-04-15 16:09:43] Origin codes: i - IGP, e - EGP, ? - incomplete
[2024-04-15 16:09:43] RPKI validation codes: V valid, I invalid, N Not found
[2024-04-15 16:09:43]
[2024-04-15 16:09:43] Network Next Hop Metric LocPrf Weight Path
[2024-04-15 16:09:43] *> 10.0.0.0/31 0.0.0.0 0 32768 ?
[2024-04-15 16:09:43] *> 10.0.0.2/31 0.0.0.0 0 32768 ?
[2024-04-15 16:09:43] *> 10.0.1.0/31 10.0.0.1 0 0 65002 ?
[2024-04-15 16:09:43] *> 192.168.1.1/32 0.0.0.0 0 32768 ?
[2024-04-15 16:09:43] *> 192.168.1.3/32 10.0.0.1 0 0 65002 ?
[2024-04-15 16:09:43] *> 192.168.1.4/32 10.0.0.1 0 65002 ?
[2024-04-15 16:09:43] *> 192.168.100.0 0.0.0.0 0 32768 ?
[2024-04-15 16:09:43]
[2024-04-15 16:09:43] Total number of prefixes 7
Then I performed failover and moved console cable to switch2 in SWV:
[2024-04-15 16:10:17] test1#redundancy force-switchover
[2024-04-15 16:10:19] Proceed with switchover to standby RP? [confirm]
[2024-04-15 16:10:23] Manual Swact = enabled
[2024-04-15 16:10:25]
[2024-04-15 16:10:25] *Apr 15 14:11:07.667: %REDUNDANCY-3-SWITCHOVER: RP switchover (PEER_NOT_PRESENT)
[2024-04-15 16:10:25] *Apr 15 14:11:07.668: %REDUNDANCY-3-SWITCHOVER: RP switchover (PEER_DOWN)
[2024-04-15 16:10:25] *Apr 15 14:11:07.668: %REDUNDANCY-3-SWITCHOVER: RP switchover (PEER_REDUNDANCY_STATE_CHANGE)
[2024-04-15 16:10:41] *Apr 15 14:11:14.842: %BGP-5-ADJCHANGE: neighbor 10.0.0.3 Up show bgp summary
[2024-04-15 16:10:44] <shortened>
[2024-04-15 16:10:44] Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd
[2024-04-15 16:10:44] 10.0.0.1 4 65002 0 0 1 0 0 never Idle
[2024-04-15 16:10:44] 10.0.0.3 4 65002 6 2 1 0 0 00:00:11 4
After failover BGP connection is reset, but seem to be reestablished immediately. The part which according to my testing causing long outage is that test1 don't advertise connected network for almost 90 seconds.
[2024-04-15 16:10:47] test1>show bgp ipv4 unicast neighbors 10.0.0.3 advertised-routes
[2024-04-15 16:10:51]
[2024-04-15 16:10:51] Total number of prefixes 0
...
[2024-04-15 16:11:02] Total number of prefixes 0
[2024-04-15 16:11:02] test1>show bgp ipv4 unicast neighbors 10.0.0.3 advertised-routes
[2024-04-15 16:11:03]
[2024-04-15 16:11:03] Total number of prefixes 0
....
[2024-04-15 16:11:03] test1>show bgp summary | i 10.0.0.3
[2024-04-15 16:11:12] 10.0.0.3 4 65002 6 2 1 0 0 00:00:39 4
....
[2024-04-15 16:11:42] test1>show bgp ipv4 unicast neighbors 10.0.0.3 advertised-routes
[2024-04-15 16:11:42] BGP table version is 7, local router ID is 192.168.1.1
[2024-04-15 16:11:42] Status codes: s suppressed, d damped, h history, * valid, > best, i - internal,
[2024-04-15 16:11:43] r RIB-failure, S Stale, m multipath, b backup-path, f RT-Filter,
[2024-04-15 16:11:43] x best-external, a additional-path, c RIB-compressed,
[2024-04-15 16:11:43] t secondary path, L long-lived-stale,
[2024-04-15 16:11:43] Origin codes: i - IGP, e - EGP, ? - incomplete
[2024-04-15 16:11:43] RPKI validation codes: V valid, I invalid, N Not found
[2024-04-15 16:11:43]
[2024-04-15 16:11:43] Network Next Hop Metric LocPrf Weight Path
[2024-04-15 16:11:43] *> 10.0.0.2/31 0.0.0.0 0 32768 ?
[2024-04-15 16:11:43] *> 192.168.1.1/32 0.0.0.0 0 32768 ?
[2024-04-15 16:11:43] *> 192.168.100.0 0.0.0.0 0 32768 ?
[2024-04-15 16:11:43]
[2024-04-15 16:11:43] Total number of prefixes 3
I have tried to play with graceful-restart timers, but what seems to be the core of my issue is long advertisement delay.
It seems inappropriate for HA solution to have such a long downtime especially with manually triggered failover.
My configuration on test1:
stack-mac persistent timer 0
boot system bootflash:packages.conf
switch 1 provision c9500-32c
switch 2 provision c9500-32c
stackwise-virtual
domain 200
dual-active detection pagp trust channel-group 1
!
redundancy
mode sso
!
interface HundredGigE1/0/1
switchport mode trunk
channel-group 1 mode desirable
!
interface HundredGigE1/0/2
!
interface HundredGigE1/0/3
!
interface HundredGigE1/0/4
stackwise-virtual link 1
interface HundredGigE1/0/32
no switchport
ip address 10.0.0.0 255.255.255.254
!
interface HundredGigE2/0/1
switchport mode trunk
channel-group 1 mode desirable
!
interface HundredGigE2/0/2
!
interface HundredGigE2/0/3
!
interface HundredGigE2/0/4
stackwise-virtual link 1
!
interface HundredGigE2/0/32
no switchport
ip address 10.0.0.2 255.255.255.254
!
interface Vlan1
no ip address
!
interface Vlan100
ip address 192.168.100.1 255.255.255.0
!
router bgp 65001
bgp log-neighbor-changes
bgp graceful-restart restart-time 1
bgp graceful-restart stalepath-time 1
bgp graceful-restart
neighbor 10.0.0.1 remote-as 65002
neighbor 10.0.0.1 ha-mode graceful-restart
neighbor 10.0.0.3 remote-as 65002
neighbor 10.0.0.3 ha-mode graceful-restart
!
address-family ipv4
redistribute connected
neighbor 10.0.0.1 activate
neighbor 10.0.0.1 soft-reconfiguration inbound
neighbor 10.0.0.3 activate
neighbor 10.0.0.3 soft-reconfiguration inbound
exit-address-family
test1#show issu state detail
Current ISSU Status: Enabled
Previous ISSU Operation: Successful
=======================================================
System Check Status
-------------------------------------------------------
Platform ISSU Support Yes
Standby Online Yes
Autoboot Enabled Yes
SSO Mode Yes
Install Boot Yes
Valid Boot Media Yes
Operational Mode HA-REMOTE
=======================================================
No ISSU operation is in progress
According to this presentation the failover should be in sub-seconds but that shows only L2 scenarios: https://www.ciscolive.com/c/dam/r/ciscolive/emea/docs/2023/pdf/BRKENS-2095.pdf
Do I missing some configuration? Thanks
04-22-2024 09:25 AM
Maybe try NSF config and test - i do not see the configuration SSO config ?
check below guide :
04-24-2024 01:57 AM
Can I see both SW config, I interested in this issue and try to understand how you config it.
MHM
04-24-2024 03:36 AM
Hello,
unless you really need redistribution, what happens if yiou just advertise your connected network ?
network 10.0.0.0 mask 255.255.255.254
network 10.0.0.2 mask 255.255.255.254
network 192.168.100.0 mask 255.255.255.0
Or you could try BFD (neighbor x.x.x.x fail-over bfd).
06-19-2024 10:05 PM
I am experiencing the same issue and running the same version 17.9.4a. We have opened a TAC case with Cisco
08-21-2024 01:48 AM
Hello PAPL,
was TAC any helpful?
I'm about to get back to this as need to break the lab and move nodes to production and about to deploy more of these SWV pairs to production.
Thanks
Michal
08-21-2024 01:57 AM
Can more elaborate what is issue maybe I can help you
MHM
Discover and save your favorite ideas. Come back to expert answers, step-by-step guides, recent topics, and more.
New here? Get started with these tips. How to use Community New member guide