08-30-2012 06:05 AM - edited 03-07-2019 08:36 AM
Hi Guys,
I'm having terrible problems with a new VSS setup I have going at the moment.
It consists of :
2x 7600's for the core
2x 6500's for the vss
Anyway,
When I do a failover, it fails over fine but OSPF seems to drop and then come back.
The VSS has MEC's going to the 7600's so when one VSS-switch dies, only one uplink in a MEC is down.. The port-channel is up so in reality OSPF should stay up?
The problem I had is that we get a good 5 to 30 seconds worth of outage when a failover happens and I cant seem to find a reliable config to get this to stay up happily.
Any thoughts?
Base OSPF config on the VSS = :
router ospf 7
router-id x.x.x.x
log-adjacency-changes
auto-cost reference-bandwidth 10000
nsf
nsf cisco helper disable
passive-interface default
network 0.0.0.0 255.255.255.255 area 0
default-information originate
Base OSPF on the 7600's:
router ospf 7
max-metric router-lsa on-startup 120
log-adjacency-changes
auto-cost reference-bandwidth 10000
network 0.0.0.0 255.255.255.255 area 0
Thanks
G
08-30-2012 06:32 AM
Hi Graham,
which side brings OSPF down, the 7600 or the VSS?
Also, you need NSF on the 7600 as well in order for it to support neighbors experiencing SSO switchover. From your configuration I don't see it enabled.
Finally, did you by any chance enable OSPF Fast helloes on the core interfaces (between 7600 and VSS)? If yes they prevent NSF from properly working.
Please attach the logs of OSPF going down (with time stamps) from both VSS and 7600 to understand which side torns the session down first.
Riccardo
08-30-2012 07:29 AM
Hi thanks for your reply.
Im pretty sure fast hello isnt on ( Just running default ospf )
Timer intervals configured, Hello 2, Dead 8, Wait 8, Retransmit 5
Supports Link-local Signaling (LLS)
Cisco NSF helper support enabled
IETF NSF helper support enabled
On another note. I just enabled NSF on all of it ( the 7600's and the VSS ) and when I did a failover of the VSS , BOTH! 7600's crashed and reloaded lol! ( Not the VSS, but the core 7600's reloaded... The VSS failed over as normal )
I have that raised with our vendor as evidently something is wrong software wise..
Next time I do a failover and can actually get the logs, ill paste them here. Just waiting for our support people to figure out what crashed both 7600's
So..What you think shoudl be.
No fast hello.
And
router ospf x
nsf
end
on ALL routers?
From what I could see in ciscos documentation only the VSS needed NSF enabled. ( http://www.cisco.com/en/US/docs/solutions/Enterprise/Campus/VSS30dg/VSS-dg_appa-configs.html#wp1051802
)
Thanks again for your time.
-G
Edit:
Looking on the core.
NSF seems to be enabled. ( even though not in the config )
Timer intervals configured, Hello 2, Dead 8, Wait 8, Retransmit 5
oob-resync timeout 40
Hello due in 00:00:00
Supports Link-local Signaling (LLS)
Cisco NSF helper support enabled
IETF NSF helper support enabled
08-30-2012 09:27 AM
Hi Graham,
I double checked and on newer releases NSF helper mode is on by default, this is why the 7600 shows NSF helper support mode enabled.
For the crash you did right; it might be not relevant to this issue.
However in order to troubleshoot this issue we need to see some logs and debugs.
The idea is that when the SSO switchover occurs on the VSS the new active RP of the new active Supervisor sends a so-called grace LSA (since we are talking of OSPF here) to its neighbors to signal them that it is undergoing a graceful restart.
The 7600 at this point, upon receival of such grace LSA, should enter in helper mode and start a timer. During this time the adjacency is not declared down and forwarding keeps on occurring based on mls entries which are not flushed. If by the time this timer expires the routing adjacency is not brought up again and LSA re-synch'ed the adjacency is declared down.
Now we don't know 1) whether the VSS actually sends a grace lsa 2) whether the 7600 receives it and moves into helper mode and 3) who is declaring the session down.
For that, as I wrote we need logs and since you need to test this feature again in order to get the logs I suggest you to enable some useful debugs rightway in order to speed up the investigation.
If you enable "debug ip ospf nsf detail" on both routers before staring a new test we might get valid info.
regards,
Riccardo
08-30-2012 09:36 AM
Thanks for the reply.
That makes perfect sense, Ill get the debugs added and try again.
Ive been told to move to 15.3 IOS as the NSF bug is fixed in that. ( Yet for some reason I cant find the 15.3 on cisco.com , only 15.2 is the latest for the 7606's , grrrr lol )
Once I have them upgraded Ill re-run the tests with the debugs.
I will reply tomorrow hopefully with these outputs.
Thanks very much for your help. As I thought, it seems odd that OSPF dropped so hopefully my recent enablment of NSF and a new IOS to stop the NSF bug crash may fix it hehe.
Thanks again
-G
08-30-2012 10:16 AM
Hi,
Ive placed a newer IOS on, and it "seemed" happier. Didnt loose any packets to stuff behind the VSS from what I could see..
Here is the console output from one core router:
Does this look normal?
I can see LDP dropped.. OSPF changed state a few times but not sure if it died completely or simply went into helper..
Looks like NSF did its job happily..
Not sure why BGP died either as OSPF was still up.. Hmm..
Aug 30 18:14:31.394 BST: %BGP_SESSION-5-ADJCHANGE: neighbor xxx.xxx.95.225 IPv4 Unicast topology base removed from session NSF peer closed the session
Aug 30 18:14:31.394 BST: %BGP-5-ADJCHANGE: neighbor xxx.xxx.95.225 Down NSF peer closed the session
Aug 30 18:14:33.514 BST: %BGP-5-ADJCHANGE: neighbor XX00:ED0::3C Down Peer closed the session
Aug 30 18:14:33.514 BST: %BGP_SESSION-5-ADJCHANGE: neighbor XX00:ED0::3C IPv6 Unicast topology base removed from session Peer closed the session
Aug 30 18:14:34.154 BST: %OSPFv3-5-ADJCHG: Process 7, Nbr xxx.xxx.95.225 on Port-channel3 from LOADING to FULL, Loading Done
Aug 30 18:14:39.638 BST: %LDP-5-GR: GR session xxx.xxx.95.225:0 (inst. 2): interrupted--recovery pending
Aug 30 18:14:39.638 BST: %LDP-5-NBRCHG: LDP Neighbor xxx.xxx.95.225:0 (0) is DOWN (Discovery Hello Hold Timer expired)
Aug 30 18:14:50.290 BST: %BGP-5-ADJCHANGE: neighbor XX00:ED0::3C Up
Aug 30 18:14:52.834 BST: OSPF-7 NSF_C Po3: OOB resync from Nbr xxx.xxx.95.225 XX.173.96.22
Aug 30 18:14:52.834 BST: OSPF-7 NSF_C Po3: Starting OOB resync with xxx.xxx.95.225 address XX.173.96.22 (receiver)
Aug 30 18:14:52.834 BST: %OSPF-5-ADJCHG: Process 7, Nbr xxx.xxx.95.225 on Port-channel3 from FULL to EXSTART, OOB resynchronization
Aug 30 18:14:57.610 BST: %OSPF-5-ADJCHG: Process 7, Nbr xxx.xxx.95.225 on Port-channel3 from EXSTART to EXCHANGE, Negotiation Done
Aug 30 18:14:57.618 BST: %OSPF-5-ADJCHG: Process 7, Nbr xxx.xxx.95.225 on Port-channel3 from EXCHANGE to LOADING, Exchange Done
Aug 30 18:14:57.618 BST: %OSPF-5-ADJCHG: Process 7, Nbr xxx.xxx.95.225 on Port-channel3 from LOADING to FULL, Loading Done
Aug 30 18:14:57.618 BST: OSPF-7 NSF_C Po3: OOB resync completed with xxx.xxx.95.225 address XX.173.96.22
Aug 30 18:15:12.406 BST: %BGP-5-ADJCHANGE: neighbor xxx.xxx.95.225 Up
Aug 30 18:15:27.454 BST: %LDP-5-NBRCHG: LDP Neighbor xxx.xxx.95.225:0 (2) is UP
08-31-2012 03:33 AM
Right,
OSPF seems happy now.
I dont loose any packets to stuff behind the VSS.
However, Im wondering, Do you know what may be needed to have the same results with MPLS ?
For example, Stuff on a normal vlan on the VSS doesnt loose any packets, However LDP seems to drop and when it comes back it takes quite some time to finally get vrf routes before I can contact stuff attached to a VRF again.
I assume its the same kinda setup ..
Hmm..
Thanks again for all your help!
-G
08-31-2012 05:05 AM
Hi Graham,
first of all make sure that you have NSF enabled for BGP and LDP (mpls ldp graceful-restart)
About BGP it seems that NSF is on but the peer tear the bgp session down.
Can you print the relevant configuration from the routers (VSS and neighbors) with the IOS info too?
Riccardo
08-31-2012 05:49 AM
Hi, Thanks for the reply.
Core IOS: c7600rsp72043-advipservicesk9-mz.151-2.S
VSS IOS: s72033-advipservicesk9_wan-mz.122-33.SXJ3.bin
Re LDP:
heres the config I have:
Core:
sh run | inc ldp
mpls ldp graceful-restart timers neighbor-liveness 5
mpls ldp graceful-restart timers max-recovery 15
mpls ldp graceful-restart
mpls ldp router-id Loopback0 force
show mpls ldp graceful-restart
LDP Graceful Restart is enabled
Neighbor Liveness Timer: 5 seconds
Max Recovery Time: 15 seconds
Forwarding State Holding Time: 600 seconds
Down Neighbor Database (0 records):
Graceful Restart-enabled Sessions:
VRF default:
Peer LDP Ident: xxx.125.95.224:0, State: estab
Peer LDP Ident: xxx.125.95.225:0, State: estab
VSS:
sh run | inc ldp
mpls ldp graceful-restart
sh mpls ldp graceful-restart
LDP Graceful Restart is enabled
Neighbor Liveness Timer: 120 seconds
Max Recovery Time: 120 seconds
Forwarding State Holding Time: 600 seconds
Down Neighbor Database (0 records):
Graceful Restart-enabled Sessions:
VRF default:
Peer LDP Ident: xxx.125.95.224:0, State: estab
Peer LDP Ident: xxx.125.95.223:0, State: estab
Im thinking these timers are borked...?
RE: BGP
Core:
router bgp xxxx
no bgp enforce-first-as
bgp log-neighbor-changes
bgp deterministic-med
bgp graceful-restart restart-time 120
bgp graceful-restart stalepath-time 360
bgp graceful-restart
bgp bestpath compare-routerid
bgp maxas-limit 100
VSS:
router bgp xxxx
bgp log-neighbor-changes
bgp deterministic-med
bgp graceful-restart restart-time 120
bgp graceful-restart stalepath-time 360
bgp graceful-restart
bgp bestpath compare-routerid
bgp maxas-limit 100
Thanks again for your time. Ill be over the moon if I can get BGP / LDP to not die during failover.
08-31-2012 06:17 AM
Full logs when it fails over.
OSPF sems happy ( Connectivity stays. )
However I loose MPLS for AGESSS
I can see LDP neighbors but for some reason all routes from the rest of the network vanish from the VSS VRF's
As soon as it all comes back ( after a minute or two )
I get this message:
*Aug 31 14:15:33.255 BST: %LDP-5-GR: GR session xxx.125.95.225:0 (inst. 4): completed graceful recovery
Seems odd why it would take so long or why the routes vanish
Aug 31 13:13:03.887: %VSLP-SW2_SPSTBY-3-VSLP_LMP_FAIL_REASON: Te2/5/4: Link down
Aug 31 13:13:03.887: %VSLP-SW2_SPSTBY-3-VSLP_LMP_FAIL_REASON: Te2/5/5: Link down
Aug 31 13:13:03.887: %VSLP-SW2_SPSTBY-2-VSL_DOWN: Last VSL interface Te2/5/5 went down
Aug 31 13:13:03.899: %VSLP-SW2_SPSTBY-2-VSL_DOWN: All VSL links went down while switch is in Standby role
Aug 31 13:13:03.899: %DUAL_ACTIVE-SW2_SPSTBY-1-VSL_DOWN: VSL is down - switchover, or possible dual-active situation has occurred
Aug 31 13:13:03.899: %PFREDUN-SW2_SPSTBY-6-ACTIVE: Initializing as Virtual Switch ACTIVE processor
Aug 31 13:13:04.639: %SYS-SW2_SPSTBY-3-LOGGER_FLUSHED: System was paused for 00:00:00 to ensure console debugging output.
Aug 31 13:13:05.817: %C6KPWR-SP-4-PSOK: power supply 1 turned on.
Aug 31 13:13:05.817: %C6KPWR-SP-4-PSOK: power supply 2 turned on.
Aug 31 13:13:06.897: %PIM-5-NBRCHG: neighbor xxx.173.96.21 UP on interface Port-channel3
Aug 31 13:13:06.901: %PIM-5-NBRCHG: neighbor xxx.173.111.129 UP on interface Port-channel4
Aug 31 13:13:07.245: %OSPF-5-ADJCHG: Process 7, Nbr xxx.125.95.224 on Port-channel4 from DOWN to INIT, Received Hello
Aug 31 13:13:07.245: %OSPF-5-ADJCHG: Process 7, Nbr xxx.125.95.224 on Port-channel4 from INIT to 2WAY, 2-Way Received
Aug 31 13:13:07.245: %OSPF-5-ADJCHG: Process 7, Nbr xxx.125.95.224 on Port-channel4 from 2WAY to FULL, NSF Adjacency Pickup
Aug 31 13:13:07.245: %OSPF-5-ADJCHG: Process 7, Nbr xxx.125.95.223 on Port-channel3 from DOWN to INIT, Received Hello
Aug 31 13:13:07.245: %OSPF-5-ADJCHG: Process 7, Nbr xxx.125.95.223 on Port-channel3 from INIT to 2WAY, 2-Way Received
Aug 31 13:13:07.245: %OSPF-5-ADJCHG: Process 7, Nbr xxx.125.95.223 on Port-channel3 from 2WAY to FULL, NSF Adjacency Pickup
Aug 31 13:13:06.189: %OIR-SW2_SP-6-INSCARD: Card inserted in slot 1, interfaces are now online
Aug 31 13:13:06.189: %OIR-SW2_SP-6-INSCARD: Card inserted in slot 9, interfaces are now online
Aug 31 13:13:06.281: %OIR-SW2_SP-6-INSCARD: Card inserted in slot 5, interfaces are now online
Aug 31 13:13:07.213: %PM-SW2_SP-4-PORT_BOUNCED: Port Gi2/1/1 was bounced by Consistency Check IDBS Up.
Aug 31 13:13:07.605: %LDP-5-GR: LDP reXXXXting gracefully. Preserving forwarding state for 600 seconds.
Aug 31 13:13:07.897: %PIM-5-DRCHG: DR change from neighbor 0.0.0.0 to xxx.173.96.22 on interface Port-channel3
Aug 31 13:13:07.901: %PIM-5-DRCHG: DR change from neighbor 0.0.0.0 to xxx.173.111.130 on interface Port-channel4
Aug 31 13:13:09.157: %PFREDUN-6-ACTIVE: Switching over to active state completed
Aug 31 13:13:11.817: %TDP-5-INFO: Port-channel3: LDP XXXXted
Aug 31 13:13:11.817: %TDP-5-INFO: Port-channel4: LDP XXXXted
Aug 31 13:13:11.854: %VSDA-SW2_SP-3-LINK_DOWN: Interface Gi2/1/1 is no longer dual-active detection capable
Aug 31 13:13:13.801: %OSPFv3-5-ADJCHG: Process 7, Nbr xxx.125.95.223 on Port-channel3 from LOADING to FULL, Loading Done
Aug 31 13:13:14.193: %OSPFv3-5-ADJCHG: Process 7, Nbr xxx.125.95.224 on Port-channel4 from LOADING to FULL, Loading Done
Aug 31 13:13:16.369: %BGP-5-ADJCHANGE: neighbor xxx.23.191.2 vpn vrf XXXX-DC-BACKEND Up
Aug 31 13:13:25.473: %BGP-5-ADJCHANGE: neighbor 2A00:ED0::3B Up
Aug 31 13:13:27.609: %BGP-5-ADJCHANGE: neighbor 2A00:ED0::3A Up
Aug 31 13:13:27.737: %OSPF-5-ADJCHG: Process 7, Nbr xxx.125.95.223 on Port-channel3 from FULL to EXXXXXT, OOB-Resynchronization
Aug 31 13:13:27.737: %OSPF-5-ADJCHG: Process 7, Nbr xxx.125.95.224 on Port-channel4 from FULL to EXXXXXT, OOB-Resynchronization
Aug 31 13:13:32.245: %OSPF-5-ADJCHG: Process 7, Nbr xxx.125.95.223 on Port-channel3 from EXXXXXT to EXCHANGE, Negotiation Done
Aug 31 13:13:32.245: %OSPF-5-ADJCHG: Process 7, Nbr xxx.125.95.223 on Port-channel3 from EXCHANGE to LOADING, Exchange Done
Aug 31 13:13:32.269: %OSPF-5-ADJCHG: Process 7, Nbr xxx.125.95.223 on Port-channel3 from LOADING to FULL, Loading Done
Aug 31 13:13:32.477: %OSPF-5-ADJCHG: Process 7, Nbr xxx.125.95.224 on Port-channel4 from EXXXXXT to EXCHANGE, Negotiation Done
Aug 31 13:13:32.477: %OSPF-5-ADJCHG: Process 7, Nbr xxx.125.95.224 on Port-channel4 from EXCHANGE to LOADING, Exchange Done
Aug 31 13:13:32.477: %OSPF-5-ADJCHG: Process 7, Nbr xxx.125.95.224 on Port-channel4 from LOADING to FULL, Loading Done
Aug 31 13:13:32.853: %BGP-5-ADJCHANGE: neighbor xxx.125.95.224 Up
Aug 31 13:13:32.853: %BGP-5-ADJCHANGE: neighbor xxx.125.95.223 Up
Aug 31 13:13:33.145: %LDP-5-NBRCHG: LDP Neighbor xxx.125.95.224:0 (1) is UP
Aug 31 13:13:35.781: %LDP-5-NBRCHG: LDP Neighbor xxx.125.95.223:0 (2) is UP
Aug 31 13:13:36.585: %BGP-5-ADJCHANGE: neighbor xxx.125.95.52 Up
Aug 31 13:13:43.013: %BGP-5-ADJCHANGE: neighbor xxx.125.95.21 Up
08-31-2012 06:47 AM
I think I may of figured why MPLS routes vanish..
My normal BGP neighbors ( v4 / v6 ) all have graceful restart advertised and recieved.
yet my vpnv4 bgp neighbours dont ( My vpnv4 route reflectors dont hav eit enabled! )
sh ip bgp vpnv4 all neighbors xxx.125.95.21 | inc race
Graceful Restart Capability: advertised
I have a feeling that If i enable GR on the RR's it will fix it.
This makes sense why "internet" routes are ok, but MPLS are not ( because the vpnv4 reflectors dont have GR enabled )
Question:
If GR is enabled on one device but isnt on the other, what issues can happen? Ie: if I put GR on my route reflectors. 90% of all devices dont have GR enabled.. Will it cause a problem for those? Or will BGP just die in its normal fashion?
Thanks
G
09-05-2012 04:19 PM
Hi Graham
Did this fix your issue?
I have a very similar problem with LDP/MPLS only with the igp being isis.
Thanks in advance
Martin
Sent from Cisco Technical Support iPhone App
09-06-2012 01:19 AM
Annoyingly I have to wait a week before I can do the change ( Red tape etc )
However I do strongly belive this is the cause.
One thing to chech when it fails over is..
Does isis keep its routing table and does your normal connected addressing work fine ( No loss etc it fails over ok )
Is it just MPLS that seems to take forever to come back?
If that sthe case, Do a failover and have a look at the routing table for a VRF.. Does it only show local routes ( Ie: no routes from other ldp neighbors etc )
This is what Im seeing so if this i sthe case with you then it probably is related.
It does certainly seem that you need GR enabled on all bgp address familys ( i assume you have BGP over isis ?)
09-06-2012 02:35 AM
We are running the following tests:
- Pinging between PE loopbacks which are learned via ISIS
- Pinging through the PE's from boxes either side which routes that are learned via MP-BGP
What we see is:
- A few drops of the pings via ISIS, which is a concern but not the main one
- The P node takes a few seconds to rebuild the BGP session with the PE, then about 2 minutes for it to receive the BGP routes from it. It appears the PE is wating for some time after establishing the BGP neighbor to run the BGP best path algorithm. We can see the routes in the BGP table of the PE, but it appears it doesn't send these upward to the P until it's run best path which like I say takes about 2 minutes.
Yes we have graceful failover configured on BGP and NSF on ISIS.
09-06-2012 04:15 AM
Whats the output of:
sh ip bgp neighbors xx.xx.xx.xx | inc race
Graceful Restart Capability: advertised and received
on all bgp peers on all your neighbours ( both ends. )
Hopefully your getting advertised and received on all routers?
Discover and save your favorite ideas. Come back to expert answers, step-by-step guides, recent topics, and more.
New here? Get started with these tips. How to use Community New member guide