VSS and OSPF during a VSS failover ( OSPF completely drops )

graham smart · ‎08-30-2012

Hi Guys,

I'm having terrible problems with a new VSS setup I have going at the moment.

It consists of :

2x 7600's for the core

2x 6500's for the vss

Anyway,

When I do a failover, it fails over fine but OSPF seems to drop and then come back.

The VSS has MEC's going to the 7600's so when one VSS-switch dies, only one uplink in a MEC is down.. The port-channel is up so in reality OSPF should stay up?

The problem I had is that we get a good 5 to 30 seconds worth of outage when a failover happens and I cant seem to find a reliable config to get this to stay up happily.

Any thoughts?

Base OSPF config on the VSS = :

router ospf 7

router-id x.x.x.x

log-adjacency-changes

auto-cost reference-bandwidth 10000

nsf

nsf cisco helper disable

passive-interface default

network 0.0.0.0 255.255.255.255 area 0

default-information originate

Base OSPF on the 7600's:

router ospf 7

max-metric router-lsa on-startup 120

log-adjacency-changes

auto-cost reference-bandwidth 10000

network 0.0.0.0 255.255.255.255 area 0

Thanks

G

-Graham
Please note: My comments are simply suggestions. I cannot be held liable for any loss of data, life or marbles due to following my instructions.

Got a website? Need some live chat software?

rsimoni · ‎08-30-2012

Hi Graham,

which side brings OSPF down, the 7600 or the VSS?

Also, you need NSF on the 7600 as well in order for it to support neighbors experiencing SSO switchover. From your configuration I don't see it enabled.

Finally, did you by any chance enable OSPF Fast helloes on the core interfaces (between 7600 and VSS)? If yes they prevent NSF from properly working.

Please attach the logs of OSPF going down (with time stamps) from both VSS and 7600 to understand which side torns the session down first.

Riccardo

graham smart · ‎08-30-2012

Hi thanks for your reply.

Im pretty sure fast hello isnt on ( Just running default ospf )

Timer intervals configured, Hello 2, Dead 8, Wait 8, Retransmit 5

Supports Link-local Signaling (LLS)

Cisco NSF helper support enabled

IETF NSF helper support enabled

On another note. I just enabled NSF on all of it ( the 7600's and the VSS ) and when I did a failover of the VSS , BOTH! 7600's crashed and reloaded lol! ( Not the VSS, but the core 7600's reloaded... The VSS failed over as normal )

I have that raised with our vendor as evidently something is wrong software wise..

Next time I do a failover and can actually get the logs, ill paste them here. Just waiting for our support people to figure out what crashed both 7600's

So..What you think shoudl be.

No fast hello.

And

router ospf x

nsf

end

on ALL routers?

From what I could see in ciscos documentation only the VSS needed NSF enabled. ( http://www.cisco.com/en/US/docs/solutions/Enterprise/Campus/VSS30dg/VSS-dg_appa-configs.html#wp1051802
)

Thanks again for your time.

-G

Edit:

Looking on the core.

NSF seems to be enabled. ( even though not in the config )

Timer intervals configured, Hello 2, Dead 8, Wait 8, Retransmit 5

oob-resync timeout 40

Hello due in 00:00:00

Supports Link-local Signaling (LLS)

Cisco NSF helper support enabled

IETF NSF helper support enabled

-Graham
Please note: My comments are simply suggestions. I cannot be held liable for any loss of data, life or marbles due to following my instructions.

Got a website? Need some live chat software?

rsimoni · ‎08-30-2012

Hi Graham,

I double checked and on newer releases NSF helper mode is on by default, this is why the 7600 shows NSF helper support mode enabled.

For the crash you did right; it might be not relevant to this issue.

However in order to troubleshoot this issue we need to see some logs and debugs.

The idea is that when the SSO switchover occurs on the VSS the new active RP of the new active Supervisor sends a so-called grace LSA (since we are talking of OSPF here) to its neighbors to signal them that it is undergoing a graceful restart.

The 7600 at this point, upon receival of such grace LSA, should enter in helper mode and start a timer. During this time the adjacency is not declared down and forwarding keeps on occurring based on mls entries which are not flushed. If by the time this timer expires the routing adjacency is not brought up again and LSA re-synch'ed the adjacency is declared down.

Now we don't know 1) whether the VSS actually sends a grace lsa 2) whether the 7600 receives it and moves into helper mode and 3) who is declaring the session down.

For that, as I wrote we need logs and since you need to test this feature again in order to get the logs I suggest you to enable some useful debugs rightway in order to speed up the investigation.

If you enable "debug ip ospf nsf detail" on both routers before staring a new test we might get valid info.

regards,

Riccardo

graham smart · ‎08-30-2012

Thanks for the reply.

That makes perfect sense, Ill get the debugs added and try again.

Ive been told to move to 15.3 IOS as the NSF bug is fixed in that. ( Yet for some reason I cant find the 15.3 on cisco.com , only 15.2 is the latest for the 7606's , grrrr lol )

Once I have them upgraded Ill re-run the tests with the debugs.

I will reply tomorrow hopefully with these outputs.

Thanks very much for your help. As I thought, it seems odd that OSPF dropped so hopefully my recent enablment of NSF and a new IOS to stop the NSF bug crash may fix it hehe.

Thanks again

-G

-Graham
Please note: My comments are simply suggestions. I cannot be held liable for any loss of data, life or marbles due to following my instructions.

Got a website? Need some live chat software?

graham smart · ‎08-30-2012

Hi,

Ive placed a newer IOS on, and it "seemed" happier. Didnt loose any packets to stuff behind the VSS from what I could see..

Here is the console output from one core router:

Does this look normal?

I can see LDP dropped.. OSPF changed state a few times but not sure if it died completely or simply went into helper..

Looks like NSF did its job happily..

Not sure why BGP died either as OSPF was still up.. Hmm..

Aug 30 18:14:31.394 BST: %BGP_SESSION-5-ADJCHANGE: neighbor xxx.xxx.95.225 IPv4 Unicast topology base removed from session NSF peer closed the session

Aug 30 18:14:31.394 BST: %BGP-5-ADJCHANGE: neighbor xxx.xxx.95.225 Down NSF peer closed the session

Aug 30 18:14:33.514 BST: %BGP-5-ADJCHANGE: neighbor XX00:ED0::3C Down Peer closed the session

Aug 30 18:14:33.514 BST: %BGP_SESSION-5-ADJCHANGE: neighbor XX00:ED0::3C IPv6 Unicast topology base removed from session Peer closed the session

Aug 30 18:14:34.154 BST: %OSPFv3-5-ADJCHG: Process 7, Nbr xxx.xxx.95.225 on Port-channel3 from LOADING to FULL, Loading Done

Aug 30 18:14:39.638 BST: %LDP-5-GR: GR session xxx.xxx.95.225:0 (inst. 2): interrupted--recovery pending

Aug 30 18:14:39.638 BST: %LDP-5-NBRCHG: LDP Neighbor xxx.xxx.95.225:0 (0) is DOWN (Discovery Hello Hold Timer expired)

Aug 30 18:14:50.290 BST: %BGP-5-ADJCHANGE: neighbor XX00:ED0::3C Up

Aug 30 18:14:52.834 BST: OSPF-7 NSF_C Po3: OOB resync from Nbr xxx.xxx.95.225 XX.173.96.22

Aug 30 18:14:52.834 BST: OSPF-7 NSF_C Po3: Starting OOB resync with xxx.xxx.95.225 address XX.173.96.22 (receiver)

Aug 30 18:14:52.834 BST: %OSPF-5-ADJCHG: Process 7, Nbr xxx.xxx.95.225 on Port-channel3 from FULL to EXSTART, OOB resynchronization

Aug 30 18:14:57.610 BST: %OSPF-5-ADJCHG: Process 7, Nbr xxx.xxx.95.225 on Port-channel3 from EXSTART to EXCHANGE, Negotiation Done

Aug 30 18:14:57.618 BST: %OSPF-5-ADJCHG: Process 7, Nbr xxx.xxx.95.225 on Port-channel3 from EXCHANGE to LOADING, Exchange Done

Aug 30 18:14:57.618 BST: %OSPF-5-ADJCHG: Process 7, Nbr xxx.xxx.95.225 on Port-channel3 from LOADING to FULL, Loading Done

Aug 30 18:14:57.618 BST: OSPF-7 NSF_C Po3: OOB resync completed with xxx.xxx.95.225 address XX.173.96.22

Aug 30 18:15:12.406 BST: %BGP-5-ADJCHANGE: neighbor xxx.xxx.95.225 Up

Aug 30 18:15:27.454 BST: %LDP-5-NBRCHG: LDP Neighbor xxx.xxx.95.225:0 (2) is UP

-Graham
Please note: My comments are simply suggestions. I cannot be held liable for any loss of data, life or marbles due to following my instructions.

Got a website? Need some live chat software?

graham smart · ‎08-31-2012

Right,

OSPF seems happy now.

I dont loose any packets to stuff behind the VSS.

However, Im wondering, Do you know what may be needed to have the same results with MPLS ?

For example, Stuff on a normal vlan on the VSS doesnt loose any packets, However LDP seems to drop and when it comes back it takes quite some time to finally get vrf routes before I can contact stuff attached to a VRF again.

I assume its the same kinda setup ..

Hmm..

Thanks again for all your help!

-G

-Graham
Please note: My comments are simply suggestions. I cannot be held liable for any loss of data, life or marbles due to following my instructions.

Got a website? Need some live chat software?

rsimoni · ‎08-31-2012

Hi Graham,

first of all make sure that you have NSF enabled for BGP and LDP (mpls ldp graceful-restart) .

About BGP it seems that NSF is on but the peer tear the bgp session down.

Can you print the relevant configuration from the routers (VSS and neighbors) with the IOS info too?

Riccardo

graham smart · ‎08-31-2012

Hi, Thanks for the reply.

Core IOS: c7600rsp72043-advipservicesk9-mz.151-2.S

VSS IOS: s72033-advipservicesk9_wan-mz.122-33.SXJ3.bin

Re LDP:

heres the config I have:

Core:

sh run | inc ldp

mpls ldp graceful-restart timers neighbor-liveness 5

mpls ldp graceful-restart timers max-recovery 15

mpls ldp graceful-restart

mpls ldp router-id Loopback0 force

show mpls ldp graceful-restart

LDP Graceful Restart is enabled

Neighbor Liveness Timer: 5 seconds

Max Recovery Time: 15 seconds

Forwarding State Holding Time: 600 seconds

Down Neighbor Database (0 records):

Graceful Restart-enabled Sessions:

VRF default:

Peer LDP Ident: xxx.125.95.224:0, State: estab

Peer LDP Ident: xxx.125.95.225:0, State: estab

VSS:

sh run | inc ldp

mpls ldp graceful-restart

sh mpls ldp graceful-restart

LDP Graceful Restart is enabled

Neighbor Liveness Timer: 120 seconds

Max Recovery Time: 120 seconds

Forwarding State Holding Time: 600 seconds

Down Neighbor Database (0 records):

Graceful Restart-enabled Sessions:

VRF default:

Peer LDP Ident: xxx.125.95.224:0, State: estab

Peer LDP Ident: xxx.125.95.223:0, State: estab

Im thinking these timers are borked...?

RE: BGP

Core:

router bgp xxxx

no bgp enforce-first-as

bgp log-neighbor-changes

bgp deterministic-med

bgp graceful-restart restart-time 120

bgp graceful-restart stalepath-time 360

bgp graceful-restart

bgp bestpath compare-routerid

bgp maxas-limit 100

VSS:

router bgp xxxx

bgp log-neighbor-changes

bgp deterministic-med

bgp graceful-restart restart-time 120

bgp graceful-restart stalepath-time 360

bgp graceful-restart

bgp bestpath compare-routerid

bgp maxas-limit 100

Thanks again for your time. Ill be over the moon if I can get BGP / LDP to not die during failover.

-Graham
Please note: My comments are simply suggestions. I cannot be held liable for any loss of data, life or marbles due to following my instructions.

Got a website? Need some live chat software?

graham smart · ‎08-31-2012

Full logs when it fails over.

OSPF sems happy ( Connectivity stays. )

However I loose MPLS for AGESSS

I can see LDP neighbors but for some reason all routes from the rest of the network vanish from the VSS VRF's

As soon as it all comes back ( after a minute or two )

I get this message:

*Aug 31 14:15:33.255 BST: %LDP-5-GR: GR session xxx.125.95.225:0 (inst. 4): completed graceful recovery

Seems odd why it would take so long or why the routes vanish

Aug 31 13:13:03.887: %VSLP-SW2_SPSTBY-3-VSLP_LMP_FAIL_REASON: Te2/5/4: Link down

Aug 31 13:13:03.887: %VSLP-SW2_SPSTBY-3-VSLP_LMP_FAIL_REASON: Te2/5/5: Link down

Aug 31 13:13:03.887: %VSLP-SW2_SPSTBY-2-VSL_DOWN: Last VSL interface Te2/5/5 went down

Aug 31 13:13:03.899: %VSLP-SW2_SPSTBY-2-VSL_DOWN: All VSL links went down while switch is in Standby role

Aug 31 13:13:03.899: %DUAL_ACTIVE-SW2_SPSTBY-1-VSL_DOWN: VSL is down - switchover, or possible dual-active situation has occurred

Aug 31 13:13:03.899: %PFREDUN-SW2_SPSTBY-6-ACTIVE: Initializing as Virtual Switch ACTIVE processor

Aug 31 13:13:04.639: %SYS-SW2_SPSTBY-3-LOGGER_FLUSHED: System was paused for 00:00:00 to ensure console debugging output.

Aug 31 13:13:05.817: %C6KPWR-SP-4-PSOK: power supply 1 turned on.

Aug 31 13:13:05.817: %C6KPWR-SP-4-PSOK: power supply 2 turned on.

Aug 31 13:13:06.897: %PIM-5-NBRCHG: neighbor xxx.173.96.21 UP on interface Port-channel3

Aug 31 13:13:06.901: %PIM-5-NBRCHG: neighbor xxx.173.111.129 UP on interface Port-channel4

Aug 31 13:13:07.245: %OSPF-5-ADJCHG: Process 7, Nbr xxx.125.95.224 on Port-channel4 from DOWN to INIT, Received Hello

Aug 31 13:13:07.245: %OSPF-5-ADJCHG: Process 7, Nbr xxx.125.95.224 on Port-channel4 from INIT to 2WAY, 2-Way Received

Aug 31 13:13:07.245: %OSPF-5-ADJCHG: Process 7, Nbr xxx.125.95.224 on Port-channel4 from 2WAY to FULL, NSF Adjacency Pickup

Aug 31 13:13:07.245: %OSPF-5-ADJCHG: Process 7, Nbr xxx.125.95.223 on Port-channel3 from DOWN to INIT, Received Hello

Aug 31 13:13:07.245: %OSPF-5-ADJCHG: Process 7, Nbr xxx.125.95.223 on Port-channel3 from INIT to 2WAY, 2-Way Received

Aug 31 13:13:07.245: %OSPF-5-ADJCHG: Process 7, Nbr xxx.125.95.223 on Port-channel3 from 2WAY to FULL, NSF Adjacency Pickup

Aug 31 13:13:06.189: %OIR-SW2_SP-6-INSCARD: Card inserted in slot 1, interfaces are now online

Aug 31 13:13:06.189: %OIR-SW2_SP-6-INSCARD: Card inserted in slot 9, interfaces are now online

Aug 31 13:13:06.281: %OIR-SW2_SP-6-INSCARD: Card inserted in slot 5, interfaces are now online

Aug 31 13:13:07.213: %PM-SW2_SP-4-PORT_BOUNCED: Port Gi2/1/1 was bounced by Consistency Check IDBS Up.

Aug 31 13:13:07.605: %LDP-5-GR: LDP reXXXXting gracefully. Preserving forwarding state for 600 seconds.

Aug 31 13:13:07.897: %PIM-5-DRCHG: DR change from neighbor 0.0.0.0 to xxx.173.96.22 on interface Port-channel3

Aug 31 13:13:07.901: %PIM-5-DRCHG: DR change from neighbor 0.0.0.0 to xxx.173.111.130 on interface Port-channel4

Aug 31 13:13:09.157: %PFREDUN-6-ACTIVE: Switching over to active state completed

Aug 31 13:13:11.817: %TDP-5-INFO: Port-channel3: LDP XXXXted

Aug 31 13:13:11.817: %TDP-5-INFO: Port-channel4: LDP XXXXted

Aug 31 13:13:11.854: %VSDA-SW2_SP-3-LINK_DOWN: Interface Gi2/1/1 is no longer dual-active detection capable

Aug 31 13:13:13.801: %OSPFv3-5-ADJCHG: Process 7, Nbr xxx.125.95.223 on Port-channel3 from LOADING to FULL, Loading Done

Aug 31 13:13:14.193: %OSPFv3-5-ADJCHG: Process 7, Nbr xxx.125.95.224 on Port-channel4 from LOADING to FULL, Loading Done

Aug 31 13:13:16.369: %BGP-5-ADJCHANGE: neighbor xxx.23.191.2 vpn vrf XXXX-DC-BACKEND Up

Aug 31 13:13:25.473: %BGP-5-ADJCHANGE: neighbor 2A00:ED0::3B Up

Aug 31 13:13:27.609: %BGP-5-ADJCHANGE: neighbor 2A00:ED0::3A Up

Aug 31 13:13:27.737: %OSPF-5-ADJCHG: Process 7, Nbr xxx.125.95.223 on Port-channel3 from FULL to EXXXXXT, OOB-Resynchronization

Aug 31 13:13:27.737: %OSPF-5-ADJCHG: Process 7, Nbr xxx.125.95.224 on Port-channel4 from FULL to EXXXXXT, OOB-Resynchronization

Aug 31 13:13:32.245: %OSPF-5-ADJCHG: Process 7, Nbr xxx.125.95.223 on Port-channel3 from EXXXXXT to EXCHANGE, Negotiation Done

Aug 31 13:13:32.245: %OSPF-5-ADJCHG: Process 7, Nbr xxx.125.95.223 on Port-channel3 from EXCHANGE to LOADING, Exchange Done

Aug 31 13:13:32.269: %OSPF-5-ADJCHG: Process 7, Nbr xxx.125.95.223 on Port-channel3 from LOADING to FULL, Loading Done

Aug 31 13:13:32.477: %OSPF-5-ADJCHG: Process 7, Nbr xxx.125.95.224 on Port-channel4 from EXXXXXT to EXCHANGE, Negotiation Done

Aug 31 13:13:32.477: %OSPF-5-ADJCHG: Process 7, Nbr xxx.125.95.224 on Port-channel4 from EXCHANGE to LOADING, Exchange Done

Aug 31 13:13:32.477: %OSPF-5-ADJCHG: Process 7, Nbr xxx.125.95.224 on Port-channel4 from LOADING to FULL, Loading Done

Aug 31 13:13:32.853: %BGP-5-ADJCHANGE: neighbor xxx.125.95.224 Up

Aug 31 13:13:32.853: %BGP-5-ADJCHANGE: neighbor xxx.125.95.223 Up

Aug 31 13:13:33.145: %LDP-5-NBRCHG: LDP Neighbor xxx.125.95.224:0 (1) is UP

Aug 31 13:13:35.781: %LDP-5-NBRCHG: LDP Neighbor xxx.125.95.223:0 (2) is UP

Aug 31 13:13:36.585: %BGP-5-ADJCHANGE: neighbor xxx.125.95.52 Up

Aug 31 13:13:43.013: %BGP-5-ADJCHANGE: neighbor xxx.125.95.21 Up

-Graham
Please note: My comments are simply suggestions. I cannot be held liable for any loss of data, life or marbles due to following my instructions.

Got a website? Need some live chat software?

graham smart · ‎08-31-2012

I think I may of figured why MPLS routes vanish..

My normal BGP neighbors ( v4 / v6 ) all have graceful restart advertised and recieved.

yet my vpnv4 bgp neighbours dont ( My vpnv4 route reflectors dont hav eit enabled! )

sh ip bgp vpnv4 all neighbors xxx.125.95.21 | inc race

Graceful Restart Capability: advertised

I have a feeling that If i enable GR on the RR's it will fix it.

This makes sense why "internet" routes are ok, but MPLS are not ( because the vpnv4 reflectors dont have GR enabled )

Question:

If GR is enabled on one device but isnt on the other, what issues can happen? Ie: if I put GR on my route reflectors. 90% of all devices dont have GR enabled.. Will it cause a problem for those? Or will BGP just die in its normal fashion?

Thanks

G

-Graham
Please note: My comments are simply suggestions. I cannot be held liable for any loss of data, life or marbles due to following my instructions.

Got a website? Need some live chat software?

martin.foote · ‎09-05-2012

Hi Graham

Did this fix your issue?

I have a very similar problem with LDP/MPLS only with the igp being isis.

Thanks in advance

Martin

Sent from Cisco Technical Support iPhone App

graham smart · ‎09-06-2012

Annoyingly I have to wait a week before I can do the change ( Red tape etc )

However I do strongly belive this is the cause.

One thing to chech when it fails over is..

Does isis keep its routing table and does your normal connected addressing work fine ( No loss etc it fails over ok )

Is it just MPLS that seems to take forever to come back?

If that sthe case, Do a failover and have a look at the routing table for a VRF.. Does it only show local routes ( Ie: no routes from other ldp neighbors etc )

This is what Im seeing so if this i sthe case with you then it probably is related.

It does certainly seem that you need GR enabled on all bgp address familys ( i assume you have BGP over isis ?)

-Graham
Please note: My comments are simply suggestions. I cannot be held liable for any loss of data, life or marbles due to following my instructions.

Got a website? Need some live chat software?

martin.foote · ‎09-06-2012

We are running the following tests:

- Pinging between PE loopbacks which are learned via ISIS

- Pinging through the PE's from boxes either side which routes that are learned via MP-BGP

What we see is:

- A few drops of the pings via ISIS, which is a concern but not the main one

- The P node takes a few seconds to rebuild the BGP session with the PE, then about 2 minutes for it to receive the BGP routes from it. It appears the PE is wating for some time after establishing the BGP neighbor to run the BGP best path algorithm. We can see the routes in the BGP table of the PE, but it appears it doesn't send these upward to the P until it's run best path which like I say takes about 2 minutes.

Yes we have graceful failover configured on BGP and NSF on ISIS.

graham smart · ‎09-06-2012

Whats the output of:

sh ip bgp neighbors xx.xx.xx.xx | inc race

Graceful Restart Capability: advertised and received

on all bgp peers on all your neighbours ( both ends. )

Hopefully your getting advertised and received on all routers?

-Graham
Please note: My comments are simply suggestions. I cannot be held liable for any loss of data, life or marbles due to following my instructions.

Got a website? Need some live chat software?