Solved: OSPF fast hellos causing OSPF adjecency to reset

jvrooden · ‎09-13-2011

Hi there,

I'm having a strange issue with OSPF fast hello's. We are currently implementing a new MAN in the Netherlands with two new C6500 sup720-10G core routers and multiple C6500 sup720-3b. The core-routers are currently only connected to two distribution routers on two locations. Next month we will connect multiple sites and the two core-locations using ethernet connections over a DWDM infrastructure.

On both core-routers OSPF is currently active over the direct connections to two distribution routers with a UTP cable. This worked fine for months, until a colleague swapped the cables at the one site two weeks ago. Now every day the OSPF adjacency between the PLW OSPF neighbors is reset multiple times at irregular intervals. Always for both distribution routers at the same time. However a third OSPF neighbor, a backup connection to a C3750 at a remote site, is not being reset. We did obviously replace the cables at this location once more, but without result; the neighborship is still being reset. The only thing that helps is removing the “ip ospf dead-interval minimal hello-multiplier 4” command, or to adjust the hello timers to a higher value, I used 1 second, so the Dead interval is 4 seconds. This is not a satisfactionary solution though, as we want to use the fast convergence of the fast hellos and I really don’t see any reason why this shouldn’t work on our Gigabit MAN.

I also had this issue on an inter-connection between two distribution routers at another site, but since this port-channel link was going to be replaced by a L3 vlan and is not very important anyway, I ignored it. But now that I am having this same problem also on the core-router, I am getting afraid that we will be running into more of these problems when we activate all the Ethernet-over-DWDM-links in our new core in October.

I did some research on the Cisco support site and some other sites but I couldn’t find a note on a similar problem. Are there any issues known with OSPF fast-hellos? Any idee to what direction I need to look to solve this problem? I think it's very strange that both links from core to ditribution routers are always reset at the same time.

I hope someone can help me.

Beste regards,

Joris van Rooden

sh version:

Cisco IOS Software, s72033_rp Software (s72033_rp-ADVENTERPRISEK9_WAN-M), Version 12.2(33)SXI5, RELEASE SOFTWARE (fc2)

Config:

router ospf 1

router-id 10.100.0.102

log-adjacency-changes detail

auto-cost reference-bandwidth 40000

area 0 authentication message-digest

timers throttle spf 10 100 5000

timers throttle lsa all 10 100 5000

timers lsa arrival 80

passive-interface default

no passive-interface GigabitEthernet1/3

no passive-interface GigabitEthernet1/20

no passive-interface GigabitEthernet2/3

network 10.100.0.102 0.0.0.0 area 0

network 10.100.0.0 0.0.255.255 area 0

!

interface GigabitEthernet1/3

description Distribution-router-1 Gi3/13

dampening

mtu 9216

ip address 10.100.102.9 255.255.255.248

ip ospf message-digest-key 1 md5 7 xxxxxxxx

ip ospf network point-to-point

ip ospf dead-interval minimal hello-multiplier 4

logging event link-status

load-interval 30

carrier-delay msec 0

mpls ip

end

sh ip ospf neighbor:

Neighbor ID Pri State Dead Time Address Interface

10.100.0.127 0 FULL/ - 812 msec 10.100.113.10 GigabitEthernet1/20

10.100.0.106 0 FULL/ - 00:00:03 10.100.102.26 GigabitEthernet2/3

10.100.0.105 0 FULL/ - 852 msec 10.100.102.10 GigabitEthernet1/3

Sh log:

002913: Sep 13 07:52:29.842 CEST: %OSPF-5-ADJCHG: Process 1, Nbr 10.100.0.106 on GigabitEthernet2/3 from INIT to 2WAY, 2-Way Received

002914: Sep 13 07:52:29.842 CEST: %OSPF-5-ADJCHG: Process 1, Nbr 10.100.0.106 on GigabitEthernet2/3 from 2WAY to EXSTART, AdjOK?

002915: Sep 13 07:52:29.842 CEST: %OSPF-5-ADJCHG: Process 1, Nbr 10.100.0.106 on GigabitEthernet2/3 from EXSTART to EXCHANGE, Negotiation Done

002916: Sep 13 07:52:30.138 CEST: %OSPF-5-ADJCHG: Process 1, Nbr 10.100.0.106 on GigabitEthernet2/3 from EXCHANGE to LOADING, Exchange Done

002917: Sep 13 07:52:30.138 CEST: %OSPF-5-ADJCHG: Process 1, Nbr 10.100.0.106 on GigabitEthernet2/3 from LOADING to FULL, Loading Done

002918: Sep 13 08:16:40.210 CEST: %OSPF-5-ADJCHG: Process 1, Nbr 10.100.0.105 on GigabitEthernet1/3 from FULL to INIT, 1-Way

002919: Sep 13 08:16:40.498 CEST: %OSPF-5-ADJCHG: Process 1, Nbr 10.100.0.106 on GigabitEthernet2/3 from FULL to INIT, 1-Way

002920: Sep 13 08:16:40.758 CEST: %OSPF-5-ADJCHG: Process 1, Nbr 10.100.0.105 on GigabitEthernet1/3 from INIT to 2WAY, 2-Way Received

002921: Sep 13 08:16:40.762 CEST: %OSPF-5-ADJCHG: Process 1, Nbr 10.100.0.105 on GigabitEthernet1/3 from 2WAY to EXSTART, AdjOK?

002922: Sep 13 08:16:40.762 CEST: %OSPF-5-ADJCHG: Process 1, Nbr 10.100.0.105 on GigabitEthernet1/3 from EXSTART to EXCHANGE, Negotiation Done

002923: Sep 13 08:16:40.818 CEST: %OSPF-5-ADJCHG: Process 1, Nbr 10.10.0.105 on GigabitEthernet1/3 from EXCHANGE to LOADING, Exchange Done

002924: Sep 13 08:16:40.818 CEST: %OSPF-5-ADJCHG: Process 1, Nbr 10.100.0.105 on GigabitEthernet1/3 from LOADING to FULL, Loading Done

002925: Sep 13 08:16:40.954 CEST: %OSPF-5-ADJCHG: Process 1, Nbr 10.100.0.106 on GigabitEthernet2/3 from INIT to 2WAY, 2-Way Received

002926: Sep 13 08:16:40.954 CEST: %OSPF-5-ADJCHG: Process 1, Nbr 10.100.0.106 on GigabitEthernet2/3 from 2WAY to EXSTART, AdjOK?

002927: Sep 13 08:16:40.954 CEST: %OSPF-5-ADJCHG: Process 1, Nbr 10.100.0.106 on GigabitEthernet2/3 from EXSTART to EXCHANGE, Negotiation Done

002928: Sep 13 08:16:41.034 CEST: %OSPF-5-ADJCHG: Process 1, Nbr 10.100.0.106 on GigabitEthernet2/3 from EXCHANGE to LOADING, Exchange Done

002929: Sep 13 08:16:41.034 CEST: %OSPF-5-ADJCHG: Process 1, Nbr 10.100.0.106 on GigabitEthernet2/3 from LOADING to FULL, Loading Done

002930: Sep 13 08:52:56.226 CEST: %OSPF-5-ADJCHG: Process 1, Nbr 10.100.0.105 on GigabitEthernet1/3 from FULL to INIT, 1-Way

Joseph W. Doherty · ‎09-26-2011

Disclaimer

The Author of this posting offers the information contained within this posting without consideration and with the reader's understanding that there's no implied or expressed suitability or fitness for any purpose. Information provided is for informational purposes only and should not be construed as rendering professional advice of any kind. Usage of this posting's information is solely at reader's own risk.

Liability Disclaimer

In no event shall Author be liable for any damages whatsoever (including, without limitation, damages for loss of use, data or profit) arising out of the use or inability to use the posting's information even if Author has been advised of the possibility of such damage.

Posting

A promise kept - thank you.

PS:

Oh, just wondering, when I initially asked whether interface stats were clean, and you responded they were, did the stats not show any input queue drops? Something like this 6500's gig interface?

Input queue: 0/75/2/2 (size/max/drops/flushes); Total output drops: 1925

View solution in original post

Giuseppe Larosa · ‎09-13-2011

Hello Joris,

try with BFD if supported on your systems or with ethernet OAM

they may be better tools then fast hellos

about your issue is quite strange and it would be wise to open a TAC service request, however there are alternatives as the ones I have listed above

Hope to help

Giuseppe

Joseph W. Doherty · ‎09-13-2011

Disclaimer

The Author of this posting offers the information contained within this posting without consideration and with the reader's understanding that there's no implied or expressed suitability or fitness for any purpose. Information provided is for informational purposes only and should not be construed as rendering professional advice of any kind. Usage of this posting's information is solely at reader's own risk.

Liability Disclaimer

In no event shall Author be liable for any damages whatsoever (including, without limitation, damages for loss of use, data or profit) arising out of the use or inability to use the posting's information even if Author has been advised of the possibility of such damage.

Posting

Changing cable alone would seem a very unusual catalyst (pun not intended) to impact OSPF unless you've since been corrupting frames. I assume interface stats are all clean?

100% nothing else has changed? Overall CPU load fairly low without any spikes? No increase, especially bursts, in link traffic?

As Giuseppe suggested, BFD, if supported, might be tried. Supposedly less load on the control plane.

What type of line cards are these gig ports on?

PS:

Since you working toward fast convergence (I also note the OSPF timers in your posted configuration), you might also consider using ISPF, if supported.

jvrooden · ‎09-14-2011

Thanks for the input!

Nothing else has changed,

interface stats are indeed clean, the load on the interfaces and cpu is currently arround 1% as we are not actually routing over these links. The linecards are WS-X6748-SFP We will start using this new MAN in oktober, that is if this problem is solved of course. This problem qualified for a reboot of the 6500, but unfortunally the problem still remains.

The design for this network was made by Cisco AS. I'm afraid I can't adjust the design by implementing new features like BFD and ISPF in short term, but I will keep this in mind for the future. I will try to contact the Cisco project engineer responsible for the design now. When I know the cause of this problem and the solution I'll post it here.

Cheers,

Joris

Joseph W. Doherty · ‎09-14-2011

Disclaimer

The Author of this posting offers the information contained within this posting without consideration and with the reader's understanding that there's no implied or expressed suitability or fitness for any purpose. Information provided is for informational purposes only and should not be construed as rendering professional advice of any kind. Usage of this posting's information is solely at reader's own risk.

Liability Disclaimer

In no event shall Author be liable for any damages whatsoever (including, without limitation, damages for loss of use, data or profit) arising out of the use or inability to use the posting's information even if Author has been advised of the possibility of such damage.

Posting

From what you've further described, doesn't sound like there should be this issue.

Since this is a MAN link, is the cable end-to-end all yours or is there Service Provider topology "hidden" along the path? In other words, might it be possible a Service Provider is dropping the packets which is invisible to you?

Also from what you describe, doubt in would make a difference in this instance, but does the line card have a CFC or DFC?

Otherwise, wll hold you to your promise of posting the solution.

jvrooden · ‎09-15-2011

The cable is ours, this router is connected locally with a 5 mtr MM fiber to our distribution switches. I've got support from the Cisco consultant now, but still no solution. We actually tried BFD, but that connection stays up when the OSPF neighborship is being reset. Very strange issue this is. I just upgraded the core-router IOS, we'll see if this brings any improvement...

jvrooden · ‎09-26-2011

As promissed the cause and solution for this interesting problem:

The Route Processor queue got overloaded and Selective Packet Discard (SPD) was dropping the OSPF hellos, so the adjecency was reset. Apperently we use to many routing and L2 keepalive protocols (OSPF, BGP, HSRP, etc.) and during short periods where high priority packets (BGP sync?) where passing, the Route Processor of the 6500 couldn't handel all the packets. You can see this with the "show interfaces switching" hidden command.

The solution is to increase the queue for SPD and now the problem is solved.

spd extended-headroom 1000

ip spd queue max-threshold 1499

ip spd queue max-threshold 1498

More info here:

http://www.cisco.com/en/US/products/hw/routers/ps167/products_tech_note09186a008012fb87.shtml

Regards,

Joris

Peter Paluch · ‎09-26-2011

Joris,

Thank you very much for sharing these insights with us!

Best regards,

Peter

Joseph W. Doherty · ‎09-26-2011

Disclaimer

The Author of this posting offers the information contained within this posting without consideration and with the reader's understanding that there's no implied or expressed suitability or fitness for any purpose. Information provided is for informational purposes only and should not be construed as rendering professional advice of any kind. Usage of this posting's information is solely at reader's own risk.

Liability Disclaimer

In no event shall Author be liable for any damages whatsoever (including, without limitation, damages for loss of use, data or profit) arising out of the use or inability to use the posting's information even if Author has been advised of the possibility of such damage.

Posting

A promise kept - thank you.

PS:

Oh, just wondering, when I initially asked whether interface stats were clean, and you responded they were, did the stats not show any input queue drops? Something like this 6500's gig interface?

Input queue: 0/75/2/2 (size/max/drops/flushes); Total output drops: 1925

jvrooden · ‎09-26-2011

Oops, wrong button, now it seems you answered my question, but no mattter.

I probably looked more at the other interface statistics, but I don't recall seeing these drops or my collegue at Cisco mentioning them. If I understand it correctly is that what you would see in this input queue output are the statistics of the hardware input queue of the interface. SPD is using two extended queues called the headroom and the extended headroom and it is there where the OSPF hello´s were dropped. These SPD drops are only shown with `show interfaces switching`. I do think that as these additional queues are only filled when the standard queue is full, you would expect to see drops on the interface itself. I´ll try to find out tomorrow if this was the case. Hope I saved an output of the `show interfaces` somewhere. These drops would be a good indication to look further for SPD problems if one is facing the problem we had last week.

I found this link also to be very informative:

http://www.cisco.com/web/about/security/intelligence/spd.html