09-13-2011 06:52 AM - edited 03-07-2019 02:11 AM
Hi there,
I'm having a strange issue with OSPF fast hello's. We are currently implementing a new MAN in the Netherlands with two new C6500 sup720-10G core routers and multiple C6500 sup720-3b. The core-routers are currently only connected to two distribution routers on two locations. Next month we will connect multiple sites and the two core-locations using ethernet connections over a DWDM infrastructure.
On both core-routers OSPF is currently active over the direct connections to two distribution routers with a UTP cable. This worked fine for months, until a colleague swapped the cables at the one site two weeks ago. Now every day the OSPF adjacency between the PLW OSPF neighbors is reset multiple times at irregular intervals. Always for both distribution routers at the same time. However a third OSPF neighbor, a backup connection to a C3750 at a remote site, is not being reset. We did obviously replace the cables at this location once more, but without result; the neighborship is still being reset. The only thing that helps is removing the “ip ospf dead-interval minimal hello-multiplier 4” command, or to adjust the hello timers to a higher value, I used 1 second, so the Dead interval is 4 seconds. This is not a satisfactionary solution though, as we want to use the fast convergence of the fast hellos and I really don’t see any reason why this shouldn’t work on our Gigabit MAN.
I also had this issue on an inter-connection between two distribution routers at another site, but since this port-channel link was going to be replaced by a L3 vlan and is not very important anyway, I ignored it. But now that I am having this same problem also on the core-router, I am getting afraid that we will be running into more of these problems when we activate all the Ethernet-over-DWDM-links in our new core in October.
I did some research on the Cisco support site and some other sites but I couldn’t find a note on a similar problem. Are there any issues known with OSPF fast-hellos? Any idee to what direction I need to look to solve this problem? I think it's very strange that both links from core to ditribution routers are always reset at the same time.
I hope someone can help me.
Beste regards,
Joris van Rooden
sh version:
Cisco IOS Software, s72033_rp Software (s72033_rp-ADVENTERPRISEK9_WAN-M), Version 12.2(33)SXI5, RELEASE SOFTWARE (fc2)
Config:
router ospf 1
router-id 10.100.0.102
log-adjacency-changes detail
auto-cost reference-bandwidth 40000
area 0 authentication message-digest
timers throttle spf 10 100 5000
timers throttle lsa all 10 100 5000
timers lsa arrival 80
passive-interface default
no passive-interface GigabitEthernet1/3
no passive-interface GigabitEthernet1/20
no passive-interface GigabitEthernet2/3
network 10.100.0.102 0.0.0.0 area 0
network 10.100.0.0 0.0.255.255 area 0
!
!
interface GigabitEthernet1/3
description Distribution-router-1 Gi3/13
dampening
mtu 9216
ip address 10.100.102.9 255.255.255.248
ip ospf message-digest-key 1 md5 7 xxxxxxxx
ip ospf network point-to-point
ip ospf dead-interval minimal hello-multiplier 4
logging event link-status
load-interval 30
carrier-delay msec 0
mpls ip
end
sh ip ospf neighbor:
Neighbor ID Pri State Dead Time Address Interface
10.100.0.127 0 FULL/ - 812 msec 10.100.113.10 GigabitEthernet1/20
10.100.0.106 0 FULL/ - 00:00:03 10.100.102.26 GigabitEthernet2/3
10.100.0.105 0 FULL/ - 852 msec 10.100.102.10 GigabitEthernet1/3
Sh log:
002913: Sep 13 07:52:29.842 CEST: %OSPF-5-ADJCHG: Process 1, Nbr 10.100.0.106 on GigabitEthernet2/3 from INIT to 2WAY, 2-Way Received
002914: Sep 13 07:52:29.842 CEST: %OSPF-5-ADJCHG: Process 1, Nbr 10.100.0.106 on GigabitEthernet2/3 from 2WAY to EXSTART, AdjOK?
002915: Sep 13 07:52:29.842 CEST: %OSPF-5-ADJCHG: Process 1, Nbr 10.100.0.106 on GigabitEthernet2/3 from EXSTART to EXCHANGE, Negotiation Done
002916: Sep 13 07:52:30.138 CEST: %OSPF-5-ADJCHG: Process 1, Nbr 10.100.0.106 on GigabitEthernet2/3 from EXCHANGE to LOADING, Exchange Done
002917: Sep 13 07:52:30.138 CEST: %OSPF-5-ADJCHG: Process 1, Nbr 10.100.0.106 on GigabitEthernet2/3 from LOADING to FULL, Loading Done
002918: Sep 13 08:16:40.210 CEST: %OSPF-5-ADJCHG: Process 1, Nbr 10.100.0.105 on GigabitEthernet1/3 from FULL to INIT, 1-Way
002919: Sep 13 08:16:40.498 CEST: %OSPF-5-ADJCHG: Process 1, Nbr 10.100.0.106 on GigabitEthernet2/3 from FULL to INIT, 1-Way
002920: Sep 13 08:16:40.758 CEST: %OSPF-5-ADJCHG: Process 1, Nbr 10.100.0.105 on GigabitEthernet1/3 from INIT to 2WAY, 2-Way Received
002921: Sep 13 08:16:40.762 CEST: %OSPF-5-ADJCHG: Process 1, Nbr 10.100.0.105 on GigabitEthernet1/3 from 2WAY to EXSTART, AdjOK?
002922: Sep 13 08:16:40.762 CEST: %OSPF-5-ADJCHG: Process 1, Nbr 10.100.0.105 on GigabitEthernet1/3 from EXSTART to EXCHANGE, Negotiation Done
002923: Sep 13 08:16:40.818 CEST: %OSPF-5-ADJCHG: Process 1, Nbr 10.10.0.105 on GigabitEthernet1/3 from EXCHANGE to LOADING, Exchange Done
002924: Sep 13 08:16:40.818 CEST: %OSPF-5-ADJCHG: Process 1, Nbr 10.100.0.105 on GigabitEthernet1/3 from LOADING to FULL, Loading Done
002925: Sep 13 08:16:40.954 CEST: %OSPF-5-ADJCHG: Process 1, Nbr 10.100.0.106 on GigabitEthernet2/3 from INIT to 2WAY, 2-Way Received
002926: Sep 13 08:16:40.954 CEST: %OSPF-5-ADJCHG: Process 1, Nbr 10.100.0.106 on GigabitEthernet2/3 from 2WAY to EXSTART, AdjOK?
002927: Sep 13 08:16:40.954 CEST: %OSPF-5-ADJCHG: Process 1, Nbr 10.100.0.106 on GigabitEthernet2/3 from EXSTART to EXCHANGE, Negotiation Done
002928: Sep 13 08:16:41.034 CEST: %OSPF-5-ADJCHG: Process 1, Nbr 10.100.0.106 on GigabitEthernet2/3 from EXCHANGE to LOADING, Exchange Done
002929: Sep 13 08:16:41.034 CEST: %OSPF-5-ADJCHG: Process 1, Nbr 10.100.0.106 on GigabitEthernet2/3 from LOADING to FULL, Loading Done
002930: Sep 13 08:52:56.226 CEST: %OSPF-5-ADJCHG: Process 1, Nbr 10.100.0.105 on GigabitEthernet1/3 from FULL to INIT, 1-Way
Solved! Go to Solution.
09-26-2011 11:49 AM
Disclaimer
The Author of this posting offers the information contained within this posting without consideration and with the reader's understanding that there's no implied or expressed suitability or fitness for any purpose. Information provided is for informational purposes only and should not be construed as rendering professional advice of any kind. Usage of this posting's information is solely at reader's own risk.
Liability Disclaimer
In no event shall Author be liable for any damages whatsoever (including, without limitation, damages for loss of use, data or profit) arising out of the use or inability to use the posting's information even if Author has been advised of the possibility of such damage.
Posting
A promise kept - thank you.
PS:
Oh, just wondering, when I initially asked whether interface stats were clean, and you responded they were, did the stats not show any input queue drops? Something like this 6500's gig interface?
Input queue: 0/75/2/2 (size/max/drops/flushes); Total output drops: 1925
09-13-2011 12:58 PM
Hello Joris,
try with BFD if supported on your systems or with ethernet OAM
they may be better tools then fast hellos
about your issue is quite strange and it would be wise to open a TAC service request, however there are alternatives as the ones I have listed above
Hope to help
Giuseppe
09-13-2011 04:59 PM
Disclaimer
The Author of this posting offers the information contained within this posting without consideration and with the reader's understanding that there's no implied or expressed suitability or fitness for any purpose. Information provided is for informational purposes only and should not be construed as rendering professional advice of any kind. Usage of this posting's information is solely at reader's own risk.
Liability Disclaimer
In no event shall Author be liable for any damages whatsoever (including, without limitation, damages for loss of use, data or profit) arising out of the use or inability to use the posting's information even if Author has been advised of the possibility of such damage.
Posting
Changing cable alone would seem a very unusual catalyst (pun not intended) to impact OSPF unless you've since been corrupting frames. I assume interface stats are all clean?
100% nothing else has changed? Overall CPU load fairly low without any spikes? No increase, especially bursts, in link traffic?
As Giuseppe suggested, BFD, if supported, might be tried. Supposedly less load on the control plane.
What type of line cards are these gig ports on?
PS:
Since you working toward fast convergence (I also note the OSPF timers in your posted configuration), you might also consider using ISPF, if supported.
09-14-2011 01:08 AM
Thanks for the input!
Nothing else has changed,
interface stats are indeed clean, the load on the interfaces and cpu is currently arround 1% as we are not actually routing over these links. The linecards are WS-X6748-SFP We will start using this new MAN in oktober, that is if this problem is solved of course. This problem qualified for a reboot of the 6500, but unfortunally the problem still remains.
The design for this network was made by Cisco AS. I'm afraid I can't adjust the design by implementing new features like BFD and ISPF in short term, but I will keep this in mind for the future. I will try to contact the Cisco project engineer responsible for the design now. When I know the cause of this problem and the solution I'll post it here.
Cheers,
Joris
09-14-2011 02:39 AM
Disclaimer
The Author of this posting offers the information contained within this posting without consideration and with the reader's understanding that there's no implied or expressed suitability or fitness for any purpose. Information provided is for informational purposes only and should not be construed as rendering professional advice of any kind. Usage of this posting's information is solely at reader's own risk.
Liability Disclaimer
In no event shall Author be liable for any damages whatsoever (including, without limitation, damages for loss of use, data or profit) arising out of the use or inability to use the posting's information even if Author has been advised of the possibility of such damage.
Posting
From what you've further described, doesn't sound like there should be this issue.
Since this is a MAN link, is the cable end-to-end all yours or is there Service Provider topology "hidden" along the path? In other words, might it be possible a Service Provider is dropping the packets which is invisible to you?
Also from what you describe, doubt in would make a difference in this instance, but does the line card have a CFC or DFC?
Otherwise, wll hold you to your promise of posting the solution.
09-15-2011 06:59 AM
The cable is ours, this router is connected locally with a 5 mtr MM fiber to our distribution switches. I've got support from the Cisco consultant now, but still no solution. We actually tried BFD, but that connection stays up when the OSPF neighborship is being reset. Very strange issue this is. I just upgraded the core-router IOS, we'll see if this brings any improvement...
09-26-2011 05:40 AM
As promissed the cause and solution for this interesting problem:
The Route Processor queue got overloaded and Selective Packet Discard (SPD) was dropping the OSPF hellos, so the adjecency was reset. Apperently we use to many routing and L2 keepalive protocols (OSPF, BGP, HSRP, etc.) and during short periods where high priority packets (BGP sync?) where passing, the Route Processor of the 6500 couldn't handel all the packets. You can see this with the "show interfaces switching" hidden command.
The solution is to increase the queue for SPD and now the problem is solved.
spd extended-headroom 1000
ip spd queue max-threshold 1499
ip spd queue max-threshold 1498
More info here:
http://www.cisco.com/en/US/products/hw/routers/ps167/products_tech_note09186a008012fb87.shtml
Regards,
Joris
09-26-2011 06:19 AM
Joris,
Thank you very much for sharing these insights with us!
Best regards,
Peter
09-26-2011 11:49 AM
Disclaimer
The Author of this posting offers the information contained within this posting without consideration and with the reader's understanding that there's no implied or expressed suitability or fitness for any purpose. Information provided is for informational purposes only and should not be construed as rendering professional advice of any kind. Usage of this posting's information is solely at reader's own risk.
Liability Disclaimer
In no event shall Author be liable for any damages whatsoever (including, without limitation, damages for loss of use, data or profit) arising out of the use or inability to use the posting's information even if Author has been advised of the possibility of such damage.
Posting
A promise kept - thank you.
PS:
Oh, just wondering, when I initially asked whether interface stats were clean, and you responded they were, did the stats not show any input queue drops? Something like this 6500's gig interface?
Input queue: 0/75/2/2 (size/max/drops/flushes); Total output drops: 1925
09-26-2011 12:21 PM
Oops, wrong button, now it seems you answered my question, but no mattter.
I probably looked more at the other interface statistics, but I don't recall seeing these drops or my collegue at Cisco mentioning them. If I understand it correctly is that what you would see in this input queue output are the statistics of the hardware input queue of the interface. SPD is using two extended queues called the headroom and the extended headroom and it is there where the OSPF hello´s were dropped. These SPD drops are only shown with `show interfaces switching`. I do think that as these additional queues are only filled when the standard queue is full, you would expect to see drops on the interface itself. I´ll try to find out tomorrow if this was the case. Hope I saved an output of the `show interfaces` somewhere. These drops would be a good indication to look further for SPD problems if one is facing the problem we had last week.
I found this link also to be very informative:
http://www.cisco.com/web/about/security/intelligence/spd.html
Discover and save your favorite ideas. Come back to expert answers, step-by-step guides, recent topics, and more.
New here? Get started with these tips. How to use Community New member guide