OSPF Fast Hellos & CPU/Priority Bizarre Problem

Matt Reader · ‎06-16-2013

Hi, I'll try and summarise this problem we have and I'll keep the info brief!

We're implementing a LAN generally based around 4500 Series (4506-E/4510R) running Layer 3 with multiple VRFs to the access layer. OSPF is configured as routing protocol using point to point links and the timers have been amended as follows:

ip ospf dead-interval minimal hello-multiplier 4

timers throttle spf 10 100 5000

timers throttle lsa all 10 100 5000

timers lsa arrival 80

Now, the problem we have is the adjacencies keep dropping between the links at a very random and inconsistent basis (and come back up instantly). We're fairly confident that there arent physical problems on these links and I've even mirrored the physical interface and captured with Wireshark the hellos in both directions when the problem occurs - there is no flooded traffic or anything out of the ordinary.

The interesting part is when we were checking CPU history etc on the switches, it seemed fairly normal until then I and another colleague noticed an anomaly - when you enter commands via CLI with such basics like show run, the adjacencies would occasional drop at the very same time. Upon running show tech-support on the switches the adjacencies will flap virtually every time. Unless I'm going off on a wild tangent, is the CPU being interuptted too much for whatever reason and simply dropping the OSPF Hellos?

Thanks
Matt

pille1234 · ‎06-16-2013

Unless I'm going off on a wild tangent, is the CPU being interuptted too much for whatever reason and simply dropping the OSPF Hellos?

That's obviously what is happening here. While I have no experience with 4500 I can witness several seconds of cpu load beyond 90% on our 6500 Switches when doing a show run. Looks like the cpu is too busy to compute the incoming hellos or to send outgoing hello packets.

The question is, why do you need sub second timers in your setup? When you use p2p links wouldn't the devices detect a link error immidiately anyways?

Depending on the hardware capabilites (as I said, no experience with 4500) you may have a look at bidirection forwarding detection (BFD). That protocol works in subsecond timeframes as well and can be associated with OSPF. At least on Nexus 7k there is the additional advantage that the hello processing is offloaded to the line cards, thus it is independent from CPU load situations.

Regards Pille

Matt Reader · ‎06-20-2013

Thanks for the reply. We had BFD on a list as something to try, but it appears it isnt supported on the kit we have.

We're going to try removing fast hello and see what in reality the convergence times are with some test scenarios.

Regards

Matt

Joseph W. Doherty · ‎06-20-2013

Disclaimer

The Author of this posting offers the information contained within this posting without consideration and with the reader's understanding that there's no implied or expressed suitability or fitness for any purpose. Information provided is for informational purposes only and should not be construed as rendering professional advice of any kind. Usage of this posting's information is solely at reader's own risk.

Liability Disclaimer

In no event shall Author be liable for any damages whatsoever (including, without limitation, damages for loss of use, data or profit arising out of the use or inability to use the posting's information even if Author has been advised of the possibility of such damage.

Posting

If you have a support contract, you might raise this issue with TAC. Ideally, most CLI commands shouldn't be stealing CPU from routing processes.

I don't know whether the 4500 supports any variant of the "scheduler" commands, and if they do, whether such a modification would make a difference.