Odd OSPF problem on Catalyst 4500-X-32 switch

PatrickCavell85782 · ‎11-27-2023

My customer is having an odd OSPF problem on a Cat 4500X-32 switch with an Enterprise Services license and running cat4500e-universalk9.SPA.03.11.04.E.152-7.E4.bin software.

Switch has 6 OSPF adjacencies, all on broadcast network types. Each adjacency on a different VLAN. They are all going down regularly (every few minutes) due the Dead Timers expiring. Timers are set at Hello = 10, Dead = 40. Nothing has changed in this environment for months. Also, MTU’s on links have not changed. Before this problem crept up last week things were working fine.

Debug shows the switch sending Hello’s every 10 seconds to all neighbors but not receiving them from neighbors every 10 seconds. Check of at least one neighbor shows it sending Hello’s to this switch every 10 seconds and also receiving Hello’s from this switch every 10 seconds. Don’t see any interface errors or interfaces going up and down. All neighbors are on different physical interfaces. The other 6 neighbors are not reporting Dead timer expiration.

Switch had been up 16+ weeks and I had customer reboot it yesterday. Problem cam back after reboot.

Any thoughts on what might be the culprit? Could this be a bug?

Thanks for any guidance.

balaji.bandi · ‎11-27-2023

How is your STP configuration for that VLAN's ? and how is your VTP configuration-

check any STP changes show span summary - make sure Parent switch is Root for the VLAN

post the logs and also enable debug for OSPF packets and post the logs here.

BB

***** Rate All Helpful Responses *****

How to Ask The Cisco Community for Help

Joseph W. Doherty · ‎11-27-2023

A possible culprit would be enough traffic that the OSPF packets are being dropped. I would think unlikely, because of having 6 neighbors.

Is your topology such you might try setting of the OSPF neighbors (logically) as p2p?

PatrickCavell85782 · ‎11-27-2023

They are configured as broadcast because they will eventually be replaced by Meraki L3 switches. Meraki implements OSPF with broadcast interfaces. Don't think Meraki supports OSPF p2p interfaces.

Joseph W. Doherty · ‎11-27-2023

Well, until you move to Meraki, using p2p might be worth considering.

For now, if you were to try it, I would only suggest trying on one pair and see if it make any different to that adjacency.

What about the possibility of congestion dropping OSPF hellos? I have seen it often, but I have seen it. If that's the case, using QoS to insure priority to OSPF packets tends to fix that kind of problem.

MHM Cisco World · ‎11-27-2023

Hi friend
can you check the cpu of SW
is it high or not?

PatrickCavell85782 · ‎11-27-2023

Good question. Meant to include in initial query. When Problem was occurring over the weekend I did see high cpu's. Today, the switch has gone 3.5+ hours without any OSPF adjacencies going down. CPU is low. The really odd thing is that this is a K12 environment and problem started last Wed (day before Thanksgiving) and continued for next several days. The students had been out on vacation this whole time. This morning the students are back in and it seems more stable (at least for the last 3.5 hours). So, it doesn't seem to be load related.

MHM Cisco World · ‎11-27-2023

Friend use eem to check cpu and send log when cpu high

Then check this log you get with ospf down issue' is it happened in same time.

I fully sure it cpu issue not ospf.

You need to check which cpu process in top five (show process cpu ) cause this issue to SW.

MHM

Reza Sharifi · ‎11-27-2023

Have a look at this post with Peter's explanation and the command he is suggesting to troubleshoot the problem. Same platform and the same version of IOS.

https://community.cisco.com/t5/switching/4500x-high-cpu-and-ospf-dropping/td-p/4908113

HTH

PatrickCavell85782 · ‎11-29-2023

It's been over 2 days and the problem has not presented itself again. It occured with much regularity from last Wed through this Sunday (during the US Thanksgiving holidays) when students were out and activity low. Students came back Monday morning and it's been rock solid. The link Reza sent above is very interesting. I think something like that was occurring, causing the CPU to spike. Why it stopped I have no idea. Customer claims nothing was changed or events reported (ex: power fluctuations). My feeling is that something on that network was behaving such that it spiked the CPU and now that device is off or is now behaving normally. I can't help but think some power issue unbeknownst to the IT staff occurred. In any event the customer has activated an EEM script to trigger on OSPF Dead Timer going off and gathering show tech and other info.

Thanks for all the feedback.

Joseph W. Doherty · ‎11-29-2023

"I think something like that was occurring, causing the CPU to spike."

Further, whatever such a something might be, it would likely need to be something to "starve" OSPF of processing cycles, which, I would think, would be uncommon as I would think a routing protocol process would have higher priority obtaining CPU over many other processes. In other words, just any kind of high CPU wouldn't necessarily cause OSPF issues.

Also BTW, since switches have dedicated hardware (e.g. ASICs) to handle most of their work, i.e. dataplane, it's not unusual that the CPU within them doesn't have much overall capacity. I.e. it doesn't often take much traffic, that triggers CPU processing, to overwhelm a switch's CPU.

PatrickCavell85782 · ‎12-12-2023

We may have solved the problem. Set en EEM script to do a packet capture on the control plane upon trigger of OSPF Dead timers going off. Finding that some Dell computers are sending a lot of IPv6 Multicast Listener Report packets. We think this is happening when they go to sleep. These are new Dell computers in a K12 lab - just deployed a few weeks ago. This morning customer disabled IPv6 on these Dells. Now in wait and see mode. Note that we do not see this rash of IPv6 packets in a control plane capture when the system is stable.

MHM Cisco World · ‎12-12-2023

As I mentioned before it cpu and using eem is so smart step.

Good luck friend

MHM

MHM Cisco World · ‎12-12-2023

More point help you to troubleshooting

show traffic statistics

Check multicast count

You can reduce this multicast by apply some strom control and limit multicast pass through suspect interface' in end you can not disconnect these user.

MHM

Joseph W. Doherty · ‎12-13-2023

Ah, yup, just the kind of thing to spike your CPU. Further, "accepting" the multicast packets, for "review" might also be the kind of thing to preempt OSPF processing.

Other than eliminating the multicast generation on the PCs, another possible way to mitigate this would by via CoPP.