Random CPU / Unresponsiveness Issue

jonathanw84 · ‎11-08-2017

Hello,

We have a bunch of satellite offices in various parts of the country, and ever since last week, we are experiencing a very odd issue at pretty much every site. Every 24 hours, the 4510 chassis at each location (which each site having a different internet circuit / ISP), quits responding to ICMP / SNMP / SSH requests and CPU utilization goes high. Regular traffic is at least being passed because any end points that hang off these switches remain up and accessible. This seems to be an issue with the control plane on these devices, and perhaps a scan or something else unknown on the network is causing this issue. I've opened up a TAC case but they can find nothing wrong. We are running version 03.04.07 SG on these devices. What is the best way to go about troubleshooting this? How can we look at the control plane and determine what is hammering the control plane on all of these devices?

Thanks,

Jonathan

Matt Delony · ‎11-08-2017

Hello Jonathan,

Extended periods of 99%+ CPU utilization can cause issues with processing control traffic in a timely manner.

For 4500 switches, I usually follow this guide. Basically, it boils down to the following steps:

Identify top CPU process contributing to high CPU
- show process cpu sorted
- If process is "cat4k Mgmt LoPri":
  - show platform health
  - Check for any process where "actual" utilization is higher than "target" utilization. A management process is considered high priority until the actual utilization goes higher than target, then it is considered a low priority management process, hence the "cat4k Mgmt LoPri" process going high.
Check for any CPU queues that are getting overwhelmed
- Show platform cpu packet statistics

Once you identify the top process contributing to high CPU, it should give more evidence as to the nature of the cause.

If the cause of the CPU is traffic-based, then I usually move on to the in-built CPU sniffer to identify the details of the offending traffic (src/dst ip, src/dst mac, ingress interface, reason for hitting CPU, etc.).

The best time to collect these outputs is when the issue is actively occurring, otherwise most of the data will be irrelevant. Depending on how long the CPU spike lasts, you could collect these outputs manually if you have enough time. If the CPU utilization is intermittent, I usually recommend implementing an EEM script so that the aforementioned outputs can be collected automatically and written to a file on flash. Perhaps you can ask TAC if this is an option?