05-03-2010 09:20 AM - edited 03-06-2019 10:54 AM
Hi all,
This weekend our data center switch had a meltdown. I consoled into the switch and noticed CPU utilization was at 100%. I did "show proc cpu sort" and arp input was at 14%. I could not isolate what other process hogged the CPU so I rebooted the switch and it solved the issue. I also did a bug tool search on this particular IOS and nothing really stands out. We have this IOS for almost 1 year and half.
I am wondering if there is an IOS command that can backtrace what process hogged the CPU and found the root cause of the problem.
Thanks.
Sup-720 base
IOS: advance IP services 12.2(18)SXF15
05-03-2010 11:14 AM
Hello Kevin,
>> I am wondering if there is an IOS command that can backtrace what process hogged the CPU and found the root cause
not after a reload, but I understand that you needed to find a quick workaround.
if possible in similar cases you should take sh proc cpu and sh log and to save them in a text file, before reloading the device (logs could be retrivied on syslog server if the device exports log messages to an external server).
Also sw interrupts could be the main users of cpu resources, the second number in 5 seconds cpu usage says how much cpu is used by SW interrupts so it is possible to have cpu at 100% and sh proc cpu sorted does not show processes for more then 20 percent just to make an example.
Very high cpu usage can be triggered by a bridging loop that leads to a broadcast storm that makes the cpu to process a rapidly increasing amount of broadcast frames ( no TTL exists at OSI layer 2 so frames are not dropped after circulating many times in the loop and each cycle a multiplication effect happens).
Combined to a bridging loop there can be other effects for example on HSRP groups: if two Vlans broadcast domains are joined the device can receive on Vlan X the multicast HSRP hello packets of Vlan Y and this can cause problems even to C6500 devices. A fixup for this is to use different MD5 password for different HSRP groups in different vlans so that if for any accident two vlans are joined HSRP group N does not consider frames of group M and viceversa.
Hope to help
Giuseppe
05-03-2010 11:17 AM
Configure the service nagle command, which might help you access the box when the cpu is at 100%. Then you will be able to execute te commands Giuseppe suggests before you have to reboot it.
Victor
Discover and save your favorite ideas. Come back to expert answers, step-by-step guides, recent topics, and more.
New here? Get started with these tips. How to use Community New member guide