Re: How to troubleshoot high CPU utilization on Nexus 5010

sachin goel · ‎07-21-2010

My monitoring tool is reporting alerts for high cpu utilization on Nexus 5010.Image is 4.1(3)N1(1)

Only command supported on this code is sh proc cpu.The output of which does not really tell what is the current cpu utilization.How do i troubleshoot the cause of high cpu on nexus switches.

Any info will be much appreciated

thx

Jayakrishna Mada · ‎08-05-2010

Hi,

show system resources is the command you are looking for. This along with show proc cpu will help you troubleshoot high cpu.

JayaKrishna

sean.wang · ‎08-07-2010

I have the same experience observing frequent high CPU on Nexus 5010 and 5020, while there isn't a significant amount of traffic.

No command seems to be able to pinpoint the process consuming the CPU.

Anybody else also observing this? So far traffic forwarding has been functional. Occasionally command prompt was very slow to respond. I'd appreciate if there is some definitive information on this questionable symptom.

5020-access# sh proc cpu hist
                                                        11 1
                                1                 1 1 00901 1
     8    3    8 1     918      1 8    2     81 6 304900606 12
100                                                     ####
90                                                     ####
80                                                     ####
70                                                     ####
60                                                     ####
50                                                     ####
40                                                     ####
30                                                     ####
20                                               #     #####
10 #         #       # #      # #          #   # # ###### #
    0....5....1....1....2....2....3....3....4....4....5....5....
              0    5    0    5    0    5    0    5    0    5

CPU% per second (last 60 seconds)
# = average CPU%

    111111111111111111111111111111111111111111111111111111111111
    000000000000000000000000000000000000000000000000000000000000
    000000000000000000000000000000000000000000000000000000000000
100 ************************************************************
90 ************************************************************
80 ************************************************************
70 ************************************************************
60 ************************************************************
50 ************************************************************
40 ************************************************************
30 ************************************************************
20 ************************************************************
10 *********##**********#**#*******#******#***#**********#*****
    0....5....1....1....2....2....3....3....4....4....5....5....
              0    5    0    5    0    5    0    5    0    5

CPU% per minute (last 60 minutes)
* = maximum CPU% # = average CPU%

Jayakrishna Mada · ‎08-09-2010

Hi,

Can you post "show proc cpu" sorted from the switch that is seeing this symptom. Do you have any SNMP configured on this switch, if yes can you turn it off and monitor it.

JayaKrishna

sean.wang · ‎08-10-2010

I tried turning off SNMP with no obvious difference. Never saw "show proc cpu sort" coming up with any run away process.

I am somewhat questioning whether it is a real CPU issue, or faulty display. Why would last 60 min always show high peak, and last 72 hour show very low peak.

5010-sw2# sh proc cpu sort

PID    Runtime(ms) Invoked   uSecs 1Sec    Process
----- ----------- -------- ----- ------ -----------
3759         2348   5266079      0    4.0% pfma
    1         1444    337537      4    0.0% init

# sh proc cpu hist
                             1
       1        1           302      1        1
    1 84      1 81    11 84709   1 82    4 3 731    1 1 81 1
100                          #
90                          #
80                          #
70                          #
60                          #
50                          #
40                         ##
30                         ###
20             #           ###               #
10   ##        #         # ###     ##        #          #
    0....5....1....1....2....2....3....3....4....4....5....5....
              0    5    0    5    0    5    0    5    0    5

CPU% per second (last 60 seconds)
# = average CPU%

                    1 111    11 111111111111111111111111111111
    999999898889999909900099990099000000000000000000000000000000
    766632509978555608900042550088000000000000000000000000000000
100 ****       *********** ************************************
90 ************************************************************
80 ************************************************************
70 ************************************************************
60 ************************************************************
50 ************************************************************
40 ************************************************************
30 ************************************************************
20 ************************************************************
10 #**********#**********#**********#**********#**********#****
    0....5....1....1....2....2....3....3....4....4....5....5....
              0    5    0    5    0    5    0    5    0    5

CPU% per minute (last 60 minutes)
* = maximum CPU% # = average CPU%

      1
    781777778779888777877798777768778689777888888897768787788779787768978677
100
90
80
70
60
50
40
30
20
10 ######**####*##########################*################****************
    0....5....1....1....2....2....3....3....4....4....5....5....6....6....7.
              0    5    0    5    0    5    0    5    0    5    0    5    0

CPU% per hour (last 72 hours)
* = maximum CPU% # = average CPU%

s-pirrello · ‎01-04-2011

Anyone find the cause of this? I've been observing this behavior on all Nexus 5010s living on my network.

seanxwang · ‎01-05-2011

This is likely a related bug: http://tools.cisco.com/Support/BugToolKit/search/getBugDetails.do?method=fetchBugDetails&bugId=CSCth19083

Although it did not say that the bug only exists with DCNM, DCNM is probalby the only cause for frequent opening and closing of SSH sessions on 5k. In our case, disabling monitoring of 5k by DCNM was the fix. Check if you have a DCNM system. If not, try temporarily disabling/re-enabling SSH to see if it's the casue.

Note the bug is fixed in release 5.

Regards,

sean

www.seanxwang.com

c.sauvageot · ‎10-29-2012

I have a similar situation with some Nexus 5020's. Show proc cpu history indicates high cpu utilization when looking at the max value, but the average is 10% or lower. I opened a TAC case and the engineer indicated that this is common in the 5K platform. I'm paraphrasing here: NX-OS is Linux based and low priority processes are allowed to run the processor up to 100% for very short durations to keep it clear for high priority processes. That is why the average (in my case) is always very low but the max values can reach 100% during many of the one-minute intervals displayed in the "show proc cpu history" command output. The TAC engineer also indicated that, unless average processor utilization exceeds 50% on a regular basis, there really is not an issue. I did not realize this condition existed until a new implementation of LMS began receiving traps for high cpu utilization from the Nexus 5020's. Based on TAC's response to my case, I'm no longer concerned about the max values I'm seeing, but I'll be monitoring average CPU% as a more meaningful indicator.

dbass · ‎01-05-2011

I would agree that it sounds like a monitoring system of some sort causing it. Because it lasts for such a short period of time, you are unlikely to catch it with the "sh proc cpu sorted" command. I've seen similar behaviour on 6500s, and while I knew for sure that it was SNMP polling, I could never actually catch it in the act because it happens so quickly.

Keep in mind that a lot of these commands aren't necesarily all that exact either ;-). It's also extremely hard to find out exactly how a lot of these statistical "show" commands actually work as a lot of them generate the data off of different (depending on who wrote the application) polling cycles, and the exact information is proprietary.

seanxwang · ‎01-09-2011

That's exactly it. "show proc cpu" has limitations. Even running it with automated script did not produce conclusive results. However, it was useful to analyze the patterns with "show proc cpu history". If CPU spikes up periodically, it is likely in synch with DCNM polling. See how the pattern behaves by changing DCNM polling interval.

sean

www.seanxwang.com

Douglas Bradfield · ‎01-10-2012

I tried using the command "sh proc cpu hist" to see the overall CPU utilization on one of my 5010's but that command doesn't work. But our monitoring keeps giving us alerts that it is running above 95%. Before we open a TAC case I want to see for myself on that specific switch that it is spiking. Also nothing in the logs. version 4.1(3)N2(1)

alanjbrown · ‎01-11-2012

Douglas,

Have you tried looking at these two bugs as we had a similar issue:

CSCte81951 -- show system resources does not show correct cpu utilization

CSCth08102 -- Gatos XL/Carmel: CPU states shows "nan% user" instead of numbers

thks,

Al

Sivagami Narayanan · ‎11-01-2012

Hi

Have tried combining related bugs that may cause High CPU utilization in Nexus 5000.

Please do Refer :Troubleshooting High CPU Utilization on Nexus 5010

Do rate the correct answer and the document if you find it useful

Cheers

Sivagami.N

habookans · ‎06-11-2018

https://supportforums.cisco.com/t5/network-management/cpu/m-p/3079496/highlight/false#M113815

please visit this link it may help you!

sbhadrav@cisco.com · ‎06-14-2018

Actually, there is a limitation/restriction on sup8E board. When you use sup8e either in RPR or SSO mode, only the first four uplinks on each supervisor engine are available. The second set of four uplinks are unavailable.

Regarding the uplink BW, when the daughter card is activated, Supervisor Engine 8-E baseboard uplink bandwidth is restricted to 40G as the default configuration in a ten-slot chassis.

In non-redundancy mode, the supervisor can support the first 4 active interfaces.

In redundancy mode, the first two interfaces on both the active and the standby supervisors become active.

In your case, you have a redundant sup installed and you see port 1-4 as active and remaining 5-8 as disabled. Since you using dual-sup, usually you should see first 2 ports in each sup to be in active/up state.

What you see in your situation is expected.

Sending a nice CCO link for your reference. Please go through

http://www.cisco.com/c/en/us/td/docs/switches/lan/catalyst4500/XE3-7-0E/15-23E/configuration/guide/xe-370-configuration/sw_int.html#pgfId-1236145