Ask the Expert:Troubleshooting High CPU and Other Issues in the Cisco Catalyst 4500 Series Switches - Page 2

ciscomoderator · ‎10-05-2012

With Nickolay Karpyshev and Ivan Shirshin

Welcome to the Cisco Support Community Ask the Expert conversation. This is an opportunity to learn and ask questions about architecture and troubleshooting of Industry's Most Widely Deployed Modular Access Platform Cisco Catalyst 4500 with Cisco Experts Nikolay Karpyshev and Ivan Shirshin.

Nikolay and Ivan are Customer Support Engineers in the high touch technology support team (HTTS) at Cisco specialized in LAN Switching and Routing. They support the Cisco Switches Nexus 7000, Catalyst 6500, 3750, 3560, 4500, 2900 and variety of routing platforms, and work as senior and escalation engineers. Both Nikolay and Ivan were previously a part of Cisco Sales Associate program. They hold Cisco Certifications: CCNP, CCSP, and CCDP.

Remember to use the rating system to let Nikolay and Ivan know if you have received an adequate response.

Nikolay and Ivan might not be able to answer each question due to the volume expected during this event. Remember that you can continue the conversation on the Network Infrastructure sub-community discussion forum shortly after the event. This event lasts through through October 19, 2012. Visit this forum often to view responses to your questions and the questions of other community members.

Fahad Wasi · ‎10-13-2012

Hi Ivan,

Thanks for the reply,

Q.1 Can you please explain what you mean when you say 7200 Router is CPU based? Do you mean the speed?

Q.2 What is hardware forwarding?

Q.3 Also I wanted to know that in Cisco Routers ,do we need memories to run IOS only or also the commands that we use on the routers need memories to be executed?

Regards,

nkarpysh · ‎10-13-2012

Hi Fahad,

Let me answer your questions:

Q.1 Can you please explain what you mean when you say 7200 Router is CPU based? Do you mean the speed?

There are two main technologies used in routing and switching - Hardware switching and Software switching. CPU controls the software part however as it also controls multiple processes within device it was decided to take the switching part of from it. It is done based of specific ASIC (engines) built on routing and switching processors. CPU is still controlling all the processes and protocols but it passes all routing and switching information down to those HW engines. Thus whenever the packet is coming and if device is capable of HW switching it is passing through this HW engine for forwarding decision. So CPU stay available for other tasks. There are still certain kind of packets which are managed by CPU always but the overall level is much lower. 7200 router architecture does not support any HW engines thus all the forwarding decisions are done by CPU. In comparison 7600 moving most of forwarding decision to feature cards on Supervisor and Line cards keeping CPU free from it.

As this topic about 4500 - on it there is CEF protocol which is controlling all aspects of Hardware forwarding (actually this protocol is used on multiple platforms to control Harware switching and interface with control plane)

http://www.cisco.com/en/US/docs/switches/lan/catalyst4500/12.2/31sga/configuration/guide/cef.html

Q.2 What is hardware forwarding?

I hope that is answered above. To add is that different platforms have different type of HW forwarding engines.

Q.3 Also I wanted to know that in Cisco Routers ,do we need memories to run IOS only or also the commands that we use on the routers need memories to be executed?

Not sure if I understood your question correctly but there are different types of memories on each platform. If we take 7600 for example it has few different types: NVRAM (primary to store config), RAM (store current processes, traffic, IOS), EPROM (storing ROMMON) and certain permanent flash partitions to store files (IOS image, crashinfo, etc)

when you check "show file system" - you can see different permanent file storage's like bootflash: and sup-bootflash:. You can use "dir" command to check the available files on it. E.G. dir bootflash:

If you want to check the usage of operative memory you can use command "show process memory" - it is showing you the total size, usd and free size. It also includes the processes using the memory and the amount each holds.

You can check following document for more info:

http://www.cisco.com/en/US/partner/docs/ios/12_1/configfun/configuration/guide/fcd204.html

Hope it helps.

Let me know if you have any further questions to discuss.

Nik

HTH,
Niko

Ivan Shirshin · ‎10-12-2012

Hi Akhtar,

Sorry for delay with reply, somehow your notification for your post gone missing.

This is really a topic for open discussion as it is hard to make some recomendation without having specific requirements.

From some of the Risk Assessments that I have seen done for 4500 and 4900 switches, 15.0(2)SG2 was considered much more stable release than 12.2(54)SG1 with many serious bugs fixed.

Speaking about the CPU specifically, the latest in this branch is 15.0(2)SG5 and I see only one bug not resolved in it that is relates to CPU:

- CSCtz04599 MU: Cat4500: dot1x fail - MAB success - dot1x fail leads to High CPU

It will be fixed in SG6 that is expected in November.

Anyway it is a subject for discussion and I will also appreciate if other guys will share their best practices.

Kind Regards,
Ivan Shirshin

**Please grade this post if you find it useful.

Kind Regards,
Ivan

sr1482613 · ‎10-13-2012

Hello, Nickolay and Ivan !

This is Hank from ISONET.

I'm a systems engineer and have been assigned to support my customer who are operating a C4K that has high CPU utilization issue.

I have a question about high cpu utilization for you guys.

Actually, I already opend a tac case and got the answer from a tac engineer.

The TAC engineer said it's performance issue, so I might have to change a network design.

Here is a log that i gave the tac engineer.

- show proc cpu sorted | ex 0.00

XXX#show proc cpu sorted | ex 0.00

CPU utilization for five seconds: 96%/3%; one minute: 91%; five minutes: 64%

PID Runtime(ms) Invoked uSecs 5Sec 1Min 5Min TTY Process

106 30357545111460547752 2078 57.63% 53.56% 32.34% 0 IP Input

55 1994448688 896053981 2225 19.10% 19.58% 12.01% 0 Cat4k Mgmt LoPri

54 26567903491005451955 2642 13.50% 11.92% 13.80% 0 Cat4k Mgmt HiPri

43 46356722 2103973 22032 0.95% 0.11% 0.06% 0 Per-minute Jobs

177 6004 10025 598 0.55% 0.45% 0.92% 1 Virtual Exec

39 239320691 60960958 3925 0.31% 0.33% 0.31% 0 IDB Work

210 51910933756189628 0 0.23% 0.21% 0.22% 0 HSRP Common

99 50335097 30310150 1660 0.23% 0.09% 0.08% 0 CDP Protocol

88 1528720 609626516 2 0.07% 0.09% 0.08% 0 UDLD

211 29795068 380029744 78 0.07% 0.04% 0.05% 0 HSRP IPv4

14 36539866 111072493 328 0.07% 0.06% 0.07% 0 ARP Input

113 5869671 75026899 78 0.07% 0.14% 0.15% 0 Spanning Tree

As per the information, I suspect the traffic punt to cpu lead to high cpu.

I can see the highest cpu utilization is IP Input Process, it means too many packets punt to cpu.

Here is other logs.

K2CpuMan Review 30.00 32.28 30 83 100 500 37 35 7 54119:03

K2AccelPacketMan: Tx 10.00 13.43 20 0 100 500 13 12 3 20377:35

As per the platform cpu packet statistics info , I can see that packets drop occurred on

L3 RX Low .

Packets Dropped by Packet Queue

Queue Total 5 sec avg 1 min avg 5 min avg 1 hour avg

---------------------- --------------- --------- --------- --------- ----------

Host Learning 2225 0 0 0 0

L2 Fwd Low 2610 0 0 0 0

L3 Rx Low 50680864 66 78 37 3

Packets Dropped by Packet Queue

Queue Total 5 sec avg 1 min avg 5 min avg 1 hour avg

---------------------- --------------- --------- --------- --------- ----------

Host Learning 2225 0 0 0 0

L2 Fwd Low 2610 0 0 0 0

L3 Rx Low 50681748 22 66 36 3

Packets Received at CPU per Input Interface

Interface Total 5 sec avg 1 min avg 5 min avg 1 hour avg

---------------------- --------------- --------- --------- --------- ----------

Gi2/1 233 4 0 0 0

Gi2/2 41 0 0 0 0

Gi2/3 32 0 0 0 0

Gi2/6 77232 2217 685 81 0

Gi3/1 39 0 0 0 0

Gi3/7 25 0 0 0 0

Gi3/9 26 0 0 0 0

Gi3/11 1 0 0 0 0

Gi3/14 1 0 0 0 0

Gi3/47 243 6 0 0 0

Gi3/48 7 0 0 0 0

Packets Received at CPU per Input Interface

Interface Total 5 sec avg 1 min avg 5 min avg 1 hour avg

---------------------- --------------- --------- --------- --------- ----------

Gi2/1 285 3 0 0 0

Gi2/2 56 0 0 0 0

Gi2/3 41 0 0 0 0

Gi2/6 96543 1823 788 81 16

Gi3/1 52 0 0 0 0

Gi3/7 27 0 0 0 0

Gi3/9 28 0 0 0 0

Gi3/11 1 0 0 0 0

Gi3/14 1 0 0 0 0

Gi3/16 4 0 0 0 0

Gi3/47 311 5 0 0 0

Gi3/48 8 0 0 0

I can see these count increase quckly, And packets increase highly input interface is Gi2/6.

When I captured the traffics, the traffics punting to CPU was broadcast including stock market price from outside of a comapny through the wan in real time, so I couldn't kill the traffics.

Is there any solution to reduce the high cpu utilization ?

if you want to get the service request number to solve it, you can refer to the serviec request # 622799643.

That SR was already closed.

thanks.

nkarpysh · ‎10-13-2012

Hi Hank,

Well I see a good analysis already don on this case. So you found that broadcast packets causing this High CPU. All switches and routers are designed that way to check L3 broadcast in CPU to understand if any action should be taken on that. This is default behavior for most devices.

In your case broadcast hitting the switches seems to be destined to some other devices and 4500 should just forward it. As there no way to remove that traffic from network you can configure 4500 to limit the number of broadcast sent toward CPU by Control Plane Policing. This is a tool which inspecting the traffic heading CPU with set of pre-configured ACLs and dedicate only certain bandwidth to each class of traffic dropping the one which is going over limit.

Tus you configure basic CoPP template and create your own access-lists limiting certain traffic (e.g. broadcast) to certain boundaries. Broadcast will still be forwarded correctly in HW to all the port within broadcast domain but the part of it hitting the CPU will be limited to certain rate you configure thus keeping CPU safe.

You can consider following page on how to implement it

http://www.cisco.com/en/US/docs/switches/lan/catalyst4500/12.2/54sg/configuration/guide/cntl_pln.html

Let us know if you have any further questions.

Nik

HTH,
Niko

Akhtar Samo · ‎10-16-2012

Hi Nik,

What are the broadcast threshold values in bps and pps which can be used in CoPP so as not to kill the CPU.

Regards,

Akhtar

Ivan Shirshin · ‎10-18-2012

Hi Akhtar,

I have seen in many cases people using "police 32000 1000" but I recommend to tune the values to your specific setup, as better solution is to do some testing in your specific scenario on optimal thresholds.

Kind Regards,
Ivan Shirshin

**Please grade this post if you find it useful.

Kind Regards,
Ivan

kthned · ‎10-15-2012

Hi Experts,

I would like to take this forum as an oppurtunity to understand high CPU usage on one of our 6500 switch running IOS 12.2(33)SXI3. The problem is the Switch processor is taking too much CPU compare to Route processor. Some times the switch processor reaches to 99% . The command "sh process cpu " shows following two process on the top (NDE - IPV4, Spanning-Tree). Could you tell what is the way to root cause such issue. sh process CPU files are attached.

Not too sure if high SP is normal . As I use to see the route processor cpu usage by sh proc cpu. I came across to know high SP CPU using SNMP walk.

[nnmserver ~]$ snmpwalk -v2c -c public aaa-ddd 1.3.6.1.4.1.9.9.109.1.1.1.1.8

SNMPv2-SMI::enterprises.9.9.109.1.1.1.1.8.1 = Gauge32: 7

SNMPv2-SMI::enterprises.9.9.109.1.1.1.1.8.3 = Gauge32: 68

Thanks !

Regards,
Umair

Ivan Shirshin · ‎10-15-2012

Hi,

It seems that there is some traffic being sent to SP CPU as it is high on interrupts - 46%:

switch_6500#remote command switc sh process cpu sorte

CPU utilization for five seconds: 61%/46%; one minute: 69%; five minutes: 68%

PID Runtime(ms) Invoked uSecs 5Sec 1Min 5Min TTY Process

281 11313742443353170351 0 8.00% 7.01% 7.05% 0 Spanning Tree

253 1146778924 48986767 23410 2.15% 1.81% 1.78% 0 Vlan Statistics

470 1427548824 109539080 13032 1.35% 4.65% 4.34% 0 NDE - IPV4

Do you have any issues with stability in the network or spanning tree?

Please send the "show tech" and "show spannning-tree summ".

Kind Regards,
Ivan Shirshin

**Please grade this post if you find it useful.

Kind Regards,
Ivan

kthned · ‎10-16-2012

Hi Ivan

Thanks for your input & consideration.

Here is the sh tech & sh span summary output. please note that IP address and domains names are renamed to arbitrary.

Regards,

Umair

Ivan Shirshin · ‎10-18-2012

Hi,

Spanning tree is fine but I do see some log statements in the ACL (and such packets are sent to CPU for accounting).

access-list 2460 permit tcp yy.24.16.0 0.7.239.255 host xxx.225.53.82 range 137 138 log

...

access-list 2460 permit udp yy.24.16.0 0.7.239.255 host xxx.225.53.82 eq netbios-ss log

...

access-list 2460 permit udp yy.24.16.0 0.7.239.255 host xxx.225.53.82 eq 445 log

access-list 2460 permit tcp yy.24.16.0 0.7.239.255 host xxx.225.53.83 range 137 138 log

Interruts could be seen due to that ACL logs, or some functions constantly using CPU resources - usually due to bugs - or due to traffic hitting the SP CPU.

Lets check the functions first by doing profiling (not service impacting). To prepare correct procedure, please provide me the outputs:

! login to the SP

switch# remote login switch

switch-sp# show region

switch-sp# show mem stat

switch-sp#exit

switch#

Kind Regards,
Ivan Shirshin

**Please grade this post if you find it useful.

Kind Regards,
Ivan

kthned · ‎10-19-2012

Thanks for the help Ivan. Here is the output for show region and sho mem stat

Ivan Shirshin · ‎10-19-2012

Hi,

Please following this procedure for the CPU profiling (to identify the functions responsible for interrupts):

1. Setup the profile

# profile 40101328 42247FFF 4

# profile task interrupt

2. Now run the following command and don't do anything on the router for about 5 minutes. You can inform all people logged in to leave the router alone by using "send *". If not left alone, the results of the profiling could become corrupted due to the CPU processing user commands.

# profile start

3. After waiting for about 5 minutes run the following command

# profile stop

4. Next run the following 4 commands in sequence, via TELNET. Note that these commands may generate a large amount of data. Do NOT attempt to do this via the console port, since the console port is slow and does not obey flow control, so data may be lost.

# terminal length 0 // Turn off the "more" page scrolling feature

# show profile terse

# show profile detail

# terminal length 40 // Turns the "more" page scrolling feature back on

5. Finally run the following to release the memory.

# clear profile

# unprofile 40101328 42247FFF 4

6. After that, please send me the output of the followings :

# show processes cpu

# show memory statistics

# show region

# show alignment

Kind Regards,
Ivan Shirshin

**Please grade this post if you find it useful.

Kind Regards,
Ivan

kthned · ‎10-19-2012

Hi Ivan

Thanks for the input. I shall update you on Monday as weekend has already started here in EU . I hope you continue help us in this regards next week. Thanks a lot and have a nice weekend !

Regards,

Umair

sbertsch · ‎10-18-2012

Experts,

I have an odd issue on c4503 w/sup iv running 12.2(50)SG7 where CPU is apparently being driven by an ESP flow that is being forwarded by CPU for no apparent reason.

I used CPU SPAN to capture CPU tx & rx traffic. All other traffic appears normal with no smoking gun (e.g. ICMP traffic). The only difference in the ESP packets for rx vs tx is TTL decrement & checksum updates.

See attached image of one of the ESP frames captured from the CPU SPAN.

#show processes cpu

CPU utilization for five seconds: 99%/0%; one minute: 96%; five minutes: 96%

PID Runtime(ms) Invoked uSecs 5Sec 1Min 5Min TTY Process

48 3887011340 80841077 48082 17.09% 8.78% 8.22% 0 Cat4k Mgmt HiPri

49 42937403361140819070 3763 16.53% 24.29% 24.76% 0 Cat4k Mgmt LoPri

97 9055618683276278949 276 50.71% 55.99% 55.30% 0 IP Input

192 800 92 8695 13.17% 1.05% 0.21% 1 SSH Process

#show platform health

%CPU %CPU RunTimeMax Priority Average %CPU Total

Target Actual Target Actual Fg Bg 5Sec Min Hour CPU

K2CpuMan Review 30.00 25.76 30 25 100 500 33 32 20 78196:03

K2AccelPacketMan: Tx 10.00 3.68 20 1 100 500 2 2 2 26900:32

K2PortMan Review 3.00 2.79 15 11 100 500 2 2 1 19009:51

K2Fib Consistency Ch 1.00 8.34 5 3 100 500 9 2 1 19781:29

K2PacketBufMonitor-P 3.00 2.00 10 1 100 500 2 2 1 26238:29

%CPU Totals 214.80 47.58

#show platform cpu packet statistics

Packets Received by Packet Queue

Queue Total 5 sec avg 1 min avg 5 min avg 1 hour avg

---------------------- --------------- --------- --------- --------- ----------

L3 Rx Low 17404309562 2642 3271 2534 2008

Packets Dropped by Packet Queue

Queue Total 5 sec avg 1 min avg 5 min avg 1 hour avg

---------------------- --------------- --------- --------- --------- ----------

L3 Rx Low 319042488 1179 1226 734 409

#show platform hardware ip route summary

TCAM running in 144 bit mode. (16 routes per block)

5525 blocks used out of 8192 (67.44%)

87906 K2Fib TCAM entries used out of 131072 (67.06%)

(512 entries are fixed overhead)

294 K2FibAdjs used out of 32768 (0.89%)

87394 IrmFibEntries used out of 262144 (33.3333%)

5 IrmMfibEntries used out of 65536 (0.00%)

281 IrmFibAdjs used out of 49152 (0.57%)

K2FibAdj allocation failures: 0

K2FibEntry allocation failures: 0

K2FibRegion block reshuffles: 0

IrmFibAdj allocation failures: 0

Number of Entries using RpfFloodSet:0

VRF Vlans using software forwarding due to resource exhaustion: 0

Consistency Checker failures:

reported: cam: 0 mask: 0 fte: 0

suppressed: cam: 0 mask: 0 fte: 0