Solved: thank you sir, this is - Page 2

Lisa Latour · ‎05-05-2015

This is an opportunity to learn and ask questions about high CPU condition that you might be facing your environment and troubleshooting the same with the tools and techniques available within the platform with Cisco expert Vinit Jain.

Ask questions from Monday, May 11th, 2015 to Friday, May 22, 2015

High CPU condition is a very common problem seen in production environments which can cause a huge impact on the services if not taken care on time. High CPU can be classified in primarily in two categories – 1) High CPU due to process and 2) High CPU due to interrupt (traffic). Cisco expert Vinit Jain will cover and answer all of your questions about troubleshooting High CPU on Cisco IOS.

Vinit Jain, 3X CCIE #22854 is a Technical Lead in HTTS (High Touch Technical Support) team supporting customers in areas of routing, MPLS, TE, IPv6, multicast and a wide variety of platform issues like High CPU, Memory leak, etc IOS, IOS XE, IOS XR and NxOS code base. Has been delivering trainings within Cisco on various technology as well as platform troubleshooting topics. He has also written workbook on IOS XR fundamentals on Cisco Support Community. Vinit has CCIE in R&S, SP and Sec and holds multiple certifications on programming and databases.

Vinit Jain will also be speaking at Cisco Live in June 2015 on Troubleshooting BGP (BRKRST-3320).
Click here for More Information

Find other https://supportforums.cisco.com/expert-corner/events.

**Ratings Encourage Participation! **
Please be sure to rate the Answers to Questions

Vinit Jain · ‎05-12-2015

Glad, i was able to help.

Feel free to post if you have any further questions.

Thanks
--Vinit

Manish Kumar · ‎05-08-2015

Hi Vinit,

Could you share an example where EEM can be used to troubleshoot high cpu caused by IGP (for example ospf) flapping? It's catch 22 where IGP can flap due to high cpu and if IGP is constantly flapping it can cause cpu to go high, and generally your remote connection session could be very slow/unresponsive during those times.

Thanks

Manish

Vinit Jain · ‎05-08-2015

Hello Manish

yes, there are scenario's in which an IGP flap can cause CPU to spike up. Now the question is how do we approach this problem. If we try to troubleshoot high CPU, then this will lead us to look at IGP flaps.

Suppose, the BFD is flapping which is causing OSPF to flap, then we can use the below script to troubleshoot this problem:

event manager applet OSPF_Monitor
event syslog pattern "Neighbor Down: BFD node down"
action 1.01 syslog priority critical msg "**** BFD Failure Detected - Statistics Logged ****"
action 1.02 cli command "enable"
action 1.03 cli command "show clock | append bootdisk:cpu_stats"
action 1.04 cli command "show proc cpu sort | append bootdisk:cpu_stats"
action 1.05 cli command "debug netdr cap  rx"
action 1.06 cli command "show netdr cap | append bootdisk:cpu_stats"
action 1.07 cli command "undebug all"
action 1.08 cli command "end"

The above capture is for performing a netdr capture on the event of BFD flap to see what packets are hitting the CPU which can then be decoded to further understand what is happening on the router. We can capture commands related to BFD or OSPF in the above EEM.

If we dont know which process or protocol is causing high CPU and when its causing it, we can have another EEM script configured on the router which can be triggered when the CPU spikes up:

event manager applet HIGHCPU
event snmp oid "1.3.6.1.4.1.9.9.109.1.1.1.1.3.1" get-type exact entry-op gt entry-val "90"
exit-op lt exit-val "70" poll-interval 5 maxrun 200
action 1.0 syslog msg "START of TAC-EEM: High CPU"
action 1.1 cli command "enable"
action 1.3 cli command "debug netdr clear-capture"
action 1.4 cli command "debug netdr capture rx“
action 2.0 cli command "sh clock | append disk0:proc_CPU"
action 2.1 cli command "show process cpu sorted | append disk0:proc_CPU“
action 2.2 cli command "show proc cpu history | append disk0:proc_CPU"
action 2.3 cli command "show netdr capture | append disk0:proc_CPU"
action 3.1 cli command "show log | append disk0:proc_CPU"
action 4.0 syslog msg "END of TAC-EEM: High CPU"

In the above EEM script, we are triggering the EEM when the high CPU is noticed. We can also set the min and max range of CPU on which the trigger can occur.

The more imp question is why the IGP is flapping. It could be due to some drops, of MTU issues or some rate-limiter dropping some legitimate packets etc..

Hope this helps.

Vinit

PS: Please do rate the reply if you find them useful

Thanks
--Vinit

Manish Kumar · ‎05-08-2015

thank you sir, this is exactly what I was looking for.

Great work.

CSCO10662744_2 · ‎05-12-2015

hi Vinit,

Two of a remote site's Cat4K's were running IOS-XE 3.4.4, and their CPU would spike from time to time.

Would the TShoot steps be the same for a Cat4500-X running IOS-XE?
Can I use the same HIGHCPU EEM that you had posted before?

Is there any way to set certain operations as "high" priority, within IOS/IOSd, so that IGP flaps, or STP events don't bring down the network?
How about setting a limit/policer, so that no processes can take up the entire CPU?

CORE-S1#sh proc cpu sort
Core 0: CPU utilization for five seconds: 3%; one minute: 31%; five minutes: 24%
Core 1: CPU utilization for five seconds: 98%; one minute: 71%; five minutes: 78%
PID Runtime(ms) Invoked uSecs 5Sec 1Min 5Min TTY Process
8571 1733061 33879058 727 51.28 51.10 50.92 0 iosd

CORE-S2#sh proc cpu sort
Core 0: CPU utilization for five seconds: 3%; one minute: 8%; five minutes: 9%
Core 1: CPU utilization for five seconds: 95%; one minute: 76%; five minutes: 77%
PID Runtime(ms) Invoked uSecs 5Sec 1Min 5Min TTY Process
8685 2782447 16732559 422 48.90 42.45 43.04 0 iosd

Vinit Jain · ‎05-12-2015

Hello,

The EEM can be helpful but we need to understand what commands to capture. The netdr capture works only on 6500/7600 platform. Here the CPU is high due to iosd process which is consuming around 40-50% of CPU cycle but we also need to understand what is the other stuff causing CPU to spike up. There may be some CPU bound traffic coming on the router which can be analyzed using the below troubleshooting steps

Troubleshooting Steps:

Use in-built CPU sniffer

+ debug platform packet all receive buffer  (Wait for 2-3 minutes after issuing this
command.)
   + show platform cpu packet buffered
   + undebug all


To identify interface that sends traffic to CPU

   + debug platform packet all count (Wait for 2-3 minutes after issuing this command.)
   + show platform cpu packet statistics
   + undebug all

+ Use Sniffer to verify the packets that were hitting the CPU

Could you please share the below output:

- show version
- show platform cpu packet statistics all
- show process cpu detailed
- show platform health

its also important to remember that Cat 4500 considers the CPU as underutilized if its below 100%. Below is the information from CCO.

"The Catalyst 4500 considers the CPU underutilized unless the CPU is used at 100 percent for a single time slot. There is another very important implementation detail of Catalyst 4500 CPU packet handling. If the CPU has already serviced high-priority packets or processes but has more spare CPU cycles for a particular time period, the CPU services the low-priority queue packets or performs background processes of lower priority. High CPU utilization as a result of low-priority packet processing or background processes is considered normal because the CPU constantly tries to use all the time available. In this way, the CPU strives for maximum performance of the switch and network without a compromise of the stability of the switch."

I would further like to understand, since when did you start facing high CPU problem. Were there any changes made recently or is it a day 1 issue?

You can also refer to a good CCO documentation on troubleshooting high CPU on Cat4500:

http://www.cisco.com/c/en/us/support/docs/switches/catalyst-4000-series-switches/65591-cat4500-high-cpu.html

Hope this helps.

Vinit

PS: Feel free to rate the posts if you find them useful

Thanks
--Vinit

CSCO10662744_2 · ‎05-14-2015

Thank you Vinit.
The CPU was high, when we were doing a new LAN deployment.
There were multiple things happening through the week...spantree convergences (new access switches being installed in closets), EIGRP & BGP peering, PIM peering, etc.

After an upgrade from IOS-XE 3.4.4 to 3.4.5, and no more changes & events, the CPU has been quiet.
I was asking these questions for future references.

Do these debugs affect the CPU, or are they considered "low" priority?
If we're already having high CPU utilization, we certainly don't want to make it worse...
debug platform packet all receive buffer
debug platform packet all count
===========

You said the netdr capture only works on 6500/7600 platform. (debug netdr)
Could you tell us what commands, or EEM scripts we'd need to run on other platforms?
Cat3K/4K, Nexus5K/7K, ISR-G1/G2, ASR1K...

thanks,
Kevin

Vinit Jain · ‎05-14-2015

Hello Kevin

The above debug commands wont have much impact as these are actually packet capture tools present in the platform but their commands just start with the keyword "debug". Also, there are asics already present in the platform for taking care of these captures, thus it wont impact the other processes. For all IOS/ IOS XE paltforms. you can use the similar/same EEM script with the difference in the commands that you want to run when the event occurs.

EEM on Nexus platform is used a bit differently. You can refer to the below CCO doc for an overview:

http://www.cisco.com/c/en/us/td/docs/switches/datacenter/sw/5_x/nx-os/system_management/configuration/guide/sm_nx_os_cg/sm_12eem.html

Please let me know if you have any further questions/queries.

Hope this helps.

Vinit

Thanks
--Vinit

useridcisco · ‎05-16-2015

Hi Kevin, you can check this document from Cisco Support Community, Packet Capture Capabilities of Cisco Routers and Switches.

Vinit Jain · ‎05-12-2015

One other thing you might want to check on is implementing CoPP for protecting your control-plane traffic. This will help protect your control-plane traffic and also help protect the router from excessive control-plane traffic (just in case). i dont think there is a way to give priority to processes within iosd.

the processes/traffi are already pre-defined which have higher priority and which have lower like e.g. BFD is high priority where as SNMP is a low priority process.

Thanks
--Vinit

light1001 · ‎05-15-2015

Vinit,

We are using a 2821 router as a centralized TDM-to-SIP voice gateway for a few remote facilities. The voice traffic is first sent as H323 between a Vendor_B voice gateway at the remote site to a central Vendor_B voice gateway through GRE tunnels. The GRE tunnels from the remote facilities terminate on the 2821, which then routes the traffic to the central Vendor_B voice gateway. The voice traffic is then processed by a central PBX and handed back to the 2821 as TDM through 2 MFT T1 ports. The 2821 then sends the calls as SIP to the ITSP.

The router is overloading at around 40-50 concurrent calls due to interrupt traffic. All calls are G.711 and, even though the voice traffic is being processed twice, the bandwidth rarely exceeds 7mbps. Please see attached diagram. Thanks so much for your help!

Vinit Jain · ‎05-15-2015

Hello,

the first thing we need to understand is if the traffic coming on the router is hitting the CPU or not and if yes, then why? Could you please share the following outputs:

- show process cpu sort | ex 0.0    // capture this output 2-3 times
- show interface
- show interface switching
- Running config of all relevant interfaces

You can attach the above outputs as a file.

Thanks
--Vinit

light1001 · ‎05-16-2015

Vinit,

Please see the attached file sir. We were able to capture this output when the router had an average call volume of 25-28 concurrent calls. We did not get output from the "show interface switching" command, but we will collect this output again tomorrow when call volume picks up again. Thanks!

Vinit Jain · ‎05-16-2015

One interesting thing that i see from the output is, the interface GI0/0 is having traffic around 3.5 mb in both input and output directions but the other interface GI0/1 (which i am assuming is the interface to the other side of the network) is not having much traffic. it may be highly possible that the traffic on that interface is just destined towards this router.

Few questions:

what has been the CPU baseline over the last few days? Has the CPU utilization increased over time or this is being seen since day 1?

There are no packet capture tools that we can use on 2800. The only option to understand the traffic hitting CPU and the processes working to process those packets is to perform CPU profiling.

For performing CPU Profiling, I will need "show stack" output. i will then share the commands for CPU profiling.

Thanks
--Vinit

light1001 · ‎05-16-2015

Vini,

For the traffic patterns sir, I believe this might be the result of the Voice Gateway design we are using. I have attached a diagram that should help. The voice traffic is first received as data inbound on interface G0/0 through GRE tunnels, and routed to the back-end PBX system on G0/1. After the call has been processed by the central PBX, it is sent once again to the router, this time as voice calls on TDM interfaces. The router then converts this to SIP and it is sent out to the ITSP on a sub-interface of G0/0. So the traffic is transmitted & received twice through G0/0, first as data and the second as outbound SIP generated by the router.

The CPU average has always correlated with the call volume since installation. With no calls, CPU usage is around 1%. With 25-30 calls it is around 40-50%, and with around 50 calls its 90-100%. CPU usage is always mostly interrupt traffic, with very little being IP Input or any other process.

I have attached the output of the "show stacks" cmd.

Thanks for your help!

Ask the Expert: High CPU on IOS Questions with Cisco Expert Vinit Jain