Troubleshooting tools to analyze high CPU utilization issues on Catalyst 6500 Series switches - FAQs...

Souvik Ghosh · ‎01-19-2012

Introduction

Read the bio

Souvik Ghosh is a customer support engineer at the Cisco Technical Assistance Center in Bangalore, India. He has three and half years of experience in LAN switching technologies. LAN switching products such as the Cisco Catalyst 6500, 4500, 3750, and 2960 Series Switches are his areas of expertise. He has been involved in various escalation requests from India, Singapore, and Australia and is currently working as a technical lead for the LAN switching team in Bangalore, India. He holds CCNP and CCIP certifications.

The following experts were helping Souvik to answer few of the questions asked during the session: Amit Singh, Akshay Balaganur, and Ranganatha Raju. Amit, Akshay and Ranganatha are support engineers and have vast knowledge in 6500 related topics.

You can download the slides of the presentation in PDF format here.The related Ask The Expert sessions is available here.

The Complete Recording of this live Webcast can be accessed here.

General Questions

Q. We have an issue with the Catalyst 6500 in VSS and broadcast coming from a checkpoint cluster. As a result, there is high CPU on Catalyst 6500. Is there any way to drop broadcast traffic and not make it go to the control plane?

A. This depends totally on the kind of broadcast. We can implement storm control on access layer switches so that we can save the CPU in case of broadcast storm. In case of ARP there are two options available: one is implementing mls qos protocol arp command, which will rate limit arp packets in hardware. Second, we can implement ip dhcp snooping option along with arp inspection and configure rate limit to arp inspection value. The drawback to both of these is that they are system wide, so normal broadcast packets are also dropped. As a result, storm control is the best option.

Q. Is it normal to see CPU usage around 60%?

A. 60% constant CPU usage is not normal. However, an intermittent spike might occur and be normal. It also depends on if you are using Non-modular IOS (mz image) or Modular IOS (vz image).

Q. I have a problem with a sup2 and 12.2(18)sxd6. I will stop pim v1 from working and do it on all other 6500 with sup720 using COPP and an ACL with policing. Is there any option to filter pim v1 without a big performance impact?

A. In addition to COPP, we have the option of hardware-based limiters on the CPU.

Q. I have a 7600 with sup 720 in an ISP environment. The CPU reaches 100 every time we have a DDoS attack on an end user. What approach should I take?

A. Refer to Protection against Denial of Services, which contains details to avoid DDoS attacks.

Q. Do we need to specifically check the SP CPU by logging into the SP using the “remote login” command? If the RP was quiet and the SP was busy, would you have to log in to the SP or can you see this without logging in?

A. We can log in remotely to SP, or use the command “remote command switch show process cpu” and “remote command switch show process cpu history”.

Q. Are the drops from input queue related only to the packets that need to be processed switched?

A. That depends on whether it is a layer 3 port or a layer 2 port. For a layer 3 port, input queue – software queue. For a layer 2, input queue – hardware queue.

Q. Is it possible to check forwarding plane utilization for each line card?

A. The CPU on the line card is used for communicating with the Supervisor. This CPU will not be used for maintaining the control plane protocols and will only be used for responding to diagnostic requests from the Supervisor. You can use "attach <module no>" command to log on to the module and check the cpu utilization, using the "show process cpu" command.

Q. Why does the TTL value become 1 during Multicasting on 720 which in turn causes high CPU?

A. It depends on the kind of network with which we are dealing. It becomes 1 in case of loop in the network causing the TTL to go down. This should not be an issue with high CPU; rather high CPU is a symptom of some other issue like a loop in the network. It can also be caused of a bug within an application causing the TTL to go down to a value of 1.

Q. Since each line card has specific forwarding capacity, can we check each line card's forwarding utilization?

A. Yes, the "show platform hardware capacity" command will show the utilization.

Q. How can we reduce the process for SSH on 6500?

A. If it is virtual exec, that is used for servicing vty lines, Vty lines are used for logging into the switch. If we are trying to dump a huge output like "show tech", it is expected to see high CPU and is not a matter for concern.

Q. How to troubleshoot CPU spike due to IOS-BASE?

A. High CPU utilization under the IOS-BASE process is for Supervisors running modular IOS. First, find out the PID of the IOS-BASE process from the output of the "show process cpu" command. Following that use the command "show process cpu detail <PID of the IOS-BASE process>" and find out which sub-process under the IOS-BASE process is consuming most of the CPU cycles.

Q. Why does running “show tech” increase the CPU? Is this due to monolithic IOS?

A. Seeing a very small CPU spike while running “show tech” is expected as it is caused by “SSH Process”. It is not a result of IOS.

Q. What is the best way to baseline the traffic in a network in order to implement hardware based limiters or COPP?

A. These are two different things. Hardware rate limiters are available for specific features. If these are not available for the feature we are looking for, we need to fall back to COPP. Hardware rate limiters are best option because of where packets are dropped (right at PFC). In COPP, it is done in software, so the packet needs to go from PFC to software and is dropped in between.

Q. In an ideal situation, what would be the CPU utilization of my switch?

A. Ideally, when network protocol has converged, it should be 0% or, at a maximum, 10%.

Q. If I upgrade my Sup2 to Sup 720/Sup 32, can I expect to see the difference in CPU utilization?

A. Yes and no. Depending on what caused the CPU utilization, we need to check if upgrade will help or not.

Q. Do we have a CPU on the Line cards running DFCs? Do we see high CPU utilization on the LCs running DFCs as well?

A. Yes, we have a CPU on line cards running DFCs. However, that is not used for maintaining control plane protocol, it is used for diagnostic purposes. This CPU is used for responding back to Sup in case Sup is polling towards LC. You might see a CPU utilization there, but it has to do with internal management of LC and system, not network traffic. You should get in touch with TAC in such instances as this can be a bug as well.

Q. Is there any dependency of IOS for capturing process utilization, and, if a process is not identified, will it be seen as spike on CPU utilization?

A. If there is CPU utilization due to process switching, you should see some process associated with it. If you do not see a process, it could be a bug in the IOS code.

Questions Related to the Tools Available

Q. Can hardware based rate limiters limit to zero, because the customer want to stop the automatic fallback to pim v1 for security reasons?

A. Yes, we can make it zero.

Q. Does the Netdr capture would not overload the CPU?

A. In version 12.2(18)SXF onwards, you do not see any impact on the switch by enabling Netdr capture. On older versions of code, it is just like enabling any other debug, there would be some impact on CPU.

Q. Would setting up a CPU rising threshold capture the SP CPU?

A. CPU rising threshold monitors both RP and SP. If you need to monitor only SP CPU, then this EEM script is helpful:

event snmp oid 1.3.6event manager applet POLL-SP-CP

.1.4.1.9.9.109.1.1.1.1.7.2 get-type exact entry-op gt

entry-val "60" poll-interval 10

action 1.1 syslog priority notifications msg "CPU utilization >60% on SP processor, logging data"

action 2.1 cli command "enable"

action 2.2 cli command "rem comm swi show proc cpu sort | redirect bootflash:SP-CPU"¬

action 2.3 cli command "end”

.

Q. Can we run Netdr capture on 7600 platform as well?

A. Yes, it is the same for 7600 platform.

Q. I can see a lot of UDP packets punting to my CPU. At the same time I can see that “UDP no port” is rising in the “show ip traffic”. Can I configure the 720 not to process unwanted UDP traffic for the servers without the UDP service running?

A. First analyze the netdr captured outputs in order to understand why these UDP packets are hitting the CPU. If these packets are to be dropped, use an ACL.

Q. Do we need to be cautious with the "show buffer input-interface" command?

A. It is safe to use this command as it just takes a dump of packets in the buffer.

Q. Can we use an ELAM to get specific traffic to the CPU or is this for data plane only?

A. Yes, we can use ELAM to find what traffic is going to CPU. Note that this is an internal tool, so you should have TAC assistance to run this tool.

Q. If my top process is “IP Input”, will I have to capture the packages and analyze them on an external system?

A. Yes. The IP Input process handles the interrupt/process switched traffic, meaning that the traffic is handled by CPU. As a result, we need to do a packet capture to see what traffic is being processed by CPU.

Q. Can RP/SP inbound span be used on the 7600 or are there limitations?

A. Yes, you can use RP/SP inband span on 7600 as well. There are no limitations.

Q. Is there any recommended COPP for 6500s from Cisco?

A. No, we do not have any recommended COPP configuration. It totally depends on your network and traffic pattern.

Q. How do I take a capture for the packets punted to the MSFC in case there is a redundant supervisor?

A. In this case, only one sup will be in active state. As a result, you will not see high CPU on redundant supervisor. You will see high CPU on active only, so that is what you need to t/s using the tools we discussed.

Q. Will there be an increase in CPU utilization, if we have an EEM script running?

A. No. Refer to the Feature Navigator tool in order to check which version supports it. After that version, there is a specific driver used for EEM functionality. As a result, we should not see any high CPU due to EEM running in background.

Q. Do we have a similar span capture for Sup2 running in native mode?

A. Yes, but depending on IOS version running, the commands can differ.

Q. I can see many options which are available with netdr capture. What are the different scenarios in which we can use netdr?

A. Netdr can always be used when you are trying to capture packets which are bound to or from the CPU. If you are suspecting that the CPU is not sending certain traffic you can configure Netdr in the "Tx" direction and check the output of "show netdr capture" command.The other options available with Netdr is used for filtering the traffic captured in the "Debug netdr capture" command.

Q. How can we determine the direction of the inband span?

A. It will be Tx direction.