cancel
Showing results for 
Search instead for 
Did you mean: 
cancel

Community Tech-Talk Series: Troubleshooting High CPU due to Multicast

11071
Views
20
Helpful
9
Comments
Cisco Employee

Tech Talk series is a great platform to share knowledge on specific topics like troubleshooting and advanced features which is normally hard  to find. Being a TAC engineer with both Routing as well Switching team I  come across high CPU issues almost on a daily basis. The most common  reason that I find is that high CPU is because of multicast traffic. So I  though why not use this amazing platform to share some of the insights  regarding the troubleshooting process.

Multicast tech talk.png


Common problems specfic to multicast traffic that cause high CPU


Lets first see few of the common situations which would let multicast traffic not be CEF (Cisco Express Forwarding) switched:

Note:  In the Video as well as the blog software switched means packets that  are NOT switched by CEF and hardware switched means packets that are  either swithced by software CEF for software switched platforms or  swithced by hardware CEF (not software CEF) in hardware switched  platforms. 

1. CEF is disabled: Title says it all, if CEF is disabled then we would have to switch  packets in software. This is true for unicast as well as multicast  traffic.

2. Presence of "ip igmp join-group" : When we do not have an option to receive IGMP reports from clients or  if we wish to ensure multicast continues to flow through an interface  even if there is no receiver present then we need to statically  configure join for a multicast group on the router for that interface.

There are two interface levels commands to do so :

A) "ip igmp join-group <group_address>"  makes the multicast to go out of the interface as well as send a copy  of multicast traffic to CPU. So every multicast packet coming in (for  the multicast group configured) would be sent to the CPU.


B) "ip igmp static-group <group_address>" makes the multicast  to go out of the interface and not send a copy to CPU.


So  if you want to configure statically a multicast group in production  network then option B is our choice. We normally use "join-group" during  pre-production to ensure our multicast is fine due to the fact that  multicast traffic goes to the CPU and thus the router can take some  action like for example send an ICMP reply for an ICMP request send to a  multicast group. 


3. Traffic Failing RPF (Reverse Path Forwarding) check: Multicast traffic that fails RPF check is always sent to CPU. RPF is a  mechanism to ensure there is no loop for multicast traffic in the  network. We only forward if we receive multicast traffic on RPF  interface.

4. PIM registration Process: When  a source needs to register with RP (Rendezvous Point) the First Hop  router encapsulates the multicast packet into a unicast packet. This  process is completely done in software so if registration process is not  getting completed then all the multicast traffic would be software  switched on first hop router.

5. Traffic to reserved multicast groups: Multicast traffic destined to 224.0.0.1 - 224.0.0.255, 224.0.1.39 and 224.0.0.1.40 is always software switched.

6. TTL = 1 packets: Packets having TTL=1 and needs multicast routing are always sent to CPU.


7. Fragmentation: If packet needs to be fragmented when it needs to be sent out then we  need to sent it to CPU as fragmentation cannot be done in hardware.

8. Platform Limitations:  Before designing multicast we must ensure we meet all the criteria for  traffic to be switched in hardware for that particular platform.

Troubleshooting Approach and Useful Tools

Now lets look at the troubleshooting approach we would follow:

1.  First we need to determine what kind of packets are hitting the CPU.  There are majorly two ways to do that: First is to sniff the CPU by  connecting a PC running Wireshark or Ethereal to the problem device.  However most of the times this is not possible as the device is at  remote location. The other way to do so is run some platform specific  commands:

A) 7600/6500 platforms: "Netdr Capture"   This is an internal buffer which can capture up to 4096 packets that  are going to the CPU. It is safe to run in high CPU sitations:

i)   To enable the capture: "debug netdr capture rx"

ii)  To display the packets captured: "show netdr capture"

iii) To clear the capture buffer "debug netdr clear"

iv) To stop the capture "undebug netdr capture"


B) 4500 platforms: "CPU packet dump utility"


Details can be found here: Troubleshooting high CPU on 4500 devices

C) 3560/3750/ME3XXX platforms: "CPU receive queue dump utility"  Please take care while using this as this command floods the console with lot of data.

Details are available here: Troubleshooting high CPU on 3560/3750/ME3XXX Platforms

D)  For platforms like ISR/7200 packets are swithed using software and  there is not way to sniff or dump the packets going to CPU. We could  only see packets in the input buffers of an interface using the command "  show buffers input-interface <> packet" but since packets  dequeued very quickly for process we cannot see all the packets. Other  way to dump the packets is to use Embedded Packet Capture (EPC) utility  if you are running release 12.4(20)T.

Information on EPC can be found here: Configuring Embedded Packet Capture

2.  Next we need to analyze what we captured and see if we could find what  is causing CPU to go high. If we see that it is multicast packets then  what we need to do is zero in on a multicast group which we see hitting  the CPU the most. If we cannot zero in on any then just choose one based  on your best judgement. We would call this as the problem group and  probably if we solve issue for this group then may be we could apply the  same solution for the other groups.

Tools that can  help us with this job depend on what kind of captures we have, If we  have wireshark or ethereal captures then we can use filter expressions  to narrow down the problem group. However if we have outputs from  inbuilt CPU sniffer captures then we cannot use softwares like Wireshark  . The way to go about this would be to use Linux/Unix commands like "grep" in combination with "cut", "uniq" and "sort" keywords.

For  example let us say we have a netdr capture (in a file netdr.txt) and we  want to find out number packets received for each destination IP then  this is the command I would execute from the directory containing  "netdr.txt" file:

grep 'ttl' "netdr.txt" | cut -d, -f6 | sort | uniq -c

What this command would do is parse the complete netdr.txt and select the lines containing "ttl". These would be the same lines having Destination IP. Next we do a "cut" operation which would extract only destination IP address from the line. Now we do a "sort" so that all same destination IP come together and afterwards we could use the "uniq" keyword to count consecutive same destination IP. More details can be found about these keywords by referring manual of each.

3.  Once we have our problem group, let us analyze the packets for this  problem group and see if packet itself requires special handling and  this the reason it is going to the CPU. We need to check:

A) If TTL value of packet is 1.

B) If the length of packet is more than MTU of configured on the interfaces.

C) If destination IP is in reserved range that is 224.0.0.1 - 224.0.0.225 or 224.0.1.39 - 224.0.0.40

D)  if any IP options are present. If they are present then packet would be  handled by CPU as IP options cannot be processed in hardware.

4.  If we do not find anything wrong with the packet itself then we need to  check if there is something wrong with the network or configuration. We  would follow the following sub steps:

A) Check if we have "ip igmp join-group <problem_group>" present in the config. if it is present then change it to "ip igmp static-group <problem_group>".


B) Check which PIM flavor you are running for that multicast group. Depending on the flavor your multicast tree would be formed.

C) Check "show ip mroute <problem_group>" to see if inbound and outbound interfaces are correctly listed and are  in accordance with multicast tree that is supposed to be built.

D) Check "show ip mroute <problem_group> count" to see if RPF is failing. If it is then this is the reason of high CPU  and we might need to see why multicast traffic is not coming through RPF  interface. It is quiet possible that PIM is not enabled on RPF  interface or there are some "static mroutes" wrongly configured. There  might be other reasons and we would need to refer PIM troubleshooting: IP Multicast Troubleshooting Guide

E) If we see "Registering" flag in "show ip mroute" then it means there is some problem in registration process and we need  to check why it is not getting completed. Probably we do not have route  to RP on first hop router or route to source on RP. We might also have a  problem with (S,G) tree between First Hop router and RP. We would need  to go hop by hop starting from RP to find what is wrong with multicast  tree.

5) We could also run some debug  and show commands related to multicast which could help us figure out  what is going wrong. Please refer this link for such commands and debug  outputs: Basic Multicast Troubleshooting Tools

6)  We must by now should have found the problem but if not we might need  to check few other things like if IGMP snooping is disabled for a vlan  and if we have a SVI with IP address for that vlan then packet would be  sent to CPU. The reason is now all multicast packets would be flooded in  a VLAN at layer 2 and thus would be sent to SVI interface also. Since  ownership of SVI lies with CPU packet is sent to CPU. We could see if we  have correct CEF entry and if CEF entry is not present then traffic  would be sent to CPU. There might be some platform limitation which we  could find in configuration guide for that platform.

Watch the Tech-Talk and checkout the presentation slides

I  hope this has been an informative session and proves useful for troubleshooting multicast high CPU situations. Please do share your feedback and opinion via the  comments session below.

Thank you for watching!

9 Comments

HI Ruchir Jain,

Wonderfull document what i was looking for... thanks a lot...

Regards

Deben

Cisco Employee

Hi Deben,

Thanks a lot for your kind words. I appreciate it.

You could also check out the video that has just been added for more detailed information.

Regards,

Ruchir

Beginner

Thank you for the great Video. You should develop another video for NX-OS.

Cisco Employee

Thanks jmercado1986 for the nice words. I would try to see what I can do for NX-OS.

Cheers,

Ruchir

Beginner

Thanks for that great presentation.  I have one criticism though; video doesn't appear in chrome.

Cisco Employee

Hi Gabriel,

Thanks for the appreciation and the feedback. I would pass it on to the right folks so that they could look into it.

Regards,

Ruchir

Cisco Employee

Hi Gabriel,

To load the video in chrome you may click the shield icon in address bar as shown below:

chrome.png

Thanks,

Satish

Beginner

Great info....thank you so much..

Cisco Employee

great Video Ruchir, very useful.. thanks

This widget could not be displayed.