Tech Talk series is a great platform to share knowledge on specific topics like troubleshooting and advanced features which is normally hard to find. Being a TAC engineer with both Routing as well Switching team I come across high CPU issues almost on a daily basis. The most common reason that I find is that high CPU is because of multicast traffic. So I though why not use this amazing platform to share some of the insights regarding the troubleshooting process.
Lets first see few of the common situations which would let multicast traffic not be CEF (Cisco Express Forwarding) switched:
Note: In the Video as well as the blog software switched means packets that are NOT switched by CEF and hardware switched means packets that are either swithced by software CEF for software switched platforms or swithced by hardware CEF (not software CEF) in hardware switched platforms.
1. CEF is disabled: Title says it all, if CEF is disabled then we would have to switch packets in software. This is true for unicast as well as multicast traffic.
2. Presence of "ip igmp join-group" : When we do not have an option to receive IGMP reports from clients or if we wish to ensure multicast continues to flow through an interface even if there is no receiver present then we need to statically configure join for a multicast group on the router for that interface.
There are two interface levels commands to do so :
A) "ip igmp join-group <group_address>" makes the multicast to go out of the interface as well as send a copy of multicast traffic to CPU. So every multicast packet coming in (for the multicast group configured) would be sent to the CPU.
B) "ip igmp static-group <group_address>" makes the multicast to go out of the interface and not send a copy to CPU.
So if you want to configure statically a multicast group in production network then option B is our choice. We normally use "join-group" during pre-production to ensure our multicast is fine due to the fact that multicast traffic goes to the CPU and thus the router can take some action like for example send an ICMP reply for an ICMP request send to a multicast group.
3. Traffic Failing RPF (Reverse Path Forwarding) check: Multicast traffic that fails RPF check is always sent to CPU. RPF is a mechanism to ensure there is no loop for multicast traffic in the network. We only forward if we receive multicast traffic on RPF interface.
4. PIM registration Process: When a source needs to register with RP (Rendezvous Point) the First Hop router encapsulates the multicast packet into a unicast packet. This process is completely done in software so if registration process is not getting completed then all the multicast traffic would be software switched on first hop router.
5. Traffic to reserved multicast groups: Multicast traffic destined to 18.104.22.168 - 22.214.171.124, 126.96.36.199 and 188.8.131.52.40 is always software switched.
6. TTL = 1 packets: Packets having TTL=1 and needs multicast routing are always sent to CPU.
7. Fragmentation: If packet needs to be fragmented when it needs to be sent out then we need to sent it to CPU as fragmentation cannot be done in hardware.
8. Platform Limitations: Before designing multicast we must ensure we meet all the criteria for traffic to be switched in hardware for that particular platform.
Troubleshooting Approach and Useful Tools
Now lets look at the troubleshooting approach we would follow:
1. First we need to determine what kind of packets are hitting the CPU. There are majorly two ways to do that: First is to sniff the CPU by connecting a PC running Wireshark or Ethereal to the problem device. However most of the times this is not possible as the device is at remote location. The other way to do so is run some platform specific commands:
A) 7600/6500 platforms: "Netdr Capture" This is an internal buffer which can capture up to 4096 packets that are going to the CPU. It is safe to run in high CPU sitations:
i) To enable the capture: "debug netdr capture rx"
ii) To display the packets captured: "show netdr capture"
iii) To clear the capture buffer "debug netdr clear"
iv) To stop the capture "undebug netdr capture"
B) 4500 platforms: "CPU packet dump utility"
Details can be found here: Troubleshooting high CPU on 4500 devices
C) 3560/3750/ME3XXX platforms: "CPU receive queue dump utility" Please take care while using this as this command floods the console with lot of data.
Details are available here: Troubleshooting high CPU on 3560/3750/ME3XXX Platforms
D) For platforms like ISR/7200 packets are swithed using software and there is not way to sniff or dump the packets going to CPU. We could only see packets in the input buffers of an interface using the command " show buffers input-interface <> packet" but since packets dequeued very quickly for process we cannot see all the packets. Other way to dump the packets is to use Embedded Packet Capture (EPC) utility if you are running release 12.4(20)T.
Information on EPC can be found here: Configuring Embedded Packet Capture
2. Next we need to analyze what we captured and see if we could find what is causing CPU to go high. If we see that it is multicast packets then what we need to do is zero in on a multicast group which we see hitting the CPU the most. If we cannot zero in on any then just choose one based on your best judgement. We would call this as the problem group and probably if we solve issue for this group then may be we could apply the same solution for the other groups.
Tools that can help us with this job depend on what kind of captures we have, If we have wireshark or ethereal captures then we can use filter expressions to narrow down the problem group. However if we have outputs from inbuilt CPU sniffer captures then we cannot use softwares like Wireshark . The way to go about this would be to use Linux/Unix commands like "grep" in combination with "cut", "uniq" and "sort" keywords.
For example let us say we have a netdr capture (in a file netdr.txt) and we want to find out number packets received for each destination IP then this is the command I would execute from the directory containing "netdr.txt" file:
grep 'ttl' "netdr.txt" | cut -d, -f6 | sort | uniq -c
What this command would do is parse the complete netdr.txt and select the lines containing "ttl". These would be the same lines having Destination IP. Next we do a "cut" operation which would extract only destination IP address from the line. Now we do a "sort" so that all same destination IP come together and afterwards we could use the "uniq" keyword to count consecutive same destination IP. More details can be found about these keywords by referring manual of each.
3. Once we have our problem group, let us analyze the packets for this problem group and see if packet itself requires special handling and this the reason it is going to the CPU. We need to check:
A) If TTL value of packet is 1.
B) If the length of packet is more than MTU of configured on the interfaces.
C) If destination IP is in reserved range that is 184.108.40.206 - 220.127.116.11 or 18.104.22.168 - 22.214.171.124
D) if any IP options are present. If they are present then packet would be handled by CPU as IP options cannot be processed in hardware.
4. If we do not find anything wrong with the packet itself then we need to check if there is something wrong with the network or configuration. We would follow the following sub steps:
A) Check if we have "ip igmp join-group <problem_group>" present in the config. if it is present then change it to "ip igmp static-group <problem_group>".
B) Check which PIM flavor you are running for that multicast group. Depending on the flavor your multicast tree would be formed.
C) Check "show ip mroute <problem_group>" to see if inbound and outbound interfaces are correctly listed and are in accordance with multicast tree that is supposed to be built.
D) Check "show ip mroute <problem_group> count" to see if RPF is failing. If it is then this is the reason of high CPU and we might need to see why multicast traffic is not coming through RPF interface. It is quiet possible that PIM is not enabled on RPF interface or there are some "static mroutes" wrongly configured. There might be other reasons and we would need to refer PIM troubleshooting: IP Multicast Troubleshooting Guide
E) If we see "Registering" flag in "show ip mroute" then it means there is some problem in registration process and we need to check why it is not getting completed. Probably we do not have route to RP on first hop router or route to source on RP. We might also have a problem with (S,G) tree between First Hop router and RP. We would need to go hop by hop starting from RP to find what is wrong with multicast tree.
5) We could also run some debug and show commands related to multicast which could help us figure out what is going wrong. Please refer this link for such commands and debug outputs: Basic Multicast Troubleshooting Tools
6) We must by now should have found the problem but if not we might need to check few other things like if IGMP snooping is disabled for a vlan and if we have a SVI with IP address for that vlan then packet would be sent to CPU. The reason is now all multicast packets would be flooded in a VLAN at layer 2 and thus would be sent to SVI interface also. Since ownership of SVI lies with CPU packet is sent to CPU. We could see if we have correct CEF entry and if CEF entry is not present then traffic would be sent to CPU. There might be some platform limitation which we could find in configuration guide for that platform.
I hope this has been an informative session and proves useful for troubleshooting multicast high CPU situations. Please do share your feedback and opinion via the comments session below.
Thank you for watching!
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.