Re: Ping Delay going through Sup720/6509

John Biederstedt · ‎05-01-2011

I have a single link into the gig eth port of a sup 720 via fiber SFP, and pings going through that link take over 9ms on average (often over 10ms). In other words, if pinging something that switch directly via the Sup720 gig port, pings take about .5ms. If pinging something connected to that switch, pings take 9ms or longer. That 6509 has another 6509 connected via Etherchannel with 4 gig eth ports on fabric enabled cards. Pinging one 6509 via the fiber SFP link on one 6509 takes 9 - 30 or longer ms.

That doesn't make any sense to me. This hasn't always been the case either - it looks like it started on the 22nd about 5 minutes to midnite. The link int into the 6509 is Dot1Q from a 4900M. Any thoughts?

Thanks

John

andtoth · ‎05-02-2011

Hi John,

What has changed in the network or configuration recently and around the time you started to observe this?

Do you have QoS enabled on the devices?

Have you changed any of the QoS configurations of the devices?

What's the traffic rate on the links (on each one) between the 2 end devices?

Could you please attach an output of the following commands from the devices in the path?

sh ver

sh mod

sh proc cpu sort

sh interface (for all involved interfaces)

Best regards,

Andras

John Biederstedt · ‎05-02-2011

What has changed in the network or configuration recently and around the time you started to observe this?

There weren't any network config changes when this started.

Do you have QoS enabled on the devices? No

Have you changed any of the QoS configurations of the devices? No

What's the traffic rate on the links (on each one) between the 2 end devices?

They're both 1 Gigabit rated connections, but are now running at around 100Mb/s.

What's also odd is that over the weekend, rates slowed down to around 35Mb/s, yet hosts were trying to send traffic

andtoth · ‎05-02-2011

Hi,

Thanks for your reply. By looking at the outputs, I see the following:

On the 6500 switch, GigabitEthernet5/1 is receiving about 400 Mbps traffic in 5 mins average.

5 minute input rate 416533000 bits/sec, 43622 packets/sec

You might try configuring 'load-interval 30' on this interface to see a more granular rate.

Also, on the 4500 switch, the CPU utilization seems to be a bit high but that depends on your actual baseline.

CPU utilization for five seconds: 25%/1%; one minute: 24%; five minutes: 24%

Do you have a CPU utilization graph for this switch to see if the CPU usage went up or not?

Could you please collect 'sh platform health' from this switch?

Please clarify whether these switches perform L3 routing or act as L2 switches only?

Also, which interface is used for connecting the switches and which interface is connecting to the end devices?

Could you please describe the path from one end device to the other one?

Best regards,

Andras

John Biederstedt · ‎05-02-2011

The cpu load for the 4900M has been in that range since before the first of the year, as well as beforethe last month. The 6509 and 4900M switches are running layer 3.

Enclosed is the output of the 'sh health' for the 4900M, graphs for its CPU, and a diagram showing the paths and ping time along those paths.

thanks

john

andtoth · ‎05-02-2011

Hi John,

Thanks for the graphs and outputs, it looks normal compared to the graph from history.

I understand that the switches are running Layer 3. Are these switches both routing traffic between the 2 devices where you're testing or acting as Layer 2 for this?

If you move the hosts to the same device, do you still see the issue? Try moving them to the 6500 and route between them, then try the same by connecting them to the 4900M in order to narrow down the issue and see where the delay is added. If no significant delay is observed in either scenario, the delay is most likely happening between the 2 switches.

Did you have a chance to check the interface rates with load-interval 30 ?

Are you seeing this issue continously or is it intermittent?

What is the pattern, is it steady or fluctuating?

If you start sending large amount of traffic (not just pings) between the devices where the issue is happening, and check the CPU utilization of the devices in the path, do you see a significant increase? If yes, on which device and what's the rate?

Make sure the traffic is really taking the expected path, by looking at the interface counters before and during sending traffic and compare it with the rate reported on the source.

On the 6500, please check the output of 'sh counters interface ' for both ingress and egress interfaces during sending high volume of traffic and see which counter is increasing.

On the 4900M switch, use the 'sh interface counters all' command to see the detailed counters.

Best regards,

Andras

John Biederstedt · ‎05-03-2011

Yes, all the switches are L3 and routing. They are using vlan/subnet 172.16.2.0/24 for L3 routing.

When I ping from a device on the same switch, I get these results:

On rtr-core-a, vlan3 -> rtr-core-a vlan3: rtt min/avg/max/mdev = 0.149/1.347/15.766/3.671 ms

On rtr-core-a, vlan33 -> rtr-core-a vlan3: rtt min/avg/max/mdev = 0.214/6.876/15.671/2.542 ms

On rtr-core-a, vlan33 -> rtr-core-b vlan3: rtt min/avg/max/mdev = 0.391/6.302/9.064/1.611 ms

On rtr-core-a, vlan3 -> rtr-core-b vlan3: rtt min/avg/max/mdev = 1.247/6.318/34.419/2.819 ms

On rtr-core-b, vlan3 -> rtr-core-b vlan3: rtt min/avg/max/mdev = 0.110/0.309/17.692/1.524 ms

On rtr-core-b, vlan33 -> rtr-core-b vlan3: rtt min/avg/max/mdev = 0.155/1.663/7.374/1.236 ms

On rtr-core-b, vlan33 -> rtr-core-a vlan3: rtt min/avg/max/mdev = 0.158/6.341/203.723/21.430 ms

On rtr-core-b, vlan3 -> rtr-core-a vlan3: rtt min/avg/max/mdev = 0.167/3.082/82.633/7.222 ms

The delays seem to happen when anything routes through rtr-core-a. One thing i noticed is that rtr-core-a is standby for 172,16.3.0 and its the spanning tree root for that vlan. I'm not sure that should make a difference, however. rtr-core-a is active for vlan/subnet 33 rtr-core-b is the standby active for subnet 172.16.3.0.

The next hop from the colo to the hosts connected to rtr-core-a and rtr-core-b is is rtr-core-a. From there the next step to reach a host in the 172.16.3.0/24 subnet will be the active router for that subnet, rtr-core-b. Pings to hosts in subnet 172.16.3.0/vlan 3 take about the same amount of time, regardless of which rtr-core switch they are plugged into 'a' or 'b'.

I'm not seeing high CPU spikes when traffic goes up. Also, I've set the load-interval to 30, and the load looks constant. That makes sense since hosts at the colo are constantly trying to send backup data to the host plugged into rtr-core-b switch.

john

John Biederstedt · ‎05-04-2011

yesterday, pings went back to normal for a period, the the CPU on rtr-core-b also went down for the exact same period. As it turns out, rtr-core-b is the HSRP active for vlan 1 also. show proc cpu sorted looks like:

rtr-core-b#sh proc cpu sorted
CPU utilization for five seconds: 16%/12%; one minute: 20%; five minutes: 19%
PID Runtime(ms)   Invoked      uSecs   5Sec   1Min   5Min TTY Process
12   123801904 208089948        594 1.03% 0.86% 0.76%   0 ARP Input
24    86942872 661073029        131 0.87% 0.36% 0.30%   0 IPC Seat Manager
273   1313500001562271313         84 0.55% 0.54% 0.52%   0 IP Input
355    59945764 30920137       1938 0.39% 0.43% 0.42%   0 CEF: IPv4 proces
66         132       126       1047 0.23% 0.14% 0.03%   1 SSH Process
297      2387002476283753          0 0.15% 0.14% 0.15%   0 Ethernet Msec Ti
267     2684700 22217580        120 0.15% 0.16% 0.15%   0 CDP Protocol
572      6911362434323989          0 0.15% 0.16% 0.15%   0 IP SLAs Event Pr
561      4529001217279652          0 0.15% 0.15% 0.15%   0 HSRP Common
562     8873356 146666652         60 0.07% 0.03% 0.02%   0 HSRP IPv4
563    11103264 245990939         45 0.07% 0.11% 0.12%   0 OSPF-1 Hello
558     3509804   8943714        392 0.07% 0.01% 0.01%   0 DHCPD Receive
342     2076636   4327321        479 0.07% 0.03% 0.00%   0 XDR mcast
51      875704 19708775         44 0.07% 0.07% 0.07%   0 Per-Second Jobs
376     6617720   9808264        674 0.07% 0.04% 0.05%   0 HIDDEN VLAN Proc
343    13602276   4068070       3343 0.07% 0.06% 0.05%   0 IPC LC Message H
407      162388 609137650          0 0.07% 0.03% 0.02%   0 RADIUS
242      165700 609286155          0 0.07% 0.03% 0.02%   0 ACE Tunnel Task
18           0         1          0 0.00% 0.00% 0.00%   0 IFS Agent Manage
17        8488     66428        127 0.00% 0.00% 0.00%   0 EEM ED Syslog
16          84       189        444 0.00% 0.00% 0.00%   0 Entity MIB API

What else can i check to try and find out why cpu is so high on that switch?

thanks,

John

andtoth · ‎05-04-2011

Hi John,

Thanks for your reply. The 20% usage is not necessarily high, depending on your network. On the other if the traffic you're sending through the switch is software switched for some reason (normally it should be forwarded in hardware), then the traffic goes to the CPU which can delay it can cause higher RTT.

Please refer to the following documentation for more information about troubleshooting high CPU usage on 6500 devices.

Catalyst 6500/6000 Switch High CPU Utilization

http://www.cisco.com/en/US/products/hw/switches/ps708/products_tech_note09186a00804916e0.shtml

Also, I would recommend you to check the following documentation as well.

Troubleshooting with a NETDR capture on a sup720/6500

https://supportforums.cisco.com/docs/DOC-15608

If you're not seeing the ping packets which you're sending in the netdr capture (by referring to the previous link; netdr will capture packets seen or forwarded by the CPU), it's likely not a high CPU issue, therefore traffic is not forwarded in software but in hardware which is what we want.

I'm a bit confused about your following statement:

The next hop from the colo to the hosts connected to rtr-core-a and rtr-core-b is is rtr-core-a. From there the next step to reach a host in the 172.16.3.0/24 subnet will be the active router for that subnet, rtr-core-b.

Does this mean that packets are sent to rtr-core-a then it sends them to rtr-core-b which will forward them to the destination? Based on your topology, this might trigger ICMP redirects which will indeed cause traffic to be software switched causing a delay. Could you please try configuring no ip redirects on the Vlan interfaces of the routers to make sure ICMP redirects are not sent out?

Please refer to the following documentation for more information about ICMP redirects:

When Are ICMP Redirects Sent?

http://www.cisco.com/en/US/tech/tk365/technologies_tech_note09186a0080094702.shtml

If you cannot determine the root cause, I would suggest you to open a TAC case so an engineer can assist you to investigate this further.

Best regards,

Andras

John Biederstedt · ‎05-04-2011

The next hop from the colo to the hosts connected to rtr-core-a and rtr-core-b is is rtr-core-a. From there the next step to reach a host in the 172.16.3.0/24 subnet will be the active router for that subnet, rtr-core-b.

That was bad typing. The path is: source host -> colo-l3-a -> rtr-core-a -> rtr-core-b -> dest host

thanks for the links!

John

andtoth · ‎05-04-2011

Hi John,

Thanks for clarifying. Just to confirm, make sure you're pinging between servers and not from or to the Cisco devices as they might have rate-limiters enabled and if their CPU (control-plane) is being busy calculating something, ping RTT values can be higher.

You might also try disabling firewalls during the troubleshooting, as certain firewall applications on the servers might prioritize ping packets lower than other traffic and you might see higher RTT values as well.

Let me know how it goes and if you need any assistance with analyzing additional outputs.

By the way, if you have a contract to open a TAC Service Request, you can convert this discussion thread to a case so the TAC engineer will already know what has been checked so far.

Best regards,

Andras

John Biederstedt · ‎05-04-2011

Thanks Andras,

I have a question. If a device in vlan 3 is sending traffic to another vlan which is present on that switch, shouldn't that traffic be sent not via the cpu? Isn't the part of the advantage of layer 3 using the switch fabric instead of the cpu to forward packets? I see a lot of inband traffic that is going between two servers. Is that to be expected?

Again, thanks for all the great info, advice, and help

john

andtoth · ‎05-04-2011

Hi John,

On the Catalyst switches, packet forwarding should be done in hardware normally, regardless of L2 or L3 forwarding. When traffic is software switched or packets are punted to CPU for further processing, there are either some entries (ARP or IP route) missing or you're using a feature which does not support HW forwarding like ACL logging or TTL is expiring or ICMP redirect has to be sent. Software switched traffic should not be seen on Catalyst switches usually. The reasons for CPU punts are mostly discussed in the documentation I had included, in the Packets and Conditions That Require Special Processing section.

You can see the inband rate with the sh ibc command. With the netdr capture, you can check which actual packets are punted to the CPU either to support some feature (sending back ICMP messages usually) or to software switch traffic.

While you're sending the pings via the switch, are you seeing those ICMP packets appearing in the netdr capture? If so, please attach the output of sh netdr cap so that I can investigate further. You might try sending ping packets faster to make sure you're indeed seeing them in the netdr capture.

You can use the below commands to check which packets are seen by the CPU:

debug netdr clear

debug netdr capture rx

[wait for 10 seconds]

undeb all

term len 0

sh netdr cap

Best regards,

Andras