Re: Strange packet drops

OOO Complex Telesystems OOO CTS · ‎01-15-2010

Hello

Pls give me some advice in troubleshooting.

We have several sites, connected with one ISP via L3 mpls VPNs. There is no routing protocol between our routers and ISP routers, we have p2p GRE tunnels from each site to each other site with OSPF inside them. One site has just static routing inside the GRE.

Now we have the following strange situation:

Ping from site1 router to local ISP router is clean. Ping from site one to the remote ISP router is also clean. Ping from site1 router to site2 router is not clean, we are getting 5% drops. Ping from site2 router to its local ISP router is also clean.

I have no clue how to deal with it. It seems that our routers are dropping ICMP but the channels are not overused, there are no rules to limit ICMP, the CPU load is about 5-7%. Drops appear both when packets travel inside the tunnel and outside the tunnel.

ISP says that it can successfully ping our interfaces from any point of their network.

We have 3845 routers at our sites, ios versions are different - 12.4(7d)advipservices, 12.4(24)T1advipservices.

Traceroutes between this sites are identical. We use NM-16ESW module interfaces for this WAN channels.

interface configuration:

site 1

interface FastEthernet2/7
no switchport
ip address x.x.x.x x.x.x.x
ip flow ingress
load-interval 30
duplex full
speed 10
no cdp enable
end

interface Tunnel266
bandwidth 2048
ip unnumbered Loopback0
ip mtu 1476
ip flow ingress
ip tcp adjust-mss 1436
load-interval 30
qos pre-classify
keepalive 2 3
cdp enable
tunnel source FastEthernet2/7
tunnel destination y.y.y.y y.y.y.y.y

site 2

interface FastEthernet2/0
no switchport
ip address y.y.y.y y.y.y.y.y
ip flow ingress
ip flow egress
duplex full
speed 10
no cdp enable

interface Tunnel259
ip unnumbered Loopback0
ip mtu 1476
ip flow ingress
ip tcp adjust-mss 1436
load-interval 30
qos pre-classify
keepalive 2 3
cdp enable
tunnel source FastEthernet2/0
tunnel destination x.x.x.x x.x.x.x

Maybe someone had the same expirience. Are there any ideas how to troubleshoot it?

Thanks

ohassairi · ‎01-16-2010

when you try to ping remote site and you r getting the 5% of drops, use a sniffer to capture sent and received packets, may be you will capture some other ICMP packets that can inform you about what happened (destination unreacheable, port unreacheable, need fragmentation....) these messages if exist could not be seen from windows.

hope this help

OOO Complex Telesystems OOO CTS · ‎01-17-2010

I am pinging from my border routers, there are no destination unreacheable, port unreacheable, need fragmentation ICMP messages. Just drops.

Very strange that ping tp ISP interfaces is ideal.

If i am not mistaken I cannot mirror WAN interface of my router (only switch supports it), so I'm not sure how to sniff this links without breaking the channel( it is a production network)

milan.kulik · ‎01-18-2010

Hi,

if your IOS supports the feature, you could try capture the packets on the router using the Cisco IOS Embedded Packet Capture (EPC).

See http://www.cisco.com/en/US/products/ps9913/products_ios_protocol_group_home.html

and http://www.cisco.com/en/US/docs/ios/netmgmt/configuration/guide/nm_packet_capture_ps6441_TSD_Products_Configuration_Guide_Chapter.html#wp1061984

for details.

You could even export the captured data in PCAP format suitable for analysis using an external tool such as Wireshark.

BR,

Mikan

OOO Complex Telesystems OOO CTS · ‎01-17-2010

Here is the partial output of the "show ip cache flow" from the sites

site1:

sh ip cache flow
IP packet size distribution (7731M total packets):
1-32 64 96 128 160 192 224 256 288 320 352 384 416 448 480
.004 .314 .275 .050 .043 .022 .040 .016 .026 .006 .004 .004 .006 .007 .004

512 544 576 1024 1536 2048 2560 3072 3584 4096 4608
.003 .003 .019 .029 .115 .000 .000 .000 .000 .000 .000

IP Flow Switching Cache, 278544 bytes
1094 active, 3002 inactive, 482041572 added
2564275832 ager polls, 0 flow alloc failures
Active flows timeout in 30 minutes
Inactive flows timeout in 15 seconds
IP Sub Flow Cache, 66824 bytes
1094 active, 954 inactive, 482037958 added, 482037958 added to flow
0 alloc failures, 217 force free
2 chunks, 212 chunks added
last clearing of statistics never
Protocol         Total    Flows   Packets Bytes Packets Active(Sec) Idle(Sec)
--------         Flows     /Sec     /Flow /Pkt     /Sec     /Flow     /Flow
TCP-Telnet       48583      0.0        11    83      0.1       1.4       2.2
TCP-FTP          22323      0.0         3    45      0.0       9.8      14.7
TCP-FTPD           192      0.0     26997   114      1.2     162.6       2.1
TCP-WWW       32696526      7.6         5   355     40.6       0.6       2.1
TCP-SMTP       1355864      0.3       157   751     49.5       5.0       2.9
TCP-X              237      0.0       158   411      0.0      16.1       7.6
TCP-BGP           1513      0.0         8    57      0.0       9.1       9.1
TCP-NNTP            13      0.0         1    46      0.0       0.3       9.4
TCP-Frag         55330      0.0        38    44      0.5       6.9      15.5
TCP-other    200008374     46.5        12   461    580.8       2.6       6.8
UDP-DNS       61554309     14.3         1    72     14.5       0.1      15.5
UDP-NTP        6153186      1.4         1    78      1.4       0.0      15.5
UDP-TFTP          2608      0.0         1    71      0.0       0.0      15.5
UDP-Frag        222287      0.0         4    65      0.2      11.9      15.5
UDP-other    112130203     26.1        12   163    338.7       2.9      15.5
ICMP          67057733     15.6         4    65     69.7       5.9      15.4
IPINIP           53500      0.0      1245   160     15.5      51.2      15.1
GRE             661234      0.1      4373   190    673.3     276.4      13.5
IP-other         16463      0.0      3479    82     13.3    1677.1       3.1
Total:       482040478    112.2        16   284   1799.9       3.1      10.9

site2:

sh ip cache flow
IP packet size distribution (20846M total packets):
1-32 64 96 128 160 192 224 256 288 320 352 384 416 448 480
.005 .175 .301 .105 .031 .033 .049 .032 .015 .013 .010 .006 .008 .005 .004

512 544 576 1024 1536 2048 2560 3072 3584 4096 4608
.003 .003 .003 .036 .154 .000 .000 .000 .000 .000 .000

IP Flow Switching Cache, 278544 bytes
376 active, 3720 inactive, 344235011 added
1313258552 ager polls, 0 flow alloc failures
Active flows timeout in 30 minutes
Inactive flows timeout in 15 seconds
IP Sub Flow Cache, 42120 bytes
374 active, 1674 inactive, 344100314 added, 344100301 added to flow
0 alloc failures, 0 force free
2 chunks, 286 chunks added
last clearing of statistics never
Protocol         Total    Flows   Packets Bytes Packets Active(Sec) Idle(Sec)
--------         Flows     /Sec     /Flow /Pkt     /Sec     /Flow     /Flow
TCP-Telnet       11912      0.0        57    41      0.1       4.6       9.1
TCP-FTP           2851      0.0         4    61      0.0       1.7       6.8
TCP-FTPD           226      0.0      7679 1308      0.4     134.4       2.5
TCP-WWW       30882552      7.1         5   211     40.8       0.5       1.8
TCP-SMTP        624168      0.1       185 1204     27.0       4.6       4.8
TCP-X              212      0.0       175   152      0.0      17.1       6.1
TCP-BGP              4      0.0         1    40      0.0       0.0       1.4
TCP-NNTP             9      0.0         1    45      0.0       1.3       4.6
TCP-Frag           104      0.0         3 1012      0.0       4.1      12.2
TCP-other    141375905     32.9        12   440    395.2       2.5       4.7
UDP-DNS        4549509      1.0        22    72     23.3      15.2      15.4
UDP-NTP        1938294      0.4         1    79      0.5       0.0      15.5
UDP-Frag        155488      0.0         2   785      0.0       3.0      15.4
UDP-other    132236086     30.7         7   110    243.8       2.1      15.4
ICMP          30017601      6.9        10    60     74.9       8.3      15.4
IPINIP           79580      0.0      1044   922     19.3      35.6      15.2
GRE            2351269      0.5      7357   359   4027.7     274.3      13.4
IP-other           159      0.0        28   955      0.0       5.9      15.5
Total:       344225929     80.1        60   353   4853.5       4.7       9.8

Dunno whether it is something abnormal?

I am still getting drops, not only between this two sites, but between others too,,,

milan.kulik · ‎01-18-2010

Hi,

does your provider allow traceroute through the network?

If yes, I'd try it to see if the backbone is not dropping your packets.

You say you are losing 5% of your Pings.

Do you mean a default Ping with 100Bytes packet size?

Have you tested larger packets with Don't Fragment bit set to 1, e.g.?

BR,

Milan

OOO Complex Telesystems OOO CTS · ‎01-18-2010

I'm not able to see inner provider structure, traceroute shows only border routers.

Yes, I do mean the default 100byte packets.

The same situation with larger packets, 1500byte with df-bit set goes through with the same average 5% drops

milan.kulik · ‎01-18-2010

Hi,

as you asy in your original post:

"Ping from site1 router to local ISP router is clean. Ping from site one to the remote ISP router is also clean. Ping from site1 router to site2 router is not clean, we are getting 5% drops. Ping from site2 router to its local ISP router is also clean."

What about a Ping from site1 router to the local ISP router on site2?

Without using any tunnel, if possible?

If it's not clean, I'd start blaming the provider for something wrong in his backbone.

BR,

Milan

OOO Complex Telesystems OOO CTS · ‎01-18-2010

Ping from site1 router to the local ISP router on site2 is clean, without using the tunnel. Moreover, provider engineers say that they are able to ping our interfaces from their network without any loss. From any point.

I will try to inspect traffic with Embedded Packet Capture. I did not know about this feature, thank you for the advice.

It's hard to blame the provider when they send their ideal statistics though...

milan.kulik · ‎01-18-2010

Hi,

well, I suggested to blame the provider if the Ping from site1 router to the local ISP router on site2 were s not clean.

Unfortunatelly, it's not the case...

But it's really weird Ping to the ISP router on site2 is clean and just one hop further (your router on site2) is not clean.

And the Ping between your router on site2 and the ISP router on site2 is clean again, so no circuit problem.

What about Pings between PCs on site1 and site2?

Losing also 5%?

BR,

Milan

OOO Complex Telesystems OOO CTS · ‎01-18-2010

Yes, drops between pc's are also 2-5% average.

milan.kulik · ‎01-19-2010

Well,

it seems your routers are really causing the trouble :-(

I'd guess some overload but it's difficult to say without knowing all the details.

And you said CPU was running on 5-7% only... Does sh proc cpu his confirm?

How many tunnels are configured on each router?

I'd try to check if the routers are really using CEF (sh cef not-cef-switched).

And also I'd try to simplify a config on one router pair maximally (removing NetFlow, reducing the number of tunnels, using static only if possible, etc.) and observe if that helps.

This might lead you to the problem cause.

Good luck,

Milan

OOO Complex Telesystems OOO CTS · ‎01-19-2010

I am really confused by this situation.

Here is the output of some show commands

site1#sh proc cpu | ex 0.00
CPU utilization for five seconds: 10%/7%; one minute: 11%; five minutes: 12%

site2#sh proc cpu | ex 0.00
CPU utilization for five seconds: 7%/3%; one minute: 5%; five minutes: 5%

There are 2 to 6 GRE tunnels on the sites. Each starts from the same interface, destination differs. Interfaces are 10Mbps. Average port utilisation is 40-70%. No output drops at the interfaces.

site1:

site1#sh ip cef switching statistics

       Reason                          Drop       Punt Punt2Host
RP LES No route                     1040159          0     310687
RP LES Packet destined for us             0 208257377      34579
RP LES Encapsulation resource             0 146498617          0
RP LES No adjacency                       2          0          0
RP LES Incomplete adjacency            7622          0          0
RP LES Unresolved route                 329          0          0
RP LES Bad checksum                      66          0          0
RP LES TTL expired                        0          0 103549588
RP LES IP options set                     0          0       3250
RP LES Fragmentation failed           12795          0      26454
RP LES Unclassified reason              843          0          0
RP LES Neighbor resolution req           36          0          0
RP LES Total                        1061852 354755994 103924558

All Total 1061852 354755994 103924558

site1#sh cef not-cef-switched
% Command accepted but obsolete, see 'show (ip|ipv6) cef switching statistics [feature]'

IPv4 CEF Packets passed on to next switching layer
Slot No_adj No_encap Unsupp'ted Redirect Receive Options Access Frag
RP 0 0 250074545 0 208289900 3250 0 26454

Unsupp'ted and Receive counters are increasing here

site2:

site2#sh cef not-cef-switched
CEF Packets passed on to next switching layer
Slot No_adj No_encap Unsupp'ted Redirect Receive Options Access Frag
RP 3661 0 0 2155804 3482468521 0 0 0

Receive counter is increasing here

site3:

site3k#sh cef not-cef-switched
CEF Packets passed on to next switching layer
Slot No_adj No_encap Unsupp'ted Redirect Receive Options Access Frag
RP 16724 0 0 0 2819093294 0 0 0

Receive counter is increasing here

site4:

site4#sh cef not-cef-switched
% Command accepted but obsolete, see 'show (ip|ipv6) cef switching statistics [feature]'

IPv4 CEF Packets passed on to next switching layer
Slot No_adj No_encap Unsupp'ted Redirect Receive Options Access Frag
RP 1 0 731499062 0 577875181 110 0 423885

Unsupp'ted and Receive counters are increasing here

site4#sh ip cef switching statistics

       Reason                          Drop       Punt Punt2Host
RP LES No route                      199509          0      86431
RP LES Packet destined for us             2 574316518    3559493
RP LES Encapsulation resource             0   81119928          0
RP LES No adjacency                  112696          0          1
RP LES Incomplete adjacency           56753          0          2
RP LES Unresolved route                  27          0          0
RP LES Unsupported                        0     541260          0
RP LES Bad checksum                      14          0          0
RP LES TTL expired                        0          0 649416183
RP LES IP options set                     0          0        110
RP LES Fragmentation failed          310402          0     423888
RP LES Routed to Null0              1062030          0    2077024
RP LES Unclassified reason            41790          0          0
RP LES Neighbor resolution req       225981         35          0
RP LES Total                        2009204 655977741 655563132

Looks awful:)

I will read about this command. I will also try to simplify the config as much as possible though it is hard in the production network

OOO Complex Telesystems OOO CTS · ‎01-20-2010

I have figured out that extended ping with option "record" is 100% successful between sites. Looks like there is some CEF-related problem, as you have mentioned.

According to the docs such packets are process-switched, simple icmp are cef switched. Still don't know how to fix it.

Disabling cef globally or per interface basis does not seem a good idea, maybe you can give me some piece of advice, milan?

milan.kulik · ‎01-20-2010

Hi,

I'm really guessing here, but:

Why do you need so many GRE tunnels?

Couldn't several GRE tunnels using one Loopback as a source address cause a CEF trouble?

Looking to the CEF statistics you provided I've got a bad feeling of the Unsupported high counter.

I tried to find what the unsupported features could be: NAT, Policy Based Routing and accounting were the examples I found.

So NetFlow possibly? Have you tried to remove it from your GRE tunnels?

BR,

Milan