01-15-2010 02:17 AM - edited 03-04-2019 07:12 AM
Hello
Pls give me some advice in troubleshooting.
We have several sites, connected with one ISP via L3 mpls VPNs. There is no routing protocol between our routers and ISP routers, we have p2p GRE tunnels from each site to each other site with OSPF inside them. One site has just static routing inside the GRE.
Now we have the following strange situation:
Ping from site1 router to local ISP router is clean. Ping from site one to the remote ISP router is also clean. Ping from site1 router to site2 router is not clean, we are getting 5% drops. Ping from site2 router to its local ISP router is also clean.
I have no clue how to deal with it. It seems that our routers are dropping ICMP but the channels are not overused, there are no rules to limit ICMP, the CPU load is about 5-7%. Drops appear both when packets travel inside the tunnel and outside the tunnel.
ISP says that it can successfully ping our interfaces from any point of their network.
We have 3845 routers at our sites, ios versions are different - 12.4(7d)advipservices, 12.4(24)T1advipservices.
Traceroutes between this sites are identical. We use NM-16ESW module interfaces for this WAN channels.
interface configuration:
site 1
interface FastEthernet2/7
no switchport
ip address x.x.x.x x.x.x.x
ip flow ingress
load-interval 30
duplex full
speed 10
no cdp enable
end
interface Tunnel266
bandwidth 2048
ip unnumbered Loopback0
ip mtu 1476
ip flow ingress
ip tcp adjust-mss 1436
load-interval 30
qos pre-classify
keepalive 2 3
cdp enable
tunnel source FastEthernet2/7
tunnel destination y.y.y.y y.y.y.y.y
site 2
interface FastEthernet2/0
no switchport
ip address y.y.y.y y.y.y.y.y
ip flow ingress
ip flow egress
duplex full
speed 10
no cdp enable
interface Tunnel259
ip unnumbered Loopback0
ip mtu 1476
ip flow ingress
ip tcp adjust-mss 1436
load-interval 30
qos pre-classify
keepalive 2 3
cdp enable
tunnel source FastEthernet2/0
tunnel destination x.x.x.x x.x.x.x
Maybe someone had the same expirience. Are there any ideas how to troubleshoot it?
Thanks
01-16-2010 09:52 PM
when you try to ping remote site and you r getting the 5% of drops, use a sniffer to capture sent and received packets, may be you will capture some other ICMP packets that can inform you about what happened (destination unreacheable, port unreacheable, need fragmentation....) these messages if exist could not be seen from windows.
hope this help
01-17-2010 07:27 PM
I am pinging from my border routers, there are no destination unreacheable, port unreacheable, need fragmentation ICMP messages. Just drops.
Very strange that ping tp ISP interfaces is ideal.
If i am not mistaken I cannot mirror WAN interface of my router (only switch supports it), so I'm not sure how to sniff this links without breaking the channel( it is a production network)
01-18-2010 02:07 AM
Hi,
if your IOS supports the feature, you could try capture the packets on the router using the Cisco IOS Embedded Packet Capture (EPC).
See http://www.cisco.com/en/US/products/ps9913/products_ios_protocol_group_home.html
for details.
You could even export the captured data in PCAP format suitable for analysis using an external tool such as Wireshark.
BR,
Mikan
01-17-2010 08:52 PM
Here is the partial output of the "show ip cache flow" from the sites
site1:
sh ip cache flow
IP packet size distribution (7731M total packets):
1-32 64 96 128 160 192 224 256 288 320 352 384 416 448 480
.004 .314 .275 .050 .043 .022 .040 .016 .026 .006 .004 .004 .006 .007 .004
512 544 576 1024 1536 2048 2560 3072 3584 4096 4608
.003 .003 .019 .029 .115 .000 .000 .000 .000 .000 .000
IP Flow Switching Cache, 278544 bytes
1094 active, 3002 inactive, 482041572 added
2564275832 ager polls, 0 flow alloc failures
Active flows timeout in 30 minutes
Inactive flows timeout in 15 seconds
IP Sub Flow Cache, 66824 bytes
1094 active, 954 inactive, 482037958 added, 482037958 added to flow
0 alloc failures, 217 force free
2 chunks, 212 chunks added
last clearing of statistics never
Protocol Total Flows Packets Bytes Packets Active(Sec) Idle(Sec)
-------- Flows /Sec /Flow /Pkt /Sec /Flow /Flow
TCP-Telnet 48583 0.0 11 83 0.1 1.4 2.2
TCP-FTP 22323 0.0 3 45 0.0 9.8 14.7
TCP-FTPD 192 0.0 26997 114 1.2 162.6 2.1
TCP-WWW 32696526 7.6 5 355 40.6 0.6 2.1
TCP-SMTP 1355864 0.3 157 751 49.5 5.0 2.9
TCP-X 237 0.0 158 411 0.0 16.1 7.6
TCP-BGP 1513 0.0 8 57 0.0 9.1 9.1
TCP-NNTP 13 0.0 1 46 0.0 0.3 9.4
TCP-Frag 55330 0.0 38 44 0.5 6.9 15.5
TCP-other 200008374 46.5 12 461 580.8 2.6 6.8
UDP-DNS 61554309 14.3 1 72 14.5 0.1 15.5
UDP-NTP 6153186 1.4 1 78 1.4 0.0 15.5
UDP-TFTP 2608 0.0 1 71 0.0 0.0 15.5
UDP-Frag 222287 0.0 4 65 0.2 11.9 15.5
UDP-other 112130203 26.1 12 163 338.7 2.9 15.5
ICMP 67057733 15.6 4 65 69.7 5.9 15.4
IPINIP 53500 0.0 1245 160 15.5 51.2 15.1
GRE 661234 0.1 4373 190 673.3 276.4 13.5
IP-other 16463 0.0 3479 82 13.3 1677.1 3.1
Total: 482040478 112.2 16 284 1799.9 3.1 10.9
site2:
sh ip cache flow
IP packet size distribution (20846M total packets):
1-32 64 96 128 160 192 224 256 288 320 352 384 416 448 480
.005 .175 .301 .105 .031 .033 .049 .032 .015 .013 .010 .006 .008 .005 .004
512 544 576 1024 1536 2048 2560 3072 3584 4096 4608
.003 .003 .003 .036 .154 .000 .000 .000 .000 .000 .000
IP Flow Switching Cache, 278544 bytes
376 active, 3720 inactive, 344235011 added
1313258552 ager polls, 0 flow alloc failures
Active flows timeout in 30 minutes
Inactive flows timeout in 15 seconds
IP Sub Flow Cache, 42120 bytes
374 active, 1674 inactive, 344100314 added, 344100301 added to flow
0 alloc failures, 0 force free
2 chunks, 286 chunks added
last clearing of statistics never
Protocol Total Flows Packets Bytes Packets Active(Sec) Idle(Sec)
-------- Flows /Sec /Flow /Pkt /Sec /Flow /Flow
TCP-Telnet 11912 0.0 57 41 0.1 4.6 9.1
TCP-FTP 2851 0.0 4 61 0.0 1.7 6.8
TCP-FTPD 226 0.0 7679 1308 0.4 134.4 2.5
TCP-WWW 30882552 7.1 5 211 40.8 0.5 1.8
TCP-SMTP 624168 0.1 185 1204 27.0 4.6 4.8
TCP-X 212 0.0 175 152 0.0 17.1 6.1
TCP-BGP 4 0.0 1 40 0.0 0.0 1.4
TCP-NNTP 9 0.0 1 45 0.0 1.3 4.6
TCP-Frag 104 0.0 3 1012 0.0 4.1 12.2
TCP-other 141375905 32.9 12 440 395.2 2.5 4.7
UDP-DNS 4549509 1.0 22 72 23.3 15.2 15.4
UDP-NTP 1938294 0.4 1 79 0.5 0.0 15.5
UDP-Frag 155488 0.0 2 785 0.0 3.0 15.4
UDP-other 132236086 30.7 7 110 243.8 2.1 15.4
ICMP 30017601 6.9 10 60 74.9 8.3 15.4
IPINIP 79580 0.0 1044 922 19.3 35.6 15.2
GRE 2351269 0.5 7357 359 4027.7 274.3 13.4
IP-other 159 0.0 28 955 0.0 5.9 15.5
Total: 344225929 80.1 60 353 4853.5 4.7 9.8
Dunno whether it is something abnormal?
I am still getting drops, not only between this two sites, but between others too,,,
01-18-2010 01:10 AM
Hi,
does your provider allow traceroute through the network?
If yes, I'd try it to see if the backbone is not dropping your packets.
You say you are losing 5% of your Pings.
Do you mean a default Ping with 100Bytes packet size?
Have you tested larger packets with Don't Fragment bit set to 1, e.g.?
BR,
Milan
01-18-2010 01:29 AM
I'm not able to see inner provider structure, traceroute shows only border routers.
Yes, I do mean the default 100byte packets.
The same situation with larger packets, 1500byte with df-bit set goes through with the same average 5% drops
01-18-2010 02:19 AM
Hi,
as you asy in your original post:
"Ping from site1 router to local ISP router is clean. Ping from site one to the remote ISP router is also clean. Ping from site1 router to site2 router is not clean, we are getting 5% drops. Ping from site2 router to its local ISP router is also clean."
What about a Ping from site1 router to the local ISP router on site2?
Without using any tunnel, if possible?
If it's not clean, I'd start blaming the provider for something wrong in his backbone.
BR,
Milan
01-18-2010 02:58 AM
Ping from site1 router to the local ISP router on site2 is clean, without using the tunnel. Moreover, provider engineers say that they are able to ping our interfaces from their network without any loss. From any point.
I will try to inspect traffic with Embedded Packet Capture. I did not know about this feature, thank you for the advice.
It's hard to blame the provider when they send their ideal statistics though...
01-18-2010 03:19 AM
Hi,
well, I suggested to blame the provider if the Ping from site1 router to the local ISP router on site2 were s not clean.
Unfortunatelly, it's not the case...
But it's really weird Ping to the ISP router on site2 is clean and just one hop further (your router on site2) is not clean.
And the Ping between your router on site2 and the ISP router on site2 is clean again, so no circuit problem.
What about Pings between PCs on site1 and site2?
Losing also 5%?
BR,
Milan
01-18-2010 04:02 AM
Yes, drops between pc's are also 2-5% average.
01-19-2010 01:28 AM
Well,
it seems your routers are really causing the trouble :-(
I'd guess some overload but it's difficult to say without knowing all the details.
And you said CPU was running on 5-7% only... Does sh proc cpu his confirm?
How many tunnels are configured on each router?
I'd try to check if the routers are really using CEF (sh cef not-cef-switched).
And also I'd try to simplify a config on one router pair maximally (removing NetFlow, reducing the number of tunnels, using static only if possible, etc.) and observe if that helps.
This might lead you to the problem cause.
Good luck,
Milan
01-19-2010 02:04 AM
I am really confused by this situation.
Here is the output of some show commands
site1#sh proc cpu | ex 0.00
CPU utilization for five seconds: 10%/7%; one minute: 11%; five minutes: 12%
site2#sh proc cpu | ex 0.00
CPU utilization for five seconds: 7%/3%; one minute: 5%; five minutes: 5%
There are 2 to 6 GRE tunnels on the sites. Each starts from the same interface, destination differs. Interfaces are 10Mbps. Average port utilisation is 40-70%. No output drops at the interfaces.
site1:
site1#sh ip cef switching statistics
Reason Drop Punt Punt2Host
RP LES No route 1040159 0 310687
RP LES Packet destined for us 0 208257377 34579
RP LES Encapsulation resource 0 146498617 0
RP LES No adjacency 2 0 0
RP LES Incomplete adjacency 7622 0 0
RP LES Unresolved route 329 0 0
RP LES Bad checksum 66 0 0
RP LES TTL expired 0 0 103549588
RP LES IP options set 0 0 3250
RP LES Fragmentation failed 12795 0 26454
RP LES Unclassified reason 843 0 0
RP LES Neighbor resolution req 36 0 0
RP LES Total 1061852 354755994 103924558
All Total 1061852 354755994 103924558
site1#sh cef not-cef-switched
% Command accepted but obsolete, see 'show (ip|ipv6) cef switching statistics [feature]'
IPv4 CEF Packets passed on to next switching layer
Slot No_adj No_encap Unsupp'ted Redirect Receive Options Access Frag
RP 0 0 250074545 0 208289900 3250 0 26454
Unsupp'ted and Receive counters are increasing here
site2:
site2#sh cef not-cef-switched
CEF Packets passed on to next switching layer
Slot No_adj No_encap Unsupp'ted Redirect Receive Options Access Frag
RP 3661 0 0 2155804 3482468521 0 0 0
Receive counter is increasing here
site3:
site3k#sh cef not-cef-switched
CEF Packets passed on to next switching layer
Slot No_adj No_encap Unsupp'ted Redirect Receive Options Access Frag
RP 16724 0 0 0 2819093294 0 0 0
Receive counter is increasing here
site4:
site4#sh cef not-cef-switched
% Command accepted but obsolete, see 'show (ip|ipv6) cef switching statistics [feature]'
IPv4 CEF Packets passed on to next switching layer
Slot No_adj No_encap Unsupp'ted Redirect Receive Options Access Frag
RP 1 0 731499062 0 577875181 110 0 423885
Unsupp'ted and Receive counters are increasing here
site4#sh ip cef switching statistics
Reason Drop Punt Punt2Host
RP LES No route 199509 0 86431
RP LES Packet destined for us 2 574316518 3559493
RP LES Encapsulation resource 0 81119928 0
RP LES No adjacency 112696 0 1
RP LES Incomplete adjacency 56753 0 2
RP LES Unresolved route 27 0 0
RP LES Unsupported 0 541260 0
RP LES Bad checksum 14 0 0
RP LES TTL expired 0 0 649416183
RP LES IP options set 0 0 110
RP LES Fragmentation failed 310402 0 423888
RP LES Routed to Null0 1062030 0 2077024
RP LES Unclassified reason 41790 0 0
RP LES Neighbor resolution req 225981 35 0
RP LES Total 2009204 655977741 655563132
Looks awful:)
I will read about this command. I will also try to simplify the config as much as possible though it is hard in the production network
01-20-2010 02:50 AM
I have figured out that extended ping with option "record" is 100% successful between sites. Looks like there is some CEF-related problem, as you have mentioned.
According to the docs such packets are process-switched, simple icmp are cef switched. Still don't know how to fix it.
Disabling cef globally or per interface basis does not seem a good idea, maybe you can give me some piece of advice, milan?
01-20-2010 07:15 AM
Hi,
I'm really guessing here, but:
Why do you need so many GRE tunnels?
Couldn't several GRE tunnels using one Loopback as a source address cause a CEF trouble?
Looking to the CEF statistics you provided I've got a bad feeling of the Unsupported high counter.
I tried to find what the unsupported features could be: NAT, Policy Based Routing and accounting were the examples I found.
So NetFlow possibly? Have you tried to remove it from your GRE tunnels?
BR,
Milan
Discover and save your favorite ideas. Come back to expert answers, step-by-step guides, recent topics, and more.
New here? Get started with these tips. How to use Community New member guide