Dropping packets and high latency across the WAN

zrunner626 · ‎01-06-2015

Good morning,

Over the past week or two we have started randomly seeing periods of time with many packets dropping across the WAN and high latency. Sometime is last 5-30 minutes, sometimes hours. Of course the service provider has tested (during working times) and doesn't see the issue on their side. Any ideas that could be the issue on my router WAN interface... Reviewing the service providers usage charts this does not appear to be happening during times of heavy network utilization. Often times the traffic is around 10-20% of the available bandwidth. The CPU utilization is 10-20% and my router has not rebooted but the tunnel across the WAN shows in the logs as going up/down.

Here are some baseline pings/traceroutes while my network is stable...should I be concerned that my last reply from my router is showing an increased response time and that some of the packets from the Traceroutes and Pings are being dropped? Does it indicate a problem that is being exasperated when another variable is introduced?

Thanks for any ideas!

nash#trace x.x.58.86
Type escape sequence to abort.
Tracing the route to (x.x.58.86)
VRF info: (vrf in name/id, vrf out name/id)
1 x.x.129.65 4 msec 0 msec 2 msec
2 (x.x.58.85) [AS x828] 20 msec 20 msec 18 msec
3 (x.x.58.86) [AS x828] 30 msec * 24 msec
Nash#trace x.x.58.86
1 x.x.129.65 2 msec 2 msec 0 msec
2 (x.x.58.85) [AS x828] 20 msec 20 msec 20 msec
3 (x.x.58.86) [AS x828] 24 msec * *
Nash#trace x.x.58.86
1 x.x.129.65 2 msec 2 msec 0 msec
2 (x.x.58.85) [AS x828] 20 msec 20 msec 20 msec
3 (x.x.58.86) [AS x828] 24 msec * 22 msec

Nash#ping x.x.58.86 repeat 200
Type escape sequence to abort.
Sending 200, 100-byte ICMP Echos to x.x.58.86, timeout is 2 seconds:
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!.!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Success rate is 99 percent (199/200), round-trip min/avg/max = 22/24/58 ms

bnar01#ping x.x.58.86 repeat 200
Type escape sequence to abort.
Sending 200, 100-byte ICMP Echos to x.x.58.86, timeout is 2 seconds:
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Success rate is 100 percent (200/200), round-trip min/avg/max = 22/26/90 ms

bnar01#trace x.x.58.86
1 x.x.129.65 2 msec 0 msec 2 msec
2 (x.x.58.85) [AS x828] 20 msec 20 msec 20 msec
3 (x.x.58.86) [AS x828] 22 msec * 22 msec
bnar01#trace x.x.58.86
1 x.x.129.65 14 msec 2 msec 2 msec
2 (x.x.58.85) [AS X828] 18 msec 20 msec 28 msec
3 (x.x.58.86) [AS x828] 24 msec * 26 msec

bnar01#ping x.x.58.86 repeat 200
Type escape sequence to abort.
Sending 200, 100-byte ICMP Echos to x.x.58.86, timeout is 2 seconds:
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Success rate is 100 percent (200/200), round-trip min/avg/max = 22/24/32 ms

bnar01#ping x.x.58.86 repeat 200
Type escape sequence to abort.
Sending 200, 100-byte ICMP Echos to x.x.58.86, timeout is 2 seconds:
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!.!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Success rate is 99 percent (199/200), round-trip min/avg/max = 22/24/40 ms

Dallas#trace x.x.x.86
Type escape sequence to abort.
Tracing the route to (x.x.58.86)
VRF info: (vrf in name/id, vrf out name/id)
1 (x.x.12.5) 24 msec 32 msec 32 msec
2 (x.x.58.85) [AS x828] 32 msec 48 msec 52 msec
3 (x.x.58.86) [AS x828] 68 msec * 184 msec
dallas#trace x.x.x.86
1 (x.x.12.5) 24 msec 24 msec 24 msec
2 (x.x.58.85) [AS x828] 36 msec 32 msec 36 msec
3 (x.x.58.86) [AS x828] 48 msec * 68 msec
dallas#trace x.x.x.86
1 (x.x.12.5) 12 msec 8 msec 12 msec
2 (x.x.58.85) [AS x828] 20 msec 28 msec 16 msec
3 (x.x.58.86) [AS x828] 24 msec * 104 msec

dallas#ping x.x.58.86 repeat 200
Type escape sequence to abort.
Sending 200, 100-byte ICMP Echos to x.x.58.86, timeout is 2 seconds:
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Success rate is 100 percent (200/200), round-trip min/avg/max = 20/73/192 ms
dallas#ping x.x.58.86 repeat 200
Type escape sequence to abort.
Sending 200, 100-byte ICMP Echos to x.x.58.86, timeout is 2 seconds:
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!.!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Success rate is 99 percent (199/200), round-trip min/avg/max = 16/56/164 ms

Joseph W. Doherty · ‎01-06-2015

Disclaimer

The Author of this posting offers the information contained within this posting without consideration and with the reader's understanding that there's no implied or expressed suitability or fitness for any purpose. Information provided is for informational purposes only and should not be construed as rendering professional advice of any kind. Usage of this posting's information is solely at reader's own risk.

Liability Disclaimer

In no event shall Author be liable for any damages whatsoever (including, without limitation, damages for loss of use, data or profit) arising out of the use or inability to use the posting's information even if Author has been advised of the possibility of such damage.

Posting

Increased latency often indicates traffic being queued (generally because rate offered is greater than available transmission rate).

Drops often due to a queue overflowing.

The two often go hand-in-hand.

Also keep in mind, service providers, unfortunately, are often unable to "see" problems within their network until you "rub their nose in it". (NB: I'm not saying your provider is the cause of your issue; just sometimes they are the problem even when they say they're not.)

Insufficient information to comment further.

casanavep · ‎01-06-2015

Have you ran anything such as MTR on a Linux box (or WINMTR equivalent on PC)? If so, can you find a trend in loss or high latency on a specific hop on the path? I would ensure you adjust the ICMP payload size to a higher size such as 1000Bytes and adjust the ping interval to every two seconds or so. This ensures you are not running into an issue where the provider is rate limiting your pings, which is not uncommon for some providers, if the pings (ICMP messages) are terminating on their endpoints.

Do you have QoS policies applied on interfaces on either end of these pings / traces? If so, do you have assurance that ICMP messages will not be impacted by queue based dropping or shaping latency? One solution is, move traffic from your ICMP traffic with the source or destination of your ICMP ping and trace endpoint in a priority queue with adequate bandwidth (should be a very low requirement). This may not make sense since your bandwidth utilization is low, but shaping of busy flows can actually occur long before congestion, depending on your design.

Another item that may give you better insight is running and monitoring / graphing IP-SLA probes between your routers on each end. You could then trend issues and give graphed evidence to your provider. They could then compare your lossy and high latency periods to their appliance interface, memory, and CPU loads to see if they can find a correlating trend. It can be a hard battle to get ISPs to not only admit they have issues, but allocate resources to isolate and resolve these issues. Good SLA probe data showing that their paths are not meeting delivery standards speak much louder that pings to them.

zrunner626 · ‎01-06-2015

Would you setup the IP SLA to monitor from the troubled site back to HQ or the other way around?

Thanks

casanavep · ‎01-06-2015

It depends on the probe type that you use. Some, such udp-jitter, provide bidirectional statistics, so can be originated on either side. I would probably use this in your case, since it will utilize multiple packets for analysis and generate good path quality metrics such as: per-direction latency, loss, and jitter (variation in latency between packets).

zrunner626 · ‎01-08-2015

Thanks all for the suggestions. It ended up being a DDoS attack on the service provider. But I will be implementing some IP SLA for the future.