Packet loss investigation - 6.14.23

jmaxwellUSAF · ‎06-14-2023

Hello.

GOAL: To determine the source of probable packet loss within a LAN/WAN circuit.

GIVEN:
1. Server application stress test produces excessive errors/failures. Pcaps suggest packet loss from an AWS server to a server on the enterprise LAN. This circuit crosses a C1101 enterprise edge gateway device configured with BGP, as well as a L2L VPN.

2. Because stress test produces many brief quickly terminating connections instead of 1 long connection, packet loss investigation is atypical.

3. Data in two attached images of 3 pcaps taken from both server endpoints, and also the C1101 gateway.

QUESTIONS:

1. Is the "red boxed" data strong evidence of packet loss? Is this the best evidence of packet loss out of all this data?

2. The "blue boxed" data on the lower left-- Why does the C1101 device register so few packets captured (about 75% less) relative to the AWS-Cloud server? (This device pcap was not Wireshark, but the OS internal software pcap feature.)

3. When investigating this packet loss symptom, is there anything else particularly interesting in this data?

4. What is the recommended next step in investigating the cause of this packet loss?

5. My conclusion is that my next step is to inspect the statistics / health of the C1101 gateway device. If that shows no symptoms, I will then inspect the AWS gateway router. May you please advise me on what you think should be my course of action on this troubleshoot?

Thank you.

Ramblin Tech · ‎06-14-2023

End-to-end packet loss in a network transport typically comes down to a couple of high-level causes:
1 - Transmission errors/impairments on the network links.

2 - Over-subscription of some resource in the network elements (switches/routers).

For #1, check the ingress interface stats on the network elements for incrementing errors, especially CRC/FCS errors. Network interface hardware should throw away any received frame with a bad CRC (but should also increment the counter).

For #2, most modern elements can forward at wire-speed for real-world packet sizes, so over-subscription of the fabric bandwidth (measured in bps) or packet forwarding rate (measured in pps) is possible, but not the most likely cause. The most likely cause will be over-subscription of an egress interface. That is, packets are arriving at the egress interface faster than they can be transmitted out. This results in egress queues filling until buffers are exhausted. With no egress buffers available, newly arriving packets are dropped.

Look at the QoS stats of your egress interfaces to determine if you are experiencing buffer exhaust and queue drops. Do not rely on 5-minute or even 30-second snapshots of link utilization, as these numbers will misleadingly smooth out utilization peaks and completely miss micro-bursts of traffic that can overrun your buffers. What to do about chronic queue drops is another conversation focused on capacity and QoS.

Disclaimer: I am long in CSCO

Joseph W. Doherty · ‎06-17-2023

Is this a continuation of the same issue from your prior posting?

Regardless, when I read "Server application stress test produces excessive errors/failures.", to me, it raises a very large red flag! If this testing stresses the network, often the network will drop packets which generally/often is adverse to applications running across the network. My first troubleshooting step is determining if network is being stressed.

What do I consider network stress? Anytime the network's capacity is being exceeded, regardless of for how long.

If network isn't being stressed, usually, but not always, the application is having issues due to such testing.

Consider you have a rope "rated" for 1,000 pounds. In theory, if you never lift more than a 1,000 pounds, the rope should be fine. However, the moment you attempt to lift more than 1,000 pounds, let's say, you've voided the warranty. When trying the latter, and the rope does break, can you troubleshoot the cause, beyond you exceeded the rope's capacity? Sure you can, but to what purpose?

In my prior example, if the 1,000 pound rope breaks lifting 1,000 pounds, or less, now you have good reason to determine cause of this "unexpected" failure.

Likewise with networks. Has the application stress test exceeded the capacity of the network?

I'm far from doing expert level packet capture analysis, because in my experience, it very seldom provides information which I find helpful, beyond other troubleshooting tests I do. However, when I've done network troubleshooting, I've had access to information that's probably unavailable to you. In such cases, a packet capture can provide symptom information, but I suspect it won't, alone, be very helpful in cause identification.

"May you please advise me on what you think should be my course of action on this troubleshoot?"

Determine whether the network's capacity is being exceeded during such application stress testing. If it is, the problem cause most likely is, the application stress testing exceeding such capacity.

If network's capacity is not being exceeded, then you need to determine if the application, itself, is having issues when being stressed, or whether there is some issue with the network. The latter can sometimes be difficult to identify (especially when you don't have full access to everything), and often even more difficult to convince others.