Solved: Bursty Traffic? Help

simoesmarco8626982 · ‎08-02-2020

Hi all,

Hope someone can help me.

A brief explanation, I have asa of a costumer network that is having loads of packet loss but I can't understand why, everything is connected via fiber and using the Cisco switch I can see that the port is at 40% capacity (100Mbps port). I think the issue could be microbursting,b I can't find a software to read this very well and the only software I could use is wireshark... So I did a mirror of the uplink port and used the I/O of the wireshark to see the amount of bits passing through the port with a scale of 10ms. My issue is this is beyond my knowledge and I can't read the graph.

Going by this graph:

https://drive.google.com/file/d/1J0frAEfu89f8GnPRcrGmnlTaac0Es75u/view?usp=drivesdk

Can anyone please tell me if the traffic is bursty or not?

Thank you

Joseph W. Doherty · ‎08-15-2020

Some points of possible interest/concern, from what you wrote.

If any device is hard coded full duplex, but if the other side of the link has a device set for auto, speed will match, but auto side will run at half duplex. Traffic will pass, but very slowly vs. link speed. This might account for your 10 Mbps throughput over a 100 Mbps link.

Second, you mention once the ports were set to auto, and link speed was increased to gig, a 3rd party limits throughput to 100 Mbps. Well, that's a bottleneck, which might be having drops, which you cannot "see". If the 3rd party limits throughput to 100 Mbps, if you can, run the connecting ports also at 100. Then you can "see" if there's any congestion at your port, and if so, you then have the option to manage it with QoS.

Lastly, if all is working well with flow control (besides having at least one way to mitigate the problem - which is good), flow control should engage when device(s) detect congestion, which is generally right before they start to drop packets. So, everything else being equal, I would expect drops w/o flow control. Yet, that's not the case.(?) That makes for a "curious" situation.

BTW, the only "problem" with normal flow control it's an all-or-nothing bandwidth management. There's are enhanced variants of flow control, which can selectively pause just some traffic.

View solution in original post

simoesmarco8626982 · ‎08-06-2020

Hi all, I'm going to make a bump of the post with new graphs that are easier to see:

1ms Graph Capture

https://ibb.co/SvTNk4S

100ms Graph Capture

https://ibb.co/HPk1YwW

Anyone please can tell me how to read this?

Thank you

Joseph W. Doherty · ‎08-06-2020

If the port is showing drops, but a low average load, that's due to bursty traffic. (Although, really high offered load can result in low average utilization from excessive drops.)

If the port show drops, what does it graph like, over time?

What's the actual device and IOS running on it?

simoesmarco8626982 · ‎08-06-2020

The device is a comnet and unfortunately doesn't show much, but according to the port statistics all the packets are being transmitted correctly.
Because the comnet doesn't show much, I tried with a small 8 port SG250, and didn't see any packet drops as well, not even tail dropped bytes. But the way the problem is presenting itself (the problem being images breaking up) points to bandwidth issues.
If I use QOS I can mitigate the problem, but my costumer is adamant that we shouldn't have to use it if the link is at 40% capacity.
Honestly I already tried everything, and I'm lost for ideas
I must had, I'm talking about IP CCTV

Joseph W. Doherty · ‎08-06-2020

40% utilization doesn't necessarily mean you do not need QoS.

Live streaming TV can be sensitive to jitter and latency, besides being sensitive to loss. It's much like VoIP traffic, in its service needs, although generally both more demanding of bandwidth and bandwidth usage being more variable.

I would suggest you try to prioritize your IP CCTV traffic, and see if that mitigates the issues you've been having.

BTW, your graphs are interesting, but I wonder about their accuracy.

On the 100 ms diagram, I see a spike (at about 11:33:12) above 10 Mbits, but 10 Mbits is the max possible, @ 100 Mbps, for 100 ms. That noted, average utilization does appear to be about 40%.

On the 1 ms, I see most of the graph above 100 Kbits, but @ 100 Mbps, for 1 ms, the max possible is 100 Kbits, so that graph seems far off.

Similar issue for your OP 10 ms graph.

Joseph W. Doherty · ‎08-06-2020

As I noted with video, its bandwidth demand is variable. When you set it to use some amount of bandwidth, that's an average consumption. As I believe your graphs correctly show, you see bandwidth demand is highly variable, even during small time intervals. (NB: with many video codecs, you'll see bandwidth demand jump as more changes on the screen, for a still picture, the bandwidth demand can go way down.)

You describe the problem arises by just adding video sources, QoS, indeed might be unable to cure the problem. Why? Because, effectively the problem could be within the video traffic, itself, i.e. the video traffic competing against itself. Normally, QoS works great when you have something like video competing against data file copying.

From what you describe, yes if prioritize the first set/group of cameras, they should work well in spite of the second group, the problem, though, is the second group may (or probably) not function well. I'm sure you customer wants all to work.

You mention different brands of video equipment. Well, different brands might use different video codecs. I.e. for the same video source, one's x Bandwidth could be more variable than another's for the "same" x Bandwidth setting.

Your mention of using flow control seems to address the problem, that' very, very interesting. Flow control engages when a port is about to be overrun. Basically it tells something further up-stream to pause sending (avoiding drops). However, you mention you didn't see drops on the interfaces. So, that doesn't make much sense. Perhaps the switch is dropping packets, but they don't register as drops? Unfortunately, I have no experience with the SMB Cisco switches, like the SG. With an Enterprise level Cisco switch, we can generally "tune" resource allocation and see if that mitigates the problem.

That said, if flow control seems to mitigate the problem, and it, itself, doesn't cause any problems, use it. Again, it's primary purpose is to avoid drops, but it can increase latency and/or jitter. However, so does increasing queue/buffer resources. (Actually, flow control uses buffer resources further up-stream.) The one major issue with flow control, the original/common version is an all-or-nothing approach.

Joseph W. Doherty · ‎08-06-2020

Oh, some more "food for thought".

To avoid most queuing, you want an average overall load to be 1/3 or less.

Up to about 1/2 average overall load, you have little queuing.

Up to about 2/3 average load, you often start to queue, but often not a lot.

Over 2/3 average load, your queuing can start to increase exponentially.

I mention this, again look at your 100 ms graph at about 11:33:12 where the average now looks to be around 70%.

Also, where you might think a 40% load average isn't real busy, it can be for highly variable bandwidth demanding traffic, that's "sensitive" to delay, jitter and/or loss.

The above is why Cisco recommends LLQ allocations not exceed about 1/3 allocation (actually they say it's to provide bandwidth for other traffic). LLQ gets full priority over all other traffic, so it's like it has the all the bandwidth to itself, but again, it's a question of "like" traffic, at the same QoS treatment level, getting sufficient bandwidth.

You mention trying TCP rather than UDP. TCP might be okay, or possibly even better, for buffered video streaming traffic, but for live video, where buffering causes problems, or where buffered video with some form of its own transmission quality and/or rate control, UDP is almost always the better choice.

Lastly, in the grand scheme of QoS, situations like this are best addressed by some form of admission control being able to obtain guaranteed bandwidth. For example, as each camera comes on-line to transmit, it would notify the network of the resources it needs, and only transmit if permitted (when the network can, indeed, provide/guarantee them). Unfortunately, in the "real world", it's not uncommon for one additional stream to not only not work (well) itself, as it starts to transmit, but to cause issues to all other concurrent traffic of the same kind.

Sometimes the only practical solution is "better" hardware resources, whether it's "better" network devices and/or additional bandwidth. More times than not, I've shown QoS can avoid the need for "better" hardware resources, but, again, not always. Yours might be such a situation.

simoesmarco8626982 · ‎08-15-2020

Thank you very much Joseph for the replies!

I've been troubleshooting the network and basically found one big issue... One of the links is comprised of 2 - Netgear Prosafes 5 ports and then entered the Comnet switch. I found that if the switches were hard coded to operate at 100Mbps full duplex for some reason only 10Mbps would pass trough that link. I did the test using iperf

When I passed the link to auto negotiate and them negotiating at 1Gbps, I could achieve the 100 Mbps speed, well 91Mbps according to iperf (the overall speed of that link is 100Mbps due to constrains in the link speed from a third party).

After doing this I found that even still I had the image still breaking up, but had no packet loss being show on the switches nor even the 2960. On the 2960 I can see some output drops, but no drops in the input queues. Honestly don't know why I have output drops if I virtually don't have any traffic going out trough the port.

Well the only way for me to able to have a stable picture according to the specs of the consultant, was to activate Flow Control on all the switch until it reached the 2960.

Between the switches thy have activated Symmetric flow control, and all the cameras are transmitting in UDP. This way the consultant accepts the solution, but not if I used QoS.

This is clearly a Bandwidth issue, but I can't see anything in the switches being dropped of... it's really weird what's happening there

Joseph W. Doherty · ‎08-15-2020

Some points of possible interest/concern, from what you wrote.

If any device is hard coded full duplex, but if the other side of the link has a device set for auto, speed will match, but auto side will run at half duplex. Traffic will pass, but very slowly vs. link speed. This might account for your 10 Mbps throughput over a 100 Mbps link.

Second, you mention once the ports were set to auto, and link speed was increased to gig, a 3rd party limits throughput to 100 Mbps. Well, that's a bottleneck, which might be having drops, which you cannot "see". If the 3rd party limits throughput to 100 Mbps, if you can, run the connecting ports also at 100. Then you can "see" if there's any congestion at your port, and if so, you then have the option to manage it with QoS.

Lastly, if all is working well with flow control (besides having at least one way to mitigate the problem - which is good), flow control should engage when device(s) detect congestion, which is generally right before they start to drop packets. So, everything else being equal, I would expect drops w/o flow control. Yet, that's not the case.(?) That makes for a "curious" situation.

BTW, the only "problem" with normal flow control it's an all-or-nothing bandwidth management. There's are enhanced variants of flow control, which can selectively pause just some traffic.

simoesmarco8626982 · ‎10-13-2020

Hi Joseph,

Just to give an update. Everything is now working with flow control on, not the best option but the option the costumer accepts.

Everything is working correctly and the costumer is happy

Thank you very much for your extremely insight explanations

Kind regards