cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
3036
Views
45
Helpful
7
Replies

Microbursts - Software to use


Hi all,

Lately, almost every network that I put my hands on is suffering with this, but until now I haven't found a reliable tool where I can easily see this.

Anyone knows of a software that I can use to easily detect microbursting using SPAN?

Thank you

1 Accepted Solution

Accepted Solutions

I understand.  I did mention video as a common micro-burst contributor, because many people, with variable rate video, don't fully appreciate the difference between video's average bandwidth usage and its peak bandwidth usage, especially for hi-def, which can consume a fair chunk of bandwidth.

Also many don't understand queuing theory, so many don't understand, once you typically pass beyond about 1/3 utilization, queuing becomes more and more likely.  Between 1/3 and 2/3 such queuing isn't "bad", which seems to hold with your stats with a load of 40 to 60%.  (Once you get beyond 2/3 utilization, queuing can start to skyrocket.)

What I sometimes have found helps, is asking someone arguing that 50% load shouldn't cause problems, why drops, which arise from queue overflows, are happening.

BTW, for video, if it's not being viewed real-time, increasing queue sizes, to buffer bursts, avoiding drops, if often all you need to do.  For real-time video, other than sufficient bandwidth to avoid queuing, about the only other thing you can do, if supported on the equipment, is use something like FQ if there are multiple streams.  This won't preclude the drops, but if often avoids drops against the streams which are not causing the transient congestion.

There was an interesting 3rd party technology, (quite) some time back, that Cisco deployed in some platforms' QoS.  This technology would analyze traffic, down at the millisecond level (on the device, itself), and "tell you" what bandwidth the traffic needed to meet specifications you defined.  See QoS Bandwidth Estimation You couldn't always trust what this feature told you, and I think, Cisco stopped providing it.  I also believe the 3rd party providing this technology is still in business.  Likely, also an expensive product.

I've found average load averages often useless for detailed analysis.  But monitoring, and graphing, current queue depth and/or drops can be very revealing, and that might be accomplished with something like PRTG.  Only issue, though, if they are infrequent enough, they can be hard to catch.  That what was nice about the 3rd party feature, it seem to see all the traffic hitting the interface.  So even a very infrequent queue wouldn't go unnoticed and so it (should) use that when calculating needed bandwidth to meet a certain specified SLA.

Unfortunately, none of the above really provides a solution to your issue.  Hopefully, it will provide some new avenues of research and something to also use when trying to explain why 50% utilization can be a problem.

View solution in original post

7 Replies 7

Joseph W. Doherty
Hall of Fame
Hall of Fame

Possibly something like Wireshark.

BTW, I find it interesting you use the words "lately", "almost every network" and "suffering".

I suspect the root cause of most micro-bursts is due to TCP's slow start.  If so, nothing new there, but there are some factors which would make the impact of TCP's slow start's micro-bursts more impactful.  (Some high bandwidth, variable rate, video might also be micro-burst contributor.)

For instance, later TCP stacks, I believe, use a much larger RWIN then they did in the past.  This would allow the sender to send larger bursts before the sender bumps up against the RWIN limit.  (TCP variants that don't adhere to slow start's "rules" will compound the problem.)

Another factor, later networks likely have more hosts than ever, and those hosts (each) might transmit more data than ever.  Multiple hosts, sending more data, are more likely to combine their micro-bursts, such that even our high bandwidth uplink ports will still need to (briefly) queue them.

Yet another possible factor, in larger networks there seems to be a bit more centralization of shared servers, often on the other end of a WAN.  So, although WAN bandwidths have increased, have they actually proportionally increased compared to the past?

In any case, I've found proper utilization of "QoS" avoids suffering, in both ye olde networks, and current networks.

Unfortunately (and as I mount my soapbox), although the industry moved away from punched cards and reel tapes decades ago, networking devices often seem to be stuck with still using single FIFO egress queues which is where micro-bursts often cause much "suffering".

Thank you for the reply Joseph once again

Wireshark is extremely helpful, but to present a graph from wireshark to a costumer can be extremely daunting and almost impossible to make them understand. Was looking for something more friendly with easy to read graphs. I know there are some paid softwares that do this but they are extremely expensive.

Let's see what more I can find.

Regarding the network, I work in the CCTV industry and the issue with my networks is that they are only comprised of CCTV cameras. Everything uses H264 and in some occasions H265 (rare tough), and despite the compression and due to the erratic behaviour of the video, the networks get loads of microbursting, then and for the company to avoid spending much money, they dimension the links basically to pass the amount of data the camera says. If we program a camera with 8Mbps, they dimension the link to accommodate 8Mbps and don't leave much overhead. What happens is the cameras indeed they work in average of 8Mbps but due to the I frames they have spikes that can reach 16Mbps.

And because my costumer basically wants the cameras to output the max they can to have the best image possible this behaviour gets even worst.

It has been a battle for me to be able to handle with this, I'm trying to use QoS to the best of my knowledge but what is, sometimes even with QoS and putting all the available resources of the switch in play the queues get full and the packets start to drop.

My costumer most of the times doesn't see this and they put alot of pressure on me for it to be put to work with the resources I have, and it's being really a big battle... 

But well, this is the main reason to why I'm trying to find a software where it would be easy to show this things, because I show them the averages sometimes of 40/50/60% load of a 100Mbps for example, and they come back to me saying, Ah! But all the cameras combined don't reach even 50Mbps (they read the bitrate face value 8+8, etc) and you're telling me the link is already getting full? Put it to work...

 

 

I understand.  I did mention video as a common micro-burst contributor, because many people, with variable rate video, don't fully appreciate the difference between video's average bandwidth usage and its peak bandwidth usage, especially for hi-def, which can consume a fair chunk of bandwidth.

Also many don't understand queuing theory, so many don't understand, once you typically pass beyond about 1/3 utilization, queuing becomes more and more likely.  Between 1/3 and 2/3 such queuing isn't "bad", which seems to hold with your stats with a load of 40 to 60%.  (Once you get beyond 2/3 utilization, queuing can start to skyrocket.)

What I sometimes have found helps, is asking someone arguing that 50% load shouldn't cause problems, why drops, which arise from queue overflows, are happening.

BTW, for video, if it's not being viewed real-time, increasing queue sizes, to buffer bursts, avoiding drops, if often all you need to do.  For real-time video, other than sufficient bandwidth to avoid queuing, about the only other thing you can do, if supported on the equipment, is use something like FQ if there are multiple streams.  This won't preclude the drops, but if often avoids drops against the streams which are not causing the transient congestion.

There was an interesting 3rd party technology, (quite) some time back, that Cisco deployed in some platforms' QoS.  This technology would analyze traffic, down at the millisecond level (on the device, itself), and "tell you" what bandwidth the traffic needed to meet specifications you defined.  See QoS Bandwidth Estimation You couldn't always trust what this feature told you, and I think, Cisco stopped providing it.  I also believe the 3rd party providing this technology is still in business.  Likely, also an expensive product.

I've found average load averages often useless for detailed analysis.  But monitoring, and graphing, current queue depth and/or drops can be very revealing, and that might be accomplished with something like PRTG.  Only issue, though, if they are infrequent enough, they can be hard to catch.  That what was nice about the 3rd party feature, it seem to see all the traffic hitting the interface.  So even a very infrequent queue wouldn't go unnoticed and so it (should) use that when calculating needed bandwidth to meet a certain specified SLA.

Unfortunately, none of the above really provides a solution to your issue.  Hopefully, it will provide some new avenues of research and something to also use when trying to explain why 50% utilization can be a problem.


Thank you very much for the reply Joseph.

I'm going to show this reply to one of the person I was arguing with about the 60%. He has CCNP, and he should know this things, but he was being adamant that 60% average was not much. 

Yesterday I had a big issue where I was seeing pocket drops at this rates, I stated that It was due to microbursting and I was seeing being dropped in the queues. But to no avail, I had to put the switch to handle the packet being dropped no matter what. In that case we were talking about HP switches, and that one unlikely like the Cisco's don't have much that can be done with the QoS and even the buffer for the queues/bandwidth we are unable to change. 

The only option I had was to limit the number of queues in order to increase the buffers on those queues. By doing this I was able to reduce the number of drops. But is a loosing battle, the costumer says it has to be done, even the ccnp guy and I'm the one left to do miracles... 

The traffic itself is a mix of both, streaming and recorded traffic and FQ as as far as I know the 3650s don't have it, am I correct? I was seeing as well and they have I think WFQ can this be used?

I'm going to try to use PRTG and see if I can prove something to show them 

 

No, the 3650 doesn't support FQ (I don't recall any Cisco L3 switch that does) except between its hardware queues.  I.e. not like FQ on a Cisco router.

However, there is some QoS settings tuning that can sometimes make a huge difference on the 3650/3850.

For example, you often start with "qos queue-softmax-multiplier 1200", mentioned in Catalyst 3850: Troubleshooting Output drops . BTW, the whole paper is worth reading.

For transient (brief) congestion, I've found using a common pool of buffers works well.  (Maybe so has Cisco.  They had that on the 35xx switches, but not, by default, on the 3560/3750 series of switches, and have now DTS on the 3650/3850 switches.)

Besides the softmax command, you might want to assign a larger buffer share to the queue with the video in it and/or (if possible) cut back on the hardmax reservations (if any).

As an example, years ago I had a 3750, gig ports, with a couple of SAN servers on a couple of ports.  Those ports were showing multiple drops per second.  After bumping up the allowance for what could be used from the common pool, and reducing buffer reservations per port (more like hardmax on the 3650/3850s), my drops, on those ports, were reduced to a few, a day!  (Other ports didn't appears to suffer.)

BTW, I was just recently reading something on iPerf (How do you test microburst effects using iPerf? ), that it had a command to control bursting.  If so, what might help, making believers of bursting issues, is a lab setup where you can demonstrate those issues, and their mitigation.

I start to think this people is like Trump, you show them the truth but they don't care...

Thank you very much Joseph, like always your explanations are amazing and have learned alot from those

Thank you

 

Laugh - not so much, I think, people don't care, but perhaps more like "don't confuse me with the facts".

Thank your for the compliment on my explanations.  Personally, I wouldn't rate them amazing, for that, I would give that title to Peter Paluch's postings.  Lots of other great postings from other contributors too, especially the many from our VIP members.

Review Cisco Networking products for a $25 gift card