CBS350 AES67 PTPv2 Jitter Under Load

SeanOGorman · ‎09-27-2022

We are working through some issues with our CBS350 network in regards to AES67 PTPv2 timing. To simplify things we have a test network setup on the bench containing two CBS350-24P-4G switches and a CBS350-8S-E-2G switch sitting in between them. (see attached Network.pdf file).

We have a few vlans setup to keep AES67 traffic separate from the Dante traffic. Switches are connected via Trunks. QOS is enabled and configured for AES67 networks.

When the switches have low load (very few devices connected) PTPv2 timing is spot on across devices on the network. If we take ~20 Dante POE devices and connect them to one of the CBS350-24P-4G switches, the AES67 devices (2) on that switch start drifting as much as 1.5ms from the GM and loose sync. If we move the ~20 Dante devices to the other switch then we see that switch settle down and the new switch with the high device count starts to drift off the network.

Some notes / observations:

All switches are running current firmware and have been factory reset.
There are no dropped packets in the QOS Queue log.
At no point does any interface get above 8% utilization.

It almost feels like the switch is not fast enough when switching large number of devices.

Thoughts?

Harald Steindl · ‎09-15-2023

Very interesting. Since this post is quite old: Did you manage to solve this one?
MY question/suggestion would be to change your trunks. To my knowledge putting PTPv2 traffic into a multi-vlan trunk is never a good idea. I know of multiple manufacturers insisting on having a dedicated AES67 trunk between switches no matter which switch model.

HST Consulting - raising the bar

Joseph W. Doherty · ‎09-15-2023

Interesting indeed. I don't recall this post, when posted, perhaps because its title/label didn't grab my attention.

Insufficient information to say for sure what the problem is, but I think it unlikely it's due to ". . . the switch is not fast enough when switching large number of devices." I think it more likely it's due to the switch's architecture not being suitable for supporting necessary services for the large number of AES67 PTPv2 flows and/or QoS capabilities not optimally used.

BTW:

There are no dropped packets in the QOS Queue log.
At no point does any interface get above 8% utilization.

Does not rule out suboptimal QoS for service needs, especially as we don't know what the QoS configuration was.

"I know of multiple manufacturers insisting on having a dedicated AES67 trunk between switches no matter which switch model."

Unsurprising, as that's often easier for someone to provide than having the necessary QoS when using a shared link.

Above I mentioned switch architecture.

Consider if the same priority frame arrive on all 24 ingress ports at the "same" time, and we need to forward them out one "uplink" ports. We can start to transmit one frame immediately, and queue the other 23.

Which of the 24 that arrived at the "same" time will be transmitted first? How will the remaining 23 frames be queued for transmission? If another batch of "2nd" frames now arrive at the "same" time, will their queuing sequence be guaranteed the same as for the 1st batch, or might it differ? If it can differ, we now have a jitter issue.

Harald Steindl · ‎10-20-2023

Thanks for replying.

As people way smarter than me found out, QoS normally does not help you all that much in real life for ptp troubles. In other words to not expect QoS to be the solution unless you have a decent loaded network.
The way it was explained to me goes like this:
QoS helps the device choosing if there are two packets to be sent out "now", and the switch needs a helping hand in deciding which one to go out first. However in an only lightly loaded network this is not happening, as packets come in only "once in a while" so to say. QoS does not speed up things in general! If there are not many packets waiting, then there is nothing to priorize.Each packet get sent out asap as nothing holds it up.

Take a look onto this video, SUPER helpful: https://www.youtube.com/watch?v=HsmZu8Ooo6A

HST Consulting - raising the bar

Joseph W. Doherty · ‎10-20-2023

You might want to keep looking for your smarter people, if they believe QoS, optimally done is generally ineffective.

I do agree if an interface doesn't ever have a queue, QoS generally isn't needed, but in real networks, rarely are interfaces not oversubscribed.

I only very briefly skimmed your referenced video, and (perhaps) saw one of the biggest common mistakes, where they showed a link loaded at (?) 10%, 50% and 90% had no congestion issues. (I didn't see their method of loading the link.)

Firstly, understand an interface is always being used as just two usage percentages, 0 and 100. Anything else is a capacity average over some time period.

For example, if I transmit, continously for 10 seconds, and if your measurement period is the same 10 seconds, usage would be seen as 100%. If those 10 seconds were measured across 20 seconds, then usage would be 50%. But, even in the latter, for 10 seconds of the 20, link was completely saturated.

But, say for that 10 second transmission, a faster or multiple interfaces were feeding the traffic to an egress interface, and it couldn't queue more than 10 seconds worth of traffic. Say, actually 11 seconds worth of traffic was received. I.e. we dropped about 10% of the traffic. Yet, looking at just link loading for 20 seconds, usage was (only) 50%.

The "poster child" for QoS is mixing something like FTP and VoIP traffic on the same intrface, because FTP can cause transient problems for VoIP during the 10 second period.

Further, even if you were running just FTP, personally, I believe QoS needs to insure the best "goodput" transmission rate for the FTP traffic.

Don't misunderstand, I don't propose QoS is always of benefit, or always needed, but in my experience, its needed much more than many think.

Also, except for things like VoIP traffic, you typical class book QoS model, like Cisco's 12 class model, or the RFC model, I find often don't work very well in the "real world". Since they don't, unfortunally it furthers the common misception that QoS isn't very effective.

Personally, I wouldn't qualifiy myself as a smarter person, but referencing much QoS literature, I spent about a decade working to improve perceived network performance across an Americas WAN. Over much time, I found what QoS techniques were truly a QoS benefit, and what weren't.
Two stories I like to tell about the forgoing. First, the senior WAN engineer felt QoS was just worthless "voodo", well until the day he came to work and his phone was lit up like a Christmas Tree with people complaining "what's wrong with the network". Turned out, overnight one of our HQ routers rebooted itself and dropped it QoS settings. He restored the QoS settings, phone calls stopped.

Company I was doing this for, had 3 major regions, Americas, EMEA and Asia-Pacific. During one of their annual network conferences, two other regions were complaining, no matter how much bandwidth they purchase, poor network performance complaints just don't stop. I mentioned, since we started to use QoS, we seldom increase bandwidths, yet our uses no longer complain. Well, other regions also seemed to consider QoS "voodo" and continued as they did.