I have a QoS policy that I'm trying to verify and have run into an issue. The policy looks something like this:
bandwidth percent 50
priority percent 50
shape average 10000000
Basically, let's shape all traffic down to 10 Mb/s while being sure to reserve 50% of the bandwidth for traffic like pings and SSH and giving priority to up to 50% of the bandwidth for traffic like voice and video. This is simplified to an extent, but you get the idea. Anyway, I then have two IxChaiot endpoints on either side of the interface where this policy is applied. If I send a 100 Mb/s UDP stream matching the VOICE-VIDEO class from one endpoint to the other, I get 10 Mb/s of throughput as the network is not congested and there's no need to apply QoS. If I start a TCP throughput test matching that PING-SSH class, wait 10 seconds, and then start the same UDP stream from before, I get about 5 Mb/s throughput for each test meaning the policy works just fine. However, if I were to start those tests at the exact same time, the TCP test times out and the UDP test uses all 10 Mb/s of bandwidth. Similarly, if I just start the UDP test and then try and ping or SSH over the network, the pings either time out or I get incredibly slow round trip times and the SSH session fails altogether.
Now, I'm pretty sure this is happening because of the way priority classes are treated. As soon as traffic matching a priority class hits the egress queue, its sent straight to the top and shipped out immediately. With the bandwidth classes, they're all sent to the bottom of the queue and wait their turn to egress the interface. If I send a single ping or a TCP SYN, its sent to the queue and then waits there while all that priority traffic goes straight to the top of the queue and sent out. I also see my queue depth increase during these times for my PING-SSH class, which, from my understanding, means the router is hanging on to these packets in a buffer.
Class-map: PING-SSH (match-any)
948 packets, 1295850 bytes
30 second offered rate 0000 bps, drop rate 0000 bps
Match: dscp cs5 (40)
queue limit 64 packets
(queue depth/total drops/no-buffer drops) 10/0/0
(pkts output/bytes output) 948/1295850
bandwidth 10% (150 kbps)
That number could remain the same even after 5-10 seconds, implying to me the router is just hanging on to those packets for an excessive amount of time before finally shipping them out. Now, if I already have a steady stream of TCP traffic, QoS works just fine (as evidenced by waiting 10 seconds to start the priority UDP stream from the example above), but if a large priority stream is already using all the available bandwidth, there's no way for a new TCP connection to ramp up. This could be the case if there are more video or voice calls going over the interface than it can handle or possibly if a DDoS attack somehow matched a priority class. My solution for now is to put SSH traffic into a priority class so that, if something like this were to happen, I could at least get into my network gear and possibly try and turn off ports or adjust ACLs to try and stop the problematic traffic, but I'm wondering if there's a better way to do that. Any thoughts?
Unfortunately, that doesn't appear to be possible:
Router(config-pmap)# class VOICE-VIDEO
Police and Priority with bandwidth/percent are not allowed in the same class
Was that what you had in mind or did I misunderstand?
By the way, I failed to mention that if I replace the bandwidth command under class-map PING-SSH with a priority command, my pings work just fine. Also, this is all on a Cisco 4431.
Fair enough. In my particular case, that class was intended for VOIP and video conferencing so I'd want to keep it in a priority class. There really should never be a case when voice and video traffic use more than the bandwidth allocated by the policy, but strange things happen and I'm not comfortable with the idea of that traffic using up all of the available bandwidth. My solution is going to be to put SSH traffic into a priority class so I can at least get to my equipment in the case that voice and video do end up using up all the available bandwidth.