The problem in this case is

Owen Mould · ‎04-15-2016

We have a multi-site network mostly interconnected with an Opt-E-MAN cloud (multi-point Layer 2 network). Sometimes one server or another at one site or another will shove a whole lot of data down the pipe all at once, to the detriment of other traffic. If it's Microsoft update server or a Quorum backup box, this can persist for hours.

I've already put in place class-based weighted fair queuing and service policies that give business-critical traffic priority on all outbound links (sample below). That doesn't seem to be enough: earlier this week a couple of boxes dumped so much traffic on their part of the net that X-ray image uploads--a top priority--went from taking 5 minutes to 5 hours.

Two questions:

How can I assess how the service policies are working? I would like to see what traffic they're holding back, what they're flinging forward, and what they're throwing away.
Would it help to put service policies on inbound traffic as well?

Here's a sample policy and interface implementation:

Policy Map PM-cbwfq
    Class CM-Priority-1
      set ip dscp af11
      bandwidth 5000 (kbps)
    Class CM-Priority-2
      set ip dscp af12
      bandwidth 2500 (kbps)
    Class CM-VoiceVideo
      set ip dscp ef
      bandwidth 2500 (kbps)
    Class CM-Network
      set ip dscp ef
      bandwidth 1250 (kbps)
    Class CM-bulk
      set ip dscp default
    Class class-default
      fair-queue

...

interface GigabitEthernet0/1
description Opt-E-MAN
bandwidth 100000
ip address 10.100.1.10 255.255.255.0
rate-limit output 100000000 18750000 37500000 conform-action transmit exceed-action drop
duplex auto
speed auto
service-policy output PM-cbwfq

Thanks!

Vasilii Mikhailovskii · ‎04-17-2016

Hello.

I'm not sure about your problem, but here are some notes about the configuration:

I doubt it's a good idea to mix legacy and MQC;
Doing MQC you must run shaper, not policer, otherwise you won't have good QoS;
If you have voice class - it typically should be LLQ (priority);
shape average rate must match your contracted (CIR) bandwidth (typically 90-95% of CIR);
if you are running pre-HQF IOS release (before 12.4.20T), your class-default and CM-bulk may starve.

PS: does your provider support any kind of QoS?

Owen Mould · ‎04-19-2016

Thanks for your comments. I reviewed Cisco's MQC transition documents but didn't see the legacy commands you refer to. Are you referring to the dscp commands?

The shaping must take into account that the interface in question is wide-area Ethernet, with no burst provisions. We get the bandwidth specified, no more, no less; if we let the router transmit too much, the excess is simply dropped. So the rate-limit line up there is wrong, but I don't think that's my problem here. It's beginning to look like a multi-line rate-limit command (specifying different rate limits for different dscp values) might be useful, but if my problem is on the LAN interface it won't help.

Vasilii Mikhailovskii · ‎04-19-2016

Hello.

"Legacy" is rate-limit under physical interface - that should be replaced with shaping and hierarchical MQC.

Joseph W. Doherty · ‎04-18-2016

Disclaimer

The Author of this posting offers the information contained within this posting without consideration and with the reader's understanding that there's no implied or expressed suitability or fitness for any purpose. Information provided is for informational purposes only and should not be construed as rendering professional advice of any kind. Usage of this posting's information is solely at reader's own risk.

Liability Disclaimer

In no event shall Author be liable for any damages wha2tsoever (including, without limitation, damages for loss of use, data or profit) arising out of the use or inability to use the posting's information even if Author has been advised of the possibility of such damage.

Posting

If you have multi-point topology, you have a classical problem that multiple sites can send to the same destination site, at the same time, and over run it. Basic individual site egress QoS, does not address this.

What can be done?

The simplest and often the best solution (unfortunately often not available), is for your service provider to provide egress QoS, from their "cloud" to each site. (Althoug service providers rarely offer QoS support, they often are quite willing to sell you additional bandwidth.)

Another simple solution is to shape each site's egress to insure any aggregate combination of sites cannot overrun any other individual site. For example, if you have 10 sites with FE connections, shape each site to 10 Mbps. (Hopefully, you can see all the disadvantages with this.)

A more complicated solution is use some kind of performance monitoring that can dynamically shape based on perceived load.

Owen Mould · ‎04-19-2016

The problem in this case is not different sites overrunning one site's connection (though that can happen). It's one or two hosts on one site's LAN flooding that site's WAN link. Schematic:

    x-ray images (1st priority)\
other business-critical traffic >= g0/0.210 -> {traffic queued by priority} -> g0/1 -> (((WAN)))
 backup traffic (4th priority) /

It's even possible that the congestion is happening on the LAN side (g0/0.210)--the backup server may be putting out so many packets that the x-ray image packets can't even get into the router to get queued properly. That's why I'm thinking about inbound traffic shaping on the LAN interface, though when I tried it, it didn't seem to improve throughput.

The command for checking policy-map performance is, not surprisingly,

sh policy-map int interface

Joseph W. Doherty · ‎04-19-2016

Disclaimer

The Author of this posting offers the information contained within this posting without consideration and with the reader's understanding that there's no implied or expressed suitability or fitness for any purpose. Information provided is for informational purposes only and should not be construed as rendering professional advice of any kind. Usage of this posting's information is solely at reader's own risk.

Liability Disclaimer

In no event shall Author be liable for any damages wha2tsoever (including, without limitation, damages for loss of use, data or profit) arising out of the use or inability to use the posting's information even if Author has been advised of the possibility of such damage.

Posting

Ah, well in that case, "normal" QoS should do that trick.

However, looking again at your OP, how you've configured your QoS is not the way you want to do it.

What's the platform?

Owen Mould · ‎04-19-2016

Platforms are a 3925 and a 3925e, both running IOS version 15.3(3)M7.

How would you recommend configuring the QoS?

Joseph W. Doherty · ‎04-20-2016

Disclaimer

The Author of this posting offers the information contained within this posting without consideration and with the reader's understanding that there's no implied or expressed suitability or fitness for any purpose. Information provided is for informational purposes only and should not be construed as rendering professional advice of any kind. Usage of this posting's information is solely at reader's own risk.

Liability Disclaimer

In no event shall Author be liable for any damages wha2tsoever (including, without limitation, damages for loss of use, data or profit) arising out of the use or inability to use the posting's information even if Author has been advised of the possibility of such damage.

Posting

In situations, where you logical bandwidth is less than your physical hand-off bandwidth, you can use a hierarchical policy (as also mentioned by Vasilii). A "parent" policy shapes for the logical bandwidth cap, a "child" policy manages shaper congestion.

e.g. (100 Mbps cap on a gig inteface)

policy-map parent

class class-default

shape average 85000000

service-policy PM-cbwfq

BTW, for your EF classes,you might consider using LLQ (the priority statement).

You apply the parent policy, as the interface egress policy.

If you're wondering why 85 Mbps, rather than 100 Mbps, in the above, That's because many Cisco shapers (and policers) don't appear to account for L2 overhead. If you have a platform/IOS that does, than you would set to 100 Mbps.

Owen Mould · ‎04-26-2016

Something like this, then? Please forgive my denseness--I've looked at the docs, but there seems to be a million pages all saying different things.

policy-map parent-out    ! egress policy        
 class class-default
  shape average 85000000 
  service-policy PM-cbwfq 

policy-map child-Xray
 class CM-Xray           ! top priority traffic
  priority 10000 

policy-map child-VoiceVideo
 class CM-VoiceVideo
  priority 4000         ! 4 Mbps ought to be enough for voice and video,
                        ! "priority" should invoke LLQ

...                     ! other child policy-maps

policy-map child-Bulk
 class CM-Bulk          ! ACL uses time-range to strictly        
  police 1000           ! limit b/w during clinic hours

It seems to me I could profitably add an ingress policy that assigns dscp values that I could use throughout the network.

Thanks!

enforcing traffic priorities, checking up on service policies