ASR9000/XR: Understanding QOS, default marking behavior and troubleshooting - Page 10

xthuijs · ‎03-07-2011

Introduction

This document provides details on how QOS is implemented in the ASR9000 and how to interpret and troubleshoot qos related issues.

Core Issue

QOS is always a complex topic and with this article I'll try to describe the QOS architecture and provide some tips for troubleshooting.

Based on feedback on this document I'll keep enhancing it to document more things bsaed on that feedback.

The ASR9000 employs an end to end qos architecture throughout the whole system, what that means is that priority is propagated throughout the systems forwarding asics. This is done via backpressure between the different fowarding asics.

One very key aspect of the A9K's qos implementation is the concept of using VOQ's (virtual output queues). Each network processor, or in fact every 10G entity in the system is represented in the Fabric Interfacing ASIC (FIA) by a VOQ on each linecard.

That means in a fully loaded system with say 24 x 10G cards, each linecard having 8 NPU's and 4 FIA's, a total of 192 (24 times 8 slots) VOQ's are represented at each FIA of each linecard.

The VOQ's have 4 different priority levels: Priority 1, Priority 2, Default priority and multicast.

The different priority levels used are assigned on the packets fabric headers (internal headers) and can be set via QOS policy-maps (MQC; modular qos configuration).

When you define a policy-map and apply it to a (sub)interface, and in that policy map certain traffic is marked as priority level 1 or 2 the fabric headers will represent that also, so that this traffic is put in the higher priority queues of the forwarding asics as it traverses the FIA and fabric components.

If you dont apply any QOS configuration, all traffic is considered to be "default" in the fabric queues. In order to leverage the strength of the asr9000's asic priority levels, you will need to configure (ingress) QOS at the ports to apply the priority level desired.

In this example T0 and T1 are receiving a total of 16G of traffic destined for T0 on the egress linecard. For a 10G port that is obviously too much.

T0 will flow off some of the traffic, depending on the queue, eventually signaling it back to the ingress linecard. While T0 on the ingress linecard also has some traffic for T1 on the egress LC (green), this traffic is not affected and continues to be sent to the destination port.

Resolution

The ASR9000 has the ability of 4 levels of qos, a sample configuration and implemenation detail presented in this picture:

Policer having exceeddrops, not reaching configured rate

When defining policers at high(er) rates, make sure the committed burst and excess burst are set correctly.

This is the formula to follow:

Set the Bc to CIR bps * (1 byte) / (8 bits) * 1.5 seconds

and

Be=2xBc

Default burst values are not optimal

Say you are allowing 1 pps, and then 1 second you don’t send anything, but the next second you want to send 2. in that second you’ll see an exceed, to visualize the problem.

Alternatively, Bc and Be can be configured in time units, e.g.:

policy-map OUT

class EF

police rate percent 25 burst 250 ms peak-burst 500 ms

For viewing the Bc and Be applied in hardware, run the "show qos interface interface [input|output]".

Why do I see non-zero values for Queue(conform) and Queue(exceed) in show policy-map commands?

On the ASR9k, every HW queue has a configured CIR and PIR value. These correspond to the "guaranteed" bandwidth for the queue, and the "maximum" bandwidth (aka shape rate) for the queue.

In some cases the user-defined QoS policy does NOT explicitly use both of these. However, depending on the exact QoS config the queueing hardware may require some nonzero value for these fields. Here, the system will choose a default value for the queue CIR. The "conform" counter in show policy-map is the number of packets/bytes that were transmitted within this CIR value, and the "exceed" value is the number of packets/bytes that were transmitted within the PIR value.

Note that "exceed" in this case does NOT equate to a packet drop, but rather a packet that is above the CIR rate on that queue.

You could change this behavior by explicitly configuring a bandwidth and/or a shape rate on each queue, but in general it's just easier to recognize that these counters don't apply to your specific situation and ignore them.

What is counted in QOS policers and shapers?

When we define a shaper in a qos pmap, the shaper takes the L2 header into consideration.

The shape rate defined of say 1Mbps would mean that if I have no dot1q or qinq, I can technically send more IP traffic then having a QIQ which has more L2 overhead. When I define a bandwidth statement in a class, same applies, also L2 is taken into consideration.

When defining a policer, it looks at L2 also.

In Ingress, for both policer & shaper, we use the incoming packet size (including the L2 header).

In order to account the L2 header in ingress shaper case, we have to use a TM overhead accounting feature, that will only let us add overhead in 4 byte granularity, which can cause a little inaccuracy.

In egress, for both policer & shaper we use the outgoing packet size (including the L2 header).

ASR9K Policer implementation supports 64Kbps granularity. When a rate specified is not a multiple of 64Kbps the rate would be rounded down to the next lower 64Kbps rate.

For policing, shaping, BW command for ingress/egress direction the following fields are included in the accounting.

MAC DA

MAC SA

EtherType

VLANs..

L3 headers/payload

CRC

Port level shaping

Shaping action requires a queue on which the shaping is applied. This queue must be created by a child level policy. Typically shaper is applied at parent or grandparent level, to allow for differentiation between traffic classes within the shaper. If there is a need to apply a flat port-level shaper, a child policy should be configured with 100% bandwidth explicitly allocated to class-default.

Understanding show policy-map counters

QOS counters and show interface drops:

Policer counts are directly against the (sub)interface and will get reported on the "show interface" drops count.
The drop counts you see are an aggregate of what the NP has dropped (in most cases) as well as policer drops.

Packets that get dropped before the policer is aware of them are not accounted for by the policy-map policer drops but may
show under the show interface drops and can be seen via the show controllers np count command.

Policy-map queue drops are not reported on the subinterface drop counts.
The reason for that is that subinterfaces may share queues with each other or the main interface and therefore we don’t
have subinterface granularity for queue related drops.

Counters come from the show policy-map interface command

Class name as per configuration	Class precedence6
Statistics for this class	Classification statistics (packets/bytes) (rate - kbps)
Packets that were matched	Matched : 31583572/2021348608 764652
packets that were sent to the wire	Transmitted : Un-determined
packets that were dropped for any reason in this class	Total Dropped : Un-determined
Policing stats	Policing statistics (packets/bytes) (rate - kbps)
Packets that were below the CIR rate	Policed(conform) : 31583572/2021348608 764652
Packets that fell into the 2nd bucket above CIR but < PIR	Policed(exceed) : 0/0 0
Packets that fell into the 3rd bucket above PIR	Policed(violate) : 0/0 0
Total packets that the policer dropped	Policed and dropped : 0/0
Statistics for Q'ing	Queueing statistics <<<----
Internal unique queue reference	Queue ID : 136
how many packets were q'd/held at max one time (value not supported by HW)	High watermark (Unknown)
number of 512-byte particles which are currently waiting in the queue	Inst-queue-len (packets) : 4096
how many packets on average we have to buffer (value not supported by HW)	Avg-queue-len (Unknown)
packets that could not be buffered because we held more then the max length	Taildropped(packets/bytes) : 31581615/2021223360
see description above (queue exceed section)	Queue(conform) : 31581358/2021206912 764652
see description above (queue exceed section)	Queue(exceed) : 0/0 0
Packets subject to Randon Early detection and were dropped.	RED random drops(packets/bytes) : 0/0

Understanding the hardware qos output

RP/0/RSP0/CPU0:A9K-TOP#show qos interface g0/0/0/0 output

With this command the actual hardware programming can be verified of the qos policy on the interface

(not related to the output from the previous example above)

Tue Mar 8 16:46:21.167 UTC
Interface: GigabitEthernet0_0_0_0 output
Bandwidth configured: 1000000 kbps Bandwidth programed: 1000000
ANCP user configured: 0 kbps ANCP programed in HW: 0 kbps
Port Shaper programed in HW: 0 kbps
Policy: Egress102 Total number of classes: 2
----------------------------------------------------------------------
Level: 0 Policy: Egress102 Class: Qos-Group7
QueueID: 2 (Port Default)
Policer Profile: 31 (Single)
Conform: 100000 kbps (10 percent) Burst: 1248460 bytes (0 Default)
Child Policer Conform: TX
Child Policer Exceed: DROP
Child Policer Violate: DROP
----------------------------------------------------------------------
Level: 0 Policy: Egress102 Class: class-default
QueueID: 2 (Port Default)
----------------------------------------------------------------------

Default Marking behavior of the ASR9000

If you don't configure any service policies for QOS, the ASR9000 will set an internal cos value based on the IP Precedence, 802.1 Priority field or the mpls EXP bits.

Depending on the routing or switching scenario, this internal cos value will be used to do potential marking on newly imposed headers on egress.

Scenario 1

Scenario 2

Scenario 3

Scenario 4

Scenario 5

Scenario 6

Special consideration:

If the node is L3 forwarding, then there is no L2 CoS propagation or preservation as the L2 domain stops at the incoming interface and restarts at the outgoing interface.

Default marking PHB on L3 retains no L2 CoS information even if the incoming interface happened to be an 802.1q or 802.1ad/q-in-q sub interface.

CoS may appear to be propagated, if the corresponding L3 field (prec/dscp) used for default marking matches the incoming CoS value and so, is used as is for imposed L2 headers at egress.

If the node is L2 switching, then the incoming L2 header will be preserved unless the node has ingress or egress rewrites configured on the EFPs.
If an L2 rewrite results in new header imposition, then the default marking derived from the 3-bit PCP (as specified in 802.1p) on the incoming EFP is used to mark the new headers.

An exception to the above is that the DEI bit value from incoming 802.1ad / 802.1ah headers is propagated to imposed or topmost 802.1ad / 802.1ah headers for both L3 and L2 forwarding;

Related Information

ASR9000 Quality of Service configuration guide

Xander Thuijs, CCIE #6775

Alejandro Rivera · ‎03-22-2016

Hello Xander,

Thanks. Awesome post. I have a question regarding a certain behavior on a ASR9010, where the following class-maps cannot be committed. This class-maps have been applied to ASR903 platform, without any problem.

Please let me know If there is any solution, since it is a previously implemented template on IOS XE platforms in a live network.

class-map match-all GOLD
match mpls experimental topmost 2
match dscp cs2 af21 af22 af23
end-class-map
!
class-map match-any SILVER
match mpls experimental topmost 1
match dscp cs1 af11 af12 af13
end-class-map
!

policy-map MPLSCORE

class GOLD

bandwidth remaining percent 21

!

class SILVER

bandwidth remaining percert 16

!

class class-default

!

end-policy-map

!

interface Gi0/0/0/0

service-policy output MPLSCORE

!

Thanks in advance.

Alejandro Rivera · ‎03-22-2016

Hello Xander.

Thanks, Awesome post.

I have a question regarding the behavior of a certain classification template on an ASR9010 platform, which is not allowed to be committed, stating a semantic error. The thing is, this template is already applied on ASR903 IOS XE devices, but it isnt allowed on this ASR9010.

Please, let me know of any suggestion, thanks in advance.

This is the template:

class-map match-all GOLD
match mpls experimental topmost 2
match dscp cs2 af21 af22 af23
end-class-map
!
class-map match-any SILVER
match mpls experimental topmost 1
match dscp cs1 af11 af12 af13
end-class-map
!

policy-map MPLSCORE

!

class GOLD

bandwidth remaining percert 21

!

class SILVER

bandwidth remaining percent 16

!

class class-default

!

interface GigabitEthernet0/0/0/0

service-policy output MPLSCORE

!

Aleksandar Vidakovic · ‎03-23-2016

hi Alejandro,

the syntax error is because of the match-all in GOLD, where you are trying to match simultaneously on the fields from MPLS header and IP header. When MPLS packets are received, we're looking into the EXP to classify the packet. We can't simultaneously look into the MPLS EXP and IP ToS for classification. I doubt that other platforms can actually do it. Can you convert that to match-any and try to commit?

Btw, there's a typo as well: 'percert' instead of 'percent'

regards,

Aleksandar

racarvalho · ‎03-23-2016

Hi Xander,

I need to build a qos police to provide different services, but i have doubts in the difference between allocating bandwidth to a class and priority of a class. Ex.

policy-map EGRESS
 class VOICE
  priority 
  police rate percent 10 
!
 class CONTROL
  bandwidth remaining percent 2 
!
 class PREMIUM
  bandwidth remaining percent 50
  random-detect 10 ms 20 ms
 !
 class SILVER
  bandwidth remaining percent 25
  random-detect 10 ms 20 ms
 ! 
 class class-default
  bandwidth remaining percent 18 
  random-detect 60 ms 80 ms 
 !

In the example above the voice class is forward before any other class.

But what happens to the other classes?

Are they all forward equally only limited by bandwidth?
Is the bandwidth command in anyway influencing the priority/weight?

Who can i guarantee the class control has more priority than gold, and gold more than silver and so on? What i need is something like CBWFQ.

Thanks

RAC

xthuijs · ‎03-23-2016

hey Rac, you would want to add the priority level to the voice class like:

priority level 1 (or 2 or 3).

The way the scheduler works is:

1) first P1 is served, THEN P2, THEN P3.

2) the left over bandwidth is then scheduled ratio wise accordingly.

say a class X has a BW of 10 and the other one Y has a BW of 20

then the scheduler says, X give me 1, class Y give me 2.

The way the scheduler derives whether it will take the 1 or 2 from X or Y respectively is that it looks at the packet size that is head of the queue and if there are enough tokens available the packet gets transmitted, otherwise it waits a token refresh cycle (aka as Tc), to accumulate enough tokens to get the packet on the wire.

Basically a Q might say, hey I have 1000 bytes to send now, yes or no. If no a shaper would Q it, hold it for the next cycle, a policer will mark it drop the exceed (or violate) action that is likely drop.

So the shaped classes are all DQ'd in a WRR fashion, whereby the scheduler cycles through the BW queues and takes the packet from that Q (or not and keep it buffered).

Every token refresh BW amount of tokens are added to the bucket for that Q and hence if there are good enough tokens availalbe for the packet (size) that is head of Q it is taken.

As you can see, queue limit (how many packets I can buffer) and burst size (can I borrow tokens from the future) are deterministic here whether a packet is ready for xmit or buffer or not.

To answer your question more directly I guess, CONTROL, Premium and SILVER all have the same scheduling priority, but they all have a reserved BW on the circuit.

VOICE is always DQ'd first for as long as there are packets it in its queue. Hence a policer is very important because that rate limits what enters that Q. If no policer on a PQ. it could *starve* all the BW on the circuit!

xander

racarvalho · ‎03-24-2016

Thanks for the quick reply Xander,

Just to make sure i understood correctly, for the policy to server first the VOICE then CONTROL then Premium then SILVER and finally default(i understood that its not first but more, right?), the policy should look like this.

policy-map EGRESS
 class VOICE
  priority level 1
  police rate percent 10 
!
 class CONTROL
  bandwidth remaining percent 40
!
 class PREMIUM
  bandwidth remaining percent 30
  random-detect 10 ms 20 ms
 !
 class SILVER
  bandwidth remaining percent 20
  random-detect 10 ms 20 ms
 ! 
 class class-default
  bandwidth remaining percent 10
  random-detect 60 ms 80 ms 
 !

this way we have:

VOICE - LLQ served always

CONTROL - WRR served 4 times

PREMIUM - WRR served 3 times

SILVER - WRR served 2 times

default - WRR served 1 times

(i know its not really like this, this is just a simplification for comprehension)

reasoning from the above, when in congestion, the premium class has 4 time less probability of being drop then the default, is this a correct assumption?

Just to conclude the "bandwidth remaining percent" its not the interface bandwidth but the queuing bandwidth(tokens).

Best regards,

RAC

xthuijs · ‎03-24-2016

Nice! yeah you got it RAC! :)

and correct too, it is not the intf bw, but the remaining bw after all prio stuff is served, this is known as "service rate". More info on service rate check the sandiego id 2904 CL preso if you're interested.

xander

bn.thiyagarajan · ‎04-05-2016

Hello Aleksander,

Can we use MOD80 SE/TR for your said option 1. Since MOD 80 has one NP per subslot, do we have the option of proper load-balancing in terms of an interface configured with qos.

Warm regards,

Thiyagarajan B

xthuijs · ‎04-05-2016

yeah if you use a say 4x10MPA, these 4 intfs are served by a single npu.

xander

bn.thiyagarajan · ‎04-05-2016

Thanks Xander, Can we have the bundle as l2 and create a bvi for l3 for load-balancing traffic when qos is applied. Will that work out?

Warm Regards,

Thiyagarajan B

xthuijs · ‎04-05-2016

qos on bvi is limited, if you'd only have the bundle efp and the bvi in teh bridge-domain, you'd be impacting pps due to the use of bvi (requiring l2 and l3 pass), in that case it is best to use an L3 efp on the bundle directly. the loadbalancing over the bundle members is (configurably) the same between L2 and L3 bundle interfaces. cheers xander

wblackcenic · ‎05-06-2016

As always, great information. I have a question about the order in which QoS is acted-on for the input direction.

If we were to match input traffic for IP Prec 5 or 6 and re-write that traffic to IP Prec 0, would that have an impact on any traffic destined to the local device, such as BGP protocol traffic that is internally set to IP Prec 6. In other words, is the BGP traffic marked with IP Prec 6 punted to the CPU before it might be re-written to Prec 0? I want to confirm that we won't negatively impact control-plane traffic with Prec 5/6 markings.

If I need to further clarify my question, please let me know.

Thanks.

xthuijs · ‎05-06-2016

hey! :) thank you! :) traffic "for me" is bypassing ingress QOS and handled via lpts policers, so a set prec whatever should not affect bgp etc.

cheers!

xander

catalin.petrescu · ‎05-09-2016

hi Xander ,

got confirmation from tac under : CSCuy85148 multicast traffic is treated diff then unicast.

Using 4.3.4 . Not sure about other versions.

Regards,

Catalin

xthuijs · ‎05-09-2016

hi catalin, I somehow missed addressing the original question you posted 6 months ago, shame on me! :)

There was a decision made for mcast, or more specifically mvpn, or encapped multicast to use the inner prec instead of the outer exp. this change was in 42<something>

The ddts that describes that is CSCtr35679 It is used to leverage inner IP PREC to egress COS instead of outer EXP for MVPN decap path.

cheers

xander