QoS on hub-and-spoke ISP network

Scott Plank · ‎07-22-2020

We have a hub-and-spoke ISP network that connects some of our remote sites back to our core network. It consists of a 50 Mbps headend circuit at our data center, with various sizes of smaller circuits at the remote sites that connect back to the headend circuit (1.5 Mbps all the way up to 20 Mbps). The 50 Mbps headend circuit connects to an ISR4451-X running IOS-XE v16.9.2 (our current IOS-XE standard version due to a variety of reasons). EIGRP is utilized in this network to advertise networks at the remote sites to the headend router.

Our QoS policy on the headend router consists of a nested policy whose goal is to primarily accomplish two things: shape traffic to the size of the circuit at each site, and prioritize traffic based on a standard 5-class model (realtime, control, critical data, scavenger, class-default) - the main benefit being the priority realtime queue for voice traffic, since we utilize Cisco VoIP at all of our sites.

Below is an example of the policy configuration for a single 5Mbps site and the configuration of the nested policy-map that are configured on the headend router, comments are in red and preceded by an ! - I have color-coded the ACL, class-map, and policy-map names to hopefully make it easier to follow:

! First entry is the remote site Data network, second is the remote site Voice network, third is the IP on the remote site router that faces the ISP network, fourth is the loopback IP for the remote site router
ip access-list extended acl_Remote-Site-1
 permit ip any 10.1.1.0 0.0.0.255
 permit ip any 10.2.2.0 0.0.0.255
 permit ip any host 10.3.3.10
 permit ip any host 10.4.4.10

class-map match-any cm_Remote-Site-1
 match access-group name acl_Remote-Site-1

! Shaping traffic to 5Mbps for the specified class-map and calling the nested policy-map
policy-map pm_WAN-Network
 class cm_Remote-Site-1
  shape average 5000000
   service-policy pm_WAN-Branch-5-Class

! Applying the policy-map to the ISP-facing interface on the ISR4451-X
interface g0/0/1
 service-policy output pm_WAN-Network

This is the configuration for the nested policy-map. It is a standard 5-class model:

class-map match-any REALTIME
 match dscp ef
 match dscp cs5
 match dscp cs4
class-map match-any CONTROL
 match dscp cs6
 match dscp cs3
 match dscp cs2
class-map match-any CRITICAL-DATA
 match dscp af41 af42 af43
 match dscp af31 af32 af33
 match dscp af21 af22 af23
 match dscp af11 af12 af13
class-map match-all SCAVENGER
 match dscp cs1

policy-map pm_WAN-Branch-5-Class
 class REALTIME
  priority percent 32
 class CONTROL
  bandwidth percent 7
 class CRITICAL-DATA
  bandwidth percent 35
  fair-queue
  random-detect dscp-based
 class SCAVENGER
  bandwidth percent 1
 class class-default
  bandwidth percent 25
  fair-queue
  random-detect dscp-based

This is a configuration I inherited that I have modified to apply consistently to all sites on the hub-and-spoke ISP network, and to better suit a company-wide QoS policy, should that ever come to fruition (the config that I inherited was applied haphazardly and basically only separated voice from non-voice traffic, but in some cases it didn't even do that).

My questions/concerns regarding this config:

Is this a good way to accomplish the two primary goals stated at the beginning? Maybe a better question would be: does this actually accomplish the two primary goals stated at the beginning? Based on some testing I did recently at one of the sites during a time of congestion, this seems to be the case, but I figured it would be a good idea to see what others think of this config.
This configuration requires manually defining the networks/IPs at each remote site in an ACL. If we need to add a network at a remote site, we have to remember to add it to the ACL for that site on the headend router (spoiler alert: we're not great at doing this yet). Is there a way to use EIGRP to dynamically define the networks/IPs advertised by each remote site in the QoS policy instead of using an ACL?

Joseph W. Doherty · ‎07-22-2020

To your second question, not that I'm aware of, although on newer devices, like yours, you might have an embedded device (i.e. router) script that examines the route table from time to time, and rewrites your remote site QoS ACLs to match destination routes for particular site. The "hardest" part of such a script might be to "know" what bandwidth a particular remote site should be shaped for. (Possibly you can "key" off a table, of some kind, that has each remote's next hop IP and its expected bandwidth. [BTW, keeping shaper bandwidth allocations accurate is another maintenance issue. I've seen a site's bandwidth physically upgraded, but then no one remembered to update shaper allowance. For some "reason", users saw no improvement in performance - ouch!])

To your first question, is your policy ideal for your goals? It's possibly not. Not withstanding many minor points I might comment on (if you want to go into those, let me know), the two biggest items might be, first, what are you doing for QoS at the remote sites? You might possibly have congestion both leaving the remote site and entering the hub site (the latter, if sum of all the remotes maximum can exceed 50 Mbps).

Second, what do you do to manage hub link egress congestion, if possible (does sum of all remote shapers exceed 50 Mbps)?

BTW, I believe many Cisco shapers do not account for L2 overhead. If so, you need to shaper (about) 15% slower to allow for it.