VoIP QoS problems over point to point WAN

dbrown · ‎01-28-2012

I have a customer with a main office and 9 branch offices. Their phone system is an NEC NetLink, which I admittedly know little about as I was not involved in its implementation. The main office has an ASR 1001 with a trunk ethernet handoff to Time Warner, and each branch office has 2921's with access handoffs to Time Warner. The branches are a mix of 5Mbps and 50Mbps circuits, and the 5Mbps circuits often get saturated, resulting in the phone system going offline.

The way the phone guy explained it to me is that there are hosts at each branch that communicate back to the 'main' host via a TCP heartbeat. Within the phone system itself, all Signaling (heartbeat) traffic is marked with DiffServ 24 (CS3) and the rtp voice traffic is marked with DiffServ 46 (EF). The Time Warner 'cloud' is nothing more than L2 VLANs using QinQ to provide virtual point to point links, with no QoS in between (just traffic policing to ensure we stay near our CIR).

Here is the sanitized config for the main office ASR:

!

class-map match-any voip-signaling

match dscp cs3

class-map match-any voip-rtp

match dscp ef

class-map match-any phone

match access-group 103

match qos-group 46

match qos-group 24

match dscp ef

match dscp cs3

!

policy-map Branch1

class phone

priority 384 2048

policy-map Branch2

class phone

priority 384 2048

policy-map Branch3

class phone

priority 384 2048

policy-map Branch4

class phone

priority 384 2048

policy-map Branch5

class phone

priority 384 2048

policy-map Branch6

class phone

priority 384 2048

policy-map Branch7-voip

class voip-signaling

bandwidth 500

set dscp cs3

class voip-rtp

set dscp ef

priority 2000

policy-map Branch8

class phone

priority 2048

policy-map Branch9

class phone

priority 384 2048

policy-map Branch1-Shape

class class-default

shape average 50000000

service-policy Branch1

policy-map Branch2-Shape

class class-default

shape average 5000000

service-policy Branch2

policy-map Branch3-Shape

class class-default

shape average 5000000

service-policy Branch3

policy-map Branch4-Shape

class class-default

shape average 5000000

service-policy Branch4

policy-map Branch5-Shape

class class-default

shape average 50000000

service-policy Branch5

policy-map Branch6-Shape

class class-default

shape average 50000000

service-policy Branch6

policy-map Branch7-Shape

class class-default

shape average 5000000

service-policy Branch7-voip

policy-map Branch8-Shape

class class-default

shape average 50000000

service-policy Branch8

policy-map Branch9-Shape

class class-default

shape average 5000000

service-policy Branch9

!

interface GigabitEthernet0/0/1

no ip address

speed 1000

no negotiation auto

!

interface GigabitEthernet0/0/1.1525

description Branch1 50Mbps link

bandwidth 50000

encapsulation dot1Q 1525

ip address 192.168.254.5 255.255.255.252

service-policy output Branch1-Shape

!

interface GigabitEthernet0/0/1.1526

description Branch3 5Mbps Link

bandwidth 5000

encapsulation dot1Q 1526

ip address 192.168.254.13 255.255.255.252

service-policy output Branch3-Shape

!

interface GigabitEthernet0/0/1.1527

description Branch25Mbps Link

bandwidth 5000

encapsulation dot1Q 1527

ip address 192.168.254.9 255.255.255.252

service-policy output Branch2-Shape

!

interface GigabitEthernet0/0/1.1528

description Branch4 5Mbps Link

bandwidth 5000

encapsulation dot1Q 1528

ip address 192.168.254.17 255.255.255.252

service-policy output Branch4-Shape

!

interface GigabitEthernet0/0/1.1529

description Branch5 50Mbps Link

bandwidth 50000

encapsulation dot1Q 1529

ip address 192.168.254.21 255.255.255.252

service-policy output Branch5-Shape

!

interface GigabitEthernet0/0/1.1530

description Branch6 50Mbps Link

bandwidth 50000

encapsulation dot1Q 1530

ip address 192.168.254.25 255.255.255.252

service-policy output Branch6-Shape

!

interface GigabitEthernet0/0/1.1531

description Branch7 5Mbps Link

bandwidth 5000

encapsulation dot1Q 1531

ip address 192.168.254.29 255.255.255.252

service-policy output Branch7-Shape

!

interface GigabitEthernet0/0/1.1532

description Branch8 50Mbps Link

bandwidth 50000

encapsulation dot1Q 1532

ip address 192.168.254.33 255.255.255.252

service-policy output Branch8-Shape

!

interface GigabitEthernet0/0/1.1533

description Branch9 5Mbps Link

bandwidth 5000

encapsulation dot1Q 1533

ip address 192.168.254.37 255.255.255.252

service-policy output Branch9-Shape

!

The two branches experiencing outages are Branch3 and Branch7. During the outages, I can see that the point to point circuit for that branch is completely saturated (from main office to branch...very low usage the other way). The strange thing is that Branch2 and Branch4 appear to be unaffected, although they also have 5Mbps circuits that actually seem to stay saturated more than Branch3 and Branch7. Based on the config snippet above, does this seem like a QoS problem on the ASR or misconfiguration in the phone system? I lean toward the latter, but I keep getting pushback that it is a network issue, and as you can see above, I changed the policy-map for Branch7, but it made no difference whatsoever.

Mohamed Sobair · ‎01-28-2012

Hello,

where is your phone system located at the HQ? Do you have classification and Marking at the Interface connected to the Phone System? the Voice VLAN interface , Can you post the complet config highliting the VOice LAN interface?

Regards,

Mohamed

dbrown · ‎01-28-2012

The phone system is comprised of several NEC SV8100's (one at each branch), with the 'main' one being at Branch8. As I said before, I was not involved in the implementation, or it would have been at the main branch. As it is now, all voice traffic must traverse the circuit back to the main office, then the circuit to Branch8, which is not very efficient. That said, the link to Branch8 is nevery completely saturated (seldom exceeds 50% of the CIR), and I can see very clearly that the point of contention is the egress interface of the ASR.

This is a pretty flat network, too. Everything rides on the default VLAN (1), and the switches are a mix of Netgear and Cisco (2950) with very flat configs. The tagging I referred to is supposedly done by the phone system itself.

Joseph W. Doherty · ‎01-29-2012

Disclaimer

The Author of this posting offers the information contained within this posting without consideration and with the reader's understanding that there's no implied or expressed suitability or fitness for any purpose. Information provided is for informational purposes only and should not be construed as rendering professional advice of any kind. Usage of this posting's information is solely at reader's own risk.

Liability Disclaimer

In no event shall Author be liable for any damages whatsoever (including, without limitation, damages for loss of use, data or profit) arising out of the use or inability to use the posting's information even if Author has been advised of the possibility of such damage.

Posting

What's curious is your description that two branches have issues with link saturation and two don't.

Is it possible there's branch to branch traffic? If there is, it's possible when a branch is saturated from the HQ, another branch oversubscribes bandwidth the the same destination branch.

Have you confirmed/tested you can obtain same bandwidth to "like" branches? It's possible your service provider's policing (I'm assuming branches with 5 Mbps have "faster" physical handoffs) differs between 5 Mbps links.

In your last post, when you describe ". . . I can see very clearly that the point of contention is the egress interface of the ASR.", do you refer to you subinterfaces with their shapers or the physical interface itself? Does the HQ link also have less than physical interface bandwidth?

Have you confirmed the problem HQ subinterfaces, when congested, show traffic flowing through LLQ class? Have you also confirmed LLQ class isn't discarding packets? LLQ is configured with non-default (?) Bc, if so, why? (NB: on Bc question, there may be nothing wrong with the Bc setting, but if it has been manually configured, would like to understand the reasoning for that setting.)

dbrown · ‎01-29-2012

Branch to branch traffic still passes through the HQ, as this is not an MPLS cloud, but rather point to point links (logical star). We use MRTG to graph out bandwidth utilization, and the egress traffic for the ASR's subinterface connecting to Branch4 actually appears to stay saturated longer, yet they say their phones work fine. The more I look at it, the more it points to differences in the actual phone system between branches, as only 2 branches suffer if the QoS policies match across the board. The physical interface of the ASR is a 1Gbps full-duplex handoff, with the physical interfaces for all branches (5Mbps or 50Mbps) at 100Mbps full-duplex. I'm not sure about the 50Mbps links, but the ISP is policing traffic on the 5Mbps links to ensure I stay under 6Mbps. I'm using 'shape average 5000000' to keep the egress traffic from exceeding 5Mbps on those links, so I could try to raise it closer to the ISP's limit of 6Mbps, but our CIR is 5Mbps. According to MRTG, the router is doing what it is told, as I will see utilization spike to 5Mbps and sustain that level for an hour straight, but it never goes over 5Mbps.

The saturation is mostly in the morning, as all the PCs get powered on around the same time and many of the staff download some large reports regarding the previous day's activity. These reports are the cause of the congestion, and while they are being downloaded, the phones suffer. Sometimes it is just garbled voice from rtp packets being dropped and sometimes the system reboots at one branch, with the latter being more often.

There is a TCP heartbeat from each branch back to Branch8, and if this heartbeat is lost, the local system reboots into a standalone operating mode. Once the heartbeat is re-established, the system reboots again to return to a master/slave operating mode.

Joseph W. Doherty · ‎01-29-2012

Disclaimer

The Author of this posting offers the information contained within this posting without consideration and with the reader's understanding that there's no implied or expressed suitability or fitness for any purpose. Information provided is for informational purposes only and should not be construed as rendering professional advice of any kind. Usage of this posting's information is solely at reader's own risk.

Liability Disclaimer

In no event shall Author be liable for any damages whatsoever (including, without limitation, damages for loss of use, data or profit) arising out of the use or inability to use the posting's information even if Author has been advised of the possibility of such damage.

Posting

The easiest explanation is something does involve the different in phone systems, if that is truly the only real difference between sites that have this issue and sites that don't.

You may want to further investigate VoIP packets are being properly marked. You might try defining SLA tests using the same markings.

Your MRTG graphs cover both sides of the link and confirm one sides egress matches the other sides ingess?