Losing ALL WAN connectivity - MLPPP / BGP / MPLS

rkendrick_2 · ‎11-30-2010

I have an issue at one of my sites losing ALL WAN connectivity (VoIP calls drop and other WAN-bound access halts including ICMP to remote nodes) whenever 1 out of 8 T1s bounces or drops. This is reaLLy an issue at this site because there are nearly 250 VoIP users taking calls and this totally disrupts their workflow when a circuit bounce occurs. On one occasion, I did a manual shutdown of the errant T1 serial interface and the users still lost connectivity. The connectivity returns after 30-60 seconds. I think it may have something to do with my BGP configuration. We have an MPLS network. The router (Nebula) has 8T1 bonded circuits all configured to a multilink PPP group 1969 and a multilink1969 interface. The upstream PER has the same configuration.

I've attached the running config. Please help me resolve this issue. Thanx.

Shelley Bhalla · ‎11-30-2010

2 things you can try...

1/ remove : ppp multilink fragment disable and test again.

2/ remove : ip accounting output-packets and test again.

Also, does upgrading to the latest/grestest IOS fix the issue?

This will narrow down the focus for the issue.

Also, what do the logs suggest? anything interesting recorded there when this issue occurs?

Shelley.

Giuseppe Larosa · ‎12-03-2010

Hello RKendrick,

>> whenever 1 out of 8 T1s bounces or drops

>> The connectivity returns after 30-60 seconds.

>>>> I think it may have something to do with my BGP configuration.

you have a single PPP bundle made of 8 T1 but when one T1 fails there are connectivity problems.

It is not clear if this happens only when a specific T1 line fails or if it happens regardless of the line that fails.

We cannot know what the BGP timers used in the session are, in your side you are using the defaults 60 seconds keepalive and 180 seconds holdtime.

if also ATT router uses default settings BGP takes 180 seconds to fail and you can easily check with

show ip bgp summary that can give the uptime of the eBGP session.

I would suggest the following:

divide the 8 T1 bundle in two 4 xT1 bundles so that if one T1 fails you still have a working 4 x T1 bundle.

this needs cooperation with the provider and you need also faster detection of failures either using lower BGP timers or using BFD (if supported by uour device and by SP node)

Your current design puts all the eggs in the same box so if anything happens that impairs the bundle you have this 30-60 seconds of connectivity problems.

Hope to help

Giuseppe

rkendrick_2 · ‎12-03-2010

Can I simply start with reducing the BGP timers for holdtime and keep-alives?

Why does the bundle and traffic not just operate smoothly with only 1 of the 8 T1's down? I'm not understanding why 1 errant T1 would cause the BGP disruption. The other 7 T1s would be just fine and experiencing no errors.

sh ip bgp nei
BGP neighbor is 172.17.1.18, remote AS 12272, external link
BGP version 4, remote router ID x.x.x.233
BGP state = Established, up for 4w0d
Last read 00:00:59, last write 00:00:50, hold time is 180, keepalive interval is 60 seconds

Neighbor capabilities:
    Route refresh: advertised and received(new)
    Address family IPv4 Unicast: advertised and received
Message statistics:
    InQ depth is 0
    OutQ depth is 0

                         Sent       Rcvd
    Opens:                  2          2
    Notifications:          0          1
    Updates:                4        721
    Keepalives:        131219     132203
    Route Refresh:          0          0
    Total:             131225     132927
Default minimum time between advertisement runs is 30 seconds

Giuseppe Larosa · ‎12-03-2010

Hello RKendrick,

the BGP timers are the default ones and the BGP session looks like stable

>> BGP state = Established, up for 4w0d

the problem is in the bundle, reducing the timers may help to detect the connectivity issue if there is an alternate path, otherwise it just resets the BGP session.

>> I'm not understanding why 1 errant T1 would cause the BGP disruption. The other 7 T1s would be just fine and experiencing no errors.

Again the BGP session is stable as shown above, it is the fowarding plane, the capability to send packets that is affected because the MLPPP bundle has a 30-60 seconds "block".

Dividing the bundle in two groups may even help in reducing the connectity issue duration, and it can give you a way to have one working path.

I agree that it requires some effort.

Hope to help

Giuseppe