Degraded link failover

slavb · ‎02-11-2005

Hi

Could anybody tell me how to force EIGRP to stop routing traffic over a link that suffers degraded reliability?

I had a problem where router with two equal metric frame-relay links forwarded traffic over a link that has been severely degraded but not down. The design ensures redundancy with two links, but the end effect was that applications have been timing out and to the costumer it looked like the network is extremely slow.

I know that EIGRP can use reliability of a circuit while calculating its metric but I would like to make sure that adjusting K value will not introduce new problems. Also is there any other way except K value adjustment to ensure that past certain reliability threshold routing protocol will route all traffic through good circuit only.

Thank you.

Georg Pauwen · ‎02-12-2005

Hello,

you might want to look into IP Event Dampening, this is a feature which, similar to BGP dampening, assigns penalties to flapping interfaces, thereby ensuring a more stable routing environment. IP Event Dampening is not specifically designed for EIGRP, but it might help in your situation.

Check this link for a detailed explanation of that feature:

IP Event Dampening

http://www.cisco.com/en/US/partner/products/sw/iosswrel/ps1839/products_feature_guide09186a0080110bc8.html

HTH,

GP

vcjones · ‎02-12-2005

Assuming you are using the term "degraded" to mean the link has a high bit error rate but is otherwise up (so the link is not flapping, it's just unuseable) you have a real challenge. The problem is that the short hello packets used by EIGRP to test the link are getting through frequently enough to keep the link up, while full size packets have a very high failure rate.

Side note: Consider a link with a 10e-4 BER. The probability of a 50 byte frame getting delivered is 95% while the probability of a 1500 byte frame surviving the journey approaches zero.

If this is your problem, welcome to the club. Short term, the only solution is to shut down the link until the phone company fixes it. Longer term, the "correct" solution is to use PPP Link Quality Monitoring to detect a failing link and shut it down at the link level.

Unfortunately, while Cisco supports LQM, when I last tested it, their implementation made it worthless. The link gets shut down when the percentage of failed frames exceeds a threshold, the problem is that the percentage is calculated on the total number of frames since the link was initialized rather than a running average. As a result, a link which fails after a year of heavy use will take a long time to reach even a low threshold.

Another possible solution (this problem has been around for a LONG time) is to use a routing protocol, such as IS-IS, which supports padding the hello packets to an arbitrary size. This approach is another good idea which turns out to sound better than it works. As you can imagine, the impact on efficiency makes this approach questionable on slow links, where BER problems are most likely to occur.

It also turns out to be too slow to react, as the link will stay up as long as at least one out of every three hellos gets through, while the performance impact typically becomes unacceptable when the failure rate is still in single digit percentages.

Currently, the only viable tool is monitoring error rates using a network management tool and getting the telco to repair the links before the BER gets high enough to be significant. Easier said than done. Meanwhile, you can try PPP LQM (if PPP is supported on your interfaces) and beat up on your Cisco account team to fix the implementation so it is useful (which may have already happened, although I have not noticed any announcements to that effect, which is why I suggest testing first).

Good luck and have fun! Been there, done that, been burnt :-[

Vincent C Jones

www.networkingunlimited.com

slavb · ‎02-14-2005

Thank you both very much for a quick response.

From your responses I see that there is no definitive answer to this problem. It does make me wonder though, Cisco pushes so many advanced futures in their advertisements and does not offers a solution to what seems to be a fundamental issue. The site in question is a hospital, I hope this event will give me some ammunition to argue the need for reliable network monitoring tool, which they presently lack and are not willing to spend money for.

Best

Slav

jroyster · ‎02-14-2005

well you could adjust the K values to include reliability in the metric. Does the "reliability" actually show less than 255/255? I dont' know if that counts line protocol or physical interface.

The only other thing might be using SAA (service assurance agent I think) which you can use to send test packets/applications and take action accordingly.

Of course the real answer here is to get the provider to fix the circuit.

;-)