cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
12200
Views
5
Helpful
10
Replies

UDLD and how to recover automatically after failure

Hi All,

i have a question about the automatic recovery after a failure that was detected by UDLD.

Background: we are running dual ethernet circuits provided by a carrier between two remote sites. Both circuits are configured as a

cross stack etherchannel on a 3750X stack at each site. The problem that we are facing is, that we had a couple of outages, where

there was a logical error of some kind in the carrier network. Both sites had a link on both circuits and the etherchannel was up, although

one of the circuits was not bi-directional anymore and was not passing traffic in one direction. This caused an outage because of loadbalancing

on the etherchannel some packets went down the uni-directional circuit and did not make it to the other site.

We are now thinking of using UDLD to detect such failures. The idea is to have UDLD aggressive disable the circuit when it becomes uni-directional, so

that it will be unbundled from the etherchannel and all packets will be using the remaining bi-directional circuit.

The problem that I have is the automatic recovery after the error-disable timer has expired and the problem is this:

Does UDLD check the circuit for being bi-directional prior to re-enabling it after the error-disable timer has expired ? If it doesn´t that would cause

an outage because packets would flow down the uni-directional circuit. Or even worse, if the circuit does not become bi-directional, it will stay

up although uni-directional, because UDLD will only disable the circuit when it goes from a bi-directional state to a uni-directional state and not

from a disabled state to a uni-directional state.

Thanks for reading. Any comments are appreciated.

Thanks

Markus

10 Replies 10

Peter Paluch
Cisco Employee
Cisco Employee

Hi Markus,

Does UDLD check the circuit for being bi-directional prior to re-enabling it after the error-disable timer has expired ?

UDLD checks the circuit after it is reenabled. If the err-disable recovery puts a port previously disabled by UDLD back into service, UDLD will test whether the link is uni-directional. However, this happens while the port is up and operating. In other words, the port will come up, perhaps start transmitting data, and at the same time, UDLD will try to verify if the link is uni-directional. If it is, it will disable the port again (within 30-60 seconds I believe; I can debug this for you).

Best regards,

Peter

Hi Peter,

thanks for your comment. Have you seen UDLD behave that way yourself ? I´m asking because I have found this comment on another non-Cisco blog and that started me thinking about this originally and it goes something like this:

http://blog.ine.com/2008/07/05/udld-modes-of-operation/

" At least in my opinion biggest problem with UDLD is it’s inability to recover from fault state. Sure, it disables port in aggressive mode and errdisable recovery re-enables port after configured delay. However recovery is done blindly without checking if UDLD partner has actually come back or not. Port is simply enabled and no further UDLD processing is done on that port until partner has returned and port has changed to bidirectional mode at least once. After that if new fault has occurred it will take port down as expected. For this reason UDLD is fine when not using errdisable recovery or running it in non-aggressive mode. Which also means you’re prepared to always manually fix problem and have off-band management access to all of your network equipment. For automated operations UDLD offers no help making it completely useless for many setups where such monitoring would be needed (dumb fiber transceivers, EoMPLS etc). Based on comments where people claim they use UDLD successfully makes me believe they have never actually tested different fault scenarios and simply assume it will function properly when needed."

Kind regards

Markus

Hi Markus,

I've just tested it now in UDLD Normal mode - I've interconnected two switches, Sw1 and Sw2, on their Fa0/1 ports. Sw1 is configured as follows:

mac access-list extended Deny

deny   any any

!

interface FastEthernet0/1

udld port

mac access-group Deny in

In essence, Sw1 is emulating a link endpoint towards which the "fiber" is broken, i.e. it does not hear Sw2. Sw2 is configured simply as:

errdisable recovery cause udld

errdisable recovery interval 30

!

interface FastEthernet0/1

udld port

Now this is what I get on Sw2's console, repeatedly:

Sw2(config-if)#

*Mar  1 00:11:29.971: %PM-4-ERR_RECOVER: Attempting to recover from udld err-disable state on Fa0/1

*Mar  1 00:11:33.511: %LINK-3-UPDOWN: Interface FastEthernet0/1, changed state to up

*Mar  1 00:11:34.518: %LINEPROTO-5-UPDOWN: Line protocol on Interface FastEthernet0/1, changed state to up

*Mar  1 00:11:36.984: %UDLD-4-UDLD_PORT_DISABLED: UDLD disabled interface Fa0/1, unidirectional link detected

*Mar  1 00:11:36.984: %PM-4-ERR_DISABLE: udld error detected on Fa0/1, putting Fa0/1 in err-disable state

*Mar  1 00:11:37.990: %LINEPROTO-5-UPDOWN: Line protocol on Interface FastEthernet0/1, changed state to down

*Mar  1 00:11:38.989: %LINK-3-UPDOWN: Interface FastEthernet0/1, changed state to down

*Mar  1 00:12:06.990: %PM-4-ERR_RECOVER: Attempting to recover from udld err-disable state on Fa0/1

*Mar  1 00:12:10.672: %LINK-3-UPDOWN: Interface FastEthernet0/1, changed state to up

*Mar  1 00:12:11.679: %LINEPROTO-5-UPDOWN: Line protocol on Interface FastEthernet0/1, changed state to up

*Mar  1 00:12:13.969: %UDLD-4-UDLD_PORT_DISABLED: UDLD disabled interface Fa0/1, unidirectional link detected

*Mar  1 00:12:13.969: %PM-4-ERR_DISABLE: udld error detected on Fa0/1, putting Fa0/1 in err-disable state

*Mar  1 00:12:14.976: %LINEPROTO-5-UPDOWN: Line protocol on Interface FastEthernet0/1, changed state to down

*Mar  1 00:12:15.974: %LINK-3-UPDOWN: Interface FastEthernet0/1, changed state to down

*Mar  1 00:12:43.967: %PM-4-ERR_RECOVER: Attempting to recover from udld err-disable state on Fa0/1

*Mar  1 00:12:47.515: %LINK-3-UPDOWN: Interface FastEthernet0/1, changed state to up

*Mar  1 00:12:48.522: %LINEPROTO-5-UPDOWN: Line protocol on Interface FastEthernet0/1, changed state to up

*Mar  1 00:12:50.963: %UDLD-4-UDLD_PORT_DISABLED: UDLD disabled interface Fa0/1, unidirectional link detected

*Mar  1 00:12:50.971: %PM-4-ERR_DISABLE: udld error detected on Fa0/1, putting Fa0/1 in err-disable state

*Mar  1 00:12:51.978: %LINEPROTO-5-UPDOWN: Line protocol on Interface FastEthernet0/1, changed state to down

*Mar  1 00:12:52.976: %LINK-3-UPDOWN: Interface FastEthernet0/1, changed state to down

What happens here is that Sw2 hears UDLD packets from Sw1 but Sw1 does not hear Sw2. That leads to Sw1 sending UDLD packets with an empty echo list. Sw2 expects to see itself in the echo list - and when it finds it empty, it assumes the link is faulty and brings it down. After 30 seconds, the process repeats.

I believe that the crucial key to understanding the post in the UDLD article you have posted is the following statement:

However recovery is done blindly without checking if UDLD partner has  actually come back or not. Port is simply enabled and no further UDLD  processing is done on that port until partner has returned and port has  changed to bidirectional mode at least once.

There is a grain of truth here but it is not the complete truth.

If an UDLD-protected port comes up but hears no UDLD packets whatsoever, it assumes that there is no UDLD peer connected. It can not disable the port in that case because that is not a proof of an uni-directional link. This is, by the way, what happens on Sw1 - it sees its Fa0/1 port going up and down, but because of the ACL, it does not hear any UDLD packets from Sw2. It does not err-disable the port itself, then.

However, on an uni-directional link, one peer must - by definition - hear the other side. In this case, it is Sw2. It can hear UDLD packets from Sw1, and after it find out after repeated attempts to establish the peering that Sw1 does not respond with Sw2's ID, it brings the port down.

The statement above is mistaken in the assumption that the state of port must move to Bidirectional before UDLD can do anything. That is not true, as you can see yourself. UDLD makes a series of checks before it declares a port to be Bidirectional, and failing those checks will cause the port to be err-disabled.

Best regards,

Peter

We are now thinking of using UDLD to detect such failures. The idea is to have UDLD aggressive disable the circuit when it becomes uni-directional, so that it will be unbundled from the etherchannel and all packets will be using the remaining bi-directional circuit.

I understand your logic about using UDLD aggressive but I don't understand why you want to consider auto-recovery.

In my line of work, I wouldn't even dream of using auto-recovery.  UDLD aggressive plays an important part of our network.  If the link goess err-disable due to UDLD then it means that I either have a faulty fibre optic connection or a faulty module.  Either ones means I got to act.  If you enable auto-recovery, how will you be able to determine that you've got some fault somewhere?

Another thing to consider is when you have auto-recovery enabled and you have routing.  Sending the line up and then down regularly can send your routing protocol nuts.

Peter, your opinion would be welcome? 

Hi Leo,

I know you are a staunch advocate against any automagic err-disable recovery practices And I agree with you.

It seems, though, that Markus's situation is a little different, as the uni-directional connectivity in his case is caused by the provider and, as he writes, because of configuration issues, not because of underlying physical uni-directional condition. If this is true indeed then using UDLD with automatic recovery may not be a totally bad idea, although his mileage may vary - it is still possible to blast your leg with the autorecovery.

What I would suggest, though, is running LACP on the EtherChannels. If LACPDUs cease to be received, the physical link should be dropped from the bundle.

Best regards,

Peter

It seems, though, that Markus's situation is a little different, as the uni-directional connectivity in his case is caused by the provider and, as he writes, because of configuration issues, not because of underlying physical uni-directional condition. If this is true indeed then using UDLD with automatic recovery may not be a totally bad idea, although his mileage may vary - it is still possible to blast your leg with the autorecovery.

LOL.  Ok.  UDLD needs to be enabled both ways for this to work.  Meaning both Markus and the ISP needs to get this enabled.

The problem with this issue is someone needs to tell the provider that the link is going wonky.  And this is why I still don't agree with auto-recovery of a link disabled by UDLD. 

What I would suggest, though, is running LACP on the EtherChannels. If LACPDUs cease to be received, the physical link should be dropped from the bundle.

Good option.

Please use this command "errdisable recovery cause udld" with extreme caution.    Because this command was in my data center Nexus5ks configuration it created a network loop.  

I experienced a rare hardware failure on my distribution switch where control traffic was no longer passing over an access<->distribution link.  UDLD aggressive mode did as it should have done and disabled the port on the n5k access switch.   However, the above recovery command re-enabled the link after 5 seconds and created a loop that was basically hidden.    Once the link came back up UDLD was not able to re-establish, the port was not able to rejoin the port-channel as LACP-PDUS were not negotiated (thus creating loop) and STP was not abe to do anything about it because it's also control traffic.

I've since removed this hazardous command from my configurations.   I initially thought this was a global default as it's not in any of my configuration templates but in everyone of my production configurations.   At this point I have no evidence that it's a global default although I still suspect that it is and may have to do with Nexus vPC's and FEXs.   

I see no good reason for this command.   Why woud you want to automatically bring back up a link that has an obvious problem with passing traffic (uni-directional/no control traffic/etc,)   If UDLD detects an issue it should disable and allow for manual intervention given the high chance of a loop condition being introduced.  

Chuck

Peter, Leo and Chuck,

thanks a lot for your advice, ideas and efforts taken to answer my question.

Maybe LACPDUs are the answer to my problem. I will take a closer look at it and try to simulate

possible failure scenarios in a lab environment.

I also fully understand why you are arguing against any automatic recovery from a UDLD detected error.

Nevertheless, since the root cause for the issues I have experienced were always beyond my control,

I still think that an automatic recovery can be a good idea here. E.g. let´s say one of my circuits becomes

faulty (uni-directional etc.) UDLD will shut it down. If the service provider recovers from this

error, I will not know it before I manually re-enable the err-disabled port. If during the time between the service

provider recovery and me, manually getting that port up again, the other circuits fails as well, we will experience

an outage, although  principally we had an operational circuit, that was just disabled by UDLD.

If there are any other ideas or remarks, they are of course welcome.

Kind regards

Markus

If the service provider recovers from this error, I will not know it before I manually re-enable the err-disabled port. If during the time between the service provider recovery and me, manually getting that port up again, the other circuits fails as well, we will experience an outage, although  principally we had an operational circuit, that was just disabled by UDLD.

I fully understand why you need to use auto-recovery.  But my intention is this:  If there is a fault, I want it to stay down so I know there's a fault and I can intervene (to enable the link).  If I enable auto-recovery this means I will NEVER KNOW there was a fault.  The service provider could continue to happily collect the service fees from your company without doing anything to investigate or fix this issue.  Over time, your link issue could get worst.  UDLD auto-recovery, in my humble opinion, is like "sweeping under the carpet" or singing the song "I'm not listening to you, do-dah, do-dah".

Personally, I'd like to know WHY the link would go uni-directional.   Again, this is my own opinion. 

Leo,

we are thinking along the same lines. The final resolution can only be that the carrier accepts the issues, identifies the root cause and fixes it. To date we have not enough evidence to put high pressure on the carrier though and they keep telling me that they do not see any issues and that they are keeping their SLAs for circuits.

So while we are trying to produce more evidence, I do not want my client to suffer. I want to have something to offer to my client and tell them, hey look, we have two options here, make your choice:

  1. we wait for the next outage and are prepared to collect some clear evidence, so we can put pressure on the carrier and have the fix the root cause or
  2. if you don´t want to accept another outage we can switch to LACP and have a reliable way to detect and recover from the issues, but you have then to accept that we might not be able to find and fix the root cause.

Once again thanks for your advice, insights and time

Markus

Review Cisco Networking for a $25 gift card