02-17-2014 05:21 AM - edited 03-07-2019 06:15 PM
Hi All,
i have a question about the automatic recovery after a failure that was detected by UDLD.
Background: we are running dual ethernet circuits provided by a carrier between two remote sites. Both circuits are configured as a
cross stack etherchannel on a 3750X stack at each site. The problem that we are facing is, that we had a couple of outages, where
there was a logical error of some kind in the carrier network. Both sites had a link on both circuits and the etherchannel was up, although
one of the circuits was not bi-directional anymore and was not passing traffic in one direction. This caused an outage because of loadbalancing
on the etherchannel some packets went down the uni-directional circuit and did not make it to the other site.
We are now thinking of using UDLD to detect such failures. The idea is to have UDLD aggressive disable the circuit when it becomes uni-directional, so
that it will be unbundled from the etherchannel and all packets will be using the remaining bi-directional circuit.
The problem that I have is the automatic recovery after the error-disable timer has expired and the problem is this:
Does UDLD check the circuit for being bi-directional prior to re-enabling it after the error-disable timer has expired ? If it doesn´t that would cause
an outage because packets would flow down the uni-directional circuit. Or even worse, if the circuit does not become bi-directional, it will stay
up although uni-directional, because UDLD will only disable the circuit when it goes from a bi-directional state to a uni-directional state and not
from a disabled state to a uni-directional state.
Thanks for reading. Any comments are appreciated.
Thanks
Markus
02-17-2014 07:56 AM
Hi Markus,
Does UDLD check the circuit for being bi-directional prior to re-enabling it after the error-disable timer has expired ?
UDLD checks the circuit after it is reenabled. If the err-disable recovery puts a port previously disabled by UDLD back into service, UDLD will test whether the link is uni-directional. However, this happens while the port is up and operating. In other words, the port will come up, perhaps start transmitting data, and at the same time, UDLD will try to verify if the link is uni-directional. If it is, it will disable the port again (within 30-60 seconds I believe; I can debug this for you).
Best regards,
Peter
02-17-2014 08:25 AM
Hi Peter,
thanks for your comment. Have you seen UDLD behave that way yourself ? I´m asking because I have found this comment on another non-Cisco blog and that started me thinking about this originally and it goes something like this:
http://blog.ine.com/2008/07/05/udld-modes-of-operation/
" At least in my opinion biggest problem with UDLD is it’s inability to recover from fault state. Sure, it disables port in aggressive mode and errdisable recovery re-enables port after configured delay. However recovery is done blindly without checking if UDLD partner has actually come back or not. Port is simply enabled and no further UDLD processing is done on that port until partner has returned and port has changed to bidirectional mode at least once. After that if new fault has occurred it will take port down as expected. For this reason UDLD is fine when not using errdisable recovery or running it in non-aggressive mode. Which also means you’re prepared to always manually fix problem and have off-band management access to all of your network equipment. For automated operations UDLD offers no help making it completely useless for many setups where such monitoring would be needed (dumb fiber transceivers, EoMPLS etc). Based on comments where people claim they use UDLD successfully makes me believe they have never actually tested different fault scenarios and simply assume it will function properly when needed."
Kind regards
Markus
02-17-2014 09:12 AM
Hi Markus,
I've just tested it now in UDLD Normal mode - I've interconnected two switches, Sw1 and Sw2, on their Fa0/1 ports. Sw1 is configured as follows:
mac access-list extended Deny
deny any any
!
interface FastEthernet0/1
udld port
mac access-group Deny in
In essence, Sw1 is emulating a link endpoint towards which the "fiber" is broken, i.e. it does not hear Sw2. Sw2 is configured simply as:
errdisable recovery cause udld
errdisable recovery interval 30
!
interface FastEthernet0/1
udld port
Now this is what I get on Sw2's console, repeatedly:
Sw2(config-if)#
*Mar 1 00:11:29.971: %PM-4-ERR_RECOVER: Attempting to recover from udld err-disable state on Fa0/1
*Mar 1 00:11:33.511: %LINK-3-UPDOWN: Interface FastEthernet0/1, changed state to up
*Mar 1 00:11:34.518: %LINEPROTO-5-UPDOWN: Line protocol on Interface FastEthernet0/1, changed state to up
*Mar 1 00:11:36.984: %UDLD-4-UDLD_PORT_DISABLED: UDLD disabled interface Fa0/1, unidirectional link detected
*Mar 1 00:11:36.984: %PM-4-ERR_DISABLE: udld error detected on Fa0/1, putting Fa0/1 in err-disable state
*Mar 1 00:11:37.990: %LINEPROTO-5-UPDOWN: Line protocol on Interface FastEthernet0/1, changed state to down
*Mar 1 00:11:38.989: %LINK-3-UPDOWN: Interface FastEthernet0/1, changed state to down
*Mar 1 00:12:06.990: %PM-4-ERR_RECOVER: Attempting to recover from udld err-disable state on Fa0/1
*Mar 1 00:12:10.672: %LINK-3-UPDOWN: Interface FastEthernet0/1, changed state to up
*Mar 1 00:12:11.679: %LINEPROTO-5-UPDOWN: Line protocol on Interface FastEthernet0/1, changed state to up
*Mar 1 00:12:13.969: %UDLD-4-UDLD_PORT_DISABLED: UDLD disabled interface Fa0/1, unidirectional link detected
*Mar 1 00:12:13.969: %PM-4-ERR_DISABLE: udld error detected on Fa0/1, putting Fa0/1 in err-disable state
*Mar 1 00:12:14.976: %LINEPROTO-5-UPDOWN: Line protocol on Interface FastEthernet0/1, changed state to down
*Mar 1 00:12:15.974: %LINK-3-UPDOWN: Interface FastEthernet0/1, changed state to down
*Mar 1 00:12:43.967: %PM-4-ERR_RECOVER: Attempting to recover from udld err-disable state on Fa0/1
*Mar 1 00:12:47.515: %LINK-3-UPDOWN: Interface FastEthernet0/1, changed state to up
*Mar 1 00:12:48.522: %LINEPROTO-5-UPDOWN: Line protocol on Interface FastEthernet0/1, changed state to up
*Mar 1 00:12:50.963: %UDLD-4-UDLD_PORT_DISABLED: UDLD disabled interface Fa0/1, unidirectional link detected
*Mar 1 00:12:50.971: %PM-4-ERR_DISABLE: udld error detected on Fa0/1, putting Fa0/1 in err-disable state
*Mar 1 00:12:51.978: %LINEPROTO-5-UPDOWN: Line protocol on Interface FastEthernet0/1, changed state to down
*Mar 1 00:12:52.976: %LINK-3-UPDOWN: Interface FastEthernet0/1, changed state to down
What happens here is that Sw2 hears UDLD packets from Sw1 but Sw1 does not hear Sw2. That leads to Sw1 sending UDLD packets with an empty echo list. Sw2 expects to see itself in the echo list - and when it finds it empty, it assumes the link is faulty and brings it down. After 30 seconds, the process repeats.
I believe that the crucial key to understanding the post in the UDLD article you have posted is the following statement:
However recovery is done blindly without checking if UDLD partner has actually come back or not. Port is simply enabled and no further UDLD processing is done on that port until partner has returned and port has changed to bidirectional mode at least once.
There is a grain of truth here but it is not the complete truth.
If an UDLD-protected port comes up but hears no UDLD packets whatsoever, it assumes that there is no UDLD peer connected. It can not disable the port in that case because that is not a proof of an uni-directional link. This is, by the way, what happens on Sw1 - it sees its Fa0/1 port going up and down, but because of the ACL, it does not hear any UDLD packets from Sw2. It does not err-disable the port itself, then.
However, on an uni-directional link, one peer must - by definition - hear the other side. In this case, it is Sw2. It can hear UDLD packets from Sw1, and after it find out after repeated attempts to establish the peering that Sw1 does not respond with Sw2's ID, it brings the port down.
The statement above is mistaken in the assumption that the state of port must move to Bidirectional before UDLD can do anything. That is not true, as you can see yourself. UDLD makes a series of checks before it declares a port to be Bidirectional, and failing those checks will cause the port to be err-disabled.
Best regards,
Peter
02-17-2014 01:18 PM
We are now thinking of using UDLD to detect such failures. The idea is to have UDLD aggressive disable the circuit when it becomes uni-directional, so that it will be unbundled from the etherchannel and all packets will be using the remaining bi-directional circuit.
I understand your logic about using UDLD aggressive but I don't understand why you want to consider auto-recovery.
In my line of work, I wouldn't even dream of using auto-recovery. UDLD aggressive plays an important part of our network. If the link goess err-disable due to UDLD then it means that I either have a faulty fibre optic connection or a faulty module. Either ones means I got to act. If you enable auto-recovery, how will you be able to determine that you've got some fault somewhere?
Another thing to consider is when you have auto-recovery enabled and you have routing. Sending the line up and then down regularly can send your routing protocol nuts.
Peter, your opinion would be welcome?
02-17-2014 02:24 PM
Hi Leo,
I know you are a staunch advocate against any automagic err-disable recovery practices And I agree with you.
It seems, though, that Markus's situation is a little different, as the uni-directional connectivity in his case is caused by the provider and, as he writes, because of configuration issues, not because of underlying physical uni-directional condition. If this is true indeed then using UDLD with automatic recovery may not be a totally bad idea, although his mileage may vary - it is still possible to blast your leg with the autorecovery.
What I would suggest, though, is running LACP on the EtherChannels. If LACPDUs cease to be received, the physical link should be dropped from the bundle.
Best regards,
Peter
02-17-2014 02:43 PM
It seems, though, that Markus's situation is a little different, as the uni-directional connectivity in his case is caused by the provider and, as he writes, because of configuration issues, not because of underlying physical uni-directional condition. If this is true indeed then using UDLD with automatic recovery may not be a totally bad idea, although his mileage may vary - it is still possible to blast your leg with the autorecovery.
LOL. Ok. UDLD needs to be enabled both ways for this to work. Meaning both Markus and the ISP needs to get this enabled.
The problem with this issue is someone needs to tell the provider that the link is going wonky. And this is why I still don't agree with auto-recovery of a link disabled by UDLD.
What I would suggest, though, is running LACP on the EtherChannels. If LACPDUs cease to be received, the physical link should be dropped from the bundle.
Good option.
02-18-2014 04:58 AM
Please use this command "errdisable recovery cause udld" with extreme caution. Because this command was in my data center Nexus5ks configuration it created a network loop.
I experienced a rare hardware failure on my distribution switch where control traffic was no longer passing over an access<->distribution link. UDLD aggressive mode did as it should have done and disabled the port on the n5k access switch. However, the above recovery command re-enabled the link after 5 seconds and created a loop that was basically hidden. Once the link came back up UDLD was not able to re-establish, the port was not able to rejoin the port-channel as LACP-PDUS were not negotiated (thus creating loop) and STP was not abe to do anything about it because it's also control traffic.
I've since removed this hazardous command from my configurations. I initially thought this was a global default as it's not in any of my configuration templates but in everyone of my production configurations. At this point I have no evidence that it's a global default although I still suspect that it is and may have to do with Nexus vPC's and FEXs.
I see no good reason for this command. Why woud you want to automatically bring back up a link that has an obvious problem with passing traffic (uni-directional/no control traffic/etc,) If UDLD detects an issue it should disable and allow for manual intervention given the high chance of a loop condition being introduced.
Chuck
02-18-2014 07:03 AM
Peter, Leo and Chuck,
thanks a lot for your advice, ideas and efforts taken to answer my question.
Maybe LACPDUs are the answer to my problem. I will take a closer look at it and try to simulate
possible failure scenarios in a lab environment.
I also fully understand why you are arguing against any automatic recovery from a UDLD detected error.
Nevertheless, since the root cause for the issues I have experienced were always beyond my control,
I still think that an automatic recovery can be a good idea here. E.g. let´s say one of my circuits becomes
faulty (uni-directional etc.) UDLD will shut it down. If the service provider recovers from this
error, I will not know it before I manually re-enable the err-disabled port. If during the time between the service
provider recovery and me, manually getting that port up again, the other circuits fails as well, we will experience
an outage, although principally we had an operational circuit, that was just disabled by UDLD.
If there are any other ideas or remarks, they are of course welcome.
Kind regards
Markus
02-18-2014 01:46 PM
If the service provider recovers from this error, I will not know it before I manually re-enable the err-disabled port. If during the time between the service provider recovery and me, manually getting that port up again, the other circuits fails as well, we will experience an outage, although principally we had an operational circuit, that was just disabled by UDLD.
I fully understand why you need to use auto-recovery. But my intention is this: If there is a fault, I want it to stay down so I know there's a fault and I can intervene (to enable the link). If I enable auto-recovery this means I will NEVER KNOW there was a fault. The service provider could continue to happily collect the service fees from your company without doing anything to investigate or fix this issue. Over time, your link issue could get worst. UDLD auto-recovery, in my humble opinion, is like "sweeping under the carpet" or singing the song "I'm not listening to you, do-dah, do-dah".
Personally, I'd like to know WHY the link would go uni-directional. Again, this is my own opinion.
02-20-2014 01:37 AM
Leo,
we are thinking along the same lines. The final resolution can only be that the carrier accepts the issues, identifies the root cause and fixes it. To date we have not enough evidence to put high pressure on the carrier though and they keep telling me that they do not see any issues and that they are keeping their SLAs for circuits.
So while we are trying to produce more evidence, I do not want my client to suffer. I want to have something to offer to my client and tell them, hey look, we have two options here, make your choice:
Once again thanks for your advice, insights and time
Markus
Discover and save your favorite ideas. Come back to expert answers, step-by-step guides, recent topics, and more.
New here? Get started with these tips. How to use Community New member guide