UDLD Aggressive

Daniel Smith · ‎01-09-2013

We have encountered issues where a 4500 chassis switch, connected to a pair of 6509s and running udld port aggressive on its interfaces, will get dropped off the network in the event of a crash and reload of the 4500. Apparently the interfaces on the 6500s see the loss of udld packets and put their interfaces in err-disabled state. This could likely happen with other access layer switches, but our experience has been with the 4500. Manual intervention is required to get the access switch back online with its upstream neighbors. I wonder if anyone knows of a tweak, nerd knob, or other work around to get past this issue?

glen.grant · ‎01-09-2013

You can set it so that a er-disable port tries to re enable the port after a given period of time. You do need to determine why it goes err-disable , it should not being doing that. The 4500 should not be crashing either..

In order to turn on errdisable recovery and choose the errdisable conditions, issue this command:

cat6knative#errdisable recovery cause ?  all                 Enable timer to recover from all causes
  arp-inspection      Enable timer to recover from arp inspection error disable
                      state
  bpduguard           Enable timer to recover from BPDU Guard error disable
                      state
  channel-misconfig   Enable timer to recover from channel misconfig disable
                      state
  dhcp-rate-limit     Enable timer to recover from dhcp-rate-limit error
                      disable state
  dtp-flap            Enable timer to recover from dtp-flap error disable state
  gbic-invalid        Enable timer to recover from invalid GBIC error disable
                      state
  l2ptguard           Enable timer to recover from l2protocol-tunnel error
                      disable state
  link-flap           Enable timer to recover from link-flap error disable
                      state
  mac-limit           Enable timer to recover from mac limit disable state
  pagp-flap           Enable timer to recover from pagp-flap error disable
                      state
  psecure-violation   Enable timer to recover from psecure violation disable
                      state
  security-violation  Enable timer to recover from 802.1x violation disable
                      state
  udld                Enable timer to recover from udld error disable state
  unicast-flood       Enable timer to recover from unicast flood disable state

This example shows how to enable the BPDU guard errdisable recovery condition:

cat6knative(Config)#errdisable recovery cause bpduguard

A nice feature of this command is that, if you enable errdisable recovery, the command lists general reasons that the ports have been put into the error-disable state. In this example, notice that the BPDU guard feature was the reason for the shutdown of port 2/4:

cat6knative#show errdisable recoveryErrDisable Reason Timer Status ----------------- -------------- udld Disabled bpduguard Enabled security-violatio Disabled channel-misconfig Disabled pagp-flap Disabled dtp-flap Disabled link-flap Disabled l2ptguard Disabled psecure-violation Disabled gbic-invalid Disabled dhcp-rate-limit Disabled mac-limit Disabled unicast-flood Disabled arp-inspection Disabled Timer interval: 300 seconds Interfaces that will be enabled at the next timeout: Interface Errdisable reason Time left(sec) --------- --------------------- -------------- Fa2/4 bpduguard 290

If any one of the errdisable recovery conditions is enabled, the ports with this condition are reenabled after 300 seconds. You can also change this default of 300 seconds if you issue this command:

cat6knative(Config)#errdisable recovery interval timer_interval_in_seconds

This example changes the errdisable recovery interval from 300 to 400 seconds:

cat6knative(Config)#errdisable recovery interval 400

Daniel Smith · ‎01-10-2013

Thanks for the post, we do use these recovery features already.

Leo Laohoo · ‎01-09-2013

will get dropped off the network in the event of a crash and reload of the 4500.

If the port goes into "err-disable" I want to see WHY. Can you post the output to the command "sh interface status error"?

Daniel Smith · ‎01-10-2013

The command 'show interface status err-disabled' yields no information since none of the interfaces are currently in that statue. However, the log of the adjacent 6500 shows the following:

Apr 19 07:37:04.463: %LINEPROTO-5-UPDOWN: Line protocol on Interface GigabitEthernet2/29, changed state to down

Apr 19 07:37:04.471: %LINK-3-UPDOWN: Interface GigabitEthernet2/29, changed state to down

Apr 19 07:37:04.459: %UDLD-SP-4-UDLD_PORT_DISABLED: UDLD disabled interface Gi2/29, aggressive mode failure detected

Apr 19 07:37:04.459: %PM-SP-4-ERR_DISABLE: udld error detected on Gi2/29, putting Gi2/29 in err-disable state

Which likely indicates a loss of UDLD packets, and the resulting determination that the link was now unidirectional.

Leo Laohoo · ‎01-10-2013

Remove UDLD from both interfaces.

Is G2/29 a fiber port?

Daniel Smith · ‎01-11-2013

Yes, G2/29 is a fiber port.

Leo Laohoo · ‎01-11-2013

Yes, G2/29 is a fiber port.

Ok, I think you've got a fibre issue. Could be a fault with the fibre link itself, patch cables, modules. Before you start looking for it, can you post the output to the command "sh interface " of both sides? I want to see if I can "catch" any line errors.

Daniel Smith · ‎01-13-2013

My original post explained how both uplinks to an access layer switch went err-disabled following a crash/reload of a 4500. This is not a fiber issue as we have tracked identical outcomes from other crash/reload events on other devices. Controlled reloads do not result in the disabled event.

David Kosich · ‎04-05-2013

Daniel,

Other than an actual physical layer problem (which you said is not the case here), the most likely scenario for a "false positive" with UDLD is high CPU on either side of the link. That's one reason that UDLD aggressive is sometimes not recommedned, because it is more prone to false positives.

I would recommend getting the crash looked at by TAC (assuming this still happens, I know the post is a few months old). It sounds like whatever is happening during the 4500's crash, is affecting the 4500's ability to send the UDLD frames. UDLD frames are generated/received by the CPU, so if the 4500 is crashing due to a high CPU condition, it's possible the CPU is busy, and is not sending the UDLD frames (and therefore why the 6500 is err-disabling its ports).

Checking 'show proc cpu history' would show you what the CPU looked like. A crash info file may have that output (don't know off hand with the 4500). Either way, TAC can look at the tracebacks in the 4500's crash file to determine if it was most likely due to high CPU, in which case, the UDLD issue is just a symptom of a larger issue where traffic is being software switched by the CPU. The following doc has more information on High CPu on the 4500 if you are interested --

http://www.cisco.com/en/US/products/hw/switches/ps663/products_tech_note09186a00804cef15.shtml

David Kosich