we are having some issues with our 4500X switches running VSS, the version is Version 03.06.06.E
basically the VSS just stops working for a few mins then comes back up, there is no power loss on both switches or loss of the network connectivity, they are both directly connected, the logs are below
001887: Oct 30 16:43:16.433: %PM-4-ERR_RECOVER: Attempting to recover from udld err-disable state on Te1/1/8
Seriously? Auto-recovery when UDLD kicks in is enabled?
Post the complete output to the command "sh redundancy".
What is the issue with having UDLD enabled ?
please see below show redundancy
Why would you disable UDLD, it is there for a reason, to protect against a single fiber going down.
UDLD is doing its job if it sees a link go down
UDLD is a life saver. Think about it: Once it detects a potential issue with the link it will bring it down. Why? Because if this link happens to be carrying very large routing tables, you don't want this link to be flapping or it'll kill the CPU.
Look at the output of the "sh redundancy". Notice that the 2nd card has an uptime of 20 hours? I bet I know what caused that.
If UDLD kicks in, tough luck. Take the time to investigate WHY UDLD got triggered. Enabling the auto-recovery is just sweeping the problem under the rug.
Common sense must be used: If UDLD is enabled, disable the auto-recovery. If auto-recovery is enabled then disable UDLD.
NOTE: Please note the config-registry value is 0x2101.
When you say, you bet you know what caused it? what are your thoughts?
The secondary line card has a very low uptime. This means either the line card lost power (unlikely) but I'm leaning towards crashing.
UDLD is enabled. And someone then enabled link auto-recovery due to UDLD. This is what usually happens:
Let's presume you've got a VSS pair and they are doing BGP routing.
1. UDLD detection gets detected & one of the link goes into error-disabled;
NOTE: Let's just say that that link is the PRIMARY path to the internet.
2. This means everything stops;
3. 30 seconds later link becomes auto-recovery;
4. Guess what: BGP routing and advertisements comes flooding in;
5. When this happens the first thing that gets hit is the CPU;
6. After a few minutes, repeat #1 to #5 for about four to five "cycles".
Now what do you think is going to happen to the supervisor card?
Another thing: If this has been happening for awhile, look inside the crashinfo or coredump folder.
Prove me wrong, Carl: Disable UDLD auto-recovery and then investigate which link(s) go into UDLD error-disable. Try it for four days.