I came across a serious issue the other day and have yet to find a workaround or solution when our access layer switch reloads. You'll see why this is such a serious issue by the last paragraph.
Typical network setup
Distribution switch:3750X-24S-S (15.0(1)SE3) Access switch: 2960CG-8TC-L (15.0(2)SE)
Switches are connected by either SM or MM fiber and all links trunks (production and management traffic)
Up until here recently everything has been working normally, then due to maintenance being performed on our building electrical system our access layer switches loses power and reloads. Upon reboot the access layer switch will show a green light on the uplink port Gi0/9 or Gi0/10. Consoling into the access switch shows no cdp neighbors, no errors in the log, or any other indication that anything is wrong. However the uplink shows connected.
The distribution switch shows not connected to access switch. There was an error in the log indicating a udld on the link however the switch tries to recover from the error. However port never reconnects. Doing a show interface status the port shows notconnect not err-disable.
Currently running out of usable switchports and we are expecting another power outage next month. Also unable to open a TAC case as none of these are on maintenance which I had originally thought. We also have five other of the same model switches out there running the same IOS version. Three of these have experienced the same issue.
Anyone have any ideas, pointers or experienced this?
There was an error in the log indicating a udld on the link however the switch tries to recover from the error.
1. Why is UDLD enabled on a port that is known to be going to another switch.
2. Why is error-disable recovery (for UDLD) enabled? Does someone want instability in the network?
1. Disable UDLD on known ports that are going to another switch (better yet, enable 802.1q Trunking).
2. Take out ANY error-disable auto-recovery global commands.
Thanks for your input.
Note that on my above statement (Ruled out bad access switch). This test switch was connected over a 1 meter fiber jumper so I could quickly test other ports/configurations.
Only common thing I can see if all of this aside from udld being configured is that there is something wrong with the distribution switch but can't figure out what. Like I also stated there is also 5 other switches out there in 5 separate buildings displaying the same behaviour. Also note that when I had originally configured and installed these device, no more than a year or so ago, everything was working as advertised. Even with udld enabled.
You are not alone and we are seeing the exact same symptoms with a number of 3750X-s-24s switch stacks being used as distribution gear. At this point, I've seen this issue span 3 different, isolated, networks across 4 locations.
The only recovery option we've been able to identify to date is a total reboot of the switch stack containing the 3750X-s-24s gear. Again, like you, access switch shows connected, absent cdp data, and the local port is down/down.
At this point, we're out of ports and shifting them around is no longer an option.
Did you discover anything yet in regards to root cause? I'm anxious to know as we're at our wits end... next stop, tech support I guess.
This was definetely a weird one for sure. Cisco TAC was unable to reproduce anomalies found on the production systems utilizing same hardware and IOS. Big bummer there, thought TAC could figure anything out.
They did however suggest we upgrade our IOS's on our 3750's to a current maintenance release due to the IOS we are running now is a deferred release. We went from 15.0(1)SE3 to 15.0(2)SE8. IOS upgrade went well and the issue was fixed with no other issues noted to date. I'd like to say the root cause was a bug in the IOS but since TAC couldn't find one its anyone's guess.
Hope this helps you out.
Thanks Dave. If it is a "bug", it must exist from the 12.2 strain on up to 15.0(1)SE3. We ran into it on multiple versions... ok, if an upgrade seems to be the answer, that is the path we'll be on.
Thanks for your prompt response,
Ah. Ok. Thank you. I'm seeing this issue after power outages (albeit on 3850s), and the work around we found was simply shutting the stack ports to the fiber module causing it to reload it's config.
Everything came up immediately.