cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
2205
Views
0
Helpful
12
Replies

3750X-24S-S fails to reconnect after access layer reload

WAN-SOLO
Level 1
Level 1

I came across a serious issue the other day and have yet to find a workaround or solution when our access layer switch reloads. You'll see why this is such a serious issue by the last paragraph.

 

Typical network setup

Distribution switch:3750X-24S-S (15.0(1)SE3)
Access switch: 2960CG-8TC-L (15.0(2)SE)

Switches are connected by either SM or MM fiber and all links trunks (production and management traffic)

 

Issue

Up until here recently everything has been working normally, then due to maintenance being performed on our building electrical system our access layer switches loses power and reloads. Upon reboot the access layer switch will show a green light on the uplink port Gi0/9 or Gi0/10. Consoling into the access switch shows no cdp neighbors, no errors in the log, or any other indication that anything is wrong. However the uplink shows connected.

The distribution switch shows not connected to access switch. There was an error in the log indicating a udld on the link however the switch tries to recover from the error.​ However port never reconnects. Doing a show interface status the port shows notconnect not err-disable.

 

Troubleshooting

  • Ruled out the fiber as both strands were tested with no loss.
  • Ruled out the SFP as there were several different ones tested on both ends.
  • Set error disable recover time to 5 minutes; no change.
  • Disabled udld on both ends; no change.
  • Shut, no shut port, no change.
  • Ruled out bad access switch; Connected test switch to port and test switch shows connected however no link on distribution switch.
  • Tried to force an err-disable on distribution switch by looping fiber on SFP port and re-enabling udld. Port light never comes on and no indication that port is error disabling.
  • Tried other open ports on the switch and after a couple a link is finally formed.

 

Currently running out of usable switchports and we are expecting another power outage next month. Also unable to open a TAC case as none of these are on maintenance which I had originally thought. We also have five other of the same model switches out there running the same IOS version. Three of these have experienced the same issue.

 

Anyone have any ideas, pointers or experienced this?

 

 

12 Replies 12

Leo Laohoo
Hall of Fame
Hall of Fame
There was an error in the log indicating a udld on the link however the switch tries to recover from the error.

1.  Why is UDLD enabled on a port that is known to be going to another switch. 

2.  Why is error-disable recovery (for UDLD) enabled?  Does someone want instability in the network?

 

Recommendation: 

1.  Disable UDLD on known ports that are going to another switch (better yet, enable 802.1q Trunking). 

2.  Take out ANY error-disable auto-recovery global commands.

Leo,

Thanks for your input.

  1. UDLD is enabled to prevent any loops and to verify the health of the link. This is standard on any of our LAN templates and is configured without giving it any thought.
  2. UDLD recovery is configured as to allow the switch to try and recover if someone was to say bump one of the fiber pairs causing accidental loss of signal or a technician patches it in incorrectly. Saves us a lot of time too as this has happened on more than one occasion.

 

  1. I have disabled UDLD on both links and dot1q trunking was already enabled. This made no difference as the link on the distribution end still wouldn't connect.
  2. I will get rid of the error recovery commands and see if this makes any difference once I'm back in the office.

Note that on my above statement (Ruled out bad access switch). This test switch was connected over a 1 meter fiber jumper so I could quickly test other ports/configurations.

Only common thing I can see if all of this aside from udld being configured is that there is something wrong with the distribution switch but can't figure out what. Like I also stated there is also 5 other switches out there in 5 separate buildings displaying the same behaviour. Also note that when I had originally configured and installed these device, no more than a year or so ago, everything was working as advertised. Even with udld enabled.

 

Dave

Dave,

You are not alone and we are seeing the exact same symptoms with a number of 3750X-s-24s switch stacks being used as distribution gear.  At this point, I've seen this issue span 3 different, isolated, networks across 4 locations. 

The only recovery option we've been able to identify to date is a total reboot of the switch stack containing the 3750X-s-24s gear.  Again, like you, access switch shows connected, absent cdp data, and the local port is down/down.

At this point, we're out of ports and shifting them around is no longer an option.

Did you discover anything yet in regards to root cause?  I'm anxious to know as we're at our wits end...  next stop, tech support I guess.

This was definetely a weird one for sure. Cisco TAC was unable to reproduce anomalies found on the production systems utilizing same hardware and IOS. Big bummer there, thought TAC could figure anything out.

They did however suggest we upgrade our IOS's on our 3750's to a current maintenance release due to the IOS we are running now is a deferred release. We went from 15.0(1)SE3 to 15.0(2)SE8. IOS upgrade went well and the issue was fixed with no other issues noted to date. I'd like to say the root cause was a bug in the IOS but since TAC couldn't find one its anyone's guess.

Hope this helps you out.

Thanks Dave.  If it is a "bug", it must exist from the 12.2 strain on up to 15.0(1)SE3.  We ran into it on multiple versions...  ok, if an upgrade seems to be the answer, that is the path we'll be on.

Thanks for your prompt response,

Charles

Did the upgrade fix the issue? 

Yes, the IOS upgrade fixed the problem. To date no other issues have been noticed.

Ah. Ok. Thank you. I'm seeing this issue after power outages (albeit on 3850s), and the work around we found was simply shutting the stack ports to the fiber module causing it to reload it's config. 

Everything came up immediately. 

What model of 3850 are you guys using? 

Actually, I have to correct myself. It was a 3750 as well. 

3750X-24S-S? 

WS-C3750X-48P

Getting Started

Find answers to your questions by entering keywords or phrases in the Search bar above. New here? Use these resources to familiarize yourself with the community: