Solved: Random port drops

Davy.Cave · ‎01-07-2013

Hi,

Lately I've noticed some strange behavior on some of the switchports.

When I go through the logs my SGE2000/2010 stack, I see that some of the ports randomly lose their connection:

2147482703	05-Jan-2013 04:11:43	Warning	%LINK-W-Down: 2/g14
2147482704	05-Jan-2013 03:35:20	Warning	%STP-W-PORTSTATUS: 2/g33: STP status Forwarding
2147482705	05-Jan-2013 03:34:50	Informational	%LINK-I-Up: 2/g33
2147482706	05-Jan-2013 03:34:47	Warning	%LINK-W-Down: 2/g33
2147482707	05-Jan-2013 03:34:19	Informational	%LINK-I-Up: 2/g33
2147482708	05-Jan-2013 03:34:17	Warning	%LINK-W-Down: 2/g33
2147482709	05-Jan-2013 03:34:15	Informational	%LINK-I-Up: 2/g33
2147482710	05-Jan-2013 03:34:14	Warning	%LINK-W-Down: 2/g33
2147482711	05-Jan-2013 03:34:12	Warning	%STP-W-PORTSTATUS: 1/g15: STP status Forwarding
2147482712	05-Jan-2013 03:33:42	Informational	%LINK-I-Up: 1/g15
2147482713	05-Jan-2013 03:33:40	Warning	%LINK-W-Down: 1/g15
2147482714	05-Jan-2013 03:33:20	Warning	%STP-W-PORTSTATUS: 1/g15: STP status Forwarding
2147482715	05-Jan-2013 03:32:50	Informational	%LINK-I-Up: 1/g15
2147482716	05-Jan-2013 03:32:47	Warning	%LINK-W-Down: 1/g15
2147482717	05-Jan-2013 03:31:48	Warning	%STP-W-PORTSTATUS: 2/g5: STP status Forwarding
2147482718	05-Jan-2013 03:31:18	Informational	%LINK-I-Up: 2/g5

I'm having trouble locating the source of the problem. The devices connected to the port are servers and desktops.

This happens frequently throughout the day, but not always on the same ports.

What could cause the random drops?

Thanks in advance!

Tom Watts · ‎01-07-2013

Hi Davy, looks like you've got a stack. The stack implementation of the older SFE/SGE weren't very great and do have some stability issues.

The common causes for ports to go up/down may include

Spanning tree (such as root bridge elections or max age time out tables resetting)
Negotiation (speed/duplex)
Discovery protocols such as bonjour
Over utilization (system resources or Layer 3)
Firmware problems

If it is at all possible, I'd break the stack and have the switches standalone. I would attribute to 90% of the problems to the stack. Most of the time it's just that, unfortunately.

If you'd like to troubleshoot off the 5 points listed above, you can make sure your root bridges are set correctly to avoid max age timers updating causing a drop in cam tables.

You may also manually set port speeds/negotiations to see if it stabilizes the connection. Discovery protocol like bonjour can cause unexpected errors so you may want to disable it.

If the switches have a really heavy load or high cpu/memory use, may try to remove a few connections. If the switches are operating in layer 3, you may be experiencing SFFT overflow errors since the software can't route fast enough.

Of course, could always be a firmware issue. Make sure you're on the latest!

-Tom
Please mark answered for helpful posts

-Tom Please mark answered for helpful posts http://blogs.cisco.com/smallbusiness/

View solution in original post

Tom Watts · ‎01-07-2013

Hi Davy, looks like you've got a stack. The stack implementation of the older SFE/SGE weren't very great and do have some stability issues.

The common causes for ports to go up/down may include

Spanning tree (such as root bridge elections or max age time out tables resetting)
Negotiation (speed/duplex)
Discovery protocols such as bonjour
Over utilization (system resources or Layer 3)
Firmware problems

If it is at all possible, I'd break the stack and have the switches standalone. I would attribute to 90% of the problems to the stack. Most of the time it's just that, unfortunately.

If you'd like to troubleshoot off the 5 points listed above, you can make sure your root bridges are set correctly to avoid max age timers updating causing a drop in cam tables.

You may also manually set port speeds/negotiations to see if it stabilizes the connection. Discovery protocol like bonjour can cause unexpected errors so you may want to disable it.

If the switches have a really heavy load or high cpu/memory use, may try to remove a few connections. If the switches are operating in layer 3, you may be experiencing SFFT overflow errors since the software can't route fast enough.

Of course, could always be a firmware issue. Make sure you're on the latest!

-Tom
Please mark answered for helpful posts

-Tom Please mark answered for helpful posts http://blogs.cisco.com/smallbusiness/

Davy.Cave · ‎01-07-2013

Hi Tom,

First of all, thanks for the reply!

I will try your suggestions and will give feedback on it asap.

Our firmware is indeed outdated, so I'll give that a shot first.

Davy.Cave · ‎01-16-2013

Hi,

I've tried the answers you suggested, but so far I've been out of luck.

We do have some stand-alone SGE2000 switches in our network as well.

They've been showing the same behavior as the stacks:

147483044	15-Jan-2013 13:26:43	Warning	%STP-W-PORTSTATUS: g4: STP status Forwarding
2147483045	15-Jan-2013 13:26:41	Warning	%LINK-W-Down: g4
2147483046	15-Jan-2013 08:30:07	Informational	%LINK-I-Up: g4
2147483047	15-Jan-2013 08:30:07	Warning	%STP-W-PORTSTATUS: g4: STP status Forwarding
2147483048	15-Jan-2013 08:30:04	Warning	%LINK-W-Down: g4
2147483049	15-Jan-2013 08:30:04	Informational	%LINK-I-Up: g4
2147483050	15-Jan-2013 08:30:04	Warning	%STP-W-PORTSTATUS: g4: STP status Forwarding
2147483051	15-Jan-2013 08:30:02	Warning	%LINK-W-Down: g4

We do have a lot of STP topology changes when I check it in the properties screen.

Might this be the cause of it?

And if so, how can I troubleshoot this?

root bridge elections are all in order and the max age timer is set to 20 seconds.

Also, our last topology change was 3days ago, but we get these random port drops every day.

Tom Watts · ‎01-16-2013

Hi Davy, each switch has a default root bridge as 32768. What you want to do is make the head-most switch root bridge 4096 then the next in line 8192, next in line 12288, etc incrementing bu 4096. Additionally, you may try to globally filter BPDU.

-Tom
Please mark answered for helpful posts

-Tom Please mark answered for helpful posts http://blogs.cisco.com/smallbusiness/

Davy.Cave · ‎01-16-2013

Tom,

This has already been configured.

Our first stack is the root bridge with 24576

Our backup root bridge has 28672

Our other (stand-alone SGE2000 switches) are configured as 32768.

I have configured BPDU filtering in stead of flooding on all our switches as well.

I've added a picture to give you a better view of the topology:

Davy.Cave · ‎01-21-2013

Hi,

An update on the situation so far.

Setting the port to a static value seems to have helped for our stand alone switches!

The problem still persists on some of the ports on the stacks though.

This raised a few questions:

Why did the auto negotiation setting cause this problem?
As mentioned earlier, breaking up the stack seems to be the only solution to completely get rid of this problem?
- What is the cause of the stack instability anyways?
Also, if we'd decide to break up our stack, we will redesign our network.
For the moment, our network is a full L2 model. Would it be beneficial to implement a collapsed core model for example?
- And if it'd be beneficial, what performance could we expect from our SGE2000/2010s in L3 mode?

Thanks in advance!

Tom Watts · ‎01-21-2013

Hey Davy,

Thanks for the couple questions back. I'm not sure I'll give you the greatest answer but I will try.

Auto negotiation can be affected by a myriad of things. It could be (and some may seem silly...) the switch beging gigE and a NIC being 100, if the NIC is not advertising it is up to the switch to figure out what it is doing. This can lead to duplex mismatch, etc. This is often NOT seen on gigE between node and switch being half duplex doesn't exist (does it??? never seen it). It can also be media used, Cat5 is 100 mbit, Cat5e is roughly 350 mbit while cat6 is gigE. So it may be whats in between giving the fits. I'd recommend not to use Cat5e, just go with Cat5 or Cat6, not the middle man.

Second question, if you break the stack, the topology doesn't have to change. I do recommend a couple redundant links somewhere just incase a layer 1 break somewhere and let spanning-tree be spanning-tree. You never want a switch isolated due to wiring issues.

Last one, L3 mode, there is no performance benefit from the switch point of view. If you don't need the switches routing, don't use it. If your router is over-loaded, making the switch L3 will alleviate the router load and only send traffer that needs a router resource (such as internet).

-Tom
Please mark answered for helpful posts

-Tom Please mark answered for helpful posts http://blogs.cisco.com/smallbusiness/

Davy.Cave · ‎01-21-2013

Hi Tom,

We use Cat5e cabling throughout the building, so it could indeed be the wiring.

Anyways, thanks a lot for your time and help!

I've marked this question as answered! :-)