Solved: Loop reboots switch

Aurelio Llorente · ‎08-10-2016

Hi all,

I've been trying to configure 4 new switches to improve the core of the network.
We already had two C3650-48TS in stack as network core (nwcore2). Now I am expanding the core with two C3560-CX (nwcore3 and nwcore4). Those switches are not stackable so I am using a gigabit cable to connect them and two fibre 1Gb links to nwcore2. Finally I connected a stack (2 x C2960-X) to nwcore3 and nwcore4 using two gigabit links.
The spanning tree configuration I am using is:

nwcore2
spanning-tree mode rapid-pvst
spanning-tree loopguard default
spanning-tree logging
spanning-tree extend system-id
spanning-tree vlan 1-10,12,14-17 priority 4096

nwcore3 and nwcore4
spanning-tree mode rapid-pvst
spanning-tree loopguard default
spanning-tree logging
spanning-tree extend system-id
spanning-tree vlan 1-10,12,14-17 priority 16384

nwswitch32
spanning-tree mode rapid-pvst
spanning-tree loopguard default
spanning-tree logging
spanning-tree portfast edge default
spanning-tree portfast edge bpduguard default
spanning-tree extend system-id

All "user" ports are configured as portfast and bpduguard enabled.

Spanning tree blocks the link between nwcore3 and nwcore4 and one of the links to nwswitch32 stack as you can see in the diagram.

Everything works fine until about 5 hours and 50 minutes later. Then I see a huge amount of traffic, and all the network freezes until nwcore 3 reboots itself. Then another 5 hours and 50 minutes later the problem begins again.

Seems to be a loop, but I don't know why as rstp seems to be correctly configured, isn't it? I see no spanning tree messages in logs, only mac addresses flapping between the links.

What happens every 5 hours and 50 minutes? Why that concrete amount of time?

Can anyone help me, please?

Mark Malone · ‎08-10-2016

Hi

so where is the loop coming from have you tried to trace the STP changes see what the trigger is , something's causing it , be it a flaky link , bad port etc don't think its your config , something's setting it off more likely which causes a recalculation at layer 2 and opens back up the redundant link , you need to find what at layer 2 triggers it

you can trace stp change with this command shortens the detail output shows you where the last change occurred from , if you keep following the stp changes through each switch it can lead you to the source of the change

sh spanning-tree detail | i ieee|occur|from|is exec

AS an example on test switch shows last stp change came from po127 , so I would check the ether channel get the physical port then check cdp then go to that switch and run the command again until hopefully you can trace the source of the issue

xxxxxxxx#sh spanning-tree detail | i ieee|occur|from|is exec
Number of topology changes 22 last change occurred 6d04h ago
from Port-channel127

If that link is supposed to be the blocked link set the STP cost on each side higher than the actual cost forcing it to be the redundant link --spanning-tree cost 500--- this wont prevent the issue but its always good to know which links should be redundant I wouldn't leave it all up to stp you can manually control some of it which is a good idea as problems can be easier to trace then at layer 2

anyway just an option that's how I have found what caused the loops in larger layer 2 networks after the fact of it happening but it can be tricky

also check the switch that you need to reboot before you reboot it , what's the cpu sitting at is it maxed out and memory errors etc , things like that can cause switches to act irregularly and cause issues at layer 2 as they cant get bpdus out or pass traffic correctly

View solution in original post

Mark Malone · ‎08-10-2016

Hi

so where is the loop coming from have you tried to trace the STP changes see what the trigger is , something's causing it , be it a flaky link , bad port etc don't think its your config , something's setting it off more likely which causes a recalculation at layer 2 and opens back up the redundant link , you need to find what at layer 2 triggers it

you can trace stp change with this command shortens the detail output shows you where the last change occurred from , if you keep following the stp changes through each switch it can lead you to the source of the change

sh spanning-tree detail | i ieee|occur|from|is exec

AS an example on test switch shows last stp change came from po127 , so I would check the ether channel get the physical port then check cdp then go to that switch and run the command again until hopefully you can trace the source of the issue

xxxxxxxx#sh spanning-tree detail | i ieee|occur|from|is exec
Number of topology changes 22 last change occurred 6d04h ago
from Port-channel127

If that link is supposed to be the blocked link set the STP cost on each side higher than the actual cost forcing it to be the redundant link --spanning-tree cost 500--- this wont prevent the issue but its always good to know which links should be redundant I wouldn't leave it all up to stp you can manually control some of it which is a good idea as problems can be easier to trace then at layer 2

anyway just an option that's how I have found what caused the loops in larger layer 2 networks after the fact of it happening but it can be tricky

also check the switch that you need to reboot before you reboot it , what's the cpu sitting at is it maxed out and memory errors etc , things like that can cause switches to act irregularly and cause issues at layer 2 as they cant get bpdus out or pass traffic correctly

Aurelio Llorente · ‎08-23-2016

Thanks Mark!

Finally I managed to fix the issue. I couldn't find any way to fix the switch rebooting. Even disconnecting all the loops, the switch still kept rebooting.

After I reveiced the switches, I updated them to the last IOS available (15.2(5)E and 15.2(4)E2), and then the issue started.

When I was almost setting the switches on fire, I downgraded them to the stock IOS version (15.2(3)E2) and everything started to work fine again.

I think there are some bug in those last two versions.

However, I wasted so much time with this issue, so I have both switches in production now.

Regards

Mark Malone · ‎08-23-2016

Glad you got it sorted , good move downgrading sometimes an IOS change can fix everything even if your not 100% if its a bug or not

Leo Laohoo · ‎08-22-2016

C3560-CX (nwcore3 and nwcore4). Those switches are not stackable

Uhhh, I beg to differ.

The new 3560 "CX" range will now support Horizontal Stacking.

Aurelio Llorente · ‎08-23-2016

Oh! That's interesting... However my switches are the model C3560CX-12TC-S with no 10G ports and horizontal stacking.

Anyway... Is it interesting to stack two core switches?

I am not a big fan of stacking as I had network outages in the past updating one member of the stack (not sure if they were Cisco).

Thanks for your answer.