Re: STP: Max Age / fast changing from blocking to listening

rolf.fischer_2 · ‎10-13-2005

I’m seeing a lot of unusual ST topology changes, coming from a port which is supposed to be in blocking state: Cat4006 Gi1/2 (see thumbsketch)

+------------------ PortChannel ------------------+

Cat6513A --- (Gi1/1) Cat4006 (Gi1/2) --- Cat6513S

forw -------- forw ----------- blocking ------ forw

Cat6513A = Root, BridgeID = hex6000

Cat6513S = Standby, BridgeID = hex7000

Cat4006 = Access-Switch, BridgeID = hexC000 (UplinkFast)

I enabled debugging STP on the Cat4006 and saw interface Gi1/2 sporadically changing into listening state and changing back to blocking an instant later:

...

Oct 13 13:52:10: STP: VLAN0998 Gi1/2 -> listening

Oct 13 13:52:10: set portid: VLAN0999 Gi1/2: new port id 8002

Oct 13 13:52:10: STP: VLAN0999 Gi1/2 -> listening

Oct 13 13:52:11: STP: VLAN0310 Gi1/2 -> blocking

...

Because of this I assumed that Cat6513S failed in sending BPDUs to Cat4006 and stated tracing it’s (back)uplink port to Cat4006.

And as a matter of fact there are interruptions - exactly when the TCs occur.

Now my question: The time period of not sending BPDUs is short, only 4 - 8 seconds, which means 2 - 4 BPDUs are not received by Cat4006 Gi1/2.

Isn’t that much to quick to change into listening state? I thought it takes 20 seconds (max age) to make the port change from blocking into listening? Is the behaviour of the Cat4006 normal (which means the problem is with Cat6513S)?

I tried UDLC and LoopGuard as well, but this features don’t work in such short time periods .

Thanks in advance

Rolf Fischer

leonvd79 · ‎10-13-2005

It should take at least 20 seconds to change port states from blocking to listening another 15 seconds from listening to learning and finally 15 seconds from learning to forwarding.

I understand from your detailed post that you have loopguard and udlc enabled. Is there by any chance a high CPU utilization on Cat6513S connecting to Cat4006. Somehow the port fails to receive BPDUs and triggers a TC.

Perhaps, if you're running CatOS on the 6500s you can utilize BPDU Skew Detection to see if you're CPU and link utilization is running high.

http://www.cisco.com/warp/public/473/84.html#BPDUskewdetect

rolf.fischer_2 · ‎10-13-2005

Thanks for your responses.

Here some additions:

All switches (Catalyst 6513 and Catalyst 4006) are running IOS (current 12.1), standard STP 802.1d/PVSTP (no RSTP) configured, checked and re-checked.

I tried udld and loopguard successively but they were unable to detect 1-3 missing bpdus (2-6 seconds).

The crucial question is: Why does the Catalyst 4006 change from blocking into listening if only 1-3 bpdus fail to appear? This seems to be very unusual. I don't think there's a problem with the 6513 because there are links to about 30 switches but the TCs occur only on this particular one. If it failed sending bpdus for 10 seconds or so, I'd assume the problem here but we're talking about 2-6 seconds (most times only 2s, 6s was the longest period up to now)

And I currently don't have a loop problem because the 4006 changes back to blocking immediately after it's changing to listening state.

atman1 · ‎10-13-2005

These are recommendations to protect the networks against forwarding loops.

SPANTREE BEST PRACTICES

1. Make sure you have the complete toplogy diagram of the entire network with all the switches that carry the vlans.

2. Make sure the ROOT for all these vlans in the network is on the 6500 switch at the core or distribution whichever is the highest level for your layer 2 network.

3. If you have etherchannels configured between switches anywhere , make sure the MODE is set to DESIRABLE-DESIRABLE on both sides for PagP negotiation.

4. Same for trunks make sure they are set to "desriable" on Both sides and native VLAN is matched.

5. Enable UDLD on all fiber links Bi-Directionally. This will help detect a uni-directional link that could a loop.

6. Enable PORTFAST on all edge ports like workstations, PCs, Aps, printers etc.. to eliminate unwanted TCNs 7. - enable loopguard on all non-designated ports, in particular on the access switches

This link explain in detail these recommendations:

http://cco/en/US/partner/tech/tk389/tk621/technologies_tech_note09186a0080136673.shtml#secure_loops

atman1 · ‎10-13-2005

Troubleshooting:

When you get the error messages please do the following to identify if there is an STP loop:

sho cam aging vlan # - if it is 15, it means there is an STP loop.

show logging buffer 1023 - check for flapping ports.

show top - To see the top port talkers - evaluates port utilization over a 30 sec period of time.

Starting with the root for vlan # down to the switches where you saw the log messages, do the following command

sh spantree stats # ( for each port on this switch in vlan #)

What you are looking for is to find the switch or port that initiated the topology change notification.

Do this same steps in all switches - to check for last topoloty change initiator and to do the switch connected to that port until you get to the bottom.

Francois Tallet · ‎10-14-2005

Be careful that the aging time going down to 15 seconds is just an indication of a topology change in the network, not a bridging loop.

From the symptoms, I have the feeling that the port is suddently going back to the initial state, reset as if the link was flapping. From the initial listening state, it goes to blocking as soon as it receives a BPDU from the neighbor. I don't know what is the cause of this, but it not STP related in my opinion.

Regards,

Francois

rolf.fischer_2 · ‎10-14-2005

Thank you Francois.

I agree with you, from the symptoms it looks like linkflapping. And actually we had some link problems in the past.

But: I traced the forwarding of bpdus on that link at both sides and they are transmitted as well as received.

And always when the TCs / changing from blocking into listening occur, I see some 2 or 3 bpdus missing (also on both sides). Until this happens, the bpdus are received in the normal 2 second interval.

This is somehow remarkably, isn't it?