Re: Spanning Tree: Temp. Loops, Debugging-Messages

rolf.fischer_2 · ‎07-27-2005

I am having a problem with temporary L2-Loops. We saw Linkflapping, Hostflapping and MAC Relearning syslog events.

A few words about our topology:

Two Catalyst 6513 (IOS), priority 24576 (active) and 28672 (standby) connected via Portchannel as backbone and several Catalyst 4006 (IOS) as access-switches, connected to each 6513 via 802.1q-trunk running uplinkfast.

I’ve been debugging STP root, events and config on one particular Cat4006 and see messages like those:

.Jul 26 04:00:33: STP CFG: found port cfg FastEthernet3/48 (6D48F50)

.Jul 26 06:28:23: STP CFG: found port cfg FastEthernet2/40 (6D21B0C)

.Jul 26 06:28:23: STP SW: Fa2/40 new blocking req for 1 vlans

.Jul 26 06:28:23: STP SW: Fa2/40 new forwarding req for 1 vlans

.Jul 26 06:29:29: STP CFG: found port cfg FastEthernet2/40 (6D21B0C)

.Jul 26 06:29:29: STP SW: Fa2/40 new blocking req for 1 vlans

.Jul 26 06:29:29: STP SW: Fa2/40 new forwarding req for 1 vlans

Does anybody know resources that describe those messages or, even better, if the messages are meaningful and the best workaround to do? Since I'm debugging (about 1 week) no more loop occured but I fear sometime it might start again.

Thanks in advance

Rolf Fischer

mchin345 · ‎08-02-2005

This document presents a list of recommendations that help to implement a safe network with regard to bridging for Cisco Catalyst switches that run Catalyst OS (CatOS) and Cisco IOS. Software. This document discusses some of the common reasons that Spanning Tree Protocol (STP) can fail and the information for which to look to identify the source of the problem. The document also shows the kind of design that minimizes spanning tree-related issues and is easy to troubleshoot.

http://www.cisco.com/warp/public/473/16.html

andrew.butterworth · ‎08-02-2005

I would question why have you got any STP loops with the equipment you have? Surely you aren't running this as a layer-2 network? If you are then I would seriously consider a quick redesign and remove the Layer-2 loops from your topology. This should result in faster failover and the ELIMINATION of STP problems.

That being said what IOS versions are you running as I have had quite a few problems with STP especially Rapid STP (hence my preference to design it out of the network if possible). How many VLAN's do you have and are you spanning all VLAN's everywhere?

Andy

rolf.fischer_2 · ‎08-03-2005

Thanks for your responses.

Seems like I need to explain our topology more precise.

The backbone is formed by two Cat6513 (SUP2/MSFC, IOS) linked via Portchannel (8 x GigabitEthernet) running HSRP.

Some access-switches are connected by Layer3, some by Layer2.

We have about 50 VLANs, most of them completely distributed via VTP (at least to our Cat4006, cat3550, cat29xx access-switches). Some Trunks are restricted by ‘swichport trunk allowed vlan …’

Additionally we have some HP-Blade switches (Nortel inside) connected via 802.1q-Trunk to the 6513s and Enterasys Matrix E7 connected with access-ports (only 1 VLAN, no tagging) of the 6513s.

At the moment I’m checking the Blades and E7 because they weren’t configured by myself (I’ve been here for 3 month now) but found nothing suspicious so far.

The loops occurred from March to July, we didn’t see any in the last two weeks.

I updated some really out-of-date IOSs and checked the configs as recommended in the ‘Troubleshooting STP on Catalyst Switches Running IOS System Software’.

Most access-switches run SPT uplinkfast and I’m not convinced that I should trust in this.

Since loops doesn’t occur currently it’s hard to find the fault.

It could be anything: Bad configuration, corrupted hardware, IOS, incompatibilities, port-channel problems, etc.

So I’ve to go on checking and rechecking.

I wonder if the debugging-messages (STP SW: Fa2/40 new blocking req for 1 vlans) are meaningful, especially because the FastEth-Ports in those messages (Fa2/40 in this one) are access-ports.

I’m grateful for any ideas

Rolf

williamwbishop · ‎08-07-2005

Add in power problems(a power dip can cause this too iirc)....

You shouldn't have any problems with uplinkfast, it's a good function, and it won't cause what you are seeing(imo).

Are you saying it doesn't occur often, or that it occurred often and is not currently happening? What is the cycle on this thing?

rolf.fischer_2 · ‎08-07-2005

It didn’t occur for three weeks now but I’m not convinced that the problem is solved. As far as I know we saw it in April for the first time. In May and the first two weeks of June we saw it about once a week, interestingly most times at weekends, when only very few people are supposed to work here.

Since that we tried to optimize configurations (root guard, bpdu guard etc.) and, like mentioned before, it didn’t occur for a couple of weeks.

I wonder what happened exactly and, first of all, why.

Here are the typical syslog messages seen at the access-switches:

Jun 18 16:17:46: %C4K_EBM-4-HOSTFLAPPING: Host 00:50:DA:61:45:34 in vlan 30 is flapping between port Fa2/29 and port Gi1/1

(Gi1/1 is the uplink)

May 8 08:11:15 switch99.domain 5234: May 8 06:11:14: %RTD-1-ADDR_FLAP: FastEthernet0/1 relearning 102 addrs per min

I think that means the affected switches discard their CAM tables (because of topology changes?).

The big question is: What can I do now to monitor WHAT exactly is happening WHEN it is happening? When the problem occurs we’ve to make the LAN work again very quickly (we normally shut redundant ports down until the LAN stabilises), unfortunately there’s little time to analyze what’s going on in those situations.

MATTHIAS SCHAERER · ‎08-10-2005

Rolf,

try to see which VLAN is the one that causes trouble with looking at 'shwo spanning-tree detail' repetitively. You see how many changes have occurred and when the last one has happened. Then you can have a closer look at the failing VLAN and continue with analyzing the spanning tree (repeat your analysis as well if you say that it changes) and narrow down the failing device (or devices). With that inforamtion you can further dig into configuration or bug database to resolve your problem.

HTH

mat

bsc · ‎08-10-2005

Do you have servers that are connected to multiple/two switches? Some servers can do that and when they switch over (sometime people do this over weekends ;)) you can see this message in the log. Although this should not generate a stp loop. Do your host ports use portfast?

Is the flapping MAC always the same?

just to add something more to check.

rolf.fischer_2 · ‎08-11-2005

Unfortunately the affected VLAN is our biggest one: 20 bit subnet mask.

The flapping addresses are dozens, but (as far as I see) all in that VLAN.

Yes, we have some servers connected to both core switches but they are in a different VLAN.

And yes, we use portfast at host ports. I’m planning to secure them (all the portfast ports) with bpdu guard to prevent participation of those ports to the spanning tree.

Additionally I started configuring UDLD for the fiber uplinks.

I think securing the network with the right features should fix the problem and if not, the good thing are their messages which hopefully show us the source of the problem next time.

Thank you all for your interesting responses!

Rolf