BPDU Guard and bridge loops: Best Practice Ideas?

cpcall · ‎05-23-2013

We have an environment where users create a lot of bridge loops. We have tried to send E-mails about it and educate the users but it is almost a lost cause at this point. The loops are created when users don’t pay attention and they plug a patch cable coming off of an access port up to ANOTHER access port by mistake.

All of our access ports are from 3750 stacked switches. The way we tried to deal with this in the beginning was with BPDUGuard and ERRDiable (BPDUGuard) auto recovery. We turned BPDUGuard on globally and left BPDUGuard auto recovery at the default value (I believe it was 30 seconds). so a loop would be detected and after 30 seconds, the switch would try to enable the port and if the loop still existed, close the port for 30 more seconds. Then we started having problems with printers getting "fried". Their NICs would die out and the control board would need to be replaced. After a lot of troubleshooting and testing, it was determined that allowing the ports to come out of ERRDisabled state would flood the network and the packets would generate in the millions per second range and fry the NIC of these printer.

The fix for this and saving the printers was terrible. We removed ERRDisable auto recovery and just let the ports that are looped stay in an ERRRDisabled state. We wait for the user to figure out the loop and try to use the port and then put in a work order. Then we physically visit the site and verify the port was shut (ERRDisabled) from a loop and we bounce the port (shut/no shut) and everything is resolved.

I did lab tests with a switch looped and a printer on the switch and watched it fry. We have had no printers fry after we removed the auto recovery protocol at every location. Only the locations where loops existed and auto recovery protocol running were printers going bad. What I found during my lab tests was that each time the port was auto-recovered (yes, for that millisecond while it checks if a loop still exists), more packets were re-generated and eventually enough was re-broadcastthat printers would go down. We never had a problem with computer NICs. I guess the cheaper printer NICs couldn’t handle the broadcast storms created by this. I tried playing with the auto recovery timers and even the highest setting would eventually re-create these storms.

So my question is what best practices are others using? Should we get rid of BPDUGuard and just try to let spanning-tree handle these bridge loops? Is there something else I can try? I’m not CCNA by any means, just trying to do what I can in my environment. Manually visiting sites when loops occur is becoming more and more my job, though and I have plenty of other things to be doing…

Thanks in advance,

Chris

mgalazka · ‎05-23-2013

Hi Chris.

It sounds like you have two recurring, related issues, and you've done a lot of homework to understand and try to alleviate the issues. Nice work thus far! I'll share my thoughts, just to give you another perspective on it.

First problem, users are creating bridge loops. In order to address this situation, I would definitely continue using BPDUGuard, as it is a great protection mechanism. With regard to whether or not to use auto-recovery, keep in mind that by not auto-recovering, you are likely forcing a user to put in a ticket when they disable ports. This presents a teaching moment to help correct this behavior. You may be able to glean other information to note if there is something that can be changed from a process perspective... i.e. you find that untrained users are patching directly to the switch, instead of using data drops or engaging a support person. This could be negated by putting physical security around your communication closet to limit who has access. Also, with regard to keeping auto-recovery off, you may want to look into gather syslogs or other monitoring method as to when a port is errdisabled due to bpduguard. This allows you to track when and where it happens, if a user does not choose to call it in.

Second problem, there are printers with NIC's that can't seem to handle a high number of packets per second, without becoming damaged. It sounds like you have seen this mainly when a unicast flood / broadcast storm happens. Would you consider putting these printers in their own VLAN, forcing all other access to the printers to be routed? This would be a basic step you could take to help protect the amount of unwanted traffic (broadcasts, unicast flooding) that will hit these printer NIC's.

I realize I haven't given you any smoking gun from a technical perspective, but I hope this is still helpful.

Regards,

Matt

Joseph W. Doherty · ‎05-23-2013

Disclaimer

The Author of this posting offers the information contained within this posting without consideration and with the reader's understanding that there's no implied or expressed suitability or fitness for any purpose. Information provided is for informational purposes only and should not be construed as rendering professional advice of any kind. Usage of this posting's information is solely at reader's own risk.

Liability Disclaimer

In no event shall Author be liable for any damages whatsoever (including, without limitation, damages for loss of use, data or profit) arising out of the use or inability to use the posting's information even if Author has been advised of the possibility of such damage.

Posting

The "safe" alternative is not using portfast. If you use RSTP or MST, I think, learning delay is decreased vs. STP.

cpcall · ‎05-23-2013

Thank you both for your responses!

Matt, Unfortunately, we do not have support personnel at each location and users aren't patching directly into the switch in a closet. We are in a public school environment where users are patching in to ports in their classrooms. Each classroom may have 6 ports and no matter how much we educate them, we have this problem. It happens 3-4 times per week.

I have actually tried setting up traps and going out to search for the culprits as they happen but things aren’t labeled very well so it makes for a frustrating long wild goose chase to find the guilty party. Even when we catch them and educate them, there seems to be 5 or 6 more culprits waiting in the wings -haha.

I appreciate your recommendation on moving the printers into their own VLAN. I have thought about this and have these two considerations: One: this would mean that we are still having these millions of packets. Will this start to take away processing power at the switch and give us high utilization errors in the affected VLANS? And 2: The printers use static IPs and they are moved often from port to port as the teachers shuffle their classrooms around and move from class to class thorough out the year. This would make managing VLANs on each port troublesome. I have thought about trying to do this, though and use Cisco's smart ports to filter the mac address vendor IDs of the printers to drop them in to their own VLAN as they are moved from port to port. However, I have not found much info on these smart ports, trying to set up a lab for them using Cisco documentation has proven difficult and I fear that their relative obscurity in the market and online in forums and the limited resources available on them make me believe no one is using this technology and I worry that it may not be supported in the long run.

However, if I find that it will solve my problem, I would go down that road. We have nearly a thousand printers and we would of course have to dial in to each one to change the IP addresses before we migrated them to new VLANs so that would be a big project, but I'm willing to try it.

Joesph, I'll read in to both of those options but I know we have been using portfast for as long as I have been here. If one of them seems like it could help us, I will definitely investigate it!

Anyone else have any thoughts on the matter? I'd love to hear ideas/input!

Thanks again to you both,

Chris

Gabriel Hill · ‎05-23-2013

Hello Christopher,

One other recommendation that may be useful to you is storm control. Setup correctly could mitigate your flooding issues on delinquent nodes greatly.

http://www.cisco.com/en/US/docs/switches/lan/catalyst3750x_3560x/software/release/12.2_55_se/configuration/guide/swtrafc.html#wp1063295

Cheers,
Gabriel

paul driver · ‎05-23-2013

Hello

Whatabout designating a couple of vlans for your printers.

Res
Paul

Sent from Cisco Technical Support iPad App

Please rate and mark as an accepted solution if you have found any of the information provided useful.
This then could assist others on these forums to find a valuable answer and broadens the community’s global network.

Kind Regards
Paul

cpcall · ‎05-24-2013

Thanks again for everyone's help on this.

After reading all of the suggestions and doing some reasearch on the options, I think I am going to test a location as such:

Remove portfast from all access ports

Remove BPDUGuard from all access ports

I will then test and see how the noetwork responds to loops.

What got me to this? -> I did more tests with ERRdisable recovery and found that with one loop and the recovery interval set at 30 seconds and only one host generating traffic on the test switch, there was 4 million packets in 5 minutes because of the loop and the CPU utilization went high and low accordingly.

When I removed portfast and Bpduguard, the same test generated 186 packets in five minutes.

Thoughts against other ideas:

Port Security: I thought about trying to use port security and limiting the maximim mac address per port to 1. (As an added benefit, this would help to prevent unauthorized devices such as routers and hubs brought in from home, which our users love to do.) Problems with this are that we have voip phones that have a port on them for a computer. We also regulary have vendors installing APs and we don't want to configure every port when this happens. I could look into auto smartports to try to pull port security off of ports with CDP packets on them but again, I'm not sure how useful smartports are as there is a lack of documentation and users utilizing this technology from what I can gather.

Storm Control: Getting the right setting (threshold) correct kind of scares me. How will I know if it's just video multicast traffic or a storm using up all of the bandwidth? I still may look into this, though.

Separate VLAN for my printers: This would be a chore to do for all of our printers and I would have to use auto smartports to set up the ports (to the correct VLAN) as printers are moved from room to room and port to port. Also, this is a bandaid at best because creating a VLAN for printers and still using BPDUGurad and portfast on the workstation VLAN will still give me high CPU spikes and 4milllion packets in a 5 minute window on that VLAN. Not something I want to see at each location.

I guess what I'm trying to accomplish is a network that is not as reliant on frequent visits from our very small staff. I'm trying to find that area where we get the best performance vs maintenance (visits) as possible.

Does anyone have any input on this?

Thanks for all of the help and replies from everyone so far on this problem!

Chris

Jeff Van Houten · ‎05-25-2013

I have several thoughts. First, do you have remote access to these switches? Why are techs physically inspecting when a connection causes a loop?

Second, port security can be easily configured to allow 2 macs per switchport just as it can for one. Are you willing to put up with every linksys and net gear imaginable on your network?

3rd I'd like to know the model of printer that fries nics on reception of packets. I've never heard of that. Ever.

4th, you have vendors installing APs on the network, but you don't want to configure the ports each time? Do you have any confidential information on this network?

You need port security, storm control, separate Vlans for your printers (and phones and pcs and security equipment, etc.). You need bpduguard, you may or may not need auto smart ports, etc. but, it really sounds like you need a NETWORK ADMIN most of all.

Sent from Cisco Technical Support iPad App

Giuseppe Larosa · ‎05-25-2013

Hello Christopher,

you should specify the Cisco switch model and IOS version and the model and vendor of the printers to make this thread useful to all other colleagues.

I would NEVER disable BPDUGUARD so I would consider to remove auto-recovery and I strongly agree on enabling storm-control everywhere with appropriate low thresholds. (1% is enough for /24 IP subnets you can specify rising and falling thresholds for broadcast, unknown unicast, and multicast frames)

You may be hitting a SW bug and storm control can give you the right additional tool to solve your issue.

I have a customer of mine with same setup and up to now I haven't heard of fryed printers on their network..

In my case Access layer switches are C3560 with 12,2(35)SEE or 12.2(25)SEE and few newer C3560v2 with 12.2(50)SE

Edit:

an error recovery timeout of 30 seconds is too low end user is not able to understand something is wrong you should use 10 minutes instead. So that there is a chance the offending cable/device has been removed before re-enabling attempt.

Combined with storm-control and with a longer timer you may be able also to keep the auto-recovery function.

In the past I have seen NIC cards burned by broadcast storm some years ago, they were not PCs, they were the NIC of turnstiles for people access control

Hope to help

Giuseppe

cpcall · ‎05-26-2013

Thanks again to everyone for the replies.

Jeff, here are replies to your points:

1. I do have remote access to the switches. But when we get a work order saying ports are not working, we don't have exact port numbers. We are a LARGE school systems with a lot of users and ports spread around a fairly large county. We have to visit the school to see it this is a legitimate port issue (pins broken, wire shorts) and more often than not when we visit the sites, it is a loop caused by users. I'm talking in the range of 4-5 per week. No matter how much we educate them it happens. We can remotely bounce the ports for every switch at their site and hope that it fixes the problem but then we are hurting the users that have wire short problems and they now wait a few weeks and see that their issue was never resolved and they put in another work order. So for customer service sake of the minority, we visit every problem and it just so happens that most are from loops...

2. We currently have every linksys and net gear imaginable on the network. That's part of why I'm trying to find our options to limit this. I tried labbing port security yesterday and was having some problems but I will continue to look at it. (My assumption is that maximum 1 mac address port security should disable the port when there is a loop on the same switch as I am describing. When I labbed it up, I could not get the port security to disable the port or do anything so I need to read up on it a little more and continue testing. This sounds like my best option if it will work as I hope.

3. The printers are Dell 1710 and Dell 1720 models. They are very popular models of printers and we actually never had this problem until we installed 3750 and moved from spanning-tree only to bpduguard. As for the printers frying, I had never heard of that before, either. But it's what is happening. Period. I stumbled on the problem when after upgrading to 3750s, entire schools had 1710 and 1720 printers that were not working. There were 3 locations spread out around the county. Someone suspected power issues fried all of the printers and then one day I visited a location that had a loop and I removed the loop. At that time, "errdisable recovery cause bpduguard" and the default 30 second interval were opening the ports and causing micro storms that affected the CPUs on the 3750s. After I found and removed the loop and I was leaving the school, someone said "Thanks! You fixed the printers! They have been acting weird the past few days and a lot of them even fried!"

Light-bulb went off for me as that lined up with the loop problem. I added the loop back in to the network (I know, "part of the problem or part of the cure," right?) and again the printers were acting strange. I removed the loop and all worked fine (minus the ones that were already fried). So I went to the lab and grabbed a printer (1710) and hooked it up to a switched and added a loop and let bpduguard and errdisable recovery do their thing. After a few minutes, the printer's service lights came on. I removed the loops and the service lights went away. The printer was still working fine... Plugged the loop back up and stared at the printer. After about an hour, all of the lights on the printer were solid (which signifies a blown printer as Dell puts it). Tried to use the printer and it would not work. Ordered and replaced the board with the NIC on the printer and it was fixed.

I then implemented our current "fix" by removing errdisable recovery from all of our locations and we have not lost a printer since. Three schools worth of bad printers (160 or so in total) and none since I removed errdisable recovery. So... That's the story of how I figured the problem out (If you're still reading this...)

4. We do have vendors with confidentiality agreements who have won bids and they install devices all of the time. That's just the way it is in a large school district.

Finally, I will need to look at storm control again and keep trying to get port security to work. We have contacted Cisco about this and they say it's a Dell problem (They make the printers, after all...) so that's why I'm posting here. We can get consultants in but so far, this small problem isn't big on our radar. It's just an annoyance that I feel could be worked more efficiently. We do have separate VLANs for security and phones and access control and etc etc etc. But our printers and PCs happen to be in the same VLAN. I agree that we need a network admin but our management disagrees with you and that's out of my control. I do what I can with what we have and I'm the low guy on the totem pole, trying to make things more efficient and doing research to try and make that happen. Funny you mentioned we get a network admin because the consultants that installed the 3750s and set up the configs had bpduguard and bpdufilter on every port. That’s what they recommended. Of course, filtering your outbound BPDUs on every port and then trying to use BPDU guard to look to bpdus and shut ports is counter-intuitive.

Thanks again for your reply.

Giuseppe, here are my replies to your comments:

1. Cisco 3750 switch stacks all running 12.2 (53) SE2 are my access switches. The printers are Dell 1710 and 1720 printers (very popular corporate printers)

2. I will continue to look at storm control but in my short testing, even setting the threshold low and creating several loop on my lab switch did not set off storm control. But again, when I get back to work I will lab this and continue to test and get it working. I was going to remove bpduguard by removing portfast but that does not seem to work as WinXP asks for a DHCP address before RSTP initiates the port. Oh well, back to the drawing board. I think I am going to try and get port security and storm control working as most people seem to recommend them. I would leave BPDU Guard on and have auto recovery set to something, though. Disabling auto recovery means we have to visit the site and that’s what I’m trying to get away from…

3. I kept the 30 second timer as the default because I figured most of my users (obviously from what I am seeing never notice the loops or even notice them. I could dial out right now and I bet we have 100-150 active loops. We only get about 3-5 a week who actually notice the ports are not working. I figured the people who loop and figure out it’s a loop will unplug the port and hook it up to a computer. Then they will test the computer or try to use it. When it does not work, they will put in a work order. If the timer is too high (10 minutes, an hour) they will never go back to try the computer after that time. They try it once and then it’s work order time. I was hoping a short recovery timer would mean the port would be back and working by the time they sat down on the computer and tried to user the port.

Again, thanks to everyone for their input. I’ll keep troubleshooting and labbing until I come up with a smarter solution.