We have 10 Cisco SG200-50 switches running firmware 126.96.36.199. The switches are used in a small data center where we are a tenant and the manager. The switches normally work smoothly with no problems. When the switches fail we experience packet loss and the only way to fix them is to power cycle them.
As the manager of the data center we use a Cisco 7507 router to take internet bandwidth from multiple carriers, split our external IP addresses into different subnets, put those subnets into different VLANs(600-649), and deliver the VLANs to the customers of the data center. As a customer of the data center we give our external VLANs (600-624) to our routers and firewalls and add internal VLANs (10-19) for our internal subnets.
Switch - A is the root of our spanning tree with a bridge priority of 16384
Switch - B has a brigde priority of 20480
All other switches have a bridge priority of 32768.
As a customer we turn off Spanning Tree on some ports because we use Cisco Local Directors for load balancing.
The default VLAN Id on all the switches is 1.
When we are having a packet loss problem:
1. As a customer we have intermittent ping loss (like get 2, lose 5) when pinging from a computer from Customer 1's switches(like Switch - 104) to the 7507 router
2. Switch - A's management interface is very slow or unusable.
3. Pings to Switch - A from a computer connected to Switch - B show an increasing response time until they go back to the normal 1 ms(For example we will see response times of 1 ms, 1 ms, 1 ms, 5 ms, 10 ms, 14 ms, 20 ms, and then back to 1 ms. The response times will loop like this until we power cycle Switch - A.)
4. Switch - A is set up to send informational log data to a syslog server but, nothing relevant is logged.
Packet Loss Scenerio 1:
We configured a Trunk port on Switch - B with some VLANs on it. We then configured a trunk port on a Cisco Catalyst 2950 and connected it to Switch - B. The CPU usage on the 2950 went to 100% with the spanning-tree process taking 80%. Unplugging the 2950 from Switch - B did not fix the problem. The 2950 supports STP and PVST while the SG200 supports STP and RSTP.
Workaround: change the Trunk port on Switch - B to a General port and only allow tagged frames. I think the trunk ports on the SG200 switches require allowing untagged packets. Why does the change from a Trunk port to a General port fix this problem?
Packet Loss Scenerio 2:
I accidentally plugged 2 new computers into ports on Switch - 106 that were configured as Trunk ports allowing untagged traffic on VLAN 1. We started losing packets on Customer 1's switches and then the problem spread to Switch - A. Unplugging the computers from Switch - 106 did not fix the problem. Because the problem spread from the customer's switches to the data center's switches I am forbidden from using the SG200 switches as a customer until this issue is resolved. We had to replace our SG200 switches with our legacy Catalyst 3500 and 2900 switches.
Why are we having these packet loss problems?
Why does the packet loss problem spread from the customer switches to the data center's switches?
Why does unplugging the equipment that caused the problem not fix the problem?
Why is a power cycle necessary to fix the problem?
Does the default VLAN Id need to be different for each customer?
I'm not sure where to go next because I haven't been able to reproduce either scenerio in a test environment. I think I will turn off as much extra stuff as possible(discovery protocols, smart ports, replace LAG trunk with single cable trunks) and turn the logging up to debug. But none of that fixes the problems we are experiencing it just eliminates potential causes. There is also a new firmware update available but, I would like to be able to reproduce our problem before upgrading the firmware.
Considering the notion that the issue persists throughout the network even when you disconnect units you suspect are the point of failure, this implies there is 1 of 2 things.
Possible TCAM / MAC overflow
Possible spanning tree / storm control issue
You had indicated the switches CPU and memory are getting maxed out. Of course there can be potentially thousands of causes for this, including spanning tree, storm control, MAC/TCAM, heavy data loads, etc. The SG 200 switches are a "light managed" switch and the former 2900 and 3500 series are quite more robust than the SG 200 product.
To note your observation about the General mode Vs Trunk mode, there isn't really a huge technical difference. One could argue a General port may be more of a true 802.1q port. I have a feeling changing the port to general mode, with the smart port negotiating, the trunk requires 1 untag (the native vlan) while the general port does not.
Additionally, for the spanning-tree, you should also verify the Edge port configuration. If PORT FAST is negotiating to any port linking to another device such as a switch, router, etc, this must be disabled, otherwise a BPDU message will be received and cause a chaos on your network.
Trunk mode VLAN: by default sets egress to tagged, supports multiple VLANs, does not set PVID (native VLAN, ingress untagged), native VLAN cannot be a configured Trunk VLAN or 4095 (discard VLAN).
General mode VLAN: by default sets egress to tagged, supports multiple VLANs, does not set PVID (native VLAN, ingress untagged), native VLAN can be any defined VLAN. Setting the PVID removes default vlan (VID=1) for that port.. PVID can be 4095 (discard VLAN). General mode allows mix of tagged and untagged VLANs in the egress direction.
Since the resolution is power cycling the switch, this would indicate the switches may be having too much load. Either from a networking error (spanning tree) or simply too much traffic. Also, the 188.8.131.52 firmware has been pulled by the business unit. The 184.108.40.206 is the current supported software.
Please mark answered for helpful posts
Small business owners are willing to try new ways to protect and grow their businesses by innovating, taking risks and pushing boundaries - and technology is a valuable tool to help drive that success.
Learn how Cisco helps small businesses think big and...
This document is attempt to recreate content of original document created by famous @Patrick Born. Cisco has considered to destroy such valuable document for an unknown reason.Cisco SPA series phones and ATAs can use certificate-authenticated HTTPS (SSL) ...
Stay up to date with monthly on-line briefings. Join Customer Connection to register for briefings presented by Cisco product managers who share technical deep-dive product presentations with interactive Q&A.
Catch up on previous new small business p...
Your small business needs secure, intelligent, simple to manage solutions to keep your business humming. Cisco Designed for Business solutions enable your company to connect, compute and collaborate securely.
Why Cisco for sm...
Learn how a two-man IT team manages all audio, video, voice and networking for Goodwill Industries stores in South Florida. Meraki enables them to consolidate, visualize and monitor their wireless network.