Re: sx300 series requires reboot every 2 months

cavan1975 · ‎11-16-2011

We have 3 sf-300 series switches in Layer3 mode deployed in different offices. We have found that approximately every 2-3 months at all 3 locations users experience a serious reduction in bandwidth. Only after rebooting the cisco does the problem go away and we're okay for another few months. Has anyone else experienced this? Does anyone have any ideas on some setting/feature that may be contributing to this? We are only using several ports and 1 static route on each switch. We are not using any of the bells and whistles on the switch. 2 of the switches are using the original firmware, whereas the other is using the newest firmware. Maybe this is just what we should expect from a Small Business switch?

Thanks.

David Hornstein · ‎11-16-2011

Hi Cavan1975

No it's not something to expect from the 300 series, specially since the switches are connected on voice networks about July since the 1.1 firmware release of code..

Your switches would not have come out with 1.1.1.8 firmware loaded, try the new firmware . But like yourself, I am also interested in other folk comments.

regards Dave.

Boudewijn Plomp · ‎11-16-2011

We have exactly the same problem. I have it at home and we have it at our office! Serious without doubt a recurring event.

At home I use two SG300-20 switches in layer 3 mode.

At the office we have one SG300-28 switch in layer 3 mode and six SG300-10(P) switches in layer 2 mode.

At a certain time the network performance is unacceptable. Users complain about slow logons. Connection to servers and Virtual Machines are unacceptable slow. When you copy files over the network you must be happy when you get 100KB/s. Normally that should be around 100MB/s. When you check the switch its CPU performance it is higher than normal but still below 50%. But it spikes up and down. If you reboot the switch(es) the CPU is normal again, around 10-15% as far as I remember.

Sent from Cisco Technical Support iPad App

henry2535 · ‎11-16-2011

I also experience the same poor performance problem! Very noticeable degradation with bandwidth as it appears to be a correlation when the CPU performance is over 50% utilization.

Using a mix of - 4 - SG300-24 model with a mix of original and latest firmware. All require a reboot every 2-3 months. Most configurations are Layer3 mode with a few vlans and 1 static route.

Please let me know if there are any settings to change that can help resolve this issue.

David Hornstein · ‎11-16-2011

Hi All

I wonder what is going on in the network when you have to reboot the switch, anocdotal information regarding higher than normal CPU utilization is also interesting and cannot be discounted..

Can't really say you have exactly the same problem as aanother post in this thread, but a symptom of higher than normal CPU utilization is one symptom you both share.

I can't say that your problems are identical, but the observation of a higher than normal CPU rate is interesting, indicating something (traffic) is hitting the CPU at a higher pps rate..

As a example , my SG300-28P shows the following CPU performance;

switch4cf17c#show cpu input rate

Input Rate to CPU is 1 pps

switch4cf17c#show cpu utilization

CPU utilization service is on.

CPU utilization

---------------

five seconds: 4%; one minute: 3%; five minutes: 3%

This sort of quantitative data would be interesting when coupled with a wireshark capture, so download and have ready wireshark application on a PC.

Get a feel for capturing some packets maybe for 20 seconds and hang on to that capture for comparison purposes.

You probably will be asked to capture some packets when you ring into Small Business Support Center (SBSC) at the time of slow network performance.

The switch does use the Secure Core Technology (SCT) feature to ensure that the switch will receive and process management and protocol traffic, no matter how much total traffic is received.

What is causing the CPU to go above 50 percent, what type of traffic is running through the network at the time of a slowdown. These are questions that may be asked when you call SBSC.

With the anocdotal information presented, I am at least flying blind to help identify the causal root of the network slowdown.

It may be a broadcast / multicast storm from a bad NIC card, that is reset when the switch port powers off an on during a manual reboot.

But the anocdotal information presented regarding CPU utilization cannot be discounted, as maybe the switch is reacting to traffic on the network or your network topology at the time of slow down.

When the problem re-occurs, refrain from rebooting the switch, and call the good folk at the Small Business Support Center (SBSC) if you need assistance. They are there to help with break fixes on Cisco hardware.

It may be some network environmental scenario Broadcast / multicast issue caused by even a intermittant bad NIC that is the cause of a problem. That cause may be identified by;

taking a few wireshark captures
capturing switch port counters
show tech
show log
running copper cable diagnostics (after hours) to identify bad copper cables.
etc...

http://www.cisco.com/en/US/support/tsd_cisco_small_business_support_center_contacts.html

regards Dave

reuben.farrelly · ‎11-28-2011

I personally have two of the SG-300 8's and they don't seem to exhibit this problem, but then I don't load them up all that much. Mine typically go many weeks to months between restarts and the restarts are usually unplanned.

However I do know of someone who has some of the 24 port versions who has reported to me that his units needed restarting every month or so as well otherwise the throughput would decrease dramatically. He puts his units under a lot of load (unlike mine). We tested with multiple versions of code including 1.1.1.8 and the problem persisted across multiple versions from the initial 1.0.

In the end the units got shelved. Performance was great while they were working and fresh after restarting, but the requirement to restart them every month was a problem so he ended up rolling back to D-link unmanageable switches which while were not as fast, were more reliable (on account of not requiring restarting).

This along with the other reports on here pretty strongly indicates to me that there is a much more widespread bug causing this problem. It sounds a bit like a memory leak or something.........

sutton.matthew · ‎11-28-2011

This just happend to my sg300-52. Network bandwidth went to crap. I decided to reboot the switch before searching about it. The switch had a runtime of 52 days. After reboot everything back to normal.

It is running in L3 mode. It was running 1.1.0.73, after I rebooted I noticed there was a firmware upgrade, so I upgraded it to 1.1.1.8.

Will see what happens in the next 60 days.

David Hornstein · ‎11-29-2011

Hi Mathew,

Too many smart people in this post are saying the same thing. I don't think this scenario is a trend otherwise this posting would be very popular/busy indeed

I will run this posting by the Product Manager, but like myself, he will be running blind without some empirical data collected by you.

Remember when/if it happens again, follow the instructions i posted above, and maybe refer the technician to this posting.

regards Dave

sutton.matthew · ‎11-29-2011

Another thing that I did notice last night. I have Link Aggregation setup between the sg300 and some dell gigabit access switches. When I noticed latency and bandwidth degradation it was first to a specific server connected to the sg300. I was doing a file copy, 90 meg file was going to take over an hour. I first looked at the server and did not see anything wrong, and noticed it was happening to all traffic being routed by the sg300. I double checked that the LAGs were fully up and up. I decided to shutdown the port on the switch that the file server was connected to, and noticed it was still happening to other servers. So I brought the port back up and noticed when the port came up in the RAM log it showed "%LINK-I-Up: gi27 (Aggregated)" or something about aggregation in parentheses next to it. I thought wait that port is not part of a LAG group, and double checked and it was not. I then power cycled the switch and everything was back to normal. Port came up like it should, "%LINK-I-Up: gi27"

If it happens again I will try to isolate the problem more, and do as you suggested above.

David Hornstein · ‎11-29-2011

Hi Mathew,

I will look into your setup in more detail tonight.

But are you forgotting something about LAG, especially when you test the functionality. I have taken the liberty to copy and paste a section from page 89 of the 300 series admin guide, which discusses packet distribution in a LAG and load balancing;

Load Balancing

Traffic forwarded to a LAG is load-balanced across the active member ports, thus achieving an effective bandwidth close to the aggregate bandwidth of all the active member ports of the LAG.

Traffic load balancing over the active member ports of a LAG is managed by a hash-based distribution function that distributes Unicast traffic based on Layer 2 or Layer 3 packet header information. Multicast packets behave in the same way as Unicast packets.

The switch support two modes of load balancing:

By MAC Addresses—Based on the destination and source MAC addresses of all packets.

By IP and MAC Addresses—Based on the destination and source IP addresses for IP packets, and destination and source MAC addresses for non-IP packets.

So, if you unicast from one IP host to another IP Host, LAG will decide which link (not links) in a LAG group to send that traffic over.

If you send Traffic from one source IP address to many destination IP addresses, then the switch LAG hashing algorithm can start to distribute traffic over multiple links in a LAG group.

So, a web server that serves many remote IP hosts (destination IP addresses) on a LAN, could start to evenly distribute traffic over a LAG more efficiently. The more remote hosts, the more evenly the HASHing algorithm will start to evenly spread the traffic over the interfaces in a LAG group..

You will find, that the more IP hosts or ethernet hosts on a network the more evenly traffic will start to distribute over a LAG group. But in testing from one IP Host to another the switch will send traffic over only one link, until that link fails. If that link fails you would observe the traffic will be sent over another member of the LAG group..

But you may be unlucky enough to observe, depending on the switch HASHing algorithm, to see one link in a LAG group be used more frequently within a LAG group.

See the following IEEE link for a summary of how 802.3ad works. http://www.ieee802.org/3/hssg/public/apr07/frazier_01_0407.pdf

it also mentions;

"All packets associated with a given “conversation” are transmitted on the same link to prevent mis-ordering"

regards Dave

Boudewijn Plomp · ‎12-27-2011

Today we had a performance issue again at our office. The network was unbelievable slow. Servers where nearly unresponsive. Even an RDP connection to a server was very very very slow and disconnected several times in a minute.

I really wanted to do a wireshark capture to help solve this issue. But I was working remotely and employees at our office could not work normally. I even had a very hard to even get on the web interface of our central layer 3 switch (Cisco SG300-28).

Anyway. After the switch was rebooted everything worked normally again.

At our office all our servers (except one) are virtualized. All other devices are clients , a few IP Phones and one Voice Gateway which are connected to separate access switches (SG300-10). Isn’t there a log or aren’t there alerts stored in the Cisco switch that can show some more information?

sutton.matthew · ‎12-27-2011

Try upgrading to 1.1.2.x firmware.

It fixes: "Some MAC addresses are not showing (relearned) after a period of 4 to 6 weeks in the switch table. Cisco0000263 (FDB cache timestamp wraparound issue)"

Which if the switch cannot relearn a MAC address sounds like it becomes a hub instead of a switch. And would flood the ethernet frame out every port in the brodcast domain. And the 4-6 weeks meets the 2 month reboot cycle. Just a guess.

-Matt

Boudewijn Plomp · ‎01-02-2012

Indeed, that was my planning for januari as well. But after this happend I updated all switches to firmware v1.1.2.x straight away. Hopefully this might solve the issue.

At this point I can't think of something else that might cause an issue. We haven't had this issue with previous switches. Almost every server is a Virtual Machine, except for the clients, printers, IP phones and such. Our Hyper-V Server is also replaced three months ago. Of course I don't know for 100% sure the switch(es) are causing the problem.

jyoopro4ia · ‎02-12-2013

Any updates to this op?

I also have 2 clients that are experiencing this issue. One experiences this more than the other. This particular client's SF 300 switch requires rebooting every few months. When the problem occurs, they experience no connectivity on some of the switch ports. Reboot fixes this immediately. I think I've had to reboot their switch about 4 times last year.

Another client recently experienced this as well. All of a sudden, few of their phones were not registering to the UC320 system and reboot resolved this issue.

sutton.matthew · ‎02-12-2013

I took mine out of production. Having to reboot a switch in the middle of the day is rediculous.