Re: Cisco Catalyst 2960S stack problem

ciscomagu · ‎12-02-2010

Hi,

I have a new problem with Catalyst 2960S. We have four switch in a stack and now I get the message:

“%PLATFORM_RPC-3-MSG_THROTTLED: RPC Msg Dropped by throttle mechanism: type 37, class 14, max_msg 32, total throttled 73968 (hostname1-2)”

Traceback= 13A686C 160862C 160E0B4 15E2088 184FD48 18467B8

sh switch de

Switch/Stack Mac Address : 68bd.abc9.0000

H/W Current

Switch# Role Mac Address Priority Version State

----------------------------------------------------------

1 Member 68bd.abc9.2580 6 1 Ready

2 Member aca0.16f2.ff80 8 1 Ready

3 Member 68bd.abc9.1e00 10 1 Ready

*4 Master 68bd.abc9.0000 15 1 Ready

Stack Port Status Neighbors

Switch# Port 1 Port 2 Port 1 Port 2

--------------------------------------------------------

1 Ok Ok 4 None

2 Ok Ok 1 3

3 Ok Ok 2 4

4 Ok Ok 3 1

sh switch stack-ports

Switch # Port 1 Port 2

-------- ------ ------

1 Ok Ok

2 Ok Ok

3 Ok Ok

4 Ok Ok

The software version is:

Cisco IOS Software, C2960S Software (C2960S-UNIVERSALK9-M), Version 12.2(55)SE, RELEASE SOFTWARE (fc2)

Any ideas what it means?

Best Regards

Magnus

mushroom83 · ‎12-03-2010

Hello,

I got the same problem yesterday after repoducting a stack instability found with 2960S .....

I have 2 stacks of 4x 2960S switchs working (at the begining in Full_Ring_Mode). Since the network load we encountered network performance loss and after a time of analisys, strange stack topology change where found on both 2960S stack.

This stack topology change also induce STP topo change:

Here a log exemple:

Dec 2 13:54:19 MET: TDM change state from Stable to Converging; topology_type=UnResolved
Dec 2 13:54:19 MET: TDM change state from Converging to Converging; topology_type=FullRing
Dec 2 13:54:19 MET: %SPANTREE-5-ROOTCHANGE: Root Changed for vlan 161: New Root Port is StackPort4. New Root Mac Address is 5475.d041.9800 (afpasw03sopbou-4)
Dec 2 13:54:20 MET: TDM change state from Converging to Stable; topology_type=FullRing
Dec 2 13:54:20 MET: TDM pass normal topology to DTM
Dec 2 13:54:21 MET: TDM: received drop table ready signal from DTM.
Dec 2 13:54:21 MET: TDM: enable dataTxQueues on StackPort-1 ...
Dec 2 13:54:21 MET: TDM: enable dataTxQueues on StackPort-2 ...
Dec 2 13:54:21 MET: %SPANTREE-6-PORT_STATE: Port Po1 instance 180 moving from forwarding to blocking
Dec 2 13:54:21 MET: %SPANTREE-5-ROOTCHANGE: Root Changed for vlan 180: New Root Port is GigabitEthernet1/0/52. New Root Mac Address is 5475.d041.9800
Dec 2 13:54:21 MET: %SPANTREE-6-PORT_STATE: Port Po1 instance 185 moving from forwarding to blocking
Dec 2 13:54:21 MET: %SPANTREE-6-PORT_STATE: Port Gi1/0/52 instance 185 moving from forwarding to blocking
Dec 2 13:54:21 MET: %SPANTREE-6-PORT_STATE: Port Po1 instance 185 moving from blocking to forwarding
Dec 2 13:54:21 MET: %SPANTREE-6-PORT_STATE: Port Gi1/0/52 instance 180 moving from forwarding to blocking
Dec 2 13:54:21 MET: %SPANTREE-6-PORT_STATE: Port Po1 instance 180 moving from blocking to forwarding
Dec 2 13:54:21 MET: %SPANTREE-6-PORT_STATE: Port Po1 instance 185 moving from forwarding to forwarding
Dec 2 13:54:21 MET: %SPANTREE-5-TOPOTRAP: Topology Change Trap for vlan 180
Dec 2 13:54:21 MET: %SPANTREE-6-PORT_STATE: Port Gi1/0/52 instance 185 moving from blocking to forwarding
Dec 2 13:54:21 MET: %SPANTREE-6-PORT_STATE: Port Gi1/0/52 instance 180 moving from blocking to forwarding
Dec 2 13:54:21 MET: %SPANTREE-6-PORT_STATE: Port Po1 instance 180 moving from forwarding to forwarding

I have opened a case and made a WebEx with an Cisco ingeneer ... but at the moment no explanation at all ...

The only mean found by myself to have a stable architecture is to disconnect the redundant Stack cable and run in Half_Ring_Mode ...

If someone encountered a similary problem with Cisco 2960S running in Full_ring_mode stack ?

Software version is : 12.2(53)SE2

schnobbr · ‎12-21-2010

I am having the same issue.

12-21-2010 11:02:46 Local7.Error 10.100.128.121 64712: Dec 21 17:02:44.737: %PLATFORM_RPC-3-MSG_THROTTLED: RPC Msg Dropped by throttle mechanism: type 37, class 14, max_msg 32, total throttled 181116 (HSTerrill.121-2)

HSTerrill.121#show switch detail
Switch/Stack Mac Address : 9c4e.2078.ae00
                                           H/W   Current
Switch# Role   Mac Address     Priority Version State
----------------------------------------------------------
*1       Master 9c4e.2078.ae00     14     1       Ready
2       Member 9c4e.2078.4980     1      1       Ready
3       Member 3037.a697.4980     7      1       Ready
4       Member 3037.a697.4600     1      1       Ready

         Stack Port Status             Neighbors
Switch# Port 1     Port 2           Port 1   Port 2
--------------------------------------------------------
1        Ok         Ok                4        2
2        Ok         Ok                1        3
3        Ok         Ok              None       4
4        Ok         Ok                3        1

HSTerrill.121#show switch stack-ports
Switch #    Port 1       Port 2
--------    ------       ------
    1           Ok           Ok
    2           Ok           Ok
    3           Ok           Ok
    4           Ok           Ok

Cisco IOS Software, C2960S Software (C2960S-UNIVERSALK9-M), Version 12.2(55)SE, RELEASE SOFTWARE (fc2)

I guess I will reboot the stack and see what happens.

paolo bevilacqua · ‎12-21-2010

I have the same issue, with the messages above, after 7 weeks uptime.

There were no apparent STP changes but two stack members went into "removed" state.

Connectivity was intermittent. We had to reload all stack members except master to restore stability.

I will open a TAC case, if you can post your SR numbers here it would be useful.

Matthew Blanshard · ‎12-21-2010

For the stack falling out after a number of weeks issue there's a known bug, CSCtg77276. Are you running 12.2(55)SE or later? If so please open up a TAC case so we can get development re-engaged on the issue.

-Matt

paolo bevilacqua · ‎12-21-2010

Thank you Matt for providing the bug ID. We are running 53SE2 so I guess we will have to update.

To be honest I'm surprised that this is only a Sev3 bug while in truth it can be quite catastrophic to a business.

Matthew Blanshard · ‎12-21-2010

It was filed as a sev 3 but it was fixed pretty quickly. Also given the challenges of confirming the fix for an issue that only happens every 4+ weeks the turnaround time wasn't bad at all. The issue was first found in 12.2(53)SE2 and fixed for 12.2(55)SE which was the next release.

-Matt

Leo Laohoo · ‎12-21-2010

The issue was first found in 12.2(53)SE2 and fixed for 12.2(55)SE which was the next release.

I'm not sure it's fixed. The topic was opened with 12.2(55)SE IOS running and the problem was evident.

Matthew Blanshard · ‎12-21-2010

There's a couple of different problems being reported in this thread.

First is the "%PLATFORM_RPC-3-MSG_THROTTLED: RPC Msg Dropped by throttle mechanism: type 37, class 14, max_msg 32, total throttled 73968 (hostname1-2)” message which is usually indicitative of a high CPU event or high traffic across the stack ring like during a loop.

Second is STP instability inside the stack which is fixed by making the stack half duplex and sounds a lot like CSCtj30652.

Third is a stack member just falling out of the stack after several weeks and that sounds like CSCtg77276.

For the third problem they are running 12.2(53)SE2, for the first they are running 12.2(55)SE, and the second is 12.2(53)SE2.

Ideally the recommend thing to do is upgrade all of your 2960s stacks to 12.2(55)SE1 which is out on cisco.com now. All of these issues are fixed there and it's a stable release for the platform.

If anyone is seeing the symptoms of CSCtg77276 after upgrade to 12.2(55)SE or later please open up a TAC case so we can investigate further.

-Matt

paolo bevilacqua · ‎12-21-2010

Matt,

I understand that in your view the first and third problem are unrelated, but in my experience they may not be.

On our crash we got the following sequence:

- 7 weeks of normal network operation, no loops of any kind (stack has a single link core), no high traffic.

- member 1 falls from the stack, however there still is intermittent connectivity

- after 7 minutes, member 4 falls from the stack.

- after 3 minutes, %PLATFORM_RPC-3-MSG_THROTTLED. Major network disruption,

- Member 1 and 4 are manually restarted. Member 3 crashes in the process, but within 5 minutes the stack has stabilized again.

I have to say that in our case the number of throttled messages was very small, unlike the case in the opening post.

Leo Laohoo · ‎12-21-2010

Hey Paolo,

Out of sheer curiousity, have you tried 12.2(55)SE1? It was released on 09 December 2010.

Matthew Blanshard · ‎12-21-2010

Hey Paolo,

While the two things are related they also aren't the same thing. To provide some more details about CSCtg77276 what happens is that a portion of the CPU gets stuck and frames get discarded at the process level causing the stack members to fall out of the stack and reload is the only way to get everything back up. So I am not surprised that you see the throttle messages when this issue occurs.

That being said this bug isn't the only reason for those messages. Those messages indicate that there's throttling of packets going to the CPU. So you can get this message in the event of a flood of messages across the stack ring or during other kinds of CPU events. So just because you see the throttle messages it doesn't mean you are hitting CSCtg77277.

When you get the throttle messages followed by the stack members falling out the throttle messages are the symptom of the larger problem.

Hope this helps clear things up.

-Matt

prwainwright · ‎01-14-2011

.

schnobbr · ‎01-06-2011

I upgraded all sixty of my switches to 12.2(55)SE1. Most of my switches are in stacks. Before the upgrade I would see packets suddenly dropped on all of the ethernet interfaces in a stack. I would sometimes see the neighbor status of “none” and I would sometimes see spanning tree changes on the stack interfaces.

The switches have been running for nine days since I have upgraded. So far the only issue I have is I still see packets dropped for a short amount of time, but only on individual interfaces (VS all monitored interfaces), and not very often . The devices connected to these ports still work fine too. This may not even be a switch issue at all since it seems that the devices connected to the interfaces that drop packets are things like time clocks, alarm systems and stuff like that. So far I do not have any problems with access points, and computers. This is far, far better than all of the issues I was having before the upgrade. I definitely would recommend the upgrade. This upgrade was not easy though.

I used archive download-sw tftp://x.x.x.x/c2960s-universalk9-tar.122-55.SE1.tar to upgrade the switch. Of the sixty switches I upgraded, 4 switches would not boot. All of the switches that did not boot were in different stacks. No individual switches failed during the upgrade. After the upgrade and reboot, they would start to load the code and then fail. They would just hang. Each of these four switches hung at a different point in loading the code from flash. I found some documentation on how to load the IOS via the serial port. The documentation I found said that only 9600 baud would work. (This is better than punch cards, but please tells me there is a better way to do this.) It took three hours and fifteen minutes for each switch up upload the code.

I did notice that on all four switches that the upgrade failed on, the file size started with 109xxxxx instead of 10894217 bytes. Hopefully that will save you some grief if that happens to you.

ashaw216 · ‎01-06-2011

I've seen several occurrances (just yesterday) where copying an imagefile even across a stack cable between stack members resulted in a corrupt IOS file on the destination switch. DEFINITELY check the filesize of the IOS image after copying or better yet, run an MD5 against it with this command:

verify /md5 flash:{IOS image name}

You can copy the result and compare it with the MD5 run against the image file on your computer. I use an MD5 utility by Tsoft that automatically loads the MD5 checksum you copied from the command above and allows you to easily compare it to the original imagefile. You can download the utility from http://software.techrepublic.com.com/abstract.aspx?docid=770725