01-31-2017 05:15 AM - edited 03-08-2019 09:07 AM
One of my Catalyst 2960X Stacks is misbehaving in the following way:
- Gradually quits forwarding frames on ports, generally clustered on a single Member (attaching a sniffer shows *zero* transmitted frames ... a normal port sees plenty of broadcast traffic from clients, not to mention CDP / LLDP / BPDUs / HSRP Hellos)
- Starts ejecting Members (Switch Status changes to Removed)
- CLI responsiveness becomes jerky, sometimes hanging
- Significant commands hang without completing, e.g. "show tech" and "reload"
- Sniffers attached to various ports (I have hundreds of these pcaps now) show intermittent but intense bursts of duplicated frames, e.g. tens of thousands or even hundreds of thousands of HSRP Hellos per second (normally cadence is 4 per second ... two for the commodity VLAN, two for the VoIP VLAN ... why two? One from the upstream vPC / HSRP Active distribution box, the other from the upstream vPC / HSRP Standby box]. Plenty of other duplicated frames (duplicate IP Ident numbers).
- Sometimes, the Stack will reboot itself, after a few hours of this. Mostly, we walk into the IDF and cold boot (unplug / replug all eight power cords).
- Generally, rebooting fixes the issue, although sometimes we have to power cycle a specific Member, to get it to rejoin the Stack.
- Intermittent behavior -- sometimes, it will run for a day or two without major issue; sometimes, we cold boot it every handful of hours.
Last time this happened (December), we replaced one Member at a time, until the issue cleared. We got lucky -- the second Member we replaced fixed the issue.
Is there a smarter way to identify a bad Member?
Actually, the challenge is larger than this -- I suspect that a bad Stacking Module or Stacking Cable could cause odd / intermittent problems, not just failing RAM / TCAM inside a Member. Generically, is there a smarter way to isolate which of the (8) Members, (8) Stacking Modules, and (9) Stacking Cables might be failing?
If the issue were reproducible in minutes, we could use binary search -- i.e. power-off half the Stack ... if the problem persists, then power off half of the remaining units ... and continue. But since the issue takes hours or even days to reproduce ... dang, binary search would consume some serious calendar time.
==> How to identify failing hardware in a Stack -- this is the question I want to address with my query here. Sure, I would like to solve this immediate problem ... but looking ahead, I want some sort of methodology for identifying bad hardware in a Stack. Suggestions?
Random notes:
- I have a TAC case open
- Stack is running 15.2(2)E6
- I have a handful of these (8) Member Cat2960X stacks ... fortunately, this is the only Stack which has been hitting problems.
01-31-2017 06:06 AM
Hi
You can execute the commands:
show switch
show switch detail
show switch stack-ring
it will help you to identify how the circuit is created. Also please check this website: http://www.cisco.com/c/en/us/td/docs/switches/lan/catalyst3750/software/troubleshooting/switch_stacks.html
Also please check show switch command + other arguments.
If it is useful please rate the comment :-) thanks
Regards
01-31-2017 06:46 AM
OK Julio, so you are pointing out that 'show switch' output can reveal some level of problems:
5n-esx#sh switch detail
Load for five secs: 39%/0%; one minute: 40%; five minutes: 42%
Time source is NTP, 06:40:35.045 pst Tue Jan 31 2017
Switch/Stack Mac Address : 00da.553e.f080
H/W Current
Switch# Role Mac Address Priority Version State
----------------------------------------------------------
*1 Master 00da.553e.f080 15 4 Ready
2 Member 38ed.1813.2780 14 4 Ready
3 Member 38ed.1812.e500 13 4 Ready
4 Member 38ed.1813.7980 12 4 Ready
5 Member 38ed.1813.7900 11 4 Ready
6 Member 1cde.a7a9.3f80 10 4 Ready
7 Member 00da.5513.4b00 9 4 Ready
8 Member 0038.df04.b600 8 4 Ready
Stack Port Status Neighbors
Switch# Port 1 Port 2 Port 1 Port 2
--------------------------------------------------------
1 Ok Ok 2 8
2 Ok Ok 3 1
3 Ok Ok 4 2
4 Ok Ok 5 3
5 Ok Ok 6 4
6 Ok Ok 7 5
7 Ok Ok 8 6
8 Ok Ok 1 7
5n-esx#sh switch stack-ports
Load for five secs: 39%/0%; one minute: 40%; five minutes: 42%
Time source is NTP, 06:40:38.869 pst Tue Jan 31 2017
Switch # Port 1 Port 2
-------- ------ ------
1 Ok Ok
2 Ok Ok
3 Ok Ok
4 Ok Ok
5 Ok Ok
6 Ok Ok
7 Ok Ok
8 Ok Ok
5n-esx#
And the URL you reference illustrates using 'show platform stack-manager all' output to point toward other issues (version mismatches, for example):
5n-esx# show platform stack manager all
Load for five secs: 40%/0%; one minute: 43%; five minutes: 43%
Time source is NTP, 06:36:08.033 pst Tue Jan 31 2017
Switch/Stack Mac Address : 00da.553e.f080
H/W Current
Switch# Role Mac Address Priority Version State
----------------------------------------------------------
*1 Master 00da.553e.f080 15 4 Ready
2 Member 38ed.1813.2780 14 4 Ready
3 Member 38ed.1812.e500 13 4 Ready
4 Member 38ed.1813.7980 12 4 Ready
5 Member 38ed.1813.7900 11 4 Ready
6 Member 1cde.a7a9.3f80 10 4 Ready
7 Member 00da.5513.4b00 9 4 Ready
8 Member 0038.df04.b600 8 4 Ready
Stack Port Status Neighbors
Switch# Port 1 Port 2 Port 1 Port 2
--------------------------------------------------------
1 Ok Ok 2 8
2 Ok Ok 3 1
3 Ok Ok 4 2
4 Ok Ok 5 3
5 Ok Ok 6 4
6 Ok Ok 7 5
7 Ok Ok 8 6
8 Ok Ok 1 7
Stack Discovery Protocol View
==============================================================
Switch Active Role Current Sequence Dirty
Number State Number Bit
--------------------------------------------------------------------
1 TRUE Master Ready 250 FALSE
2 TRUE Member Ready 130 FALSE
3 TRUE Member Ready 194 FALSE
4 TRUE Member Ready 195 FALSE
5 TRUE Member Ready 228 FALSE
6 TRUE Member Ready 206 FALSE
7 TRUE Member Ready 131 FALSE
8 TRUE Member Ready 083 FALSE
Stack State Machine View
==============================================================
Switch Master/ Mac Address Version Current
Number Member (maj.min) State
-----------------------------------------------------------
1 Master 00da.553e.f080 1.56 Ready
2 Member 38ed.1813.2780 1.56 Ready
3 Member 38ed.1812.e500 1.56 Ready
4 Member 38ed.1813.7980 1.56 Ready
5 Member 38ed.1813.7900 1.56 Ready
6 Member 1cde.a7a9.3f80 1.56 Ready
7 Member 00da.5513.4b00 1.56 Ready
8 Member 0038.df04.b600 1.56 Ready
Last Conflict Parameters
Switch Master/ Cfgd Default Image H/W # of Mac Address
Number Member Prio Config Type Prio Members
-----------------------------------------------------------------------
1 Master 15 No 4 5 6 00da.553e.f080
3 Member 13 No 4 5 0 38ed.1812.e500
4 Member 12 No 4 5 0 38ed.1813.7980
5 Member 11 No 4 5 0 38ed.1813.7900
6 Member 10 No 4 5 0 1cde.a7a9.3f80
7 Member 9 No 4 5 0 00da.5513.4b00
8 Member 8 No 4 5 0 0038.df04.b600
And the URL mentions other possible clues -- e.g. Stack Master flapping, Stack Partitioning -- neither of which affect.
I think I'm working a different space, though:
- That URL does not mention the pathology I'm seeing
- And in any case, as far as 'show switch' and 'show platform stack-manager all' go, this Stack is fine.
- Except that this Stack most definitely is not fine.
I'm wonder if I'm bumping into one of the challenges in trouble-shooting a distributed system (i.e. numerous brains ... (8) in this case ... collaborating) -- that this is just hard ... perhaps binary search is the only effective solution.
Anyway, other insights into how to trouble-shoot, generically, a Stack issue?
--sk
01-31-2017 06:52 AM
As an aside, I note that this Stack logs these sorts of messages, when it is crumping:
2017-01-30T15:01:55.019979-08:00 5n-esx-mgmt 990: 000980: Jan 30 15:01:54.543 ps
t: %SUPQ-4-PORT_QUEUE_STUCK: Port queue Stuck for asic 1 port 14 queue 0
2017-01-30T15:01:55.021329-08:00 5n-esx-mgmt 991: 000981: Jan 30 15:01:54.543 ps
t: Error disabling queue 0 for asic 1 port 14
2017-01-30T15:02:36.012774-08:00 5n-esx-mgmt 999: 000989: Jan 30 15:02:35.996 ps
t: %XDR-3-XDROOS: Received an out of sequence IPC message. Expected 9373 but got
9369 from slot 5.
2017-01-30T15:02:42.566506-08:00 5n-esx-mgmt 1026: -Traceback= 555EECz 7139Cz 76
58Cz 8A768z 102113Cz 1021C0Cz 1025CF4z 1026A58z 1027280z 1027A14z 28252C4z 275BA4Cz 275E770z 2743F78z 273B524z 273B4CCz
2017-01-30T15:06:46.324424-08:00 5n-esx-mgmt 1827: 001406: 000035: Jan 30 15:06:
45.249 pst: %XDR-6-XDRLCDISABLEREQUEST: Client XDR Interrupt Priority Client req
uested to be disabled. Due to XDR Keepalive Timeout (5n-esx-7)
But again, I think the specifics are not so important; rather, I'm interested in methodology for isolating bad hardware.
--sk
01-31-2017 07:21 AM
Yeap, I was investigating and this error could be generated by mls qos or by a bug and an IOS upgrade is required, please see this similar case:
https://supportforums.cisco.com/discussion/12519776/cisco-2960-x-error-supq-4-portqueuestuck
https://quickview.cloudapps.cisco.com/quickview/bug/CSCtx83354
The devices are still with contract to open a Cisco TAC ticket? Apparently everything is ok with your stack.
04-14-2018 04:16 AM - edited 04-14-2018 04:18 AM
Discover and save your favorite ideas. Come back to expert answers, step-by-step guides, recent topics, and more.
New here? Get started with these tips. How to use Community New member guide