OK Julio, so you are pointing

stuartkendrick · ‎01-31-2017

One of my Catalyst 2960X Stacks is misbehaving in the following way:

- Gradually quits forwarding frames on ports, generally clustered on a single Member (attaching a sniffer shows *zero* transmitted frames ... a normal port sees plenty of broadcast traffic from clients, not to mention CDP / LLDP / BPDUs / HSRP Hellos)

- Starts ejecting Members (Switch Status changes to Removed)

- CLI responsiveness becomes jerky, sometimes hanging

- Significant commands hang without completing, e.g. "show tech" and "reload"

- Sniffers attached to various ports (I have hundreds of these pcaps now) show intermittent but intense bursts of duplicated frames, e.g. tens of thousands or even hundreds of thousands of HSRP Hellos per second (normally cadence is 4 per second ... two for the commodity VLAN, two for the VoIP VLAN ... why two? One from the upstream vPC / HSRP Active distribution box, the other from the upstream vPC / HSRP Standby box]. Plenty of other duplicated frames (duplicate IP Ident numbers).

- Sometimes, the Stack will reboot itself, after a few hours of this. Mostly, we walk into the IDF and cold boot (unplug / replug all eight power cords).

- Generally, rebooting fixes the issue, although sometimes we have to power cycle a specific Member, to get it to rejoin the Stack.

- Intermittent behavior -- sometimes, it will run for a day or two without major issue; sometimes, we cold boot it every handful of hours.

Last time this happened (December), we replaced one Member at a time, until the issue cleared. We got lucky -- the second Member we replaced fixed the issue.

Is there a smarter way to identify a bad Member?

Actually, the challenge is larger than this -- I suspect that a bad Stacking Module or Stacking Cable could cause odd / intermittent problems, not just failing RAM / TCAM inside a Member. Generically, is there a smarter way to isolate which of the (8) Members, (8) Stacking Modules, and (9) Stacking Cables might be failing?

If the issue were reproducible in minutes, we could use binary search -- i.e. power-off half the Stack ... if the problem persists, then power off half of the remaining units ... and continue. But since the issue takes hours or even days to reproduce ... dang, binary search would consume some serious calendar time.

==> How to identify failing hardware in a Stack -- this is the question I want to address with my query here. Sure, I would like to solve this immediate problem ... but looking ahead, I want some sort of methodology for identifying bad hardware in a Stack. Suggestions?

Random notes:

- I have a TAC case open

- Stack is running 15.2(2)E6

- I have a handful of these (8) Member Cat2960X stacks ... fortunately, this is the only Stack which has been hitting problems.

Julio E. Moisa · ‎01-31-2017

Hi

You can execute the commands:

show switch
show switch detail
show switch stack-ring

it will help you to identify how the circuit is created. Also please check this website: http://www.cisco.com/c/en/us/td/docs/switches/lan/catalyst3750/software/troubleshooting/switch_stacks.html

Also please check show switch command + other arguments.

If it is useful please rate the comment :-) thanks

Regards

>> Marcar como útil o contestado, si la respuesta resolvió la duda, esto ayuda a futuras consultas de otros miembros de la comunidad. <<

stuartkendrick · ‎01-31-2017

OK Julio, so you are pointing out that 'show switch' output can reveal some level of problems:

5n-esx#sh switch detail
Load for five secs: 39%/0%; one minute: 40%; five minutes: 42%
Time source is NTP, 06:40:35.045 pst Tue Jan 31 2017

Switch/Stack Mac Address : 00da.553e.f080
                                           H/W   Current
Switch# Role   Mac Address     Priority Version State
----------------------------------------------------------
*1       Master 00da.553e.f080     15     4       Ready
2       Member 38ed.1813.2780     14     4       Ready
3       Member 38ed.1812.e500     13     4       Ready
4       Member 38ed.1813.7980     12     4       Ready
5       Member 38ed.1813.7900     11     4       Ready
6       Member 1cde.a7a9.3f80     10     4       Ready
7       Member 00da.5513.4b00     9      4       Ready
8       Member 0038.df04.b600     8      4       Ready

         Stack Port Status             Neighbors
Switch# Port 1     Port 2           Port 1   Port 2
--------------------------------------------------------
1        Ok         Ok                2        8
2        Ok         Ok                3        1
3        Ok         Ok                4        2
4        Ok         Ok                5        3
5        Ok         Ok                6        4
6        Ok         Ok                7        5
7        Ok         Ok                8        6
8        Ok         Ok                1        7

5n-esx#sh switch stack-ports
Load for five secs: 39%/0%; one minute: 40%; five minutes: 42%
Time source is NTP, 06:40:38.869 pst Tue Jan 31 2017

Switch #    Port 1       Port 2
--------    ------       ------
    1           Ok           Ok
    2           Ok           Ok
    3           Ok           Ok
    4           Ok           Ok
    5           Ok           Ok
    6           Ok           Ok
    7           Ok           Ok
    8           Ok           Ok

5n-esx#

And the URL you reference illustrates using 'show platform stack-manager all' output to point toward other issues (version mismatches, for example):

5n-esx#              show platform stack manager all
Load for five secs: 40%/0%; one minute: 43%; five minutes: 43%
Time source is NTP, 06:36:08.033 pst Tue Jan 31 2017

Switch/Stack Mac Address : 00da.553e.f080
                                           H/W   Current
Switch# Role   Mac Address     Priority Version State
----------------------------------------------------------
*1       Master 00da.553e.f080     15     4       Ready
2       Member 38ed.1813.2780     14     4       Ready
3       Member 38ed.1812.e500     13     4       Ready
4       Member 38ed.1813.7980     12     4       Ready
5       Member 38ed.1813.7900     11     4       Ready
6       Member 1cde.a7a9.3f80     10     4       Ready
7       Member 00da.5513.4b00     9      4       Ready
8       Member 0038.df04.b600     8      4       Ready

         Stack Port Status             Neighbors
Switch# Port 1     Port 2           Port 1   Port 2
--------------------------------------------------------
1        Ok         Ok                2        8
2        Ok         Ok                3        1
3        Ok         Ok                4        2
4        Ok         Ok                5        3
5        Ok         Ok                6        4
6        Ok         Ok                7        5
7        Ok         Ok                8        6
8        Ok         Ok                1        7

               Stack Discovery Protocol View
==============================================================

Switch   Active   Role    Current   Sequence   Dirty
Number                    State     Number     Bit
--------------------------------------------------------------------
1        TRUE    Master   Ready       250       FALSE
2        TRUE    Member   Ready       130       FALSE
3        TRUE    Member   Ready       194       FALSE
4        TRUE    Member   Ready       195       FALSE
5        TRUE    Member   Ready       228       FALSE
6        TRUE    Member   Ready       206       FALSE
7        TRUE    Member   Ready       131       FALSE
8        TRUE    Member   Ready       083       FALSE


                 Stack State Machine View
==============================================================

Switch   Master/   Mac Address          Version    Current
Number   Member                          (maj.min) State
-----------------------------------------------------------
1        Master    00da.553e.f080          1.56        Ready
2        Member    38ed.1813.2780          1.56        Ready
3        Member    38ed.1812.e500          1.56        Ready
4        Member    38ed.1813.7980          1.56        Ready
5        Member    38ed.1813.7900          1.56        Ready
6        Member    1cde.a7a9.3f80          1.56        Ready
7        Member    00da.5513.4b00          1.56        Ready
8        Member    0038.df04.b600          1.56        Ready

Last Conflict Parameters

Switch Master/ Cfgd Default Image H/W # of    Mac Address
Number Member Prio Config   Type Prio Members
-----------------------------------------------------------------------
1     Master   15   No       4    5   6 00da.553e.f080
3     Member   13   No       4    5   0 38ed.1812.e500
4     Member   12   No       4    5   0 38ed.1813.7980
5     Member   11   No       4    5   0 38ed.1813.7900
6     Member   10   No       4    5   0 1cde.a7a9.3f80
7     Member   9    No       4    5   0 00da.5513.4b00
8     Member   8    No       4    5   0 0038.df04.b600

And the URL mentions other possible clues -- e.g. Stack Master flapping, Stack Partitioning -- neither of which affect.

I think I'm working a different space, though:

- That URL does not mention the pathology I'm seeing

- And in any case, as far as 'show switch' and 'show platform stack-manager all' go, this Stack is fine.

- Except that this Stack most definitely is not fine.

I'm wonder if I'm bumping into one of the challenges in trouble-shooting a distributed system (i.e. numerous brains ... (8) in this case ... collaborating) -- that this is just hard ... perhaps binary search is the only effective solution.

Anyway, other insights into how to trouble-shoot, generically, a Stack issue?

--sk

stuartkendrick · ‎01-31-2017

As an aside, I note that this Stack logs these sorts of messages, when it is crumping:

2017-01-30T15:01:55.019979-08:00 5n-esx-mgmt 990: 000980: Jan 30 15:01:54.543 ps
t: %SUPQ-4-PORT_QUEUE_STUCK: Port queue Stuck for asic 1 port 14 queue 0

2017-01-30T15:01:55.021329-08:00 5n-esx-mgmt 991: 000981: Jan 30 15:01:54.543 ps
t: Error disabling queue 0 for asic 1 port 14

2017-01-30T15:02:36.012774-08:00 5n-esx-mgmt 999: 000989: Jan 30 15:02:35.996 ps
t: %XDR-3-XDROOS: Received an out of sequence IPC message. Expected 9373 but got
9369 from slot 5.

2017-01-30T15:02:42.566506-08:00 5n-esx-mgmt 1026: -Traceback= 555EECz 7139Cz 76
58Cz 8A768z 102113Cz 1021C0Cz 1025CF4z 1026A58z 1027280z 1027A14z 28252C4z 275BA4Cz 275E770z 2743F78z 273B524z 273B4CCz

2017-01-30T15:06:46.324424-08:00 5n-esx-mgmt 1827: 001406: 000035: Jan 30 15:06:
45.249 pst: %XDR-6-XDRLCDISABLEREQUEST: Client XDR Interrupt Priority Client req
uested to be disabled. Due to XDR Keepalive Timeout (5n-esx-7)

But again, I think the specifics are not so important; rather, I'm interested in methodology for isolating bad hardware.

--sk

Julio E. Moisa · ‎01-31-2017

Yeap, I was investigating and this error could be generated by mls qos or by a bug and an IOS upgrade is required, please see this similar case:

https://supportforums.cisco.com/discussion/12519776/cisco-2960-x-error-supq-4-portqueuestuck

https://quickview.cloudapps.cisco.com/quickview/bug/CSCtx83354

The devices are still with contract to open a Cisco TAC ticket? Apparently everything is ok with your stack.

>> Marcar como útil o contestado, si la respuesta resolvió la duda, esto ayuda a futuras consultas de otros miembros de la comunidad. <<

Danila Hodorovich · ‎04-14-2018

Hi,
We have the same issue. How did you finally fix it?

How to identify bad hardware in a Stack