cancel
Showing results for 
Search instead for 
Did you mean: 
cancel

Who Me Too'd this topic

How to identify bad hardware in a Stack

stuartkendrick
Level 1
Level 1

One of my Catalyst 2960X Stacks is misbehaving in the following way:

- Gradually quits forwarding frames on ports, generally clustered on a single Member (attaching a sniffer shows *zero* transmitted frames ... a normal port sees plenty of broadcast traffic from clients, not to mention CDP / LLDP / BPDUs / HSRP Hellos)

- Starts ejecting Members (Switch Status changes to Removed)

- CLI responsiveness becomes jerky, sometimes hanging

- Significant commands hang without completing, e.g. "show tech" and "reload"

- Sniffers attached to various ports (I have hundreds of these pcaps now) show intermittent but intense bursts of duplicated frames, e.g. tens of thousands or even hundreds of thousands of HSRP Hellos per second (normally cadence is 4 per second ... two for the commodity VLAN, two for the VoIP VLAN ... why two?  One from the upstream vPC / HSRP Active distribution box, the other from the upstream vPC / HSRP Standby box].  Plenty of other duplicated frames (duplicate IP Ident numbers).

- Sometimes, the Stack will reboot itself, after a few hours of this.  Mostly, we walk into the IDF and cold boot (unplug / replug all eight power cords).

- Generally, rebooting fixes the issue, although sometimes we have to power cycle a specific Member, to get it to rejoin the Stack.

- Intermittent behavior -- sometimes, it will run for a day or two without major issue; sometimes, we cold boot it every handful of hours.

Last time this happened (December), we replaced one Member at a time, until the issue cleared.  We got lucky -- the second Member we replaced fixed the issue.

Is there a smarter way to identify a bad Member?

Actually, the challenge is larger than this -- I suspect that a bad Stacking Module or Stacking Cable could cause odd / intermittent problems, not just failing RAM / TCAM inside a Member.  Generically, is there a smarter way to isolate which of the (8) Members, (8) Stacking Modules, and (9) Stacking Cables might be failing?

If the issue were reproducible in minutes, we could use binary search -- i.e. power-off half the Stack ... if the problem persists, then power off half of the remaining units ... and continue.  But since the issue takes hours or even days to reproduce ... dang, binary search would consume some serious calendar time.

==> How to identify failing hardware in a Stack -- this is the question I want to address with my query here.  Sure, I would like to solve this immediate problem ... but looking ahead, I want some sort of methodology for identifying bad hardware in a Stack.  Suggestions?

Random notes:

- I have a TAC case open

- Stack is running 15.2(2)E6

- I have a handful of these (8) Member Cat2960X stacks ... fortunately, this is the only Stack which has been hitting problems.

Who Me Too'd this topic