Re: CBS220 - random periodic crash and/or reboot

jason-g · ‎10-30-2024

We have (4) CBS220 that periodically reboot. At first at thought it was maybe a power issue, but we have confirmed that our UPS units are good and another Cisco switch (CBS250) on the same UPS as one of our CBS220 does not reboot. The CBS220 may be up a few weeks and then next day will show an uptime (in "show version") of just a few hours. The "show logging" does not reveal anything in particular.

I've started by disabling unused services :

no pnp enable

That did not appear to resolve it (had reboots a couple/few weeks later).

I've next disabled bonjour :

no bonjour enable

that was a few days ago and we are continuing to monitor to see if any further reboots occur.

There are Cisco TAC Cases I see someone has open with similar experience, but specific to having a mikrotik router in the environment :

https://bst.cisco.com/bugsearch/bug/CSCwh29387

https://bst.cisco.com/bugsearch/bug/CSCwf78354

we are running v2.0.2.14 which I believe is the latest version available.

I'm also watching "show memory statistics" to see if available memory drops slowly over time.

CPU appears to be normal when I check.

I manage these CBS220 strictly via ssh and not via web ui as that interface is just painfully slow. I suppose I could also do "no ip http secure-server" ("no ip http server" is already present in the config).

any other thoughts , things to check or disable?

I'm trying to avoid disabling other services like CDP/LLDP as those are generally useful for troubleshooting.

regards,

Jason

marce1000 · ‎10-30-2024

- It would be advisable besides 'show logging' to configure a central syslog server on these switches to capture overall logging ; then logs can be examined before devices reboot and check if a (last gasp) pattern can be detected.
The same applies for configuring and sending all snmp traps to an snmp manager (trap receiver) , which should then be examined with the same purposes when these devices reboot ,

M.

-- Each morning when I wake up and look into the mirror I always say ' Why am I so brilliant ? '
When the mirror will then always repond to me with ' The only thing that exceeds your brilliance is your beauty! '

Cristian Matei · ‎10-30-2024

Hi,

There have ben known issues with this platform; I assume the switches are held under normal environment conditions, right? sure you run the latest version (if not upgrade) and open TAC case if it keeps happening, most likely you'll get an RMA.

Best,

Cristian.

jason-g · ‎10-31-2024

yes latest version 2.0.2.14 and normal temp conditions environment. Knock on wood so far stable with bonjour disabled starting on friday oct 25th. so only be 6 days or so...need several weeks of uptime to gain some confidence on that being a valid workaround. I did setup logging to file ("warning" level or better). We don't have a syslog server or snmp mgmt server unfortunately to capture logs (small non-profit business here). maybe I can spin up something temporarily for syslog troubleshooting

marce1000 · ‎10-31-2024

- Syslog server easy to configure on a Linux VM ('spin it up!) ; e.g ; probably the same for a snmp trap receiver ,on the same host.
Searching on the internet for the subjects will get you there fast.

M.

-- Each morning when I wake up and look into the mirror I always say ' Why am I so brilliant ? '
When the mirror will then always repond to me with ' The only thing that exceeds your brilliance is your beauty! '

jason-g · ‎11-06-2024

still continuing to monitor...still stable without any reboots. available memory stats about where they were back on Oct 25th...holding steady.

jason-g · ‎12-02-2024

Well, made it to about 42 days or so uptime and then rebooted. 3 of 4 cbs220 we have rebooted in last week at different times. Not much from log files.

jason-g · ‎12-03-2024

turned on debug logging and noticed these every 15 min or so in the log....

*Dec 03 2024 13:38:45.768+-500: %MCAST-7-GROUP_RANGE_INVALID: Received invalid group range 224.0.0.X
*Dec 03 2024 13:38:44.808+-500: %MCAST-7-GROUP_RANGE_INVALID: Received invalid group range 224.0.0.X
*Dec 03 2024 13:38:44.158+-500: %MCAST-7-GROUP_RANGE_INVALID: Received invalid group range 224.0.0.X
*Dec 03 2024 13:38:41.098+-500: %MCAST-7-GROUP_RANGE_INVALID: Received invalid group range 224.0.0.X
*Dec 03 2024 13:38:40.518+-500: %MCAST-7-GROUP_RANGE_INVALID: Received invalid group range 224.0.0.X
*Dec 03 2024 13:38:39.808+-500: %MCAST-7-GROUP_RANGE_INVALID: Received invalid group range 224.0.0.X
*Dec 03 2024 13:25:23.237+-500: %MCAST-7-GROUP_RANGE_INVALID: Received invalid group range 224.0.0.X
*Dec 03 2024 13:25:22.947+-500: %MCAST-7-GROUP_RANGE_INVALID: Received invalid group range 224.0.0.X
*Dec 03 2024 13:25:22.157+-500: %MCAST-7-GROUP_RANGE_INVALID: Received invalid group range 224.0.0.X
*Dec 03 2024 13:25:17.807+-500: %MCAST-7-GROUP_RANGE_INVALID: Received invalid group range 224.0.0.X
*Dec 03 2024 13:25:17.467+-500: %MCAST-7-GROUP_RANGE_INVALID: Received invalid group range 224.0.0.X
*Dec 03 2024 13:25:17.357+-500: %MCAST-7-GROUP_RANGE_INVALID: Received invalid group range 224.0.0.X

these (4) CBS220 switches have IGMP snopping enabled (although the documentation states IGMP snooping is disabled by default , and the running-config has no line items related to IGMP). so, as another "thing to try" I've disabled igmp snooping to see if that helps (again, grasping at straws). I determined the source of the IGMPv3 messages that trigger the above syslog messages and it is a known system on the same local broadcast domain as the mgmt interfaces of these switches.

jason-g · ‎12-03-2024

examples/screenshots attached of IGMPv3 messages that correlate to the "%MCAST-7-GROUP_RANGE_INVALID: Received invalid group range 224.0.0.X"

jason-g · ‎12-09-2024

our 4th cbs220 rebooted over the weekend at around 50 days uptime. That was with igmp snooping disabled as well. I had syslog logging to external syslog server set at debug level I will check those logs but I'm not confident that they will reveal anything.

jason-g · ‎12-09-2024

as expected, nothing in the syslogs around the time of the suspected crash and reboot. That was with debug level enabled