Re: CBS220 - random periodic crash and/or reboot

jason-g · ‎10-30-2024

We have (4) CBS220 that periodically reboot. At first at thought it was maybe a power issue, but we have confirmed that our UPS units are good and another Cisco switch (CBS250) on the same UPS as one of our CBS220 does not reboot. The CBS220 may be up a few weeks and then next day will show an uptime (in "show version") of just a few hours. The "show logging" does not reveal anything in particular.

I've started by disabling unused services :

no pnp enable

That did not appear to resolve it (had reboots a couple/few weeks later).

I've next disabled bonjour :

no bonjour enable

that was a few days ago and we are continuing to monitor to see if any further reboots occur.

There are Cisco TAC Cases I see someone has open with similar experience, but specific to having a mikrotik router in the environment :

https://bst.cisco.com/bugsearch/bug/CSCwh29387

https://bst.cisco.com/bugsearch/bug/CSCwf78354

we are running v2.0.2.14 which I believe is the latest version available.

I'm also watching "show memory statistics" to see if available memory drops slowly over time.

CPU appears to be normal when I check.

I manage these CBS220 strictly via ssh and not via web ui as that interface is just painfully slow. I suppose I could also do "no ip http secure-server" ("no ip http server" is already present in the config).

any other thoughts , things to check or disable?

I'm trying to avoid disabling other services like CDP/LLDP as those are generally useful for troubleshooting.

regards,

Jason

marce1000 · ‎10-30-2024

- It would be advisable besides 'show logging' to configure a central syslog server on these switches to capture overall logging ; then logs can be examined before devices reboot and check if a (last gasp) pattern can be detected.
The same applies for configuring and sending all snmp traps to an snmp manager (trap receiver) , which should then be examined with the same purposes when these devices reboot ,

M.

-- Each morning when I wake up and look into the mirror I always say ' Why am I so brilliant ? '
When the mirror will then always repond to me with ' The only thing that exceeds your brilliance is your beauty! '

Cristian Matei · ‎10-30-2024

Hi,

There have ben known issues with this platform; I assume the switches are held under normal environment conditions, right? sure you run the latest version (if not upgrade) and open TAC case if it keeps happening, most likely you'll get an RMA.

Best,

Cristian.

jason-g · ‎10-31-2024

yes latest version 2.0.2.14 and normal temp conditions environment. Knock on wood so far stable with bonjour disabled starting on friday oct 25th. so only be 6 days or so...need several weeks of uptime to gain some confidence on that being a valid workaround. I did setup logging to file ("warning" level or better). We don't have a syslog server or snmp mgmt server unfortunately to capture logs (small non-profit business here). maybe I can spin up something temporarily for syslog troubleshooting

marce1000 · ‎10-31-2024

- Syslog server easy to configure on a Linux VM ('spin it up!) ; e.g ; probably the same for a snmp trap receiver ,on the same host.
Searching on the internet for the subjects will get you there fast.

M.

-- Each morning when I wake up and look into the mirror I always say ' Why am I so brilliant ? '
When the mirror will then always repond to me with ' The only thing that exceeds your brilliance is your beauty! '

jason-g · ‎11-06-2024

still continuing to monitor...still stable without any reboots. available memory stats about where they were back on Oct 25th...holding steady.

jason-g · ‎12-02-2024

Well, made it to about 42 days or so uptime and then rebooted. 3 of 4 cbs220 we have rebooted in last week at different times. Not much from log files.

jason-g · ‎12-03-2024

turned on debug logging and noticed these every 15 min or so in the log....

*Dec 03 2024 13:38:45.768+-500: %MCAST-7-GROUP_RANGE_INVALID: Received invalid group range 224.0.0.X
*Dec 03 2024 13:38:44.808+-500: %MCAST-7-GROUP_RANGE_INVALID: Received invalid group range 224.0.0.X
*Dec 03 2024 13:38:44.158+-500: %MCAST-7-GROUP_RANGE_INVALID: Received invalid group range 224.0.0.X
*Dec 03 2024 13:38:41.098+-500: %MCAST-7-GROUP_RANGE_INVALID: Received invalid group range 224.0.0.X
*Dec 03 2024 13:38:40.518+-500: %MCAST-7-GROUP_RANGE_INVALID: Received invalid group range 224.0.0.X
*Dec 03 2024 13:38:39.808+-500: %MCAST-7-GROUP_RANGE_INVALID: Received invalid group range 224.0.0.X
*Dec 03 2024 13:25:23.237+-500: %MCAST-7-GROUP_RANGE_INVALID: Received invalid group range 224.0.0.X
*Dec 03 2024 13:25:22.947+-500: %MCAST-7-GROUP_RANGE_INVALID: Received invalid group range 224.0.0.X
*Dec 03 2024 13:25:22.157+-500: %MCAST-7-GROUP_RANGE_INVALID: Received invalid group range 224.0.0.X
*Dec 03 2024 13:25:17.807+-500: %MCAST-7-GROUP_RANGE_INVALID: Received invalid group range 224.0.0.X
*Dec 03 2024 13:25:17.467+-500: %MCAST-7-GROUP_RANGE_INVALID: Received invalid group range 224.0.0.X
*Dec 03 2024 13:25:17.357+-500: %MCAST-7-GROUP_RANGE_INVALID: Received invalid group range 224.0.0.X

these (4) CBS220 switches have IGMP snopping enabled (although the documentation states IGMP snooping is disabled by default , and the running-config has no line items related to IGMP). so, as another "thing to try" I've disabled igmp snooping to see if that helps (again, grasping at straws). I determined the source of the IGMPv3 messages that trigger the above syslog messages and it is a known system on the same local broadcast domain as the mgmt interfaces of these switches.

jason-g · ‎12-03-2024

examples/screenshots attached of IGMPv3 messages that correlate to the "%MCAST-7-GROUP_RANGE_INVALID: Received invalid group range 224.0.0.X"

jason-g · ‎12-09-2024

our 4th cbs220 rebooted over the weekend at around 50 days uptime. That was with igmp snooping disabled as well. I had syslog logging to external syslog server set at debug level I will check those logs but I'm not confident that they will reveal anything.

jason-g · ‎12-09-2024

as expected, nothing in the syslogs around the time of the suspected crash and reboot. That was with debug level enabled

jason-g · ‎01-13-2025

two of four cbs220 appeared to "reboot" over the weekend (one on friday and one of SUnday) at the 50 day marker. I have a new theory : the switches are NOT rebooting, there is something with the "uptime" that gets reset after 50days. WHy do I think that ? the logs do not show any up/down events for the uplinks ports in/around the time of "reboot". I am using ntp (internal ntp server on our network, not the default ntp servers) and it shows in sync. the logs ("show logging") all show with an asterisk before them. Usually with Cisco IOS/IOS-XE switches that meant time NOT in sync at the time the log entry was generated. But the "show sntp configuration" and "show clock" are all showing "time source is sntp" and "sntp server status: Up".

so - what do folks think about this theory ? something amiss with the "uptime" function in the "show version" output that causes it to reset uptime at 50 days? I have two consistent 50 days cycles now on 2 of 4 switches. my other two switches are at 36 and 41 days so I'm going to watch those at 50 days.

mp25 · ‎01-28-2025

I just posted about a reboot around 50 days. Your message interests me especially since my switch currently has an uptime of 48 days, while the tech support file indicates 147 days.

mp25 · ‎01-28-2025

But no, it can’t be that. My CCTV server stops recording because the PoE cameras reboot.

EDIT: no, finally it does not cut PoE power with this last firmware.

jason-g · ‎01-29-2025

this is not the same here - other than the "show version" indicating recent uptime/reboot, there is no actual reboot or ports down/up in the syslog. In my case, it truly appears to be something wonky with the uptime in "show version". Maybe just a cosmetic bug in my case.