5508 Unexpected Reboots - Software reset

cmonks119 · ‎05-20-2021

I have three HA pairs of 5508s and one standalone 5508 that have all of a sudden started rebooting every few days. All are on 8.3.143.0 and have been for over a year with no major changes.

It seems like there is something triggering the reloads, because they are happening pretty consistently at 7:15-7:30pm. Sometimes one of them, sometimes two 10-20 minutes apart. All 7 of them have rebooted at least twice over the past two weeks. The HA pairs do an HA failover, the standalone one of course just does a reboot.

"Last Reset" for all of them is the same: Software reset

There are NO crash logs and NO coredumps. I had coredumps enabled on two of the WLC pairs and there is nothing, I just enabled on all of them to be sure.

I have enabled syslog on all of them, but don't have access to the syslog server currently. I'll see if anything is in there as soon as I get access, but I don't imagine there will be anything.

On the HA nodes, this is in the msglog:

*spamApTask2: May 20 19:23:26.972: %RMGR-3-RED_WLC_SWITCHOVER: [PA]rmgr_sm.c:3333 WLC HA - Switchover Occurred, role changed from Standby to Active, Reason:HB Timeout, Peer HealthSt:0x10 (| Config changed |)
*rmgrMain: May 20 19:23:26.651: %RMGR-3-RED_HA_KA_STATS: [PS]rmgr_main.c:688 Keep-alive stats: peer RP KA loss count 3, peer RMI KA received count 0
*rmgrMain: May 20 19:23:26.651: %RMGR-3-RED_HA_GW_STATS: [PS]rmgr_main.c:687 Default gateway stats: ping loss count 0, ping received count 1
*rmgrMain: May 20 19:23:26.651: %RMGR-3-RED_HEARTBEAT_TMOUT: [PS]rmgr_sm.c:1850 Standby WLC has lost keep-alives with peer.
*rmgrMain: May 20 19:23:26.547: %RMGR-3-RED_HA_KA_STATS: [PS]rmgr_main.c:688 Keep-alive stats: peer RP KA loss count 2, peer RMI KA received count 0
*rmgrMain: May 20 19:23:26.547: %RMGR-3-RED_HA_GW_STATS: [PS]rmgr_main.c:687 Default gateway stats: ping loss count 0, ping received count 1
*rmgrMain: May 20 19:23:26.443: %RMGR-3-RED_HA_KA_STATS: [PS]rmgr_main.c:688 Keep-alive stats: peer RP KA loss count 1, peer RMI KA received count 259120
*rmgrMain: May 20 19:23:26.443: %RMGR-3-RED_HA_GW_STATS: [PS]rmgr_main.c:687 Default gateway stats: ping loss count 0, ping received count 1

I'm not getting anything out of this other than that the peer just crashed and is no longer replying to HA keepalives. I'm ruling out HA issues as causing the problem, because the exact same issue is happening to a non-HA WLC. So the HA failure seems to just be a result of a crash of the primary.

So basically I'm not getting any clues to what is going on. Is there any additional logging/monitoring I can setup to figure out what is causing 'Software reset'? My next step is to setup packet captures to monitor management traffic to see if it's something coming over SNMP, or some kind of network DOS or something.

Any other suggestions?

Thanks.

marce1000 · ‎05-21-2021

> ....because they are happening pretty consistently at 7:15-7:30pm.

That usually indicated an external issue (indeed). Especially if all went good for a long time

>...but don't have access to the syslog server currently.

Really now needed to keep track of this problem

- You should indeed monitor the network, watch out for attacks, traffic surges etc.

M.

-- Each morning when I wake up and look into the mirror I always say ' Why am I so brilliant ? '
When the mirror will then always repond to me with ' The only thing that exceeds your brilliance is your beauty! '

Leo Laohoo · ‎05-24-2021

Upgrade the firmware.

cmonks119 · ‎05-24-2021

Unfortunately our SmartNet on the 5508s just expired, We're in the process of migrating off since they are EOL, and the renewal was too expensive to justify. Of course as luck would have it, as soon as Smartnet expires we start having these issues. That's why I came looking for help since I can't upgrade or open a TAC case, and need to keep these running for a little while longer until we can get off them.

Looking at SNMP probes from Service Now as a possible trigger for the resets..

Scott Fella · ‎05-24-2021

Seems like it is something else. Like what Marce mentioned, happening at the same time and also with standalone, just seems odd. Is there any new tools or automation that is polling or accessing the wlc via snmp? Maybe define a CPU ACL and just allow access from a specific subnet or IP just to test.

-Scott
*** Please rate helpful posts ***

Leo Laohoo · ‎05-24-2021

@cmonks119 wrote:

Unfortunately our SmartNet on the 5508s just expired

Read Cisco IOS XE Software for Catalyst 9800 Series and Cisco AireOS Software for Cisco WLC Flexible NetFlow Version 9 Denial of Service Vulnerability.

Scroll down to the Customers Without Service Contracts section and read it very carefully:

Customers who purchase directly from Cisco but do not hold a Cisco service contract and customers who make purchases through third-party vendors but are unsuccessful in obtaining fixed software through their point of sale should obtain upgrades by contacting the Cisco TAC.

Customers should have the product serial number available and be prepared to provide the URL of this advisory as evidence of entitlement to a free upgrade.

saravlak · ‎05-24-2021

check syslog, if not capturing already start to capture to try and get a clue -it's crucial to obtain since it's happening on all WLCs.
it could be silent reboot without crash file/coredump -possible memory leak or similar issue. what's the last reboot reason showing.
Even if you've smartnet 8.3 isn't supported. Try updating to 8.5 and or FUS image on one WLC and check.
if you're hospital network, you can still get exception to open TAC case through your SE.