11-04-2022 11:00 AM
I have a stack of 2 C9200L-48P-4X running 16.12.4 and about once a week which ever member is set as master will randomly crash/reboot/reload. It doesn't matter which member is master at the time, its happened with both stack members being master. Log output for the time of the reload is:
Nov 4 07:08:31.047: %PLATFORM_INFRA-5-IOS_INTR_OVER_LIMIT: IOS thread disabled interrupt for 272 msec
-Traceback= 1#714eae76875aa6a2c50124258b444fa8 :10000+421D2B8 :10000+1697708 :10000+1673728 :10000+1676EE8 :10000+56054A0
Nov 4 07:08:33.090: %HMANRP-6-HMAN_IOS_CHANNEL_INFO: HMAN-IOS channel event for switch 2: EMP_RELAY: Channel UP!
Nov 4 07:08:35.218: %PLATFORM-6-HASTATUS: RP switchover, received chassis event to become active
Nov 4 07:08:35.368: %REDUNDANCY-3-SWITCHOVER: RP switchover (PEER_NOT_PRESENT)
Nov 4 07:08:35.381: %REDUNDANCY-3-SWITCHOVER: RP switchover (PEER_DOWN)
Nov 4 07:08:35.381: %REDUNDANCY-3-SWITCHOVER: RP switchover (PEER_REDUNDANCY_STATE_CHANGE)
Nov 4 07:08:36.912: %PLATFORM-6-HASTATUS: RP switchover, sent message became active. IOS is ready to switch to primary after chassis confirmation
Nov 4 07:08:38.276: %HMANRP-6-EMP_NO_ELECTION_INFO: Could not elect active EMP switch, setting emp active switch to 0: EMP_RELAY: Could not elect switch with mgmt port UP
Nov 4 07:08:38.296: %PLATFORM-6-HASTATUS: RP switchover, received chassis event became active
Nov 4 07:08:38.299: %PLATFORM_FEP-1-FRU_PS_SIGNAL_OK: Switch 2: signal on power supply A is restored
Nov 4 07:08:38.344: %PLATFORM_FEP-1-FRU_PS_SIGNAL_OK: Switch 2: signal on power supply B is restored
Nov 4 07:08:38.600: %PLATFORM-6-HASTATUS_DETAIL: RP switchover, received chassis event became active. Switch to primary (count 4)
Nov 4 07:08:38.707: %HA-6-SWITCHOVER: Route Processor switched from standby to being active
Nov 4 07:08:38.838 UTC: Unable to set IPV4 table id for BT interface
Nov 4 07:08:38.852 UTC: Unable to set IPV6 table id for BT interface
Nov 4 07:08:42.145: %SYS-6-LOGGINGHOST_STARTSTOP: Logging to host 172.24.253.2 port 514 started - CLI initiated
Nov 4 07:08:42.315: %HMANRP-5-CHASSIS_DOWN_EVENT: Chassis 1 gone DOWN!
CPU utilization also spikes to 90+% during this time.
11-04-2022 12:38 PM
First, i would turn off the stack
re-seat the stack cable and see if they are tight if you have a spare replace them with a new one.
after that still have issue, suggest upgrading to the latest IOS XE code 17.3.X and test it.
=====Preenayamo Vasudevam=====
***** Rate All Helpful Responses *****
11-04-2022 01:11 PM
I've tested and reseated the cables and they are tightened down as tight as they will go. Unfortunately, I don't have any spare stack cables. My next course of action, when I can get a maintenance window is to upgrade the OS to 16.12.8 which is the latest recommended release for the 16.x OS train.
11-04-2022 04:56 PM
Post the complete output to the following commands:
If the stack is on 16.12.4, then I suspect there is a memory leak due to FN - 72323 - Cisco IOS XE Software: QuoVadis Root CA 2 Decommission Might Affect Smart Licensing, Smart Call Home, and Other Functionality
If neither workarounds are done, my guess is it would take about 6 to 8 months before the memory leak would look like the the picture below:
3850 (4 x switches), Firmware version: 16.12.4, Uptime: 1y43w4d
And if this is the case, then I'd drill down further to determine the process monopolizing the memory.
Memory leak due to "keyman" process
"keyman" process usually is between 10k to 15k. "keyman" process is one of the few attributed to Cisco Smart License.
Finally, 16.12.4 is not a stable firmware version for the 9200. Downgrade to the latest 16.6.X for stability purposes.
Discover and save your favorite ideas. Come back to expert answers, step-by-step guides, recent topics, and more.
New here? Get started with these tips. How to use Community New member guide