cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
17419
Views
0
Helpful
22
Replies

9200 Stack reloads once in a while

fgasimzade
Level 4
Level 4

Hello Everyone,

We have a stack of 3 C9200-48T switches with the following IOS

Switch Ports Model SW Version SW Image Mode
------ ----- ----- ---------- ---------- ----
1 56 C9200-48T 16.12.3a CAT9K_LITE_IOSXE INSTALL
2 56 C9200-48T 16.12.3a CAT9K_LITE_IOSXE INSTALL
* 3 56 C9200-48T 16.12.3a CAT9K_LITE_IOSXE INSTALL

 

It started to reload, 3 times already for the past week

 

This is what we got in the logs

 

May 28 18:46:28: %HMANRP-6-HMAN_IOS_CHANNEL_INFO: HMAN-IOS channel event for switch 3: EMP_RELAY: Channel UP!
May 28 18:46:28: %HMANRP-6-HMAN_IOS_CHANNEL_INFO: HMAN-IOS channel event for switch 2: EMP_RELAY: Channel UP!
May 28 18:46:28: %PLATFORM-6-HASTATUS: RP switchover, received chassis event to become active
May 28 18:46:28: %REDUNDANCY-3-SWITCHOVER: RP switchover (PEER_NOT_PRESENT)
May 28 18:46:28: %REDUNDANCY-3-SWITCHOVER: RP switchover (PEER_DOWN)
May 28 18:46:28: %REDUNDANCY-3-SWITCHOVER: RP switchover (PEER_REDUNDANCY_STATE_CHANGE)
May 28 18:46:28: %PM-4-PORT_INCONSISTENT: Port Gi2/0/37 is inconsistent: IDB state down (set 00:00:02 ago),
link: up (2d03h ago), admin: up (2d03h ago).
May 28 18:46:28: %PM-4-PORT_INCONSISTENT: Port Gi2/0/38 is inconsistent: IDB state down (set 00:00:02 ago),
link: up (2d03h ago), admin: up (2d03h ago).
May 28 18:46:28: %PM-4-PORT_INCONSISTENT: Port Gi3/0/11 is inconsistent: IDB state down (set 00:00:02 ago),
link: up (2d03h ago), admin: up (2d03h ago).
May 28 18:46:28: %PLATFORM-6-HASTATUS: RP switchover, sent message became active. IOS is ready to switch to primary after chassis confirmation
May 28 18:46:28: %HMANRP-6-EMP_NO_ELECTION_INFO: Could not elect active EMP switch, setting emp active switch to 0: EMP_RELAY: Could not elect switch with mgmt port UP
May 28 18:46:28: %PLATFORM-6-HASTATUS: RP switchover, received chassis event became active
May 28 18:46:28: %PLATFORM_FEP-1-FRU_PS_SIGNAL_OK: Switch 3: signal on power supply A is restored
May 28 18:46:28: %PLATFORM_FEP-1-FRU_PS_SIGNAL_OK: Switch 3: signal on power supply B is restored
May 28 18:46:28: %STACKMGR-4-SWITCH_REMOVED: Switch 2 R0/0: stack_mgr: Switch 1 has been removed from the stack.
May 28 18:46:28: %SYS-6-LOGGINGHOST_STARTSTOP: Logging to host 10.30.26.130 port 514 started - CLI initiated
May 28 18:46:28: %STACKMGR-4-SWITCH_REMOVED: Switch 3 R0/0: stack_mgr: Switch 1 has been removed from the stack.
May 28 18:46:28: %PLATFORM-6-HASTATUS_DETAIL: RP switchover, received chassis event became active. Switch to primary (count 1)
May 28 18:46:28: %HA-6-SWITCHOVER: Route Processor switched from standby to being active
May 28 18:46:28: %IOSXE_MGMTVRF-3-SET_TABLEID_FAIL: Installing ipv4 Management interface tableid 0x1 failed
May 28 18:46:28: %IOSXE_MGMTVRF-3-SET_TABLEID_FAIL: Installing ipv6 Management interface tableid 0x1E000001 failed
May 28 18:46:28: Unable to set IPV4 table id for BT interface

May 28 18:46:28: Unable to set IPV6 table id for BT interface

May 28 18:46:28: %HMANRP-6-EMP_NO_ELECTION_INFO: Could not elect active EMP switch, setting emp active switch to 0: EMP_RELAY: Could not elect switch with mgmt port UP
May 28 18:46:29: %SMART_LIC-5-EVAL_START: Entering evaluation period
May 28 18:46:29: %SMART_LIC-5-EVAL_START: Entering evaluation period
May 28 18:46:29: %SMART_LIC-5-EVAL_START: Entering evaluation period
May 28 18:46:29: %PM-4-PORT_INCONSISTENT: Port Gi3/0/10 is inconsistent: IDB state down (set 00:00:02 ago),
link: up (2d03h ago), admin: up (2d03h ago).
May 28 18:46:29: %PM-4-PORT_INCONSISTENT: Port Te3/1/1 is inconsistent: IDB state down (set 00:00:02 ago),
link: up (2d03h ago), admin: up (2d03h ago).
May 28 18:46:29: %PM-4-PORT_INCONSISTENT: Port Te3/1/2 is inconsistent: IDB state down (set 00:00:02 ago),
link: up (2d03h ago), admin: up (2d03h ago).
May 28 18:46:29: %SMART_LIC-5-EVAL_START: Entering evaluation period
May 28 18:46:29: %STACKMGR-6-STACK_LINK_CHANGE: Switch 3 R0/0: stack_mgr: Stack port 2 on Switch 3 is down
May 28 18:46:30: %HMANRP-5-CHASSIS_DOWN_EVENT: Chassis 1 gone DOWN!
May 28 18:46:31: %SMART_LIC-5-EVAL_START: Entering evaluation period
May 28 18:46:31: %SMART_LIC-5-EVAL_START: Entering evaluation period
May 28 18:46:31: %STACKMGR-6-STACK_LINK_CHANGE: Switch 3 R0/0: stack_mgr: Stack port 1 on Switch 3 is down
May 28 18:46:32: %HMANRP-5-CHASSIS_DOWN_EVENT: Chassis 2 gone DOWN!
May 28 18:46:33: %SMART_LIC-6-HA_ROLE_CHANGED: Smart Agent HA role changed to Active.
May 28 18:46:33: %PM-4-PORT_BOUNCED: Port Gi3/0/10 was bounced by Consistency Check IDBS Down.
May 28 18:46:33: %PM-4-PORT_BOUNCED: Port Gi3/0/11 was bounced by Consistency Check IDBS Down.
May 28 18:46:33: %PM-4-PORT_BOUNCED: Port Te3/1/1 was bounced by Consistency Check IDBS Down.
May 28 18:46:33: %PM-4-PORT_BOUNCED: Port Te3/1/2 was bounced by Consistency Check IDBS Down.

 

Any ideas why is it happenings? It seems like the stack is being rebuild

Thank you in advance

22 Replies 22

Hello Leo,

 

show version attached

Other below

 

SWH_STACK_Server_Room# dir flash-1:/core
Directory of flash:/core/

64775 -rw- 1 May 31 2021 10:05:52 +04:00 .callhome
64772 drwx 4096 Dec 4 2020 13:34:09 +04:00 modules

1956839424 bytes total (429297664 bytes free)

 

SWH_STACK_Server_Room# dir flash-2:/core
Directory of flash-2:/core/

89061 -rw- 1 May 26 2021 14:44:29 +04:00 .callhome
89059 drwx 4096 Dec 4 2020 17:32:35 +04:00 modules

1957167104 bytes total (429916160 bytes free)

 

SWH_STACK_Server_Room# dir flash-3:/core
Directory of flash-3:/core/

48581 -rw- 1 May 31 2021 10:08:34 +04:00 .callhome
48579 drwx 4096 Dec 4 2020 17:32:22 +04:00 modules

1957167104 bytes total (429391872 bytes free)

 

SWH_STACK_Server_Room#dir crashinfo-1:
Directory of crashinfo:/

36641 drwx 53248 May 31 2021 10:10:59 +04:00 tracelogs
17 -rw- 2454114 May 29 2021 01:38:50 +04:00 SWH_STACK_Server_Room_1_RP_0_trace_archive_2-20210529-013845.tar.gz
16 -rw- 2259322 May 29 2021 01:13:36 +04:00 SWH_STACK_Server_Room_trace_archive_0-20210529-011331.tar.gz
15 -rw- 1163745 May 26 2021 14:50:53 +04:00 SWH_STACK_Server_Room_1_RP_0_trace_archive_1-20210526-145049.tar.gz
14 -rw- 1273017 May 1 2021 14:38:31 +04:00 SWH_STACK_Server_Room_1_RP_0_trace_archive_0-20210501-143829.tar.gz
13 -rw- 920718 Jan 29 2021 11:57:52 +04:00 SWH_STACK_Server_Room_1_RP_0_trace_archive_1-20210129-075750.tar.gz
11 -rw- 891918 Jan 29 2021 11:56:49 +04:00 SWH_STACK_Server_Room_1_RP_0_trace_archive_0-20210129-075647.tar.gz
12 -rw- 0 Dec 11 2019 20:56:58 +04:00 koops.dat

825638912 bytes total (764428288 bytes free)

 

SWH_STACK_Server_Room#dir crashinfo-2:
Directory of crashinfo-2:/

14657 drwx 32768 May 31 2021 10:09:47 +04:00 tracelogs
18 -rw- 11674381 May 30 2021 22:16:30 +04:00 SWH_STACK_Server_Room_2_RP_0-system-report_2_20210530-221619-Baku.tar.gz
17 -rw- 2959211 May 29 2021 01:38:48 +04:00 SWH_STACK_Server_Room_2_RP_0_trace_archive_0-20210529-013844.tar.gz
16 -rw- 2720388 May 29 2021 01:13:34 +04:00 SWH_STACK_Server_Room_2_RP_0_trace_archive_0-20210529-011329.tar.gz
15 -rw- 1736020 May 28 2021 18:46:33 +04:00 SWH_STACK_Server_Room_2_RP_0_trace_archive_0-20210528-184628.tar.gz
14 -rw- 1416108 May 26 2021 14:50:57 +04:00 system-report_2_20210526-145054-Baku.tar.gz
13 -rw- 1063824 May 26 2021 14:50:53 +04:00 SWH_STACK_Server_Room_trace_archive_1-20210526-145049.tar.gz
11 -rw- 1805733 May 1 2021 14:38:30 +04:00 SWH_STACK_Server_Room_2_RP_0_trace_archive_0-20210501-143828.tar.gz
12 -rw- 0 Dec 11 2019 20:56:58 +04:00 koops.dat

825753600 bytes total (749731840 bytes free)

 

SWH_STACK_Server_Room#dir crashinfo-3:
Directory of crashinfo-3:/

21985 drwx 24576 May 31 2021 10:11:08 +04:00 tracelogs
25 -rw- 10583211 May 30 2021 22:16:26 +04:00 SWH_STACK_Server_Room_3_RP_0-system-report_3_20210530-221619-Baku.tar.gz
24 -rw- 4472184 May 30 2021 22:16:20 +04:00 SWH_STACK_Server_Room_3_RP_0_trace_archive_0-20210530-221613.tar.gz
23 -rw- 1288907 May 29 2021 01:40:11 +04:00 system-report_3_20210529-014010-Baku.tar.gz
22 -rw- 1269934 May 29 2021 01:39:51 +04:00 SWH_STACK_Server_Room_trace_archive_1-20210529-013946.tar.gz
21 -rw- 2340336 May 29 2021 01:13:40 +04:00 system-report_3_20210529-011336-Baku.tar.gz
20 -rw- 1287126 May 29 2021 01:13:35 +04:00 SWH_STACK_Server_Room_3_RP_0_trace_archive_2-20210529-011330.tar.gz
19 -rw- 1070398 May 28 2021 18:46:32 +04:00 SWH_STACK_Server_Room_trace_archive_0-20210528-184629.tar.gz
18 -rw- 1748605 May 26 2021 14:50:53 +04:00 SWH_STACK_Server_Room_3_RP_0_trace_archive_0-20210526-145049.tar.gz
17 -rw- 1611290 May 1 2021 15:04:27 +04:00 system-report_3_20210501-150424-Baku.tar.gz
16 -rw- 1215691 May 1 2021 14:40:31 +04:00 SWH_STACK_Server_Room_3_RP_0_trace_archive_1-20210501-144029.tar.gz
15 -rw- 1217171 May 1 2021 14:39:31 +04:00 SWH_STACK_Server_Room_3_RP_0_trace_archive_0-20210501-143929.tar.gz
14 -rw- 2000413 Jan 29 2021 11:58:04 +04:00 system-report_3_20210129-075801-UTC.tar.gz
13 -rw- 950865 Jan 29 2021 11:57:51 +04:00 SWH_STACK_Server_Room_trace_archive_1-20210129-075750.tar.gz
12 -rw- 871872 Jan 29 2021 11:57:49 +04:00 SWH_STACK_Server_Room_3_RP_0_trace_archive_0-20210129-075747.tar.gz
11 -rw- 0 Dec 11 2019 20:56:58 +04:00 koops.dat

825753600 bytes total (742391808 bytes free)

 

SWH_STACK_Server_Room#sh log on switch 1 up detail
--------------------------------------------------------------------------------
UPTIME SUMMARY INFORMATION
--------------------------------------------------------------------------------
First customer power on : 11/28/2020 12:19:43
Total uptime : 0 years 15 weeks 2 days 1 hours 25 minutes
Total downtime : 0 years 10 weeks 6 days 19 hours 34 minutes
Number of resets : 14
Number of slot changes : 0
Current reset reason : stack merge due to incompatiblity
Current reset timestamp : 05/30/2021 22:18:56
Current slot : 1
Chassis type : 247
Current uptime : 0 years 0 weeks 0 days 11 hours 0 minutes
--------------------------------------------------------------------------------

--------------------------------------------------------------------------------
UPTIME CONTINUOUS INFORMATION
--------------------------------------------------------------------------------
Time Stamp | Reset | Uptime
MM/DD/YYYY HH:MM:SS | Reason | years weeks days hours minutes
--------------------------------------------------------------------------------
11/28/2020 12:19:43 Power Failure or Unknown 0 0 0 0 0
11/28/2020 12:34:22 Image Install 0 0 0 0 10
11/28/2020 12:37:51 Reload Command 0 0 0 0 0
11/30/2020 04:46:43 Power Failure or Unknown 0 0 0 0 0
12/04/2020 13:34:49 Reload Command 0 0 0 0 0
12/04/2020 13:49:38 Image Install 0 0 0 0 10
12/04/2020 13:53:08 Reload Command 0 0 0 0 0
01/25/2021 10:37:47 Power Failure or Unknown 0 0 0 0 0
01/25/2021 11:31:25 Power Failure or Unknown 0 0 0 0 30
02/17/2021 10:01:03 Power Failure or Unknown 0 0 4 3 0
02/17/2021 10:31:46 Power Failure or Unknown 0 0 0 0 10
02/17/2021 14:10:00 Power Failure or Unknown 0 0 0 0 25
05/28/2021 18:48:55 stack merge due to incompatiblity 0 14 2 4 0
05/30/2021 18:14:26 Image Install 0 0 1 23 0
05/30/2021 22:18:56 stack merge due to incompatiblity 0 0 0 4 0
--------------------------------------------------------------------------------

 

SWH_STACK_Server_Room#sh log on switch 2 up detail
--------------------------------------------------------------------------------
UPTIME SUMMARY INFORMATION
--------------------------------------------------------------------------------
First customer power on : 11/28/2020 17:02:20
Total uptime : 0 years 15 weeks 1 days 23 hours 40 minutes
Total downtime : 0 years 10 weeks 6 days 16 hours 35 minutes
Number of resets : 18
Number of slot changes : 1
Current reset reason : stack merge
Current reset timestamp : 05/30/2021 22:18:56
Current slot : 2
Chassis type : 247
Current uptime : 0 years 0 weeks 0 days 11 hours 0 minutes
--------------------------------------------------------------------------------

--------------------------------------------------------------------------------
UPTIME CONTINUOUS INFORMATION
--------------------------------------------------------------------------------
Time Stamp | Reset | Uptime
MM/DD/YYYY HH:MM:SS | Reason | years weeks days hours minutes
--------------------------------------------------------------------------------
11/28/2020 17:02:20 Power Failure or Unknown 0 0 0 0 0
11/28/2020 17:17:07 Image Install 0 0 0 0 10
11/28/2020 17:20:42 Reload Command 0 0 0 0 0
12/04/2020 04:32:48 Power Failure or Unknown 0 0 0 0 0
12/04/2020 17:33:15 Reload Command 0 0 0 0 0
12/04/2020 17:47:57 Image Install 0 0 0 0 10
12/04/2020 17:51:27 Reload Command 0 0 0 0 0
01/25/2021 10:41:17 Power Failure or Unknown 0 0 0 0 0
01/25/2021 11:35:02 Power Failure or Unknown 0 0 0 0 30
01/29/2021 12:00:24 lost both active and standby 0 0 3 23 56
02/17/2021 10:02:43 Power Failure or Unknown 0 0 0 2 0
02/17/2021 10:33:26 Power Failure or Unknown 0 0 0 0 8
02/17/2021 14:11:39 Power Failure or Unknown 0 0 0 0 23
05/26/2021 14:53:18 stack merge 0 13 6 23 57
05/28/2021 18:48:55 lost both active and standby 0 0 2 3 0
05/29/2021 01:16:01 lost both active and standby 0 0 0 6 0
05/29/2021 01:44:06 EHSA standby down 0 0 0 0 25
05/30/2021 18:14:25 Image Install 0 0 1 16 0
05/30/2021 22:18:56 stack merge 0 0 0 4 0
--------------------------------------------------------------------------------

 

SWH_STACK_Server_Room#sh log on switch 3 up detail
--------------------------------------------------------------------------------
UPTIME SUMMARY INFORMATION
--------------------------------------------------------------------------------
First customer power on : 11/28/2020 17:35:30
Total uptime : 0 years 15 weeks 1 days 23 hours 57 minutes
Total downtime : 0 years 10 weeks 6 days 15 hours 45 minutes
Number of resets : 19
Number of slot changes : 1
Current reset reason : stack merge
Current reset timestamp : 05/30/2021 22:18:56
Current slot : 3
Chassis type : 247
Current uptime : 0 years 0 weeks 0 days 11 hours 0 minutes
--------------------------------------------------------------------------------

--------------------------------------------------------------------------------
UPTIME CONTINUOUS INFORMATION
--------------------------------------------------------------------------------
Time Stamp | Reset | Uptime
MM/DD/YYYY HH:MM:SS | Reason | years weeks days hours minutes
--------------------------------------------------------------------------------
11/28/2020 17:35:30 Power Failure or Unknown 0 0 0 0 0
11/28/2020 17:50:15 Image Install 0 0 0 0 10
11/28/2020 17:53:47 Reload Command 0 0 0 0 0
12/04/2020 04:32:21 Power Failure or Unknown 0 0 0 0 0
12/04/2020 17:33:02 Reload Command 0 0 0 0 0
12/04/2020 17:47:45 Image Install 0 0 0 0 10
12/04/2020 17:51:15 Reload Command 0 0 0 0 0
12/04/2020 17:54:56 Power Failure or Unknown 0 0 0 0 0
01/25/2021 10:37:11 Power Failure or Unknown 0 0 0 0 0
01/25/2021 11:31:05 Power Failure or Unknown 0 0 0 0 30
01/29/2021 12:00:25 stack merge 0 0 4 0 0
02/17/2021 10:01:20 Power Failure or Unknown 0 0 0 2 0
02/17/2021 10:32:02 Power Failure or Unknown 0 0 0 0 9
02/17/2021 14:10:13 Power Failure or Unknown 0 0 0 0 24
05/01/2021 15:06:48 stack merge 0 10 2 23 57
05/26/2021 14:53:19 lost both active and standby 0 3 3 23 0
05/29/2021 01:16:01 stack merge 0 0 2 9 59
05/29/2021 01:44:06 stack merge 0 0 0 0 20
05/30/2021 18:14:28 Image Install 0 0 1 15 59
05/30/2021 22:18:56 stack merge 0 0 0 4 0
--------------------------------------------------------------------------------


@fgasimzade wrote:

05/30/2021 18:14:28 Image Install 0 0 1 15 59
05/30/2021 22:18:56 stack merge 0 0 0 4 0


I can see the firmware was upgraded at 1814 UTC and at 2218 UTC the stack rebooted crashed again.  Is this correct?

Can I see the output to the following command:  sh platform software status con brief

Hello Leo,

Yes, you are correct about the timing

 

SWH_STACK_Server_Room#sh platform software status con brief
Load Average
Slot Status 1-Min 5-Min 15-Min
1-RP0 Healthy 0.49 0.48 0.51
2-RP0 Healthy 0.17 0.19 0.18
3-RP0 Healthy 0.42 0.40 0.41

Memory (kB)
Slot Status Total Used (Pct) Free (Pct) Committed (Pct)
1-RP0 Healthy 4028924 1093936 (27%) 2934988 (73%) 2126760 (53%)
2-RP0 Healthy 4028924 800532 (20%) 3228392 (80%) 1211564 (30%)
3-RP0 Healthy 4028924 1066596 (26%) 2962328 (74%) 2133864 (53%)

CPU Utilization
Slot CPU User System Nice Idle IRQ SIRQ IOwait
1-RP0 0 6.65 4.46 0.00 87.73 0.72 0.41 0.00
1 7.32 3.87 0.00 87.86 0.62 0.31 0.00
2 5.81 4.98 0.00 88.16 0.62 0.41 0.00
3 4.78 5.30 0.00 89.08 0.51 0.31 0.00
2-RP0 0 3.34 3.04 0.00 92.90 0.50 0.20 0.00
1 3.46 3.05 0.00 92.87 0.40 0.20 0.00
2 3.26 2.95 0.00 93.26 0.40 0.10 0.00
3 3.55 2.74 0.00 93.08 0.40 0.20 0.00
3-RP0 0 4.70 5.01 0.00 89.34 0.62 0.31 0.00
1 5.62 5.93 0.00 87.61 0.51 0.30 0.00
2 5.16 5.99 0.00 87.91 0.51 0.41 0.00
3 3.94 5.39 0.00 89.71 0.62 0.31 0.00

 

 

 

 - Keep working on the syslog-server solution too because in  these conditions pre-pending messages logged before the stack reloads may be indicative as to upcoming issues and or useful 'last gasp' messages and or messaging.

 M.



-- Let everything happen to you  
       Beauty and terror
      Just keep going    
       No feeling is final
Reiner Maria Rilke (1899)

Ok, snapshot of the CPU and memory utilization are low. 

Here's what I would do: 

1.  Keep the stack under observation.  If the stack does not crash in the next 7 days, then good. 

2.  If the stack crashes again, pull the power cable to all stack members.  Make sure the entire stack is dark.  This also means the fans of the power supplies stop spinning.  Wait for about 5 seconds and restore the power back again. 

Let us see if this fixes it.


@fgasimzade wrote:
SWH_STACK_Server_Room_2_RP_0-system-report_2_20210530-221619-Baku.tar.gz
SWH_STACK_Server_Room_3_RP_0-system-report_3_20210530-221619-Baku.tar.gz
SWH_STACK_Server_Room_3_RP_0_trace_archive_0-20210530-221613.tar.gz

May I also see these three files?

Hello Leo,

 

Noted, thank you

 

Have you been able to identify the cause of the last reload?


@fgasimzade wrote:

Have you been able to identify the cause of the last reload?


Not sure yet.  

Last reload reason: stack merge due to incompatiblity

This is the last "sh version" output and this is still pointing to the Bug ID I'm suspecting earlier. 

The last "show version" shows this:

 

Switch 02
---------
Switch uptime : 15 hours, 46 minutes

Base Ethernet MAC Address : e8:eb:34:22:45:80
Motherboard Assembly Number : 73-18792-04
Motherboard Serial Number : JAE24472552
Model Revision Number : C1
Motherboard Revision Number : B0
Model Number : C9200-48T
System Serial Number : JAE24472552
Last reload reason : stack merge
CLEI Code Number : INM9Y00ERA

Switch 03
---------
Switch uptime : 15 hours, 47 minutes

Base Ethernet MAC Address : e8:eb:34:10:69:80
Motherboard Assembly Number : 73-18792-04
Motherboard Serial Number : JAE24480GZS
Model Revision Number : C1
Motherboard Revision Number : B0
Model Number : C9200-48T
System Serial Number : JAE24480GZS
Last reload reason : stack merge
CLEI Code Number : INM9Y00ERA

Configuration register is 0x102

 

But this doesnt seem right since "stack merge" was the reason when we restarted after IOS upgrade