11-11-2022 08:46 AM
I have a few 9200L stacks that will crash every so (usually about a week, upwards of 3). The stack will go down and not recover, and I am unable to console into 3 of the 4 switches in the stack until I power cycle the whole stack. The switch that stays up is ALWAYS the standby switch. Hardware is C9200L-48P-4X. IOS release is 17.03.05 (Same issue was present on 17.03.04b).
Skimming the crash logs still, but I do see an instance where it looks like the stack-mgr process crashes. It list - PROCESS : exit code for stack-mgr was 69. I've posted some outputs below. Thanks!
sh log on sw 1 up de
--------------------------------------------------------------------------------
UPTIME SUMMARY INFORMATION
--------------------------------------------------------------------------------
First customer power on : 10/18/2021 09:39:10
Total uptime : 0 years 24 weeks 2 days 20 hours 43 minutes
Total downtime : 0 years 31 weeks 1 days 9 hours 23 minutes
Number of resets : 30
Number of slot changes : 0
Current reset reason : Power Failure or Unknown
Current reset timestamp : 11/11/2022 13:46:37
Current slot : 1
Chassis type : 247
Current uptime : 0 years 0 weeks 0 days 2 hours 0 minutes
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
UPTIME CONTINUOUS INFORMATION
--------------------------------------------------------------------------------
Time Stamp | Reset | Uptime
MM/DD/YYYY HH:MM:SS | Reason | years weeks days hours minutes
--------------------------------------------------------------------------------
10/18/2021 09:39:10 Power Failure or Unknown 0 0 0 0 0
10/18/2021 09:57:41 Image Install 0 0 0 0 15
10/18/2021 10:02:02 Power Failure or Unknown 0 0 0 0 0
10/18/2021 11:08:06 Power Failure or Unknown 0 0 0 0 0
10/18/2021 12:23:58 Reload Command 0 0 0 0 0
10/18/2021 12:42:35 Image Install 0 0 0 0 15
10/25/2021 06:34:16 Power Failure or Unknown 0 0 0 0 0
10/25/2021 06:40:31 Reload Command 0 0 0 0 0
10/25/2021 06:45:32 Power Failure or Unknown 0 0 0 0 0
05/17/2022 14:44:49 Power Failure or Unknown 0 0 0 0 0
05/17/2022 16:38:59 Image Install 0 0 0 1 15
05/17/2022 18:08:45 Reload Command 0 0 0 1 0
05/17/2022 18:23:49 Reload Command 0 0 0 0 10
05/17/2022 18:37:47 Power Failure or Unknown 0 0 0 0 5
05/25/2022 15:14:17 Power Failure or Unknown 0 0 0 20 0
05/25/2022 16:23:48 Reload Slot Command 0 0 0 1 0
06/28/2022 21:03:03 Power Failure or Unknown 0 4 6 1 10
06/30/2022 19:11:42 Power Failure or Unknown 0 0 1 20 0
08/03/2022 05:11:00 Power Failure or Unknown 0 4 5 9 2
10/02/2022 19:46:57 Power Failure or Unknown 0 8 4 14 5
10/07/2022 01:56:31 EHSA keepalive timeout 0 0 4 6 0
10/07/2022 14:04:42 Power Failure or Unknown 0 0 0 12 0
10/11/2022 17:28:52 EHSA keepalive timeout 0 0 4 3 0
10/11/2022 18:52:36 Power Failure or Unknown 0 0 0 1 0
10/11/2022 19:18:07 active removed before switch beca 0 0 0 0 20
10/15/2022 21:46:30 EHSA keepalive timeout 0 0 4 2 0
10/17/2022 12:19:49 Power Failure or Unknown 0 0 1 14 0
10/19/2022 01:29:28 Image Install 0 0 1 13 0
11/11/2022 00:38:05 EHSA keepalive timeout 0 3 1 23 2
11/11/2022 13:46:37 Power Failure or Unknown 0 0 0 13 0
sh pla so statu con bri
Load Average
Slot Status 1-Min 5-Min 15-Min
1-RP0 Healthy 1.27 1.29 1.35
2-RP0 Healthy 0.74 0.84 0.83
3-RP0 Healthy 0.44 0.44 0.49
4-RP0 Healthy 0.35 0.42 0.44
5-RP0 Healthy 0.33 0.54 0.63
6-RP0 Healthy 0.93 0.90 0.83
Memory (kB)
Slot Status Total Used (Pct) Free (Pct) Committed (Pct)
1-RP0 Healthy 1984308 1313808 (66%) 670500 (34%) 1878608 (95%)
2-RP0 Healthy 1984308 1282816 (65%) 701492 (35%) 1722128 (87%)
3-RP0 Healthy 1984308 826592 (42%) 1157716 (58%) 841292 (42%)
4-RP0 Healthy 1984308 830172 (42%) 1154136 (58%) 832164 (42%)
5-RP0 Healthy 1984308 829388 (42%) 1154920 (58%) 815784 (41%)
6-RP0 Healthy 1984308 829624 (42%) 1154684 (58%) 815096 (41%)
CPU Utilization
Slot CPU User System Nice Idle IRQ SIRQ IOwait
1-RP0 0 18.29 12.61 0.00 66.66 1.89 0.52 0.00
1 17.19 11.78 0.00 69.74 0.74 0.53 0.00
2 17.92 9.65 0.00 71.15 0.74 0.53 0.00
3 18.55 10.27 0.00 69.91 0.62 0.62 0.00
2-RP0 0 11.14 8.12 0.00 79.06 1.25 0.41 0.00
1 11.55 8.50 0.00 78.78 0.73 0.42 0.00
2 13.72 7.73 0.00 77.29 0.72 0.51 0.00
3 10.42 7.19 0.00 81.23 0.62 0.52 0.00
3-RP0 0 8.52 3.35 0.00 87.00 1.01 0.10 0.00
1 7.09 3.39 0.00 88.90 0.51 0.10 0.00
2 6.95 3.53 0.00 88.88 0.51 0.10 0.00
3 8.76 3.36 0.00 87.15 0.50 0.20 0.00
4-RP0 0 8.57 3.71 0.00 86.57 0.92 0.20 0.00
1 7.86 3.67 0.00 87.74 0.51 0.20 0.00
2 7.96 3.20 0.00 88.00 0.62 0.20 0.00
3 7.05 3.57 0.00 88.65 0.51 0.20 0.00
5-RP0 0 9.48 3.67 0.00 85.40 1.12 0.30 0.00
1 9.37 4.38 0.00 85.42 0.50 0.30 0.00
2 11.96 3.44 0.00 83.87 0.60 0.10 0.00
3 10.70 4.28 0.00 84.40 0.50 0.10 0.00
6-RP0 0 12.06 4.22 0.00 82.16 1.34 0.20 0.00
1 9.69 4.58 0.00 84.88 0.62 0.20 0.00
2 11.32 4.63 0.00 83.21 0.61 0.20 0.00
3 13.17 4.80 0.00 81.30 0.51 0.20 0.00
Solved! Go to Solution.
11-12-2022 05:54 PM
I can get with my supervisor tomorrow and see what we can work out. There shouldn't be anyone at this site on the weekends. This stack did already have to be physically rebooted a few days ago, as the only way to recover the stack is to go on location and pull the power cable on all the switches.
11-12-2022 07:02 PM
@CaeCae wrote:
This stack did already have to be physically rebooted a few days ago
The "reload" command is not as effective as a cold reboot.
11-12-2022 07:20 PM
It was not rebooted with the reload command. The whole stack is frozen when this happens - no ssh, no console. We have to do a cold reboot when this occurs, and that is what we did a few days ago.
I will perform another one during after hours. Is there any command output or file you will need after doing so?
Thanks
11-13-2022 01:07 AM - edited 11-13-2022 01:07 AM
Before cold-rebooting the entire stack, please share the complete output to the command "sh platform software status con brief".
11-14-2022 05:21 AM
I haven't had an opportunity to reset the stack yet, but here is the requested output. I will run the command again before I cold reboot and then once more after it's booted.
sh pla so statu con bri
Load Average
Slot Status 1-Min 5-Min 15-Min
1-RP0 Healthy 0.80 1.39 1.51
2-RP0 Healthy 0.82 0.77 0.80
3-RP0 Healthy 0.63 0.60 0.55
4-RP0 Healthy 0.33 0.42 0.45
5-RP0 Healthy 0.61 0.66 0.63
6-RP0 Healthy 0.92 0.81 0.73
Memory (kB)
Slot Status Total Used (Pct) Free (Pct) Committed (Pct)
1-RP0 Healthy 1984308 1354164 (68%) 630144 (32%) 1880480 (95%)
2-RP0 Healthy 1984308 1294264 (65%) 690044 (35%) 1751224 (88%)
3-RP0 Healthy 1984308 837992 (42%) 1146316 (58%) 883120 (45%)
4-RP0 Healthy 1984308 838384 (42%) 1145924 (58%) 850312 (43%)
5-RP0 Healthy 1984308 836344 (42%) 1147964 (58%) 841324 (42%)
6-RP0 Healthy 1984308 836112 (42%) 1148196 (58%) 822352 (41%)
CPU Utilization
Slot CPU User System Nice Idle IRQ SIRQ IOwait
1-RP0 0 17.78 9.72 0.00 69.97 1.88 0.62 0.00
1 16.64 10.89 0.00 71.20 0.73 0.52 0.00
2 16.61 10.51 0.00 71.60 0.73 0.52 0.00
3 19.39 8.59 0.00 70.75 0.73 0.52 0.00
2-RP0 0 9.17 5.63 0.00 83.52 1.14 0.52 0.00
1 10.54 6.10 0.00 82.21 0.72 0.41 0.00
2 11.22 6.59 0.00 81.15 0.61 0.41 0.00
3 12.22 6.11 0.00 80.62 0.62 0.41 0.00
3-RP0 0 7.75 3.77 0.00 87.24 1.02 0.20 0.00
1 8.51 3.89 0.00 86.87 0.51 0.20 0.00
2 7.62 3.55 0.00 88.10 0.60 0.10 0.00
3 7.86 2.89 0.00 88.61 0.51 0.10 0.00
4-RP0 0 9.42 3.85 0.00 85.51 1.01 0.20 0.00
1 8.34 4.62 0.00 86.33 0.50 0.20 0.00
2 9.48 3.97 0.00 85.93 0.40 0.20 0.00
3 9.23 2.94 0.00 87.20 0.50 0.10 0.00
5-RP0 0 9.65 3.49 0.00 85.62 1.02 0.20 0.00
1 8.31 4.00 0.00 86.96 0.51 0.20 0.00
2 10.85 3.95 0.00 84.38 0.60 0.20 0.00
3 9.48 3.57 0.00 86.32 0.51 0.10 0.00
6-RP0 0 10.04 4.96 0.00 83.33 1.44 0.20 0.00
1 12.64 5.03 0.00 81.60 0.51 0.20 0.00
2 9.95 4.40 0.00 84.90 0.52 0.20 0.00
3 14.07 4.34 0.00 80.84 0.51 0.20 0.00
11-14-2022 01:56 PM
@CaeCae wrote:
1-RP0 Healthy 1984308 1354164 (68%) 630144 (32%) 1880480 (95%)
2-RP0 Healthy 1984308 1294264 (65%) 690044 (35%) 1751224 (88%)
Switch 1 and switch 2 memory utilization is abnormally high.
Post the complete output to the following commands:
sh processes memory platform sort location switch 1 r0
sh processes memory platform sort location switch 2 r0
NOTE: Please provide the output to the "first page" of every command. There is no need to see the rest of the output.
11-14-2022 02:06 PM
Here is the requested output.
sh processes memory platform sort location switch 1 r0
System memory: 1984308K total, 1357284K used, 627024K free,
Lowest: 567380K
Pid Text Data Stack Dynamic RSS Name
----------------------------------------------------------------------
7598 132681 412940 136 116 412940 linux_iosd-imag
9326 159 160996 136 104 160996 fed main event
16906 1701 132924 148 76064 132924 confd
1072 675 39032 136 92 39032 smand
1895 4835 34408 136 44 34408 fman_rp
18863 5709 32580 136 52 32580 fman_fp_image
12750 61 29720 136 168 29720 pubd
13320 289 26152 136 40 26152 ndbmand
2246 166 23900 136 104 23900 dbm
20113 116 22244 136 104 22244 sessmgrd
3951 7 20048 136 2780 20048 python3
8297 263 18196 136 96 18196 sif_mgr
11703 1183 17020 136 104 17020 cmand
2772 41 16340 136 52 16340 cli_agent
13960 143 15404 136 92 15404 dmiauthd
96 116 14652 132 8448 14652 systemd-journal
21478 135 12652 136 32 12652 repm
8034 399 11104 136 92 11104 stack_mgr
11149 430 10792 484 48 10792 hman
10888 288 9920 136 92 9920 install_mgr
610 58 9616 136 32 9616 psd
17790 200 7984 136 32 7984 iomd
1391 204 7896 136 32 7896 tms
12092 190 7816 136 92 7816 btman
9991 91 7744 136 92 7744 keyman
11950 57 7044 136 92 7044 bt_logger
9479 129 7040 136 92 7040 lman
9795 190 6880 136 92 6880 btman
1 1059 6732 132 1052 6732 systemd
8967 910 6644 308 4828 6644 ncd.sh
12534 910 6552 308 4768 6552 auto_upgrade_cl
10361 910 6536 304 4632 6536 issu_stack.sh
7847 44 6376 136 84 6376 tamd_proc
22051 67 5988 136 32 5988 plogd
14647 269 5888 136 92 5888 ncsshd
15072 910 5708 304 4632 5708 issu_stack.sh
15064 910 5636 304 4632 5636 issu_stack.sh
7708 52 5544 136 84 5544 tams_proc
2497 104 5056 136 32 5056 cmm
13168 910 5052 308 3284 5052 periodic.sh
20059 1424 5036 136 1680 5036 nginx
6382 910 5016 304 3132 5016 rollback_timer.
14359 764 4864 136 32 4864 ncsshd_bp
11342 41 4592 136 84 4592 tam_svcs_esg_cf
8594 230 4280 136 92 4280 nif_mgr
6747 910 4208 304 2348 4208 psvp.sh
6818 910 4160 304 2336 4160 pvp.sh
8335 910 3932 304 2092 3932 pvp.sh
10160 910 3888 304 1960 3888 pvp.sh
4015 910 3824 300 1976 3824 reflector.sh
3776 910 3804 300 1976 3804 droputil.sh
6383 910 3724 304 1944 3724 chasync.sh
12583 48 3652 136 24 3652 pttcd
sh processes memory platform sort location switch 2 r0
System memory: 1984308K total, 1295440K used, 688868K free,
Lowest: 659512K
Pid Text Data Stack Dynamic RSS Name
----------------------------------------------------------------------
7581 132681 340724 136 116 340724 linux_iosd-imag
9256 159 175880 136 104 175880 fed main event
18123 1701 115588 148 62776 115588 confd
15421 61 59796 136 168 59796 pubd
2572 166 48780 136 104 48780 dbm
21377 5709 47172 136 52 47172 fman_fp_image
2079 4835 44540 136 44 44540 fman_rp
22105 116 37796 136 104 37796 sessmgrd
16007 289 36804 136 40 36804 ndbmand
11706 1183 33140 136 104 33140 cmand
8293 263 31928 136 96 31928 sif_mgr
28013 7 29304 136 2712 29304 python3
1211 675 28304 136 92 28304 smand
22819 135 21660 136 32 21660 repm
3236 41 20216 136 52 20216 cli_agent
3784 58 16188 136 32 16188 psd
8047 399 15928 136 92 15928 stack_mgr
10922 288 14292 136 92 14292 install_mgr
105 116 13536 132 8004 13536 systemd-journal
11140 430 11940 480 48 11940 hman
16492 143 11128 136 92 11128 dmiauthd
20640 200 10516 136 32 10516 iomd
9545 129 10240 136 92 10240 lman
11947 57 10204 136 92 10204 bt_logger
12063 190 9456 136 92 9456 btman
1423 204 9136 136 32 9136 tms
9773 190 8840 136 92 8840 btman
10018 91 8348 136 92 8348 keyman
23283 67 6976 136 32 6976 plogd
7865 44 6856 136 84 6856 tamd_proc
1 1059 6656 132 1044 6656 systemd
8873 910 6644 308 4828 6644 ncd.sh
2951 104 6572 136 32 6572 cmm
12523 910 6552 308 4768 6552 auto_upgrade_cl
19202 1424 6528 136 1680 6528 nginx
10307 910 6512 308 4632 6512 issu_stack.sh
8566 230 6272 136 92 6272 nif_mgr
16911 764 6264 136 32 6264 ncsshd_bp
7696 52 5940 136 84 5940 tams_proc
15568 910 5712 308 4632 5712 issu_stack.sh
15562 910 5640 308 4632 5640 issu_stack.sh
13156 910 5052 308 3284 5052 periodic.sh
6397 910 5016 304 3132 5016 rollback_timer.
11342 41 4996 136 84 4996 tam_svcs_esg_cf
15267 48 4452 136 24 4452 pttcd
17191 269 4312 136 132 4312 ncsshd
6746 910 4208 304 2348 4208 psvp.sh
6818 910 4156 300 2340 4156 pvp.sh
8353 910 3928 304 2092 3928 pvp.sh
10205 910 3888 304 1960 3888 pvp.sh
4014 910 3824 300 1976 3824 reflector.sh
3756 910 3804 300 1976 3804 droputil.sh
21933 910 3752 300 1936 3752 brelay_console.
11-14-2022 02:35 PM
Numbers look OK.
Try upgrading to 17.6.4.
11-19-2022 06:55 AM
We just upgraded one of our IDF stacks to 17.6.4. Already memory utilization is better than what it was before. *Note, this is not the same stack as previous outputs provided in thread, it is the same environment minus 2 switches. All outputs and crashes were the same/similar as the other stacks. Already we have noticed an improvement in memory utilization. I will continue to monitor and update this thread of any changes. I'm going to accept as a solution in the meantime. Thank you both for your help!
sh pla so statu con bri
Load Average
Slot Status 1-Min 5-Min 15-Min
1-RP0 Healthy 0.56 1.15 1.29
2-RP0 Healthy 0.35 0.67 0.78
3-RP0 Healthy 0.20 0.41 0.47
4-RP0 Healthy 0.35 0.61 0.55
Memory (kB)
Slot Status Total Used (Pct) Free (Pct) Committed (Pct)
1-RP0 Healthy 1984168 1153740 (58%) 830428 (42%) 1866780 (94%)
2-RP0 Healthy 1984168 1078448 (54%) 905720 (46%) 1717740 (87%)
3-RP0 Healthy 1984168 800268 (40%) 1183900 (60%) 840688 (42%)
4-RP0 Healthy 1984168 801976 (40%) 1182192 (60%) 840456 (42%)
CPU Utilization
Slot CPU User System Nice Idle IRQ SIRQ IOwait
1-RP0 0 10.75 6.88 0.00 80.89 1.25 0.20 0.00
1 9.17 7.50 0.00 82.48 0.52 0.31 0.00
2 11.15 5.73 0.00 82.27 0.62 0.20 0.00
3 9.51 5.33 0.00 84.41 0.52 0.20 0.00
2-RP0 0 7.08 3.69 0.00 88.29 0.82 0.10 0.00
1 6.48 2.98 0.00 89.80 0.51 0.20 0.00
2 7.82 2.88 0.00 88.67 0.51 0.10 0.00
3 5.48 3.61 0.00 90.27 0.51 0.10 0.00
3-RP0 0 6.87 1.82 0.00 90.39 0.80 0.10 0.00
1 5.72 3.06 0.00 90.70 0.40 0.10 0.00
2 6.39 2.94 0.00 90.15 0.40 0.10 0.00
3 6.67 3.13 0.00 89.58 0.50 0.10 0.00
4-RP0 0 9.21 3.95 0.00 85.61 1.11 0.10 0.00
1 7.88 3.84 0.00 87.76 0.40 0.10 0.00
2 7.47 3.68 0.00 88.21 0.51 0.10 0.00
3 9.77 3.86 0.00 85.84 0.40 0.10 0.00
Discover and save your favorite ideas. Come back to expert answers, step-by-step guides, recent topics, and more.
New here? Get started with these tips. How to use Community New member guide