03-15-2023 01:56 AM
After applying an upgrade to current version 17.06.05 on our two C9800-L-F-K9 devices (not in HA), one device won't boot anymore... LEDs are off. Powercycling let the onboard LEDs light up, but switching off after a few seconds, fans keep running on high level.
Quick solution: Open the device and remove the lowermost 8GB RAM-Memory bar, close the device, powerup again. Device boots with half of the 16GB RAM.
With "sh version" I can't see any differences between our devices. I guess it is an hardware issue.
03-15-2023 01:59 AM
Just RMA the box.
03-16-2023 06:00 AM
As far as I know Cisco gives only a poor one year warranty on these devices. Our device is slighly older than 1 year. Moreover I guess that it will take a few day until the new device is in place. Due to the fact that wifi is essential in our days I think more about getting a second appliance for RP or change it to a virtual controller...
03-16-2023 07:34 AM
You can go either route, but if you are getting an appliance, I suggest you get support. This provides TAC support and or replacement of the hardware. If you go with a virtual controller, then you will most likely put that in the cloud if you can or else keep it on-prem, but that changes your design if you are running local mode.
Things break and some devices last forever, its the luck of the draw.
03-16-2023 09:01 AM
>....second appliance for RP or change it to a virtual controller...
That depends on how big the wireless infrastructure will get in the end too, usually you will need the high end physical boxes if you get lots of access points and need high networking throughput , let alone that not being possible when you deploy a virtual controller in the cloud (e.g.)
M.
03-17-2023 02:41 AM
Thanks for your thoughts: We won't get over 250 APs on this place, therefore I think a virtual controller on-prem will work. Our switching to flex saves a lot of bandwith from/to the controller, as well.
03-17-2023 02:43 AM
Thanks for your thoughts: We won't get over 250 APs on this place, therefore I think a virtual controller on-prem will work. Our switching to flex-connect saves a lot of bandwith from/to the controller, as well.
03-15-2023 03:16 AM - edited 03-15-2023 03:17 AM
>...I guess it is an hardware issue
It probably is if you have 'an available controller' with or without the particular 8GB RAM-Memory bar the following diagnostic commands may be useful :
show platform hardware slot R0 dram statistics
show logging onboard dram
show logging onboard slot 0 dram
show platform hardware slot R0 alarms visual
show facility-alarm status
show logging profile hardware-diagnostics
show logging onboard slot 0 voltage
Note that depending on the platform type , some of these commands may not be available for your particular model , note that if you have a stable controller it is also always advisable to upgrade https://software.cisco.com/download/home/286321399/type/282046486/release/16.12(3r)
and https://software.cisco.com/download/home/286321399/type/283425232/release/17.11.1 if applicable and or not yet done ,
M.
03-16-2023 05:49 AM
Thank you. For me these command show no further infos, except the first one. But I can't figure out the meaning:
show platform hardware slot R0 dram statistics
def. device:
DRAM ECC Errors [* = last process affected]:
MME MBE SBE SBET SBEC PID* NAME* ADDR* Last Update
-------------------------------------------------------------------------------
0 0 5 0 1 9663 kworker 0X17683000 03/15/23 11:12:33
non def. device:
DRAM ECC Errors [* = last process affected]:
MME MBE SBE SBET SBEC PID* NAME* ADDR* Last Update
-------------------------------------------------------------------------------
0 0 0 0 0 0 00000000 03/16/23 13:05:47
show logging onboard dram
both:
> no output
show logging onboard slot 0 dram
both:
> no output
show platform hardware slot R0 alarms visual
def:
Current Visual Alarm States
Critical: On
Major : Off
Minor : Off
ndef:
Current Visual Alarm States
Critical: On
Major : Off
Minor : Off
show facility-alarm status
def:
System Totals Critical: 4 Major: 0 Minor: 0
Source Time Severity Description [Index]
------ ------ -------- -------------------
TwoGigabitEthernet0/0/0 Mar 14 2023 22:17:17 CRITICAL Physical Port Link Down [1]
TwoGigabitEthernet0/0/1 Mar 14 2023 22:17:17 CRITICAL Physical Port Link Down [1]
TwoGigabitEthernet0/0/2 Mar 14 2023 22:17:17 CRITICAL Physical Port Link Down [1]
TwoGigabitEthernet0/0/3 Mar 14 2023 22:17:17 CRITICAL Physical Port Link Down [1]
ndef:
System Totals Critical: 4 Major: 0 Minor: 0
Source Time Severity Description [Index]
------ ------ -------- -------------------
TwoGigabitEthernet0/0/0 Mar 14 2023 19:41:40 CRITICAL Physical Port Link Down [1]
TwoGigabitEthernet0/0/1 Mar 14 2023 19:41:40 CRITICAL Physical Port Link Down [1]
xcvr container 0/1/0 Mar 14 2023 19:41:39 CRITICAL Transceiver Missing - Link Down [1]
xcvr container 0/1/1 Mar 14 2023 19:41:39 CRITICAL Transceiver Missing - Link Down [1]
show logging profile hardware-diagnostics
both:
Displaying logs from the last 0 days, 0 hours, 10 minutes, 0 seconds
executing cmd on chassis 1 ...
show logging onboard slot 0 voltage
both:
> no output
03-16-2023 06:19 AM
>....But I can't figure out the meaning:
Neither can I , the SBE in that row probably means single bit error(s) , which seems 5 when you align it with the line beneath,
An MBE is a hardware check to see if multiple binary bits in a value of memory are incorrect and can often be used to verify a hardware failure.
the '9663 kworker' refers to the PID and process name that was involved for the rest 'probably food for TAC' (LOL) but I don't think there is a real memory problem (currently) . Critical states seem to be correlated with some uplinks being down so probably not important,
M.
03-17-2023 02:59 AM
The Rommon Image is at 16.12(3r) on both devices. But I found that C9800-L-hw-programmables.16.12.04a.SPA.pkg is still in place on the failed device. On the newer device this package is missing (there isn't any C9800-L-hw-programmables.xxxx.pkg file found on that device). Could this HWP-PKG cause the issue?
03-17-2023 03:27 AM - edited 03-17-2023 03:29 AM
>...On the newer device this package is missing...
- It's a bit unclear what you mean by that , the package won't be seen in an inventory as such however you can verify the controller firmware version(s) with :
show platform hardware chassis active qfp datapath pmd ifdev | i FW
As far as 'Could this HWP-PKG cause the issue?' is concerned , for these controller types (C9800-L-F-K9) it's always advisable to install the latest (because it can indeed be part of the issue) : https://software.cisco.com/download/home/286321399/type/283425232/release/17.11.1
M.
03-17-2023 05:13 AM
The package resides as PKG-File in the bootflash-Folder of the failed device. That one was originally delivered with V 16.12. In fall 2021 we upgraded the device to 17.03.03 without any boot failure. The boot failure mentioned above occurs after upgrading from 17.03.03 to 17.06.05
The newer device was originally delivered with V 17.03.04c, there is no HWP-PKG file seen.
Your command shows no differences between the devices:
FW Version : 0x80000757
FW MDIO : 9.1.2 ID: 43503 vers: 1385
FW Version : 0x80000757
FW MDIO : 9.1.2 ID: 43503 vers: 1385
FW Version : 0x80000756
FW MDIO : 9.1.2 ID: 43503 vers: 1385
FW Version : 0x80000756
FW MDIO : 9.1.2 ID: 43503 vers: 1385
FW Version : 3.1.106
FW Version : 3.1.106
03-17-2023 11:31 PM
- Probably all devices are up to date, boot failure is most likely accidental correlated to the upgrade and or a native hardware problem was already 'coming up' .
M.
Discover and save your favorite ideas. Come back to expert answers, step-by-step guides, recent topics, and more.
New here? Get started with these tips. How to use Community New member guide