cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
1145
Views
3
Helpful
13
Replies

C9800-L-F-K9 boot issue after upgrade

t.ricco
Level 1
Level 1

After applying an upgrade to current version 17.06.05 on our two C9800-L-F-K9 devices (not in HA), one device won't boot anymore... LEDs are off. Powercycling let the onboard LEDs light up, but switching off after a few seconds, fans keep running on high level.

Quick solution: Open the device and remove the lowermost 8GB RAM-Memory bar, close the device, powerup again. Device boots with half of the 16GB RAM.

With "sh version" I can't see any differences between our devices. I guess it is an hardware issue.

13 Replies 13

Leo Laohoo
Hall of Fame
Hall of Fame

Just RMA the box.

As far as I know Cisco gives only a poor one year warranty on these devices. Our device is slighly older than 1 year. Moreover I guess that it will take a few day until the new device is in place. Due to the fact that wifi is essential in our days I think more about getting a second appliance for RP or change it to a virtual controller...

You can go either route, but if you are getting an appliance, I suggest you get support.  This provides TAC support and or replacement of the hardware.  If you go with a virtual controller, then you will most likely put that in the cloud if you can or else keep it on-prem, but that changes your design if you are running local mode.

Things break and some devices last forever, its the luck of the draw.

-Scott
*** Please rate helpful posts ***

 

                              >....second appliance for RP or change it to a virtual controller...
  That depends on how big the wireless infrastructure will get in the end too, usually you will need the high end physical boxes if you get lots of access points and need high networking throughput  , let alone that not being possible when you deploy a virtual controller in the cloud (e.g.)

 M.



-- Each morning when I wake up and look into the mirror I always say ' Why am I so brilliant ? '
    When the mirror will then always repond to me with ' The only thing that exceeds your brilliance is your beauty! '

Thanks for your thoughts: We won't get over 250 APs on this place, therefore I think a virtual controller on-prem will work. Our switching to flex saves a lot of bandwith from/to the controller, as well.

Thanks for your thoughts: We won't get over 250 APs on this place, therefore I think a virtual controller on-prem will work. Our switching to flex-connect saves a lot of bandwith from/to the controller, as well.

marce1000
VIP
VIP

 

                 >...I guess it is an hardware issue
  It probably is if you have 'an available controller' with or without the particular  8GB RAM-Memory bar the following diagnostic commands  may be useful :
              show platform hardware slot R0  dram statistics
                show logging onboard dram
                show logging onboard slot 0 dram
                show platform hardware slot R0  alarms visual
                show  facility-alarm status

                show logging profile hardware-diagnostics
                show logging onboard slot 0 voltage

  Note that depending on the platform type , some of these commands may not be available for your particular model , note that if you have a stable controller it is also always advisable to upgrade https://software.cisco.com/download/home/286321399/type/282046486/release/16.12(3r)
     and https://software.cisco.com/download/home/286321399/type/283425232/release/17.11.1  if applicable and or not yet done , 

 M.



-- Each morning when I wake up and look into the mirror I always say ' Why am I so brilliant ? '
    When the mirror will then always repond to me with ' The only thing that exceeds your brilliance is your beauty! '

Thank you. For me these command show no further infos, except the first one. But I can't figure out the meaning:

show platform hardware slot R0 dram statistics

def. device:
DRAM ECC Errors [* = last process affected]:
MME MBE SBE SBET SBEC PID* NAME* ADDR* Last Update
-------------------------------------------------------------------------------
0 0 5 0 1 9663 kworker 0X17683000 03/15/23 11:12:33

non def. device:
DRAM ECC Errors [* = last process affected]:
MME MBE SBE SBET SBEC PID* NAME* ADDR* Last Update
-------------------------------------------------------------------------------
0 0 0 0 0 0 00000000 03/16/23 13:05:47


show logging onboard dram

both:
> no output

show logging onboard slot 0 dram

both:
> no output


show platform hardware slot R0 alarms visual

def:
Current Visual Alarm States
Critical: On
Major : Off
Minor : Off

ndef:
Current Visual Alarm States
Critical: On
Major : Off
Minor : Off


show facility-alarm status

def:
System Totals Critical: 4 Major: 0 Minor: 0
Source Time Severity Description [Index]
------ ------ -------- -------------------
TwoGigabitEthernet0/0/0 Mar 14 2023 22:17:17 CRITICAL Physical Port Link Down [1]
TwoGigabitEthernet0/0/1 Mar 14 2023 22:17:17 CRITICAL Physical Port Link Down [1]
TwoGigabitEthernet0/0/2 Mar 14 2023 22:17:17 CRITICAL Physical Port Link Down [1]
TwoGigabitEthernet0/0/3 Mar 14 2023 22:17:17 CRITICAL Physical Port Link Down [1]

ndef:
System Totals Critical: 4 Major: 0 Minor: 0
Source Time Severity Description [Index]
------ ------ -------- -------------------
TwoGigabitEthernet0/0/0 Mar 14 2023 19:41:40 CRITICAL Physical Port Link Down [1]
TwoGigabitEthernet0/0/1 Mar 14 2023 19:41:40 CRITICAL Physical Port Link Down [1]
xcvr container 0/1/0 Mar 14 2023 19:41:39 CRITICAL Transceiver Missing - Link Down [1]
xcvr container 0/1/1 Mar 14 2023 19:41:39 CRITICAL Transceiver Missing - Link Down [1]


show logging profile hardware-diagnostics

both:
Displaying logs from the last 0 days, 0 hours, 10 minutes, 0 seconds
executing cmd on chassis 1 ...


show logging onboard slot 0 voltage

both:
> no output

 

                    >....But I can't figure out the meaning:
 Neither can I , the SBE in that row probably means single bit error(s) , which seems 5 when you align it with the line beneath,
  An MBE is a hardware check to see if multiple binary bits in a value of memory are incorrect and can often be used to verify a hardware failure.
the '9663 kworker'  refers to the PID and process name that was involved for the rest 'probably food for TAC' (LOL) but I don't think there is a real memory problem (currently) . Critical states seem to be correlated with some uplinks being down so probably not important, 

 M.



-- Each morning when I wake up and look into the mirror I always say ' Why am I so brilliant ? '
    When the mirror will then always repond to me with ' The only thing that exceeds your brilliance is your beauty! '

The Rommon Image is at 16.12(3r) on both devices. But I found that C9800-L-hw-programmables.16.12.04a.SPA.pkg is still in place on the failed device. On the newer device this package is missing (there isn't any C9800-L-hw-programmables.xxxx.pkg file found on that device). Could this HWP-PKG cause the issue?

 

                                        >...On the newer device this package is missing...
  - It's a bit unclear what you mean by that , the package won't be seen in an inventory as such however you can verify the controller firmware version(s) with :
                    show platform hardware chassis active qfp datapath pmd ifdev | i FW 
 As far as 'Could this HWP-PKG cause the issue?' is concerned , for these controller types (C9800-L-F-K9)  it's always advisable to install the latest  (because it can indeed be part of the issue) : https://software.cisco.com/download/home/286321399/type/283425232/release/17.11.1

 M.



-- Each morning when I wake up and look into the mirror I always say ' Why am I so brilliant ? '
    When the mirror will then always repond to me with ' The only thing that exceeds your brilliance is your beauty! '

The package resides as PKG-File in the bootflash-Folder of the failed device. That one was originally delivered with V 16.12. In fall 2021 we upgraded the device to 17.03.03 without any boot failure. The boot failure mentioned above occurs after upgrading from 17.03.03 to 17.06.05

The newer device was originally delivered with V 17.03.04c, there is no HWP-PKG file seen.

Your command shows no differences between the devices:

FW Version : 0x80000757
FW MDIO : 9.1.2 ID: 43503 vers: 1385
FW Version : 0x80000757
FW MDIO : 9.1.2 ID: 43503 vers: 1385
FW Version : 0x80000756
FW MDIO : 9.1.2 ID: 43503 vers: 1385
FW Version : 0x80000756
FW MDIO : 9.1.2 ID: 43503 vers: 1385
FW Version : 3.1.106
FW Version : 3.1.106

 

 - Probably all devices are up to date, boot failure is most likely accidental correlated to the upgrade and or a native hardware problem was already 'coming up' .

 M.



-- Each morning when I wake up and look into the mirror I always say ' Why am I so brilliant ? '
    When the mirror will then always repond to me with ' The only thing that exceeds your brilliance is your beauty! '
Review Cisco Networking for a $25 gift card