cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
2519
Views
0
Helpful
8
Replies

Cisco Nexus N9K-C92160YC-X kernel panicking at bootup and stuck in endless powercycle.

I believe I have a bad Cisco Nexus N9K-C92160YC-X.  I would like it if someone could confirm it for me before I submit an RMA.

 

I received this switch brand new from my supplier with a blank configuration.  It booted up successfully the first time I powered it on.  I gave it an interface VLAN 1 IP address and decided I wanted to update its firmware and so I uploaded the nxos.9.3.3.bin firmware to the bootflash and ran the "show install all impact" command to verify that it would be a successful installation.  Everything came back good in the report.  However, I never actually installed it.  I shut the switch down for the weekend with plans to run the install on Monday.  This morning I powered it on and now it won't boot.

 

The switch is stuck in a constant powercycle loop.  During bootup I can see it kernel panicking and it appears to be due to a memory error.  

 

Here's the output to the boot sequence:

 

CISCO SWITCH Ver7.59
Device detected on 0:6:0 after 0 msecs
Device detected on 0:1:2 after 0 msecs
Device detected on 0:1:1 after 0 msecs
Device detected on 0:1:0 after 0 msecs
MCFrequency 1333Mhz
Adjusting Clock synthesizer
CLK AFT: 0: ff
CLK AFT: 1: 9e
CLK AFT: 2: 3f
CLK AFT: 3: 75
CLK AFT: 4: 3
CLK AFT: 5: 7
CLK AFT: 6: 13
CLK AFT: 7: 1
CLK AFT: 8: a
CLK AFT: 9: 46
Relocated to memory
Time: 6/1/2020 14:56:17
Pre-Reserving Memory Bar of size 8000000 for root-port B0|D1C|F0 7ffffff 4
Detected CISCO IOFPGA
MIFPGA Present
Code Signing Results: 0x0
Using Upgrade FPGA
Checking and setting PSU fan directions
Booting from Primary Bios
FPGA Revison : 0x17
FPGA ID : 0x1505787
FPGA Date : 0x20161121
Power Debug Register: 0x0
Reset Cause Register: 0x80000000
Boot Ctrl Register : 0xe0ff
FPGA Update Status : 0x20
Detected CISCO MIFPGA
FPGA Update Status : 0x20
Version 2.16.1240. Copyright (C) 2013 American Megatrends, Inc.
Board type 2
IOFPGA @ 0xc8000000
SLOT_ID @ 0xf
Standalone chassis
check_bootmode: grub: Continue grub
Trying to read config file /boot/grub/menu.lst.local from (hd0,4)
Filesystem type is ext2fs, partition type 0x83

Booting bootflash:/nxos.7.0.3.I3.1.bin ...
Booting bootflash:/nxos.7.0.3.I3.1.bin
Trying diskboot
Filesystem type is ext2fs, partition type 0x83
Image valid


Image Signature verification was Successful.

Boot Time: 6/1/2020 14:56:50
Unprotecting eUSB ...
INIT: version 2.88 booting
Unprotecting eUSB ...
Unsquashing rootfs ...

Loading IGB driver ...
Installing SSE module ... done
Creating the sse device node ... done
Loading I2C driver ...
Installing CCTRL driver for card_type 33 ...
CCTRL driver for card_index 21125 ...
Micron_M500IT_MT
Checking SSD firmware ...
Model Number: Micron_M500IT_MTFDDAT064SBD
Serial Number: MSA2210036X
Firmware Revision: MU01.00

Checking all filesystems.......
Installing default sprom values ...
done.Configuring network ...
Installing LC netdev ...
Installing veobc ...
Installing OBFL driver ...
mounting plog for N9k!
invalid group file entry
delete line 'aaa-db-operator:508:'? No
grpck: no changes
..done Mon Jun 1 14:57:08 UTC 2020
tune2fs 1.42.1 (17-Feb-2012)
Setting reserved blocks percentage to 0% (0 blocks)
Starting portmap daemon...
creating NFS state directory: done
starting 8 nfsd kernel threads: done
starting mountd: done
starting statd: done
Saving image for img-sync ...
Loading system software
Installing local RPMS
Patch Repository Setup completed successfully
dealing with default shell..
file /proc/cmdline found, look for shell
unset shelltype, nothing to do..
user add file found..edit it
Uncompressing system image: Mon Jun 1 14:57:14 UTC 2020
blogger: nothing to do.

..done Mon Jun 1 14:57:14 UTC 2020
Creating /dev/mcelog
Starting mcelog daemon
Overwriting dme stub lib
Replaced dme stub lib
INIT: Entering runlevel: 3
Running S93thirdparty-script...

Populating conf files for hybrid sysmgr ...
Starting hybrid sysmgr ...
[ 33.846766] [1591023444] NMI: PCI system error (SERR) for reason b1 on CPU 0.
[ 33.931884] [1591023444] Memory ERR Staus 0x3
[ 33.983795] [1591023444] pci 0000:00:00.0: Memory ERR Staus 0x3
[ 34.054412] [1591023445] Channel 0 ECC Regs 0x28fa0003 0x200fff1
[ 34.127105] [1591023445] Channel 1 ECC Regs 0x0 0x0
[ 34.186301] [1591023445] ***Channel 0: Un-Correctable mutiple-bit error ***
[ 34.269391] [1591023445] Kernel panic - not syncing: ***Channel 2: Un-Correctable mutiple-bit error ***
[ 34.269393] [1591023445]
[ 34.412682] [1591023445] Pid: 0, comm: swapper/0 Tainted: P O 3.4.43-WR5.0.1.13_standard #1
[ 34.522767] [1591023445] Call Trace:
[ 34.565338] [1591023445] <NMI> [<ffffffff816b13d9>] panic+0xfb/0x23d
[ 34.643228] [1591023445] [<ffffffff8101faad>] host_bridge_memory_errors_reporting+0x47d/0x510
[ 34.746041] [1591023445] [<ffffffff816bb9f0>] ? do_nmi+0x190/0x4e0
[ 34.820812] [1591023445] [<ffffffff816b1585>] ? printk+0x6a/0x83
[ 34.893502] [1591023445] [<ffffffff816bb9f9>] do_nmi+0x199/0x4e0
[ 34.966199] [1591023445] [<ffffffff816bad6c>] end_repeat_nmi+0x1a/0x1e
[ 35.045125] [1591023446] [<ffffffff81305376>] ? intel_idle+0xb6/0xf0
[ 35.121969] [1591023446] [<ffffffff81305376>] ? intel_idle+0xb6/0xf0
[ 35.198816] [1591023446] [<ffffffff81305376>] ? intel_idle+0xb6/0xf0
[ 35.275658] [1591023446] <<EOE>> [<ffffffff81522aaf>] cpuidle_enter_state+0x4f/0xe0
[ 35.369125] [1591023446] [<ffffffff81522c69>] cpuidle_idle_call+0x129/0x220
[ 35.453246] [1591023446] [<ffffffff8100b35f>] cpu_idle+0x7f/0xb0
[ 35.525939] [1591023446] [<ffffffff8168d4c9>] rest_init+0x6d/0x74
[ 35.599671] [1591023446] [<ffffffff81cfac3e>] start_kernel+0x466/0x473
[ 35.678591] [1591023446] [<ffffffff81cfa54f>] ? repair_env_string+0x5a/0x5a
[ 35.762707] [1591023446] [<ffffffff81cfa32a>] x86_64_start_reservations+0x131/0x135
[ 35.855133] [1591023446] [<ffffffff81cfa140>] ? early_idt_handlers+0x140/0x140
[ 35.942368] [1591023446] [<ffffffff81cfa430>] x86_64_start_kernel+0x102/0x111
[ 36.028560] [1591023447] Dumping interrupt statistics
[ 36.088785] [1591023447] CPU0 CPU1 CPU2 CPU3 intrs/last_sec max_intrs/sec
[ 36.210285] [1591023447] 0: 57 0 0 0 57 57 IO-APIC-edge timer
[ 36.341130] [1591023447] 4: 134 0 0 0 10 28 IO-APIC-edge serial
[ 36.473017] [1591023447] 8: 1 0 0 0 0 1 IO-APIC-edge rtc0
[ 36.602823] [1591023447] 9: 0 0 0 0 0 0 IO-APIC-fasteoi acpi
[ 36.732637] [1591023447] 23: 26 0 0 0 25 25 IO-APIC-fasteoi ehci_hcd:usb1
[ 36.871790] [1591023447] 40: 0 0 0 0 0 0 PCI-MSI-edge PCIe PME
[ 37.005753] [1591023447] 41: 0 0 0 0 0 0 PCI-MSI-edge PCIe PME
[ 37.139716] [1591023448] 42: 0 0 0 0 0 0 PCI-MSI-edge PCIe PME
[ 37.273679] [1591023448] 43: 0 0 0 0 0 0 PCI-MSI-edge PCIe PME
[ 37.407644] [1591023448] 44: 0 0 0 0 0 0 PCI-MSI-edge PCIe PME
[ 37.541607] [1591023448] 45: 0 0 0 0 0 0 PCI-MSI-edge PCIe PME
[ 37.675569] [1591023448] 46: 0 0 0 0 0 0 PCI-MSI-edge PCIe PME
[ 37.809533] [1591023448] 47: 0 0 0 0 0 0 PCI-MSI-edge PCIe PME
[ 37.943498] [1591023448] 48: 3488 0 0 0 0 1445 PCI-MSI-edge ahci
[ 38.073310] [1591023449] 58: 0 0 0 0 0 0 PCI-MSI-edge cctrl_tor3_plat_io_isr
[ 38.221809] [1591023449] 59: 0 0 0 0 0 0 PCI-MSI-edge cctrl_tor3_portlib_mi_isr
[ 38.373432] [1591023449] sending NMI to all CPUs:
[ 38.429513] [1591023449] NMI backtrace for cpu 2
[ 38.484550] [1591023449] CPU 2
[ 38.519856] [1591023449] Modules linked in: klm_procfs_init(PO) klm_i2c_stub(O) ata_piix klm_isan_kthread(PO) klm_cmos(PO) klm_ins_igb(O) klm_psdev(O) klm_pfmsvcs(PO) klm_sse(O) klm_tlv(PO) klm_mping(PO) klm_kpss(PO) klm_modlock(O) klm_sdwrap(O) klm_cctrli(PO) klm_if_index(PO) klm_vdc_mgr(O) klm_dc_sprom(O) klm_nvram(O) lc_netdev.mod(O) klm_vdc(O) klm_veobc(O) klm_obfl(O) klm_rwsem(PO) klm_pss(O) klm_aipc(PO) klm_kadb(O) klm_mts(PO) klm_mtsfilter(PO) klm_cctrli_bg(PO) klm_sup_ctrl_mc(PO) klm_rdn_dummy(PO) klm_usd(O) klm_misc(O) klm_gpl(PO) klm_lc_diag_stat(O) klm_ls_notify(PO) klm_fcfwd(PO) klm_fcoe(PO) klm_fc2(PO) klm_cisco_nb(O) klm_kfsmutils(PO) klm_sysmgr-hb(O) klm_sysmgr-hb_lc(O) klm_utaker(O) klm_kgdb(PO)
[ 39.274823] [1591023449]
[ 39.305984] [1591023449] Pid: 10117, comm: sysmgr Tainted: P O 3.4.43-WR5.0.1.13_standard #1 To be filled by O.E.M. To be filled by O.E.M./Aptio CRB
[ 39.475250] [1591023449] RIP: 0010:[<ffffffff810392de>] [<ffffffff810392de>] native_flush_tlb_others+0xce/0x110
[ 39.596757] [1591023449] RSP: 0000:ffff880451c3dbe8 EFLAGS: 00000202
[ 39.673607] [1591023449] RAX: 00000000000008d1 RBX: ffff88044094b740 RCX: 0000000000000001
[ 39.772266] [1591023449] RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000286
[ 39.870920] [1591023449] RBP: ffff880451c3dc18 R08: ffff88044094b740 R09: 0000000000000000
[ 39.969575] [1591023449] R10: dfed912167b8c580 R11: 0000000000000000 R12: 0000000000000080
[ 40.068231] [1591023449] R13: ffffffff81dcd180 R14: 0000000000000002 R15: 000000000a041a7c
[ 40.166886] [1591023449] FS: 0000000000000000(0000) GS:ffff88047fd00000(0063) knlGS:00000000eca9d940
[ 40.276964] [1591023449] CS: 0010 DS: 002b ES: 002b CR0: 0000000080050033
[ 40.359002] [1591023449] CR2: 000000000a041a7c CR3: 00000004515ae000 CR4: 00000000001407e0
[ 40.457659] [1591023449] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 40.556312] [1591023449] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 40.654968] [1591023449] Process sysmgr (pid: 10117, threadinfo ffff880451c3c000, task ffff88041ab0c3b0)
[ 40.768161] [1591023449] Stack:
[ 40.805542] [1591023449] ffff880451c3dbf8 ffffffff810098e9 ffff88044094b480 ffff88044094b740
[ 40.907309] [1591023449] 000000000a041a7c ffff880453eae3f0 ffff880451c3dc48 ffffffff810394ad
[ 41.009079] [1591023449] ffff88044094b480 ffff880453eae3f0 000000000a041a7c 0000000000000001
[ 41.110849] [1591023449] Call Trace:
[ 41.153438] [1591023449] [<ffffffff810098e9>] ? sched_clock+0x9/0x10
[ 41.230286] [1591023449] [<ffffffff810394ad>] flush_tlb_page+0x8d/0xa0
[ 41.309209] [1591023449] [<ffffffff8103805e>] ptep_set_access_flags+0x4e/0x70
[ 41.395404] [1591023449] [<ffffffff8111f54b>] do_wp_page+0x2fb/0x820
[ 41.472250] [1591023449] [<ffffffff811217b6>] handle_pte_fault+0xa86/0xb10
[ 41.555328] [1591023449] [<ffffffff81121bba>] handle_mm_fault+0x1da/0x200
[ 41.637369] [1591023449] [<ffffffff810cde39>] ? trace_clock_local+0x9/0x10
[ 41.720448] [1591023449] [<ffffffff810d44ef>] ? rb_reserve_next_event.isra.33+0x9f/0x300
[ 41.818067] [1591023449] [<ffffffff816be3c3>] do_page_fault+0x353/0x5d0
[ 41.898026] [1591023449] [<ffffffff810639ab>] ? queue_delayed_work+0x2b/0x30
[ 41.983181] [1591023449] [<ffffffff810639cb>] ? schedule_delayed_work+0x1b/0x20
[ 42.071453] [1591023449] [<ffffffff810d8a16>] ? trace_wake_up+0x26/0x30
[ 42.151416] [1591023449] [<ffffffff810da588>] ? trace_current_buffer_unlock_commit+0x48/0x60
[ 42.253185] [1591023449] [<ffffffff810e8884>] ? ftrace_syscall_exit+0xb4/0xd0
[ 42.339379] [1591023449] [<ffffffff8100f172>] ? syscall_trace_leave+0xb2/0x170
[ 42.426611] [1591023449] [<ffffffff811481b9>] ? sys_read+0x59/0x100
[ 42.502419] [1591023449] [<ffffffff816ba9e5>] page_fault+0x25/0x30
[ 42.577186] [1591023449] Code: c1 3b ca 00 41 8d b6 cf 00 00 00 49 8d 7d 18 ff 90 d8 00 00 00 41 8b b4 24 18 d1 dc 81 85 f6 74 0e 0f 1f 40 00 f3 90 41 8b 4d 18 <85> c9 75 f6 83 3d 7b 4e ca 00 20 49 c7 84 24 00 d1 dc 81 00 00
[ 42.816027] [1591023449] Call Trace:
[ 42.858614] [1591023449] [<ffffffff810098e9>] ? sched_clock+0x9/0x10
[ 42.935463] [1591023449] [<ffffffff810394ad>] flush_tlb_page+0x8d/0xa0
[ 43.014389] [1591023449] [<ffffffff8103805e>] ptep_set_access_flags+0x4e/0x70
[ 43.100584] [1591023449] [<ffffffff8111f54b>] do_wp_page+0x2fb/0x820
[ 43.177427] [1591023449] [<ffffffff811217b6>] handle_pte_fault+0xa86/0xb10
[ 43.260507] [1591023449] [<ffffffff81121bba>] handle_mm_fault+0x1da/0x200
[ 43.342548] [1591023449] [<ffffffff810cde39>] ? trace_clock_local+0x9/0x10
[ 43.422133] [1591023449] END: PANIC REPORT GENERATED AT 1591023449
[ 43.422139] [1591023449] CCTRL PANIC DUMP
[ 43.422140] [1591023449] =========================
[ 43.422142] [1591023449] WDT last punched at 0
[ 43.422145] [1591023449] REG(0x300) = baadbeef
[ 43.422148] [1591023449] REG(0x304) = baadbeef
[ 43.422149] [1591023449] =========================
[ 43.422152] [1591023449] pstore: Dump l1 0 l2 96741 ToDump 65512 Dumped 0
[ 43.906455] [1591023449] [<ffffffff810d44ef>] ? rb_reserve_next_event.isra.33+0x9f/0x300
[ 44.004069] [1591023449] [<ffffffff816be3c3>] do_page_fault+0x353/0x5d0
[ 44.084027] [1591023449] [<ffffffff810639ab>] ? queue_delayed_work+0x2b/0x30
[ 44.169185] [1591023449] [<ffffffff810639cb>] ? schedule_delayed_work+0x1b/0x20
[ 44.257462] [1591023449] [<ffffffff810d8a16>] ? trace_wake_up+0x26/0x30
[ 44.337424] [1591023449] [<ffffffff810da588>] ? trace_current_buffer_unlock_commit+0x48/0x60
[ 44.439196] [1591023449] [<ffffffff810e8884>] ? ftrace_syscall_exit+0xb4/0xd0
[ 44.525385] [1591023449] [<ffffffff8100f172>] ? syscall_trace_leave+0xb2/0x170
[ 44.612617] [1591023449] [<ffffffff811481b9>] ? sys_read+0x59/0x100
[ 44.688426] [1591023449] [<ffffffff816ba9e5>] page_fault+0x25/0x30
[ 44.763197] [1591023449] NMI backtrace for cpu 3
[ 44.818233] [1591023449] CPU 3
[ 44.853536] [1591023449] Modules linked in: klm_procfs_init(PO) klm_i2c_stub(O) ata_piix klm_isan_kthread(PO) klm_cmos(PO) klm_ins_igb(O) klm_psdev(O)

1 Accepted Solution

Accepted Solutions

Turns out the switch was covered under the Army service contract. I was able to get the switch replaced within a day.  The replacement switch is working great.

View solution in original post

8 Replies 8

Reza Sharifi
Hall of Fame
Hall of Fame

It appears that something had gone wrong with the software upgrade. Since this is a new switch, I would open a ticket with TAC and RMA it to Cisco for a replacement.

HTH

I forgot to mention that I also applied the following commands:

 

diagnostic monitor interval module 1 test PrimaryBootROM hour 23 min 59 sec 59

diagnostic monitor interval module 1 test SecondaryBootROM hour 23 min 59 sec 59

These were in order to avoid the potential issues outlined in Cisco bug CSCvk30831.  Details regarding this bug can be found here: https://www.cisco.com/c/en/us/support/docs/field-notices/703/fn70320.html

We've run into this bug with several other 9Ks and so I wanted to avoid  that problem before I upgraded.

 

This may or may not have anything to do with the problem I'm having today but I wanted to at least include that information with this post.

It could be related, but I don't see any BIOS related issues in the output you posted originally.

Did you issue "show install all impact" to see what needs to be upgraded before starting the upgrade?

HTH

I don't even remember actually performing the upgrade. I know I ran the "show install all impact" command and everything was good. I can't remember for certain whether I actually performed the upgrade after issuing that command. Is there something in the output that I'm not seeing that indicates that I did?

In fact the output shows that it's attempting to boot nxos 7.0.3 which was the version that the switch shipped with.  I was planning to upgrade to 9.3.3.

 

Booting bootflash:/nxos.7.0.3.I3.1.bin ...
Booting bootflash:/nxos.7.0.3.I3.1.bin

A lot of trace calls which seems like a bug with memory issues.

 42.071453] [1591023449] [<ffffffff810d8a16>] ? trace_wake_up+0x26/0x30
[ 42.151416] [1591023449] [<ffffffff810da588>] ? trace_current_buffer_unlock_commit+0x48/0x60
[ 42.253185] [1591023449] [<ffffffff810e8884>] ? ftrace_syscall_exit+0xb4/0xd0
[ 42.339379] [1591023449] [<ffffffff8100f172>] ? syscall_trace_leave+0xb2/0x170

 

I would open a ticket with TAC and see if this is a known bug.

 

HTH

So I just ended up calling Cisco support and they told me that the warranty is expired and that it's not covered under any service contract.  I know I had mentioned that I got this from my supplier but in truth it came from one of our Army customers.  They must have purchased this switch a long time ago.  They sent it to me on loan to specifically test out our implementation of 802.1x/radius.  I have to give it back to them in 4 weeks and apparently there isn't any service contract on it.

 

Does this mean I just bricked my customer's switch and there's nothing I can do about it sans paying for extended support? 

Turns out the switch was covered under the Army service contract. I was able to get the switch replaced within a day.  The replacement switch is working great.

Getting Started

Find answers to your questions by entering keywords or phrases in the Search bar above. New here? Use these resources to familiarize yourself with the community:

Review Cisco Networking products for a $25 gift card