C9332C - blk_update_request: I/O error, dev sda, sector 0 - kernel

justclash4 · ‎12-17-2022

We have 2 N9K-C9332C with nxos 9.3.10 software version.

One of them (Spine-2) has this issue:

%KERN-3-SYSTEM_MSG: [203074.763600] ata2.00: exception Emask 0x0 SAct 0x40 SErr 0x0 action 0x6 frozen - kernel
%KERN-3-SYSTEM_MSG: [203074.763606] ata2.00: failed command: WRITE FPDMA QUEUED - kernel
%KERN-3-SYSTEM_MSG: [203074.763614] ata2.00: cmd 61/18:30:20:ac:67/00:00:01:00:00/40 tag 6 ncq 12288 out - kernel
%KERN-3-SYSTEM_MSG: [203074.763614] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) - kernel
%KERN-3-SYSTEM_MSG: [203074.763625] ata2.00: status: { DRDY } - kernel
%KERN-3-SYSTEM_MSG: [203084.792210] ata2: COMRESET failed (errno=-16) - kernel
%KERN-3-SYSTEM_MSG: [203084.792221] ata2: reset failed, giving up - kernel
%KERN-3-SYSTEM_MSG: [203084.792316] blk_update_request: I/O error, dev sda, sector 23571488 - kernel
%KERN-3-SYSTEM_MSG: [203084.792355] Aborting journal on device sda3-8. - kernel
%KERN-3-SYSTEM_MSG: [203084.792364] blk_update_request: I/O error, dev sda, sector 93050960 - kernel
%KERN-3-SYSTEM_MSG: [203084.792378] Aborting journal on device sda7-8. - kernel
%KERN-3-SYSTEM_MSG: [203084.792404] blk_update_request: I/O error, dev sda, sector 23554048 - kernel
%KERN-3-SYSTEM_MSG: [203084.792407] Buffer I/O error on dev sda3, logical block 131072, lost sync page write - kernel
%KERN-3-SYSTEM_MSG: [203084.792416] JBD2: Error -5 detected when updating journal superblock for sda3-8. - kernel
%KERN-3-SYSTEM_MSG: [203084.792420] blk_update_request: I/O error, dev sda, sector 93024256 - kernel
%KERN-3-SYSTEM_MSG: [203084.792423] Buffer I/O error on dev sda7, logical block 4751360, lost sync page write - kernel
%KERN-3-SYSTEM_MSG: [203084.792433] JBD2: Error -5 detected when updating journal superblock for sda7-8. - kernel
%KERN-3-SYSTEM_MSG: [203085.019909] blk_update_request: I/O error, dev sda, sector 22505472 - kernel
%KERN-3-SYSTEM_MSG: [203085.019912] Buffer I/O error on dev sda3, logical block 0, lost sync page write - kernel
%KERN-2-SYSTEM_MSG: [203085.019923] EXT4-fs error (device sda3): ext4_journal_check_start:56: Detected aborted journal - kernel
%KERN-2-SYSTEM_MSG: [203085.019929] EXT4-fs (sda3): Remounting filesystem read-only - kernel
%KERN-3-SYSTEM_MSG: [203085.019932] EXT4-fs (sda3): previous I/O error to superblock detected - kernel
%KERN-3-SYSTEM_MSG: [203085.019965] blk_update_request: I/O error, dev sda, sector 22505472 - kernel
%KERN-3-SYSTEM_MSG: [203085.019968] Buffer I/O error on dev sda3, logical block 0, lost sync page write - kernel
%KERN-3-SYSTEM_MSG: [203085.051414] blk_update_request: I/O error, dev sda, sector 55013376 - kernel
%KERN-3-SYSTEM_MSG: [203085.051417] Buffer I/O error on dev sda7, logical block 0, lost sync page write - kernel
%KERN-2-SYSTEM_MSG: [203085.051429] EXT4-fs error (device sda7): ext4_journal_check_start:56: Detected aborted journal - kernel
%KERN-2-SYSTEM_MSG: [203085.051434] EXT4-fs (sda7): Remounting filesystem read-only - kernel
%KERN-3-SYSTEM_MSG: [203085.051438] EXT4-fs (sda7): previous I/O error to superblock detected - kernel
%KERN-3-SYSTEM_MSG: [203085.051471] blk_update_request: I/O error, dev sda, sector 55013376 - kernel
%KERN-3-SYSTEM_MSG: [203085.051474] Buffer I/O error on dev sda7, logical block 0, lost sync page write - kernel
%KERN-3-SYSTEM_MSG: [2288057.729289] blk_update_request: I/O error, dev loop10, sector 0 - kernel
%KERN-3-SYSTEM_MSG: [2288097.327236] blk_update_request: I/O error, dev sda, sector 0 - kernel

It will be solved if I reload the switch. But usually happens again after 3 days. This issue exists on nxos 9.3.8 and 9.3.10 and only on one of the Spines. What should we do? How can I fix this? For example, this issue causes saving configuration error.

Both of the Spines bootflash model:
ATA
Micron_1100_MTFDDAV256TBN

I found these two bugs with no workarounds!
https://bst.cisco.com/bugsearch/bug/CSCvm94379
https://bst.cisco.com/bugsearch/bug/CSCvu07378
It's not acceptable for me that I have to reload a critical switch in my datacenter one in a week.

I'll be happy to hear about your experiences and workarounds.

justclash4 · ‎12-18-2022

.

Tomson · ‎02-08-2023

Hi, You are probably facing one of those
CSCvu07378 CSCvm94379
Try this to figure out your SSD version.
switch# conf t
switch(config)# feature bash
switch(config)# run bash sudo su
bash-4.2# smartctl -a /dev/sda | egrep 'Model|Firmware|Hours'

justclash4 · ‎02-22-2023

Here it is.

What to do next?

Tomson · ‎02-28-2023

Well... your only option is to reload to overcome this issue (you then can write /copy run start)
But it seems it is not permanent fix and you can face this issue again and/or you can face it on another of your devices.
There is no exact time when this issue will/should be fixed known. So we have to wait until the cisco provide fix.
As in history the Micron SSD are pretty bad and similar issues were seen on other MICRON platforms, so I hope these SSD will be avoided in future.

sp2720401 · ‎02-28-2023

9.3.11 or 10.2.4 doesn't fix this?

Tomson · ‎03-02-2023

As far as I know there is no fix yet