Disk 1 stuck at 0% rebuilding

lco · ‎10-24-2022

Dear,

I'm facing a problem on a UCS server model UCSB-B200-M4, where after a re-ack on the server, disk slot 1 is in rebuilding for several hours and doesn't go beyond 0%. Apparently the disk is recognized, and equipped, being possible to see the serial number and its information, showing that apparently the hardware is ok, but the rebuilding does not end. It was also done OIR of the disk with the server running but the rebuilding continues at 0% and does not progress for more than 3 hours. Could anyone tell me what the problem is and how to solve it?

sterifein10 · ‎10-25-2022

According to my opinion it is the common issue with rebuilding of UCS server model UCSB-B200-M4. In the previous week I was doing rebuilding of it for www.ihorhnatewicz.com/best-youtube-automation-tools/ so it was also showing the same issue which you are discussing. But after hours of spending we selected the alternative disk port which worked for us.

lco · ‎10-25-2022

What would this alternate port procedure look like? that you mentioned have selected

Wes Austin · ‎10-25-2022

You would need to check the RAID controller logs and determine what is happening and why the rebuild is not moving forward.

You may be able to connect to the CIMC on the server in question and run the command "messages" and see any storage related messages.

UCSM#connect cimc x/y <--- x= chassis y=slot

#messages

Otherwise, if you collect a tech support from the CIMC, the storage logs should be located there for review.

lco · ‎10-26-2022

I performed the gather as recommended. Attached is the TXT file of the "messages" output

Wes Austin · ‎10-26-2022

I just see this looping in the logs over and over:

6:2022 Oct 24 15:01:07 UTC:3.1(23e):storaged:8783: log.c:214:SAS:Firmware initialization started (PCI ID 005d/1000/0124/1137)
6:2022 Oct 24 15:01:07 UTC:3.1(23e):storaged:8783: log.c:214:SAS:Firmware version 4.620.01-7265
6:2022 Oct 24 15:01:07 UTC:3.1(23e):storaged:8783: log.c:214:SAS:Controller reset requested by host, completed
6:2022 Oct 24 15:01:07 UTC:3.1(23e):storaged:8783: log.c:214:SAS:Package version 24.12.1-0205
6:2022 Oct 24 15:01:07 UTC:3.1(23e):storaged:8783: log.c:214:SAS:Board Revision 06001
6:2022 Oct 24 15:01:07 UTC:3.1(23e):storaged:8783: log.c:214:SAS:Inserted: PD 04(e62/s1)
6:2022 Oct 24 15:01:07 UTC:3.1(23e):storaged:8783: log.c:214:SAS:Controller Hot Plug detected
6:2022 Oct 24 15:01:07 UTC:3.1(23e):storaged:8783: log.c:214:SAS:Inserted: PD 04(e1/s1) Info: enclPd=3e, scsiType=0, portMap=00, sasAddr=50000397a84bfff6,0000000000000000
6:2022 Oct 24 15:01:07 UTC:3.1(23e):storaged:8783: log.c:214:SAS:Inserted: PD 06(e62/s2)
6:2022 Oct 24 15:01:07 UTC:3.1(23e):storaged:8783: log.c:214:SAS:Inserted: PD 06(e1/s2) Info: enclPd=3e, scsiType=0, portMap=01, sasAddr=5000039a884b1bc6,0000000000000000
6:2022 Oct 24 15:01:07 UTC:3.1(23e):storaged:8783: log.c:214:SAS:State change on PD 04(e62/s1) from REBUILD(14) to OFFLINE(10)
6:2022 Oct 24 15:01:07 UTC:3.1(23e):storaged:8783: log.c:214:SAS:Rebuild resumed on PD 04(e62/s1)
6:2022 Oct 24 15:01:07 UTC:3.1(23e):storaged:8783: log.c:214:SAS:State change on PD 04(e62/s1) from OFFLINE(10) to REBUILD(14)
6:2022 Oct 24 15:01:07 UTC:3.1(23e):storaged:8783: log.c:214:SAS:Controller operating temperature within normal range, full operation restored
6:2022 Oct 24 15:01:07 UTC:3.1(23e):storaged:8783: log.c:214:SAS:Host driver is loaded and operational
6:2022 Oct 24 15:01:07 UTC:3.1(23e):storaged:8783: log.c:214:SAS:Setting VD 00/0 as boot device

Based on some related issues with the same behavior, its usually a sign the disk needs replacement. In this case HDD1.

Are you able to swap the disk with another new disk and see if the behavior persist.

lco · ‎10-26-2022

Yes, yesterday I had already requested the RMA of the disk in case of need for replacement. Another test I performed to confirm the need for RMA was that I performed the "shutdown server", then "reset CIMC" and after the process was 100% finished, I did the "Boot server". At the end, I did a new decommission and re-ack on the server, when Discovery and Associate finished, then a critical disk 1 alert appeared and it appeared as "Unconfigured Bad", and that's why I requested the RMA, before the disk went straight to the status "Rebuilding", I suppose the CIMC reset via UCSM somehow induced the disk to show bad in the alerts.

As your analysis of the messages points out the need for RMA, so I believe replacing will solve the problem. As soon as I replace the disk, I'll be sharing the result with everyone.

lco · ‎10-27-2022

Hi!

Last night we did a MW to change the disk and unfortunately it didn't work, the disk is still in Rebuilding status and doesn't go beyond 0% in the progress bar.

Actions taken:

- Disk RMA, but it didn't work;
- Inversion of disk slots, where this disk was initially in the Disk1 slot, but I changed it to the Disk2 slot and kept it;
- In all these changes, there was the OIR of the blade in order to perform the discovery and association of the Service Profile for a deeper solution, and still we were not successful;

Could you let me know if there are any other plans of action please?

Awaiting, thanks!

Wes Austin · ‎10-27-2022

If you can private message me your TAC case, I will take a look and see what the current status is based on the logs.

lco · ‎10-27-2022

I sent the SR number in your private messages. I opened the TAC this afternoon.