Cisco UCS C220 M3 - removing failed disk casued CUCM to go for for 30 seconds. Why?

andrewjosephson · ‎04-03-2017

Two of the four disk in the C220 have failed.

I removed one of the failed disks and the phones went out for about one minute.

ESXi host log shows: "Lost access to volume (datastore1) due to connectivity issues. Recovery attempt is in progress and outcome will be reported shortly.". One second later the log shows "Successfully restored access"

Why would removing a failed disk cause this? What is the likely impact of installing the replacement drives(s)?

AJ

Wes Austin · ‎04-03-2017

Hello,

Did both drives fail simultaneously?

Without further investigation into the storage controller logs, it would be hard to understand what caused the access loss. You would want to make sure your controller firmware is up to date as well as your storage driver (lsi_mr3/megaraid_sas) in ESXi. If these are out of spec, your reported issue may be related.

https://ucshcltool.cloudapps.cisco.com/public/

Once this checks out, I would attempt to re-build the RAID during off-hours to be safe and prevent further business disruption.

HTH,

Wes

andrewjosephson · ‎04-04-2017

No, I do not believe both drives failed simultaneously.

BTW - When the second drive failed 7 days ago the datastore was inaccessible for 7 minutes (with the same error messages ESXi log).

The problem is I don't want to poweroff the server with only two working drives. To make matters worse I can't access CIMC and the fan is running flatout (ie noisy). I believe it is this bug.

https://bst.cloudapps.cisco.com/bugsearch/bug/CSCun88303/?reffering_site=dumpcr

What would be the best way to handle this scenario?

Ie.

- Two of the four drives have failed

- seemingly when a change is made (Eg disk removed) the RAID is unavailable for a period.

- CIMC can not be accessed

- Fan on full throttle

AJ

Wes Austin · ‎04-04-2017

AJ,

If you are hitting that bug mentioned, then you would need to AC power cycle the server regardless in order to recover the CIMC.

I would attempt this first and then install your new drives to rebuild the RAID. Alternatively, you can install the new drives before the power cycle, however, it would be hard to monitor the status without the CIMC.

HTH,

Wes

andrewjosephson · ‎04-04-2017

Thanks for the advice.

I shutdown then backed up the VM's then power cycled the server.

CIMC access restored, fan noise stopped ESXi booted.

Inserted one of the replacement drives. Build process began (90 minutes to complete).

CUCM did not glitch when the the replacement drive was inserted (unlike when the faulty drive was removed) nor during the build process.

The server still has one failed drive inserted - It will be interesting to see if CUCM/ESXi momentarily losses access to the datastore when it is removed (as it did when the first disk was removed before the reboot).

AJ

Kirk J · ‎04-04-2017

The CIMC memory leak issue is fixed in 2.03i or higher HUU.

You may want to consider the 3.01 HUU level to get HMTL5 based KVM.

It will take log review to confirm what is actually occurring during your I/O issues, drive swaps.

Thanks,

Kirk...