cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
1246
Views
0
Helpful
5
Replies

Cisco UCS C220 M3 - removing failed disk casued CUCM to go for for 30 seconds. Why?

andrewjosephson
Level 1
Level 1

Two of the four disk in the C220 have failed.

I removed one of the failed disks and the phones went  out for  about one minute.

ESXi host log shows: "Lost access to volume  (datastore1) due to connectivity issues. Recovery attempt is in progress and outcome will be reported shortly.". One second later the log shows "Successfully restored access"

Why would removing a failed disk cause this? What is the likely impact of installing the replacement drives(s)? 

AJ

5 Replies 5

Wes Austin
Cisco Employee
Cisco Employee

Hello,

Did both drives fail simultaneously?

Without further investigation into the storage controller logs, it would be hard to understand what caused the access loss. You would want to make sure your controller firmware is up to date as well as your storage driver (lsi_mr3/megaraid_sas) in ESXi. If these are out of spec, your reported issue may be related.

https://ucshcltool.cloudapps.cisco.com/public/

Once this checks out, I would attempt to re-build the RAID during off-hours to be safe and prevent further business disruption.

HTH,

Wes

No, I do not believe both drives failed simultaneously.

BTW - When the second drive failed 7 days ago the datastore was inaccessible for 7 minutes (with the same error messages ESXi log).

The problem is I don't want to poweroff the server with only two working drives. To make matters worse I can't access CIMC and the fan is running flatout (ie noisy). I believe it is this bug.  

https://bst.cloudapps.cisco.com/bugsearch/bug/CSCun88303/?reffering_site=dumpcr

What would be the best way to handle this  scenario? 

Ie.

- Two of the four drives have failed

- seemingly when a change is made (Eg disk removed) the RAID is unavailable for a period.

- CIMC can not be accessed

- Fan on full throttle 

AJ

AJ,

If you are hitting that bug mentioned, then you would need to AC power cycle the server regardless in order to recover the CIMC.

I would attempt this first and then install your new drives to rebuild the RAID. Alternatively, you can install the new drives before the power cycle, however, it would be hard to monitor the status without the CIMC.

HTH,

Wes

Thanks for the advice.

I shutdown then backed up the VM's then power cycled the server.

CIMC access restored, fan noise stopped ESXi booted.

Inserted one of the replacement drives. Build process began (90 minutes to complete).

CUCM did not glitch when the the replacement drive was inserted (unlike when the faulty drive was removed) nor during the build process.

The server still has one failed drive inserted - It will be interesting to see if CUCM/ESXi momentarily losses access to the datastore when it is removed (as it did when the first disk was removed before the reboot).

AJ

The CIMC memory leak issue is fixed in 2.03i or higher HUU.

You may want to consider the 3.01 HUU level to get HMTL5 based KVM.

It will take log review to confirm what is actually occurring during your I/O issues, drive swaps.

Thanks,

Kirk...

Review Cisco Networking products for a $25 gift card