cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
Announcements
AMA event- Migrating Existing Networks to Cisco ACI
2068
Views
10
Helpful
18
Replies
Beginner

hosts are not responding state/frozen state after upgrade from 5.5U3 to 6.5U2 on C240-M4S

We recently upgraded esxi 5.5 U3 to esx 6.5 U2 with cisco customized image  on C240-M4S Server. We first upgrade cisco firmware from 2.0(6) to 4.0(1c) and then esxi host upgrade from 5.5 u3 to 6.5 U2.(Please find the attached text to know Driver and FW details before and after upgrade).

 

After upgrade, hosts are going not responding state/frozen state where in esxi hosts are reachable via PING over network, but unable to re-connect host back to vCenter.

 

During host not responding state ,we can login into putty with multiple session ,however we can’t see/run any commands (like, if df- h, to view logs under cat /var/log ) .When we ran df-h, hosts won’t display anything, gets struck until we close putty session and then can re-connect .

During host not responding state, vms continue to be running, but we can’t migrate those vms into another host and also we are unable to manage those vms via vCloud panel .

 

We have to reboot host to bring back host and then will connect to vcenter .

 

We working with Vmware and Cisco since from 3 weeks ,no resolution yet .

 

We can see lot of Valid sense data: 0x5 0x24 0x0 logs in vmkernel.logs and VMware suspect something with the LSI MegaRAID (MRAID12G) diver. So Vmware asked to contact hardware vendor to check hardware/firmware issues and LSI issues as well

 

 cpu20:66473)ScsiDeviceIO: 3001: Cmd(0x439d48ebd740) 0x1a, CmdSN 0xea46b from world 0 to dev "naa.678da6e715bb0c801e8e3fab80a35506" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0
This command failed  4234 times on "naa.678da6e715bb0c801e8e3fab80a35506"

Display Name: Local Cisco Disk (naa.678da6e715bb0c801e8e3fab80a35506)
Vendor: Cisco | Model: UCSC-MRAID12G | Is Local: true | Is SSD: false

 

Cisco did not see any issues with Server /hardware after analyzing Tech support logs and also we performed Cisco diagnostics test on few servers,all components tests/ checks looks good .Only one recommendation given by cisco is to change Power Management policy from balance to High Performance under esxi host->configure->Hardware->Power Mgmt->Active policy ->High Performance

 

Can someone help me to find cause/fix .

18 REPLIES 18
Cisco Employee

Re: hosts are not responding state/frozen state after upgrade from 5.5U3 to 6.5U2 on C240-M4S

It sounds like the issue is specific to the management network if I am reading correctly, since the VMs are fine, but the host are not accessible in vCenter. Is that correct?

 

If the hosts can ping, it doesn't sound like a networking issue, but specific to ESXi OS.

 

Does this issue happen on all hosts at the same time or on random hosts? Did it start directly after the firmware upgrade or after the ESXi upgrade. 

 

Do your hosts boot from local disk or FC storage? LSI driver looks okay, but the qlogic looks to be on an earlier release. I would look towards correcting the FC driver at some point.

 

Per UCS HCL https://ucshcltool.cloudapps.cisco.com/public/

 

  • Firmware Version
    8.08.03
    Driver Version
    2.1.74.0-1OEM.600.0.0.2768847 qlnativefc 
    Adapter BIOS
    3.43
    Notes

You probably need to investigate the VMware vmkernel.log and hostd.log to understand what is causing the host to hang. If this was related to a hardware hang, I would expect to see something in the UCS SEL logs.

 

If you PM me your TAC SR, I can take a look and see if there are any further suggestions.

Highlighted
Beginner

Re: hosts are not responding state/frozen state after upgrade from 5.5U3 to 6.5U2 on C240-M4S

Thanks Wes for information .

 

It sounds like the issue is specific to the management network if I am reading correctly, since the VMs are fine, but the host are not accessible in vCenter. Is that correct?

                           Yes, VMs are running ,and host is also reachable via PING over network ,but host is not responding any inputs/command via Putty session.

 

Does this issue happen on all hosts at the same time or on random hosts? Did it start directly after the firmware upgrade or after the ESXi upgrade.

                          The issue started after host upgrade which we performed in two steps one after another that first we upgraded firmware of servers from 2.0(6) to 4.0(1c) and then upgraded ESXi 5.5 u3 to 6.5U2 using Cisco customized Image .

The issue occurred randomly and on 6 hosts, out of 12 hosts .

 

Do your hosts boot from local disk or FC storage? LSI driver looks okay, but the qlogic looks to be on an earlier release. I would look towards correcting the FC driver at some point.

                         Hosts boots from Local Disk. As mentioned in text file, we have QLE2672 QLogic 2-port 16Gb Fibre Channel Adapter for QLE2672 ,running with FC Firmware Version: 8.03.06 (d0d5), Driver version 2.1.53.0

 

You probably need to investigate the VMware vmkernel.log and hostd.log to understand what is causing the host to hang. If this was related to a hardware hang, I would expect to see something in the UCS SEL logs.

                                 The hosts completely hang after this until the reboot and this is could be shown in the following analysis
vmksummary.log
2019-02-11T01:13:39Z bootstop: Host has booted

vmkernel.log
2019-02-10T16:19:19.016Z cpu24:66473)ScsiDeviceIO: 3001: Cmd(0x439d4548d980) 0x1a, CmdSN 0xa6575 from world 0 to dev "naa.678da6e715bad3f01db8fff50c7431f0" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0
VMB: 112: mbMagic: 2badb002, mbInfo 0x10185c >>> start of the reboot

vmkernel.log
2019-02-11T01:47:24.255Z cpu30:66473)ScsiDeviceIO: 3001: Cmd(0x439d40f31800) 0x1a, CmdSN 0x3aac from world 0 to dev "naa.678da6e715bad3f01db8fff50c7431f0" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0
This command failed 1904 times on "naa.678da6e715bad3f01db8fff50c7431f0"

Cisco Employee

Re: hosts are not responding state/frozen state after upgrade from 5.5U3 to 6.5U2 on C240-M4S

I took a quick look at the logs and don't see any problems on the UCS. I see a lot of these indicating a reboot from the OS:

 

 BIOS | System Event #0x83 | OEM System Boot Event | Asserted

Your storage controller logs look fine, I wouldnt expect anything to do with LSI or local storage.

 

If the VM continue to run, there is a process in VMware that is having an issue. If there was a problem on the hardware, I would expect it to freeze all the VM as well.

 

I would work with VMware to set up logging or monitoring or a way you can see if something is causing problems on the ESXi side. Otherwise, you may rollback to 6.5 U1 or a different release and see if the issue persist.

 

You keep saying "the host is completely hung" but if it was hung, how would VM still function? Can you access the host via KVM in CIMC?

 

Your vmkernel logs have a lot of these? Any comment from VMware?

 

2019-02-19T23:33:03.574Z cpu22:206011)Tcpip_Vmk: 129: get connection pkt trace failed with error code 195887136
2019-02-19T23:33:03.574Z cpu22:206011)Tcpip_Vmk: 129: get connection pkt trace failed with error code 195887136
2019-02-19T23:33:03.574Z cpu22:206011)Tcpip_Vmk: 96: get connection stats failed with error code 195887136
2019-02-19T23:33:03.574Z cpu22:206011)Tcpip_Vmk: 129: get connection pkt trace failed with error code 195887136
2019-02-19T23:33:03.574Z cpu22:206011)Tcpip_Vmk: 129: get connection pkt trace failed with error code 195887136
2019-02-19T23:33:03.574Z cpu22:206011)Tcpip_Vmk: 96: get connection stats failed with error code 195887136
2019-02-19T23:33:03.574Z cpu22:206011)Tcpip_Vmk: 129: get connection pkt trace failed with error code 195887136
2019-02-19T23:33:03.574Z cpu22:206011)Tcpip_Vmk: 129: get connection pkt trace failed with error code 195887136
2019-02-19T23:33:03.574Z cpu22:206011)Tcpip_Vmk: 96: get connection stats failed with error code 195887136
2019-02-19T23:33:03.574Z cpu22:206011)Tcpip_Vmk: 129: get connection pkt trace failed with error code 195887136
2019-02-19T23:33:03.574Z cpu22:206011)Tcpip_Vmk: 129: get connection pkt trace failed with error code 195887136
2019-02-19T23:33:03.575Z cpu22:206011)Tcpip_Vmk: 96: get connection stats failed with error code 195887136

 

 

Beginner

Re: hosts are not responding state/frozen state after upgrade from 5.5U3 to 6.5U2 on C240-M4S

You keep saying "the host is completely hung" but if it was hung, how would VM still function? Can you access the host via KVM in CIMC?

                        The Hypervisor agent on sever will be hung ,however we can see vpxa and hostd agents are running on esxi host and as also can login into host via putty with multiple session ,but unable to view any logs from esxi hosts (like to to view vmkernel logs ,df –h ,etc)  . Y

Yes, we can access server via KVM/CIMC and also can login  esxi host via CIMC .As Vms are resides on shared storage, so vms are continued to be running .

 

Your vmkernel logs have a lot of these? Any comment from VMware? Probably no,

 

Thanks for your info

Beginner

Re: hosts are not responding state/frozen state after upgrade from 5.5U3 to 6.5U2 on C240-M4S

We  founded bug in Cisco (https://quickview.cloudapps.cisco.com/quickview/bug/CSCuw38385) which indicates that if is happening large number of times this might cause unresponsive hosts with the following conditions, But Cisco said that the given bug is not aplies to current FW version of HW, 

Server: C240-M4SX or C240-M4S.

OS: ESXI 5.5 and 6.0.

RAID Controller: Cisco 12G SAS Modular Raid Controller

Cisco Employee

Re: hosts are not responding state/frozen state after upgrade from 5.5U3 to 6.5U2 on C240-M4S

This bug impacts older firmware, older ESXi version and has been terminated from the Cisco side, meaning there was not enough information to conclude anything.

 

 

Beginner

Re: hosts are not responding state/frozen state after upgrade from 5.5U3 to 6.5U2 on C240-M4S

We have been still working with Vendor Cisco and Vmware, still no resolution /cause of the issue .

 

However we did some  below changes to fix the issue as Local disk drive causing the issue to hosts to be non-responsive state as per Vmware Analysis.,As VMware suspecting that there is an issue either with the boot disk or with the lsi_mr3 driver| firmware, but Cisco did not find any  issues/errors with HDD or Firmware version though verifying CIMC Logs

And also we performed below remediation’s  to identify and fix the issue

 

We configured scratch partition  to store logs on dedicated LUN for further analysis .

We upgraded HDD firmware of Seagate  from 0003/0004 to A005 firmware version (ESXi hosts are cisco C240-M4s )

 

Even after upgrade of HDD Firmware, still having issues, then we ordered new disks of Seagate and replaced all old disks to new Seagate HDD with FW N0B1.

 

But still we are facing issues with host not responding state .

 

During host not responding state ,I tried to aa=analysis logs from scratch partition ,I found that local disk drive I/O error

 

vmkwarning

2019-03-22T23:43:00.828Z cpu13:68580)WARNING: Partition: 1158: Partition table read from device naa.678da6e715bb0c801e8e3fab80a35506 failed: I/O error

2019-03-22T23:44:20.833Z cpu1:68583)WARNING: Partition: 1158: Partition table read from device naa.678da6e715bb0c801e8e3fab80a35506 failed: I/O error

2019-03-22T23:45:01.405Z cpu34:2412446)ALERT: hostd detected to be non-responsive

 

  1. vmksummary.log

 

2019-03-23T04:19:19.576Z cpu0:68539)WARNING: Partition: 1158: Partition table read from device naa.678da6e715bb0c801e8e3fab80a35506 failed: I/O error

2019-03-23T04:19:19.576Z cpu11:66468)NMP: nmp_ThrottleLogForDevice:3562: last error status from device naa.678da6e715bb0c801e8e3fab80a35506 repeated 456 times

2019-03-23T04:19:19.576Z cpu11:66468)NMP: nmp_ThrottleLogForDevice:3616: Cmd 0x28 (0x43955db34080, 67662) to dev "naa.678da6e715bb0c801e8e3fab80a35506" on path "vmhba4:C2:T0:L0" Failed: H:0x0 D:0x8 P:0x0 Invalid sense data: 0x0 0x0 0x0. Act:NONE

2019-03-23T04:19:19.576Z cpu11:66468)ScsiDeviceIO: 2980: Cmd(0x43955db34080) 0x28, CmdSN 0x1 from world 67662 to dev "naa.678da6e715bb0c801e8e3fab80a35506" failed H:0x0 D:0x8 P:0x0 Invalid sense data: 0x61 0x74 0x68.

2019-03-23T04:19:19.715Z cpu0:66468)NMP: nmp_ThrottleLogForDevice:3545: last error status from device naa.678da6e715bb0c801e8e3fab80a35506 repeated 10 times.

 

Display Name: Local Cisco Disk (naa.678da6e715bb0c801e8e3fab80a35506)

 

naa.678da6e715bb0c801e8e3fab80a35506

 

 

The driver information

Key Value Instance:  lsi_mr3-578da6e715bb1470/LSI Incorporation

Listing keys:

Name:   MR-DriverVersion

Type:   string

value:  7.703.19.00

Name:   MR-HBAModel

Type:   string

value:  Avago (LSI) HBA 1000:5d:1137:db

Name:   MR-FWVersion

Type:   string

value:  Fw Rev. 24.12.1-0433

Name:   MR-ChipRevision

Type:   string

value:  Chip Rev. C0

Name:   MR-CtrlStatus

Type:   string

value:  FwState c0000000

Key Value Instance:  MOD_PARM/qlogic

Listing keys:

Name:   DRIVERINFO

Type:   string

value:

Driver version 2.1.53.0

 

please can some help if you find any clue /resolution

 

Beginner

Re: hosts are not responding state/frozen state after upgrade from 5.5U3 to 6.5U2 on C240-M4S

Hey there, out of curiosity, were you able get any resolution to this? We have B200 M4 blades using the LSI megaraid drivers, and are seeing the same issues. We have ESXi 6.5 U2 installed on our local SCSi drives, and randomly the hosts will disconnect. The only solution is to restart ESXi. Both Cisco and VMware have pointed the fingers at each other. We saw this issue on ESXi 6.0, and VMware told us to upgrade, so we went to 6.5, but didn't see a resolution to this issue.

Cisco Employee

Re: hosts are not responding state/frozen state after upgrade from 5.5U3 to 6.5U2 on C240-M4S

Please provide me the following information:
Server Model:
UCSM firmware:
Network Adapter Model:
LSI Raid controller model:

OS firmware:
ENIC driver:
FNIC driver (if FC is being used):
Raid controller Driver:

Commands to run in ESXi cli
vmware -vl
esxcfg-scsidevs -a
esxcfg-nics -l
vmkload_mod -s fnic
vmkload_mod -s nenic
vmkload_mod -s megaraid_sas
vmkload_mod -s lsi_mr3

Can you also provide a screenshot of PSOD?

- Josh O
Beginner

Re: hosts are not responding state/frozen state after upgrade from 5.5U3 to 6.5U2 on C240-M4S

Hey Josh,

 

I never received a PSOD. The symptoms I am seeing is that hosts were disconnecting and showing unresponsive, but the VMs continue to run. The only way to get it to reconnect is to reset the host and kill the VMs running on it. Here is the items you are asking for:

 

---------------------------
UCS
---------------------------
Name Model Running Version Startup Version Backup Version Update Status Activate Status
Adapters
Adapter 1 Cisco UCS VIC 1340 4.2(2b) 4.2(2b) 4.1(2e) Ready Ready
Adapter 2 Cisco UCS VIC 1380 4.2(2b) 4.2(2b) 4.1(2e) Ready Ready
BIOS Cisco UCS B200 M4 B200M4.3.1.3i.0.032120171710 B200M4.3.1.3i.0.032120171710 B200M4.3.1.3g.0.011820171448 Ready Ready
Board Controller Cisco UCS B200 M4 14 14 N/A N/A Ready
CIMC Controller Cisco UCS B200 M4 3.1(25d) 3.1(25d) 3.1(21f) Ready Ready
FlexFlash Controller 1 1.3.2 build 170 N/A N/A N/A
Storage Controller SAS 1 Cisco FlexStorage 12G SAS RAID Controller 24.5.0-0021 24.5.0-0021 N/A N/A Ready
Disks
Disk 1 A03-D300GA2 1 1 N/A N/A Ready
Disk 2 A03-D300GA2 1 1 N/A N/A Ready

---------------------------
ESXi
---------------------------
vmware -vl
VMware ESXi 6.5.0 build-9298722
VMware ESXi 6.5.0 Update 2

---------------------------

esxcfg-scsidevs -a
vmhba0 lsi_mr3 link-n/a sas.518e728372736380 (0000:01:00.0) Avago (LSI / Symbios Logic) MegaRAID SAS Invader Controller
vmhba64 iscsi_vmk online iqn.1998-01.com.vmware:sdcorgesx3s1-21063a36iSCSI Software Adapter

---------------------------

esxcfg-nics -l
Name PCI Driver Link Speed Duplex MAC Address MTU Description
vmnic0 0000:06:00.0 nenic Up 1000Mbps Full 00:25:b5:00:06:df 1500 Cisco Systems Inc Cisco VIC Ethernet NIC
vmnic1 0000:07:00.0 nenic Up 20000Mbps Full 00:25:b5:00:07:2f 9000 Cisco Systems Inc Cisco VIC Ethernet NIC
vmnic10 0000:85:00.0 nenic Up 1000Mbps Full 00:25:b5:00:06:ff 1500 Cisco Systems Inc Cisco VIC Ethernet NIC
vmnic11 0000:86:00.0 nenic Up 20000Mbps Full 00:25:b5:00:07:4f 9000 Cisco Systems Inc Cisco VIC Ethernet NIC
vmnic12 0000:87:00.0 nenic Up 20000Mbps Full 00:25:b5:00:07:5f 9000 Cisco Systems Inc Cisco VIC Ethernet NIC
vmnic13 0000:88:00.0 nenic Up 20000Mbps Full 00:25:b5:00:06:ef 9000 Cisco Systems Inc Cisco VIC Ethernet NIC
vmnic14 0000:89:00.0 nenic Up 20000Mbps Full 00:25:b5:00:07:0f 9000 Cisco Systems Inc Cisco VIC Ethernet NIC
vmnic15 0000:0f:00.0 nenic Up 20000Mbps Full 00:25:b5:00:06:1f 1500 Cisco Systems Inc Cisco VIC Ethernet NIC
vmnic16 0000:10:00.0 nenic Up 20000Mbps Full 00:25:b5:00:06:2f 1500 Cisco Systems Inc Cisco VIC Ethernet NIC
vmnic17 0000:11:00.0 nenic Up 20000Mbps Full 00:25:b5:00:06:3f 9000 Cisco Systems Inc Cisco VIC Ethernet NIC
vmnic18 0000:12:00.0 nenic Up 20000Mbps Full 00:25:b5:00:06:4f 1500 Cisco Systems Inc Cisco VIC Ethernet NIC
vmnic19 0000:13:00.0 nenic Up 20000Mbps Full 00:25:b5:00:06:8f 1500 Cisco Systems Inc Cisco VIC Ethernet NIC
vmnic2 0000:08:00.0 nenic Up 20000Mbps Full 00:25:b5:00:07:3f 9000 Cisco Systems Inc Cisco VIC Ethernet NIC
vmnic3 0000:09:00.0 nenic Up 20000Mbps Full 00:25:b5:00:06:bf 9000 Cisco Systems Inc Cisco VIC Ethernet NIC
vmnic4 0000:0a:00.0 nenic Up 20000Mbps Full 00:25:b5:00:07:1f 9000 Cisco Systems Inc Cisco VIC Ethernet NIC
vmnic5 0000:8e:00.0 nenic Up 20000Mbps Full 00:25:b5:00:06:0f 1500 Cisco Systems Inc Cisco VIC Ethernet NIC
vmnic6 0000:8f:00.0 nenic Up 20000Mbps Full 00:25:b5:00:06:6f 1500 Cisco Systems Inc Cisco VIC Ethernet NIC
vmnic7 0000:90:00.0 nenic Up 20000Mbps Full 00:25:b5:00:06:7f 9000 Cisco Systems Inc Cisco VIC Ethernet NIC
vmnic8 0000:91:00.0 nenic Up 20000Mbps Full 00:25:b5:00:06:5f 1500 Cisco Systems Inc Cisco VIC Ethernet NIC
vmnic9 0000:92:00.0 nenic Up 20000Mbps Full 00:25:b5:00:06:9f 1500 Cisco Systems Inc Cisco VIC Ethernet NIC

---------------------------

vmkload_mod -s nenic
vmkload_mod module information
input file: /usr/lib/vmware/vmkmod/nenic
Version: 1.0.25.0-1OEM.650.0.0.4598673
Build Type: release
License: Proprietary
Required name-spaces:
com.vmware.vmkapi#v2_4_0_0
Parameters:
debug_mask: ulong
Enabled debug mask (default: DRIVER | UPLINK | QUEUE | HW)

---------------------------

vmkload_mod -s fnic
vmkload_mod module information
input file: /usr/lib/vmware/vmkmod/fnic
Version: Version 1.6.0.44, Build: 2494585, Interface: 9.2 Built on: Mar 26 2018
Build Type: release
License: GPLv2
Name-space: com.cisco.fnic#9.2.3.0
Required name-spaces:
com.vmware.libfcoe#9.2.3.0
com.vmware.libfc#9.2.3.0
com.vmware.driverAPI#9.2.3.0
com.vmware.vmkapi#v2_3_0_0
Parameters:
skb_mpool_max: int
Maximum attainable private socket buffer memory pool size for the driver.
skb_mpool_initial: int
Driver's minimum private socket buffer memory pool size.
heap_max: int
Maximum attainable heap size for the driver.
heap_initial: int
Initial heap size allocated for the driver.
fnic_max_qdepth: uint
Queue depth to report for each LUN
fnic_fc_trace_max_pages: uint
Total allocated memory pages for fc trace buffer
fnic_trace_max_pages: uint
Total allocated memory pages for fnic trace buffer

---------------------------

vmkload_mod -s megaraid_sas
vmkload_mod module information
input file: /usr/lib/vmware/vmkmod/megaraid_sas
Version: Version 6.610.16.00, Build: 2494585, Interface: 9.2 Built on: Sep 13 2016
Build Type: release
License: GPL
Required name-spaces:
com.vmware.driverAPI#9.2.3.0
com.vmware.vmkapi#v2_3_0_0
Parameters:
heap_max: int
Maximum attainable heap size for the driver.
heap_initial: int
Initial heap size allocated for the driver.
max_msix_count: int
To change MSI-X vector count. Default: Set by Firmware
class_event_print: int
To print Event details. Default: 2
msix_disable: int
Disable MSI interrupt handling. Default: 0
cmd_per_lun: int
Maximum number of commands per logical unit (default=128)
max_sectors: int
Maximum number of sectors per IO command
fast_load: int
megasas: Faster loading of the driver, skips physical devices! (default=0)
lb_pending_cmds: int
Change raid-1 load balancing outstanding threshold. Valid Values are 1-128. Default: 4
disable_1MB_IO: int
megasas: Set TRUE to disable Extended IO feature
disable_dual_qd: int
megasas: Set TRUE to disable iMR extended QD feature

---------------------------

vmkload_mod -s lsi_mr3
vmkload_mod module information
input file: /usr/lib/vmware/vmkmod/lsi_mr3
Version: 7.703.19.00-1OEM.650.0.0.4598673
Build Type: release
License: Proprietary
Required name-spaces:
com.vmware.vmkapi#v2_4_0_0
Parameters:
lb_pending_cmds: int
Desc : Change raid-1 load balancing outstanding threshold
Default : 4
Range : 1 - 128
max_msix_count: int
Desc : Maximum MSI-X vector count to allocate
Default : No. of Physical Sockets
Range : 1 - Max System Supported
disable_1MB_IO: int
Desc : Disable 1MB IO support
Default : 0
Range : 0 - 1
disable_dual_qd: int
Desc : Disable dual queue depth mode
Default : 0
Range : 0 - 1
block_SynchCache: int
Desc : Loop SYNC_CACHE SCSI command
Default : 0
Range : 0 - 1
class_event_print: int
Desc : Level of FW event details to log
Default : 2
Range : -2 (low severity) to 4 (high severity)
disable_TB_support: int
Desc : Disable SAS2.5 Thunderbolt controller support
Default : 0
Range : 0 - 1
mfiDumpFailedCmd: int
Desc : Log hex dump of failed command
Default : 0
Range : 0 - 1
max_sectors: int
Desc : Maximum number of sectors per IO command
Default : FW supported
Range : 1 - FW supported
max_mpool_sz: int
Desc : Maximum mem pool size in units of MB
Default : 200
Range : 16 - 1024
max_heap_sz: int
Desc : Maximum heap size in units of MB
Default : 50
Range : 16 - 128
max_mgmtheap_sz: int
Desc : Maximum management heap size in units of MB
Default : 5
Range : 1 - 20

 

These hosts are B200 M4 hosts, and I did find this bug in the library: CSCut37134. It sounds very similar to the issues we were seeing. We went ahead and removed the VIBs that is recommended, and it is now loading the megaraid_sas driver upon reboot. I am hoping that resolves our issues now. 

 

After following the work around:

 

esxcfg-scsidevs -a
vmhba0 megaraid_sas link-n/a unknown.vmhba0 (0000:01:00.0) Avago (LSI / Symbios Logic) MegaRAID SAS

Re: hosts are not responding state/frozen state after upgrade from 5.5U3 to 6.5U2 on C240-M4S

Has anyone found a resolution to this error, other than restarting the host? I have the same issue at two customers, both running Cisco UCS B200 M4.

 

Cisco Employee

Re: hosts are not responding state/frozen state after upgrade from 5.5U3 to 6.5U2 on C240-M4S

Have you opened a TAC case to investigate? This problem is fairly generic and you might not be facing the exact same issue outlined in this post.

Beginner

Re: hosts are not responding state/frozen state after upgrade from 5.5U3 to 6.5U2 on C240-M4S

Hey there michael.hedberg@atea.se,

 

I can confirm I was being affected by this bug: https://quickview.cloudapps.cisco.com/quickview/bug/CSCut37134.

 

Ever since I performed the steps posted in the bulletin, I have not had my hosts disconnect. I was being affected every couple of months, and have not had this happen since I performed the steps in the link. You sound like you have a similar situation as me.

 

Re: hosts are not responding state/frozen state after upgrade from 5.5U3 to 6.5U2 on C240-M4S

Great, thanks :)
CreatePlease to create content
Content for Community-Ad
August's Community Spotlight Awards