B200 M3 network connection silently failing

igucs_support · ‎01-29-2013

We have had 2 hosts lose their network connectivity with no messages or errors reported to UCSM.

A restart of the NIC interface within the OS fixes the issue.

Both servers have the following spec :

B200 M3

2 x 3.3 GHz Xeon CPU's

VIC1240 CNA Adaptor

One server is running RHEL 5.8 and the other server is running VMWare ESXi hypervisor 5.1 build 914609.

The servers are connected to 2 separate Cisco 6120 FI Pairs but both pairs run 2.0(2q) firmware.

There are no events in the event log. The FI's are logging to a syslog server and no errors were reported.

The RHEL 5.8 linux host driver details are :

[root@host1 ~]$ uname -a

Linux host1 2.6.18-308.1.1.el5 #1 SMP Fri Feb 17 16:51:01 EST 2012 x86_64 x86_64 x86_64 GNU/Linux

[root@host2 ~]$ ethtool -i eth0

driver: enic

version: 2.1.1.24

firmware-version: 2.0(2q)

bus-info: 0000:06:00.0

VMware 5.1 ESXi Host :

~ # uname -a

VMkernel host2 5.1.0 #1 SMP Release build-914609 Nov 18 2012 12:01:37 x86_64 GNU/Linux

# ethtool -i vmnic0

driver: enic

version: 1.4.2.15a

firmware-version: 2.0(2q)

bus-info: 0000:06:00.0

Both servers have the Cisco VIC1240 CNA adapter with Part ID of UCSB-MLOM-40G-01

Both servers experienced very similar pattern of behavior during the failure and were a part of the same order of servers in late 2012.

It really does smell of faulty batch of VIC cards or firmware/driver issue.

Is there anyone else seeing this problem or know of any issues with the VIC or servers ?

We run about 50 x B200 M3 purchased earlier in 2012 of a different spec and have not seen the same issues.

Robert Burns · ‎01-29-2013

OS logs showing the NIC outage would be useful. All we have so far are word-of-mouth symptoms.

Might want to generate a tech support bundle for the adapter and see if you can correlate the OS-seen failure with the adaptor logs.

Regards,

Robert

igucs_support · ‎01-30-2013

Forgot to mention that there is nothing in the OS logs for the linux host.

Only on the VMware server was there more information ( had to prompt one of my VMWare colleagues for this ):

Jan 19 04:48:35 2013-01-19T04:48:35.571Z lvuatesx111.igi.ig.local vobd: [netCorrelator] 653089124356us: [vob.net.vmnic.linkstate.down] vmnic vmnic0 linkstate down
Options

4

»

19/01/2013 04:48:35.000

Jan 19 04:48:35 2013-01-19T04:48:35.571Z vobd: [netCorrelator] 653089124035us: [vob.net.pg.uplink.transition.down] Uplink: vmnic0 is down. Affected portgroup: VMkernel_Management. 1 uplinks up. Failed criteria: 128

Options

5

»

19/01/2013 04:48:35.000

Jan 19 04:48:35 2013-01-19T04:48:35.571Z vobd: [netCorrelator] 653089124028us: [vob.net.pg.uplink.transition.down] Uplink: vmnic0 is down. Affected portgroup:VLAN 3402. 1 uplinks up. Failed criteria: 128

Options

6 » 19/01/2013 04:48:35.000 Jan 19 04:48:35 2013-01-19T04:48:35.533Z vmkernel: cpu2:8744)<3>enic 0000:06:00.0: vmnic0: Failed to alloc notify buffer, aborting.
Options
7 » 19/01/2013 04:48:35.000 Jan 19 04:48:35 2013-01-19T04:48:35.533Z vmkernel: cpu2:8744)<3>enic: Busy devcmd 21
8 » 19/01/2013 04:48:35.000 Jan 19 04:48:35 2013-01-19T04:48:35.533Z vmkernel: cpu2:8744)VMK_VECTOR: 138: Added handler for vector 81, flags 0x10
Options
9 » 19/01/2013 04:48:35.000 Jan 19 04:48:35 2013-01-19T04:48:35.533Z vmkernel: cpu2:8744)IRQ: 233: 0x51 <vmnic0-notify> exclusive (entropy source), flags 0x10
Options
10 » 19/01/2013 04:48:35.000 Jan 19 04:48:35 2013-01-19T04:48:35.533Z vmkernel: cpu2:8744)VMK_VECTOR: 138: Added handler for vector 73, flags 0x10
Options

For the linux host as mentioned it was a silent disconnect at layer 2 level. physical link was up however no messages or dmesg entries to speak of.

The only thing that appears is NFS timeout errors from mounts that are on the host.

Robert Burns · ‎01-30-2013

I really believe these are unrelated host OS failure. UCSM is not showing any failed hardware, so going from your logs its a driver or OS-level issue. I would open up a VMware case and have them investigate the vmkernel logs deeper.

Ask them what "Failed Criteria 128" and "flags 0x10" refer to.

Again, for the linux host, it looks like the OS level stack is failing. Doesn't smell like a Layer 1 issue at all.

Robert

igucs_support · ‎04-11-2013

To close this discussion off the results of the reboots were :

RHEL 5.8 linux falling off the network was due to some DHCP leases which were hanging about unnecessarily. With service profile using a MAC pool and being allocated a MAC address that was previously used by another service profile ( and was recently removed and available in MAC Pool ) it did a broadcast upon it's 1st boot after service profile assocation and picked up an IP address already in use. This created some very strange behaviour with the ssh sessions hanging and clustering software getting very confused.

RHEL 6 based DHCP servers do not do any checks as per the RFC specs and just dish out IP's if asked for.

One to look out for if anyone in a changing environment.

As for the VMWare issue it turns out the NIC driver on the vmware host was not updated after a FW update resulting in the random reboot with no warning or logs.

Unfortunately with a lot of things going in the environment at the same time there was suspicion that the 2 events were related but they weren't. VMware problem was an easily spotted however the RHEL DHCP issue was a difficult one to nail down.

Thanks for any input.