FCPIO_TIMEOUT with ESXi 5.1 and 1.5.0.45 fnic driver

Timo Scheller · ‎08-21-2014

Hi,

we have a lot of trouble with a new setup. Brought a new open-e storage server with Qlogic HBAs into our fabric (nexus 55xx) connected this through zoning into our B200M3 with 2.1.2a and 2.1.3a BIOS and running ESXi 5.1 Update 1b on top.

After the reboot, no connected host sees any LUN from this device any more.

Only a reboot of the open-e server recovers the luns. When the device was properly rebootet, all hosts had immediatly access to the luns.

While rebooting a host, we got this messages:

vmkernel.log:2014-08-21T15:42:07.882+02:00 vmkernel cpu18:8210)<7>fnic : 4 :: abts cmpl recd. id 236 status FCPIO_TIMEOUT
vmkernel.log-2014-08-21T15:42:07.882+02:00 vmkernel cpu14:600697)<7>fnic : 4 :: Returning from abort cmd type 2 FAILED
vmkernel.log-2014-08-21T15:42:14.477+02:00 vmkernel cpu0:600785)<7>fnic : 4 :: Abort Cmd called FCID 0x10300, LUN 0x3 TAG f2
flags 3
vmkernel.log-2014-08-21T15:42:14.487+02:00 vmkernel cpu10:601534)<7>fnic : 4 :: Abort Cmd called FCID 0x10300, LUN 0x3 TAG f3
 flags 3
vmkernel.log:2014-08-21T15:42:16.491+02:00 vmkernel cpu18:8210)<7>fnic : 4 :: abts cmpl recd. id 242 status FCPIO_TIMEOUT
vmkernel.log-2014-08-21T15:42:16.491+02:00 vmkernel cpu0:600785)<7>fnic : 4 :: Returning from abort cmd type 2 FAILED
vmkernel.log-2014-08-21T15:42:16.491+02:00 vmkernel cpu8:8840)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237:NMP device "
eui.3564333930626263" state in doubt; requested fast path state update...
vmkernel.log-2014-08-21T15:42:16.491+02:00 vmkernel cpu8:8840)ScsiDeviceIO: 2331: Cmd(0x4124003d20c0) 0x1a, CmdSN 0xaae9 from
 world 0 to dev "eui.3564333930626263" failed H:0x8 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.
vmkernel.log-2014-08-21T15:42:16.491+02:00 vmkernel cpu0:9278)WARNING: ScsiDeviceIO: 7370: READ CAPACITY on device "eui.35643
33930626263" from Plugin "NMP" failed. I/O error
vmkernel.log-2014-08-21T15:42:16.491+02:00 vmkernel cpu0:9278)FSS: 4972: No FS driver claimed device 'eui.3564333930626263:1'
: Not supported
--
vmkernel.log-2014-08-21T15:42:16.492+02:00 vmkernel cpu2:601289)VisorFSRam: 752: inode 4773189553399820288
vmkernel.log-2014-08-21T15:42:16.492+02:00 vmkernel cpu2:601289)VisorFSRam: 770: ramdisk snmptraps
vmkernel.log-2014-08-21T15:42:16.492+02:00 vmkernel cpu2:601289)VisorFSRam: 911: for ramdisk snmptraps
vmkernel.log-2014-08-21T15:42:16.493+02:00 vmkernel cpu0:9278)VC: 1591: Device rescan time 361264 msec (total number of devic
es 14)
vmkernel.log-2014-08-21T15:42:16.493+02:00 vmkernel cpu0:9278)VC: 1594: Filesystem probe time 901243 msec (devices probed 14
of 14)
vmkernel.log:2014-08-21T15:42:16.501+02:00 vmkernel cpu18:8210)<7>fnic : 4 :: abts cmpl recd. id 243 status FCPIO_TIMEOUT
vmkernel.log-2014-08-21T15:42:16.501+02:00 vmkernel cpu10:601534)<7>fnic : 4 :: Returning from abort cmd type 2 FAILED
vmkernel.log-2014-08-21T15:42:17.487+02:00 vmkernel cpu14:600697)<7>fnic : 4 :: Abort Cmd called FCID 0x10300, LUN 0x0 TAG f5
 flags 3

Talking to VMware, they gave me a KB article:

kb.vmware.com/kb/1033409

But we`ve already installed this drivers:

net-enic                       2.1.2.38-1OEM.500.0.0.472560 
scsi-fnic                      1.5.0.45-1OEM.500.0.0.472560

Any help would be appriciated. Especially with troubleshooting SAN/traffic on nexus, fabric-interconnect and maybe the blades itself....

Thx!

Update: 22.08.14

What we can see now is, using a host with 4 vHBAs has problems. Not only with open-e. But also with another storage vendor.

Example the hosts sees after a reboot only 2 (of 4) paths into 4 of 5 luns. On one lun he sees (again) all 4.

Hosts having only two vHBAs, not connected to open-e, seems not having trouble.

Guess would be, there musst be some kind of trouble with the driver?

Walter Dey · ‎08-27-2014

Hallo

Your enic / fnic version are ok.

I don't have any experience with open e-storage server.

I can't find it on the interop matrix either

http://www.cisco.com/c/dam/en/us/td/docs/unified_computing/ucs/interoperability/matrix/r_hcl_B_2-12.pdf

Regarding 4 hba configuration

- I assume your UCS FI is end host mode, and the N5k NPIV ?

- do you have a dual fabric, with different VSAN's

- do you see the flogi of all the 4 vhba's (show flogi database vsan ... on the N5k)

- did you see any strange error messages on UCS, resp. N5K

Timo Scheller · ‎08-27-2014

Hi wdey,

yeah that is another problem with this "matrix"-only thing....

It`s just a plain old intel x86 server hw from a well know manufactorer running open-e as so called "software definded storage" on it. The connection is done via plain old very well known QLA2562, 8 GBit Dualport HBAs. Them connected to our N5K.

Your question answered:

Regarding 4 hba configuration

- I assume your UCS FI is end host mode, and the N5k NPIV ?

YES

- do you have a dual fabric, with different VSAN's

YES

- do you see the flogi of all the 4 vhba's (show flogi database vsan ... on the N5k)

Only two vHBAs are connected to the N5K fabric but YES, see them all there.

And of course see them in our second non-N5K fabric. Also different VSAN`s then N5K

- did you see any strange error messages on UCS, resp. N5K

NO, none so far. Thats the thing. Not even error-counters etc.rising.

And thanks for your help!!!

Walter Dey · ‎08-27-2014

Hi

more q

- what kind is the non-N5k fabric ? Brocade, MDS,....

- what is other storage vendor ? is it on the support matrix

- is it correct, that you have 2 vhba for each fabric (A, B), and they all are in different VSAN's, e.g. (10,20) for N5k fabric, and (11,21) for non-5k fabric ?

Timo Scheller · ‎08-27-2014

OK, more q ;-)

- what kind is the non-N5k fabric ? Brocade, MDS,....

Qlogic SANBox 58xx

- what is other storage vendor ? is it on the support matrix

Nope. Its also a software defined storage system on plain old intel server with QLA2462.

- is it correct, that you have 2 vhba for each fabric (A, B), and they all are in different VSAN's, e.g. (10,20) for N5k fabric, and (11,21) for non-5k fabric ?

Totally correct.

And i should mention, that the loss you path to luns with the other vendors has been solved right now. Found the problem, which was in the storage (serve/unserve). Somebody forgot to check all ports but only on that hosts.... Needed just a few ours to get into that...

But the problem above, FCPIO_TIMEOUT, already there. The other problem was just a side-effect which shouldn`t been there if the doing were correctly. Sorry for that....

And for the explaining, we need that setup for a migration to open-e and a newer version of the other vendor then also connected into the N5K fabric. So 4 vHBAs into two different fabrics just for a migration. But still.... won`t get the open-e stable with that....

Walter Dey · ‎08-27-2014

vmkernel.log-2014-08-21T15:42:14.487+02:00 vmkernel cpu10:601534)<7>fnic : 4 :: Abort Cmd called FCID 0x10300, LUN 0x3 TAG f3
 flags 3
vmkernel.log:2014-08-21T15:42:16.491+02:00 vmkernel cpu18:8210)<7>fnic : 4 :: abts cmpl recd. id 242 status FCPIO_TIMEOUT

This is a FC protocol issue; flogi seems to work ok.

You have to open a TAC case; and I fear the TAC engineer will tell you, that this storage system is not on the interop matrix.

Don't know how widely this system is used in the field; you could convince product management to do certification, or at least do a RPQ.

Good luck

Walter.

Timo Scheller · ‎08-28-2014

Hi Walter,

i agree.

But what else can i do? Installed all the drivers, updates and stuff. And still it is not able to put just a plain old x86 server into a ucs with a software defined storage into it?

Somehow there musst be this error coming from. And according to a lot sites, it affects also EMC, HP and other vendors too.

And even with 2.2.1(d) there are errors of the same kind:

https://communities.vmware.com/message/2402255?tstart=0#2402255

Maybe not directly related. But sure, there are problems. And, they have to come from somewhere. Question is, from where and why? And who will be responsible for that?

And there are tools, from cisco, like this one:

http://www.cisco.com/c/dam/en/us/solutions/collateral/data-center-virtualization/unified-computing/guide_c07-730810.pdf

But somehow even TAC doesn`t know about them...

We already open a case with TAC yesterday. But the first thing said:

"It is not in our matrix". Yeah well great. A lot of things aren`t, in this world.

And not anybody has the ability to afford EMC or NetApp as there storage vendors....

But thx for your help!

Timo

Walter Dey · ‎08-28-2014

I agree 100%

Checked the MDS / N5k storage interop matrix as well

http://www.cisco.com/c/en/us/td/docs/switches/datacenter/mds9000/interoperability/matrix/intmatrx/Matrix1.html

RH and SUSE are there with Qlogic HBA support; which Linux derivate is used on this storage system.

To troubleshoot this issue, you need a FC analyzer; the tool you are referring to is more for monitoring.

FC protocol is tricky; look at the storage vendors interop matrix ? it's a nightmare ?

Did you try to fix all the fc attributes on the ports connecting the storage, so no negotation is necessary; e.g. port type = F, speed =.....

Timo Scheller · ‎08-28-2014

Thx a lot a first for your help.

And thx for checking the matrix. It is indeed a linux working on the open-e. But sure it will be no RH or SuSE. Its a very new kernel release as i remember from the last boot.

And FC protocol is a tricky one, your right. We don`t have tools for that.

The storage vendor matrix is indeed a nightmare. 100% agreed. ;-)

Of course i tried to fix all possible attributes. As far as i know and as far as the documentation got it. Which is not much, i have to say....

But i will check this again tomorrow and give a update, if i could find something...

Anyway, thx for your help!