08-21-2014 12:49 PM - edited 03-01-2019 11:48 AM
Hi,
we have a lot of trouble with a new setup. Brought a new open-e storage server with Qlogic HBAs into our fabric (nexus 55xx) connected this through zoning into our B200M3 with 2.1.2a and 2.1.3a BIOS and running ESXi 5.1 Update 1b on top.
After the reboot, no connected host sees any LUN from this device any more.
Only a reboot of the open-e server recovers the luns. When the device was properly rebootet, all hosts had immediatly access to the luns.
While rebooting a host, we got this messages:
vmkernel.log:2014-08-21T15:42:07.882+02:00 vmkernel cpu18:8210)<7>fnic : 4 :: abts cmpl recd. id 236 status FCPIO_TIMEOUT vmkernel.log-2014-08-21T15:42:07.882+02:00 vmkernel cpu14:600697)<7>fnic : 4 :: Returning from abort cmd type 2 FAILED vmkernel.log-2014-08-21T15:42:14.477+02:00 vmkernel cpu0:600785)<7>fnic : 4 :: Abort Cmd called FCID 0x10300, LUN 0x3 TAG f2 flags 3 vmkernel.log-2014-08-21T15:42:14.487+02:00 vmkernel cpu10:601534)<7>fnic : 4 :: Abort Cmd called FCID 0x10300, LUN 0x3 TAG f3 flags 3 vmkernel.log:2014-08-21T15:42:16.491+02:00 vmkernel cpu18:8210)<7>fnic : 4 :: abts cmpl recd. id 242 status FCPIO_TIMEOUT vmkernel.log-2014-08-21T15:42:16.491+02:00 vmkernel cpu0:600785)<7>fnic : 4 :: Returning from abort cmd type 2 FAILED vmkernel.log-2014-08-21T15:42:16.491+02:00 vmkernel cpu8:8840)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237:NMP device " eui.3564333930626263" state in doubt; requested fast path state update... vmkernel.log-2014-08-21T15:42:16.491+02:00 vmkernel cpu8:8840)ScsiDeviceIO: 2331: Cmd(0x4124003d20c0) 0x1a, CmdSN 0xaae9 from world 0 to dev "eui.3564333930626263" failed H:0x8 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0. vmkernel.log-2014-08-21T15:42:16.491+02:00 vmkernel cpu0:9278)WARNING: ScsiDeviceIO: 7370: READ CAPACITY on device "eui.35643 33930626263" from Plugin "NMP" failed. I/O error vmkernel.log-2014-08-21T15:42:16.491+02:00 vmkernel cpu0:9278)FSS: 4972: No FS driver claimed device 'eui.3564333930626263:1' : Not supported -- vmkernel.log-2014-08-21T15:42:16.492+02:00 vmkernel cpu2:601289)VisorFSRam: 752: inode 4773189553399820288 vmkernel.log-2014-08-21T15:42:16.492+02:00 vmkernel cpu2:601289)VisorFSRam: 770: ramdisk snmptraps vmkernel.log-2014-08-21T15:42:16.492+02:00 vmkernel cpu2:601289)VisorFSRam: 911: for ramdisk snmptraps vmkernel.log-2014-08-21T15:42:16.493+02:00 vmkernel cpu0:9278)VC: 1591: Device rescan time 361264 msec (total number of devic es 14) vmkernel.log-2014-08-21T15:42:16.493+02:00 vmkernel cpu0:9278)VC: 1594: Filesystem probe time 901243 msec (devices probed 14 of 14) vmkernel.log:2014-08-21T15:42:16.501+02:00 vmkernel cpu18:8210)<7>fnic : 4 :: abts cmpl recd. id 243 status FCPIO_TIMEOUT vmkernel.log-2014-08-21T15:42:16.501+02:00 vmkernel cpu10:601534)<7>fnic : 4 :: Returning from abort cmd type 2 FAILED vmkernel.log-2014-08-21T15:42:17.487+02:00 vmkernel cpu14:600697)<7>fnic : 4 :: Abort Cmd called FCID 0x10300, LUN 0x0 TAG f5 flags 3
Talking to VMware, they gave me a KB article:
kb.vmware.com/kb/1033409
But we`ve already installed this drivers:
net-enic 2.1.2.38-1OEM.500.0.0.472560 scsi-fnic 1.5.0.45-1OEM.500.0.0.472560
Any help would be appriciated. Especially with troubleshooting SAN/traffic on nexus, fabric-interconnect and maybe the blades itself....
Thx!
Update: 22.08.14
What we can see now is, using a host with 4 vHBAs has problems. Not only with open-e. But also with another storage vendor.
Example the hosts sees after a reboot only 2 (of 4) paths into 4 of 5 luns. On one lun he sees (again) all 4.
Hosts having only two vHBAs, not connected to open-e, seems not having trouble.
Guess would be, there musst be some kind of trouble with the driver?
08-27-2014 06:40 AM
Hallo
Your enic / fnic version are ok.
I don't have any experience with open e-storage server.
I can't find it on the interop matrix either
http://www.cisco.com/c/dam/en/us/td/docs/unified_computing/ucs/interoperability/matrix/r_hcl_B_2-12.pdf
Regarding 4 hba configuration
- I assume your UCS FI is end host mode, and the N5k NPIV ?
- do you have a dual fabric, with different VSAN's
- do you see the flogi of all the 4 vhba's (show flogi database vsan ... on the N5k)
- did you see any strange error messages on UCS, resp. N5K
08-27-2014 12:53 PM
Hi wdey,
yeah that is another problem with this "matrix"-only thing....
It`s just a plain old intel x86 server hw from a well know manufactorer running open-e as so called "software definded storage" on it. The connection is done via plain old very well known QLA2562, 8 GBit Dualport HBAs. Them connected to our N5K.
Your question answered:
Regarding 4 hba configuration
- I assume your UCS FI is end host mode, and the N5k NPIV ?
- do you have a dual fabric, with different VSAN's
YES
- do you see the flogi of all the 4 vhba's (show flogi database vsan ... on the N5k)
Only two vHBAs are connected to the N5K fabric but YES, see them all there.
And of course see them in our second non-N5K fabric. Also different VSAN`s then N5K
- did you see any strange error messages on UCS, resp. N5K
NO, none so far. Thats the thing. Not even error-counters etc.rising.
08-27-2014 01:09 PM
Hi
more q
- what kind is the non-N5k fabric ? Brocade, MDS,....
- what is other storage vendor ? is it on the support matrix
- is it correct, that you have 2 vhba for each fabric (A, B), and they all are in different VSAN's, e.g. (10,20) for N5k fabric, and (11,21) for non-5k fabric ?
08-27-2014 01:34 PM
OK, more q ;-)
- what kind is the non-N5k fabric ? Brocade, MDS,....
Qlogic SANBox 58xx
- what is other storage vendor ? is it on the support matrix
Nope. Its also a software defined storage system on plain old intel server with QLA2462.
- is it correct, that you have 2 vhba for each fabric (A, B), and they all are in different VSAN's, e.g. (10,20) for N5k fabric, and (11,21) for non-5k fabric ?
Totally correct.
And i should mention, that the loss you path to luns with the other vendors has been solved right now. Found the problem, which was in the storage (serve/unserve). Somebody forgot to check all ports but only on that hosts.... Needed just a few ours to get into that...
But the problem above, FCPIO_TIMEOUT, already there. The other problem was just a side-effect which shouldn`t been there if the doing were correctly. Sorry for that....
And for the explaining, we need that setup for a migration to open-e and a newer version of the other vendor then also connected into the N5K fabric. So 4 vHBAs into two different fabrics just for a migration. But still.... won`t get the open-e stable with that....
08-27-2014 11:01 PM
vmkernel.log-2014-08-21T15:42:14.487+02:00 vmkernel cpu10:601534)<7>fnic : 4 :: Abort Cmd called FCID 0x10300, LUN 0x3 TAG f3 flags 3 vmkernel.log:2014-08-21T15:42:16.491+02:00 vmkernel cpu18:8210)<7>fnic : 4 :: abts cmpl recd. id 242 status FCPIO_TIMEOUT
This is a FC protocol issue; flogi seems to work ok.
You have to open a TAC case; and I fear the TAC engineer will tell you, that this storage system is not on the interop matrix.
Don't know how widely this system is used in the field; you could convince product management to do certification, or at least do a RPQ.
Good luck
Walter.
08-28-2014 12:04 AM
Hi Walter,
i agree.
But what else can i do? Installed all the drivers, updates and stuff. And still it is not able to put just a plain old x86 server into a ucs with a software defined storage into it?
Somehow there musst be this error coming from. And according to a lot sites, it affects also EMC, HP and other vendors too.
And even with 2.2.1(d) there are errors of the same kind:
https://communities.vmware.com/message/2402255?tstart=0#2402255
Maybe not directly related. But sure, there are problems. And, they have to come from somewhere. Question is, from where and why? And who will be responsible for that?
And there are tools, from cisco, like this one:
But somehow even TAC doesn`t know about them...
We already open a case with TAC yesterday. But the first thing said:
"It is not in our matrix". Yeah well great. A lot of things aren`t, in this world.
And not anybody has the ability to afford EMC or NetApp as there storage vendors....
But thx for your help!
Timo
08-28-2014 07:59 AM
I agree 100%
Checked the MDS / N5k storage interop matrix as well
http://www.cisco.com/c/en/us/td/docs/switches/datacenter/mds9000/interoperability/matrix/intmatrx/Matrix1.html
RH and SUSE are there with Qlogic HBA support; which Linux derivate is used on this storage system.
To troubleshoot this issue, you need a FC analyzer; the tool you are referring to is more for monitoring.
FC protocol is tricky; look at the storage vendors interop matrix ? it's a nightmare ?
Did you try to fix all the fc attributes on the ports connecting the storage, so no negotation is necessary; e.g. port type = F, speed =.....
08-28-2014 12:59 PM
Thx a lot a first for your help.
And thx for checking the matrix. It is indeed a linux working on the open-e. But sure it will be no RH or SuSE. Its a very new kernel release as i remember from the last boot.
And FC protocol is a tricky one, your right. We don`t have tools for that.
The storage vendor matrix is indeed a nightmare. 100% agreed. ;-)
Of course i tried to fix all possible attributes. As far as i know and as far as the documentation got it. Which is not much, i have to say....
But i will check this again tomorrow and give a update, if i could find something...
Anyway, thx for your help!
Discover and save your favorite ideas. Come back to expert answers, step-by-step guides, recent topics, and more.
New here? Get started with these tips. How to use Community New member guide