we have a pretty serious issue we are struggling to get to the bottom of. I have had a Cisco TAC case (and have just opened a netapp case) open now for coming close to a week, and we effectively have a UCS environment that is totally down. Luckily this is not a production system yet, but the issue is very concerning, as if it was, the entire environment would be dead in the water! What is even more complexing is we have an exact mirror of our environment at a DR site that is working perfectly!
This environment was working fine, up to the point the UCS chassis and nexus switches were rebooted. The netapp has not been rebooted (and we don't want to as we need to identify a root cause). After this point, we are totally unable to get any UCS blades (in either of two chassis) to see any luns.
We have stripped down the environment as much as possible, so we now have effectively this:-
Cisco UCS B200 series blade -> FCoe -> UCS FICs -> 4 port Fibre-channel port-channel trunk -> Nexus 5k -> FCoe -> Netapp SAN with CNA
There is a port channel between the nexus 5k and the netapp, but we have shut down all ports apart from one. The vfc interface is tied to the port channel.
The issue originally manifested as a boot from SAN issue, but we have now installed ESX locally on a blade, and if we do a rescan on the HBA that is connected to the netapp, we get an error similar to this in the vmware messages log:-
could not open device xxxxxxxxxxxxxxxxxxx for probing permission denied
If we do a rescan on an HBA that is not onnected (because we have shut down one of the fabrics) it rescans fine (but of course doesn't see any LUNs).
On the netapp, we see the initiator logged into the igroup fine:-
(logged in on: vtic, 4a)
We don't believe this is a vmware issue as we can also reproduce it by attempting to boot from SAN (where the bios of the vHBA is performing the connection), and the LUN can be accessed if we present it via iSCSI!
Chris what is the TAC SR #? I'll have a look into it with the assigned engineer.
This level of complex issue is best dealt with via the TAC case. Too many disparate devices, OS's and other factors to guess at potential problems
We can update the post here once the resolution has been found.
Ok, I have the solution! It was actually netapp that found it, although it is actually a cisco bug. There is a bug in the Cisco Nexus 5k in version 5.0(2)N1(1). The impact is that it prevents fibre-channel data commands from being passed between the UCS (or any other FC initiator) and a netapp SAN when using FCOE and a CNA. What is totally confusing is that FC logins ARE allowed, so your environment looks totally healthy, but you can't connect to any LUNs.
This is a pretty horrible bug, as from the description it is far from clear the impact it could have on your environment. However there are no field notices or software advisories against this image, it is just quietly mentioned in the release notes.The potential impact of this bug is pretty scary, as it can clearly take down an entire UCS environment. If you know what is causing it, it is of course easily fixed though by bouncing the FC interfaces in the netapp, and ultimately upgrading the Nexus 5k.
Symptom: With ethicist NX-OS Release 5.0(2)N1(1), when a VFC interface is shut down, in addition to the FIP Clear Virtual Link message, FCOE-LOGO is sent to the CNA. Some CNA vendors may have problems processing the FCOE LOGO. As a result, the VFC interface may not come back after a no-shut.
Here is the netapp bug info:-
Bug ID 467760 Title FCoE port hangs following a port shutdown on a Cisco Nexus switch Duplicate of Bug Severity2 - System barely usable Bug StatusNot Fixed ProductUnknown Bug TypeUnknown Description
An FCoE port can hang following a port shutdown on a Cisco Nexus switch. The
hang results from receiving and mishandling an incorrect ELS LOGO operation.
This problem has been observed with Nexus switches running NX-OS 5.0(2)N1(1).
Reset the FCoE port using the command 'fcp config
The Nexus switch sends an FCoE encoded ELS LOGO upon a switchport shutdown,
instead of a FIP LOGO. The FCoE port accepts the FCoE encoded ELS LOGO, at
which time the FCoE port can no longer respond to data packets, only logins
and aborts. Since the FCoE port will remain online in this state it could
cause MPIO to fail to detect a path failure.
A complete list of releases where this bug is fixed is available here.
This bug is fixed in 5.0(3)N1 which is currently available on CCO.
I'm having the 5.0(3) release notes updates to reflect this.
We are running a REV much more recent than the one listed here and I can confirm the issue is still happening. Flapping the interface did resolve our issue.