09-07-2013 03:34 PM - edited 03-01-2019 11:14 AM
Hi all,
We went through a firmware upgrade on our UCS environment from 2.1.1e to 2.1.2c on Thursday. The updates appeared to go well, but almost immediately, we've been seeing storage performance issues on 5 of our 9 blades (ESXi 5.1U1 servers). The issue seems to be isolated to the HBA on fabric A for the 5 hosts in question. The HBA for fabric B seems OK.
I'm hoping for some tips on how to troubleshoot this from the UCS side. I've been able to verify that everything is unchanged and appears fine everywhere in the chain from the blades through to the storage array, but I just don't have solid tools for troubleshooting an issue like this.
We've got 2x 6248UP talking to a 5548UP which talks to a VNX5500 via FC. Our 9 blades (B200M3) are spread across 3 chassis (3 / chassis). Blades with problems are in all three chassis, so it's not isolated to specific ones. I have a VSAN for each fabric, connected to an 8GB FC port on each SP (i.e. VSAN 200 talks to port 0 on SPA and SPB, and VSAN 201 talks to port 1 on SPA and SPB). Each FI uses a 2 port portchannel for FCoE traffic to the 5548.
I appear to have only two symptoms that are visible to me:
1) ScsiDeviceIO failures in ESXi logs. On affected systems, these are happening several times/second. It appears that IO eventually goes through, but performance is degraded.
2) PowerPath reports path failures in the ESXi logs and errors counters via rpowermt.
I'm able to put the HBA for fabric A into standby mode using powerpath to force all IO to fabric B and issues appear to clear (no ScsiDeviceIO errors, no path failures), so we're still functional.
There are no errors in UCS manager, nothing visible on the storage array or 5548 switch.
I will likely be opening a support case for this issue, but ahead of or alongside that, can any provide some feedback on how to clearly troubleshoot conditions where FCoE HBAs or connections appear to be having issues, but aren't completely down? I'd like to strengthen my knowledge in this area as it's my weakest in managing our environment and I don't like being in a position where I'm unable to help myself.
Thanks!
Jason
09-07-2013 09:04 PM
OK, in looking at the interface statistics for the FCoE uplink interface from fabric A, I see the following:
svcs-ucs-A(nxos)# show interface vfc 729
vfc729 is trunking (Not all VSANs UP on the trunk)
Bound interface is port-channel200
Hardware is Virtual Fibre Channel
Port WWN is 22:d8:00:2a:6a:08:a9:3f
Admin port mode is NP, trunk mode is on
snmp link state traps are enabled
Port mode is TNP
Port vsan is 200
Trunk vsans (admin allowed and active) (1,200)
Trunk vsans (up) (200)
Trunk vsans (isolated) ()
Trunk vsans (initializing) (1)
1 minute input rate 13653512 bits/sec, 1706689 bytes/sec, 1477 frames/sec
1 minute output rate 11278520 bits/sec, 1409815 bytes/sec, 976 frames/sec
441177740 frames input, 516663417296 bytes
302726 discards, 0 errors
392931870 frames output, 600864345020 bytes
0 discards, 0 errors
last clearing of "show interface" counters never
Interface last changed at Thu Sep 5 11:11:27 2013
That seems like a large number of discards.
On, FI B, I see 0 discards:
vfc730 is trunking (Not all VSANs UP on the trunk)
Bound interface is port-channel201
Hardware is Virtual Fibre Channel
Port WWN is 22:d9:00:2a:6a:08:b4:bf
Admin port mode is NP, trunk mode is on
snmp link state traps are enabled
Port mode is TNP
Port vsan is 201
Trunk vsans (admin allowed and active) (1,201)
Trunk vsans (up) (201)
Trunk vsans (isolated) ()
Trunk vsans (initializing) (1)
1 minute input rate 37581640 bits/sec, 4697705 bytes/sec, 3933 frames/sec
1 minute output rate 30401584 bits/sec, 3800198 bytes/sec, 2701 frames/sec
1021397482 frames input, 1281781952976 bytes
0 discards, 0 errors
691947891 frames output, 981663490120 bytes
0 discards, 0 errors
last clearing of "show interface" counters never
Interface last changed at Thu Sep 5 10:39:30 2013
So, I guess the next question would be, how do I dig into the reasoning for discards? It's certainly a more bounded question.
Jason
09-08-2013 03:58 PM
Hi Jason,
What do you see for 'Fcs (errors)' and 'Rcv (errors)' in UCSM for the server ports and FC/FCoE uplink ports? Also, what do we see if we take a look at the counters from here:
-Mastin-B /org/service-profile/vhba # show stats
Vnic Stats:
Time Collected: 2013-09-08T21:36:04.585
Monitored Object: sys/chassis-2/blade-4/adaptor-1/host-fc-1/vnic-stats
Suspect: No
Bytes Rx (bytes): 0
Packets Rx (packets): 0
Bytes Tx (bytes): 0
Packets Tx (packets): 0
Errors Tx (errors): 0
Errors Rx (errors): 0
Dropped Tx (packets): 0
Dropped Rx (packets): 0
Thresholded: 0
09-11-2013 01:45 AM
Hello JASON,
Please find the below link with the complete steps.
Hope the below link will help.
Regards
keshav
Discover and save your favorite ideas. Come back to expert answers, step-by-step guides, recent topics, and more.
New here? Get started with these tips. How to use Community New member guide