cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
1987
Views
0
Helpful
3
Replies

Tips for troubleshooting FCoE

sredniv2009
Level 1
Level 1

Hi all,

We went through a firmware upgrade on our UCS environment from 2.1.1e to 2.1.2c on Thursday.  The updates appeared to go well, but almost immediately, we've been seeing storage performance issues on 5 of our 9 blades (ESXi 5.1U1 servers).  The issue seems to be isolated to the HBA on fabric A for the 5 hosts in question.  The HBA for fabric B seems OK.

I'm hoping for some tips on how to troubleshoot this from the UCS side.  I've been able to verify that everything is unchanged and appears fine everywhere in the chain from the blades through to the storage array, but I just don't have solid tools for troubleshooting an issue like this.

We've got 2x 6248UP talking to a 5548UP which talks to a VNX5500 via FC.  Our 9 blades (B200M3) are spread across 3 chassis (3 / chassis).  Blades with problems are in all three chassis, so it's not isolated to specific ones.  I have a VSAN for each fabric, connected to an 8GB FC port on each SP (i.e. VSAN 200 talks to port 0 on SPA and SPB, and VSAN 201 talks to port 1 on SPA and SPB).  Each FI uses a 2 port portchannel for FCoE traffic to the 5548.

I appear to have only two symptoms that are visible to me:

1)  ScsiDeviceIO failures in ESXi logs.  On affected systems, these are happening several times/second.  It appears that IO eventually goes through, but performance is degraded.

2)  PowerPath reports path failures in the ESXi logs and errors counters via rpowermt.

I'm able to put the HBA for fabric A into standby mode using powerpath to force all IO to fabric B and issues appear to clear (no ScsiDeviceIO errors, no path failures), so we're still functional.

There are no errors in UCS manager, nothing visible on the storage array or 5548 switch.

I will likely be opening a support case for this issue, but ahead of or alongside that, can any provide some feedback on how to clearly troubleshoot conditions where FCoE HBAs or connections appear to be having issues, but aren't completely down?  I'd like to strengthen my knowledge in this area as it's my weakest in managing our environment and I don't like being in a position where I'm unable to help myself.

Thanks!

Jason

3 Replies 3

sredniv2009
Level 1
Level 1

OK, in looking at the interface statistics for the FCoE uplink interface from fabric A, I see the following:

svcs-ucs-A(nxos)# show interface vfc 729

vfc729 is trunking (Not all VSANs UP on the trunk)

    Bound interface is port-channel200

    Hardware is Virtual Fibre Channel

    Port WWN is 22:d8:00:2a:6a:08:a9:3f

    Admin port mode is NP, trunk mode is on

    snmp link state traps are enabled

    Port mode is TNP

    Port vsan is 200

    Trunk vsans (admin allowed and active) (1,200)

    Trunk vsans (up)                       (200)

    Trunk vsans (isolated)                 ()

    Trunk vsans (initializing)             (1)

    1 minute input rate 13653512 bits/sec, 1706689 bytes/sec, 1477 frames/sec

    1 minute output rate 11278520 bits/sec, 1409815 bytes/sec, 976 frames/sec

      441177740 frames input, 516663417296 bytes

        302726 discards, 0 errors

      392931870 frames output, 600864345020 bytes

        0 discards, 0 errors

    last clearing of "show interface" counters never

    Interface last changed at Thu Sep  5 11:11:27 2013

That seems like a large number of discards.

On, FI B, I see 0 discards:

vfc730 is trunking (Not all VSANs UP on the trunk)

    Bound interface is port-channel201

    Hardware is Virtual Fibre Channel

    Port WWN is 22:d9:00:2a:6a:08:b4:bf

    Admin port mode is NP, trunk mode is on

    snmp link state traps are enabled

    Port mode is TNP

    Port vsan is 201

    Trunk vsans (admin allowed and active) (1,201)

    Trunk vsans (up)                       (201)

    Trunk vsans (isolated)                 ()

    Trunk vsans (initializing)             (1)

    1 minute input rate 37581640 bits/sec, 4697705 bytes/sec, 3933 frames/sec

    1 minute output rate 30401584 bits/sec, 3800198 bytes/sec, 2701 frames/sec

      1021397482 frames input, 1281781952976 bytes

        0 discards, 0 errors

      691947891 frames output, 981663490120 bytes

        0 discards, 0 errors

    last clearing of "show interface" counters never

    Interface last changed at Thu Sep  5 10:39:30 2013

So, I guess the next question would be, how do I dig into the reasoning for discards?  It's certainly a more bounded question. 

Jason

Hi Jason,

What do you see for 'Fcs (errors)' and 'Rcv (errors)' in UCSM for the server ports and FC/FCoE uplink ports?  Also, what do we see if we take a look at the counters from here:

-Mastin-B /org/service-profile/vhba # show stats

Vnic Stats:

    Time Collected: 2013-09-08T21:36:04.585

    Monitored Object: sys/chassis-2/blade-4/adaptor-1/host-fc-1/vnic-stats

    Suspect: No

    Bytes Rx (bytes): 0

    Packets Rx (packets): 0

    Bytes Tx (bytes): 0

    Packets Tx (packets): 0

    Errors Tx (errors): 0

    Errors Rx (errors): 0

    Dropped Tx (packets): 0

    Dropped Rx (packets): 0

    Thresholded: 0

Hello JASON,

Please find the below link with the complete steps.

Hope the below link will help.

http://www.cisco.com/en/US/products/ps10281/products_configuration_example09186a0080afd130.shtml#stepsCreatevHBA

Regards

keshav

Review Cisco Networking products for a $25 gift card