Tracing Fibre Channel Aborts on the UCS

Qiese Dides · ‎05-07-2016

Introduction

This guide is to help trace aborts being generated from your Fibe Channel if you have an upstream switch such as an MDS. Having this information will help pinpoint the exact origin of the abort, which in return will help save time in finding the root cause. We will find aborts from the Tech Support File and Live from the CLI.

Tracing FC Aborts on the UCS

1. The first way to find aborts would be to download a tech support file from the specific blade in question. If you do not know how to download a tech support file please visit the following link;

http://www.cisco.com/c/en/us/support/docs/servers-unified-computing/ucs-manager/115023-visg-tsfiles-00.html

Once you get the file downloaded we will unzip the Chassis Log. We will then Unzip the Mezz_Techsupport.Tar file in question. For Example if you have Chassis 1 Blade 2 you would download the Chassis 1 Tech Support File and then unzip the Mezz21_TechSupport.Tar File. Once that file is unzipped you may open the obfl.tar file in notepad ++, from here you will be able to CTRL + F the keyword “abort” and gather the information that is relevant (look at the time stamp).

2. The second way is to do a tailing of the log files from an SSH session from your fabric interconnect.

We will be connecting to Chassis 1 Blade 2 Adapter 1 in the screen shot below. The command show-log will show the last 50 entries in the adapter log file, here we can keep hitting the space bar for the aborts.

3. Below is an example of some aborts I received on my test environment with my VIC 1240. These are the exact type of abort messages you will see in the logs or when doing the show-log What I have bolded is the source id and the destination id for the abort and where it is being generated from (Either Upstream Storage or UCS). The numbers in bold are the FCID.

160210-22:32:49.292933 ecom.ecom_main ecom(4:1): abort called for exch 5a6d, status 3 rx_id 0 s_stat 0x0 xmit_recvd 0x0 burst_offset 0x0 sgl_err 0x0 last_param 0x0 last_seq_cnt 0x0 tot_bytes_exp 0x400 h_seq_cnt 0x0 exch_type 0x1 s_id 0x40666 d_id 0x411ef host_tag 0x4

160210-22:32:49.292933 ecom.ecom_main ecom(4:1): abort called for exch 5a36, status 3 rx_id 0 s_stat 0x0 xmit_recvd 0x0 burst_offset 0x0 sgl_err 0x0 last_param 0x0 last_seq_cnt 0x0 tot_bytes_exp 0x400 h_seq_cnt 0x0 exch_type 0x1 s_id 0x40666 d_id 0x411ef host_tag 0x50

160210-22:32:49.292933 ecom.ecom_main ecom(4:1): abort called for exch 5411, status 1 rx_id 0 s_stat 0x0 xmit_recvd 0x0 burst_offset 0x0 sgl_err 0x0 last_param 0x0 last_seq_cnt 0x0 tot_bytes_exp 0x10 h_seq_cnt 0x0 exch_type 0x0 s_id 0x40400 d_id 0x40400 host_tag 0xa1

4. Since we have the FCID information now we must go to our up-stream switch and run a show fcns database This will propagate the FCID Database in which with a little bit of reading we can find exactly where the source id is coming from. For this example this is a snippet of the show fcns database output from the MDS. We can tell that the source id is EMC and we can now properly open a ticket with EMC to figure out why this is occurring.

0x040400 N 50:00:14:42:b0:55:e4:02 (EMC) scsi-fcp
0x040360 N 50:00:14:42:a0:57:12:02 (EMC) scsi-fcp

Reading FC Aborts on the UCS

160210-22:32:49.292933 ecom.ecom_main ecom(4:1): abort called for exch 5a6d, status 3 rx_id 0 s_stat 0x1 xmit_recvd 0x0 burst_offset 0x0 sgl_err 0x0 last_param 0x0 last_seq_cnt 0x0 tot_bytes_exp 0x400 h_seq_cnt 0x0 exch_type 0x1 s_id 0x40666 d_id 0x411ef host_tag 0x4

1. s_stat - 0x01 => Atleast one frame is received
2. Exch_type =0x01 => Exchange is Ingress and is active
3. Total bytes expected is => 0x400
4. Received is => 0x0
5. burst_offset is set => 0x0
6. Host scsi layer¹s IO tag for this request is -> 0x4
7. Source ID => 0x40666
8. Dest Target ID => 0x411ef
9. Seq ID => 0x0
10. RX ID => 0s

In this log message, initiator is expecting some more data from the target and the data is either dropped at the target or on the path. This is causing a timeout at the initiator end.

david.martin41 · ‎02-01-2017

If the source of the aborts is coming from the UCS hosts what could be the cause? Is this the result of an FNIC bug?

Wes Austin · ‎02-01-2017

Hello,

If the abort is sourced from the UCS, it just means we issued the abort. It is indicative that the blade is not able to read/write to storage. I would investigate the following:

1. Physical layer issues on the path to storage. (SFP, Cabling)

2. Correct fnic driver version.

3. Health of storage LUN

HTH,

Wes