08-30-2011 01:16 PM
Hi,
I have a customer who lost all connectivity from the ESX host for both networking and FCoE because (as the title suggests) the interfaces were error disabled. This happened across all 8, dual ported, dual homed CNAs at the same time. Does anyone have any idea what this error comes from? The are using ESX 4.0 and are running Nexus 5020 with 4.2(1)N2(1a).
Thanks,
Thom
09-29-2011 07:56 AM
Thom,
I am seeing this same sort of issue with a Nexus 5548P with 5.0(2)N1(1). the servers at issue are Oracle M5000 servers with QLogic CNA's. These servers are configured as dual port, dual homed on the CNA's. This issue occurs when the servers are rebooted.
I have noticed that if I issue a "show interface brief" command the Ethernet port shows a state of "unknown and the vfc port shows a state of "errdisable". A shut and no shut of the interfaces seems to clear this up and the server will connect correctly after a reboot.
Have you had any luck in troubleshooting this issue?
Thanks,
John
09-29-2011 12:51 PM
Hi John,
The issue, according to QLogic support, was the driver on ESX for the CNAs. I was suspicious of the CNAs only because all of the ports were error disabled within seconds of one another. So far there hasn't been any shutdowns but we may have to wait some time to feel safe that it was the resolution.
Thanks,
Thom
10-17-2011 02:26 PM
Hi Guys, What was the result? We are having the exact same problem. Upgraded all drivers and replaced what seemed to be a faulty cna. Has been fine for about 2 weeks and then last night all the ports err-disabled again! GRRRRRRRRRR.
10-18-2011 05:36 AM
It was a driver issue for us. We haven't seen the issue since the update.
Thom
10-19-2011 06:13 AM
DCBX Type Length Values(TLV) are packaged within a LLDP frame which is exchanged between the switch and the CNA. One such Control Sub-TLV is used for ACK which is sequence based. For example, the switch sends this control Sub-TLV with SeqNo of 1 and AckNo of 2. The host is supposed to inverse this and send a LLDP frame with control sub-TLV with SeqNo of 2 and AckNo of 1.
We expect this exchange every 30 seconds from the host and if the switch does not see it for 100 times 30 which is 3000 seconds (or 50 minutes), the switch error disables with following error
2011 May 13 12:03:23 CSX_5020_A1 %ETHPORT-2-IF_DOWN_ERROR_DISABLED: Interface Ethernet115/1/17 is down (Error disabled. Reason:DCX-No ACK in 100 PDUs) 2011 May 13 12:03:27 CSX_5020_A1 %ETHPORT-2-IF_DOWN_ERROR_DISABLED: Interface Ethernet116/1/16 is down (Error disabled. Reason:DCX-No ACK in 100 PDUs)
Some commands on the switch which helps in narrowing down root cause.
F340.24.10-5548-1# show lldp interface ethernet 1/22 Interface Information: Enable (tx/rx/dcbx): Y/Y/Y Port Mac address: 00:05:73:ab:29:bd Peer's LLDP TLVs: Type Length Value ---- ------ ----- 001 007 040000c9 9d2372 002 007 030000c9 9d2372 003 002 0078 006 045 456d756c 6578204f 6e65436f 6e6e6563 74203130 4762204d 756c7469 2066756e 6374696f 6e204164 61707465 72 007 004 00800080 127 055 001b2102 020a0000 00000002 00000001 04110000 c0000001 00003232 00000000 00000206 060000c0 00080808 0a0000c0 00890600 1b2108 000 000 F340.24.10-5548-1# show lldp dcbx interface ethernet 1/22 Local DCBXP Control information: Operation version: 00 Max version: 00 Seq no: 1 Ack no: 2 <<---Our sequence # and Ack # Type/ Subtype Version En/Will/Adv Config 003/000 000 Y/N/Y 0808 004/000 000 Y/N/Y 8906001b21 08 002/000 000 Y/N/Y 0001000032 32000000 00000002 Peer's DCBXP Control information: Operation version: 00 Max version: 00 Seq no: 2 Ack no: 1 <<---Peer sequence # and Ack # should be reversed. Type/ Max/Oper Subtype Version En/Will/Err Config 002/000 000/000 Y/Y/N 0001000032 32000000 00000002 003/000 000/000 Y/Y/N 0808 004/000 000/000 Y/Y/N 8906001b21 08 F340.24.10-5548-1#
Root cause for this problem in most cases is misbehaving CNA/server or incorrect firmware/driver on the CNA.
10-19-2011 02:29 PM
Thanks for that. Checked on a working switch and was the same as your example.
This is our output from a problem switch. The ACK seems to be "1"
Local DCBXP Control information:
Operation version: 00 Max version: 00 Seq no: 1 Ack no: 1
Type/
Subtype Version En/Will/Adv Config
003/000 000 Y/N/Y 0808
004/000 000 Y/N/Y 8906001b21 08
002/000 000 Y/N/Y 0001000032
32000000 00000002
Peer's DCBXP Control information:
Operation version: 00 Max version: 00 Seq no: 1 Ack no: 0
Type/ Max/Oper
Subtype Version En/Will/Err Config
002/000 000/000 Y/Y/N 0001000032
32000000 00000002
003/000 000/000 Y/Y/N 0801
004/000 000/000 Y/Y/N 8906001b21
10-21-2011 07:29 AM
Hi Simon
That is a problem for sure.. Its ok for ACK and seq number to be the same.. Here is one such example from my lab
24.10.5020B.1# show lldp dcbx interface ethernet 1/16
Local DCBXP Control information:
Operation version: 00 Max version: 00 Seq no: 4 Ack no: 4
Type/
Subtype Version En/Will/Adv Config
004/000 000 Y/N/Y 8906001b21 08
003/000 000 Y/N/Y 0808
002/000 000 Y/N/Y 0001000032 32000000 00000002
Peer's DCBXP Control information:
Operation version: 00 Max version: 00 Seq no: 4 Ack no: 4
Type/ Max/Oper
Subtype Version En/Will/Err Config
002/000 000/000 Y/Y/N 0001000032 32000000 00000002
003/000 000/000 Y/Y/N 0801
004/000 000/000 Y/Y/N 8906001b21 08891400 1b2108
24.10.5020B.1#
Now the question would be is the CNA sending incorrect ACK or N5k interpretting it in correct. If you can sniff ethernet interface, it would point to the culprit. Or you could use ethanalyzer if you know the source MAC of
the CNA. Here is an example
ethanalyzer local interface inbound-hi det display-filter eth.src==00:00:c9:9d:23:72
Wireshark/ethanalyzer does not decode LLDP but if you can send them to me, I have a way to figure it out
Thanks
-Prashanth
10-26-2011 12:24 AM
Hi Prashanth,
I have a similar issue . But this is for my port channel between the 2 N5K.Every 50 mints it is going down.one side NX-OS is 4.0 and second one is 4.2.can you pls help me on this.
Regards,
Ajith
10-26-2011 02:53 PM
Ajith
4.0 is very old NX-OS and I am not very sure on what was supported in that release which could explain the problem you are seeing. I suggest that you upgrade both your 5ks to newer 4.2 or 5.0(3) release and you should be fine
Thanks
-Prashanth
10-26-2011 02:30 PM
Finally had a response from Dell as Qlogic had washed their hands of us. Dell do a firmware change on the card when they build them and have finallly admitted problems. they sent us an upgrade which we have done and now see the correct info. Seq 1, ack 2 and then seq 2, ack 1 on all the ports. Will keep you updated. Cheers for the help.
10-26-2011 02:52 PM
Hello Simon
Thanks for the update with resolution. This is a common case generator. If you do not mind can you let everyone know the driver/firware you were running and the ones you upgraded and mark this question as resolved? This would
help other community members seeing similar issue.
Thanks
-Prashanth
01-20-2012 07:48 AM
We have similar issues with a storage server:
switch3# show lldp dcbx interface e114/1/2
Local DCBXP Control information:
Operation version: 00 Max version: 00 Seq no: 1 Ack no: 0
Does this mean that the server isnt sending back an ACK ? It err-disabled the server interface with the same error message... There are actually 2 servers (failovers) having the same drivers etc, but the err-disable happened to only one server !
01-24-2012 06:52 PM
You can disable lldp and it will work fine.
Sent from Cisco Technical Support iPad App
06-03-2012 06:25 PM
Im getting the same issue as the OP.
Nexus 5548s with HPB22 FEX running 5.1(3)N2(1) into HP G7 blade servers with Emulex OCe10100 CNA adapters.
Every 50 mins getting:
VDC-1 %$ %ETHPORT-2-IF_DOWN_ERROR_DISABLED: Interface Ethernet100/1/1 is down (Error disabled. Reason:DCX-No ACK in 100 PDUs)
switch# sh lldp interface e100/1/1
Interface Information:
Enable (tx/rx/dcbx): Y/Y/Y Port Mac address: 70:ca:9b:f4:b3:42
Peer's LLDP TLVs:
Type Length Value
---- ------ -----
001 007 04e83935 2b5125
002 007 03e83935 2b5125
003 002 0078
006 045 456d756c 6578204f 6e65436f 6e6e6563 74203130 4762204d 756c7469
2066756e 6374696f 6e204164 61707465 72
007 004 00800080
127 055 001b2102 020a0000 00000001 00000000 04110000 c0000000 10003232
00000000 00000206 06000000 00100808 0a0000c0 000cbc01 1b2110
switch# sh lldp dcbx interface e100/1/1
Local DCBXP Control information:
Operation version: 00 Max version: 00 Seq no: 1 Ack no: 0
That "Ack no: 0" indicates some kind of problem on the host right?
Discover and save your favorite ideas. Come back to expert answers, step-by-step guides, recent topics, and more.
New here? Get started with these tips. How to use Community New member guide