07-05-2019 07:15 AM
Just to start, we have been running a large UCS environment for our VMware infrastructure for 5 years, with no real issues. Now however we're building a new environment for our large Xenserver environment to replace an old HP environment.
The new ucs environment concists of:
2 pcs Fabric Interconnect 6332-16UP
7 pcs 5108 Chassis (56 pcs B200 M5 blades)
14 pcs 2304 IOM Modules
The fabrics are configured with 4 unified ports each, each connected with 16gbps to a Brocade SAN fabric.
Due to Cisco and Brocade not supporting to configure a portchannel for the SAN, each link being just an ISL, we have paired the FC uplinks in redundancy pairs with pinning, each pin-pair consists of an FC port from Fabric A and Fabric B. We have done it this way due to the system otherwise not having a working failback, meaning that over time we might end up with several of the FC links saturated, while others are unused. Having to manually rebalance everything after link outtage is painful and time consuming.
All of the chassis are connected to the FI's with 40Gbps ports, one to each fabric.
We have created servertemplates for our Xen hosts, each with 6 NICs and 8 HBA's. HBA0 and HBA1 is the first redundancy pair, HBA2 and 4 the next one etc. Xenserver should be configured correctly with multipathing. We did do an outtage test, and everything seemed to work as intended.
The XenServers all use SANboot. And the Xen version we're running is 7.6. We have configured 40 Xenserver hosts at the moment.
Now, here's the problem. When the environment is running, and we start to power on all the VM's, the environment suddenly start to crash. This seems to happen at random, but mostly after reaching 5-700 VM's powered on. The Xenservers are no longer connected to the SAN and start to crash. The SAN guys can se at this point that there is none of the links from the FI logged in. All the FC uplinks to the FI's are down. So somehow, the FI's are dropped from the brocade, or the FI's themselves are dropping the connection… Or it might be XenServer-releated - however I can't see how the XenServer could actually cause the FI to drop an uplink. A Xenserver error causing itself dropping a NIC or an HBA would be more plausible, than the same error causing the FI starting to drop uplinks thus crashing the entire system. There are really nothing we have been able to find in the logs stating whats causing this issue.
In the FI's syslogs we can se the following, and theese entries allways show up when this failure happens - almost down to the second:
2019 Jun 28 13:58:33 UCSP-13-A %VIM-2-NIV_VIF_LIST_SET: Received an unsuccessful VIF_LIST_SET response from adaptor for VIFINDEX: 1019
2019 Jun 28 13:58:33 UCSP-13-A %VIM-2-NIV_VIF_LIST_SET: Received an unsuccessful VIF_LIST_SET response from adaptor for VIFINDEX: 1024
2019 Jun 28 13:58:34 UCSP-13-A %VIM-2-NIV_VIF_LIST_SET: Received an unsuccessful VIF_LIST_SET response from adaptor for VIFINDEX: 1019
2019 Jun 28 13:58:36 UCSP-13-A %VIM-2-NIV_VIF_LIST_SET: Received an unsuccessful VIF_LIST_SET response from adaptor for VIFINDEX: 1025
......there are several more of them, but they're all the same except for the VIFINDEX....
Since theese log-entries occur always in connection with the problem, I know they are somehow related.... But I don't know if I'm looking at the culprit of the problem, or just a symptom caused by some other issue.
We have experienced this problem on both firmware 4.02d and 4.04b.... And we have tried both Xenserver 8 and Xenserver 7.6.... with no luck... The Brocade-logs doesn't seem to give any good clues either. The SAN guys says it seems the FI just dropped the connection. And we find nothing in the XenServer other than things related to the server loosing disk connectivity to the boot volume.... not as to WHY this happens....
Anyone in the community having any clues or been experiencing the same issues?
07-07-2019 06:53 AM - edited 07-07-2019 08:22 AM
Does this happen to one Xenserver at a time, only one side (A|B), and then eventually seem to spread, or all at the same time?
Are your A and B FC fabrics completely separate?
When you say 'drop an uplink' you mean the FI FC port goes into an error disabled state or down?
It is possible for a erroring host/vHBA to send a bunch of CRC'd frames that could cause the pinned uplink port (or connected device
s port) to auto disable, although this should't take an entire environment down unless we are hitting some sort of slow drain scenario...
I'm assuming you have no actual FI FC port counter errors, or FI eth port counter errors going to the IOMs?
There are a lot of ports that should probably be checked, vic adapter level stats, IOM ports, FIs Eth and FC ports.
What kind of errors have your FC ports logged from show int fc x/y?
What's the output of #show int fc x/y transceiver detail on your FI ports?
Is there a particular host and pinned FC uplink that seems to go down first?
What version of FNIC driver are you running?
You have a TAC case open for this I'm assuming?
Kirk...
07-08-2019 12:21 AM
"Does this happen to one Xenserver at a time, only one side (A|B), and then eventually seem to spread, or all at the same time?"
We have not been able to identify this fully, as the system seems to run fine from my point of perspective (infrastructure admin) until suddenly the XenServer-guys start to crowd my office because their servers are none-responding. At that point none of the FI uplink ports to the FC-network are logged in.... In order to see this I would have to sit an monitor this system extensivly, and sadly I also have other environments to monitor, leaving me short of time.
"Are your A and B FC fabrics completely separate?"
Yes, the A and B FC-fabrics should be completly separate. But I will double check that with the FC-admin team. It's the same fabrics used by our VMware environment, which runs just fine, and has done so for years.
"When you say 'drop an uplink' you mean the FI FC port goes into an error disabled state or down?"
No, there are no errors reported on the FI-port itself. The port is showing "green" and "up".... Problem is that it's no longer logged in to the FC-fabric (lost flogi). So the connection is down, having no traffic, even though the link is still up. Actually, in order to get the system running again, my only option is to disable and then re-enable the unified FC ports to force a new login to the fabric. The serverprofile-vif's however show states as error disabled and down for the HBA's.
"I'm assuming you have no actual FI FC port counter errors, or FI eth port counter errors going to the IOMs?"
That is correct. There are no errors reported, no indications that there is any probems with the connections. It seems as if the FC-fabric just decides to evict the ports from the fabric for no given reason. And there are no indication on the FC-fabric itself either as to why this happens. So from that point of view it seems as if the FI just decided to stop sending traffic…..
"Is there a particular host and pinned FC uplink that seems to go down first?"
Not that we have been able to identify.
"You have a TAC case open for this I'm assuming?"
Yes we have a TAC case open, but have not been able to solve the problem yet.
I will look into the output of the command you specified and post the results shortly.
07-08-2019 06:14 AM - edited 07-08-2019 10:18 AM
If both FIs, all FC ports are loosing FLOGI with brocade ports, I would tend to suspect something in the brocade fabric.
This may ultimately require some sort of FC analyzer/TAP (i.e. Finisar) inserted between the FIs and brocade to determine root cause.
Are there any other non-UCS FC devices connected to the same linecards on the brocade switches?
Kirk...
07-08-2019 03:27 AM
This is the results of #show int fc x/y tranceiver details on Fabric A
fc1/1 sfp is present
Name is CISCO-AVAGO
Manufacturer's part number is AFBR-57F5PZ-CS1
Revision is B2
Serial number is AVJ2245J0YR
FC Transmitter type is short wave laser w/o OFC (SN)
FC Transmitter supports short distance link length
Transmission medium is multimode laser with 62.5 um aperture (M6)
Supported speeds are - Min speed: 4000 Mb/s, Max speed: 16000 Mb/s
Nominal bit rate is 14000 Mb/s
Link length supported for 50/125um fiber is 30 m
Link length supported for 62.5/125um fiber is 10 m
Link length supported for 50/125um OM3 fiber is 100 m
Cisco extended id is unknown (0x0)
No tx fault, no rx loss, no sync exists, diagnostic monitoring type is 0x68
SFP Diagnostics Information:
----------------------------------------------------------------------------
Alarms Warnings
High Low High Low
----------------------------------------------------------------------------
Temperature 38.23 C 75.00 C -5.00 C 70.00 C 0.00 C
Voltage 3.28 V 3.63 V 2.97 V 3.46 V 3.13 V
Current 7.40 mA 10.50 mA 2.50 mA 10.50 mA 2.50 mA
Tx Power -2.25 dBm 1.70 dBm -13.01 dBm -1.30 dBm -9.03 dBm
Rx Power -2.48 dBm -- 3.00 dBm -16.02 dBm 0.00 dBm -11.94 dBm
Transmit Fault Count = 3
----------------------------------------------------------------------------
Note: ++ high-alarm; + high-warning; -- low-alarm; - low-warning
fc1/2 sfp is present
Name is CISCO-AVAGO
Manufacturer's part number is AFBR-57F5PZ-CS1
Revision is B2
Serial number is AVJ2245J0KU
FC Transmitter type is short wave laser w/o OFC (SN)
FC Transmitter supports short distance link length
Transmission medium is multimode laser with 62.5 um aperture (M6)
Supported speeds are - Min speed: 4000 Mb/s, Max speed: 16000 Mb/s
Nominal bit rate is 14000 Mb/s
Link length supported for 50/125um fiber is 30 m
Link length supported for 62.5/125um fiber is 10 m
Link length supported for 50/125um OM3 fiber is 100 m
Cisco extended id is unknown (0x0)
No tx fault, no rx loss, no sync exists, diagnostic monitoring type is 0x68
SFP Diagnostics Information:
----------------------------------------------------------------------------
Alarms Warnings
High Low High Low
----------------------------------------------------------------------------
Temperature 40.56 C 75.00 C -5.00 C 70.00 C 0.00 C
Voltage 3.27 V 3.63 V 2.97 V 3.46 V 3.13 V
Current 7.09 mA 10.50 mA 2.50 mA 10.50 mA 2.50 mA
Tx Power -2.25 dBm 1.70 dBm -13.01 dBm -1.30 dBm -9.03 dBm
Rx Power -2.87 dBm -- 3.00 dBm -16.02 dBm 0.00 dBm -11.94 dBm
Transmit Fault Count = 3
----------------------------------------------------------------------------
Note: ++ high-alarm; + high-warning; -- low-alarm; - low-warning
fc1/3 sfp is present
Name is CISCO-AVAGO
Manufacturer's part number is AFBR-57F5PZ-CS1
Revision is B2
Serial number is AVJ2245J0Z8
FC Transmitter type is short wave laser w/o OFC (SN)
FC Transmitter supports short distance link length
Transmission medium is multimode laser with 62.5 um aperture (M6)
Supported speeds are - Min speed: 4000 Mb/s, Max speed: 16000 Mb/s
Nominal bit rate is 14000 Mb/s
Link length supported for 50/125um fiber is 30 m
Link length supported for 62.5/125um fiber is 10 m
Link length supported for 50/125um OM3 fiber is 100 m
Cisco extended id is unknown (0x0)
No tx fault, no rx loss, no sync exists, diagnostic monitoring type is 0x68
SFP Diagnostics Information:
----------------------------------------------------------------------------
Alarms Warnings
High Low High Low
----------------------------------------------------------------------------
Temperature 38.22 C 75.00 C -5.00 C 70.00 C 0.00 C
Voltage 3.28 V 3.63 V 2.97 V 3.46 V 3.13 V
Current 6.04 mA 10.50 mA 2.50 mA 10.50 mA 2.50 mA
Tx Power -2.14 dBm 1.70 dBm -13.01 dBm -1.30 dBm -9.03 dBm
Rx Power -8.48 dBm -- 3.00 dBm -16.02 dBm 0.00 dBm -11.94 dBm
Transmit Fault Count = 3
----------------------------------------------------------------------------
Note: ++ high-alarm; + high-warning; -- low-alarm; - low-warning
fc1/4 sfp is present
Name is CISCO-AVAGO
Manufacturer's part number is AFBR-57F5PZ-CS1
Revision is B2
Serial number is AVJ2245J4C2
FC Transmitter type is short wave laser w/o OFC (SN)
FC Transmitter supports short distance link length
Transmission medium is multimode laser with 62.5 um aperture (M6)
Supported speeds are - Min speed: 4000 Mb/s, Max speed: 16000 Mb/s
Nominal bit rate is 14000 Mb/s
Link length supported for 50/125um fiber is 30 m
Link length supported for 62.5/125um fiber is 10 m
Link length supported for 50/125um OM3 fiber is 100 m
Cisco extended id is unknown (0x0)
No tx fault, no rx loss, no sync exists, diagnostic monitoring type is 0x68
SFP Diagnostics Information:
----------------------------------------------------------------------------
Alarms Warnings
High Low High Low
----------------------------------------------------------------------------
Temperature 42.23 C 75.00 C -5.00 C 70.00 C 0.00 C
Voltage 3.27 V 3.63 V 2.97 V 3.46 V 3.13 V
Current 7.40 mA 10.50 mA 2.50 mA 10.50 mA 2.50 mA
Tx Power -2.98 dBm 1.70 dBm -13.01 dBm -1.30 dBm -9.03 dBm
Rx Power -2.14 dBm -- 3.00 dBm -16.02 dBm 0.00 dBm -11.94 dBm
Transmit Fault Count = 3
----------------------------------------------------------------------------
Note: ++ high-alarm; + high-warning; -- low-alarm; - low-warning
07-08-2019 03:29 AM
This is the results of running #show int fc x/y tranceiver details on Fabric Interconnect B
fc1/1 sfp is present
Name is CISCO-AVAGO
Manufacturer's part number is AFBR-57F5PZ-CS1
Revision is B2
Serial number is AVJ2245J4C7
FC Transmitter type is short wave laser w/o OFC (SN)
FC Transmitter supports short distance link length
Transmission medium is multimode laser with 62.5 um aperture (M6)
Supported speeds are - Min speed: 4000 Mb/s, Max speed: 16000 Mb/s
Nominal bit rate is 14000 Mb/s
Link length supported for 50/125um fiber is 30 m
Link length supported for 62.5/125um fiber is 10 m
Link length supported for 50/125um OM3 fiber is 100 m
Cisco extended id is unknown (0x0)
No tx fault, no rx loss, no sync exists, diagnostic monitoring type is 0x68
SFP Diagnostics Information:
----------------------------------------------------------------------------
Alarms Warnings
High Low High Low
----------------------------------------------------------------------------
Temperature 40.32 C 75.00 C -5.00 C 70.00 C 0.00 C
Voltage 3.26 V 3.63 V 2.97 V 3.46 V 3.13 V
Current 7.07 mA 10.50 mA 2.50 mA 10.50 mA 2.50 mA
Tx Power -2.17 dBm 1.70 dBm -13.01 dBm -1.30 dBm -9.03 dBm
Rx Power -1.97 dBm -- 3.00 dBm -16.02 dBm 0.00 dBm -11.94 dBm
Transmit Fault Count = 3
----------------------------------------------------------------------------
Note: ++ high-alarm; + high-warning; -- low-alarm; - low-warning
fc1/2 sfp is present
Name is CISCO-AVAGO
Manufacturer's part number is AFBR-57F5PZ-CS1
Revision is B2
Serial number is AVJ2245J0Z2
FC Transmitter type is short wave laser w/o OFC (SN)
FC Transmitter supports short distance link length
Transmission medium is multimode laser with 62.5 um aperture (M6)
Supported speeds are - Min speed: 4000 Mb/s, Max speed: 16000 Mb/s
Nominal bit rate is 14000 Mb/s
Link length supported for 50/125um fiber is 30 m
Link length supported for 62.5/125um fiber is 10 m
Link length supported for 50/125um OM3 fiber is 100 m
Cisco extended id is unknown (0x0)
No tx fault, no rx loss, no sync exists, diagnostic monitoring type is 0x68
SFP Diagnostics Information:
----------------------------------------------------------------------------
Alarms Warnings
High Low High Low
----------------------------------------------------------------------------
Temperature 41.86 C 75.00 C -5.00 C 70.00 C 0.00 C
Voltage 3.27 V 3.63 V 2.97 V 3.46 V 3.13 V
Current 7.34 mA 10.50 mA 2.50 mA 10.50 mA 2.50 mA
Tx Power -2.53 dBm 1.70 dBm -13.01 dBm -1.30 dBm -9.03 dBm
Rx Power -2.19 dBm -- 3.00 dBm -16.02 dBm 0.00 dBm -11.94 dBm
Transmit Fault Count = 3
----------------------------------------------------------------------------
Note: ++ high-alarm; + high-warning; -- low-alarm; - low-warning
fc1/3 sfp is present
Name is CISCO-AVAGO
Manufacturer's part number is AFBR-57F5PZ-CS1
Revision is B2
Serial number is AVJ2245J0L4
FC Transmitter type is short wave laser w/o OFC (SN)
FC Transmitter supports short distance link length
Transmission medium is multimode laser with 62.5 um aperture (M6)
Supported speeds are - Min speed: 4000 Mb/s, Max speed: 16000 Mb/s
Nominal bit rate is 14000 Mb/s
Link length supported for 50/125um fiber is 30 m
Link length supported for 62.5/125um fiber is 10 m
Link length supported for 50/125um OM3 fiber is 100 m
Cisco extended id is unknown (0x0)
No tx fault, no rx loss, no sync exists, diagnostic monitoring type is 0x68
SFP Diagnostics Information:
----------------------------------------------------------------------------
Alarms Warnings
High Low High Low
----------------------------------------------------------------------------
Temperature 40.41 C 75.00 C -5.00 C 70.00 C 0.00 C
Voltage 3.27 V 3.63 V 2.97 V 3.46 V 3.13 V
Current 6.85 mA 10.50 mA 2.50 mA 10.50 mA 2.50 mA
Tx Power -2.20 dBm 1.70 dBm -13.01 dBm -1.30 dBm -9.03 dBm
Rx Power -2.76 dBm -- 3.00 dBm -16.02 dBm 0.00 dBm -11.94 dBm
Transmit Fault Count = 3
----------------------------------------------------------------------------
Note: ++ high-alarm; + high-warning; -- low-alarm; - low-warning
fc1/4 sfp is present
Name is CISCO-AVAGO
Manufacturer's part number is AFBR-57F5PZ-CS1
Revision is B2
Serial number is AVJ2245J0KT
FC Transmitter type is short wave laser w/o OFC (SN)
FC Transmitter supports short distance link length
Transmission medium is multimode laser with 62.5 um aperture (M6)
Supported speeds are - Min speed: 4000 Mb/s, Max speed: 16000 Mb/s
Nominal bit rate is 14000 Mb/s
Link length supported for 50/125um fiber is 30 m
Link length supported for 62.5/125um fiber is 10 m
Link length supported for 50/125um OM3 fiber is 100 m
Cisco extended id is unknown (0x0)
No tx fault, no rx loss, no sync exists, diagnostic monitoring type is 0x68
SFP Diagnostics Information:
----------------------------------------------------------------------------
Alarms Warnings
High Low High Low
----------------------------------------------------------------------------
Temperature 42.31 C 75.00 C -5.00 C 70.00 C 0.00 C
Voltage 3.27 V 3.63 V 2.97 V 3.46 V 3.13 V
Current 6.85 mA 10.50 mA 2.50 mA 10.50 mA 2.50 mA
Tx Power -2.19 dBm 1.70 dBm -13.01 dBm -1.30 dBm -9.03 dBm
Rx Power -2.27 dBm -- 3.00 dBm -16.02 dBm 0.00 dBm -11.94 dBm
Transmit Fault Count = 3
----------------------------------------------------------------------------
Note: ++ high-alarm; + high-warning; -- low-alarm; - low-warning
Discover and save your favorite ideas. Come back to expert answers, step-by-step guides, recent topics, and more.
New here? Get started with these tips. How to use Community New member guide