cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
997
Views
0
Helpful
5
Replies

Xenserver / Fabric Interconnect loosing SAN connectivity

aarsheim1976
Level 1
Level 1

Just to start, we have been running a large UCS environment for our VMware infrastructure for 5 years, with no real issues. Now however we're building a new environment for our large Xenserver environment to replace an old HP environment.

 

The new ucs environment concists of:

2 pcs Fabric Interconnect 6332-16UP

7 pcs 5108 Chassis (56 pcs B200 M5 blades)

14 pcs 2304 IOM Modules

 

The fabrics are configured with 4 unified ports each, each connected with 16gbps to a Brocade SAN fabric.

Due to Cisco and Brocade not supporting to configure a portchannel for the SAN, each link being just an ISL, we have paired the FC uplinks in redundancy pairs with pinning, each pin-pair consists of an FC port from Fabric A and Fabric B. We have done it this way due to the system otherwise not having a working failback, meaning that over time we might end up with several of the FC links saturated, while others are unused. Having to manually rebalance everything after link outtage is painful and time consuming.

 

All of the chassis are connected to the FI's with 40Gbps ports, one to each fabric.

 

We have created servertemplates for our Xen hosts, each with 6 NICs and 8 HBA's. HBA0 and HBA1 is the first redundancy pair, HBA2 and 4 the next one etc. Xenserver should be configured correctly with multipathing. We did do an outtage test, and everything seemed to work as intended.

 

The XenServers all use SANboot. And the Xen version we're running is 7.6. We have configured 40 Xenserver hosts at the moment.

 

Now, here's the problem. When the environment is running, and we start to power on all the VM's, the environment suddenly start to crash. This seems to happen at random, but mostly after reaching 5-700 VM's powered on. The Xenservers are no longer connected to the SAN and start to crash. The SAN guys can se at this point that there is none of the links from the FI logged in. All the FC uplinks to the FI's are down. So somehow, the FI's are dropped from the brocade, or the FI's themselves are dropping the connection… Or it might be XenServer-releated - however I can't see how the XenServer could actually cause the FI to drop an uplink. A Xenserver error causing itself dropping a NIC or an HBA would be more plausible, than the same error causing the FI starting to drop uplinks thus crashing the entire system.  There are really nothing we have been able to find in the logs stating whats causing this issue.

 

In the FI's syslogs we can se the following, and theese entries allways show up when this failure happens - almost down to the second:

 

2019 Jun 28 13:58:33 UCSP-13-A %VIM-2-NIV_VIF_LIST_SET: Received an unsuccessful VIF_LIST_SET response from adaptor for VIFINDEX: 1019

2019 Jun 28 13:58:33 UCSP-13-A %VIM-2-NIV_VIF_LIST_SET: Received an unsuccessful VIF_LIST_SET response from adaptor for VIFINDEX: 1024

2019 Jun 28 13:58:34 UCSP-13-A %VIM-2-NIV_VIF_LIST_SET: Received an unsuccessful VIF_LIST_SET response from adaptor for VIFINDEX: 1019

2019 Jun 28 13:58:36 UCSP-13-A %VIM-2-NIV_VIF_LIST_SET: Received an unsuccessful VIF_LIST_SET response from adaptor for VIFINDEX: 1025

 

......there are several more of them, but they're all the same except for the VIFINDEX....

 

Since theese log-entries occur always in connection with the problem, I know they are somehow related.... But I don't know if I'm looking at the culprit of the problem, or just a symptom caused by some other issue.

 

We have experienced this problem on both firmware 4.02d and 4.04b.... And we have tried both Xenserver 8 and Xenserver 7.6.... with no luck... The Brocade-logs doesn't seem to give any good clues either. The SAN guys says it seems the FI just dropped the connection. And we find nothing in the XenServer other than things related to the server loosing disk connectivity to the boot volume.... not as to WHY this happens....

 

Anyone in the community having any clues or been experiencing the same issues?

 

5 Replies 5

Kirk J
Cisco Employee
Cisco Employee

Does this happen to one Xenserver at a time, only one side (A|B), and then eventually seem to spread, or all at the same time?

Are your A and B FC fabrics completely separate?

When you say 'drop an uplink' you mean the FI FC port goes into an error disabled state or down?

It is possible for a erroring host/vHBA to send a bunch of CRC'd frames that could cause the pinned uplink port (or connected device

s port) to auto disable, although this should't take an entire environment down unless we are hitting some sort of slow drain scenario...

I'm assuming you have no actual FI FC port counter errors, or FI eth port counter errors going to the IOMs?

There are a lot of ports that should probably be checked, vic adapter level stats, IOM ports, FIs Eth and FC ports.

 

What kind of errors have your FC ports logged from show int fc x/y?

 

What's the output of #show int fc x/y transceiver detail on your FI ports?

Is there a particular host and pinned FC uplink that seems to go down first?

What version of FNIC driver are you running?

 

You have a TAC case open for this I'm assuming?

 

Kirk...

 

"Does this happen to one Xenserver at a time, only one side (A|B), and then eventually seem to spread, or all at the same time?"

We have not been able to identify this fully, as the system seems to run fine from my point of perspective (infrastructure admin) until suddenly the XenServer-guys start to crowd my office because their servers are none-responding. At that point none of the FI uplink ports to the FC-network are logged in.... In order to see this I would have to sit an monitor this system extensivly, and sadly I also have other environments to monitor, leaving me short of time.

 

"Are your A and B FC fabrics completely separate?"

Yes, the A and B FC-fabrics should be completly separate. But I will double check that with the FC-admin team. It's the same fabrics used by our VMware environment, which runs just fine, and has done so for years.

 

"When you say 'drop an uplink' you mean the FI FC port goes into an error disabled state or down?"

No, there are no errors reported on the FI-port itself. The port is showing "green" and "up".... Problem is that it's no longer logged in to the FC-fabric (lost flogi). So the connection is down, having no traffic, even though the link is still up. Actually, in order to get the system running again, my only option is to disable and then re-enable the unified FC ports to force a new login to the fabric. The serverprofile-vif's however show states as error disabled and down for the HBA's.

 

"I'm assuming you have no actual FI FC port counter errors, or FI eth port counter errors going to the IOMs?"

That is correct. There are no errors reported, no indications that there is any probems with the connections. It seems as if the FC-fabric just decides to evict the ports from the fabric for no given reason. And there are no indication on the FC-fabric itself either as to why this happens. So from that point of view it seems as if the FI just decided to stop sending traffic…..

 

"Is there a particular host and pinned FC uplink that seems to go down first?"

Not that we have been able to identify.

 

"You have a TAC case open for this I'm assuming?"

Yes we have a TAC case open, but have not been able to solve the problem yet.

 

I will look into the output of the command you specified and post the results shortly. 

If both FIs, all FC ports are loosing FLOGI with brocade ports, I would tend to suspect something in the brocade fabric.

This may ultimately require some sort of FC analyzer/TAP (i.e. Finisar) inserted between the FIs and brocade to determine root cause.

Are there any other non-UCS FC devices connected to the same linecards on the brocade switches?

Kirk...

This is the results of #show int fc x/y tranceiver details on Fabric A

fc1/1 sfp is present

    Name is CISCO-AVAGO

    Manufacturer's part number is AFBR-57F5PZ-CS1

    Revision is B2

    Serial number is AVJ2245J0YR

    FC Transmitter type is short wave laser w/o OFC (SN)

    FC Transmitter supports short distance link length

    Transmission medium is multimode laser with 62.5 um aperture (M6)

    Supported speeds are - Min speed: 4000 Mb/s, Max speed: 16000 Mb/s

    Nominal bit rate is 14000 Mb/s

    Link length supported for 50/125um fiber is 30 m

    Link length supported for 62.5/125um fiber is 10 m

    Link length supported for 50/125um OM3 fiber is 100 m

    Cisco extended id is unknown (0x0)

 

    No tx fault, no rx loss, no sync exists, diagnostic monitoring type is 0x68

    SFP Diagnostics Information:

----------------------------------------------------------------------------

                                     Alarms                  Warnings

                                High        Low         High          Low

----------------------------------------------------------------------------

  Temperature  38.23 C         75.00 C     -5.00 C     70.00 C        0.00 C

  Voltage       3.28 V          3.63 V      2.97 V      3.46 V        3.13 V

  Current       7.40 mA        10.50 mA     2.50 mA    10.50 mA       2.50 mA

  Tx Power     -2.25 dBm        1.70 dBm  -13.01 dBm   -1.30 dBm     -9.03 dBm

  Rx Power     -2.48 dBm --     3.00 dBm  -16.02 dBm    0.00 dBm    -11.94 dBm

  Transmit Fault Count = 3

----------------------------------------------------------------------------

  Note: ++  high-alarm; +  high-warning; --  low-alarm; -  low-warning

 

 

fc1/2 sfp is present

    Name is CISCO-AVAGO

    Manufacturer's part number is AFBR-57F5PZ-CS1

    Revision is B2

    Serial number is AVJ2245J0KU

    FC Transmitter type is short wave laser w/o OFC (SN)

    FC Transmitter supports short distance link length

    Transmission medium is multimode laser with 62.5 um aperture (M6)

    Supported speeds are - Min speed: 4000 Mb/s, Max speed: 16000 Mb/s

    Nominal bit rate is 14000 Mb/s

    Link length supported for 50/125um fiber is 30 m

    Link length supported for 62.5/125um fiber is 10 m

    Link length supported for 50/125um OM3 fiber is 100 m

    Cisco extended id is unknown (0x0)

 

    No tx fault, no rx loss, no sync exists, diagnostic monitoring type is 0x68

    SFP Diagnostics Information:

----------------------------------------------------------------------------

                                     Alarms                  Warnings

                                High        Low         High          Low

----------------------------------------------------------------------------

  Temperature  40.56 C         75.00 C     -5.00 C     70.00 C        0.00 C

  Voltage       3.27 V          3.63 V      2.97 V      3.46 V        3.13 V

  Current       7.09 mA        10.50 mA     2.50 mA    10.50 mA       2.50 mA

  Tx Power     -2.25 dBm        1.70 dBm  -13.01 dBm   -1.30 dBm     -9.03 dBm

  Rx Power     -2.87 dBm --     3.00 dBm  -16.02 dBm    0.00 dBm    -11.94 dBm

  Transmit Fault Count = 3

----------------------------------------------------------------------------

  Note: ++  high-alarm; +  high-warning; --  low-alarm; -  low-warning

 

fc1/3 sfp is present

    Name is CISCO-AVAGO

    Manufacturer's part number is AFBR-57F5PZ-CS1

    Revision is B2

    Serial number is AVJ2245J0Z8

    FC Transmitter type is short wave laser w/o OFC (SN)

    FC Transmitter supports short distance link length

    Transmission medium is multimode laser with 62.5 um aperture (M6)

    Supported speeds are - Min speed: 4000 Mb/s, Max speed: 16000 Mb/s

    Nominal bit rate is 14000 Mb/s

    Link length supported for 50/125um fiber is 30 m

    Link length supported for 62.5/125um fiber is 10 m

    Link length supported for 50/125um OM3 fiber is 100 m

    Cisco extended id is unknown (0x0)

 

    No tx fault, no rx loss, no sync exists, diagnostic monitoring type is 0x68

    SFP Diagnostics Information:

----------------------------------------------------------------------------

                                     Alarms                  Warnings

                                High        Low         High          Low

----------------------------------------------------------------------------

  Temperature  38.22 C         75.00 C     -5.00 C     70.00 C        0.00 C

  Voltage       3.28 V          3.63 V      2.97 V      3.46 V        3.13 V

  Current       6.04 mA        10.50 mA     2.50 mA    10.50 mA       2.50 mA

  Tx Power     -2.14 dBm        1.70 dBm  -13.01 dBm   -1.30 dBm     -9.03 dBm

  Rx Power     -8.48 dBm --     3.00 dBm  -16.02 dBm    0.00 dBm    -11.94 dBm

  Transmit Fault Count = 3

----------------------------------------------------------------------------

  Note: ++  high-alarm; +  high-warning; --  low-alarm; -  low-warning

 

fc1/4 sfp is present

    Name is CISCO-AVAGO

    Manufacturer's part number is AFBR-57F5PZ-CS1

    Revision is B2

    Serial number is AVJ2245J4C2

    FC Transmitter type is short wave laser w/o OFC (SN)

    FC Transmitter supports short distance link length

    Transmission medium is multimode laser with 62.5 um aperture (M6)

    Supported speeds are - Min speed: 4000 Mb/s, Max speed: 16000 Mb/s

    Nominal bit rate is 14000 Mb/s

    Link length supported for 50/125um fiber is 30 m

    Link length supported for 62.5/125um fiber is 10 m

    Link length supported for 50/125um OM3 fiber is 100 m

    Cisco extended id is unknown (0x0)

 

    No tx fault, no rx loss, no sync exists, diagnostic monitoring type is 0x68

    SFP Diagnostics Information:

----------------------------------------------------------------------------

                                     Alarms                  Warnings

                                High        Low         High          Low

----------------------------------------------------------------------------

  Temperature  42.23 C         75.00 C     -5.00 C     70.00 C        0.00 C

  Voltage       3.27 V          3.63 V      2.97 V      3.46 V        3.13 V

  Current       7.40 mA        10.50 mA     2.50 mA    10.50 mA       2.50 mA

  Tx Power     -2.98 dBm        1.70 dBm  -13.01 dBm   -1.30 dBm     -9.03 dBm

  Rx Power     -2.14 dBm --     3.00 dBm  -16.02 dBm    0.00 dBm    -11.94 dBm

  Transmit Fault Count = 3

----------------------------------------------------------------------------

  Note: ++  high-alarm; +  high-warning; --  low-alarm; -  low-warning

This is the results of running #show int fc x/y tranceiver details on Fabric Interconnect B

fc1/1 sfp is present

    Name is CISCO-AVAGO

    Manufacturer's part number is AFBR-57F5PZ-CS1

    Revision is B2

    Serial number is AVJ2245J4C7

    FC Transmitter type is short wave laser w/o OFC (SN)

    FC Transmitter supports short distance link length

    Transmission medium is multimode laser with 62.5 um aperture (M6)

    Supported speeds are - Min speed: 4000 Mb/s, Max speed: 16000 Mb/s

    Nominal bit rate is 14000 Mb/s

    Link length supported for 50/125um fiber is 30 m

    Link length supported for 62.5/125um fiber is 10 m

    Link length supported for 50/125um OM3 fiber is 100 m

    Cisco extended id is unknown (0x0)

 

    No tx fault, no rx loss, no sync exists, diagnostic monitoring type is 0x68

    SFP Diagnostics Information:

----------------------------------------------------------------------------

                                     Alarms                  Warnings

                                High        Low         High          Low

----------------------------------------------------------------------------

  Temperature  40.32 C         75.00 C     -5.00 C     70.00 C        0.00 C

  Voltage       3.26 V          3.63 V      2.97 V      3.46 V        3.13 V

  Current       7.07 mA        10.50 mA     2.50 mA    10.50 mA       2.50 mA

  Tx Power     -2.17 dBm        1.70 dBm  -13.01 dBm   -1.30 dBm     -9.03 dBm

  Rx Power     -1.97 dBm --     3.00 dBm  -16.02 dBm    0.00 dBm    -11.94 dBm

  Transmit Fault Count = 3

----------------------------------------------------------------------------

  Note: ++  high-alarm; +  high-warning; --  low-alarm; -  low-warning

 

 

fc1/2 sfp is present

    Name is CISCO-AVAGO

    Manufacturer's part number is AFBR-57F5PZ-CS1

    Revision is B2

    Serial number is AVJ2245J0Z2

    FC Transmitter type is short wave laser w/o OFC (SN)

    FC Transmitter supports short distance link length

    Transmission medium is multimode laser with 62.5 um aperture (M6)

    Supported speeds are - Min speed: 4000 Mb/s, Max speed: 16000 Mb/s

    Nominal bit rate is 14000 Mb/s

    Link length supported for 50/125um fiber is 30 m

    Link length supported for 62.5/125um fiber is 10 m

    Link length supported for 50/125um OM3 fiber is 100 m

    Cisco extended id is unknown (0x0)

 

    No tx fault, no rx loss, no sync exists, diagnostic monitoring type is 0x68

    SFP Diagnostics Information:

----------------------------------------------------------------------------

                                     Alarms                  Warnings

                                High        Low         High          Low

----------------------------------------------------------------------------

  Temperature  41.86 C         75.00 C     -5.00 C     70.00 C        0.00 C

  Voltage       3.27 V          3.63 V      2.97 V      3.46 V        3.13 V

  Current       7.34 mA        10.50 mA     2.50 mA    10.50 mA       2.50 mA

  Tx Power     -2.53 dBm        1.70 dBm  -13.01 dBm   -1.30 dBm     -9.03 dBm

  Rx Power     -2.19 dBm --     3.00 dBm  -16.02 dBm    0.00 dBm    -11.94 dBm

  Transmit Fault Count = 3

----------------------------------------------------------------------------

  Note: ++  high-alarm; +  high-warning; --  low-alarm; -  low-warning

 

fc1/3 sfp is present

    Name is CISCO-AVAGO

    Manufacturer's part number is AFBR-57F5PZ-CS1

    Revision is B2

    Serial number is AVJ2245J0L4

    FC Transmitter type is short wave laser w/o OFC (SN)

    FC Transmitter supports short distance link length

    Transmission medium is multimode laser with 62.5 um aperture (M6)

    Supported speeds are - Min speed: 4000 Mb/s, Max speed: 16000 Mb/s

    Nominal bit rate is 14000 Mb/s

    Link length supported for 50/125um fiber is 30 m

    Link length supported for 62.5/125um fiber is 10 m

    Link length supported for 50/125um OM3 fiber is 100 m

    Cisco extended id is unknown (0x0)

 

    No tx fault, no rx loss, no sync exists, diagnostic monitoring type is 0x68

    SFP Diagnostics Information:

----------------------------------------------------------------------------

                                     Alarms                  Warnings

                                High        Low         High          Low

----------------------------------------------------------------------------

  Temperature  40.41 C         75.00 C     -5.00 C     70.00 C        0.00 C

  Voltage       3.27 V          3.63 V      2.97 V      3.46 V        3.13 V

  Current       6.85 mA        10.50 mA     2.50 mA    10.50 mA       2.50 mA

  Tx Power     -2.20 dBm        1.70 dBm  -13.01 dBm   -1.30 dBm     -9.03 dBm

  Rx Power     -2.76 dBm --     3.00 dBm  -16.02 dBm    0.00 dBm    -11.94 dBm

  Transmit Fault Count = 3

----------------------------------------------------------------------------

  Note: ++  high-alarm; +  high-warning; --  low-alarm; -  low-warning

 

fc1/4 sfp is present

    Name is CISCO-AVAGO

    Manufacturer's part number is AFBR-57F5PZ-CS1

    Revision is B2

    Serial number is AVJ2245J0KT

    FC Transmitter type is short wave laser w/o OFC (SN)

    FC Transmitter supports short distance link length

    Transmission medium is multimode laser with 62.5 um aperture (M6)

    Supported speeds are - Min speed: 4000 Mb/s, Max speed: 16000 Mb/s

    Nominal bit rate is 14000 Mb/s

    Link length supported for 50/125um fiber is 30 m

    Link length supported for 62.5/125um fiber is 10 m

    Link length supported for 50/125um OM3 fiber is 100 m

    Cisco extended id is unknown (0x0)

 

    No tx fault, no rx loss, no sync exists, diagnostic monitoring type is 0x68

    SFP Diagnostics Information:

----------------------------------------------------------------------------

                                     Alarms                  Warnings

                                High        Low         High          Low

----------------------------------------------------------------------------

  Temperature  42.31 C         75.00 C     -5.00 C     70.00 C        0.00 C

  Voltage       3.27 V          3.63 V      2.97 V      3.46 V        3.13 V

  Current       6.85 mA        10.50 mA     2.50 mA    10.50 mA       2.50 mA

  Tx Power     -2.19 dBm        1.70 dBm  -13.01 dBm   -1.30 dBm     -9.03 dBm

  Rx Power     -2.27 dBm --     3.00 dBm  -16.02 dBm    0.00 dBm    -11.94 dBm

  Transmit Fault Count = 3

----------------------------------------------------------------------------

  Note: ++  high-alarm; +  high-warning; --  low-alarm; -  low-warning

Review Cisco Networking for a $25 gift card

Review Cisco Networking for a $25 gift card