Solved: Re: UCS <1.4.2b>, N1K and uplink

danilo-dicesare · ‎07-22-2011

Hi all,

i've got a UCS (cluster) connected in vPC mode to a Nexus 7010.

Running on UCS blade VMware ESXi 4.1U1 and N1K with PC EHM, UCS NIC is 71KR.

Question is what happens if both uplink of one Interconnect fail? I mean maybe fibre cut or stuff like that. N1K channel will have yet two link active?

maybe the redundancy is done by re arping for finding a mac address ( i've got UCS in switch mode) but i'm not sure about. i saw a feature in new release for handling a complete uplink failure network state tracking (how works?).

last question about redundancy....what happens if a UCS IOM resets? have i got traffic disruption?

tnx

dan

Robert Burns · ‎07-22-2011

Dan,

Just to be clear, confirm the following:

-UCS in switch mode

-M71KR adapters

-N1k using Mac-pinning (I assume)

-Upstream connectivity from each FI is a VPC to a pair of N7Ks.

In this case the N1K has no visibility to UCS uplinks. All your VEM hosts see are two uplinks for each host (one going to each Fabric Interconnect. If one of the two uplinks fail on an Interconnect, then traffic will be re-pinned to the remaining uplink on that FI. If BOTH uplinks on an FI fail, then UCS will down the server links (called Link-down), and traffic would be routed through the VEM's other uplink going to the other FI. You can change this behavior to keep the server links up (for local switching only) but UCSM's default action is to shut the corresponding server links if there's no uplinks available on an FI. Make sense?

Now, in the latest version of N1K (1.4) there is a new feature called Network State Tracking (NST) for use with VPC-HM (such as Mac Pinning). This feature will test the connectivity of a VLAN sending a probe packet out and expecting to get it returned on another uplink/sub group. If you have a network of VLAN that MUST be up, you can track it with NST. If this network becomes unavailable you can choose to shut that uplink and re-route traffic to another uplink. This is handy for detecting failures beyond the first hop (which would be the interconnects) such as a failure somewhere in your N7K level or beyond.

Configuration Guide: http://www.cisco.com/en/US/partner/docs/switches/datacenter/nexus1000/sw/4_2_1_s_v_1_4/interface/configuration/guide/n1000v_if_5portchannel.html#wp1284321

White paper: https://communities.cisco.com/docs/DOC-20657

Command Reference: http://www.cisco.com/en/US/docs/switches/datacenter/nexus1000/sw/4_2_1_s_v_1_4/command/reference/n1000v_cmds_t.html#wp1295608

For your last question about an IOM failure/reset, all the corresponding adapters from each blade will lose connectivity. This is where redundancy at the host level comes into play to re-route traffic. In the case of your N1K VEM hosts, they would simply re-route traffic out the other path to the functional IOM of the Chassis.

One additional point to consider is the the M71 and M81 adaptors support Fabric Failover. This is failover at the adapter level if there's a failure with any device between the Adapter path and Uplinks (such as the IOM or FI). Fabric Failover is an Adapter configurable option that will re-route traffic in the adaptors Menlo ASIC to the "other" fabric, such that the host will NOT see either of the two ports go down. Without fabric failover, a failure of an IOM or FI would be seen by the adapter and that particular port would go down. FF just adds a level of redundancy in the Adaptor without relying on any Host OS teaming/failover. M51KR, M61 KR and M72KR adapters do NOT support this feature.

Regards,

Robert

View solution in original post

Robert Burns · ‎07-24-2011

It all depends. If you want to be able to do local switching (from servers/adaptors connecting to the same FI), then you want to set the Net Ctrl Policy to "warn only". In terms of N1K, I'd let the system fail traffic over to the other avaiable link. <-- This is not fabric failover, this is regular failover. The VEM would see a failed link and re-route traffic to the other uplink.

Fabric Failover is fine for UCSM version 1.4 or later. Previous to 1.4 traffic could potentially become "black-holed" as the UCSM assign MAC address of the VMNIC is never used to source traffic. Remember VMware uses the actual Virtual Machine or VMK Port MAC addresses to source traffic. In the event fabric failover was enabled and there was in fact a failure, UCSM would know enough to re-route traffic originating from the "inside-outbound" to the other path - However, during a fabric failover inbound traffic (from the upstream FI) might still try to use the original path - which may no longer be valid since UCSM doesn't track source MAC addresses of the peer FI. We fixed this with a feature called MAC sync. This allows UCSM to sync the L2 MAC tables between the Fabric Interconnects such during a FF event, a GARP would quickly notify the upstream switches of the new path to the destination blade.

Long of the short, if you have a mechanism for handling failover natively (as the N1K does) then we would still recommend to rely on that as there's no realy need for FF in this scenario. It's also a bit simpler to troubleshoot without the extra virtual adaptors not in play with Fabric Failover.

Regards,

Robert

View solution in original post

Robert Burns · ‎07-22-2011

Dan,

Just to be clear, confirm the following:

-UCS in switch mode

-M71KR adapters

-N1k using Mac-pinning (I assume)

-Upstream connectivity from each FI is a VPC to a pair of N7Ks.

In this case the N1K has no visibility to UCS uplinks. All your VEM hosts see are two uplinks for each host (one going to each Fabric Interconnect. If one of the two uplinks fail on an Interconnect, then traffic will be re-pinned to the remaining uplink on that FI. If BOTH uplinks on an FI fail, then UCS will down the server links (called Link-down), and traffic would be routed through the VEM's other uplink going to the other FI. You can change this behavior to keep the server links up (for local switching only) but UCSM's default action is to shut the corresponding server links if there's no uplinks available on an FI. Make sense?

Now, in the latest version of N1K (1.4) there is a new feature called Network State Tracking (NST) for use with VPC-HM (such as Mac Pinning). This feature will test the connectivity of a VLAN sending a probe packet out and expecting to get it returned on another uplink/sub group. If you have a network of VLAN that MUST be up, you can track it with NST. If this network becomes unavailable you can choose to shut that uplink and re-route traffic to another uplink. This is handy for detecting failures beyond the first hop (which would be the interconnects) such as a failure somewhere in your N7K level or beyond.

Configuration Guide: http://www.cisco.com/en/US/partner/docs/switches/datacenter/nexus1000/sw/4_2_1_s_v_1_4/interface/configuration/guide/n1000v_if_5portchannel.html#wp1284321

White paper: https://communities.cisco.com/docs/DOC-20657

Command Reference: http://www.cisco.com/en/US/docs/switches/datacenter/nexus1000/sw/4_2_1_s_v_1_4/command/reference/n1000v_cmds_t.html#wp1295608

For your last question about an IOM failure/reset, all the corresponding adapters from each blade will lose connectivity. This is where redundancy at the host level comes into play to re-route traffic. In the case of your N1K VEM hosts, they would simply re-route traffic out the other path to the functional IOM of the Chassis.

One additional point to consider is the the M71 and M81 adaptors support Fabric Failover. This is failover at the adapter level if there's a failure with any device between the Adapter path and Uplinks (such as the IOM or FI). Fabric Failover is an Adapter configurable option that will re-route traffic in the adaptors Menlo ASIC to the "other" fabric, such that the host will NOT see either of the two ports go down. Without fabric failover, a failure of an IOM or FI would be seen by the adapter and that particular port would go down. FF just adds a level of redundancy in the Adaptor without relying on any Host OS teaming/failover. M51KR, M61 KR and M72KR adapters do NOT support this feature.

Regards,

Robert

danilo-dicesare · ‎07-23-2011

Hi Robert,

i'll confirm.

net control policy is not shut link (my conf). do you think is better to shut and let vem reroute all? any issue in no shutdown link?

i've red somewhere that with n1k and vmware, cisco doesn't reccomend fabric failover....or i'm wrong?

tnx

dan

Robert Burns · ‎07-24-2011

It all depends. If you want to be able to do local switching (from servers/adaptors connecting to the same FI), then you want to set the Net Ctrl Policy to "warn only". In terms of N1K, I'd let the system fail traffic over to the other avaiable link. <-- This is not fabric failover, this is regular failover. The VEM would see a failed link and re-route traffic to the other uplink.

Fabric Failover is fine for UCSM version 1.4 or later. Previous to 1.4 traffic could potentially become "black-holed" as the UCSM assign MAC address of the VMNIC is never used to source traffic. Remember VMware uses the actual Virtual Machine or VMK Port MAC addresses to source traffic. In the event fabric failover was enabled and there was in fact a failure, UCSM would know enough to re-route traffic originating from the "inside-outbound" to the other path - However, during a fabric failover inbound traffic (from the upstream FI) might still try to use the original path - which may no longer be valid since UCSM doesn't track source MAC addresses of the peer FI. We fixed this with a feature called MAC sync. This allows UCSM to sync the L2 MAC tables between the Fabric Interconnects such during a FF event, a GARP would quickly notify the upstream switches of the new path to the destination blade.

Long of the short, if you have a mechanism for handling failover natively (as the N1K does) then we would still recommend to rely on that as there's no realy need for FF in this scenario. It's also a bit simpler to troubleshoot without the extra virtual adaptors not in play with Fabric Failover.

Regards,

Robert