UCS ARP ? issue

jason.harlow · ‎12-20-2013

Having trouble tracking down why this might be happening....

So we have a UCS chassis with fabric a and fabric b (two 6248's) with management IPs clustered

x.x.x.4 = fab a

x.x.x.5 = fab b

x.x.x.6 = cluster IP

The management ports are plugged in to a separate HP 1gig switch.

For this example A is primary.

Upstream they are connected to two nexus 5548's.

From the "outside world" (i.e. my desktop coming in through the hp switch), I can ping all three addresses just fine.

However from my VMWare hosts on the Cisco UCS blades, if I go through a NIC on fabric A I can only ping .4 and .6. If I go through Fab B, I can only ping .5

All other hosts on that VLAN work fine over both fabric A and B, it's just the management ports on the 6248's.

richbarb · ‎12-20-2013

Hello Jason,

For sure this is something outside the UCS scope.

How your HP switch is connected to 5548??

Try create a vpc to connect a HP management switch with two nexus 5548, this vpc should allow the management vlan, that also should be configured in the UCS domain down to each vnic interface that you need it.

Richard.

jason.harlow · ‎12-26-2013

I thought that was what was configured, but apparently was wrong. Only nexus A was connected to the HP switch.

However, when I created a vpc and enabled across both switches, now I can't connect to anything on the management network from the UCS blades. Everything looks up / ok on the Cisco and HP switches but no traffic is making it over.

The VPC is configured with identical VLANs as when just the single nexus was connected.

I tried with both LACP mode active and on.

the ports are spanning-tree type normal and show a status of forwarding. Scratching my head why this is not working.

Nexus A:

interface Ethernet1/1

description "HP SW Int 42"

switchport mode trunk

switchport trunk allowed vlan 1,101

speed 1000

channel-group 20 mode active

interface port-channel20

switchport mode trunk

switchport trunk allowed vlan 1,101

spanning-tree port type normal

speed 1000

vpc 20

Nexus B:

interface Ethernet1/1

description HP SW Int 44

switchport mode trunk

switchport trunk allowed vlan 1,101

speed 1000

channel-group 20 mode active

interface port-channel20

switchport mode trunk

switchport trunk allowed vlan 1,101

spanning-tree port type normal

speed 1000

vpc 20

HP Switch:

ProCurve Switch 2510G-48(eth-44)# sh lacp

LACP

PORT LACP TRUNK PORT LACP LACP

NUMB ENABLED GROUP STATUS PARTNER STATUS

---- ------- ------- ------- ------- -------

42 Active Dyn1 Up Yes Success

44 Active Dyn1 Up Yes Success

interface 42

name "Trunk to Nexus 5k"

lacp Active

exit

interface 44

name "Trunk to Nexus 5k"

lacp Active

vlan 1

name "DEFAULT_VLAN"

untagged 1-24,26,28,30,32,34,36,38,40,45-48

ip address 172.20.0.200 255.255.0.0

tagged 42,44

no untagged 25,27,29,31,33,35,37,39,41,43

exit

vlan 101

name "NEW_MGMT"

untagged 25,27,29,31,33,35,37,39,41,43

tagged 42,44

exit

Originally, Nexus A simply had eth1/1 connected to HP port 42 and Nexus B eth1/1 was shutdown

half working nexus A config was :

interface Ethernet1/1

description "HP SW Int 42"

switchport mode trunk

switchport trunk allowed vlan 1,101

speed 1000

HP VLAN config was untouched, port was just not LACP.

jason.harlow · ‎12-26-2013

# sh vpc

vPC domain id : 1

Peer status : peer adjacency formed ok

vPC keep-alive status : peer is alive

Configuration consistency status : success

Per-vlan consistency status : success

Type-2 consistency status : success

vPC role : primary

Number of vPCs configured : 5

Peer Gateway : Enabled

Peer gateway excluded VLANs : -

Dual-active excluded VLANs : -

Graceful Consistency Check : Enabled

Auto-recovery status : Enabled (timeout = 240 seconds)

vPC Peer-link status

---------------------------------------------------------------------

id Port Status Active vlans

-- ---- ------ --------------------------------------------------

1 Po1 up 1-2,101-104,200-201,1000-1001

vPC status

----------------------------------------------------------------------------

id Port Status Consistency Reason Active vlans

------ ----------- ------ ----------- -------------------------- -----------

15 Po15 up success success 102,200-201

16 Po16 up success success 102,200-201

17 Po17 up success success 101-104,200

-201,1000

18 Po18 up success success 101-104,200

-201,1000

20 Po20 up success success 1,101

jason.harlow · ‎12-31-2013

The really strange bit (and why I suspected that it's something in UCS) is that when I only have a link between Nexus A and the HP switch, everything in VMWare over both fabrics can get to everything ELSE on the HP switch that's on that VLAN.....just not the fabric interconnect management port for the other fabric.

Example:

(route based on originating port ID is used here so that I know all traffic for a VM is going over a particular vmnic)

VM 1: Can ping 10.1.101.4 and 10.1.101.6 (Active fabric interconnect + cluster IP)....get network unreachable errors for .5 (FI B).

VM 2: Can ping 10.1.101.5, but not .4 or .6

Both of those VMs can ping/connect to everything else that's directly connected to the HP switch on VLAN 101. 10.1.101.4 and 5 are both also directly connected to the HP Switch.

mgkramer99 · ‎02-22-2014

We are having the exact same issue, did you every find a fix?

Walter Dey · ‎02-22-2014

Could you please post a network diagram of your setup.

If I understand you correctly:

Out of band management IP address is on the same vlan / ip subnet as your VM IP addresses.

management is out of band, VM communication is inband !

do you have Northbound Ethernet Uplinks to the HP switch, in addition to the management connections

Can you please post the mac address tables on the two FI's as well as the HP switch ?

padramas · ‎02-23-2014

Hello

Could be due to

CSCun19289

FI mgmt0 dropping traffic coming from blades behind that FI

Are you initiating ping to FI mgmt interface from a VM hosted on UCS blades ?

Padma

mgkramer99 · ‎02-23-2014

Hello,

Thank yoiu for your response. CSCun19289 is exactly what we are running in to. We just went to 2.2(1c) this weekend and started having this issue. We will implement workaround for now.

Thanks

Matt