1000v Uplinks to 2x Cisco 4500 & vPC-HM

allannelson · ‎01-04-2010

Hi All,

I have been testing a 1000v setup recently and have everything working as I wanted. I have 8 pNIC's connecting to 2 switches (4+4). I am using vPC-HM on the Nexus and just a standard PC on each of the 4500's. All control, packet, management, and data traffic go over these. I have no vSwitches left in the environment, just the DVS. I have 2 physical hosts in the environment at the moment.

I wanted to do some tests to make sure if the whole environment had to be turned off, if it would all come back up again.

I shut everything doen, and then power on the hosts. I was able to connect to the service consoles, so the System VLAN's seemed to work fine. I then started up the VSM (primary first), and when it came up, there were lots of errors:

Some are below:

2009 Dec 30 11:10:56 ESXvSwitch %KERN-1-SYSTEM_MSG: Dropping received frames from duplicate VSM saddr (0x1010000) - kernel

2009 Dec 30 11:10:12 ESXvSwitch %PLATFORM-2-MOD_PWRUP: Module 4 powered up (Serial number )
2009 Dec 30 11:10:20 ESXvSwitch %PLATFORM-2-PFM_VEM_REMOVE_NO_HB: Removing VEM 4 (heartbeats lost)
2009 Dec 30 11:10:20 ESXvSwitch %PLATFORM-2-MOD_REMOVE: Module 4 removed (Serial number )

2009 Dec 30 11:10:50 ESXvSwitch %PLATFORM-2-MOD_PWRUP: Module 3 powered up (Serial number )
2009 Dec 30 11:11:00 ESXvSwitch %PLATFORM-2-PFM_VEM_REMOVE_NO_HB: Removing VEM 3 (heartbeats lost)
2009 Dec 30 11:11:00 ESXvSwitch %PLATFORM-2-MOD_REMOVE: Module 3 removed (Serial number )

Module 3 and 4 are the 2 VEM's.

I checked the PC status on the Nexus and did a little bit of troubleshooting and in the end, I disabled all ports on one of the 4500's to each of the hosts. So now I only had 8x ports from one 4500 going to the 2 hosts. (no redundant path). Both modules then came up and all the errors stopped. I left it for a few min and then un-shut the ports on the other 4500, and everything is back to normal.

Now the only thing I can think of that is causing this is Spanning Tree that is running on the 4500.

I cant turn off spanning tree across the whole swithc, but I think i may need to turn it off/disable it for just the ports going to the Nexus.

Can anyone confirm that this is what I should do, and also what is the best command to do this (bpdufilter? would portfast trunk work?).

I have attached a Visio layout of the setup (ignore the 3rd host, as there will be another added at a later date).

Any help would be much appreciated.

Regards

Allan

nenduri · ‎01-05-2010

Hi Allan,

What is the version of N1K you are running? There was a known issue with first release of N1K, which got fixed with the recent release(4.0(4)SV1(2)).

Thanks,

Naren

allannelson · ‎01-05-2010

Hi Naren,

This is my current show ver:

Cisco Nexus Operating System (NX-OS) Software
TAC support: http://www.cisco.com/tac
Copyright (c) 2002-2009, Cisco Systems, Inc. All rights reserved.
The copyrights to certain works contained in this software are
owned by other third parties and used and distributed under
license. Certain components of this software are licensed under
the GNU General Public License (GPL) version 2.0 or the GNU
Lesser General Public License (LGPL) Version 2.1. A copy of each
such license is available at
http://www.opensource.org/licenses/gpl-2.0.php and
http://www.opensource.org/licenses/lgpl-2.1.php
ESXvSwitch# show ver
Cisco Nexus Operating System (NX-OS) Software
TAC support: http://www.cisco.com/tac
Copyright (c) 2002-2009, Cisco Systems, Inc. All rights reserved.
The copyrights to certain works contained in this software are
owned by other third parties and used and distributed under
license. Certain components of this software are licensed under
the GNU General Public License (GPL) version 2.0 or the GNU
Lesser General Public License (LGPL) Version 2.1. A copy of each
such license is available at
http://www.opensource.org/licenses/gpl-2.0.php and
http://www.opensource.org/licenses/lgpl-2.1.php

Software
loader:    version 1.2(2) [last: image booted through mgmt0]
kickstart: version 4.0(4)SV1(2)
system:    version 4.0(4)SV1(2)
kickstart image file is:
kickstart compile time: 9/22/2009 2:00:00
system image file is:    bootflash:/nexus-1000v-mz.4.0.4.SV1.2.bin
system compile time:     9/22/2009 2:00:00 [12/10/2009 05:21:33]

Hardware
Cisco Nexus 1000V Chassis ("Virtual Supervisor Module")
Intel(R) Xeon(R) CPU         with 2075012 kB of memory.
Processor Board ID T5056A10FF3

Device name: ESXvSwitch
bootflash:    2332296 kB

Kernel uptime is 6 day(s), 21 hour(s), 38 minute(s), 32 second(s)

plugin
Core Plugin, Ethernet Plugin

So it looks like I am running the latest version.

In general though, should I be enabling portfast or bpdufilter on the switch ports that uplink to the nexus 1000v?

Thanks for your reply.

Regards

Allan

nenduri · ‎01-05-2010

Hi Allan,

Can you post your uplink profile configuration that is being used by the 8 uplinks?

-Naren

allannelson · ‎01-06-2010

Naren,

My Uplink profile is below:

port-profile type ethernet core-uplink
vmware port-group
switchport mode trunk
switchport trunk allowed vlan 18,20,25,30,300,310,320
channel-group auto mode on sub-group cdp
no shutdown
system vlan 18,300,310
state enabled

Just to re-cap, everything works fine, the only issue happens when I re-start the hosts. It seems that the ports dont come up quick enough (STP is taking too long or something) for the Nexus to keep the module loaded (the Hello packets/heartbeat packets timeout before the ports are active).

So to me it looks like it is a STP issue on the Core switch (the 4500's) so I really just want to know if I should enable bpdufilter/portfast trunk, and if this is ok practice to do with the Nexus on the other end.

I read somewhere that the Nexus does not use STP, so I dont want to disable STP on the 4500 and then find out that a loop has been created between the 4500 and the nexus.

Thanks for your help.

Allan

nashwj · ‎01-06-2010

I have the same problem. I assumed the new version would fix it but I'm looking at dupe VSM frame errors again right now.

allannelson · ‎01-06-2010

nashwj,

What is your setup like?

I only get these errors when bringing up the environment from a cold start, when i have all switch ports enabled. If i "shutdown/unplug" one of the core switches, then the errors go away, and i can "unshut/plug in" the core switch after a few min, and I never see the errors again, hence me thinking it is something to do with STP and the time it takes for the the ports to be put into forwarding. The longer it takes, the longer it is for the CDP packets to get through which will setup the sub groups.

If it was not a live environment that I was testing in, i would just enable bpdufilter/portfast trunk on each port that heads to the Nexus and test my theory, but I dont want to bring down anything else if I am wrong.

Unfortunantaly, I dont have access to more hardware to test this in a lab setup... I migth end up setting one up at home and testing it there if I cant get it sorted soon.

Allan

nashwj · ‎01-06-2010

I have 18 ESX servers each connected to two Nexus 5K switches. Since I'm using first generation CNA adapters and FCoE I can't do normal port groups on the Nexus 5Ks so I have to use vPC-HM on each ESX server. I see what you see. To stop it I've got the VSMs connected via a single Gb link to the network instead of running them through a vPC-HM link.

nenduri · ‎01-07-2010

Allan,

Glad to know everything is working fine. Regarding STP, as N1K is a access switch and N1K VEM don't form the loops, we don't run STP on N1K.

You can safely enable port fast on the upstream switch ports that are connected to N1K VEMs and this is recommended.

If you still see the error messages after enabling port fast during re-start of the hosts, please change your configuration as below.

1. In the uplink port profile where you have vPC-HM configuration as 'sub-group cdp', change it to as follows;

channel-group auto mode on mac-pinning

2. With the above configuration, you should not configure channel on the upstream switch ports. So you need to remove the channel configuration you have on the upstream switch ports that are connected to VEM.

Thanks,

Naren

allannelson · ‎01-07-2010

Thanks for your reply and suggestions Nenduri.

I just want to clarify, as these ports are trunk ports, should I be using "spanning-tree portfast trunk" on the 4500 instead of just "spanning-tree portfast"

Thanks

Allan

nenduri · ‎01-09-2010

Allan,

Yes, you need portfast trunk. If you have configured channel on upstream switch, you need this configuration on both members and port channel interface.

Thanks,

Naren

allannelson · ‎01-10-2010

Hi Naren.

I have tried you advice and put portfast on all ports.

This did not seem to work, so I went to make the othe change you suggested and put the mac-pinning command on and remove the port-groups on the upstream switch.

I get this error when trying to apply the command:

ESXvSwitch(config-port-prof)# channel-group auto mode on mac-pinning
ERROR: cannot execute command with 'mac-pinning' option.
The system vem feature level does not meet the minimum requirements for this feature.

Am I missing something? I didnt know there were different levels of featuresets for the VEM.

Thanks

Allan

allannelson · ‎01-10-2010

Also,

I have considered that CDP is taking to long to give details for auto config of the subgroups, so I tried to specify the subgroups manually using "sub-group id #" on the physical interfaces, and also modifying the uplink profile to be "channel-group auto mode on".

This also did not seem to help.

Thanks

Allan

allannelson · ‎01-10-2010

Hi again Naren,

Just an update.

I worked out how to upgrade the featureset of the VEM's I had upgraded the VSM a while back, but never did the VEM part of it. I have now done that and upgraded the featureset, and everything is working as suggested.

I was able to do the mac-pinning command, and have done some testing and everything came up properly.

Thanks for your help.

Now I just have to work out why the subgroup cdp did not work properly, as this is my preferred method as it will allow better use of the bandwidth.

Regards

Allan

nenduri · ‎01-11-2010

Hi Allan,

With CDP you can expect some delay between the upstream switches sending CDP and the corresponding ports on N1K side being put in FWDing state. This is expected behavior.

Thanks,

Naren