Solved: Re: Nexus 1K VEM module shutdown (with DELL BLADE server)

Seok-Bin Lim · ‎12-27-2013

Hello, This is Vince.

I am doing one of PoC with important customer.

Can anyone help me to explain what the problem is?

I have been found couples of strange situation in a Nexus 1000V with DELL BLADE server)

Actually, Network diagram is like below.

I installed each two Vsphere Esxi on the Dell Blade server.

As Diagram shows each server is connected to Cisco N5K via M8024 Dell Blade Switch.

- two N1KV VM are installed on the Esxi. (of course as Primary and Secondary)

- N5K is connected to M8024 in vPC.

- VSM and VEM are checking each other via Layer3 control interface.

- the way of uplink's port-profile port channel LB is mac pinning.

interface control0

ip address 10.10.100.10/24

svs-domain

domain id 1

control vlan 1

packet vlan 1

svs mode L3 interface control0

port-profile type ethernet Up-Link

vmware port-group

switchport mode trunk

switchport trunk allowed vlan 1-2,10,16,30,77-78,88,100,110,120-121,130

switchport trunk allowed vlan add 140-141,150,160-161,166,266,366

service-policy type queuing output N1KV_SVC_Uplink

channel-group auto mode on mac-pinning

no shutdown

system vlan 1,10,30,100

state enabled

n1000v# show module

Mod Ports Module-Type Model Status

--- ----- -------------------------------- ------------------ ------------

1 0 Virtual Supervisor Module Nexus1000V ha-standby

2 0 Virtual Supervisor Module Nexus1000V active *

3 332 Virtual Ethernet Module NA ok

4 332 Virtual Ethernet Module NA ok

Mod Sw Hw

--- ------------------ ------------------------------------------------

1 4.2(1)SV2(2.1a) 0.0

2 4.2(1)SV2(2.1a) 0.0

3 4.2(1)SV2(2.1a) VMware ESXi 5.5.0 Releasebuild-1331820 (3.2)

4 4.2(1)SV2(2.1a) VMware ESXi 5.5.0 Releasebuild-1331820 (3.2)

Mod Server-IP Server-UUID Server-Name

--- --------------- ------------------------------------ --------------------

1 10.10.10.10 NA NA

2 10.10.10.10 NA NA

3 10.10.10.101 4c4c4544-0038-4210-8053-b5c04f485931 10.10.10.101

4 10.10.10.102 4c4c4544-0043-5710-8053-b4c04f335731 10.10.10.102

Let me explain what the strange things happened from now on.

If I move the Primary N1KV on the module 3 to the another Esxi of the module 4, VEM will be shutdown suddenly.

Here are sys logs.

2013 Dec 20 15:45:22 n1000v %VEM_MGR-2-VEM_MGR_REMOVE_NO_HB: Removing VEM 4 (heartbeats lost)

2013 Dec 20 15:45:22 n1000v %VIM-5-IF_DETACHED_MODULE_REMOVED: Interface Ethernet4/7 is detached (module removed)

2013 Dec 20 15:45:22 n1000v %VIM-5-IF_DETACHED_MODULE_REMOVED: Interface Ethernet4/8 is detached (module removed)

2013 Dec 20 15:45:22 n1000v %VIM-5-IF_DETACHED_MODULE_REMOVED: Interface Vethernet1 is detached (module removed)

2013 Dec 20 15:45:22 n1000v %VIM-5-IF_DETACHED_MODULE_REMOVED: Interface Vethernet17 is detached (module removed)

2013 Dec 20 15:45:22 n1000v %VIM-5-IF_DETACHED_MODULE_REMOVED: Interface Vethernet9 is detached (module removed)

2013 Dec 20 15:45:22 n1000v %VIM-5-IF_DETACHED_MODULE_REMOVED: Interface Vethernet37 is detached (module removed)

....

2013 Dec 20 15:46:53 n1000v %VEM_MGR-2-MOD_OFFLINE: Module 4 is offline

If I wanna make it works again then I have to do two things.

First of all, It should be selected on the Source MAC Check the way of vSwitch's Load balance.

(Port ID check is the default)

Second of all, the the order of Switch's fail over is very important.

If I change this order then VEM will be off in very soon.

Here you go, the screen capture file of These option. (you may not understand these Korean letters.)

In my opinion, the main problem is the link part between Esxi and M8024.

As you saw, Each Esxi is connected to two M8024 Dell Blade switches separately.

I saw the manual for the way N1K's uplink Load balance.

Even though there are 16 different port-channel LB way,

but It should be used only the way of src-mac If there is no supporting port-channel option in the upstreaming switches.

But I don't know exactly why this situation happened.

Can anyone help me how I make it works better.

Thanks in advance.

Best Regards,

Vince

plowden · ‎01-07-2014

Hi, Vince,

Sorry for the late reply. Cisco was shut down over the holidays, so most of us were on vacation.

Thank you for the excellent debugging information.

The VEM_MGR-2-VEM_MGR_REMOVE_NO_HB means the heartbeat was lost between the control interface of the active VSM (which you're referring to as the active N1KV VM) and the VEM when they're on the same ESXi server.

Does the n1kv-l3-control port profile include the configuration entry "system vlan 1"? If not, try that first.

If that doesn't work, do you have another vmkernel interface in the same IP subnet as the control interface? ESXi will always use the lowest numbered vmk for outgoing packets. If this is not the control vmk, heartbeats will be dropped as soon as its MAC table entry on the Dell M8024 ages out or there's a MAC move, which is the case when you move the VSM to the other server.

In any case, the best practice is to use the management interface for control with "svs mode L3 interface Mgmt0" so you don't have to create a separate vethernet port profile for a control vmk.

Hope this helps,

Phil Lowden

Cisco Consulting SE

View solution in original post

plowden · ‎01-07-2014