cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
5481
Views
0
Helpful
1
Replies

Nexus 1K VEM module shutdown (with DELL BLADE server)

Seok-Bin Lim
Level 1
Level 1

Hello, This is Vince.

I am doing  one of PoC with important customer.

Can anyone help me to explain what the problem is?

I have been found couples of strange situation in a Nexus 1000V with DELL BLADE server)

Actually, Network diagram is like below.

%EC%8A%A4%ED%81%AC%EB%A6%B0%EC%83%B7+2013-12-24+%EC%98%A4%EC%A0%84+3.47.37.png

I installed each two Vsphere Esxi on the Dell Blade server.

As Diagram shows each server is connected to Cisco N5K via M8024 Dell Blade Switch.

- two N1KV VM are installed on the Esxi. (of course as Primary and Secondary)

- N5K is connected to M8024 in vPC.

- VSM and VEM are checking each other via Layer3 control interface.

- the way of uplink's port-profile port channel LB is mac pinning.

interface control0

  ip address 10.10.100.10/24

svs-domain

  domain id 1

  control vlan 1

  packet vlan 1

  svs mode L3 interface control0

port-profile type ethernet Up-Link

  vmware port-group

  switchport mode trunk

  switchport trunk allowed vlan 1-2,10,16,30,77-78,88,100,110,120-121,130

  switchport trunk allowed vlan add 140-141,150,160-161,166,266,366

  service-policy type queuing output N1KV_SVC_Uplink

  channel-group auto mode on mac-pinning

  no shutdown

  system vlan 1,10,30,100

  state enabled

n1000v# show module

Mod  Ports  Module-Type                       Model               Status

---  -----  --------------------------------  ------------------  ------------

1    0      Virtual Supervisor Module         Nexus1000V          ha-standby

2    0      Virtual Supervisor Module         Nexus1000V          active *

3    332    Virtual Ethernet Module           NA                  ok

4    332    Virtual Ethernet Module           NA                  ok

Mod  Sw                  Hw    

---  ------------------  ------------------------------------------------

1    4.2(1)SV2(2.1a)     0.0                                            

2    4.2(1)SV2(2.1a)     0.0                                            

3    4.2(1)SV2(2.1a)     VMware ESXi 5.5.0 Releasebuild-1331820 (3.2)   

4    4.2(1)SV2(2.1a)     VMware ESXi 5.5.0 Releasebuild-1331820 (3.2)   

Mod  Server-IP        Server-UUID                           Server-Name

---  ---------------  ------------------------------------  --------------------

1    10.10.10.10      NA                                    NA

2    10.10.10.10      NA                                    NA

3    10.10.10.101     4c4c4544-0038-4210-8053-b5c04f485931  10.10.10.101

4    10.10.10.102     4c4c4544-0043-5710-8053-b4c04f335731  10.10.10.102

Let me explain what the strange things happened from now on.

If I move the Primary N1KV on the module 3 to the another Esxi of the module 4, VEM will be shutdown suddenly.

Here are sys logs.

2013 Dec 20 15:45:22 n1000v %VEM_MGR-2-VEM_MGR_REMOVE_NO_HB: Removing VEM 4 (heartbeats lost)

2013 Dec 20 15:45:22 n1000v %VIM-5-IF_DETACHED_MODULE_REMOVED: Interface Ethernet4/7 is detached (module removed)

2013 Dec 20 15:45:22 n1000v %VIM-5-IF_DETACHED_MODULE_REMOVED: Interface Ethernet4/8 is detached (module removed)

2013 Dec 20 15:45:22 n1000v %VIM-5-IF_DETACHED_MODULE_REMOVED: Interface Vethernet1 is detached (module removed)

2013 Dec 20 15:45:22 n1000v %VIM-5-IF_DETACHED_MODULE_REMOVED: Interface Vethernet17 is detached (module removed)

2013 Dec 20 15:45:22 n1000v %VIM-5-IF_DETACHED_MODULE_REMOVED: Interface Vethernet9 is detached (module removed)

2013 Dec 20 15:45:22 n1000v %VIM-5-IF_DETACHED_MODULE_REMOVED: Interface Vethernet37 is detached (module removed)

....

2013 Dec 20 15:46:53 n1000v %VEM_MGR-2-MOD_OFFLINE: Module 4 is offline

If I wanna make it works again then I have to do two things.

First of all, It should be selected on the Source MAC Check the way of vSwitch's Load balance.

(Port ID check is the default)


Second of all, the the order of Switch's fail over is very important.

If I change this order then VEM will be off in very soon.

Here you go, the screen capture file of These option. (you may not understand these Korean letters.)

%EC%8A%A4%ED%81%AC%EB%A6%B0%EC%83%B7+2013-12-24+%EC%98%A4%ED%9B%84+2.30.18.png

In my opinion, the main problem is the link part between Esxi and M8024.

As you saw, Each Esxi is connected to two M8024 Dell Blade switches separately.

I saw the manual for the way N1K's uplink Load balance.

Even though there are 16 different port-channel LB way,

but It should be used only the way of src-mac  If there is no supporting port-channel option in the upstreaming switches.

But I don't know exactly why this situation happened.

Can anyone help me how I make it works better.

Thanks in advance.

Best Regards,

Vince

1 Accepted Solution

Accepted Solutions

plowden
Cisco Employee
Cisco Employee

Hi, Vince,

Sorry for the late reply.  Cisco was shut down over the holidays, so most of us were on vacation.

Thank you for the excellent debugging information.

The VEM_MGR-2-VEM_MGR_REMOVE_NO_HB means the heartbeat was lost between the control interface of the active VSM (which you're referring to as the active N1KV VM) and the VEM when they're on the same ESXi server.

Does the n1kv-l3-control port profile include the configuration entry "system vlan 1"?  If not, try that first.

If that doesn't work, do you have another vmkernel interface in the same IP subnet as the control interface?  ESXi will always use the lowest numbered vmk for outgoing packets.  If this is not the control vmk, heartbeats will be dropped as soon as its MAC table entry on the Dell M8024 ages out or there's a MAC move, which is the case when you move the VSM to the other server.

In any case, the best practice is to use the management interface for control with "svs mode L3 interface Mgmt0" so you don't have to create a separate vethernet port profile for a control vmk.

Hope this helps,

Phil Lowden

Cisco Consulting SE

View solution in original post

1 Reply 1

plowden
Cisco Employee
Cisco Employee

Hi, Vince,

Sorry for the late reply.  Cisco was shut down over the holidays, so most of us were on vacation.

Thank you for the excellent debugging information.

The VEM_MGR-2-VEM_MGR_REMOVE_NO_HB means the heartbeat was lost between the control interface of the active VSM (which you're referring to as the active N1KV VM) and the VEM when they're on the same ESXi server.

Does the n1kv-l3-control port profile include the configuration entry "system vlan 1"?  If not, try that first.

If that doesn't work, do you have another vmkernel interface in the same IP subnet as the control interface?  ESXi will always use the lowest numbered vmk for outgoing packets.  If this is not the control vmk, heartbeats will be dropped as soon as its MAC table entry on the Dell M8024 ages out or there's a MAC move, which is the case when you move the VSM to the other server.

In any case, the best practice is to use the management interface for control with "svs mode L3 interface Mgmt0" so you don't have to create a separate vethernet port profile for a control vmk.

Hope this helps,

Phil Lowden

Cisco Consulting SE