1000v on UCS dropping VEMS

Cy Hauptmann · ‎02-03-2011

Hi all, I was wondering if any of you guys have noticed that if you run the 1000v VSM on UCS server, the VEMs on other UCS servers will power cycle during the day?? Anyone else seen this???

My enviroment has a UCS and 3 Dell R900s running ESXi4.1. When I put the VSM on the UCS, the VEMS on the other host wil power cycle ... but if I put the VSM on the R900s, the VEMS are stable.

Any thoughts?

Robert Burns · ‎02-03-2011

Cy,

I've never seen/heard of this. For a case as abstract as this you would be best to open a TAC case. There's many entities involved.

A few questions:

- What model of UCS server (and is this a blade/rack)?

- Have you checked the vmkernel logs of the "rebooting" ESX hosts for any clues? VMware should be logging information to their logs.

- With the VSM running on the UCS servers, are the Dell servers also rebooting?

Odds are this is more of an OS related issue as this only happens when the VSMs are on UCS servers. There should also be SEL event entries on UCS detailing why the blade was power cycled. If there's no information, then UCSM wasn't involved and it was an OS initiated shutdown/reboot task.

Regards,

Robert

Cy Hauptmann · ‎02-04-2011

Robert, thanks for replying but I think you may have misunderstood what I was trying to say. The actually ESXi host doesn't not actually reboot. When I maintain the VSM within the UCS chassis, the VEMs will lose connectivity with it and will power cycle themselves thus I lose connectivity to that particular host for 1-2 seconds while it re-establishes connectivity with the VSM. I've got this set up in two different enviroments where the if I keep the VSM within the UCS architecture, I see this behavior. In the environment where I have the Dell R900s to host the VSM and the VEMS have to go through the upstream switch to get to the VSM, then they are stable and are able to maintian connectivity to the VSM.

As or the UCS, I have both 6120s and the 6140s Fabrich Interconnects running in my two enviroments. I may have something misconfigured within my UCS that is causing this this behavior but everything looks like the way it should be to allow everything to work.

Cy.

Jeremy Waldrop · ‎02-04-2011

Cy, it sounds like you may not have the proper control/packet VLANs trunked to the vNIC that the VSM gets place on in UCS. Could be on either the UCS side, VMware side or both.

Cy Hauptmann · ‎02-04-2011

I thought about that too, but with the VSM on the Dell R900s, everything works fine so the control/packet vlans are on the vNICs within the UCS and VMware.

Jeremy Waldrop · ‎02-04-2011

Cy, you can verify this by SSH into an ESX host and running vem-health check VSM APIC MAC you can get this by first running show svs neighbors from the VSM.

Take a look at this troubleshooting guide as well

http://www.cisco.com/en/US/docs/switches/datacenter/nexus1000/sw/4_0_4_s_v_1_3/troubleshooting/configuration/guide/n1000v_trouble_5modules.html#wp1201605

Robert Burns · ‎02-05-2011

Cy,

Yeah I did mis-understand your question. The way you phrased it made it sound like the VEMs were rebooting. Now I see it's the VSM rebooting.

The only way a VSM would reboot on it's own would 100% be related to network connectivity within the Control VLAN. If this communication is lost, you'll run into a split brain, and an election will happen forcing the losing VSM to become subordinate and reboot.

Please paste your VSM running config.

Robert

Cy Hauptmann · ‎02-10-2011

Robert, I'm still thinking that you are misunderstanding my problem. Last night I went ahead and upgraded my VSMs and VEMs to 1.4.1 hoping to resolve my problem.

The problem I'm having is this. I have basically 10 ESXi host running, 6 are the B200 blades and 1 B250 blade on the UCS chassis connected to a 6120 Fabric Interconnect. I also have 3 Dell R900 servers . The error I'm getting on the VSM is "%VEM_MGR-2-VEM_MGR_REMOVE_NO_HB: Removing VEM 10 (heartbeats lost)". Once this happens, it pretty much resets the VEM connectivity and the VEM module reconnects to the VSM. The whole process takes about 6-10 seconds to come back online. Of course during this process, all the VMs on that host lose connectivity as all their network cards lose connectivity to the 1000v.

This only happens for the ESXi hosts on the UCS chassis, it never happens for the Dell R900s. This happens randomly all throughout the day and for no particular blade server. Also, if I have the VSM VM running on one of the the UCS blade server, it happens more frequently so I have since moved the VSM VMs over to the Dell R900s.

Any thoughts?

Cy.

lwatta · ‎02-10-2011

Cy,

Based of what you are describing Rob is exactly right. There is an issue with the control vlan or interface that the VSM uses to communicate with the VEMS. It be could something as easy as congestion on the network or something more complex like a network issue upstream.

We use the control network to maintain the virtual backplane between the VSMs and the VEMs. After 6 missed heartbeats (1 per second) we will drop the module. It's the same thing between the primary and secondary VSMs. If the control network goes down between the VSMs then one VSM becomes primary and forces the other VSM to reboot. The VSM that reboots will continue to reboot until the control network comes back between the two VSMs.

Since the problem seems to happen only on the UCS system most likely there is a networking issue or misconfig on the UCS that is causing the problem. If you want to continue to analyze it locally take a good look a the network config on the UCS and for any congestion or issue that would cause packet drop. The other suggestion is to open a TAC case. Have someone from TAC take a look. The team that works on UCS also works on Nexus 1000V so they are very aware of interaction between the two.

louis

Robert Burns · ‎02-10-2011

Cy,

Please post the following outputs (attach the log files) to this thread - I'll take a look.

From VSM: "show run", "show logging last 100"

From VEM: "vemlog show all > /tmp/vemlog.txt" (output redirected to file)

From UCS CLI: "show configuration"

Regards,

Robert

Cy Hauptmann · ‎02-11-2011

After speaking with a TAC engineer, it turns out that I was missing the "channel group auto mode mac-pinning" on the port-profile that I'm using for the UCS ESXi servers. Once I put that in, I haven't seen any heartbeats lost message in my logs and connection to the UCS has been stable.

Thanks all for your help.

Cy.

Robert Burns · ‎02-12-2011

Cy -Thanks for closing the loop on this - (I would have told you the same as soon as seeing your uplink port profile ). I do find it interesting you didn't see this behavior on non-UCS hosts when they were hosting the VSM - unless you had only a single phsyical uplink connected to the VEM/DVS, in which case the mac pinning command would be irrelevant.

Coming back to our original suspicion, this was a problem with your Control traffic. It was unable to reliably move between your VSM & VEM hosts while on UCS when using dual uplinks and no channel grouping. Believe it or not, this is a very common configuration oversight. MAC pinning is our auto-channel grouping mode of choice.

All & all great to hear you're all set. Let us know if you have any other questions/issues!

Regards,

Robert