cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
1483
Views
0
Helpful
3
Replies

1000v does not recover after power outage + WAN link failure

NguyenT11
Level 1
Level 1

Hi all,

We have 2xESXi cluster and 1000v deployed to many remote sites.  The clusters themselves are managed centrally at HQ with virtual center.  The other day one of our remote sites lost power and the local carrier equipment failed.  When the servers came back up, the WAN was still down, thus the connection to VC was not active.  I can see in the logs that both VSMs are talking and syncing configurations, but the VEM does not power up until the VC connection was re-established (5 hrs later). 

I had thought that the 1000v VEM would come back up with the last known configuration, so regardless of the lack of VC, traffic will still be able to go once the servers come back up.  Am I wrong?

Another item to note is that we are running HA/DRS, so all the VMs got moved to one host...this might be causing problems with the VEM changing configs and the VSM not being able to register change to VC?

Thanks


Tho

3 Replies 3

mipetrin
Cisco Employee
Cisco Employee

Hi Tho,

While the VSM is down, the VEMs continue to forward traffic using the last known configuration. Any new virtual machines that are started on those VEMs will not have connectivity because the VSM will not be available to set up the port configurations, or any other configuration changes. When the virtual machine is migrated, the virtual Ethernet (vEth) ports will not be configured on the new host because the VSM is not there.

Now, if you reboot the ESX host while the VSM is down, the only VLANs that will be forwarding upon the reboot are those defined as system VLANs. All other VLANs need to be reprogrammed by the VSM and thus reachability of the VSM is a must.

Additionally, the VSM can only make configuration changes when there is an active SVS connection. It seems from your description that the SVS connection was down, prior to the reboot of the ESX servers, and thus when they came back up - no changes could be made on the VSM. As a result, the VEM was not programmed - except for system VLANs.

Hope that helps to clarify

Thanks,

Michael

Makes sense, thanks. 

My only question now is why did my VEM not come up even though my Control/Packet VLANs are designated systems VLANs?  According to the logs, the VEM did not recover until my SVS connection came back which was 5hrs after the outage occured.

(I did simulate a WAN failure, boot up a single server at a time to cause all VMs to come up on a single host.  Everything works as expected, even without SVS connection and VMs coming back up on different hosts.  Looking at the logs, after reboot, the VEM comes up right away.) 

Hi Tho,

It is still expected that the VEM was offline but not because of the SVS connection but because of the WAN connectivity issue to the vCenter server. Let me explain.

Certain configuration on the VSM is pushed to the vCenter server and stored within the vCenter MOB. This is known as opaque data. This opaque data is a collection of Cisco Nexus 1000V Series configuration parameters that is maintained by the VSM and VMware vCenter Server when the link between the two is established (that is, SVS connection). The opaque data contains configuration details that each VEM needs to establish connectivity to the VSM during VEM installation.

Among other content, the opaque data contains:

   * Switch domain ID

   * Switch name

   * Control and packet VLAN IDs

   * System port profiles (those defined with system vlans)

When a new VEM comes online, either after initial installation or upon restart of a VMware ESX host, it is essentially an unprogrammed line card. To be correctly configured, the VEM needs to communicate with the VSM. VMware vCenter Server automatically sends the opaque data to the VEM, which the VEM uses to establish communication with the VSM and download the appropriate configuration data.

In a slightly different scenario, if the VEM can communicate with the VC (to pull down the last version of opaque data) but the SVS connection is still down, then the following would happen:

   * The VEM would program its system vlans

   * VEM would then be able to communicate with the VSM and pull down the rest of the configuration such as the port-profile configurations (from the last sync of SVS connection)

   * This will then allow the VMs on the VEM to have network connectivity again

   * However, no changes are possible (such as vmotion or port-profile changes) as the SVS connection is still down

   * Only once the SVS connection is restored will all changes be possible again

The svs connection is needed only for the VSM to push any new configuration or changes in configuration to the vCenter server. The VEM does not depend on the SVS connection to come up but it does rely on a connection to the vCenter server. So essentially the VEM would pull down whatever opaque data is available in the vCenter MOB, and once the SVS connection is back up (and IF there are any changes) the VSM would push new opaque data to the vCenter Server which would then make it's way to the VEM and reprogram itself again.

In your scenario, the SVS connection was down (due to WAN failure) which meant that it could not push opaque data to the vCenter server. Furthermore, the VEM couldn't communicate with the vCenter server (due to WAN failure) in order to pull down the opaque data that was stored in the MOB. Since it couldn't obtain this data, it could not program it's system vlans in order to communicate with the VSM, and then obtain the rest of the configuration from the VSM. This was only resolved once your WAN conenction was fixed and the VEM had connectivity to the vCenter server.

Thanks,

Michael

Review Cisco Networking for a $25 gift card