C3560G causes VMs to shutdown in ESXI 4.1

greg.garman · ‎02-03-2016

When I connect a Cisco 3560G to the network at the distribution layer the VMWare VMs shutdown. We get no messages on this in the core switch or the distribution switch logs at all. My initial thought was that it was spanning tree causing this so I connected a 3750G and got the same result. So then I connected a 3750X and had no issue. I have tried this multiple times with the 12.2 IOS train and the 15.0 IOS train. All these switches have the same spanning tree set-up, I use MST on this network. The VM host cluster is connected to a 3750X at the core 6500 switch.

When this happens we get an alarm on the ESXI host cluster that says HA is re-configuring. An attempt to remedy this situation was to set HA to not shutdown the VMs but this is not a proper way to operate because HA may not function correctly, the hosts will not release the VM to fail over.

Has anyone seen this and what is the cause/cure?

Philip D'Ath · ‎02-05-2016

Are you physical hosts plugging into this 3560G?

If so, are all the VMWare physical host port types configured in a similar way on the 3560G?

E,g, The Kernel ports are configured the same way. The guest ports are configured the same way. Any storage ports are configured the same way.

Most likely it is failing because it thinks there is a loss of connectivity.

greg.garman · ‎02-05-2016

The physical hosts are trunked into a Cisco 3750X that is trunked into the core.

The kernel ports are on their own Vlan on access ports on the 3750X.

The FAS storage devices are plugged directly into the hosts with fiber and also into the 3750X switch on access ports.

Whether I plug the 3560G into a separate distribution switch that is connected to the core, or directly into the core itself, it will cause the VMs to shutdown.

Philip D'Ath · ‎02-05-2016

Can you post the config you are using for one of the physical trunk ports to ESX?

What is doing Layer 3 in this network? The 3750/3560, of are they layer 2 only through to the core?

greg.garman · ‎02-05-2016

The routing is done at the core. The access and distribution level switches use the default gateway pointed to the core.

Configuration on the trunk ports to the ESXI hosts:

switchport trunk encapsulation dot1q

switchport mode trunk

switchport nonegotiate

Philip D'Ath · ‎02-06-2016

I wonder if some spanning tree issue is putting the ports into a learning state.

Try added this to the ports going to the physical servers:

spanning-tree portfast trunk

greg.garman · ‎02-08-2016

Well spanning-tree was my first thoughts also. But why does it have a problem with the 3560 or 3750G switches and not with the 3750X switches?

So I tried the spanning-tree portfast settings as prescribed but portfast is a little different these days. There is no more portfast trunk command in the 15.0 train.

Beginning with Cisco IOS Release 12.2(33)SXI, you can specifically configure a spanning tree port as either an edge port, a network port, or a normal port. The port type determines the behavior of the port with respect to STP extensions.

- An edge port, which is connected to a Layer 2 host, can be either an access port or a trunk port, or so Cisco says.

- A network port is connected only to a Layer 2 switch or bridge.

When I configured the ports as spanning-tree portfast network ports we could not talk to the VC or hosts. When I configured the ports as edge ports I got this response:

%Warning: Portfast should only be enabled on ports connected to a single host. Connecting hubs, concentrators, switches, bridges, etc. . . to this port can cause temporary bridging loops. Use with caution.

%Portfast has been configured on this port but will only have effect when the port is in non-trunking mode.

Well these ports are trunks. So by that warning this will have no effect. And sure enough when we connected a 3560G to the network we got the red diamonds on the cluster and our application loses connectivity as before.

Philip D'Ath · ‎02-08-2016

We need to find out more specifically what is causing the "red" diamonds. Are you able to determine exactly what can't to what that is causing the '"red" diamond issue?

greg.garman · ‎02-09-2016

The red diamonds are an alarm on the ESXI host cluster that says HA is re-configuring.

Philip D'Ath · ‎02-09-2016

There must be some log entry to say why. For example, "lost communication on NIC xxx". We just need a bit more of a hint as to why ESXi is failing over.

greg.garman · ‎02-10-2016

I am not an ESXI expert and I and our local ESXI guy spent several hours looking for some sort of indication of what happens but we couldn't identify any smoking gun. Those logs are the most confusing mess.

We checked the clocks to make sure they were syncing NTP then connected a switch to create the fault. We then went through the logs looking for entries that coincide to the connect time in the logs of the core switch. The only thing I could definitely identify was an entry in the aam log that said recovering from failure but what that failure was is not indicated. There is no entry earlier to point to a failure. One of the statements we seen several times was "for details see the VMKernel log". I cannot find any log with that name or that looks like it might be related.