Re: UCS B enic 2.3.0.14, ESX 6.0, and ASAv trunk ports

abochmann · ‎03-20-2018

We updated two of our UCS ESXi hosts to enic version 2.3.0.14, and now have a strange problem with ASAv VMs running on those hosts (never had this with the old enic 2.3.0.7):

As soon as a ASAv is migrated to one of the two updated hosts, they're not reachable via SSH anymore, and after a while some sessions passing through the firewall on trunk ports are lost. We usually notice this with SMTP sessions that just timeout.

We used to have a very similar problem with HP elxnet drivers a couple of years ago, but I don't remember in which version this has been fixed.

Has anyone seen something similar?

Kirk J · ‎03-20-2018

Greetings.

Do you have any other other Linux type guestVMs you move to the hosts in question, that have SSH reach-ability?

Might need to use the esxi pktcap-uw packet capture utility, to, confirm the frames at both the VMNIC/UPlink level, and at the DVS port level.

Is this a standard vmware DVS, or is this the ACI vmm integration one?

Thanks,

Kirk...

abochmann · ‎03-20-2018

Hello Kirk -

thanks for your reply. This is a normal dvSwitch, with a trunk port group that puts several VLANs on one ASAv interface. The problem doesn't show with untagged ASAv ports.

We did not yet try to reproduce the problem with other VM types - but that's a kinda obvious idea we had missed up to now in our discussions on how to test for the problem without customer impact. Didn't try packet captures either, yet.

I found the VMware KB article for the elxnet driver issue that caused similar symptoms (https://kb.vmware.com/s/article/2091192), but I'll really have to see some packet captures before I can say it's actually similar to our current problem.

Regards,

Alexander.

Kirk J · ‎03-20-2018

Yeah, packet captures cut to the chase.

If you see some packet issues, especially when captured at the VMnic/uplink level, may want to open a TAC case.

That makes me want to ask another question about initiating the SSH from another guestVM in same vlan, on same host (might have to check which vmnic it's pinned to to make sure traffic stays inside DVS).

How often is this reproducible?

Does this only happen from another subnet? Same subnet, guestVM on same host?, etc,etc.

Referencing VMware article https://kb.vmware.com/s/article/2051814, some sample capture commands:

# pktcap-uw --uplink vmnic2 --dir 0 -o /tmp/inbound.cap & pktcap-uw --uplink vmnic2 --dir 1 -o /tmp/outbound.cap &

Above command is actually two commands, running in background to capture frames in both directions.

As noted in vmware article, you 'll actually have to kill the capture process with:

#kill $(lsof |grep pktcap-uw |awk '{print $1}'| sort -u)

Thanks,

Kirk...

abochmann · ‎04-06-2018

After a lot of debugging we found out that we were completely on the wrong track with this: The reason for our ASAvs misbehaving was that they happened to be on those two hosts with two VMs of a different ASAv cluster, which have - for reasons yet unknown - their interfaces in promiscuous mode, and were processing packets for other firewalls in the same subnet that happened to end up on the same ESX host. Since they didn't know about the respective other destinations, they sent RST answers for those unknown sessions, effectively terminating the connections.

We disabled promiscuous mode for the dvSwitch port groups in question, and the problem has gone away for now. Why that firewall cluster behaves different from all others we have is the next unsolved puzzle.