VMware installed CSR1000v Causes DUP Ping Response from Linux Hosts

DarthNetwork · ‎07-17-2015

We have deployed an HSRP pair of CSR1000v routers on clustered ESXi servers utilizing the VMware 5.5 distributed switch. The routers are used to switch our privately addressed hosts in different networks/vlans on the distributed switch in an infrastructure service provider "cloud" environment.

In our development we've found that we are getting DUP ping responses (3 DUP responses to be exact) from Linux hosts that ping other Linux hosts on the same network when either one of the Linux hosts is on the same clustered ESXi server as our active CSR.

Some observations:

1. The DUP responses do not happen for Windows hosts under the same circumstances.

2. The DUP responses from the Linux hosts go away when the HSRP configuration is removed from the routers.

3. Linux, Windows, and the CSRs are all using the same virtual host adapter type (vmxnet3).

4. The CSR interfaces are setup with basic HSRP, no ip redirects, and no proxy-arp set.

5. The vlans on the VMware distributed switch security setting are set to "Accept" for promiscuous mode and forged transmits (the only way HSRP seems to work).

Has anyone seen this type of problem or have any suggestions on how to resolve/troubleshoot it?

Thanks,

Mike

Vinit Jain · ‎07-18-2015

Hello Mike

I guess there are a lot of similar issues seen with vmware. take a look at the below one:

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1017612

Hope this helps.

Vinit

Thanks
--Vinit

Steve Fuller · ‎07-20-2015

Hi Mike,

This one piqued my interest so I spent a bit of time playing around with this. Better get a coffee as it's fairly lengthy.

So let me provide an answer to the point you made in your observation 5, that being HSRP only works when Promiscuous Mode and Forged Transmit are set to Accept on the VMware virtual switch.

The first thing to understand is that neither VMware VSS (Virtual Standard Switch) nor VDS (vSphere Distributed Switch) implement MAC learning like a traditional network switch. This is because vSphere already knows which MAC address is assigned to a VM and therefore the MAC that's associated with a virtual switch port.

Next remember that HSRP changes the MAC addresses that are used. The active router sources hello packets from its configured IP address and the HSRP virtual MAC address. The active router sources frames from the virtual MAC such that normal learning switches/bridges know which port or segment the active router is connected to. The standby router sources its hellos with its configured IP address and burned-in MAC address (BIA).

If we don't change Forged Transmit to Accept, then the hello packets sent from the HSRP active router sourced with the HSRP virtual MAC will be dropped by the virtual switch. So changing Forged Transmit to Accept essentially allows the two HSRP routers to discover each other. Without this you'll see both routers showing "active" as local and "standby" as unknown.

Take a read of the post How The VMware Forged Transmits Security Policy Works over at Chris Wahl's blog if more insight is needed.

As for promiscuous mode, if you don't change promiscuous mode to "Accept", then the virtual switch will only forward frames with the destination MAC address assigned to the VM. In the case of HSRP the virtual switch also needs to forward frames with a destination of the HSRP virtual MAC address to the HSRP active router.

So anyway, I built this out in the lab and I'm seeing duplicate pings also. I've tested that with IOS-XE 03.13.01.S [IOS 15.4(1)S] and 03.11.00.S [IOS 15.4(3)S1] and get the same behaviour on both.

In my setup I have two CSR 1000v called c1kv-f5-1 (192.168.22.3) and c1kv-f5-2 (192.168.22.4). The active router is c1kv-f5-1.

So on your observation 1, I checked using a Windows XP host and I see duplicate packets also, but the Windows CLI doesn't tell you that it's received them. I ran a packet capture on the Windows host and see exactly the same behaviour as that which we see on the Linux hosts. So this is not just a Linux problem.

One additional observation, and perhaps you can confirm the same, is that if I ping the HSRP virtual IP address or the real IP address of the HSRP active router, then I don't see duplicate packets. I only see the duplicate packets when I ping the real IP address of the HSRP standby router.

So here are pings from my Linux host 192.168.22.100 to the HSRP virtual, the HSRP active real and the HSRP standby real.

[sfuller@rhel12 ~]$ ping -c2 192.168.22.1
PING 192.168.22.1 (192.168.22.1) 56(84) bytes of data.
64 bytes from 192.168.22.1: icmp_seq=1 ttl=255 time=4.96 ms
64 bytes from 192.168.22.1: icmp_seq=2 ttl=255 time=1.37 ms
--- 192.168.22.1 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1000ms
rtt min/avg/max/mdev = 1.372/3.166/4.961/1.795 ms
[sfuller@rhel12 ~]$ ping -c2 192.168.22.3
PING 192.168.22.3 (192.168.22.3) 56(84) bytes of data.
64 bytes from 192.168.22.3: icmp_seq=1 ttl=255 time=3.05 ms
64 bytes from 192.168.22.3: icmp_seq=2 ttl=255 time=2.73 ms
--- 192.168.22.3 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 999ms
rtt min/avg/max/mdev = 2.739/2.899/3.059/0.160 ms
[sfuller@rhel12 ~]$ ping -c2 192.168.22.4
PING 192.168.22.4 (192.168.22.4) 56(84) bytes of data.
64 bytes from 192.168.22.4: icmp_seq=1 ttl=255 time=0.427 ms
64 bytes from 192.168.22.4: icmp_seq=1 ttl=255 time=0.428 ms (DUP!)
64 bytes from 192.168.22.4: icmp_seq=1 ttl=254 time=0.514 ms (DUP!)
64 bytes from 192.168.22.4: icmp_seq=1 ttl=254 time=0.518 ms (DUP!)
64 bytes from 192.168.22.4: icmp_seq=2 ttl=255 time=0.497 ms
--- 192.168.22.4 ping statistics ---
2 packets transmitted, 2 received, +3 duplicates, 0% packet loss, time 1000ms
rtt min/avg/max/mdev = 0.427/0.476/0.518/0.049 ms

I captured the ping to the 192.168.22.4 real, and when I look at a packet trace on my Linux host I see the following:

Packet 35. Echo req Ethernet II, Src: 00:50:56:b0:c3:e7, Dst: 00:50:56:b0:1a:71
Internet Protocol Version 4, Src: 192.168.22.100, Dst: 192.168.22.4
IP TTL=64
Packet 36. Echo rep Ethernet II, Src: 00:50:56:b0:1a:71, Dst: 00:50:56:b2:66:6a
Internet Protocol Version 4, Src: 192.168.22.4, Dst: 192.168.22.100
IP TTL=255
Packet 37. Echo rep Ethernet II, Src: 00:50:56:b0:c3:e7, Dst: 00:50:56:b2:66:6a
Internet Protocol Version 4, Src: 192.168.22.4, Dst: 192.168.22.100
IP TTL=254
Packet 38. Echo rep Ethernet II, Src: 00:50:56:b0:1a:71, Dst: 00:50:56:b2:66:6a
Internet Protocol Version 4, Src: 192.168.22.4, Dst: 192.168.22.100
IP TTL=255
Packet 39. Echo rep Ethernet II, Src: 00:50:56:b0:c3:e7, Dst: 00:50:56:b2:66:6a
Internet Protocol Version 4, Src: 192.168.22.4, Dst: 192.168.22.100
IP TTL=254

So the reply in frame 36 is sourced from MAC 0050.56b0.1a71 (c1kv-f5-2) i.e., the device with the 192.168.22.4 address, but the reply in frame 37, while sourced from the 192.168.22.4 IP address, has a source MAC address of MAC 0050.56b0.c3e7 (c1kv-f5-1) i.e., the HSRP active router. Note also that the IP TTL is only 254 on the reply in frame 37 indicating it's been through another router hop.

The next thing I checked was what was seen on the ESX virtual switch. Since ESX 5.5 there's been an enhanced packet capture utility called pktcap-uw that allows packet capture at various points on the hypervisor, including switch ports to which VMs are attached.

Assuming youve access to the vSphere CLI you can find the switchport using esxtop, pressing N and then taking the ID from the PORT-ID column for the VM you're interested in. So my VM are attached to the following ports:

  PORT-ID             ED-BY  TEAM-PNIC DNAME              PKTTX/s  MbTX/s    PKTRX/s  MbRX/s %DRPTX %DRPRX
[..]
100663301 4553795:rhel12          void vSwitch4              0.00    0.00       0.00    0.00   0.00   0.00
100663302 4706701:winxp3          void vSwitch4              0.00    0.00       0.00    0.00   0.00   0.00
100663304 4727660:c1kv-f5-1       void vSwitch4            119.53    0.09     119.72    0.09   0.00   0.00
100663305 4727766:c1kv-f5-2       void vSwitch4              0.59    0.00       0.39    0.00   0.00   0.00

Making sure the virtual switch only receives as many pings as we sent:

~ # pktcap-uw --switchport 100663301 --dir 0 -o vss_cap_rhel12_rx.pcap
The switch port id is 0x06000005
The dir is Rx
The output file is vss_cap_rhel12_rx.pcap
No server port specifed, select 38149 as the port
Local CID 2
Listen on port 38149
Accept...Vsock connection from port 1032 cid 2
Dump: 2, broken : 0, drop: 0, file err: 0Destroying session 8
Dumped 2 packet to file vss_cap_rhel12_rx.pcap, dropped 0 packets.
Done.

And when I look at that capture I see two packets. We probably knew that, but best to confirm.

[sfuller@centos651 vmstore]$ tshark -r vss_cap_rhel12_rx_2.pcap
1  0.000000 192.168.22.100 -> 192.168.22.4 ICMP 98 Echo (ping) request id=0xb421, seq=1/256, ttl=64
2  1.000503 192.168.22.100 -> 192.168.22.4 ICMP 98 Echo (ping) request id=0xb421, seq=2/512, ttl=64
[sfuller@centos651 vmstore]$

If I now capture the traffic from the virtual switch in the Tx direction to the standby router:

~ # pktcap-uw --switchport 100663305 --dir 1 -o vss_cap_c1kv1-f5-2_tx_2.pcap
The switch port id is 0x06000009
The dir is Tx
The output file is vss_cap_c1kv1-f5-2_tx_2.pcap
No server port specifed, select 47141 as the port
Local CID 2
Listen on port 47141
Accept...Vsock connection from port 1040 cid 2
Dump: 7, broken : 0, drop: 0, file err: 0Destroying session 16
Dumped 7 packet to file vss_cap_c1kv1-f5-2_tx_2.pcap, dropped 0 packets.
Done.

When I look at that capture I can see that although only two Echo requests were sent from the Linux host, there were four sent out the virtual switch port to the standby router. As we saw above, the TTL of the second Echo request is one lower than the original.

[sfuller@centos651 vmstore]$ tshark -r vss_cap_c1kv1-f5-2_tx_2.pcap
1   0.000000 192.168.22.3 -> 224.0.0.2    HSRP 62 Hello (state Active)
2   2.688828 192.168.22.3 -> 224.0.0.2    HSRP 62 Hello (state Active)
3   2.873541 192.168.22.100 -> 192.168.22.4 ICMP 98 Echo (ping) request id=0xb421, seq=1/256, ttl=64
4   2.874020 192.168.22.100 -> 192.168.22.4 ICMP 98 Echo (ping) request id=0xb421, seq=1/256, ttl=63
5   3.874027 192.168.22.100 -> 192.168.22.4 ICMP 98 Echo (ping) request id=0xb421, seq=2/512, ttl=64
6   3.874991 192.168.22.100 -> 192.168.22.4 ICMP 98 Echo (ping) request id=0xb421, seq=2/512, ttl=63
7   5.361105 192.168.22.3 -> 224.0.0.2    HSRP 62 Hello (state Active)
[sfuller@centos651 vmstore]$

If I look in a little more detail at frames 3 and 4 we can see the source MAC address is different for both frames. Frame 3 is sourced from the MAC address of my Linux host as expected, but as above frame 4 is sourced from the MAC address of the active HSRP router, c1kv-f5-1:

[sfuller@centos651 vmstore]$ tshark -V -r vss_cap_c1kv1-f5-2_tx_2.pcap
Frame 3: 98 bytes on wire (784 bits), 98 bytes captured (784 bits)
    [..]
Ethernet II, Src: Vmware_b2:66:6a (00:50:56:b2:66:6a), Dst: Vmware_b0:1a:71 (00:50:56:b0:1a:71)
    Destination: Vmware_b0:1a:71 (00:50:56:b0:1a:71)
        Address: Vmware_b0:1a:71 (00:50:56:b0:1a:71)
        .... ..0. .... .... .... .... = LG bit: Globally unique address (factory default)
        .... ...0 .... .... .... .... = IG bit: Individual address (unicast)
    Source: Vmware_b2:66:6a (00:50:56:b2:66:6a)
        Address: Vmware_b2:66:6a (00:50:56:b2:66:6a)

Frame 4: 98 bytes on wire (784 bits), 98 bytes captured (784 bits)
    [..]
Ethernet II, Src: Vmware_b0:c3:e7 (00:50:56:b0:c3:e7), Dst: Vmware_b0:1a:71 (00:50:56:b0:1a:71)
    Destination: Vmware_b0:1a:71 (00:50:56:b0:1a:71)
        Address: Vmware_b0:1a:71 (00:50:56:b0:1a:71)
        .... ..0. .... .... .... .... = LG bit: Globally unique address (factory default)
        .... ...0 .... .... .... .... = IG bit: Individual address (unicast)
    Source: Vmware_b0:c3:e7 (00:50:56:b0:c3:e7)
        Address: Vmware_b0:c3:e7 (00:50:56:b0:c3:e7)

So what seems to be happening is this. Because we need promiscuous mode set to Accept, the virtual switch sends the ICMP echo request out all ports in the portgroup, including the port to the HSRP active router. The HSRP active router receives and processes the frame, hence the decremented TTL, and then sends it on its way to the IP address of the standby router. The standby router receives this and dutifully replies, and so the duplicate. The reason we see three duplicates is that the echo request is duplicated on the way to the standby router, and then the echo replies are duplicated.

So we see two echo requests:

1 - Linux to HSRP standby router direct

2 - Linux to HSRP standby router via HSRP active router

And four echo replies:

1 - HSRP standby router to Linux direct for 1 above

2 - HSRP standby router to Linux via HSRP active router for 1 above
3 - HSRP standby router to Linux direct for 2 above
4 - HSRP standby router to Linux via HSRP active router for 2 above

As we're expecting one echo reply, the Linux host sees three as duplicates.

What I've not figured out at this stage is why the active HSRP router is processing packets for all MAC addresses. I can only assume there's some weird behaviour here, in that when it's HSRP active, the interface essentially goes into some form of promiscuous mode.

Regards

DarthNetwork · ‎07-20-2015

Steve,

Thank you for your response! I appreciate the details you added. We were seeing some of the same traffic sources from tcpdump on our Linux host showing the additional ping responses and decremented TTL counter. My first impression was that the router was responding in duplicate to traffic that should have been handled by the distributed switch. You're response makes exchange much more clear!

Thank you for also clarifying that Windows boxes respond in the same way (although they hide it in the generic CLI response). That was a mystery to me that makes more sense now.

I'm still a bit curious why it only happens when one of the hosts is on the same ESXi server as the CSR but not if the CSR is on a different host. Regardless of CSR location, I would have figured the interaction would be identical

To answer your questions:

I too only see duplicate responses when pinging the standby router. The HSRP vip and active router IPs do not exhibit the DUP response.

Thank you again for looking into this. I'm still struggling to find a solution to the problem.

-Mike

Steve Fuller · ‎07-20-2015

Hi Mike,

I think the reason you only see the duplicates when the CSR is on the same host as your Linux VM is explained in the Promiscuous Portgroup Myth post. While the VMware switch doesn't learn MAC addresses and which ports they're connected to, the real switch you have connecting your hosts does.

In the diagram on that post, if VM A is the active HSRP router, and VM B is the standby HSRP router, then when VM C (your Linux host) pings the standby router, that traffic will not get sent out port A to Host A as the real switch knows the destination MAC address of the standby router is out port B.

As for a solution, not sure at this stage. I noticed the show controllers command on the CSR that is HSRP active shows "Software MAC Filter Enabled" which is probably how they make the router listen to trafifc destinated to the BIA (if you can call a vSphere assigned MAC a BIA) and the HSRP MAC. The fact that it's processing all MACs could be a bug, or perhaps it's by design.

Any chance you can raise a TAC case against it? I think we're pretty close, but getting that last bit is probably going to take some knowledge of the inner workings of the beast.

Regards

DarthNetwork · ‎07-20-2015

Thanks Steve,

I just finished reading the linked blog (I missed it my first go round). It makes sense now. I can't open a TAC case on this yet. We've set this up in our development lab as an evaluation and have just sent the PO out to make the purchase with support for the routers. Hopefully in the next week or so we can actually open a TAC case.

In the mean time, we're attacking this from our VMware contract side as well.

Thanks,

Mike

sandybreezebt · ‎07-24-2015

This caught my eye. I also confirm I can replicate the issue on both 3.13.1S and 3.14.1S with above mentioned configuration (HSRP, and only when one of the guests is on the same hypervisor as the CSR). I've captured with EPC and can see this behaviour from the router. Its as if with HSRP configured, the router thinks this ICMP is destined for him and taking a punt. Need to look at the punt statistics closer, but I'll raise with TAC in the meantime and let you know how I get on.

Interestingly, a colleague of mine who has a variety of services behind these routers mentioned issues with incoming MySQL connections too - not necessarily persistent connections, and only when HSRP is configured. I've yet to confirm this as we're looking to create a more isolated environment (as I'm sure this will speed up the TAC case).

Sandy

DarthNetwork · ‎07-24-2015

Thank you for the confirmation Sandy. I appreciate the additional sets of eyes on the problem to see if there is a solution. A TAC case would be great! We're still waiting for our procurement to finish off the licensing and support contract purchase with Cisco before I can open a case.

-Mike

DarthNetwork · ‎08-26-2015

We just recently received our service contract (procurement moves at a snail's pace here) and I was able to open a TAC case on my issue.

Thanks to your suggestions and feedback, Steve and Sandy, the Cisco engineer was able to quickly suggest an option to work around the issue. By forcing the routers to use their BIA instead of the floating virtual MAC with the "standby use-bia" command, the duplicate pings cleared up!

I am going to do some failover testing to make sure that this option does not cause issues with connected hosts, but initially things look promising.

Additionally, the TAC engineer also mentioned that Cisco case (CSCup28090) was filed to add support for this feature by default with FHRP's. The case is not public facing yet.

-Mike

Steve Fuller · ‎08-26-2015

Hi Mike,

Thanks for coming back to the forum and posting the answer. Unfortunately I don't think you can mark your own answer as correct :)

Regards

sandybreezebt · ‎09-02-2015

Hi Mike,

I'm glad you've found a workaround. Though for completeness, I thought it would be helpful to share what I’ve found.

The actual root cause is not through using any FHRP’s, but as a consequence of using them given the VMWare implementation to overcome unknown MAC’s is to flood out all ports, combined with a CSR bug which choses not to ignore L3 packets with a L2 destination of not for us. See: CSCuv63708.

What is happening is because the portgroup restricts sending to unknown MAC addresses unless promiscuous mode is configured, the CSR gets a copy of the frame when its flooded. Well, that is if the CSR is sitting on the same hypervisor as one of the VM’s. Despite the L2 header not being for the CSR, the CSR undergoes the routing algorithm, decrements the TTL, rewrites the source-mac to itself and forwards the frame out the destination interface, which if on the same LAN results in duplicates. This is not the correct behavior. In the physical world this is unlikely to happen as a downstream switch will not (unless configured to) flood out all ports.

If you have physical switches in between your hypervisors, then your physical switches will stop some of this flooded traffic in such cases where for example the destination MAC is known on the receiving port. Its also worth noting, this is the case for all traffic, not just ICMP. ICMP just shows you the issue in its DUP packet output.

Cisco have marked the bug as severe but currently there are only 2 support cases listed and 1 known affected version. I’ve demonstrated this exists in more than version and would appreciate those reading this to take their case back to TAC and get their case added to the support request.

We’re also working with VMWare in seeing if there is a workaround in changing this behavior but maintaining functionality somehow.

Sandy

Steve Fuller · ‎09-02-2015

Thanks for posting an update Sandy.

Regards