OTV Issue

JACQUES DU PLESSIS · ‎01-22-2014

Hi Guys,

We have an existing OTV implementation which extends some VLANs between our Prod and DR sites. We recently lost some disk, and we had to bring up the machines at the DR site. Since then we have seen some strange issues where, for example, machines can ping each-other, but they cannot RDP to each other (it only is a problem for machines talking over the Tunnel, not if they are both at the same site, and only when they are in different VLANs). When I give one machine an interface on the VLAN of the other device, they can communicate fine, so the extending of the VLANs are fine. Both sites have HSRP configured with the same virtual address, but HSRP isolation is set up.

So all I can think of is that a machine at site A uses site A gateway, it ends up at site B, the destination responds, but use Site B's gateway to reach the source. And during this process something breaks. Is this how it is supposed to work though?

Jacques

Amit Singh · ‎01-22-2014

I will be curious to know if this was working fine previously before the disk swap? Was the same setup running without any changes?

I see what's happening now, your inter-vlan traffic is being routed by local gateway at each site and tries to follow the WAN route from where it has learnt the IP route from?

How does the routing table looks like at each site? Routes for the IP subnets learned from where?How does the traceroute looks like when you try to trace a specific machine IP?

JACQUES DU PLESSIS · ‎01-22-2014

Hi, it is difficult to say if it was working, this is the first time we had to do it. The current configuration has been in place since the beginning.

If host A at site A in VLAN A tries to speak to host B in site B in VLAN B, and both VLANs are extended, the trace is simply gateway, and then destination, because the gateway has a interface in the destination VLAN, i.e. there is only one hop between them logically, but physically 4 or 5 routers.

Both sites are advertising their connected subnets with OSPF, but the DR site has a higher cost.

Jacques

Aries Fernandes · ‎03-24-2014

this could be an issue with the ARP timeout values. What I would recommend doing is that you clear the ARP entries on all your boxes and then check if this works. A case similar to this was seen when VM1 and VM2 belonging to VLAN 11 and VLAN 12 were pinging one another and working fine. However, after 30 minutes the pings started to fail. Reason: ARP caching in the L3 gateways result in OTV MAC addresses aging out at 30 minutes ad then dropping the packets. Hope you have understood the problem. Change the arp timeout value to something less than 1800 seconds [ip arp timeout <value>] -- under the interface configuration.

Jami Bailey · ‎01-22-2014

Jacques,

It's a shot in the dark, but can you verify the MTU end to end? OTV marks each packet that egresses the Join interface with the DF bit which will be dropped should it need to be fragmented. I'm curious if there is a consistent MTU both inside the datacenter and accross the overlay. This could potentially produce similar results to what you are seeing.

JACQUES DU PLESSIS · ‎01-22-2014

Hi, thank you for your suggestion. I am leaning toward a similiar thing, but the join interfaces, and the routed interfaces are all on mtu 1560, which should cater for it. The bottom line is that VLAN A talks to vlan A fine, which would mean the extened vlans are fine, but VLAN B talking to VLAN A is not.

Jacques