Solved: WAAS TCP Resets

Lamont Bullock · ‎06-08-2015

Hello,

I may have a problem with TCP resets on the WAVE-7571. I say "may" because I feel the WAAS is probably doing what it is supposed to and I need to just understand why.

Scenario Description: We have a scenario with 2 WAN links. On both Links and on both sides of the links we have Transparent firewalls directly connected to the WAN, C3925 routers connected to the insides of the firewalls and WAVE-7571 appliances connected to interfaces of the 3925 routers. The user hosts are connected to a VLAN on the router as well on separate VLAN from the WAVEs. The routers and the WAVEs are configured to use WCCPv2 with GRE. The WAVEs are also not clustered. The firewalls are not configured for stateful Active/Active, Active/Standby failover nor clustering. The firewalls are configured to bypass TCP state monitoring for specific traffic that we want to failover to the other link without connection state being an issue. The network gear on works as 2 separate links except the hosts can failover from one router to the other via HSRP. We have object tracking enabled to have HSRP failover when one of the links goes down or specific routes are not learned from the WAN.

Problem Description: Right after a link fails and the HSRP switches the users over to the other router, one of the WAASs seems to send TCP resets to the server and clients terminating the connections. We cannot pin point which WAAS is sending the resets, but when we remove the WCCP redirect commands from the routers to bypass using the WAAS, the reset messages are not seen. We run a proprietary software application that requires the TCP connections to stay up all the time. The resets are forcing the user to restart the application which takes a long time to do and they are losing valuable time. I tried to pinpoint which WAAS was sending the resets by doing a packet capture on the WAAS and comparing the TTLs of the reset messages to those of the normal packets, but in some cases the TTLs are 255 and other cases they match the TTL seen in other packets being optimized. I have also seen the sent/receive reset counters increase during the failovers using "show statistics tcp" In some case it appears the WAAS on the failed link is sending the resets and also the WAAS on the good link.

Questions:

1. Why would the WAAS on the failed link send TCP resets, when it is no longer receiving traffic for the TCP connection?

2. Is this due to a timeout value?

3. Why would the gaining WAAS send the TCP resets? I was under the impression the gaining WAAS should pass the connections un-optimized as Pass-Though but still forward them.

Lamont

Lamont Bullock · ‎06-18-2015

Finn,

I Started looking into the APP-NAV and it appears to be another appliance we would have to purchase and deploy. Part of the challenge I have is not having the funds to purchase new hardware. We already spent a lot of money on four WAVE-7571s. I did have an idea I wanted to run by you to see if it sounds like it will work.

Idea description:

1) Put the two routers and two WAEs at each site into a common VLAN such that both routers can access both WAEs.
I'll refer to the routers and WAEs at one site as Router1, Router2, WAE1 & WAE2. The same configuration exists at the other site like a mirror.
Router1 & WAE1 use WAN-link1 and Router2 & WAE2 use WAN-link2

2) Create 2 HSRP groups so that each router is active for 1 of the groups
Router1 will be active for group1 and router2 will be active for group2

3) Configure each WAE so its default-gw is the HSRP IP of each of the HSRP groups.
So WAE1's Default-gw is Router1 and WAE2's default-gw is Router2

4) Instead of using WCCP, use PBR to direct the traffic to the WAEs based on the source/destination & protocol
Router1 can be setup to always redirect to WAE1 unless WAE1 is down or WAN-Link1 is down

If Router1 fails, router 2 can take over as HSRP active and forward across WAN-link2
When the packet gets to the Router2 at the other site, it will forward to WAE1 based on the PBR policy
So from one site to the other the same pair of WAEs are used for the TCP session, just forwarded across a different link
If this works it will solve, the TCP reset problem with the link fails. and recently WAN link problem has been causing the failovers response to trigger which has adverse reaction on the applications
So the only failure we should worry about is if WAE1 fails, which should be a very rare case. If the WAE1 does fail, the WAE2 can be used to handle the traffic until the until it is replaced.

A few question that come to mind which may make this setup a little more robust are:
Do the WAEs absolutely rely on the configured default-gw to forward traffic back the router?
What I am getting at is, can I omit the default router configuration parameter and just rely on static routes in the WAE to choose which router to forward traffic back to?
Does the WAE have any PBR link functions I can leverage for choosing a router to forward to?

Lamont

View solution in original post

finn.poulsen · ‎06-11-2015

Hi Lamont,

Are you 100% sure that it is the WAAS, which is generating the RST ?

When you write that the WAASes are not clustered, you mean that one WAAS only has one router (the local one) as WCCP "partner" and one router only has one WAAS (the local one) as partner ?

Beware that WAAS acts as a TCP proxy, so it uses one set of TCP sequence numbers towards the unoptimized side (LAN) and another set of sequence number towards the optimized side (WAN), where sequence number are actually shifted (+ some other magic).

So when a WAAS suddenly goes into pass-through for a session (or the session is switched to another WAAS as in your scenario), the original sequence numbers (from the server or client) are sent unchanged to the server or client, which sees that there is a severe sequence number mismatch and interprets this as a man-in-the-middle attack - and resets the TCP session.

Newer applications just start up a new TCP session and continues from there... I know that this doesn't help you ;-)

Best regards

Finn Poulsen

Lamont Bullock · ‎06-16-2015

Finn,

To be honest, I am not 100% certain it was the WAAS. I tried to narrow it down to the WAAS by doing some packet captures on the WAAS and looking at the TTL value for the TCP reset packets. I found that a lot of the packets had the same TTL as other packets in the stream while some had the TTL set to 255. So I concluded the ones with the TTL set to 255 where generated by the WAAS, but I couldn't account for the ones that looked like they came from end hosts.

When I stated the WAAS were not clustered, I did mean each ISR router had its one WAAS attached that only communicated with that WAAS.

I really like your explanation of the proxying and the sequence numbers being altered. This may account for why a lot of the TCP resets I saw had the same TTL value as other packets in the stream, because they can from the end hosts, not the WAAS. Thank you for clearing that up. Is it possible to disable the sequence number changing without adverse affect?

Lamont

finn.poulsen · ‎06-16-2015

Hi Lamont,

You can not disable the sequence number changing, it is one of the methods where a WAAS determines, whether the session is being optimized - if the sequence number suddenly changes back to "normal" values, the WAAS knoe that the remote WAAS has gone away.

This is also the problem that occurs, when deploying non-cisco firewalls (or an ASA without inspect WAAS policy), because they also think there is a man-in-the-middle attach going on.

Unless there is to large delay between the two ISR, why not use both WAASes as a farm and redirect to both, dependent upon IP-addresses - this means that both ISR will always redirect to the same WAAS dependant upon e.g. destination IP-address.

And remember to run WCCP negotiated return, you'll alway return traffic to the redirecting router to avoid assymetric routing.

You'll still encounter the same problem with your legacy applications, if one of the WAASes fail and the sequence number reset occurs again.

Hope this helps

Finn

Lamont Bullock · ‎06-16-2015

Finn,

I really appreciate your responses on this topic. We had looked into doing a cluster early on and even did some testing in a clustered configuration but our customer has a requirement that half their data be sent over one link and the other half over the other link. The 2 data streams work independent of each other. This is to prevent a scenario where something bad happens to one of the links or the equipment that makes up that link fails and the failure affects the flow of both streams of data. If we used clustering and both routers decided to forward traffic to the same WAAS, and that WAAS fails, both streams are affected and this was unacceptable for our customer. It was because of this requirement that we removed the clustering. I wasn't able to find any features using WCCP that forced a router to send to a designated WAAS until it failed other than the load balancing algorithm but I vaguely remember that not being definitive enough. I didn't know about the negotiated return though. I read the WAAS will forward the traffic to it configured default router.

Do you know of a technique using WCCP to force the router to prefer a certain WAAS for all data streams and then use the other WAAS if the primary has failed?

finn.poulsen · ‎06-17-2015

Hi Lamont,

If you're running negotiated return, the return traffic from the WAAS will be GRE encapsulated and sent back to the originating (=redirecting) router.

Otherwise default behaviour is to sent the response back to the default "gateway" configured on the WAAS.

What I think you want (based on your description) is to excercise more (manual) control over where traffic is going under what circumstances.

Normally it's the WCCP load balancing algorithm that does this.

As I see it you have two "potential" options here :

1) Use mask assignment, and try to control the redirection by that (hash assignment can't be triggered in this way) by using a very small mask (0x1), which would send all even IP-addresses to one WAAS and the uneven to the other (might required a change of IP-addresses on you systems).

2) look into using APP-NAV, where you can exercise much more control

By the word "potential" I mean :

a) this is not what the proposed techniques are recommended for

b) this might not work

c) I haven't tried it

So if you want to go this way (you probably don't want to), bu sure to carefully test it.

Best Regards

Finn

Lamont Bullock · ‎06-17-2015

Finn,

I will look into the App-NAV as you suggested. At this point we cannot re-address the systems. The servers are used by other systems at the sites and would cause significant changes which would require a lot more coordination and time to pull off. Which means also a lot more money would get spent.

Thanks again,

Lamont

finn.poulsen · ‎06-17-2015

Hi Lamont,

You does not nessecarily have to change the IP-addresses of the servers.

If you can play around with the masks, and make the two (??) ip-addresses fall into each "bucket" it might do the trick.

Check this out :

https://supportforums.cisco.com/document/12012766/waas-wccp-v2-mask-assignment-calculation-method

beware that this will influence all the redirection, not only for this system.

/Finn

Lamont Bullock · ‎06-18-2015

Finn,

I Started looking into the APP-NAV and it appears to be another appliance we would have to purchase and deploy. Part of the challenge I have is not having the funds to purchase new hardware. We already spent a lot of money on four WAVE-7571s. I did have an idea I wanted to run by you to see if it sounds like it will work.

Idea description:

1) Put the two routers and two WAEs at each site into a common VLAN such that both routers can access both WAEs.
I'll refer to the routers and WAEs at one site as Router1, Router2, WAE1 & WAE2. The same configuration exists at the other site like a mirror.
Router1 & WAE1 use WAN-link1 and Router2 & WAE2 use WAN-link2

2) Create 2 HSRP groups so that each router is active for 1 of the groups
Router1 will be active for group1 and router2 will be active for group2

3) Configure each WAE so its default-gw is the HSRP IP of each of the HSRP groups.
So WAE1's Default-gw is Router1 and WAE2's default-gw is Router2

4) Instead of using WCCP, use PBR to direct the traffic to the WAEs based on the source/destination & protocol
Router1 can be setup to always redirect to WAE1 unless WAE1 is down or WAN-Link1 is down

If Router1 fails, router 2 can take over as HSRP active and forward across WAN-link2
When the packet gets to the Router2 at the other site, it will forward to WAE1 based on the PBR policy
So from one site to the other the same pair of WAEs are used for the TCP session, just forwarded across a different link
If this works it will solve, the TCP reset problem with the link fails. and recently WAN link problem has been causing the failovers response to trigger which has adverse reaction on the applications
So the only failure we should worry about is if WAE1 fails, which should be a very rare case. If the WAE1 does fail, the WAE2 can be used to handle the traffic until the until it is replaced.

A few question that come to mind which may make this setup a little more robust are:
Do the WAEs absolutely rely on the configured default-gw to forward traffic back the router?
What I am getting at is, can I omit the default router configuration parameter and just rely on static routes in the WAE to choose which router to forward traffic back to?
Does the WAE have any PBR link functions I can leverage for choosing a router to forward to?

Lamont