ACE sending RST

michaelhostbaek · ‎03-06-2016

Hi,

I have been using my ACE for several years, it is in front of a Varnish reverse proxy.

Recently I have noticed something strange.

I am seeing some of Timed-Out and Failed connections :

+------------------------------------------+
+------- Connection statistics ------------+
+------------------------------------------+
Total Connections Created : 72541
Total Connections Current : 192
Total Connections Destroyed: 72335
Total Connections Timed-out: 218
Total Connections Failed : 138

and running tcpdump on the Varnish unix server, I see the following:

22:09:56.330416 IP 172.31.255.248.31118 > 172.16.0.100.http: Flags [S], seq 1012507648, win 32768, options [mss 1460], length 0
22:09:56.330420 IP 172.16.0.100.http > 172.31.255.248.31118: Flags [S.], seq 377765974, ack 1012507649, win 65535, options [mss 1240], length 0
22:09:56.330525 IP 172.31.255.248.31118 > 172.16.0.100.http: Flags [.], ack 1, win 32768, length 0
22:09:56.330607 IP 172.31.255.248.31118 > 172.16.0.100.http: Flags [P.], seq 1:388, ack 1, win 32768, length 387: HTTP: GET / HTTP/1.1
22:09:56.330666 IP 172.16.0.100.http > 172.31.255.248.31118: Flags [.], seq 1:12401, ack 388, win 65535, length 12400: HTTP: HTTP/1.1 200 OK
22:09:56.330956 IP 172.31.255.248.31118 > 172.16.0.100.http: Flags [.], ack 1241, win 31528, length 0
22:09:56.330960 IP 172.16.0.100.http > 172.31.255.248.31118: Flags [.], seq 12401:13641, ack 388, win 65535, length 1240: HTTP
22:09:56.530272 IP 172.31.255.248.31118 > 172.16.0.100.http: Flags [F.], seq 388, ack 1241, win 31000, length 0
22:09:56.530275 IP 172.16.0.100.http > 172.31.255.248.31118: Flags [.], ack 389, win 65535, length 0
22:09:56.587149 IP 172.16.0.100.http > 172.31.255.248.31118: Flags [.], seq 1241:2481, ack 389, win 65535, length 1240: HTTP
22:09:56.709494 IP 172.31.255.248.31118 > 172.16.0.100.http: Flags [R], seq 1012508037, win 0, length 0

172.31.255.248= ACE
172.16.0.100= Varnish

Why is the ACE sending a RST in this case?

Thanks,

Aleksey Pan · ‎03-11-2016

Hi Michael,

This is not a full picture. this is only back-end capture.

It could be due several things:

- no normalization on the ACE interface

- front-end connection has been closed before ACE received the FIN,ACK from server

- Client sends the RST.

To see the full picture, better to have the span capture of the ACE backplane port, in this case you can see the front-end Client-ACE and back-end ACE-server conversation.

michaelhostbaek · ‎03-14-2016

Hi Alex,

Thanks for your answer.

Indeed I have seen several posts about normalization on the ACE interface, that might relate to my issue. My understanding is that normalization is enabled by default, and I have not specifically disabled it.

The ACE is managed service from my provider, and I only have access to the context, and not the supervisor engine - so no back-plane capture possible.

I guess I could do a capture on the WAN interface, and see if there is a corresponding RST sent from the client, when I notice the issue on the forward proxy... ?

Would that be the way forward?

thanks,

Aleksey Pan · ‎03-14-2016

Hi Michael,

Sorry for the late reply.

if you will check the interface on the ACE by "sh run int ..."

make sure, you don't have "no normalization"

By taking the capture on a WAN, yes at least we can check whether the client sends RST or not.

- If there is no possibility to take the span-capture, then would be better to have at least next:

- Captures from the Client and Server at the same time, to catch the connection from the beginning (SYN)

- Output of "show conn det client-ip x.x.x.x" and track it untill you see RST. ( log the session, then it would be easy to check.

- show service-policy det | b <VIP address>

- check if there dropp-conn are incrementing under L7 or under serverfarm

Also, this is a good doc might be helpful:

http://docwiki.cisco.com/wiki/Cisco_Application_Control_Engine_%28ACE%29_Troubleshooting_Guide_--_Troubleshooting_Connectivity

Another question:

- Do you see RST only when the connection is terminating by FIN? or it happens in the middle of the connection too?

michaelhostbaek · ‎03-15-2016

Hi Alex,

Thanks for your reply.

Here's my "sh run int"

interface vlan 1212

ip address x.xx.xx.xxx 255.255.255.240

alias x.xx.xx.xxx 255.255.255.240

peer ip address x.xx.xx.xxx 255.255.255.240

access-group input ANY

service-policy input WEB-to-vIPs

service-policy input SNMP_POLICY

service-policy input ICMP_POLICY

service-policy input L4_TCP_POLICY

no shutdown

interface vlan 2424

ip address 172.31.255.251 255.240.0.0

alias 172.31.255.249 255.240.0.0

peer ip address 172.31.255.250 255.240.0.0

no normalization

fragment timeout 10

access-group input ANY

nat-pool 1 172.31.255.248 172.31.255.248 netmask 255.240.0.0 pat

service-policy input REMOTE_MGMT_ALLOW_POLICY

service-policy input L4_TCP_POLICY

no shutdown

vlan1212 is the WAN interface (traffic from Cloudflare)

vlan2424 is the LAN interface (to my varnish proxy)

I have added "no normalization" to the "internal" interface, and that seemed to have helped.

Furthermore I added the following connection parameter map:

parameter-map type connection TCP_PARAMETER_MAP

set timeout inactivity 500

set tcp wan-optimization rtt 0

exceed-mss allow

and lastly, I added the following to the http parameter map:

server-conn reuse

With the above changes, first off majority of connections moves from the Varnish box ("netstat -an | wc -l" would show 3-4000 connections) to the ACE. The Varnish box only has 2-300 connections now, and the Cisco now shows 3-5000.

Secondly it now seems that I get very very few RST from the ACE to the Varnish box. I am not entirely sure why however.. As I still see some strange statistics:

+------------------------------------------+

+------- Connection statistics ------------+

+------------------------------------------+

Total Connections Created : 392300

Total Connections Current : 5320

Total Connections Destroyed: 137095

Total Connections Timed-out: 245923

Total Connections Failed : 8833

Policy-map : WEB-to-vIPs

Status : ACTIVE

Description: -----------------------------------------

Interface: vlan 1212

service-policy: WEB-to-vIPs

class: L4-WEB-IP

nat:

nat dynamic 1 vlan 2424

curr conns : 22 , hit count : 88188

dropped conns : 145

client pkt count : 39368436 , client byte count: 2438708515

server pkt count : 60869373 , server byte count: 73323471026

conn-rate-limit : 0 , drop-count : 0

bandwidth-rate-limit : 0 , drop-count : 0

VIP Address: Protocol: Port:

x.xx.xx.xxx tcp eq 80

loadbalance:

L7 loadbalance policy: WEB_L7_POLICY

Regex dnld status : SUCCESSFUL

Rgx comp success cnt : 2

Last Regex comp success : Mon Mar 14 20:11:30 2016

Rgx comp timeout cnt : 0

Rgx comp failed cnt : 0

VIP Route Metric : 77

VIP Route Advertise : DISABLED

VIP ICMP Reply : ENABLED-WHEN-ACTIVE

VIP State: INSERVICE

VIP DWS state: DWS_DISABLED

Persistence Rebalance: ENABLED

curr conns : 5239 , hit count : 334004

dropped conns : 542

conns per second : 0

client pkt count : 41024202 , client byte count: 2505314498

server pkt count : 60869373 , server byte count: 73323471026

conn-rate-limit : 0 , drop-count : 0

bandwidth-rate-limit : 0 , drop-count : 0

L7 Loadbalance policy : WEB_L7_POLICY

class/match : class-default

LB action: :

sticky group: web-sticky

primary serverfarm: VARNISH

state:UP

backup serverfarm : FARM_WEB_V3

state: UP

hit count : 333608

dropped conns : 0

compression : off

compression:

bytes_in : 0 bytes_out : 0

Compression ratio : 0.00%

Gzip: 0 Deflate: 0

compression errors:

User-Agent : 0 Accept-Encoding : 0

Content size: 0 Content type : 0

Not HTTP 1.1: 0 HTTP response error: 0

Others : 0

Parameter-map(s):

HTTP_PARAMETER_MAP

+------------------------------------------+

+------- Loadbalance statistics -----------+

+------------------------------------------+

Total version mismatch : 0

Total Layer4 decisions : 0

Total Layer4 rejections : 0

Total Layer7 decisions : 1270452

Total Layer7 rejections : 119

Total Layer4 LB policy misses : 0

Total Layer7 LB policy misses : 0

Total times rserver was unavailable : 0

Total ACL denied : 0

Total FT Invalid Id : 0

Total IDMap Lookup Failures : 0

Total Proxy misses : 0

Total Misc Errors : 0

Total L4 Close Before Process : 0

Total L7 Close Before Parse : 0

Total Close Msg for Valid Real : 281311

Total Close Msg for Non-Existing Real : 0

Total Cipher Lookup Failures : 0

Total Close Before Dest decision : 0

Total Optimization Msg sent to Real Servers : 0

Total Invalid Proxy Id : 0

So some drops under L4 and some under L7. None from the serverfarm.

Any input would be greatly appreciated.

Thanks,

Aleksey Pan · ‎03-15-2016

Hi Michael,

"

I have added "no normalization" to the "internal" interface, and that seemed to have helped.

"

- Not sure at all with what it could help, but for it is not reducing RST. For ex when server sends FIN packet, it will reply with RST( this feature to reduce unnecessary ACKs) And instead of graceful termination, it can just send RSTs.

"

Looking a tcpdump on the varnish box, all [RST, ACK] I get from the ACE are preceded with [TCP ZeroWindow] or [TCP Window Full]

"

- If you see a lot of "TCP Window zero" from the ACE, and if they are during the traffic, that means there is not enough buffer to handle the traffic on the ACE. If you see "TCP Window Full" from the ACE to Vanish, that means that the tcp-buffer is filled up on Vanish and asking to update the TCP-window size, so ACE will continue to send the data.

I don't know any reason, if these packets can cause RST, only if there a long time no response from the Vanish with updated buffer. Also, if there a lot "TCP window FULL" messages from the ACE to the Vanish, in this case ACE cannot continue transfer the data to this device, and will keep everything in his buffer, and at some pint ACE's buffer getting filled up, and it will reply to Client or server ( source of pushing the data) with "TCP-window zero". That can cause chain of this issue.

"

parameter-map type connection TCP_PARAMETER_MAP

set timeout inactivity 500

set tcp wan-optimization rtt 0

exceed-mss allow

"

- that is not related to the RSTs, but could be related to the drops. exceed-mss allow - this is helpful.

"

and lastly, I added the following to the http parameter map:

server-conn reuse

"

- This is probably why you see now less RST, cause this reducing the number of new-opened connections between ACE and server, less new conn- less RSTs for them

- But make sure, you have a PAT configured,, otherwise it can create collisions for the connections from the same Client. I see you have PAT configured. you good.

- The question is, is RST really creating the issue or not. If they are just as replies to FIN, that is fine. If they are in the mid of connections unexpectedly, then needs to figure out why, with all traces from Client to the Server on the path.

"

So some drops under L4 and some under L7. None from the serverfarm.

and

+------------------------------------------+
+------- Connection statistics ------------+
+------------------------------------------+
Total Connections Created : 72541
Total Connections Current : 192
Total Connections Destroyed: 72335
Total Connections Timed-out: 218
Total Connections Failed : 138

"

Those values for overall all connections thru this ACE. It is impossible to tell from here what the connections had an issues. In this case better to monitor each service-policy on drops and investigate peace by peace. They are could be due a lot of reasons: denies based on the resources( mem, throughput, license etc), mss exceed, L7 policy(matches), ssl/tls negotiations, timeouts ( client/servers), all the failed probes etc ( aggressive conf, network or server issues with replies)

Hope this helps.

Regard,

Alex

michaelhostbaek · ‎03-17-2016

Actually - I've re-added normalization. It does not seem to have any effect (on my issue) whether it is on or off.

"- If you see a lot of "TCP Window zero" from the ACE, and if they are during the traffic, that means there is not enough buffer to handle the traffic on the ACE. If you see "TCP Window Full" from the ACE to Vanish, that means that the tcp-buffer is filled up on Vanish and asking to update the TCP-window size, so ACE will continue to send the data."

Actually it is the other way around - I see [TCP Window Full] sent from Varnish to the ACE.

"- The question is, is RST really creating the issue or not. If they are just as replies to FIN, that is fine. If they are in the mid of connections unexpectedly, then needs to figure out why, with all traces from Client to the Server on the path."

Yes.. maybe the RSTs are normal, and I am worrying about nothing. I know that a great deal of client connections are originating from Africa, and perhaps the RSTs are simply because of poor network connectivity and/or because clients are cancelling the request (hitting stop in their browser)

Again - many thanks for your help!

Aleksey Pan · ‎03-17-2016

Hi Michael,

If you see those "TCP Full Window" from the Vanish, and "TCP window 0" from the ACE, that means there is not enough resources on the ACE ( buffer)

By increasing buffer in one context, can result the problems on other

I would suggest first:

parameter-map type connection TCP_PARAMETER_MAP

set tcp wan-optimization rtt 0

tcp-options selective-ack allow

tcp-options window-scale allow

and if you have just one context and you think this class-map has more priority and/or if you ok with resources, you also can increase the buffer under the same parameter-map:

set tcp buffer-share 65536

Note: default value is 32768, you can vary from 8192 to 262143

this is a command reference as well: http://www.cisco.com/c/en/us/td/docs/interfaces_modules/services_modules/ace/vA5_1_0/command/reference/ACE_cr/parammap.html#wp1233449

Hope this helps.

Regards,

Alex

michaelhostbaek · ‎03-15-2016

Looking a tcpdump on the varnish box, all [RST, ACK] I get from the ACE are preceded with [TCP ZeroWindow] or [TCP Window Full]

I wonder if this is because of "no normalization" on the vlan2424 interface?

michaelhostbaek · ‎03-14-2016

Also, it might be relevant to know that I have Cloudflare before the Cisco ACE.

So it looks like this:

Client -> Cloudflare -> Cisco ACE -> Varnish -> backend server(s)

If I remove the Cisco ACE from the equation, and send all traffic from Cloudflare directly to the Varnish machine, I no longer see those TCP resets.

I am not quite sure how to move forward on this. I have tried almost everything.

Any ideas would be greatly appreciated!

thanks,