ACE30 - High No of Total Connections Failed

Daniel Anderson · ‎01-29-2014

Hi,

We have some VIPs that have been created recently, from a high-level perspective, everything appears to be working as expected, though we've had some murmurings that some clients are experiencing some slow connectivity. These clients are located in excess of 5000 miles away, so the slowness has been marked down to latency at the moment.

I've been doing a little digging within the relevant context on the ACE30, and noticed the following stats within that context (I reset the counters within the last 25 minutes of taking the below output):

HOSTNAME# sh stats connection

+------------------------------------------+
+------- Connection statistics ------------+
+------------------------------------------+
Total Connections Created : 26718
Total Connections Current : 166
Total Connections Destroyed: 4110
Total Connections Timed-out: 6218
Total Connections Failed : 16372

HOSTNAME# sh stats loadbalance

+------------------------------------------+
+------- Loadbalance statistics -----------+
+------------------------------------------+
Total version mismatch                       : 0
Total Layer4 decisions                       : 0
Total Layer4 rejections                      : 0
Total Layer7 decisions                       : 10159046
Total Layer7 rejections                      : 565
Total Layer4 LB policy misses                : 0
Total Layer7 LB policy misses                : 0
Total times rserver was unavailable          : 0
Total ACL denied                                  : 0
Total FT Invalid Id                               : 0
Total IDMap Lookup Failures                  : 0
Total Proxy misses                                : 0
Total Misc Errors                                 : 0
Total L4 Close Before Process                : 0
Total L7 Close Before Parse                  : 0
Total Close Msg for Valid Real               : 189621
Total Close Msg for Invalid Real             : 9969365
Total Cipher Lookup Failures                 : 0
Total Close Before Dest decision             : 0
Total Optimization Msg sent to Real Servers : 0

My main concern is around 'Total Connections Failed' as this number appears to be incrementing quite rapidly. Looking at the Cisco documentation, this appears to be related to an rserver within the VIP configuration 'not replying to a SYN within the pending timeout period or it replied with a RST'. From an IP perspective, connectivity to each of the RServers is appears to be consistently responsive under 1 msec, and I've been advised they're running as expected.

The VIP in question is configured to listen on 443 (incoming connection is https), and is passing the traffic through to the rservers on the same port. We're not doing anysort of encryption/decryption on the Load Balancer. The only config we have in place out of the ordinary is a paramter map setting the max-parse-length to 76 (see below). This is used for the SSL sticky sessions.

parameter-map type generic PARAMETER-MAP
set max-parse-length 76

From a troubleshooting perspective, I'm looking to obtain a wireshark capture from one of the rservers within the VIP to analyse, do others have any pointers on where an issue may lie and be causing the stats to incremement?

TIA,

Dan

Kanwaljeet Singh · ‎01-29-2014

Hi Daniel,

Do you see failures under the serverfarm as well and counter increasing rapidly? Do you see any probe failure statistics? How about resource usage denied counters shooting up. If it is a slowness problem i doubt the counter for total connections failed would be responsible. That counter means connection never got established. But yes we need to figure why that counter is increasing along with performance issues.

Does the issue happen in peak traffic hours? Does it happen with specific group of users? Was this there from begining or started recently? Is it an issue with one single context and VIP or all VIP's in that context and other contexts? What do you see in "show stats sticky" and show stats http?

Also, it is recommended that max-parse length should never be more than 70 as per documentation. But i see you have 76. Not sure if it is related. If you have a user who has a problem then yes packet captures on ACE itself or front end as well as backend simultaneous pcap along with two instances of show-techs taken during the problem would be of great help. If you have all this information you can open a TAC for further investigation unless something stands out.

Regards,

Kanwal

Daniel Anderson · ‎01-30-2014

Hi Kanwal - Many thanks for your reply.

Just to run through your questions:

We only see couple of failures within the Serverfarm, but only 20 between all 3 rservers behind the VIP. This hasn't incrememted since I reset the stats counters yesterday afternoon. The probes for the rservers also only show 2 failures, again this has not incremented recently. Running 'show resource usage' usage within the VIP also shows that none of the connections have been Denied.

HOSTNAME/CONTEXT# sh resource usage

Allocation

Resource Current Peak Min Max Denied

-------------------------------------------------------------------------------

Context: CONTEXT

conc-connections 254 342 0 7999900 0

mgmt-connections 10 38 0 99900 0

proxy-connections 0 48 0 1048572 0

xlates 0 0 0 1048572 0

bandwidth 149839 4424695 0 622500016 0

throughput 149743 4354885 0 498750016 0

mgmt-traffic rate 96 69810 0 123750000 0

connection rate 5 34 0 599900 0

ssl-connections rate 0 0 0 30000 0

mac-miss rate 0 2 0 2000 0

inspect-conn rate 0 0 0 240000 0

http-comp rate 0 0 0 786432000 0

to-cp-ipcp rate 0 0 0 5000 0

acl-memory 2904 2904 0 78608304 0

sticky 647 847 0 4194304 0

regexp 9 9 0 1048576 0

syslog buffer 1024 1024 0 4194304 0

syslog rate 0 1 0 100000 0

HOSTNAME/CONTEXT#

From what I understand, the issue appears to be ongoing, it doesn't occur at any particular time or for any particular user. There is only a single VIP within this context, and the 3 rservers behind that VIP. Looking at the 'sh stats connection' again this morning, I notice the Connections Failed counter appears to increment quickly in blocks of 4. Not sure if this is of any particular assistance

Do you have any pointers to the documentation that speaks of the max parse length and the ideal values. I've had a quick scan of the Cisco documentation this morning but nothing seemed to jump out at me.

Thanks again for your response.

Dan

Kanwaljeet Singh · ‎01-30-2014

Hi Daniel,

My pleasure. Please visit the below link:

http://www.cisco.com/en/US/docs/interfaces_modules/services_modules/ace/vA5_1_0/configuration/slb/guide/slbgd.pdf

In there please visit SSL session-id stickyness and then have a look at "configuration considerations and requirements". Pasting the relevant portion here:

•Configure a generic parameter map to specify the maximum number of bytes in the TCP payload that you want the ACE to parse. The value of the maximum parse length should always be 70.

Resource usage looks good and it could very well may not ACE here which is the cause of those failed connections increasing. It could be due to server's sending RST or not replying to SYN. This is just speculation though but strangely you don't see failures increasing on the rservers under serverfarm.

Can you also do "show service-policy detail" and see where the connections are dropping? I also see "total L7 rejections" counter too have some value. Is that increasing too? It means that connection was rejected due to traffic not matching the condition defined.

Regards,

Kanwal

Daniel Anderson · ‎01-30-2014

Many thanks for the link.

I think I may be making a little progress. To give a little more info around the setup: We currently have 2 x Cisco 6500 Chassis, each chassis has 2 x ACE30s within it, located in Slots 2 and 3 respectively. For the ACE30 in the Slot 2 we have only Productions VIPs configured, the devices in Slot 3 has Dev/Test VIPs.

What I've noticed this morning is that within the Admin context on each ACE, they're all using Vlan 811 as the FT Vlan, each device using a separate address within that Vlan (see below). From a failover perspective Device A, Slot 2 is peering with Device B Slot 2, and the same for the devices in Slot 3.

Device A - Slot 2 - 192.168.1.1 /24

Device B - Slot 2 - 192.168.1.2 /24

Device A - Slot 3 - 192.168.1.3 /24

Device B - Slot 3 - 192.168.1.4 /24

It appears that using the same FT Vlan across both ACEs in each chassis may not be the ideal scenario, as we're seeing the following message within the log:

%ACE-4-313004: Denied ICMP type=icmp_type, from source_address on interface interface_name to dest_address:no matching session

My question is, from an FT perspective, what would be the ideal scenario - I'm assuming we need to move away from using the same vlan for the ft setup across all the modules, or would using the same vlan id work as long as we weren't using the same address space across each vlan?

TIA

Dan

Kanwaljeet Singh · ‎01-30-2014

Hi Dan,

If the FT vlan is not carrying any other traffic it is okay to have a the same VLAN for different modules. Just ensure that FT group numbers are different since the VMAC generated is based on FT group value. If you have same FT group value on two different ACE's in same chassis, it is going to have same VMAC and switch would log MAC flapping messages since same mac would be learned from two different interfaces.

Regards,

Kanwal

Daniel Anderson · ‎01-30-2014

Thanks - I'll need to get that changed then, as we have the same Group No's across the different modules. Thanks for confirming.

Also, just to clarify re: the Max Parse Length, we're currently running Vers A4(2.1) on our ACE devices. The config guide for this recommends running a value of 76:

http://www.cisco.com/en/US/docs/interfaces_modules/services_modules/ace/vA4_2_0/configuration/slb/guide/slbgd.pdf

"Configure a generic parameter map to specify the maximum number of bytes in the TCP payload that you want the ACE to parse. The value of the maximum parse length should always be 76"

Where as the Vers 5 guide advises a Max of 70 as your previous message highlighted. Is there a difference in values that should be run between the software versions, or is that an anomaly in the documentation?

Thanks again for your assistance

Kanwaljeet Singh · ‎01-30-2014

Hi Dan,

It could be a do;cumentation anomaly. I looked at internal database and it seems you should be okay with the 76. I also foundthis known bug:CSCso12679. Please have a look at that and ensure that you have the correct configuration.

Also, in some cases i see development recommending the value of 70 and it did resolve the issues people were facing but in those cases they had set a very high value of parse length(like 2000, 4000 etc) which actually caused the connection to hang/timeout or ACE not even sending the client hello to backend server.

Regards,

Kanwal