cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
Announcements

Community Helping Community

Datacenter troubleshooting guide - day 9

3117
Views
5
Helpful
1
Comments
Cisco Employee

"Datacenter troubleshooting guide” – a blog by  Gilles Dufour.

Day 9 -  Understanding me-stats (continued)

Before we continue with the me-stats, I would like to bring up something I forgot to mention when talking about stickyness and the associated counter "num stolen for reuse".

Over the years working with the ACE platform, I have seen many configurations where the resource defined for a context is :

limit-resource all minimum 0.00 maximum unlimited

limit-resource sticky minimum 1.00 maximum unlimited      

The idea is that ACE will select whatever it needs.

The problem is that the sticky resource is static.  In other words, what you get is the minimum and nothing else.  It does not grow with your needs.

Therefore, with such a configuration, your context can only use 1% of the resource.

This is very low and therefore the "num stolen for reuse" is most probably going to increase and your users will most certainly complain about connectivity issues.

If you do not expect to add more contexts, you can go ahead and assign 50% of them as the minimum.

If you know wou will grow, try to determine to what extend and split the reources accordingly.

I would personally assign 10% of the resources to each context as the minimum.

You don't want to use 0% as a minimum of resources is required to work and save your rules.

Now, back to our me-stats.

After buffering the packets (RX), determining if we hit a known connection or not (FastPath), selecting a matching rule (ICM), terminating the connection if needed (TCP) and making the loadbalancing decision (LB), the packet needs to be nated before going out.

Nating and outbound ACL check are done in the OCM module.

Scimitar1/Admin# show np 1 me-stats "-socm -v"
OCM Statistics: (Current)
--------------
LB dest decision received:                       39             0
Errors:                                           0             0
Connection create received:                    5695             3
Nat app fixup recieved:                           0             0
Connection unproxy received:                      2             0
Connection reproxy received:                      1             0
IPCP received:                                    0             0
ACK trigger received:                             1             0
TCP connected received                            1             0
Unknown message received:                         0             0
Drop [LB dest decision fail]:                     0             0
Drop [invalid ifid]                               0             0
Drop [Out of buffers]:                            0             0
Dest decision transmitted:                        2             0
TCP connect transmitted:                         37             0
ACK trigger transmitted:                          0             0
IPCP transmitted:                                 0             0
NAT[static mapped]:                               0             0
NAT[static real]:                                 0             0
NAT[xlate alloc fail]:                            0             0
NAT[xlate real hit]:                              0             0
NAT[xlate mapped hit]:                            0             0
NAT[invalid xlate]:                               0             0
NAT[dump xlate]:                                  0             0
NAT[xlate release failed]:                        0             0
NAT Pool Alloc [fail]:                            0             0
NAT Pool Alloc [addr]:                            0             0
NAT Pool Alloc [addr/port]:                       0             0
NAT Pool Free [addr]:                             0             0
NAT Pool Free [addr/port]:                        0             0
NAT Pool Free [orphan IP]:                        0             0

Reuse retrieve link update conn invalid           0             0
Reuse retrieve link update conn not on r          0             0
Reuse retrieve success but conn invalid:          0             0
Drop [Next Hop queue full]:                       0             0
Reuse retrieve miss:                              0             0
OCM Packet count (Hi & Lo):                       0             0
Packet forward received:                          0             0
NAF Error [no route or unresolved adjace          0             0
NAF Error [nat resp fail]:                        0             0
(Context ALL Statistics)
Connection inserted:                             37             0
Packet message transmitted:                       1             0
Reuse conns retrieved:                            0             0
Drop [out of connections]:                        0             0
Drop [out of proxies]:                            0             0
Drop [out of ssl]:                                0             0
Drop [mac lookup fail]:                          41             0
Drop [route lookup fail]:                      5655             3
Drop [nat fail]                                   0             0
Drop [ip sanity check fail]                       0             0
Drop [acl deny]:                                  0             0
Drop [redundant connection]:                      0             0

Drop [Reproxy fail]:                              0             0
Drop [dest nat fail]:                             0             0

The ACE box wide limits are 4M concurrent connections, 1M concurrent proxied connections, 200k SSL concurrent connections.

The first 2 can also be controlled separately for for each context with a resource allocation map.

If you run reach one of these limits, the connection/packet is dropped and the appropriate counter incremented.

Respectively : "Drop  [out of connections]" , "Drop  [out of proxies]", and "Drop  [out of ssl]".

Note that a UDP connection requires a proxy resource while ACE creates the flow entries.  This proxy resource is released after setting up the flows.

But if you have reached the proxy limit due to SSL or HTTP connections, you won't be able to accept new UDP connections.

Then comes "Drop  [mac lookup fail]" and "Drop  [route lookup fail]".  Like any IP device, ACE needs to have a route to the destination and the mac-address of the next-hop to forward its packets.

If either the route or mac-address is missing, the connection/packet is dropped and the appropriate counter incremented.

As mentioned in the introduction, OCM is responsible for all nating.  Therefore you see all the nat allocations and nat free counters.

You also have the "NAT  Pool Alloc [fail]" which counts the number of time ACE could not find a free address or free port to perform NAT/PAT.

Typically, the "Drop  [nat fail]" counter will increment each time there is a nat allocation failure.

But it also counts all other internal errors.  If you do not see the allocation failure counter incrementing but still have nat failures, you may want to reboot the box as it's probably some config corruption or memory leak.

The "Drop  [ip sanity check fail]" counter keep tracks of how many time we drop a packet because its source and destination ip address were identical after performing nat.

The "Drop  [redundant connection]" is interesting. You will remember that ICM is called to handle new connections.  It will identify the layer 3 rule and decide which path the packet should follow inside the box.  ICM has the responsibility to setup the inbound flow which saves all this information.  OCM is called just before the packet is  sent out.  It has all the information about next-hop, nating, ...  Therefore OCM needs to create the outbound flow associated with the connection (See an example below - conn id 476920    is the outbound flow created by OCM)

476919     1  in  TCP   20   27.1.2.34:37992      10.10.10.10:448       ESTAB
476920     1  out TCP   40   10.10.10.10:448      192.168.40.10:37992   ESTAB

If a flow already exist with the same exact tuple (vlan, proto, ip src, ip dst, port src, port dst) then there is a collision and the new flow can't be setup.

The whole connection is dropped.

If this counter is incrementing in your network, this is typically a problem related with UDP traffic.

UDP is connection less, therefore ACE needs to use idle timeout to decide when to remove a flow.

If the timeout is too high, you risk running out of connections and dropping new ones.

If the timeout is too low, you can be in a situation where the client send a packet, the 2 flows associated to the connection are created so that ACE knows what to do with the server response.

202  in  UDP   20   10.0.0.100:37992      192.168.1.10:53       ESTAB
203  out UDP   40   172.16.1.50:53        10.0.0.100:37992      ESTAB

If the connection has timed out when the response from the server comes in, the server packet is seen by ACE as a new connection.  It will most probably be routed to the client and a new connection will be created (see below).

210  in  UDP   40   172.16.1.50:53        10.0.0.100:37992       ESTAB
211  out UDP   20  
10.0.0.100:37992      172.16.1.50:53         ESTAB

You can quickly detect the problem.  If vlan 40 is your server vlan, you should not see an inbound  ("in") connection associated with that vlan (except for valid connections initiated by the server - for example to a database server, or dns server, ...)

So when the client  10.0.0.100:37992 comes back with a packet for the vip  192.168.1.10:53 , ICM will be able to create the inbound flow #252 :

210  in  UDP   40   172.16.1.50:53        10.0.0.100:37992       ESTAB
211  out UDP   20  
10.0.0.100:37992      172.16.1.50:53          ESTAB

252  in  UDP   20    10.0.0.100:37992      192.168.1.10:53       ESTAB

But when OCM will attempt to setup the outbound flow :

253  out UDP   40    172.16.1.50:53        10.0.0.100:37992      ESTAB

it will collide with flow # 210.  Therefore, flow 252 and 253 get dropped.

The solution is to adjust the idle timeout to avoid the connections to timeout too early.

But that does not guarantee that one server response comes after the timeout.

Another solution is to configure an ACL on the server vlan to block all incoming traffic.

This ACL will only be applied to new connections.  So, server traffic in respones to a client request will not be blocked by this ACL.

Finally, the counter "Drop  [Next Hop queue full]" indicates if the next module input queue is full preventing the packet to follow its path within the box.

Typically a sign of overloaded platform and performance issue.

Next time I will look into SSL.

Regards.

Gilles Dufour

1 Comment
Beginner

great set of articles so far Gilles, thank you for sharing some of this knowledge! Cheers

CreatePlease to create content
Content for Community-Ad
FusionCharts will render here