NTLM through the CSS not always working...

Ken Stieers · ‎12-28-2010

Good afternoon,

We're using a CSS in a one armed config, so everything is in a destination group. We have a set of web servers that we use the CSS to split traffic based on URL (ex. /* goes to server1 and 2, /Sites* goes to server3 and server4) All 4 are running IIS, we do have the "http-method parse RFC-2518-methods"

set, and are using NTLM.

That generally works fine, but see two bits of strange behavior:

1. Sometimes traffic that should go to 3 and 4 goes to 1 or 2

2. Sometime we get prompted for login info, even though it should flow through.

We see problem number 2 when pointed at a content rule with just one service added to it also...

Any pointers?

Thanks,

Ken

jsirstin · ‎01-14-2011

Ken,

Are these long winded tcp connections, or does the client open several different tcp connections for this app? The reason I ask is problem 1 sounds like a possible garbage collection issue.

Take a look at this snip from the link below.

http://www.cisco.com/en/US/products/hw/contnetw/ps789/products_qanda_item09186a00801cb75b.shtml#q9

Q. How is flow garbage collected?

A. The flow manager has a timer task that wakes up every second. This timer task performs garbage collection in an interval that depends Aron the total number of flows on either a single port or the session processor as a whole. By default, garbage collection will run every eight seconds. Each time the garbage collection runs, it looks at a number of slots in a hash table of mapped flows. Each flow is checked to see if it is older than a certain number of seconds, which varies depending on protocol. If the flow is older than a given number of seconds, the fastpath is asked if there has been any activity on the flow within a protocol dependent number of seconds. Thus, you would expect to allow more "dead" time for a chat flow than an NFS flow. If the fastpath either cannot find the flow asked about or the flow has been idle for longer than the specified number of seconds, the flow is torn down by the fastpath.
While we allow for a protocol number of seconds for a flow to be idle, the actual time that it takes to garbage collect a dead flow is that number of seconds plus the time it takes to get through the flow map. This could be as long as four extra minutes, or as short as exactly the timeout interval for a flow of a given protocol type. It is not strictly deterministic because of the way that the collection algorithm works.
As the load increases on a given port or the session processor, the number of seconds between garbage collection intervals goes down first to four then to every two seconds. The number of slots looked at also doubles each time you pass a load threshold. The busier you get, the more cleanup efforts are made. Usually, this works pretty well to maintain resources.
If you ever run short of buffers, immediately try to garbage collect, and also take any buffers queued to a flow (only relevant for content aware processing) and return them to the system buffer pool. This will cause a retransmission for the client but will maintain the system integrity. This is reviewed as a minor penalty. This also is very rare.
For example, if the flow is set up in a hash bin that the garbage collector gets to just after the default timeout interval, and there has been no activity within that timeout interval, the flow will be removed. The flow is torn down via garbage collection in as little as 15 seconds.
When a flow is removed, it is placed at the end of the free list. This list contains all allocated flows (used or not) in the system. The flow remains in a list that is hashed via network tuples so that you can forward any frames until such time as the flow is reallocated off the front of the free list. Under these circumstances, you may see a connection be destroyed in as little as 15 + ( (total allocated FCBs / 2) / sustained flow rate per second) seconds. The flow gets torn down rapidly, however, it is around long enough to do useful work till it gets reallocated from the head of the free list.

When the flow is in the free list we do not look to see what rule is the better match. the request would get sent to whatever server was last used when the flow was garbage collected. The snip above mentions 15 seconds but for http traffic this is really 8 seconds of idle time before we garbage collect. You can try increasing the time the css takes to garbage collect a flow to see if this helps. It may also be part of the problem for the second issue as well.

the command is "flow-timeout-multiplier x" x is a multiple of 16 seconds so the command flow-timeout-multiplier 4 would tell the CSS to not garbage collect the flow until over 1 minute of idle time is seen. This command would be added to all content rules that are seeing this issue. Give this a shot to see if it helps.

Regards

Jim