We use Oracle Application servers (four of them) which serve jsp/js pages. This is currently load balanced and stickiness is maintaned using arrowpoint-cookie.
we get intermittent '408 -' errors on the server access log. On analysing using tcpdump, we found these '408 -' errors are appearing exactly after 5 minutes from the GET request has reached the server. This may be a simple GET request for a .js or .svg file. 5 minutes is the Timeout period [not the keepalive timeout] mentioned in the httpd.conf of the Application server.
Is it that the flow is deleted by the CSS [due to its garbage collection] while the server is using the flow for sending the data back?
Keepalive timeouts on our web/app servers are currently set to 15 seconds
These 408 errors are currently causing intermittent timeout/hang issues for the users.
An interesting point to note here is that these 408 errors were appearing initially on only one of the servers - serverA.Thinking that the server has an issue we shut it down and after almost 20 days, we found the 408 errors have now shifted to a different server - serverB.
We brought up serverA after a few days but the 408 errors remained on the same server - serverB.
We compared just about everything among these servers, starting from tcp parameters, network statistics, config files, Application server installation etc, but could not find anything conclusive..
We were able to reproduce the 408 error by pulling the network cable at the PC end, just after the GET request has reached the app server and that led us to the thought that CSS must be deleting the flows before they are formally closed.
Since these errors are happening only on one of the servers, it is a bit confusing..
Please give us your thoughts?