12-04-2015 05:09 AM - edited 03-08-2019 02:57 AM
Hi Guys,
I have been investigating an issue for a month now and still can't figure out what the issue is.
Basically we have clients connecting to a server on port 80 but 1% of the connections fail due to a socket error. The packet traverses as below
Client - firewall - ipsec tunnel - firewall - loadbalancer - server
When we do a TCP dump between the firewall and loadbalancer we can see the successfull connections which follows the normal 3 way handshake as expected but when we look at the failed transactions we are seeing the follwing
SYN SEQ(lets say 1000000)
[TCP ACKed unseen segment]SYN,ACK SEQ(totally random number 124124) ACK(totally random number + 1 - 124125)
[TCP Spurious Retransmission] SYN SEQ(1000000)
[TCP ACKed unseen segment]SYN,ACK SEQ(t124124) ACK(124125)
[TCP Spurious Retransmission] SYN SEQ(1000000)
And so on until the packet is dropped.. So it seems either the LB or the server is sending the WRONG SYN,ACK SEQ number?? I checked previous TCP streams and also filtered on this wrong SEQ number and can't find it anywhere in my capture so its not something that was stuck in a previous segment or so. I can also confirm I have no packets missing from my wireshark.
The issue is totally random during the day and not a load or busy period issue.
From what I can see and read on the internet the LB might be involved with this 3 way handshake even though one would expect this to be between the server and the client.
I can't do a capture between the server and the LB due to the server being a VM that sits on a host on a different switch where I dont have access to atm. Also the LB Logs doesn't show any issues.
12-23-2015 05:55 PM
Greetings,
The LB can be configured to act as proxy for TCP connections so that clients connect to the LB with 3-way handshake and then LB connects to server with a different socket. (This can even be an already open socket being reused.)
You could check the LB configuration but it's hard to misconfigure a single server for TCP proxy and not the others if they are in the same server farm.
Do you have a lot of servers behind the LB for this service? This may but only a specific server having the issue only some of the time. This could explain the 1%.
If you have permission and sufficient capacity you can disable servers one at a time on the LB to see if the issue remains or not.
If it does than I'd focus on the LB more. If it goes away with a certain server then look at that specific server.
Hope this helps.
JF
Discover and save your favorite ideas. Come back to expert answers, step-by-step guides, recent topics, and more.
New here? Get started with these tips. How to use Community New member guide