I am running a number of sites with 4 circuits between them using load balance by packet.
They work well as far as balance goes but I have had issue with file transfer throughput being very low. During testing I can actually get better throughput by shutting down 2 of the circuits.
There are no errors anyplace and all circuits are exactly the same as far as speed and latency.
After getting a packet capture from both client machines there are a huge number of packet retransmissions. There is no actual loss of any packet in the capture so these are not really valid retransmissions.
From what I can tell it is due to the packets arriving out of order. This causes many acks to be sent for the same seq number which the far end decides is loss and retransmits. From reading the RFC it appears that 3 is the magic number to cause a retransmission.
I know I can use multilink ppp to solve the out of order issue but it increases the knowledge level to support. I have a few NOC guys that will not understand that circuits can be down but still appear to ping.
Anyone suggestions. I have been looking to see if there is any parm in the tcp stack that would affect this but this appears to be very fundamental to tcp.
Yes the definition of IP in itself doesn't guarantee that packet order is preserved, but the first distinction we must make is that actually a 'good and proper' IP network _does_ guarantee that, plus service parameters like packet loss, latency, etc.
In that we differ, to me an ip network that delivers out of order, is not just broken, it's seriously broken.
This is the goal we network engineers strive for, and that is accomplished using good devices, good circuits, and best design practices. Considering the large amount of money that a proper networks costs, it's reasonable to expect we deliver packets in order to our customers.
In other words, I'm not interested at all in exploring nice properties of TCP or other "applications". I know how the 'good and proper' IP network has to behave, and I deliver that, or nothing at all (I leave the business to somebody else).
To answer to your note about TCP internals, I concede that out-of-order packet can cause duplicate acks, even more, I concede that they cause mess and havoc.
The thing is that in all honesty I think that once you begin messing around with that, you're doing only more damage, even if there is a possibility that an obscure parameter can do some improvement into this or that operating system.
That is something nice to know for you studies, totally inadequate for enterprise and day-to-day operation.
Apparently we do differ somewhat in philosophy. I guess I am the "somebody else" who delivers real results, for "impossible" situations, in the real world. (Likely also due to the fact I'm not a network engineer either.)
As an example, about 7 or 8 years ago I was working with a customer doing huge database replications across the "pond". They upgraded from multiple T1/E1s to dual T3/E3s. However, the replication transfers were not taking advantage of the additional bandwidth. The cry went out, there's something wrong with the network, fix it!
One requirement that made the transfer rate critical, the transfer had to be completed within a set time window. (In this case between end of business Friday night, and start of business Monday morning.)
The network support group validated there wasn't anything wrong. No packet loss, expected latency, not even duplicate ACKs.
From my "studies", I suspected this was the classic LFN (long fat network) TCP issue. At the network level, all you have to fix this problem is provide a trans-Atlantic WAN with typical LAN latencies. Today's physics of electrical and/or optical propagation speed makes that impossible to accomplish.
I suggested multiple ways to increase transfer rate. One was, increase the receive window size on the Windows NT receiving server. They too, like you, weren't keen about changing some "obscure parameter" within the OS. They too also questioned that it would provide any improvement.
Since they hadn't found any other solution, and were required to find a solution, and this was the least complex possible solution, they tried it. They were surprised and happy when the transfer rate literally increased 5x.
We're we might agree, this solution was an EXCEPTION to the norm. For instance, it was only done on just one server.
Today, there are additional ways to solve the above problem, such as using a Vista host or some type of WAAS device, but none still really concerning the network itself providing perfect packet sequencing, no loss, ideal latency, etc.
For the poster's problem, I too still prefer avoiding the out-of-order issue, but believe all effective solutions should be on the table, though with a correct analysis of their pros and cons. Even for the suggested MLPPP, possible performance impact, to the end link routers, should be considered.