08-24-2016 11:18 AM - edited 03-05-2019 04:33 AM
Hello,
Here is an issue that has me stumped...any insight would be very helpful.
For no apparent reason, two of our corporate campus locations started experiencing intermitttent issues with multiple applications. These two locations use an MTU of 4470 on their WAN routers (this is mirrored on the Data Center routers). This configuration has been working fine for about two years with no issues. Again, only the sites with an MTU of 4470 were affected. All other sites have a standard MTU of 1500 and are configured with ip tcp adjust-mss 1380. No site with the lowered MSS value experienced any issues (roughly 300 sites were unaffected). Adding ip tcp adjust-mss 1380 to the affected sites immediately resolved the issues. We have validated that the MTU values have not changed on the ISP side, and there are 8 circuits, all with different POPs between the affected sites and the Data Centers.
So the question is, what could have caused this, assuming that there isn't an issue in the providers MPLS network, which we are still looking into.
08-24-2016 11:18 PM
Hi,
any recent change (reconfiguration, new switch) in those Data Centers LAN?
If you are 100% sure the WAN is OK, then LAN might be causing the issue?
Best regards,
Milan
08-26-2016 07:05 PM
No WAN changes and we have spent the past few days verifying that the WAN is not the issue. This is a very odd problem. We have roughly 300 sites and they all have ip tcp adjust-mss configured on them - they were not affected. The two affected sites, did not have this configuration until we added it to resolve the issue but were configured with an MTU of 4470 on the WAN routers . Also worth noting that the Data Center location has a corporate campus attached with roughly 4,000 employees and they were not impacted, so that would seem to rule out something in the Data Center. So the question is, what would cause this issue if not the WAN...it's clearly related to packet size/MTU. This has got to be one of the weirdest issues I have ever seen and I can't help but think I am missing something...
08-27-2016 02:51 AM
Hi,
reading your original post once more:
"...We have validated that the MTU values have not changed on the ISP side,"
Does that mean you are able to transfer non-fragmented packest of size 4470 Bytes between the site router and your DC router?
I can imagine something has changed on the provider backbone (or his L1 supplier, e.g.) without all the provider teams noticed...
BR,
Milan
08-27-2016 05:05 AM
That is correct - we have literally sent millions of ICMP packets across all of the circuits at the affected sites and the Data Center. The packets were sized at 4470 with the df-bit option. We have also used ping sweeps between 36 and 4470 bytes. At no time are any packets lost. Since the ip tcp adjust-mss setting changes the packet size during the three way handshake, we are looking into the remote possibility that something is wrong in the Data Center. Even though this would seemingly affect the users in the campus that is LAN connected to the Data Center, I have a theory about this. If the issue is in the Data Center, perhaps the local users are unaffected because the applications are eventually negotiating a smaller window size and since this is happening at LAN speed, the issue isn't noticeable. This same behavior across the WAN however would most likely be problematic. We'll be taking some captures in the Data Center next but for now I can't see any reason to look at the WAN anymore.
Discover and save your favorite ideas. Come back to expert answers, step-by-step guides, recent topics, and more.
New here? Get started with these tips. How to use Community New member guide