07-10-2009 01:18 AM - edited 03-12-2019 09:20 AM
The effect called "PMTUD black hole" is a failure of the TCP Path MTU Discovery due to ICMP messages "Destination Unreachable, Fragmentation needed" (Type 3, Code 4) not reaching the node that sends the TCP segments that are too large for the link with a smaller MTU within the path.
This page gives an overview of what exactly happens, and gives some points for consideration for inadvertently causing it.
This diagram represents an example network, with two hosts we call "Client" and "Server" for the purpose of referencing them, as well as two clouds - "serverside cloud" and "clientside cloud", that have between them only a single link with the smaller MTU than all the other links in the network. The real-world situation in the most complicated scenarios could involve multiple links with different MTUs and asymmetric traffic path - so the lower-MTU link would only be traversed in a single direction. Both of the clouds have links using the "standard" MTU, and provide the full connectivity - and they have some more clients connected, using the links with the same "standard" MTU and having full connectivity.
We assume the client host is an HTTP client (web browser), that makes a GET request to the server. Again, for the simplicity of the discussion, it is assumed that the GET request is small enough to fit into the single segment small enough to traverse from the client to the server without exceeding the MTU constraints on the smaller MTU link. Our example network assumes there's no transparent HTTP caching performed anywhere.
The client successfully completes the three way handshake with the server (The client sends SYN segment, the server responds with the SYN-ACK, the client responds with ACK). The segments with the SYN flag set would also have the MSS (Maximum Segment Size) option, equal to the sending node's interface MTU minus 40 bytes - so for the common case it will be 1460. Then, the client sends the HTTP request, which, for example, takes 400 bytes (a realistic request size in case there are no cookies). This segment would typically have the PUSH flag set to ensure the data is passed to the HTTP server application immediately, so the server does just that and immediately sends the ACK segment to confirm. This segment is also small, so it makes it to the client with no difficulties, even though the IP DF ('Don not fragment') bit is set. After processing the request, the server starts to send the HTTP reply, which in the majority of the cases would be bigger than the MSS - which means that at least one full-sized TCP segment with the data will be sent towards the client. This segment would traverse the serverside cloud without anything special happening, however, when it is the time for RouterB to send the IP packet towards RouterA - it can't. The size of the packet is too big to transmit it on the link with the smaller MTU - and it can not fragment the packet because the DF bit is set. So, it sends an ICMP message "Destination Unreachable" (ICMP Type: 3), with the additional information why it is unreachable (ICMP Code: 4), and includes the MTU of the interface that it was not able to send the packet on, alongside with the original IP header and 64 bits of the upper layer protocol (see RFC1191).
The server receives this ICMP packet, and reacts on it, by reducing the MTU it uses to send the outgoing packets for this destination down to the MTU that it received in the ICMP packet, and then retransmits the smaller-sized segment that is now small enough to be successfully transmitted onto the link between RouterA and RouterB - which in turn means it makes it to the client. Then the TCP connection proceeds as normal - with the client receiving the data and acknowledging it - the exact details of this are out of scope of this page.
This scenario proceeds as the previous one, upto and including the point of where the RouterB drops the segment with the data from the server, and sends the ICMP "Destination Unreachable" packet. If this packet is consistently dropped on the way back to the server for whatever reasons, the server will keep waiting the acknowledgment from the client - which never arrives. (RouterB dropped the TCP segments because they were too big). Not seeing the acknowledgment, the server will assume the segment was lost and will retransmit it - again, using the exact same MTU - so this retransmission will be also dropped with ICMP Unreachable sent back, and again dropped. From the client's point of view the server did not send any data - so the browser just displays the "Loading..." in the status line of the browser.
The ICMP "Destination Unreachable" are either filtered on the firewall devices on the way back to the server, or might be not sent altogether due to having configured "no ip unreachables" on the RouterB interface. This happens due to a deliberate action from the site administering these devices. Either of the two actions may be done only after a very careful consideration of risks - how big is the problem that one tries to solve, to ensure the solution is not worse than the problem. This would typically require the collaboration between the security and networks personnel. Blanket blocking those of the ICMP messages that are essential for the proper operation of the protocols for the vast majority of the installations should be treated as a misconfiguration.
ICMP Attacks Illustrated paper from SANS gives a good overview of the potential attacks using the ICMP. This ICMP-related entry in the SANS FAQ gives some more practical details. The CERT vulnerability note VU#222750 illustrates some other potential issues.
How much of a risk these constitute for the site - depends on the site policies. We can see that the situations involving the ICMP Unreachables do have some implications - however the mitigations for these may be possible in a more granular fashion.
The point of view from the network perspective can typically be summarised as "ICMP Errors should not be filtered at all, period". RFC2923 has a chapter dedicated to this topic.
The simplest thing is to lower the MTU value on the interface of the side that is under your control - be it server side, or the client side, if this action is doable according to the site policies. In this example scenario, lowering the client MTU causes it to send smaller MSS to the server - which prevents the server from sending the large packets. Lowering the MTU on the server side directly prevents it from sending the large packets. Adjusting the MSS via the routers on the way also can be an option. Both are frequently used as workarounds for this issue - because there can be at least three parties involved - the party that hosts the client, the party that hosts the server, and the party that has the link with the lower MTU. The PIX/ASA have the configuration command "sysopt connection tcpmss 1380" by default - which does mitigate the impact of this to some extent by decreasing the MSS in the SYN/SYNACK segments passing through - so for a quite a few common smaller MTU scenarios (IPSEC tunneling) it will take care of this issue by default - however, if the decrease of the MTU due to tunneling overhead is large enough, this issue will resurface.
These methods are only workarounds when the full operation of the PMTUD is not possible to achieve - because frequently the party that is experiencing the effect of the ICMP Unreachables being blocked is not the same party that blocks the ICMP Unreachables - and the party that has the smaller MTU link is yet another organization.
For the most part, other protocols will not set the DF bit in the IP header, so they will be fragmented. Although seemingly harmless, at high data rates in the network the fragmentation mechanism of IPv4 is no longer adequate - the [RFC4840] discuses why. Additionally, if the fragments that are created are the outer tunnel encapsulation, the burden of their reassembly lies on the receiving router - which will cause the performance to drop significantly. Some of the other protocols have the possibility of performing PMTUD to avoid the aboveissue with the fragmentation, so they will be subject to blackholing as well if the PMTUD is configured for them (refer to this paper that is mentioned in the "further reading" section).
As we can see, the wildcard blocking is very harmful for the the PMTUD. However, looking from the security perspective, the wildcard permitting all of the ICMP is also not the appropriate solution. So, creating the optimal configuration for the site requires the cooperation of the "security" and "network" departments to ensure the security matches the site policies - but at the same time does has minimal operational impact for other unsuspecting parties.
RFC4459 - MTU and Fragmentation Issues with In-the-Network Tunneling.
Team Cymru's document on ICMP filtering - provides practical example of an ACL that permits the correct functioning of the PMTUD and traceroute.
Resolve IP Fragmentation, MTU, MSS, and PMTUD Issues with GRE and IPSEC: the paper discusses in much more detail the fragmentation, PMTUD - in the light of GRE/IPSEC tunnels usage.
Disabling unreachables breaks PMTUD - a case study showing of the harm caused by blindly disabling the unreachables.
Raising the Internet MTU - this documents the state of affairs with the new, more robust method discovery of the path MTU, that is described in RFC4821.
IANA registry for ICMP types/codes - the list of all registered ICMP types (and, where applicable, codes), along with the references to the corresponding documents defining them.
Find answers to your questions by entering keywords or phrases in the Search bar above. New here? Use these resources to familiarize yourself with the community: