01-04-2014 06:11 AM
Dear all,
I've had some issues with the MSS of MPBGP sessions between two 7609Ss. I'm trying to understand how the PMTUD procedure works before BGP's TCP session MSS is set and that's why I'm capturing all the traffic exchanged between both routers. By default, PMTUD for BGP sessions is enabled with 'bgp transport path-mtu-discovery' and it's not globally enabled. Before TCP negotiation and subsequent BGP OPEN, I expected to see UDP packets of growing length from both sides in order to verify the path's MTU, so determine the respective MSS to be used to open the BGP's TCP session. To my surprise, I don't see this traffic and both routers choose a MSS that is the link's MTU minus 40 bytes (IP + TCP headres). Could you tell me how the PMTUD procedure works for BGP sessions?
Both routers run IOS 15.2(4)S1 and another strange issue I find is that in case PMTUD is disabled for BGP sessions 'no bgp transport path-mtu-discovery' the chosen MSS instead of 536 bytes is 576 bytes. Any idea about why this number?
Thanks in advance
Kind regards
Octavio
01-04-2014 01:06 PM
Hi Octaivo,
in case PMTUD is disabled for BGP sessions 'no bgp transport path-mtu-discovery' the chosen MSS instead of 536 bytes is 576 bytes. Any idea about why this number?
In IPv4, the minimum datagram size every host must accept is 576 bytes, so with PMTUD disabled the router uses this value as the maximum size for the packets it sends. The MSS is typically calculated by MTU minus 40 bytes (IP- and TCP-header; the MSS is just the TCP data size).
When you enable PMTUD, the router sends packets up to the MTU size of its output-interface and the MSS can be negotiated between both end-hosts during the TCP-handshake (both just exchange their local values, the path is not checked this way).
The df-bit will be set for every packet, so if some interface in the path has a lower MTU, the associated device should send an ICMP "Destination Unreachable/Fragmentation needed and DF bit set" message back so that the router can lessen the MTU it uses for that interface/connection:
ICMP: dst (1.1.1.1) frag. needed and DF set unreachable rcv from 192.168.12.2
TCP0: ICMP datagram too big received (576), MSS changes from 1460 to 536
R1#show tcp | i Timer|segment|mtu
Event Timers (current time is 0x506CE0):
Timer Starts Wakeups Next
PmtuAger 1 0 0x525928
Status Flags: active open, path mtu discovery
Option Flags: nagle, path mtu capable
Datagrams (max data segment is 536 bytes):
Then, as long as the MSS value for an interface is lower than the result of the negotiation during the TCP handshake, the router tries to discover path MTU changes dynamically:
TCP0: Pathmtu-Discovery, MSS changes from 536 to 966
R1#show tcp | i Timer|segment|mtu
Event Timers (current time is 0x52E3A8):
Timer Starts Wakeups Next
PmtuAger 2 1 0x542DE8
Status Flags: active open, path mtu discovery
Option Flags: nagle, path mtu capable
Datagrams (max data segment is 966 bytes):
TCP0: Pathmtu-Discovery, MSS changes from 966 to 1452
R1#show tcp | i Timer|segment|mtu
Event Timers (current time is 0x5507B4):
Timer Starts Wakeups Next
PmtuAger 3 2 0x5602A8
Status Flags: active open, path mtu discovery
Option Flags: nagle, path mtu capable
Datagrams (max data segment is 1452 bytes):
TCP0: Pathmtu-Discovery, MSS changes from 1452 to 1460
R1#show tcp | i Timer|segment|mtu
Event Timers (current time is 0x569304):
Timer Starts Wakeups Next
PmtuAger 3 3 0x0
Option Flags: nagle, path mtu capable
Datagrams (max data segment is 1460 bytes):
HTH
Rolf
01-07-2014 03:14 PM
Thank you very much, Rolf, for your explanation. I commited a mistake in my original message. What is strange in IOS
IOS 15.2(4)S1 is that, when PMTUD is not enabled (no bgp transport path-mtu-discovery), the MSS value is not 536, as expected, but it's 556 (by mistake I wrote 576 bytes), as presented in the following output:
C7609S-01#show tcp | i segment|Option
...
Option Flags: higher precendence, nagle <<--- no "path mtu enabled"
Datagrams (max data segment is 556 bytes):
...
Any idea about why this number instead of the ordinary 536?
Thanks in advance
Octavio
01-04-2014 03:00 PM
Great explanation by fischer. Here is another great article on PMTUD and MSS.
http://www.cisco.com/en/US/tech/tk827/tk369/technologies_white_paper09186a00800d6979.shtml
If you have different types of layer 1 connectivity end to end then we should understand really wherehow much extra overheads are added and that keeps the BGP UPDATE packet sizes/MSS limited.
Thanks,
Madhu
01-07-2014 03:15 PM
Thanks a lot, Madhu. Really interesting. I really appreciate your help.
Regards
Octavio
01-07-2014 04:23 PM
Hi Octavio,
Can you send the sh run of the interfaces/show ip interface on both sides and how they are connected? For example, for a back to back connected interfaces even when you disable the PMTUD it will be still interface mtu - 40 bytes. You can use the "neighbor a.b.c.d transport path-mtu-discovery disable" instead of doing it gloablly under BGP that will affect all neighbors. For some reason it is accounting 20 byte more than expected. If possible "debug ip tcp transaction" when this is happening?
Thanks,
Madhu
01-08-2014 03:59 PM
Thank you, Madhu. I just wanted to clarify why 556bytes and I didn't provide you with the whole information. I will try to do it now. In fact, I get 556 bytes as MSS without disabling PMTUD.
I have two Cisco 7609S (R1 runs 15.2(4)S1 and R2 runs 15.3(3)S1) running iBGP between them, apart from OSPF as IGP, and connected by a STM16/OC48 SONET link. Apart from that, they have several iBGP and eBGP sessions with different peers. They all have PMTUD enabled by default. My suspicion is that 15.2(4)S1 has some sort of bug. If we take a look at the MSSs you find something strange:
R2 BGP sessions:
Peer MSS Flag options
XXXXXX 1460 nagle, path mtu capable
XXXXXX 1918 nagle, path mtu capable
XXXXXX 1460 nagle, path mtu capable <-- IPX
R1 4406 nagle, path mtu capable
XXXXXX 1460 nagle, path mtu capable
XXXXXX 1460 nagle, path mtu capable
XXXXXX 4406 nagle, path mtu capable
R1 BGP sessions:
Peer MSS Flag options
XXXXXX 556 nagle
XXXXXX 556 nagle
R2 556 nagle
XXXXXX 1460 higher precendence, nagle, path mtu capable
XXXXXX 556 higher precendence, nagle
XXXXXX 556 higher precendence, nagle
XXXXXX 556 higher precendence, nagle
XXXXXX 556 VRF id set, higher precendence, nagle
XXXXXX 1460 higher precendence, nagle, path mtu capable, md5
XXXXXX 556 higher precendence, nagle
As you can see, R2 seems to be able to run PMTUD. However, R1, even though PMTUD is enabled, only two of them get a different value from 556 bytes of MSS and most of them doesn't show "path mtu capable".
If I focus on R1-R2's iBGP session over the STM4 link, R2 has a 4406 bytes MSS and R1 556 bytes.
R1 outputs:
interface POS4/0/0
ip address XXXXXX 255.255.255.254
load-interval 30
mpls traffic-eng tunnels
mpls traffic-eng attribute-flags 0xC0000008
mls qos trust dscp
pos framing sdh
pos report rdool
pos report lais
pos report lrdi
pos report pais
pos report prdi
pos report puneq
pos report pplm
pos report ptim
pos report ptiu
pos report sd-ber
pos flag s1 ignore
service-policy input Policy_NetworkIngress
ip rsvp bandwidth percent 100
end
R1#sh ip interface POS4/0/0
POS4/0/0 is up, line protocol is up
Internet address is XXXXXX/31
Broadcast address is 255.255.255.255
Address determined by non-volatile memory
MTU is 4470 bytes
Helper address is not set
Directed broadcast forwarding is disabled
Multicast reserved groups joined: 224.0.0.14 224.0.0.17 224.0.0.5
Outgoing access list is not set
Inbound access list is not set
Proxy ARP is enabled
Local Proxy ARP is disabled
Security level is default
Split horizon is enabled
ICMP redirects are always sent
ICMP unreachables are always sent
ICMP mask replies are never sent
IP fast switching is enabled
IP Flow switching is disabled
IP CEF switching is enabled
IP CEF switching turbo vector
IP Null turbo vector
Associated unicast routing topologies:
Topology "base", operation state is UP
IP multicast fast switching is enabled
IP multicast distributed fast switching is disabled
IP route-cache flags are Fast, CEF, CWAN
Router Discovery is disabled
IP output packet accounting is disabled
IP access violation accounting is disabled
TCP/IP header compression is disabled
RTP/IP header compression is disabled
Probe proxy name replies are disabled
Policy routing is disabled
Network address translation is disabled
BGP Policy Mapping is disabled
Input features: QoS Classification, MCI Check
Output features: HW Shortcut Installation
Post encapsulation features: HW Shortcut Installation
Sampled Netflow is disabled
IP Routed Flow creation is disabled in netflow table
IP Bridged Flow creation is disabled in netflow table
WCCP Redirect outbound is disabled
WCCP Redirect inbound is disabled
WCCP Redirect exclude is disabled
R2 outputs:
interface POS4/0/0
ip address XXXXXXX 255.255.255.254
load-interval 30
mpls traffic-eng tunnels
mpls traffic-eng attribute-flags 0xC0000008
mls qos trust dscp
pos framing sdh
pos report rdool
pos report lais
pos report lrdi
pos report pais
pos report prdi
pos report puneq
pos report pplm
pos report ptim
pos report ptiu
pos report sd-ber
pos flag s1 ignore
aps group 4
aps protect 1 XXXXXXXX
aps revert 10
service-policy input Policy_NetworkIngress
ip rsvp bandwidth percent 100
end
C7600-BAT01-01#sh ip interface pos4/0/0
POS4/0/0 is up, line protocol is down (APS protect - inactive)
Internet address is XXXXXX/31
Broadcast address is 255.255.255.255
Address determined by non-volatile memory
MTU is 4470 bytes
Helper address is not set
Directed broadcast forwarding is disabled
Multicast reserved groups joined: 224.0.0.14 224.0.0.17 224.0.0.5
Outgoing access list is not set
Inbound access list is not set
Proxy ARP is enabled
Local Proxy ARP is disabled
Security level is default
Split horizon is enabled
ICMP redirects are always sent
ICMP unreachables are always sent
ICMP mask replies are never sent
IP fast switching is enabled
IP Flow switching is disabled
IP CEF switching is enabled
IP CEF switching turbo vector
IP Null turbo vector
Associated unicast routing topologies:
Topology "base", operation state is UP
IP multicast fast switching is enabled
IP multicast distributed fast switching is disabled
IP route-cache flags are Fast, CEF, CWAN
Router Discovery is disabled
IP output packet accounting is disabled
IP access violation accounting is disabled
TCP/IP header compression is disabled
RTP/IP header compression is disabled
Probe proxy name replies are disabled
Policy routing is disabled
Network address translation is disabled
BGP Policy Mapping is disabled
Input features: QoS Classification, MCI Check
Output features: HW Shortcut Installation
Post encapsulation features: HW Shortcut Installation
Sampled Netflow is disabled
IP Routed Flow creation is disabled in netflow table
IP Bridged Flow creation is disabled in netflow table
IPv4 WCCP Redirect outbound is disabled
IPv4 WCCP Redirect inbound is disabled
IPv4 WCCP Redirect exclude is disabled
Thanks a lot for your help.
Kind regards
Octavio
01-10-2014 09:28 PM
Hi Octavio,
THe MSS value can be different in both directions if there is asymmetrical routing. If you enable "debug ip tcp transaction" and clear bgp between these 2 peers it is getting a differnet mss value? With ethernet link and directly connected neighbors I doubt if the pmtud will kick in as it will take mtu of interface as basis for mss calc. Not sure if it is same for any underlying link or bit different for serial links. Enabling the above debugs will clarify if R1 is sending its capability to negotiate or not.
Thanks,
Madhu
01-19-2014 03:04 PM
Dear Madhu,
Sorry for my late response. Once again, thanks a lot for your help. You will see I found something strange.
Please, remember my scenario:
I have two Cisco 7609S (R1 runs 15.2(4)S1 and R2 runs 15.3(3)S1) running iBGP between them, apart from OSPF as IGP, and connected by a STM16/OC48 SONET link.
Additionally, we set "ip tcp mss 1460" in R2.
I was only allowed to reset a MPBGP session between R1 and R2 that runs over a GRE tunnel between both routers.
As expected, both routers ended up with MSS of 1460 bytes (R2 advertises 1460 to R1 and sets 1460 as local MSS due to "ip tcp mss 1460", as it's smaller than the 4406 bytes MSS advertised by R1). Output taken from R2:
006421: Jan 13 09:42:38.069: TCB232B59D0 setting property TCP_SSO_TYPE (27) 3A413E70
006422: Jan 13 09:42:38.069: TCP: SSO already disabled for 232B59D0
006423: Jan 13 09:42:38.069: TCB232B59D0 setting property TCP_SSO_TYPE (27) 22C46A78
006424: Jan 13 09:42:38.069: TCP: SSO already disabled for 232B59D0
006425: Jan 13 09:42:38.069: TCP0: state was ESTAB -> FINWAIT1 [179 -> 10.188.188.1(64695)]
006426: Jan 13 09:42:38.069: TCP0: sending FIN
006427: Jan 13 09:42:38: %BGP-5-ADJCHANGE: neighbor 10.188.188.1 vpn vrf Mgmt Down User reset
006428: Jan 13 09:42:38: %BGP_SESSION-5-ADJCHANGE: neighbor 10.188.188.1 IPv4 Unicast vpn vrf Mgmt topology base removed from session User reset
006429: Jan 13 09:42:38.189: TCP0: state was FINWAIT1 -> FINWAIT2 [179 -> 10.188.188.1(64695)]
006430: Jan 13 09:42:38.193: TCP0: FIN processed
006431: Jan 13 09:42:38.193: TCP0: state was FINWAIT2 -> TIMEWAIT [179 -> 10.188.188.1(64695)]
006432: Jan 13 09:42:38.397: TCB318670D0 created
006433: Jan 13 09:42:38.397: TCB318670D0 setting property TCP_VRFTABLEID (20) 227AE0A4
006434: Jan 13 09:42:38.397: TCB318670D0 setting property TCP_MD5KEY (4) 0
006435: Jan 13 09:42:38.397: TCB318670D0 setting property TCP_ACK_RATE (37) 3A5F614C
006436: Jan 13 09:42:38.397: TCB318670D0 setting property TCP_TOS (11) 3A5F6150
006437: Jan 13 09:42:38.397: TCB318670D0 setting property TCP_PMTU (45) 3A5F6110
006438: Jan 13 09:42:38.397: TCB318670D0 setting property TCP_RTRANSTMO (36) 3A5F6148
006439: Jan 13 09:42:38.397: TCP: Random local port generated 38824, network 1
006440: Jan 13 09:42:38.397: TCB318670D0 bound to 10.188.188.2.38824
006441: Jan 13 09:42:38.397: Reserved port 38824 in Transport Port Agent for TCP IP type 1
006442: Jan 13 09:42:38.397: TCP: sending SYN, seq 1477543876, ack 0
006443: Jan 13 09:42:38.397: TCP0: Connection to 10.188.188.1:179, advertising MSS 1460
006444: Jan 13 09:42:38.397: TCP0: state was CLOSED -> SYNSENT [38824 -> 10.188.188.1(179)]
006445: Jan 13 09:42:38.521: TCP0: state was SYNSENT -> ESTAB [38824 -> 10.188.188.1(179)]
006446: Jan 13 09:42:38.521: TCP: tcb 318670D0 connection to 10.188.188.1:179, peer MSS 1460, MSS is 1460
006447: Jan 13 09:42:38.521: TCB318670D0 connected to 10.188.188.1.179
006448: Jan 13 09:42:38.521: TCB318670D0 setting property TCP_NO_DELAY (0) 3A5F6148
006449: Jan 13 09:42:38.521: TCB318670D0 setting property TCP_RTRANSTMO (36) 3A5F6148
006450: Jan 13 09:42:38: %BGP-5-ADJCHANGE: neighbor 10.188.188.1 vpn vrf Mgmt Up
So far so good.
However two days afterwards MSS session's value in R1 changed from (Peer MSS Flag-options) "10.188.188.2 1460 higher precendence, nagle, path mtu capable" to "10.188.188.2 556 higher precendence, nagle". BGP session wasn't restarted and R2 kept MSS values from last restart. It's as if "path mtu capable" capability was disabled and sessions MSS was changed to 556 bytes.
To be honest, I'm unable to find any other explanation than a bug in 15.2(4)S1.
Thanks a lot in advance
Kind regards
Octavio
Discover and save your favorite ideas. Come back to expert answers, step-by-step guides, recent topics, and more.
New here? Get started with these tips. How to use Community New member guide