Single TCP Flow Performance on Firewall Services Module (FWSM)

Andrew Ossipov · ‎08-20-2010

Overview
TCP Performance Considerations
FWSM Impact on Single TCP Flow Performance
Summary
Sample Performance

Overview

Firewall Services Module (FWSM) is positioned as an aggregation edge firewall. Its architecture is primarily designed to service a high number of low-bandwidth flows. When the FWSM is used to protect environments involving a few high-bandwidth flows (such as network backup applications), the observed performance on such flows is frequently lower than expected. This guide is will go over the existing limitations and provide several ways to improve single TCP flow performance.

TCP Performance Considerations

Even without an FWSM in the path, the maximum throughput of a single TCP flow is capped by the combination of the TCP receive window size as well as the Round Trip Time (RTT) between the endpoints. The TCP window size advertised by an endpoint indicates how much data the other side can send before expecting a TCP ACK. We assume that the send buffer of the transmitting endpoint can accommodate at least the size of the TCP receive window of the other side. Since the sender cannot transmit more data than the advertised receiver’s TCP window size during an RTT interval (i.e. the time it takes for the first block of data to arrive to the receiver and for the TCP ACK to come back to the sender), the maximum throughput of a TCP flow can be calculated as such:

Maximum Throughput [bps]= (TCP Window Size [bytes] /RTT [seconds]) * 8 [bits/byte]

In this and the following calculations we assume that the send buffer of the transmitting endpoint can accommodate at least the size of the TCP receive window of the other side. Inversely, to calculate the appropriate TCP window size to take the maximum advantage of the available bandwidth, the following formula can be used:

Optimal TCP Window Size [bytes] = (Minimum Link Bandwidth [bps] / 8[bits/byte]) * RTT [seconds]

For instance, assume that host A is transmitting data to host B and host B has advertised an 8Kbyte receive window. The RTT between the two hosts is 500 msec (0.5 sec). The maximum throughput of the TCP flow would be (8000 bytes/0.5 sec) * 8 bits/byte = 128Kbps. If the actual bandwidth of the link between the hosts is 10Gbps, the optimal TCP Window size would be (10,000,000,000 bps / 8 bits/byte) * 0.5 sec = 625 Mbytes. Notice, that the link is severely underutilized when the receiver uses a TCP window of 8 Kbytes. To achieve maximum utilization, it should use the window of 625 Mbytes instead. However, here lies a problem. Per RFC 793, the length of the window size field in the TCP header is 16 bits. Hence, the maximum achievable window size value is 65535 bytes. RFC1323 introduces a new TCP option called Window Scale that allows expanding the window size by using a fixed multiplier. For instance, host B will advertise the window scale of 4 during the three-way handshake with host A to imply that any TCP window size set by host A should be multiplied by 2^4 = 16. Now, host B can advertise the TCP window of 39063 bytes that host A (provided it supports Window Scaling) will multiply by 16 to get the actual TCP window size of 625008 bytes that will allow the transfer to occur at the maximum possible speed.

Another issue that significantly affects TCP throughput is packet loss. Since an endpoint can only learn about one lost TCP segment per RTT, it significantly slows down the transfer. Furthermore, any data sent after the lost segment has to be retransmitted even if it successfully arrived to the receiver. When Window Scaling is used and the RTT is high, the amount of needlessly retransmitted data can be tremendous. RFC2018 introduces a new mechanism for Selective Acknowledgement (SACK). It allows the receiver to request retransmission of only certain TCP segments while acknowledging the receipt of subsequent data. This is accomplished through embedding the information about the left and right edges (sequence numbers) of the successfully received data in TCP ACK retransmission requests. Consider the following example:

Notice that the TCP ACK on the segment is set to 1069276099 implying that this is the sequence number of the next expected segment from the other side. However, the embedded SACK option lists the data from 1069277089 through 1069277090 that was successfully received. Hence, the sender only needs to retransmit the data from 1069276099 through 1069277089. On large data transfers with occasional packet loss, this mechanism provides significant advantages.

FWSM Impact on Single TCP Flow Performance

Multilevel Packet Processing

FWSM deploys distributed processing architecture that involves several low-level Network Processors (NPs) as well as the general purpose Control Point. The majority of the traffic is handled by the NPs which have the highest forwarding capacity (hence sometimes referred to as “Fastpath”). Only certain traffic (such as that subject to application inspection) is sent to the Control Point. Since the Control Point may impose additional limitations on the throughput as well as the properties of the TCP traffic, this discussion will only consider the connections flowing exclusively through the NPs. As a general rule, avoid enabling application inspection on any traffic unnecessarily as it will significantly impact the throughput of these flows.

Backplane Etherchannel

FWSM communicates with the network through the 6Gbps data plane in the form of an Etherchannel with the local switch. The Etherchannel comprises of 6 individual GigabitEthernet ports. As with any other Etherchannel, all packets in one direction of a flow (for instance, a TCP connection from host A to host B) always land on the same port. Consequently, any single TCP flow going through the FWSM cannot transmit data at more than 1Gbps rate. Furthermore, several flows sharing the same port will reduce the maximum throughput of each individual flow even further.

Packet Payload Size

As mentioned earlier, the FWSM architecture is optimized to handle a large number of relatively low-bandwidth flows. Due to the lock structure of the hardware Network Processors (NPs), packets belonging to a single flow cannot be processed in a truly parallel fashion. As a result, every single TCP flow is capped by a certain maximum packet rate. Consequently, the more TCP payload is sent per packet, the higher throughput can be achieved. During the three-way handshake, each endpoint advertises its TCP Maximum Segment Size (MSS) value which indicates the maximum data it can process per TCP segment. With the default MTU of 1500 bytes, it typically leaves 1460 bytes for the payload. However, the default FWSM setting is to adjust the value of TCP MSS advertised by the endpoints to 1380 bytes. While this approach may be justified in certain cases, this value can be increased or the adjustment turned off altogether with per-context sysopt connection tcpmss command:

FWSM(config)# sysopt connection tcpmss ?

configure mode commands/options:

<0-65535> TCP MSS limit in bytes, minimum default is 0,

maximum default is 1380 bytes

minimum Set minimum limit of TCP MSS

When going from 1380 to 1460 bytes of payload per packet, the typical performance increase is about 6%. To increase the amount of data transmitted in every packet even further, Jumbo Frames can be used as well. FWSM supports Jumbo frames of up to 8500 bytes in size, so this setting can be used end-to-end (including the switch and the respective endpoint ports) to achieve much higher firewalled throughput. To enable Jumbo Frame support on the FWSM itself, you just need to use mtu <nameif> 8500 command for every associated interface:

FWSM(config)# mtu inside ?

configure mode commands/options:

<300-8500> MTU bytes

TCP Option Processing

Since we had established that TCP Window Scale and SACK options can improve the performance of TCP flows in a significant way, it is advisable to not clear them on the FWSM. By default, each FWSM context permits these options. You can use show run sysopt command to ensure that the following lines are present there:

FWSM#show run sysopt

[…]

sysopt connection tcp window-scale

sysopt connection tcp sack-permitted

TCP Sequence Number Randomization and SACK

Even when TCP SACK is permitted through the FWSM, there is a problem introduced by TCP Sequence Number Randomization feature that is enabled by default. The feature hides the sequence numbers generated by the endpoints behind the higher security interface by shifting them by a certain value (determined in a random fashion for each TCP connection). However, the feature does not rewrite the right and left edge values embedded into TCP SACK option. As a result, a TCP ACK requesting selective retransmission that traverses from a lower- to higher-security interface makes no sense to the inside endpoint (since the TCP sequence numbers embedded into the SACK option represent the “randomized” values known only on the outside of the FWSM). Consider the following example:

Notice that the TCP ACK is requesting retransmission of the TCP segment with the sequence number of 3973898807. This number actually makes sense to the inside host since it was “de-randomized” by the FWSM on the way in. However, the embedded TCP SACK option confirms receipt of the segments from 10969277089 through 1069277090. These sequence numbers represent the “randomized” values and hence make no sense to the inside host. As a result, the inside host ignores TCP SACK and retransmits the entire stream of data thus wasting the bandwidth. Since TCP Sequence Number Randomization is a legacy feature that was supposed to protect hosts that use predictable algorithms for initial TCP sequence number generation, it is does not provide much additional security on the modern TCP stacks. Hence, the feature can be selectively disabled to take full advantage of TCP SACK and achieve the maximum throughput on a single TCP flow. The best way to disable the randomization is to use Modular Policy Framework (MPF); you can also narrow the class down just to those trusted hosts that do the high-speed transfers:

class-map TCP

match port tcp range 1 65535

policy-map global_policy

class TCP

set connection random-sequence-number disable

service-policy global_policy global

TCP Reordering

Yet another factor that can negatively impact TCP flow performance is packet reordering. When multiple paths between the endpoints are used and load-balancing is deployed, it is possible for the receiver to get TCP segments out of order. Sometimes, such condition can be mistakenly recognized as packet loss resulting in unnecessary retransmissions and reduction in throughput. Due to the parallel processing architecture, FWSM itself may put certain TCP segments out of order. This is true especially for those flows that involve smaller sized packets within a batch of larger ones. To combat this undesirable behavior, FWSM contains a module called NP Completion Unit that ensures that the packets leave the NPs in the same order that they came in. It should be noted that it will only preserve the ingress order and not correct the out-of-order conditions introduced before the FWSM. Furthermore, it will not be able to preserve the order of TCP segments flowing through the Control Point as well as traffic processed by the FWSM capture feature. While the Completion Unit may introduce minor latency into the packet processing path, the typical performance improvements significantly outweigh this side effect. The Completion Unit is disabled by default but can be enabled globally (from within the admin context if running in multiple-context mode) with sysopt np completion-unit command:

FWSM(config)# sysopt np ?

configure mode commands/options:

completion-unit Set Completion-unit on FP NPs

Additionally, ensure that the FWSM packet capture functionality is disabled on the high-bandwidth flows as it negates the effect of the Completion Unit. Switchport Analyzer (SPAN) feature on the switch should be leveraged for any performance-related FWSM troubleshooting tasks instead.

Summary

To achieve the maximum single TCP flow performance when going through an FWSM, one should implement the following:

Use the optimal TCP window size as well as TCP Window Scale and SACK mechanisms on the endpoints.
Ensure TCP Window Scale and SACK options are not cleared by the FWSM.
Increase the default limit or disable TCP MSS adjustment on the FWSM.
Disable TCP Sequence Number Randomization for the high-bandwidth flows on the FWSM.
Enable NP Completion Unit on the FWSM.
Ensure that the traffic is not being captured on the FWSM itself.
Use Jumbo Frames end to end.

Sample Performance

All tests are done through iPerf with 256 Kbyte TCP window size between two test hosts connected to 1Gbps ports on a single Cisco6509 switch. The FWSM is running 4.0(12) software. Bear in mind that individual results may vary depending on the specific hardware and software levels used as well as the traffic patterns and the amount of other load on the FWSM.

Test Case DescriptionTransfer Size (Gbytes)Bandwidth (Mbits/sec)

Control Environment No FWSM in the traffic path	2.85	817
Default FWSM Configuration Interface MTU set to 1500 bytes TCP MSS adjusted to 1380 bytes TCP Windows Scale and SACK permitted TCP Sequence Number Randomization enabled NP Completion Unit disabled	1.60	458
Optimized FWSM Configuration Interface MTU set to 1500 bytes TCP MSS adjusted to 1460 bytes TCP Windows Scale and SACK permitted TCP Sequence Number Randomization disabled NP Completion Unit enabled	2.15	615
Optimized FWSM Configuration With Jumbo Frames Interface MTU set to 8500 bytes TCP MSS adjustment disabled TCP Windows Scale and SACK permitted TCP Sequence Number Randomization disabled NP Completion Unit enabled	2.40	686

T

golly_wog · ‎08-24-2010

This information is like golddust!

Thank you! :-)

Andrew Ossipov · ‎08-24-2010

Thank you for the kind comments, Golly!

Andrew

thiland · ‎10-12-2010

Finally a clear and much needed explanation for FWSM tuning in today's data centers! Contains all of the info I need for a change request

patoberli · ‎02-03-2011

Thanks for this good post!

I have some question though.

What would happen if I disable TCP MSS adjustment, but leave the MTU on 1500?

And are there any applications that could break because of this configuration?

Andrew Ossipov · ‎02-03-2011

If the TCP MSS adjustment is disabled on the FWSM, the hosts would advertise it normally (just like they would if there was no FWSM in the path). Unless there is an underlying problem in the network where one needs to artificially limit the payload of a transit TCP segment, there should be no impact.

Andrew

patoberli · ‎02-04-2011

Thanks for the anser

One more question, to disable the adjustment, is it either

no sysopt connection tcpmss

or sysopt connection tcpmss 65535

or sysopt connection tcpmss 0

Thanks for your help!

Patrick

Shobith K · ‎02-04-2011

I have implemented the third option without any problems (Optimized FWSM Configuration) and the throughput for data transfer has increased three times. Good document !

Andrew Ossipov · ‎02-04-2011

You should use 'sysopt connection tcpmss 0' to disable the adjustment. If you use 'no sysopt connection tcpmss' command, it will default to 1380.

Andrew

Andrew Ossipov · ‎02-04-2011

Thank you for the feedback! Glad that it was helpful.

aostberg · ‎05-10-2011

"TCP Sequence Number Randomization is a legacy feature that was supposed to protect hosts that use predictable algorithms for initial TCP sequence number generation"

So if I read this correctly, we could potentially break some legacy apps by turning off the randomization.

I guess my question really is, is there any negative side affects to turning off the randomization?

Andrew Ossipov · ‎05-10-2011

It will not break any applications, but it may expose those TCP stacks that use a very predictable (such as sequential) assignment of initial sequence numbers to external attackers.

sachin.sharma711 · ‎07-19-2013

Great insight document

Thanks

noli.pineda · ‎09-28-2013

Hi Andrew,

Thanks for sharing this very good article.

I have a question though on disabling TCP Sequence Number Randomization feature and I can see on your example above was applied to global policy. Can this feature be disable on per interface policy also?

I did a test configuration on a dev firewall but the interface doesn't seem to pick up the setting.

Thanks in anticipation and looking forward to your response.

Andrew Ossipov · ‎09-30-2013

Hi Noli,

The policy can be applied on per-interface basis as well. You may want to open a TAC case to troubleshoot your issue.

Andrew

noli.pineda · ‎09-30-2013

Thanks for the confirmation Andrew!

I will open a TAC case to troubleshoot.