1 Introduction

Panos Kampanakis · ‎09-20-2010

1 Introduction
2 Identification
2.1 Problem nature
2.2 CPU
2.3 Interfaces
2.4 Load
3 Mitigation / Alleviation
3.1 Processes
3.2 Traffic
3.3 Active/Active failover
3.4 More hardware

1 Introduction

There are many times where indications of oversubscription or excessive load on a firewall or a network device are not enough to prove if oversubscription is really happening. Thus, it is often confusing how to identify and solve such issues. This document will present the basic troubleshooting steps that someone needs to take in order to pinpoint an oversubscription problem on a Cisco FWSM firewall and will propose potential solutions to overcome it. The corresponding document for the ASA is located here.

2 Identification

The most important aspect of solving an oversubscription issue is its identification. Network engineers will often incorrectly attribute network problems to excessive traffic which leads devices like the firewalls to be wrongly considered as the bottleneck. Other times they will focus on other parts of the network in cases were the firewall processing power is not enough to handle the traffic. There can be multiple indications of load problems on firewall devices and putting them together will help us understand if traffic is indeed the reason of the problem or if we should focus elsewhere. That is what this section will try to describe.

2.1 Problem nature

Oversubscription almost never occurs by itself. It will most of the times be presented as another network problem that results from it. Such often include packet loss, slow response or drops. In general, an oversubscribed device that can't handle the load will inevitably drop some packets. Packet drops will affect sensitive applications or will cause TCP re-transmissions and affect the user experience by making transactions look as if they are taking more time to complete. If we wanted to summarize the problems that occur due to excessive load we would describe them as network degradation. Of course, someone must be careful and NOT attribute all problems that fall under the "degradation umbrella" as load issues. The indications we will present below will help more on identifying if such issues should be attributed to excessive load.

2.2 CPU

A "busy" firewall device will almost always show it on its CPU. We can check the CPU use with the command "show cpu".

FWSM# show cpu

CPU utilization for 5 seconds = 14%; 1 minute: 10%; 5 minutes: 10%

A CPU ranging above 80%-90% could indicate high traffic load.

Though, someone would need to pay attention as the FWSM's architecture is different than that of an ASA firewall and thus the CPU is not handling regular packet processing as on an ASA. It is worth noting that due to its architecture an FWSM could have low CPU while being oversubscribed. The reason is that on an FWSM the traffic is handled by 3 network processors (NP1, NP2, NP3) that are not included in the "sh cpu" output. Only for the FWSM, we can also check the aferomentioned network processor block thresholds.

FWSM# sh np block

MAX FREE THRESH_0 THRESH_1 THRESH_2

NP1 (ingress) 32768 32768 1234 242333 34343434

(egress) 521206 521206 0 0 0

NP2 (ingress) 32768 32768 0 0 0

(egress) 521206 521206 0 0 0

NP3 (ingress) 32768 32768 2333 44443 324434354

(egress) 521206 521206 0 0 0

Seeing counters increasing for these thresholds shows us that the processors are getting close to their limits The NP blocks and what they mean are explained here.

2.3 Interfaces

The FWSM interfaces statistics cannot be used to check if traffic an be handled by the device. These interfaces do not physically exist on the device. They are "virtual" between the switch and the FWSM backplane (6Gig interfaces forming an Etherchannel), so they will not show errors, overruns or underruns even if the FWSM is oversubscribed due to high traffic. Though, we could potentially check the "show nic" command output.

FWSMr# show nic

interface gb-ethernet0 is up, line protocol is up

Hardware is i82543 rev02 gigabit ethernet, address is 0012.0023.9200

PCI details are - Bus:0, Dev:0, Func:0

MTU 16000 bytes, BW 1 Gbit full duplex

15233374 packets input, 5822183269034098688 bytes, 0 no buffer

Received 0 broadcasts, 0 runts, 0 giants

0 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored, 0 abort

10486585 packets output, 4727775737442992128 bytes, 0 underruns

input queue (curr/max blocks): hardware (0/25) software (0/0)

output queue (curr/max blocks): hardware (0/13) software (0/0)

interface gb-ethernet1 is up, line protocol is up

Hardware is i82543 rev02 gigabit ethernet, address is 0012.0023.9200

PCI details are - Bus:0, Dev:0, Func:0

MTU 16000 bytes, BW 1 Gbit full duplex

8849745 packets input, 4807174976778010624 bytes, 0 no buffer

Received 0 broadcasts, 0 runts, 0 giants

0 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored, 0 abort

4 packets output, 1477468749824 bytes, 0 underruns

input queue (curr/max blocks): hardware (0/16) software (0/0)

output queue (curr/max blocks): hardware (0/1) software (0/0)

This command shows the interfaces between the CPU and NP3 (mentioned above). Seeing "no buffer" or "error" counters going up could mean excessive control plane traffic between the CPU and NP3, which could point to oversubscription.

2.4 Load

Next it is worth checking the traffic that the device is seeing. We need to clear the traffic ("clear traffic" command) statistics before checking them ("show traffic" command). We are doing that because we want to see the traffic while the problem is occurring and thus be able to tell if load is related to the problem investigated. Looking the aggregate traffic output from "show traffic" carries information since the last reload or the last time the counters were cleared, so it will not help us identify how much traffic the box is seeing for the time we are troubleshooting. After the "clear traffic" we let the box collect statistical information for 2-5minutes and we do "show traffic" to get the traffic the interfaces saw.

FWSM# clear traffic

...

...5 minutes go by...

...

FWSM# show traffic

..

int2:

received (in 2090512.330 secs):

2327338 packets 319964508 bytes

1 pkts/sec 1 bytes/sec

transmitted (in 2090512.330 secs):

2327498 packets 338246 bytes

23323 pkts/sec 24324456 bytes/sec

int3:

received (in 2090512.330 secs):

1858298 packets 139580776 bytes

32422 pkts/sec, 291777777 bytes/sec

transmitted (in 2090512.330 secs):

139235118 packets 103732100 bytes

234242343 pkts/sec, 2232423237 bytes/sec

Monitoring tools and Netflow can also help on identifying traffic and connection rates.

For the FWSM only, someone could check the input and output packets of the port-channel ports <slot of fwsm>/1-6. So for example if the FWSM is on slot 3, I would check with SNMP the packets the ports Gig3/1-6 are seeing to check the load that is pushed to my Firewall Services Module. We can then calculate the aggregate throughput the device is passing by examining the traffic that all physical interfaces saw and we will be able to see if it is being pushed to its limits. In order to do that we need to check the device specs. As the FWSM datasheet mentions "Cisco Firewall Services Module (FWSM)—a high-speed, integrated firewall module for Cisco Catalyst 6500 switches and Cisco 7600 Series routers—provides the fastest firewall data rates in the industry: 5-Gbps throughput, 100,000 CPS, and 1M concurrent connections.". Also, the FWSM 2.2 guide mentioned "With 64-byte Ethernet frames, the FWSM supports 2.84 Mpps throughput; with 1500-byte frames, the FWSM supports 5.456 Gbps throughput". As for the FWSM specifications, these cann be found under the Reference sections of he configuration guides here.

There are long discussions that people could start trying to tell if a firewall or any other device is hitting its traffic processing limits or not. Experience has shown that there is controversy on what the numbers show and what engineers consider as being close to the numbers or not. It is worth clarifying a few points. Let's use the FWSM as an example. Its name throughput is 5.5Gbps. So the question is, "if my FWSM sees about 5Gbps is it at its limits or not?". A quick answer would be "No". Though, we must not forget that there are many factors involved in this question. In the network industry name speeds of devices come out under certain tests. These tests are repeated and an average is presented as the maximum speed. Though, not always is "real-world" traffic the same traffic as the one used in the tests. We could use the aforementioned FWSM for example. Usually, the name speed tests involve stateless protocols with big packets. For a TCP web browsing application though, the packets are much smaller and TCP uses ACKs and is a "synchronized" protocol by nature. That would add more load to the firewall itself, which would make its maximum throughput value drop. On top of that, if the ASA has http inspection configured (which will do deep packet inspection for http) then we understand that its maximum processing throughput would be less than 5Gbps. It is obvious that even though 5.5Gbps is indeed the throughput the device can achieve, its real-world throughput, based on applications, traffic nature and configuration could practically be less. That is why in our performance documents we also try to provide other metrics. These include the "packets per seconds" (pps) and what is often seen as "real-world HTTP".

Another consideration on top of traffic load for the firewall devices is connections and connection rates. That is another field that could trigger various disagreements. The command we would use to see the connections on our firewall are "show conn count" and "show resource usage".

FWSM# show conn count

2 in use, 86 most used

FWSM# show resource usage

Resource Current Peak Limit Denied Context

Telnet 1 1 5 0 System

Syslogs [rate] 1 293 N/A 0 System

Conns 2 86 10000 0 System

Xlates 5 116 N/A 0 System

Hosts 6 49 N/A 0 System

FWSM-multi-context# show resource usage

Resource Current Peak Limit Denied Context

SSH 1 1 15 0 admin

Syslogs [rate] 118 348 unlimited 0 context1

Conns 89 893 unlimited 0 context1

Xlates 150 1115 unlimited 0 context1

Hosts 15 18 unlimited 0 context1

Conns [rate] 603 14694 unlimited 0 context1

...

Now, let's ask one more questions for the output from our FWSM above: "In the peak connection rate I see about 15K connections and in the specifications I read that the maximum supported rate is 100K conns/second. 15K is much less than 100K, so why do I see NP3 threshold counters increasing, showing me that I am overloading Network Processor 3 that handles connection establishments?". For someone to be able to answer that question he would need to keep in mind that the rate that is mentioned in the specifications is the average rate over one full second. To explain it better, here are a few examples:

Let's say we have a stable rate of 100K per second. This connection rate conforms to the FWSM limits.
Now let's see we have 1000K new conns per 10 seconds. That is also a rate of 100K per second.and conforms to the FWSM limits
Now let's say we have 200K new conns. for 1 second and the next 9 seconds we have 800K. That makes us total 1000K per 10 seconds which equals to average 100K per second which conforms with 100K conns/second. But the FWSM was oversubscribed for 1 second while it was seeing a rate of 200K/second.

So, it is obvious that bursts of traffic or connections could affect the performance of a firewall even if the averages over time does not seem to exceed the limits.

Additionally, having few connections through the box does not necessarily mean that traffic is not high. Theoretically speaking, someone could have 10 connections passing 1Gbps each and thus oversubscribing an FWSM with very few conns.

3 Mitigation / Alleviation

Now, it is equally important to mention options for overcoming an oversubscription issue. We would suggest to the reader to keep in mind that if a device is oversubscribed it is usually best to add more processing power by using more or more powerful devices. Though, there might be cases where we could get away with it by implementing some workarounds after identifying the root cause and the traffic profiles. Determining causes of oversubscription/excessive load should rely on external tools and traffic analysis.

3.1 Processes

When the CPU is high, we can try to see where it is spent and then we might be able to alleviate it from the process that takea most CPU cycles. We can collect the output of the "show process" command, wait for 1 minute and collect it once more.

ASA# show process

PC SP STATE Runtime SBASE Stack Process

Lwe 0805510c d52a0cf4 09fbeed8 0 d529edf0 7544/8192 block_diag

Mrd 081beaa4 d52d087c 09fbe438 873 d52b0a38 123848/131072 Dispatch Unit

Msi 08f6348f d5784f8c 09fbde4c 13 d5783088 7792/8192 y88acs06 OneSec Thread

Mwe 08068bc6 d578938c 09fbde4c 0 d57874e8 7576/8192 Reload Control Thread

Mwe 08070976 d5794314 09fc07f8 0 d5790760 12496/16384 aaa

Mwe 08d094ed d60111ec 09fbde4c 4 d57948e8 6872/8192 UserFromCert Thread

Mwe 08c331eb d57987f4 d57d47d0 0 d5796a70 6920/8192 Boot Message Proxy Process

Mwe 080a49f6 d579d37c 09fc0854 107 d5799488 8968/16384 CMGR Server Process

Mwe 080a4f05 d579f4a4 09fbde4c 20 d579d610 7696/8192 CMGR Timer Process

Lwe 081bdecc d57a8b9c 09fceba8 0 d57a6c98 7216/8192 dbgtrace

Mwe 08498525 d57b11c4 09fbde4c 172 d57af440 4712/8192 eswilp_svi_init

Msi 0861af45 d57c4734 09fbde4c 28 d57c2850 6952/8192 MUS Timeout Check Thread

Mwe 08d094ed d5a3845c 09fbde4c 0 d57cb0e0 7016/8192 netfs_thread_init

Mwe 09378625 d57d952c 09fbde4c 0 d57d76d8 7612/8192 Chunk Manager

Mwe 08932ea4 d57eadfc 09ebdb4c 0 d57e8ef8 7904/8192 IP Address Assign

Mwe 089c501f d597faa4 09ebebd0 0 d597dba0 7904/8192 Client Update Task

Lwe 093c1dba d5984404 09fbde4c 685 d5980570 15888/16384 Checkheaps

Mwe 08b9e1f2 d5994bf4 09fbde4c 1 d598cd80 31888/32768 Session Manager

Mwe 08cb45b5 d599aae4 d7cbd3b0 4 d5997090 14312/16384 uauth

Mwe 08c52475 d599d11c 09f0f884 0 d599b218 7376/8192 Uauth_Proxy

Msp 08c893ce d59a35b4 09fbde4c 2 d59a16b0 7792/8192 SSL

Mwe 08cb1f46 d59a5754 09f15434 0 d59a3870 7272/8192 SMTP

Mwe 08caac96 d59a98dc 09f15398 30 d59a59f8 15096/16384 Logger

Mwe 08cab4c5 d59ab9f4 09fbde4c 0 d59a9b80 7728/8192 Syslog Retry Thread

Mwe 08ca511e d59adb9c 09fbde4c 0 d59abd08 7192/8192 Thread Logger

Mwe 08e9c492 d59d83a4 09f492e8 0 d59d64c0 7040/8192 vpnlb_thread

...

Then he can do the diff of the "Runtime" column for all the processes (keep in mind that a process might show up twice or more). By sorting the diffs from maximum to minimum we can see the processes that take most of the CPU. There are cases where for example we might see an inspection process or the logging process taking most of the CPU. In such cases we can disable the inspections if they are not needed or turn down the logging level and save some CPU for the device. Please note that processes like "Dispatch_Unit" and "interface polling" relate to regular packet processing and there is not much that can be done to alleviate the CPU from them.

3.2 Traffic

If the traffic hitting the firewall is excessive, we can also try to send only necessary traffic through it. Although, this solution is not practical in most setups, there might be cases where someone has alternate routes for his traffic and he might not need to "firewall" all packets. In such scenarios he can use policy based routing (PBR) to divert to the firewall only traffic that needs to be "firewalled".

Especially for the FWSM, someone MIGHT be able to change the load balancing on the backplane of the switch for the 6-Gig port-channel between the FWSM and the switch, in order to hash the packets equally to both Network processors. Though, this is a viable solution only if one of the NP1,2 is oversusbscribed and not the other. We would suggest opening a TAC case to investigate such a solution if only one NP (1 or 2) is increasing its threshold counters..

3.3 Active/Active failover

In case of using two firewalls in failover in Active/Standby mode, if the Active Unit cannot handle the traffic you might be able to temporarily use an Active/Active setup to share it between both units. You would need to have the firewalls in multi-context mode and have one or more contexts active on the primary unit and one or more contexts active on the secondary. That way both firewalls will be passing traffic (for the context/s that they are active) and might not be oversubscribed. Though, you need to remember that in case one of a units failure, all contexts (thus all traffic) will be running on one unit and then you will be back to an oversubscribed scenario. Active/Active failover for oversubscription cases should only be used (if used at all) as a temporary solution with precaution, until a permanent solution is put in place.

3.4 More hardware

Finally, the ultimate solution would be for someone to add more hardware to his network or use more firewalls. That way he could divert traffic that the device/s can handle and there would be no oversubscription.

louis.beaudoin · ‎01-22-2015

Hi,

This document describes how to get the FWSM statistics from the command line interface with commands like

sh np block

or

show traffic

Is there an equivalent method to collect these statistics over time on a console, with SNMP and a mib for instance.

Thanks

FWSM oversubscription Troubleshooting

1 Introduction

2 Identification

2.1 Problem nature

2.2 CPU

2.3 Interfaces

2.4 Load

3 Mitigation / Alleviation

3.1 Processes

3.2 Traffic

3.3 Active/Active failover

3.4 More hardware