02-22-2012 06:12 PM - edited 03-11-2019 03:34 PM
We have a pair of PIX 535's with GB interfaces for the inside/outside networks. We're running them in an active/passive configuration. At peak times we're pushing about 60 Mbps. Over the past few months we've had 2-3 events that cause traffic to stop being passed for up to about a minute. One event occurred during business hours and the other occurred after-hours when usage was minimal. The interface is connected to a Catalyst 3560G SFP, which doesn't show any logs or errors.
Logically we believe the issue must be with the PIX itself (hardware or IOS issue), the GB Ethernet card in the PIX or the SFP in the Catalyst 3560G. We plan on starting by failing over to the secondary and seeing if the events subside, but we'll have to wait a while to know whether or not that resolves the issue. If not, we then plan on upgrading to 7.0(8)6 to ensure we're not exposed to any DoS vulnerabilities. We're hesitant to make any more changes than necessary because we plan to decomission these units soon.
Any thoughts?
PIX# show log
Nov 28 2011 20:27:16 PIX : %PIX-1-105005: (Primary) Lost Failover communications with mate on interface inside
Nov 28 2011 20:27:16 PIX : %PIX-1-105008: (Primary) Testing Interface inside
Nov 28 2011 20:27:16 PIX : %PIX-1-105009: (Primary) Testing on interface inside Passed
Feb 22 2012 15:55:33 PIX : %PIX-1-105005: (Primary) Lost Failover communications with mate on interface inside
Feb 22 2012 15:55:34 PIX : %PIX-1-105008: (Primary) Testing Interface inside
Feb 22 2012 15:55:34 PIX : %PIX-1-105009: (Primary) Testing on interface inside Passed
PIX# show failover
Failover On
Cable status: Normal
Failover unit Primary
Failover LAN Interface: N/A - Serial-based failover enabled
Unit Poll frequency 15 seconds, holdtime 45 seconds
Interface Poll frequency 15 seconds
Interface Policy 1
Monitored Interfaces 4 of 250 maximum
Version: Ours 7.0(6), Mate 7.0(6)
Last Failover at: 17:45:22 CDT Aug 10 2011
This host: Primary - Active
Active time: 16953870 (sec)
Interface outside (207.71.25.99): Normal
Interface inside (192.168.1.1): Normal
Interface dmz1 (172.29.1.1): Normal
Interface admin1 (172.30.30.222): Link Down (Waiting)
Other host: Secondary - Standby Ready
Active time: 0 (sec)
Interface outside (207.71.25.98): Normal
Interface inside (192.168.1.2): Normal
Interface dmz1 (172.29.1.6): Normal
Interface admin1 (172.30.30.223): Link Down (Waiting)
Stateful Failover Logical Update Statistics
Link : Unconfigured.
TSPIX# show int inside
Interface GigabitEthernet1 "inside", is up, line protocol is up
Hardware is i82543 rev02, BW 1000 Mbps
(Full-duplex), 1000 Mbps(1000 Mbps)
MAC address 0003.47df.847a, MTU 1500
IP address 192.168.1.1, subnet mask 255.255.255.248
29514684464 packets input, 15965565724242 bytes, 257280 no buffer
Received 15643 broadcasts, 0 runts, 0 giants
0 input errors, 0 CRC, 38 frame, 176271 overrun, 0 ignored, 0 abort
0 L2 decode drops
31542559822 packets output, 24913481903406 bytes, 0 underruns
0 output errors, 0 collisions
0 late collisions, 0 deferred
input queue (curr/max blocks): hardware (2/0) software (0/0)
output queue (curr/max blocks): hardware (0/100) software (0/0)
TSPIX# show ver
Cisco PIX Security Appliance Software Version 7.0(6)
Compiled on Tue 22-Aug-06 13:22 by builders
System image file is "flash:/pix706.bin"
Config file at boot was "startup-config"
TSPIX up 270 days 22 hours
failover cluster up 270 days 22 hours
Hardware: PIX-535, 1024 MB RAM, CPU Pentium III 1000 MHz
Flash i28F640J5 @ 0x300, 16MB
BIOS Flash DA28F320J5 @ 0xfffd8000, 128KB
0: Ext: GigabitEthernet0 : address is 0003.47df.8478, irq 255
1: Ext: GigabitEthernet1 : address is 0003.47df.847a, irq 12
2: Ext: Ethernet0 : address is 0003.479b.01cb, irq 255
3: Ext: Ethernet1 : address is 0002.b3d5.2e93, irq 255
Licensed features for this platform:
Maximum Physical Interfaces : 14
Maximum VLANs : 150
Inside Hosts : Unlimited
Failover : Active/Active
VPN-DES : Enabled
VPN-3DES-AES : Enabled
Cut-through Proxy : Enabled
Guards : Enabled
URL Filtering : Enabled
Security Contexts : 2
GTP/GPRS : Disabled
VPN Peers : Unlimited
This platform has an Unrestricted (UR) license.
Solved! Go to Solution.
03-01-2012 08:57 AM
Hi Jordan,
The CPU hog information is stored in a buffer and timestamped so as long as you don't reload the PIX you can go back and look at the output of 'show proc cpu-hog' and see the last few hogs that occurred. You would need to keep an eye on these and check after the overruns increase again (assuming they are not constantly increasing) to see if there were any recent CPU hogs. If this turns out to be the case, I would recommend upgrading to the latest 7.2 or 8.0 image, or opening a TAC case for further investigation.
Flow control would help with issue #1 that I mentioned above, but this is not supported on any PIX version, so you would need an ASA for that.
If the switch(es) that connect to the PIX support Netflow, you may want to consider setting that up with a collector so you can identify the source of the packet bursts (assuming #1 was the issue).
Otherwise, you could consider tuning the interface poll/hold times so that the PIX waits longer before failing over. This might help since the interface communication problem seems to be very brief:
http://www.cisco.com/en/US/docs/security/asa/asa70/configuration/guide/failover.html#wp1073912
-Mike
02-29-2012 01:02 PM
Hi Jordan,
Both failovers (November and February) occurred because the Primary unit stopped receiving failover hello messages from its mate on the inside interface. This would trigger a failover event and since you aren't using stateful failover, all connections would have to rebuild and you would see a brief outage while everything recovered.
If you don't see any link down events on the switch for the ports that connect to the PIX's inside interfaces, the interfaces would have stayed up and the hello packets were likely dropped somewhere in the path. This can certainly happen due to a bad SFP, but I would also look at the PIX interfaces and switch ports to see if there were any CRC, overrun, or underrun errors at the time.
Another possibility is that the PIX was too busy to process failover traffic. This could be because of high CPU utilization, memory block depletion, or high traffic bursts/loops. Whatever the problem was, it was extremely brief as you can see from the logs (the interface was marked as failed and recovered within the same second).
-Mike
03-01-2012 06:21 AM
Thanks Mike. Our PIX 535's are connected to SFP's on two Cat 3560's. I checked the Cat 3560 interfaces and we have 0 errors. On the PIX's, we're seeing some "no buffer" and overrun errors. I'm surprised we're seeing these because we're using gigabit interfaces and only hit about 60 Mbps at peak times. We're running PIX OS 7.0. We've been hesitant to upgrade because it's been stable for the most part and we're trying to get these things decommissioned.
Over about 15-16 hours:
PIX# show int outside
Interface GigabitEthernet0 "outside", is up, line protocol is up
Hardware is i82543 rev02, BW 1000 Mbps
(Full-duplex), 1000 Mbps(1000 Mbps)
91005761 packets input, 80904499890 bytes, 50458 no buffer
Received 21397 broadcasts, 0 runts, 0 giants
0 input errors, 0 CRC, 0 frame, 15988 overrun, 0 ignored, 0 abort
TSPIX# show xlate
1753 in use, 6679 most used
Our config is pretty straight forward. We only have an inbound ACL on the outside interface. It's about 600 lines long. And, we have about 100 static NAT's.
Do I need to try to clean up this config to simplify packet processing? Is there anything we can do to resolve these overrun errors? I assume if a failover hello/response packet gets dropped because of the no buffer/overruns then we'll see the error and experience the behavior that we're seeing.
03-01-2012 06:30 AM
Hi Jordan,
The overruns can certainly cause that behavior. There are 2 main things that can cause overruns:
1) Packets arrive at the interface at a rate faster than the interface buffers can handle
2) The PIX's CPU is too busy to pull packets out of the interface's receive buffer
For #1, this is caused by the packets/sec rate, rather than the throughput (Mbps) of the link. One example is the case where you might see a short burst of many small packets. In this case, the packets/sec rate is too high for the interface to put all of the packets into the Rx buffer, but the throughput is still very low since the packets are very small in size. If this is the problem, there is not much you can do from a PIX 7.0 stand point, other than reducing the load on the interface.
For #2, you would need to see if the CPU utilization of the PIX is very high, or if you see CPU hogs in the output of 'show proc cpu-hog' at the time when the overruns are increasing. If this is the cause, the fix would be to reduce the load on the CPU or get a software fix for the hogs.
You should also consider using stateful failover so users don't need to rebuild all of their connections if a failover occurs. This would help reduce the time of the outage:
http://www.cisco.com/en/US/docs/security/asa/asa70/configuration/guide/failover.html
-Mike
03-01-2012 08:46 AM
It's going to be hard for us to catch the CPU hogs since the events happen rarely. Currently this PIX's are providing NAT for internal IP's. We're working to phase those out so that we can decommission them entirely. That process will take some time. Is there anything we can do in the meanwhile? i.e. would an OS upgrade to 8.2 so we can enable "flow control send on" help any? Would replacing them with ASA's help? I'd ike to determine what our options are until we can get the firewalls removed entirely. Thanks.
03-01-2012 08:57 AM
Hi Jordan,
The CPU hog information is stored in a buffer and timestamped so as long as you don't reload the PIX you can go back and look at the output of 'show proc cpu-hog' and see the last few hogs that occurred. You would need to keep an eye on these and check after the overruns increase again (assuming they are not constantly increasing) to see if there were any recent CPU hogs. If this turns out to be the case, I would recommend upgrading to the latest 7.2 or 8.0 image, or opening a TAC case for further investigation.
Flow control would help with issue #1 that I mentioned above, but this is not supported on any PIX version, so you would need an ASA for that.
If the switch(es) that connect to the PIX support Netflow, you may want to consider setting that up with a collector so you can identify the source of the packet bursts (assuming #1 was the issue).
Otherwise, you could consider tuning the interface poll/hold times so that the PIX waits longer before failing over. This might help since the interface communication problem seems to be very brief:
http://www.cisco.com/en/US/docs/security/asa/asa70/configuration/guide/failover.html#wp1073912
-Mike
03-16-2012 08:25 AM
We did some clean-up on the PIX. We disabled http inspection to minimze perfmon stats and also removed some ACL's which dropped our ACL length from about 680 to 300 in hopes that would cut down on CPU utilization (though we didn't really have CPU utilization issues before. I'm sure this is minimal for the platform already). We also upgraded to 7.2(4)38 and I think that's what made the difference thus far. After a few days we're running clean with no errors. Software queues are 0/0 and hardware queues are 0/36. No cpu hogs and CPU runs at about 8% during peak hours.
Discover and save your favorite ideas. Come back to expert answers, step-by-step guides, recent topics, and more.
New here? Get started with these tips. How to use Community New member guide