Solved: ASA 5520 with multiple contexts becomes unresponsive

igor.hamzic · ‎05-24-2013

Hi all. We have encountered a perculiar problem with a pair of our ASA 5520 firewalls with 2 contexts(each context being active on different ASA). What we are seeing is that sometimes when we have a sudden increase of inbound traffic(mostly HTTP) towards servers behind the firewalls they seem to go bananas for the lack of a better expression.

They become unaccessible via ssh and the traffic drops significantly. The problem is mitigated by disabling one of the monitored interfaces for failover(on one of the switches the firewall is connected to) so that both contexts become active on one firewall. After that the firewalls seem to come to their senses and we can enable the switch interface again but sometimes one of the pair needs to be rebooted to restore full funcionality.

To us it seems like there is a problem with failover and contexts but we haven't been able to pin it down. The failover link isn't stateful and when we tested the failover it works fine both ways with each ASA taking up the full load when the other ASA of the pair is not available.

Did anyone come across a similar situation with their firewalls?

Favaloro. · ‎05-29-2013

Is Gig0/3 a dedicated failover link?

How often does the problem happen?

When the problem happens, might it be related to a huge load of traffic at the time?

Assuming you have physical access to the units, if the problem happens again and remote access it not available, can you try to console into both the ASAs and get the output of the "show tech" before and after disabling the interface?

Also, before trying that, try to clear the interface counters [clear interface], just to get more accurate info from the "show tech"s.

Any logs from when the problem happens? In case you don't have it, you can consider the configuration of a syslog server, it has proved to be useful.

View solution in original post

Favaloro. · ‎05-28-2013

Can you share your failover configuration and the output of the "show failover" command from the system context?

Also, when you disable the monitored interface, is it always the same interface?

What code version are your firewalls running?

When they "go bananas", what's the resource utlization [CPU,Memory, bandwidth] on the units?

Do you get any logs that might provide a hint of what's going on?

igor.hamzic · ‎05-29-2013

We are using ASA version 8.2(5).

The configuration of the failover is:

failover

failover lan unit primary

failover lan interface fail_int GigabitEthernet0/3

failover interface ip fail_int x.x.x.x 255.255.255.252 standby x.x.x.x

failover group 1

preempt

failover group 2

secondary

preempt

Output of the "show failover":

This host: Primary

Group 1 State: Active

Active time: 399409 (sec)

Group 2 State: Standby Ready

Active time: 111 (sec)

slot 0: ASA5520 hw/sw rev (2.0/8.2(5)) status (Up Sys)

admin Interface out (x.x.x.x): Normal (Waiting)

admin Interface inside (x.x.x.x): Normal (Waiting)

admin Interface dmz4 (x.x.x.x): Normal

admin Interface dmz1(x.x.x.x): Normal (Not-Monitored)

C1 Interface out (x.x.x.x): Normal (Waiting)

C1 Interface inside (x.x.x.x): Normal (Waiting)

C1 Interface dmz5 (x.x.x.x): Normal

C1 Interface dmz1 (x.x.x.x): Normal (Not-Monitored)

slot 1: empty

Other host: Secondary

Group 1 State: Standby Ready

Active time: 0 (sec)

Group 2 State: Active

Active time: 398992 (sec)

slot 0: ASA5520 hw/sw rev (2.0/8.2(5)) status (Up Sys)

admin Interface out (x.x.x.x): Normal (Waiting)

admin Interface inside (x.x.x.x): Normal (Waiting)

admin Interface dmz4 (x.x.x.x): Normal

admin Interface dmz1(x.x.x.x): Normal (Not-Monitored)

C1 Interface out (x.x.x.x): Normal (Waiting)

C1 Interface inside (x.x.x.x): Normal (Waiting)

C1 Interface dmz5 (x.x.x.x): Normal

C1 Interface dmz1 (x.x.x.x): Normal (Not-Monitored)

slot 1: empty

Stateful Failover Logical Update Statistics

Link : Unconfigured.

When I disabled the monitored interface it was always the same interface altough I believe the same effect could be achieved with disabling any of the monitored interfaces.

As for memory and CPU when it happens I cannot access the units to get a reading but I asume it's through the roof.

The thing that troubles me more is that the situation persists when the load drops and I have to perform the solution from the first post. One would assume that with the drop of the load that both firewalls would start to behave normally.

And I see that I haven't mentioned it before but when the load drops both units continue to handle traffic normally but I sometimes see as a side effect that I cannot SSH to one of the units. That unit usually has to be restarted.

Favaloro. · ‎05-29-2013

Is Gig0/3 a dedicated failover link?

How often does the problem happen?

When the problem happens, might it be related to a huge load of traffic at the time?

Assuming you have physical access to the units, if the problem happens again and remote access it not available, can you try to console into both the ASAs and get the output of the "show tech" before and after disabling the interface?

Also, before trying that, try to clear the interface counters [clear interface], just to get more accurate info from the "show tech"s.

Any logs from when the problem happens? In case you don't have it, you can consider the configuration of a syslog server, it has proved to be useful.

igor.hamzic · ‎06-15-2013

Hi. Sorry for the late reply but I was swamped with work and answering to this slipped from my mind.

We have determined that the problem wasn't with the firewalls themselves. In fact it was a SYN flood form of DDoS that shot the CPU and memory through the roof.

The firewalls were unresponsive because they couldn't allocate memory for SSH connections. I have lowered the time out for embryonic connections to 5 seconds so the memory is somewhat more free so I can connect to the firewalls.

Thanks for the all the help.