ASA 5545X & Firepower Issues

philipspe · ‎09-29-2022

This is my first time posting here, so bear with me.

Our company runs ASA5545X with Firepower in each of our 2 data centers. For the last 2 years, we have been having an ongoing issue where they stop passing traffic. In the last couple months, it has gotten really bad and happens sometimes multiple times a week requiring us to manually failover to restore services. We have opened multiples tickets with TAC and they don't ever find anything really that points to an issue. We have been instructed to go through multiple code upgrades but issue continues. Most recently this week we have upgraded the SFR's to 6.6.7 and the ASA is at 9.8.4(44). From what we can tell, it seems to be the SFR's causing the issue but I can't confirm anything. When it happens , I can ping the outside interface from our outside polling solution 24x7. From the ASA and our internet router we can ping out to google. You just can't ping from inside the network out or make it past the firewall from outside. Has anyone experienced this kind of issue? It's affecting both data centers and we have 2 units per data center.

Aref Alsouqi · ‎09-29-2022

Are there any interesting logs generated when this issue happens?

philipspe · ‎09-29-2022

Not that I'm aware of, TAC always ask for Sh tech and coredumps things like that but they never actually ever find anything according to them. The last time it happened was this past Monday and the TAC engineer was seeing the below being spammed every minute. That is when they instructed us to upgrade to 6.6.7 but nothing still explains this ongoing issue for the last 2 years.

Couple months we even asked for physical replacement on one of the units but the issue has persisted.

SF-IMS[3837]: [3837] pm:process [INFO] Killing mojo_server with /usr/local/sf/bin/mojostop.pl
https://bst.cloudapps.cisco.com/bugsearch/bug/CSCvz46879

Aref Alsouqi · ‎09-29-2022

Not sure, but it seems to be a buggy behaviour of the SFR module. Maybe @Marvin Rhoads came across this issue.

tvotna · ‎09-29-2022

Since this is a firepower module setup, you can disable traffic redirection to the module by removing "sfr" command from the policy-map or replace it with "sfr fail-open monitor-only" and see if this helps. If yes, this is a snort issue. Otherwise this is an ASA issue or a DAQ (ASA-SFR path) issue. Troubleshooting of snort and other features running on the module isn't easy. There is a TECSEC-3301 presentation available on the Internet which can give you few ideas.

philipspe · ‎09-29-2022

Yes, I thought about issuing the sfr fail-open monitor-only command but we are always panicking so bad and just trying to restore service I immediately just fail it over via asdm to restore service. I am going to try that next time though. Also, how dangerous is it to run like this with it in monitor only mode since I'll be losing that layer of protection? We do our Geoblocking etc on firepower.

tvotna · ‎09-29-2022

Obviously, the answer depends on your security policy. If you're protecting inside network from Internet threats, this may not be a good idea. Many people use this product to limit what inside users can do. In this case this is ok to disable protection temporarily. ASA L3/L4 ACLs will still be in place anyway if you put the module into the monitor-only mode.

Aref Alsouqi · ‎09-29-2022

Bypassing the SFR module will bypass any protection provided by that module.

Marvin Rhoads · ‎09-29-2022

I've not seen this issue in any of my customers' deployments. If you are engaged with a Cisco Account Manager sometimes they can help escalate the level of resources TAC devotes to your case. I know what it's like to have a bug affect you for month after month without resolution - this certainly sounds like one of those (unfortunately).

philipspe · ‎09-29-2022

Thanks, we have been in contact with our account manager and our dedicate SE and they have been helping each time. I'll update if this happens again and if the monitor only for the SFR works during that time. Thanks

philipspe · ‎10-02-2022

Well, it happened again. Start getting alerts, had my engineer issue the sfr fail-open monitor-only command and that seemed to immediately restore service. Ran fine all week after upgrading to 6.6.7 last Monday, then this happened. We have our Snort rules update every Sunday at 3AM and Geolocation update every Sunday at 4AM. I really wonder if these are causing these issues? It really seems to hit us later in the day on Sunday every week or on Monday morning as of late. Anyone experienced issues after these updates run? We're updating TAC to let them know that the sfr fail-open monitor-only command immediately restores service, but of course we can't leave it disabled. This is so frustrating.

RachelGomez161999 · ‎10-03-2022

Troubleshooting
Problems encountered during Policy Deployment might be due to, but not limited to:

Misconfiguration
Communication between FMC and FTD
Database and System health
Software defects and Caveats
Other Unique situations
Some of these issues might be easily fixed, while others might require assistance from the Cisco Technical Assistance Center (TAC).

The goal of this section is to provide techniques to isolate the issue or determine the root cause.

FMC Graphical User Interface (GUI)
Cisco recommends each troubleshooting session for deployment failures to start on the FMC appliance.

On the failure notification window, on all versions beyond 6.2.3, there are additional tools that can assist with other possible failures.

Utilize The Deployment Transcripts
Step 1. Pull up the Deployments list on the FMC Web UI.

Step 2. While the Deployments tab is selected, click Show History.

FMCHistory

Step 3. Inside the Deployment History box, you can see all previous deployments from your FMC. Select the deployment in which you would like to see more data.

Step 4. Once a deployment element is selected, the Deployment Details selection displays a list of all devices inside the Transaction. These entries are broken down into these columns: Device Number, Device Name, Status,and Transcript.

DeploymentHist1

Step 5. Select the device in question and click on the transcript option to see the individual deployment transcript which can inform you of failures as well as configurations that are placed on the managed devices.

DeployTranscript1

Step 6. This transcript can designate certain failure conditions as well as indicate a very important number for the next step: Transaction ID.

DeployTransactionID

Step 7. In a Firepower Deployment, the Transaction ID is what can be used to track each individual section of a policy deployment. With this, on the Command-Line of the Device, you can obtain a more in-depth version of this data for remediation and analysis.

This may help you,

Rachel Gomez

TODavies · ‎10-07-2022

We have experienced similar ongoing issues and are still working through them with TAC! I have noticed that the CPU on the Firepower modules hits 100% every time these events occur. The most stable version I've been able to run for any length of time is 6.2.3.16 - upgrading to anything beyond that causes high CPU events and an outage, manually failing over restores service immediately. I have tried the following versions 6.4.0.15, 6.6.5.2 & 6.6.7 but still witnessed the same problems.

We have a requirement to upgrade our FMC but having firepower modules running 6.2.3.16 really limits the version that the FMC can run.

TAC have pointed us to the following bugs -

Please let me know if you get anywhere with your TAC case.

philipspe · ‎10-07-2022

TODavies,

Thank you for letting me know you have been experiencing the same issue. Do you by chance see any pattern to when the outages hit? Our seems to happen sometime after our Geolocation updates run on Sunday at 4AM and Snort updates at 3AM. Usually, we see the issue within 12 -18 hours after the updates run. We usually fail over and reboot the ASA's and they will run until the next week when the updates run. Curious if you by chance have seen the same behavior? For now, I am having my team disable the recurring updates, and we are going to plan to run them manually in a maintenance window. We also see the CPU spike when the outage occurs.

philipspe · ‎12-19-2022

Just wanted to check in to see if you're still having an issue with this? We ran good for about 2 months after disabling geolocation and snort updates. Decided to run them last night since it has been a while and every since then we have had the issue again. CPU spikes and traffic is being dropped until we fail over or go to monitor only mode. Have TAC on webex since this morning and they are still saying this is bug and no fix for it. What in the world!!!

We're running 6.6.7

https://bst.cloudapps.cisco.com/bugsearch/bug/CSCvv60849