FTD 2110 Snort3 problem

viktar23 · ‎10-15-2024

Hello everyone! In our organization we have two Cisco FTD 2110s managed by FMC and configured in a HA pair. Until recently, they functioned as regular stateful L3-L4 firewalls. However, a couple of months ago, the authorities demanded that high-level filtering be enabled on them. At that time, the FTD version was 7.3.0. I started by configuring IPS. I took the Balanced Security and Connectivity policy as a basis and turned on Cisco recommendations. Everything was fine for a month, the FW worked stably. During this month, I added several URL rules based on our organization's security policy. Then I decided to turn on AMP. Two weeks after that, we had a Snort3 crash with the error The Primary Detection Engine process terminated unexpectedly 1 time(s). I decided not to jump to conclusions and left everything as it is, I could not determine the cause of the crash. A week later, the situation repeated itself and continued to repeat for another month (once or twice a week we had Snort3 crash).

I found that Snort3 could crash due to SMB traffic generated by the attacker https://sec.cloudapps.cisco.com/security/center/content/CiscoSecurityAdvisory/cisco-sa-ftd-smbsnort3-dos-pfOjOYUV

The security department and I did not find any signs of penetration into our network, however, I decided to exclude SMB traffic by placing it in the Prefilter Policy. However, this, as expected, did not help. After that, I decided to update FTD to the latest version available at that time, 7.4.1. After the update, the problem went away for 3 weeks. But then hell broke loose. Exactly three weeks later, at around 10:30, FTD began to miss traffic badly, almost all packets were lost, and those that reached had a delay of 2-3 seconds. Snort crash and Failover, as it was recently, did not occur. There were no fresh core files in /ngfw/var/common/, analogically with .dmp in /ngfw/var/log and /ngfw/var/log/crashinfo. Logs from /ngfw/var/log/messages are attached. Also of note was the load on the CPU, more precisely on the 6th and 10th cores (cpu_load image), and Packet Queue Receive Utilizaton (snort_load image). Through top, I looked at which processes load the 6 core (top image), syslog-ng and fail2ban-server were constantly in the top. I couldn't draw any useful conclusions from this. I had to make a switch node manually. Everything worked. The next day, exactly the same situation happened (with the same log, the same load on the CPU cores, snort queue utilization and syslog-ng, fail2ban-server in top output) at about the same time +- 5 minutes. I had to manually switch off the FW again. For 3 weeks, the problem was repeated day after day at approximately the same time, except on weekends, when the load on FTD was 2-3 times less than on weekdays. Since the problem appeared only 3 weeks after the update, I decided that it had nothing to do with it and I ran into a new problem. Apparently, some specific traffic appeared in our network that disabled FTD (although in fact it sounds strange, because at that time we did not introduce any new products and traffic could hardly change significantly). An experienced colleague from another organization advised me to exclude elephant flows from a high-level inspection, since in his opinion these flows may be the cause of failures. I caught all the elephant flows and put them in the prefilter, but it didn't help. After that, I decided to exclude from the high-level inspection all the threads that were passing at the time of the failure. It only helped that at the moments of failure, the traffic did not completely stop being skipped by FTD, but only the delay increased slightly (but the CPU load, snort queue utilization and processes in top remained the same). After that, I decided that maybe the problem was specifically in my hardware, but no… The new, unambiguously serviceable FTD with the same version 7.4.1 and Snort3 behaved the same way on our network.

I did not make any significant changes to the policies before that fateful day. The standard rules, which I canceled a few days before the crash, but this did not have a positive effect on the situation. I remembered that relatively recently, when I was still setting up IPS, I hung a large enough ACL (more than 1000 records) on the Control Plane to block bruteforce attempts on our RA VPN. I canceled it and... lo and behold!!! 4 weeks of silence, stable network operation and no failures. But... that wasn't the end of the story. Exactly 4 weeks later, it all started again. And this is terrible! Of the latest changes: moved the RA VPN to another FTD to unload it a bit, and disabled AMP from the policy. But that didn't help me either. I don't understand at all what the problem might be. I hope for your help. Unfortunately, our organization does not have a service contract

Kasun Bandara · ‎10-15-2024

@viktar23 hi, seems like you did many tries to solve the issue. next step should be the cisco TAC. as you mentioned that you dont have service contract for this devices, i highly recommend to purchase the contract to get support from TAC. because these find of internal issues can solved by the TAC team as you already did many tastings and corrections.

Please rate this and mark as solution/answer, if this resolved your issue
Good luck
KB

Marvin Rhoads · ‎10-15-2024

It certainly sounds like you have performed much more than the typical troubleshooting on your own. I would agree with @Kasun Bandara that getting support would be the best move at this point. It's odd that management insists on enabling more advanced features but won't pay for TAC support.

It wouldn't hurt to upgrade to 7.4.2.1 to get the latest bug fixes.

Have you tried disabling the file policy (AMP) in the rules that have it enabled? I find file policies of quite limited value since 90% or more of Internet edge traffic is encrypted and thus not inspectable for Malware payloads (unless you have SSL decryption which only a very small percentage of customers do).

viktar23 · ‎10-15-2024

It also seemed to me that AMP has little effect on the Snort load due to the encrypted traffic. However, given the fact that problems with FTD began after enabling this policy, I decided to disable it anyway. Yes, I have disabled the file policy for all rules in the ACP used. I also disabled adaptive profiles just in case. We have traffic decryption enabled, but it is currently being used in test mode, literally for several traffic flows.