Somewhere in upgrading to ASA code 9.1.4 and CX code 22.214.171.124 (52) we've run into a known and as yet still open bug (CSCud54665). The symptom that we experienced was frequent failover back and forth due to 'Service card in other unit has failed'. This continued for a couple of days until finally we had to bypass the CX modules altogether.
While I wait for the bug to (hopefully) be resolved, has anyone come across this? Is there a better workaround than turning off the CX modules (ie we're not logging traffic or proactively blocking malware anymore).
Has anyone successfully downgraded their CX module(s)?
Thank you in advance
I just wanted to add that I did find a supported way to downgrade my CX modules back to what they were and the problem is still present. This potentially means that the problem was introduced in ASA code 9.1.3 or 9.1.4. I'm not brave enough to try to downgrade back to 9.1.2 which is where I started.
Thanks for updating your thread.
So you had to back your CX code down to 9.1(2) as well (or I guess you did that first in the troubleshooting process)? Because the latest 9.2(1) CX requires ASA 9.1(3) or higher. (Reference)
That's disappointing if so because it would mean not being able to use the NGFW IPS licenses at all.
Yup I downgraded the CX modules first and still found that I was failing back and forth frequently. This forced me to turn off CX inspection to stabilize the situation. Now that I've downgraded back to ASA 9.1.2 (listed as a stable, recommended release), I turned the CX inspection back on and we're in business again. I even went as far as to bring us up to 9.1.3 of the CX code and we're still good.
You're absolutely right that 9.2 CX code requires 9.1.3 ASA code or higher. I guess I'll wait until the 9.1.5 or whatever the next recommended release is.
For now, emergency over!
I'm having similar issues. We ended up downgrading to 9.1.3 and disabling CX inspection. Does anyone know of a good stable release for ASA code when running CX code...126.96.36.199-82. Below is a history of upgrades/downgrades that I have had to do over the past month.
CSCuj99176 - Make ASA-SSM cplane keepalives more tolerable to communication delays -Upgraded ASA's to 9.1.3 -Upgraded CX modules to 188.8.131.52-82
CSCun48868 - ASA changes to improve CX throughput and prevent unnecessary failovers -Upgraded to 9.1.5 interim release 10
CSCul77722 - Traceback with Assertion 0 (ASA Clientless VPN Denial of Service) -downgraded to 9.1.3
We just ran into this issue when configuring failover on our pair of new 5515-Xs for the first time. The Primary unit has been in use as a single device since 11 Feb with no issues. On 11 March, I added the 2nd unit as the secondary in an Active/Standby pair. Within an hour of doing so, the secondary went active. Everytime I forced the primary back into active service, the units would failover - sometime within 30 minutes, sometimes it would take a few hours.
We just opened a case on this issue. For now we've decided to just turn off the secondary unit until a solution can be found.
Here is a syslog I captured from the primary during one of the events:
Mar 12 2014 13:24:19: %ASA-1-323006: Module cxsc experienced a data channel communication failure, data channel is DOWN.
Mar 12 2014 13:24:19: %ASA-6-720032: (VPN-Primary) HA status callback: id=3,seq=200,grp=0,event=406,op=130,my=Standby Ready,peer=Active.
Mar 12 2014 13:24:19: %ASA-6-720028: (VPN-Primary) HA status callback: Peer state Active.
Mar 12 2014 13:24:19: %ASA-6-721002: (WebVPN-Primary) HA status change: event HA_STATUS_PEER_STATE, my state Standby Ready, peer state Active.
Mar 12 2014 13:24:19: %ASA-1-104002: (Primary) Switching to STANDBY - Other unit wants me Standby. Secondary unit switch reason: Service card in other unit has failed.
The really interesting thing here is that we had a VERY similiar issue with our old 5520's back in 2010. The pair would flip-flop randomly due to a service card failure (back then the SSM card). The failover was eventually traced back to the "fover_health_monitoring_thread" process timing out due to the "logger_save" process hogging up too much CPU.
I'll post an update here if anything else comes up. Thanks for starting this post.