Every day or so I receive this notification "SMA down; Outbreak quarantine rescan failed". The SMA is never truly down nor is there a network hiccup as far as I can tell, what could be causing this? It happens numerous times a week and causes me to test and send out notifications to the entire team the SMA is NOT down.
Is the SMA at the same location as all of your gateways? Is it possible a switch or other network equipment could be having issues?
Also I was having a similar issue, but my SSH was not responding on my ESAs because I was also using the CCS (a known bug). I was able to fix this by using IP and SSH for ClusterConfig. You did not mention that you are using cluster config so I'll assume this is not the cause, but something to look into if you are.
I would setup a request with Cisco Support from the appliance. They should be able to confirm if it is your ESAs, SMA, or neither, in which case you may need to track down a connectivity issue in the network.
Happy New Year
Yes there is a known bug, https://tools.cisco.com/bugsearch/bug/%20CSCuq05636. Basically SSH and CCS can't be enabled at the same time. CCS is not really needed and the configuration for ClusterConfig over SSH is pretty simple. I was able to migrate to it with minimal alerts. You will get a few complaints from each of the appliances when you initially make the change that they can't communicate, but nothing more then what your already dealing with.
This is from memory, but should be close. From your response you're probably familiar enough to figure it out.
1. SSH to one of the ESAs
3. connstatus or list - note if one to the cluster members are not communicating. Reboot if needed to get them communicating for the change to ssh.
- you will be using the option to communicate with IP (clusterconfig with DNS does not seem to support SSH)
- Then select SSH instead of CCS
hit enter back to the main prompt, then commit and enter any appropriate comments.
Now check your cluster, usually clustercheck, it will list any configuration issues, or any ESAs it can communicate with. Deal with these before moving on.
5. Login to the gui of each ESA, in Machine mode, ensure CCS is unchecked for each interface.
Monitor for the few communication alerts, reboot appliances as needed, also check your SMA to ensure everything is now stable and data is transferring.
If this fixes your issue, mark it as the answer. I give credit to Cisco Support Engineer, Michael F. who helped me out with this same issue a few days ago.
It didn't work or change the connection issues. I ran clustercheck and they all check out fine, I then changed the cluster communication and it still failed.
I ended up opening a TAC support case about a month ago and they are STILL looking into the issue. Since it is taking so long, I imagine it is a real issue dealing with a bug that isn't easily fixed.
Could i ask if you could post the entire failed alert you received (remove the serial numbers if you want)
From what I can tell so far, if you are using CCS on your ESA's the bug which was previously mentioned would be indicative that CCS caused SSH protocols to fail and when this happens on the ESA, it will cause connection issues between the ESA and SMA device (as it needs port 22 + SSH for communication) then it will also connect on the required port for the service that you're running.
Let me know.
The Critical message is:
Quarantine: Could not connect to the SMA 10.X.X.X at port 7025. Messages in Outbreak quarantine could not be rescanned.
Serial Number: XXXXXXXXXXXXXXXXXXX
Timestamp: 26 Mar 2015 21:37:31 -0400
Is this issue reproducible?
From your ESA are you able to telnet to the SMA on 10.X.X.X on port 7025 on the delivery IP interface on the ESA?
(You can check this with CLI > deliveryconfig)
When you check 'tophosts' on the ESA, how is "the.cpq.host" on your list as well.
This looks to be connection to the SMA is unreachable on the required port for transfer at that given time.
When you check ESA to SMA and SMA to ESA connectivity on port 22 as well, are they both connection and showing the SSH procotols up as well?
Too hosts shows the.cpq.hosts on the top so I know it's working must of the time. Telnet checks out on all ports, connection is fine (I tested this before opening this forum).
What I do see is in solarwinds monitoring is tons of outages and disconnects as if the network is dropping all the time. I dont think this is the issue but I've asked my network team to investigate this while TAC works on the logs.
Thanks for the update.
It is to my understanding you have a TAC case open at the moment investigating the system?
If the issue is intermittent then it may require further log review.
Typically SMA to ESA connectivity must be active (the port 22 connection that you establish on the SMA when first setting up) needs to remain in tact, if connections are being dropped, this will interrupt the ESA's transfer to the SMA as well as if connection is not available the SMA will likely refuse the connection attempts as it registers that service is not available for the ESA's connecting IP.
That is correct, TAC was involved and had to make back-end code changes for the system quarantines and now they all seem to be working as desired.
I also upgraded my ESA's to VER9 which also seemed to help. So far over the last week no new alerts have been thrown and emails are being quarantined correctly.
They keep coming back sporadically but it doesn't seem to be affecting anything so I have been ignoring the alerts. I have now ruled this down to poorly functional networking and is currently being investigated by our network engineering team. They say there is nothing wrong but I know it is the root cause just have no way to prove it yet.
Please let me know if you have come to any resolution. We have gone so far as RMA the appliance. We sporadically get this error as well.
There never was a real solution, TAC engineers went through everything, we changed the network, made system changes and nothing fully fixed it.
I found the latest release of the SMA/ESA's made it better but still an issue.