Re: c600v queue slow (4-5 per minute) but utilization percentages low

DanielG_2020 · ‎06-28-2021

I encountered a significant bottleneck on my c600v today. We had one address receive approximately 1000 emails over the course of an hour or two, and the work queue spiked to over 1000 and took several hours to clear out, causing significant mail send/receive delays for everybody else.

We were able to speed it up a little bit by increasing the rate for that particular domain in destination control, which resulted in the sustained rate of approximately 4 to 5 messages clearing through the queue per minute, so individual messages ended up getting delayed by 2 hours or more.

It's acting like it's overloaded, but the CPU usage graphs never went above about 20%, and the individual module usage never went above 50% (anti-spam being the highest).

I don't see any particular alerts, and we typically handle about 12k attempted incoming messages per day with the queue time never going above near-instant.

I'm open to any suggestions on where to look to figure out why the system is going so slowly, yet looks like it's barely being utilized.

Mathew Huynh · ‎06-28-2021

Hey Daniel,

Typically I would say a TAC case would be the best bet for this type of issue; from my experience - if it was a queue bottle neck at the workqueue (rather than delivery, they're independent of each other) then having a large burst of emails in quick succession ( a few mins) can normally cause a spike; combine this with normal mail flow it'll be a recipe for disaster.

The best thing I could suggest is at the time you saw the workqueue start to build (you can use the GUI report to get an approximate) run this grep on the mail_logs (assuming logs are still there) grep -i "Jun 29 10:.*ready" mail_logs -- replace the timing and date with the approximate date you found.

This will give us all the emails that entered the workqueue at this time; and hopefully you can correlate the timing of the queue build starting so you can check the type of messages that entered.

If you saw a common trend of a same sender and email of larger size or so; it would attribute quite heavily to a potential queue problem. if the sender is someone you don't really trust - perhaps utilising envelope sender rate limit would be beneficial (keeping in mind this is a global change and impacts all senders to your ESA/CES).

You could also potentially grep the MIDs out in this timing to check which service/process was having the longest processing time - that would allow you to pinpoint if its a content problem, or just a raw capacity issue at that stage.

Remember CPU guides on the GUI is an average of all CPUs; there are instances that a process may be utilising 1 CPU and getting a throttle in that case; where averages looks fine but 1 CPU could be at it's peak.

Other than this ; this is just very top level troubleshooting at this stage.

Feel free to post back more anything you have questions on.

Thanks,

Mathew

DanielG_2020 · ‎06-28-2021

I did some grepping per your suggestion. I found one specific email address that caused the flood of incoming emails. Unfortunately, it is an expected email that does have high volume.

I did spot check a bunch of MID's from before the queue piled up, then during, then later on, and I did notice a pattern.

The initial receipt from START to READY goes instantly, and the spam, av, outbreak, dlp, dkim sections all take 1-2 seconds tops.

However, I do get this warning on all emails:

"Warning: MID XXXX unable to lookup SLBL for recipient because the DB server is unavailable", after approximately a 30-90 second delay.

But, where I do see a significant gap is between the "ready" and then the "matched all recipients for per-recipient policy xxxx in the inbound tabl".

When the queue running normally, there is appreciable time between those two steps. However, when the queue starts building up, the gap between "ready" and "matched all recipients" increases to at least 35 minutes or maybe even longer.

So, does this mean that the recipient matching is where the delay is occurring, or is there some other process occurring between READY and that step?

The SLBL DB error is clearly an issue that needs to be resolved too, but that comes 30-90 seconds after the processing delay has already occurred so I do not believe that is the primary cause of the delay.

Mathew Huynh · ‎06-28-2021

Hey Daniel,

Excellent investigative work!

Ready -> Matched incoming mail policy etc means the queue is backed up to the point that new emails couldn't get into the pipeline for scanning.

It is very likely the SLBL lookup on previous emails prior is causing a buildup for every new email to even enter the workqueue pipeline.

A good way to see it is...

SMTP Connection level is everything at the "new SMTP ICID" all the way to the "ready from xxx@domain.com" line.

After this line it means the email is ready for workqueue pipeline.

WHere first matching is done at message filters (per the pipeline) then incoming/outgoing mail policy match then all the scanning engines.

IF there is already a backed up queue - then it's expected for emails to have that gap between ready and matching a policy; however if your queue is NOT backed up and you see this delay - that means there is likely an issue on message filters and/or matching policies.

In your case; i believe it's simply a case of a queue back up.

You traced a delay on SLBL matching due to DB (90 secs after) if this occurs for every email - you can imagine the extreme delays we may see.

As the workqueue is a FIFO setup; if whatever is coming in is stuck at SLBL; all new emails will continue to wait with the "ready" line before it moves forward.

SLBL is implemented at antispam levels; on the emails -before- the queue build up where you see the error of SLBL; did you notice a delay from "ready" to "matched per recipient policy" ? (I would assume not) and if so; I'd put this down as the SLBL issue that needs to be squared away.

Regards,

Mathew

DanielG_2020 · ‎06-28-2021

Thanks for the update and the info.

My ESA was in need of a version upgrade, so I'm processing that now. If the SLBL issue persists, I will continue with more troubleshooting on that.

DanielG_2020 · ‎06-29-2021

Unfortunately, after upgrading to asyncos 14, the SLBL error is still present and thus after receiving a bunch of emails my queue has bloomed to 60 minutes or more.

Is there any way I can try to troubleshoot or diagnose the problem myself?

edit: just a quick note to anyone who encounters this issue, I was able to turn off end user safelist/blocklist in the spam quarantine settings, and my queue dropped from 500+ to 0 almost instantly. Now I just need to figure out how to fix that feature.

Ken Stieers · ‎06-29-2021

I'd start with downloading the SLBL and cleaning it up...

Remove entries for users that are not with the company any longer.

Remove entries for bounce addresses with random codes.

That may clear the issue.

Mathew Huynh · ‎06-29-2021

Do you have the full error message, if you're using an SMA for your spam quarantines ; try to log into the CLI and use "displayalerts" and see if there's any alerts/app faults; can you share them i'll have a look to see if there's anything.

Else what Ken shared is good - ensure the slbl isn't overpopulated; a large SLBL (esp when you are doing SMA backups or such) could potentially cause the DB to not come up properly or in worse case; corrupt.