cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
4594
Views
0
Helpful
5
Replies

Possible Issues with Cisco DNS Resolution Causing Mail Delivery Problems

hanifena
Level 1
Level 1

Cisco community,

Reaching out to see if anybody can assist with a very sporadic DNS resolution issue that GoDaddy, Cisco TAC, and a third-party DNS provider we were originally hosting our client on are unable to assist with.  Hoping someone with a larger brain can provide some insight and possibly get Cisco to look into it.  Cisco has dismissed it as an issue with the client DNS records, which don't seem to line up with our results.  We ourselves are not a Cisco customer, so our avenues for support are limited.  Long-winded detailed explanation and diagnostic steps taken below.

We are a small MSP with a client (clientdomain) who has received reports of multiple mail delivery issues that only seem to occur when a customer of theirs (customerdomain) utilizes a Cisco Ironport hosted service.  The exact error(s) from message traces provided by multiple customers of our client is this:

##############################

02 Nov 2020 09:24:05 (GMT -08:00)

(DCID 3161342) Message 92757962 to USER@clientdomain.com  bounced by destination server. Reason: 5.4.7 - Delivery expired (message too old) ('000', ['DNS Soft Error looking up clientdomain.com (MX) while asking recursive_nameserver1.parent. Error was: unable to reach nameserver on any valid IP'])

02 Nov 2020 09:24:05 (GMT -08:00)

Start message 92922748 on incoming connection (ICID 0).

02 Nov 2020 09:24:05 (GMT -08:00)

A new message 92922748 was generated to handle bounce of message 92757962.

02 Nov 2020 09:24:05 (GMT -08:00)

Message 92922748 enqueued on incoming connection (ICID 0) from .

02 Nov 2020 09:24:05 (GMT -08:00)

Message 92922748 on incoming connection (ICID 0) added recipient (USER@customerdomain.com ).

02 Nov 2020 09:24:05 (GMT -08:00)

Message 92922748 (7601 bytes) from ready.

02 Nov 2020 09:24:05 (GMT -08:00)

Message 92922748 queued for delivery.

02 Nov 2020 09:24:05 (GMT -08:00)

(DCID 3161343) Message 92757962 to USER@clientdomain.com  bounced by destination server. Reason: 5.4.7 - Delivery expired (message too old) ('000', ['DNS Soft Error looking up customerdomain.com (MX) while asking recursive_nameserver1.parent. Error was: unable to reach nameserver on any valid IP'])

###############################

This error started occurring on the 17th and 18th of September, 2020.  After a month of back and forth with the original DNS hosting provider and web developer, we thought we had identified the issue.  The DNS hosting provider our client was utilizing is a small vendor local to our area of coverage.  They had some missing AAAA glue records which their engineer said may have affected mail delivery.  The moment those missing records were added, the customer domains that were failing to reach our client started coming through.

Less than a week later, mail delivery from the affected customer domains to our client was again held up, and the previous recursive_nameserver1.parent error returned.  Since we had been troubleshooting for such a long period of time and our client was pressed for a solution, we made the decision to move them to their existing GoDaddy account to use their DNS services temporarily under the assumption the issues were upstream from the third-party DNS provider and would take longer to identify the root cause.

Again, immediately after DNS propagation, the failing customer domains emailing into our client started coming through again.  We assumed, incorrectly, that this was an issue with the third-party provider or their upstream DNS resolvers.  Earlier this week, the issue returned again, but it is now sporadically allowing some messages to come through.  We are stumped.  Here is what we know for sure:

1.  All failing emails are FROM the customers TO our client.  There has never been an issue with mail delivery FROM our client TO their customers.

2.  Every failing message into our client's domain seems to occur when the customer utilizes a hosted IronPort service.  We haven't been able to determine if an on-prem IronPort would cause the same issue, as there are no customers of our client that seem to be using one.

3.  All publicly available tools we've used for troubleshooting indicate no issues with DNS on our client's end.  This includes MXToolbox, CheckTLS, etc.

4.  When Cisco DNS servers are used, the recursive_nameserver1.parent error can be reproduced, however there are periods of time when emails come through, indicating this is only a sporadic/transient issue and not a consistent one.  The messages that do come through seem to have long delays of 1800+ minutes.  Perhaps indicative of a Cisco Ironport greylisting issue?

5.  Our client is not on any public blacklists, does not send out marketing spam, etc.

6.  We, and our client, utilize Microsoft 365 for mail services, with no third-party spam filtering solutions or non-default mail configurations/custom spam filters in place.  Our emails back and forth are instantaneous and work without incident.

7.  DKIM has been temporarily turned off while we troubleshoot the issue to prevent any variables that may lead to inconsistent test results.  DMARC policy set to NONE when it was utilized at the start of this issue.

8.  Any major DNS change that has been made, like the addition of the AAAA glue record at the original DNS host, and the move from their hosting to GoDaddy DNS hosting, seems to trigger something temporarily that allows some of the messages from the customers to reach our client before errors return a few days later.  This is indicative of an issue with DNS somewhere, but we can't determine where.

9.  We have double, triple, quadruple checked our DNS records and see no issues with what we have entered.  Other sources have verified our records and can’t find an issue either.

Again, all issues are present only with customers mailing into the client domain that utilize Cisco Ironport hosted services.  We are not ourselves a Cisco customer.  All troubleshooting we can perform on our end is relatively fruitless and has to come from customers of our client willing to provide us with logs.  So far, they've been more than happy to assist, but the client is getting more frustrated with our inability to resolve as the days pass.  I'm willing to share more detailed logs, DNS zone file, and unredacted info via DM or email if someone thinks they may have the answer or be able to assist with troubleshooting.

Thank you Cisco Community!  Hope we can get to the bottom of this and possibly help out other Ironport users that may be encountering the same issues!

5 Replies 5

hanifena
Level 1
Level 1

Posted this to Reddit too.  Just leaving this link here in case others encounter the same issue.  I believe we may be on to some sort of solution:  https://www.reddit.com/r/sysadmin/comments/jopuuw/issues_with_cisco_dns_resolution_causing_mail/

Libin Varghese
Cisco Employee
Cisco Employee

For CES (hosted ESA's) I've seen instances of EDNS compliance causing DNS lookup issues, but not necessarily the case here since you mentioned its intermittent.

A TAC case can be opened if issues are still being seen since it'll need to be investigated in the CES network by the Ops team.

 

There are a couple of workarounds that can be implemented on the ESA's:

1. Configure an alternate DNS server for this domain, clientdomain.com to use 8.8.8.8 or another DNS server of choice (would still need a TAC case since CES customers would not have access to DNS section of the configuration.)

2. Configure a static SMTP route for domain clientdomain.com, clientdomain.com -> xx.xx.xx.xx where the X's is the IP you would like to deliver all emails for that domain to. This effectively would bypass the need for a DNS lookup.

 

Regards,

Libin

hanifena
Level 1
Level 1

Libin, that is promising news!  Would you please open the case with TAC so we can figure out how to resolve it permanently?  I plan on moving the client back to their previous DNS host from GoDaddy after resolution.  This way, they can keep their other services such as automatic certificate renewal and website failover.  A workaround using Google's public DNS resolver seems like a fair solution while the TAC case is being reviewed.  I've redacted my client's domain in the post, so I understand I'll need to provide additional info.  Please let me know what the next steps are, and thank you so much for your assistance!

Libin Varghese
Cisco Employee
Cisco Employee

To avoid confusions, the TAC case would need to be opened by the CES client/customer using their support contract details and cannot be done through the support forums.

 

Cisco Contact Numbers: https://www.cisco.com/c/en/us/support/web/tsd-cisco-worldwide-contacts.html

Libin, not a problem.  I have this issue on the radar of two different customers of my client.  I'll be following up with them and having them reference this topic and your suggestions so we can move forward.  Again, thank you so much for your reply!  Very happy to have a potential solution!

Getting Started

Find answers to your questions by entering keywords or phrases in the Search bar above. New here? Use these resources to familiarize yourself with the community: