cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
1338
Views
0
Helpful
6
Replies

On Beta site: Still getting a bunch of DNS alerts all over the place

davebainum
Level 1
Level 1

Hi there,

We still keep getting 2-8+ DNS alerts from multiple sites, pretty much every 1-2 days... just curious how best to tweak these, if it's possible.  Any thoughts?

Generally the "DNS alert" is within about 2-4 minutes of the "DNS OK" message, and the site doesn't seem to exhibit any other anomalies - at least, that the client has reported, anyways.  ;-)

TIA,

-- Dave Bainum, PMP* (dbainum@ritetech.net)

RiteTech LLC / www.ritetech.net / Tel. +1 (703) 561-0607

[*PMP=PMI Certified Project Management Professional]

6 Replies 6

Michael Holloway
Cisco Employee
Cisco Employee

Hi Dave, sorry for the late response, I just rotated back onto the development team from a two month stint in the Service Operations group for OnPlus service to do some knowledge transfer with their team. I even got to wear a pager and wake up at 3am when things weren't happy

Regarding the DNS alerts, the thresholds are already pretty high for these monitors, these events likely indicate actual issues being experienced at your sites. By default (these settings are configurable in the ON100's 'monitors' tab), the device attempts to resolve www.cisco.com, and allows 15 seconds for the DNS response to be returned. If there is no response from the DNS server, no A record returned, or if a timeout threshold is exceeded, the ON100 waits 60 seconds then checks again. If that second check also fails for one of the above reasons, then the event is generated and any configured notification would be also sent.

From the ON100's perspective (just another device on the LAN) it tried to resolve something and failed twice, 60 seconds apart. Other devices on the network should also be seeing the same 'outage' at the time.

Things to try might be to first look for common nameservers used by the multiple sites that are reporting the problem. Are they ISP nameservers? Robert Wyatt was recently speaking to another partner who was having similar DNS issues being reported by OnPlus, and it turned out that their prominent North American cable tv/internet company's DNS servers were being flakey every other day. I believe he switched over to using alternate DNS servers while the cable company acknowledged and worked on fixing the issue.

Inside the ON100's 'info' tab, you can see which dhcp_dns_servers got assigned to the device (if using DHCP). Frequently this is just the DHCP server's own IP address and it is attempting to satisfy DNS queries from LAN clients directly. If this is the case, perhaps the DHCP server could be set to offer different nameservers other than itself. Google offers the nameservers 8.8.8.8 and 8.8.4.4 for public use, might be worth it to try these at one of the sites experiencing regular DNS events detected by OnPlus. Alternately, you could 'pause' the DNS monitor at these sites if you just need the noise to go away. But personally I'd investigate further. End-users rarely think to blame DNS servers when a webpage doesn't load, they just go get a cup of coffee and come back and the website is working again.

-mike

Mike,

This is a fantastic response. Thanks for that.

Are there publicly exposed nameserver sites other than google's that you'd recommend?

Is there a way to change the DNS lookup target from www.cisco.com, or to specify more than one target? cisco.com is a pretty reliable target, but failure of two or more (geographicaly or topologically dispersed) different targets would be even more reliable as an indicator of "real" ISP DNS nameserver issues.

Dave C

Can't wait to try the NTOP service... :-)

Go Rangers!!!

Hi Dave,

I'm not really familar with any DNS servers that are available for public use today other than Google's, although many exist. I did find a recently updated list at the link below, but unfortunately the list doesn't provide links to each operator's public-use policy:

http://www.tech-faq.com/public-dns-servers.html

I frequently use GTE Verizon's nameservers for external testing, I'm sure that many people on this forum do as well.

Regarding the lookup target, yes this can easily be changed by editing the ON100 device and going to it's 'Monitors' tab. While the default www.cisco.com is a reasonable choice because of the level of redundany in Cisco's nameservers, I do recommend setting the target to be one that the VAR has control and visibility over. Possibly a DNS record that should always be resolvable by the customer's computers when their DNS servers are working properly. If the customer uses MS SBS servers that provide the DNS service for the users at a site, then checking a locally hosted record for which the server itself is the Start of Authority (SOA) would be ideal as it cuts the internet out of the picture when troubleshooting DNS problems being reported by OnPlus. This could also be the VAR's domain name, or another common domain, but in the event of a DNS outage for that domain, the VAR might receive tens or hundreds of events from all of their customers. It should definately be a well-hosted domain with plenty of working backup nameservers in the registration record.

It's possible to have OnPlus check multiple DNS records (separately) by adding a DNS monitor to another device on the customer's dashboard. The default DNS monitor on the ON100 device will check the configurable DNS record on the nameservers being offered via DHCP to the computers on that network (or the statically configured nameservers if you log into the ON100 device and set these), But you can also add a DNS monitor to any other device on the customer's network, or you can even add a new device that isn't really on the network, such as Google's nameservers answering as 8.8.8.8. You can set the record to be checked to a different record if you wish.

Unfortunately, there isn't currently any mechanism to tie the results of 2 separate DNS monitors together, so you would still receive events (and notifications if you have them configured) if either of the names failed to resolve from their respective devices, but you would at least know that it was a problem only with a single domain and that the other name was still resolving for the customer's devices.

There is definately room for improvement here, we could add the ability to add multiple DNS monitors to a single device (such as the ON100), and we could also stand to implement some basic event correlation so that an event is only generated when a cluster of [DNS] monitors fail.

Dave B. Just to mention that the DNS errors that I have experienced in the past... have been fixed by the firmware upgrades they made available shortly afterwards..  So maybe they are real alerts? 

Yes, we did recently change the process behind the DNS monitor (we moved from using the deprecated nslookup tool to using dig), and I believe the detection thresholds may have been slightly softened around the same time-frame. The monitor should be useful in detecting DNS outages lasting longer than 1 minute, and I don't believe we've yet seen an example of a false event being detected from this monitor since the changes. Cisco could always move its nameservers around (possibly at odd hours) and cause a brief disruption to the domain name long enough for this monitor to trigger an event, so it is best to use a locally hosted record instead if possible.

I also need to clarify/correct my previous statement that:

>> you can also add a DNS monitor to any other device on the customer's network, or you can even add a new device that isn't really on the network, such as Google's nameservers answering as 8.8.8.8.

You should only ever add a DNS monitor to a device on the network that *IS* a DNS server, as that device's IP is what the monitor will attempt to connect to and resolve from. If you want to check different external nameservers than what the ON100 will check by default (the nameservers the ON100 is configured to use via DHCP, or statically set), you'll need to add a new host with the new DNS server's IP address (such as 8.8.8.8). 

The default ON100 DNS monitor will always continue to only inspect the DNS servers that the ON100 itself is configured to use. However, you can optionally pause that monitor if you wind up manually adding additional DNS monitors to other devices on the network. For example, if you manually add a DNS monitor to a pair of DNS servers on the network and they happen to be the same DNS servers that are being offered via DHCP to the ON100, then the ON100 would be doing duplicate testing of those services (4 DNS tests in total per period), until you paused the default monitor on the ON100 device. Otherwise, if you just add a new host (i.e. for 8.8.8.8), I'd recommend leaving the default ON100 DNS monitor also running, because those servers are the ones that other devices on the network should be receiving from the same DHCP server. You may want to know when client computers are experiencing a DNS outage, and adding a check of a third-party nameserver (and also possibly checking a different domain record) just helps you quantify the magnitude of a problem when it occurs. Are all names failing to resolve, or just www.cisco.com? And is it just the nameservers the LAN computers are using that are seeing the issue, or are third-party nameservers seeing it too?

-mike

Thanks for the guidance and advice.  We'll have to take a closer look at our DNS messages.

There are two sites, in particular, where they seem to be occurring fairly frequently.  Both sites use a different ISP, however.  The primary DNS at one site is just a Windows 2008 R2 server; the other (I'm pretty sure) is either just the ISP or Google's DNS (8.8.8.8), or possibly OpenDNS.  However, we'll take a closer look, and see what we can figure out.

Cheers,

-- Dave Bainum, PMP* (dbainum@ritetech.net)

RiteTech LLC / www.ritetech.net / Tel. +1 (703) 561-0607

[*PMP=PMI Certified Project Management Professional]