Re: Events don't e-mail out from one client...

netguyz08 · ‎05-28-2010

Since one client (1086) was down for a few days, came up and the firmware updated, I haven't been receiving e-mails for warnings and critical events. I checked the Event History to confirm entries were showing up, and some do.

I also observed when the Linksys router was down, nothing was reported at all. So I removed and re-added the monitors for the server, the router and the wireless router. I restarted the wireless router as a test. I had received an event earlier today that it was "up" but never an event when it was supposedly down. After restarting it just now, no events were recorded at all..

Michael Holloway · ‎05-29-2010

netguyz08 wrote:

Since one client (1086) was down for a few days, came up and the firmware updated, I haven't been receiving e-mails for warnings and critical events. I checked the Event History to confirm entries were showing up, and some do.

I also observed when the Linksys router was down, nothing was reported at all. So I removed and re-added the monitors for the server, the router and the wireless router. I restarted the wireless router as a test. I had received an event earlier today that it was "up" but never an event when it was supposedly down. After restarting it just now, no events were recorded at all..

Hi Edward,

Discovery runs pretty much constantly, mostly listening to passive broadcast protocols and occasionally running some active probes every few minutes to update the network topology. But each monitor runs actively only every 5 minutes, this is to avoid adding any notable load on the network and on the network devices being monitored. Very common for active network monitoring. This means that the amount of time from when a service or server goes offline until it's detected and then reported to the portal could be anywhere from a few seconds to about 6 minutes. Active monitors are scheduled and are run every 5 minutes for each service, all interleaved and evenly distributed across each hour. Upon finding a downed service on one of these 5 minute checks, another check is immediately scheduled for 60 seconds later, and if the service or server is still down, the event is considered 'real' and an event is sent to the portal for possible notifications to be sent on.

I'm wondering if the reboot time of your Linksys router was quick enough to fall in-between two 5 minute checks and so never be found down.

I see a notification rule you have set up to send all CRITICAL and higher severity events via email, and I see one such event on 5/28 for this customer 1086 at 1:05pm PDT, the last CRITICAL or higher event occurring back on 5/20. We'd expect that 1 email was sent today, so if you didn't receive it please let us know and we'll investigate further.

-mike

Marcos Hernandez · ‎06-03-2010

Hi Ed,

Was this fixed?

Thanks,

Marcos

netguyz08 · ‎06-04-2010

Marcos,

Yes, taking the events and adding them, then shutting them off worked.

I've noticed some other issues, so I am hoping to see what the next drop will bring, because this last firmware seems to have made the reporting in general be less consistent. And a router I took off a network today, won't disappear, but the new one appeared (and wasn't identified - an RV042).

Michael Holloway · ‎06-04-2010

Hi Edward,

Regarding the device that doesn't disappear, do you mean that it does not show as missing in the topology or dashboard views? We never remove devices automatically once discovered, but should mark them as missing (red X).

As for the RV-042 being detected, but not properly labelled, understood, this is a known issue in this current phase of the trial. We only currently identify a handful of products, but that list will grow in future drops.

You say 'taking the events and adding them, then shutting them off worked.', can you please explain what you did here so that we can examine the issue further? Did you mean adding then removing monitors, or notifications?

Also, what events transpired that you expected to receive a notification for, but did not? In my previous message I listed the notification rule that you had configured, and the single event that met the criteria. Did we have a failure to record an event here?

Thanks!

-mike

netguyz08 · ‎06-07-2010

Mike,

The RV042 shows up correctly, but the old Linksys BEFSX41 still shows as well. Both have the same IP address, but the BEFSX41 is totally gone from the network. It shows up though as if it still exists and I have no option to delete it.

For disabling events, I would add the Host State monitor on a device, and set Host Up and Host Down to "No Event." This prevented from being bugged about the device any further.

And then to answer the failure to record an event, yes - that happened. I took the BEFSX41 at one point and did a "reset" on it through the web interface and it was never recorded as being down even though I lost my Remote Desktop connection into the LAN. I updated the firmware and rebooted it again and still no event. An event did finally record that it went offline when I unplugged the thing completely.

Now that has all been events recorded or not recorded in the Event History for a given customer. I have also seen events logged into the Event History that are Critical, Errors or Warnings (which I used to get e-mail alerts on) and not send out an e-mail notification anymore (which is thread was about originally). It isn't consistent enough, but I noticed the drop in notices from one site when I updated the firmware to the latest and didn't change anything else.

Michael Holloway · ‎06-07-2010

The RV042 shows up correctly, but the old Linksys BEFSX41 still shows as well. Both have the same IP address, but the BEFSX41 is totally gone from the network. It shows up though as if it still exists and I have no option to delete it.

Ah, yes I see both devices on this site claiming 192.168.1.1 and both appearing as present on the network. There is probably something going on with ARP that is making the TBA believe the BEFSX41 is still present on the network. We'll take a close look at this one in the morning.

For disabling events, I would add the Host State monitor on a device, and set Host Up and Host Down to "No Event." This prevented from being bugged about the device any further.

Gotcha, this issue is fixed in drop 3.

And then to answer the failure to record an event, yes - that happened. I took the BEFSX41 at one point and did a "reset" on it through the web interface and it was never recorded as being down even though I lost my Remote Desktop connection into the LAN. I updated the firmware and rebooted it again and still no event. An event did finally record that it went offline when I unplugged the thing completely.

By reset, I take that to mean that you caused it to reboot, not a 'factory reset', and your RDC went offline during this, so you know that it did indeed reset. Since the TBA only actively checks that a host is 'up' every 5 minutes (see earlier in this thread), it's not surprising that this device reset and was back in action before the next time the TBA looked to see if the host was up. In drop 3, we increased the resolution a bit for host state checks to once every 2 minutes, and if the host doesn't answer another check is done 60 seconds later and if still down the event is send (3 minutes total instead of 6).

Now that has all been events recorded or not recorded in the Event History for a given customer. I have also seen events logged into the Event History that are Critical, Errors or Warnings (which I used to get e-mail alerts on) and not send out an e-mail notification anymore (which is thread was about originally). It isn't consistent enough, but I noticed the drop in notices from one site when I updated the firmware to the latest and didn't change anything else.

You might want to review the notifications configured on the portal and make sure that it's set to send Warning events and higher to your email (if that is what you are wanting to receive emails on), when I looked last I don't think I saw such a notification rule and that would certainly keep emails from being sent.

-mike

netguyz08 · ‎06-07-2010

Mike,

Ok thanks for letting me know on the notifications I fixed that now so I should be receiving the other notifications now.

Ok, and thanks for letting me know on the restart. That sounds exactly like what is happening. Never did a factory reset, just a quick restart on the router. And I have even experimented and rebooted a server of mine a few times and noticed since it comes back on so quick it never gets reported as up or down.

In Drop 3 will we be able to change the time on host state checks? I could see where I'd want to adjust it at times to something lower to see if some issue is occuring. Might that adjustment be possible in the future?

Michael Holloway · ‎06-08-2010

In Drop 3 will we be able to change the time on host state checks? I could see where I'd want to adjust it at times to something lower to see if some issue is occuring. Might that adjustment be possible in the future?

Not in drop 3 which is currently in testing, because of the scope of this change, but it is something we could add to a future drop. What intervals would work for you, assuming the GUI interface was a dropdown list? 30 seconds, 1, 2, 3, 4, and 5 minutes? Under 30 seconds probably isn't an option, because of the scheduler needing to account for protocol time-outs in the previous check.

5 minute checks are fairly standard for active service monitoring (as to not induce unnecessary load on what is being tested), but we're building this solution around your feedback and your needs. Do you see this setting needing to be available for the other monitors besides host state? Or is checking services every 5 minutes still ok? Anyone else have an opinion here?

The secondary 'down' check doesn't *need* to occur after 60 seconds either, in fact we could probably just change this to just 30 seconds for all re-checks, which should also account for various protocol time-outs. The secondary checks are desirable to keep alerts from being generated from a restarting service, and generally cut down the number of false positives to a tolerable level.

-mike

Michael Holloway · ‎06-08-2010

The RV042 shows up correctly, but the old Linksys BEFSX41 still shows as well. Both have the same IP address, but the BEFSX41 is totally gone from the network. It shows up though as if it still exists and I have no option to delete it.

Hi Edward, an update for you on this issue. We found that the bonjour discovery daemon on the TBA was hung, and so was not detecting that this node had disappeared from the network, so that you could then delete the missing device. I've restarted this discovery protocol on the TBA for you, and the device is now detected as missing and can be deleted. We'll continue to try to get to the bottom of how exactly the bonjour daemon locked up. Resetting the TBA would have also cleared this error, but thanks for letting us take a look first.

-mike