09-11-2010 11:35 PM
Noticed one of the devices has been hung since 8//25. Another one was hung since 9/1, but I just reset the power to it.
Anyway you can take a look at the hung one or should I drive out and reset the power?
09-12-2010 04:13 AM
Hello Edward,
Just curious: No notification from the cloud side about the unreachable appliance?
-Kurt.
09-12-2010 08:02 AM
Edward,
This would be for cust 1118? I see the Portal fired a warning event at that time. It fires warnings when a connection is lost unexpectedly, notices otherwise (ie if the device is rebooting for new firmware we reduce the event severity). I see a separate connection warning at about the same time of day but three days earlier, as well as a bunch of warnings regarding DNS lookup latency all on the second DNS server. I don't know that there is much information we can derive from that except maybe that the second DNS server might have issues. Or not, I see it is off-net and the primary is local.
So, when you have a chance please reboot the TBA and drop us a note so we can take a look at its logs.
In the mean time, I'll take a look at 1086, which I figure was the other device, now back up.
Andy
09-12-2010 11:23 AM
Hi Andy,
Received a "Site Comms down" (Event Id3370096) for 1150 today, not a very long time since the customer was set-up today.
CONN, CLOSE or HEARTBEAT CLOSED
UP or DOWN DOWN
Briefly checking later, shows the site is reachable form the cloud.
I care about, becase there was no "UP" indication notified!
For the moment, both appliances are nated to the same public WAN IP. As the SSL tunnel is apparently set-up form the appliance side, even some additional internal NAT should not create any issues. But I can be wrong here.
Sorry for capturing this discussion - a moderator is free to move this to a new thread.
Regards,
-Kurt.
09-12-2010 11:37 AM
Hi Edward, Kurt,
We have two separate issues going on in the thread, and the good news is we think we've identified the causes of both.
First, for the 'hung' TBA issue, we believe that we've identified scenarios in which the TBA fails to acquire a DHCP lease (or lease renewal). Now that we understand the cause, it should be easy to correct.
Second, the lack of an 'UP' event. These are currently set as 'Informational' severity. If you didn't have a notification rule to send these specific events types, or, events of 'Informational' severity (which most people would not), you would never see these. The portal periodically scrubs 'Informational' events, so looking for the 'UP' event a few days later you may not find it. To address this issue, we'll change these events to 'Notice' severity, to match the 'DOWN' event severity. These are both 'normal but significant condition' types of events. Glad you caught this one.
-mike
09-12-2010 10:28 PM
Ok, I will have to go there and reset the power on it. And then hopefully you'll be able to glean more info from it once it comes back online.
The network there has continued to beat and despite DNS issues reported by it, no complaints from users have come in... interesting.
09-13-2010 07:37 AM
I wouldn't really expect problems on the site because the primary DNS is not having issues. The probe wants an answer within 5 seconds, so it will fire if the DNS server is only a "little" slow. 5 second response would be a big drag on a network if it was affecting all lookups, but as the primary server remains responsive, probably nobody but the TBA would notice.
09-15-2010 03:58 AM
Andy,
I had a "duh" moment and went ahead and did an administrative shutdown and no shutdown on the PoE port (the TBA is on a ASA 5505). It is back up now so you can take a further look at it.
-Ed
09-15-2010 09:59 AM
Edward,
Thanks. I've taken a look and logs indicate a hard crash at 2010.08.25 02:38:26, but there is no clear indication as to why. Could have been a power outage, kernel crash, etc. Unfortunately we have very limited postmortem tools on these small devices. I also see some errors logged regarding not getting DHCP responses but there are some inconsistencies we're still trying to resolve.
We have found a problem with the DHCP client on the TBA in that it will give up making DHCP requests if it doesn't get an answer after a few minutes. That means that if there was a temporary issue with DHCP at a site and the TBA reboots, it may never recover. This problem is corrected in the forth-coming release, which also offers the ability to set a static IP, bypassing the issue entirely.
Andy
Discover and save your favorite ideas. Come back to expert answers, step-by-step guides, recent topics, and more.
New here? Get started with these tips. How to use Community New member guide