Re: Thunderbolt hung

netguyz08 · ‎09-11-2010

Noticed one of the devices has been hung since 8//25. Another one was hung since 9/1, but I just reset the power to it.

Anyway you can take a look at the hung one or should I drive out and reset the power?

Kurt Schumacher · ‎09-12-2010

Hello Edward,

Just curious: No notification from the cloud side about the unreachable appliance?

-Kurt.

afullfor · ‎09-12-2010

Edward,

This would be for cust 1118? I see the Portal fired a warning event at that time. It fires warnings when a connection is lost unexpectedly, notices otherwise (ie if the device is rebooting for new firmware we reduce the event severity). I see a separate connection warning at about the same time of day but three days earlier, as well as a bunch of warnings regarding DNS lookup latency all on the second DNS server. I don't know that there is much information we can derive from that except maybe that the second DNS server might have issues. Or not, I see it is off-net and the primary is local.

So, when you have a chance please reboot the TBA and drop us a note so we can take a look at its logs.

In the mean time, I'll take a look at 1086, which I figure was the other device, now back up.

Andy

Kurt Schumacher · ‎09-12-2010

Hi Andy,

Received a "Site Comms down" (Event Id3370096) for 1150 today, not a very long time since the customer was set-up today.

CONN, CLOSE or HEARTBEAT CLOSED
UP or DOWN DOWN

Briefly checking later, shows the site is reachable form the cloud.

I care about, becase there was no "UP" indication notified!

For the moment, both appliances are nated to the same public WAN IP. As the SSL tunnel is apparently set-up form the appliance side, even some additional internal NAT should not create any issues. But I can be wrong here.

Sorry for capturing this discussion - a moderator is free to move this to a new thread.

Regards,

-Kurt.

Michael Holloway · ‎09-12-2010

Hi Edward, Kurt,

We have two separate issues going on in the thread, and the good news is we think we've identified the causes of both.

First, for the 'hung' TBA issue, we believe that we've identified scenarios in which the TBA fails to acquire a DHCP lease (or lease renewal). Now that we understand the cause, it should be easy to correct.

Second, the lack of an 'UP' event. These are currently set as 'Informational' severity. If you didn't have a notification rule to send these specific events types, or, events of 'Informational' severity (which most people would not), you would never see these. The portal periodically scrubs 'Informational' events, so looking for the 'UP' event a few days later you may not find it. To address this issue, we'll change these events to 'Notice' severity, to match the 'DOWN' event severity. These are both 'normal but significant condition' types of events. Glad you caught this one.

-mike

netguyz08 · ‎09-12-2010

Ok, I will have to go there and reset the power on it. And then hopefully you'll be able to glean more info from it once it comes back online.

The network there has continued to beat and despite DNS issues reported by it, no complaints from users have come in... interesting.

afullfor · ‎09-13-2010

I wouldn't really expect problems on the site because the primary DNS is not having issues. The probe wants an answer within 5 seconds, so it will fire if the DNS server is only a "little" slow. 5 second response would be a big drag on a network if it was affecting all lookups, but as the primary server remains responsive, probably nobody but the TBA would notice.

netguyz08 · ‎09-15-2010

Andy,

I had a "duh" moment and went ahead and did an administrative shutdown and no shutdown on the PoE port (the TBA is on a ASA 5505). It is back up now so you can take a further look at it.

-Ed

afullfor · ‎09-15-2010

Edward,

Thanks. I've taken a look and logs indicate a hard crash at 2010.08.25 02:38:26, but there is no clear indication as to why. Could have been a power outage, kernel crash, etc. Unfortunately we have very limited postmortem tools on these small devices. I also see some errors logged regarding not getting DHCP responses but there are some inconsistencies we're still trying to resolve.

We have found a problem with the DHCP client on the TBA in that it will give up making DHCP requests if it doesn't get an answer after a few minutes. That means that if there was a temporary issue with DHCP at a site and the TBA reboots, it may never recover. This problem is corrected in the forth-coming release, which also offers the ability to set a static IP, bypassing the issue entirely.

Andy