OnPlus - Asymmetric Events and Useless Timestamps - DOWN, but no UP...

Kurt Schumacher · ‎06-01-2011

Ongoing... DOWN events, but no UP events...

Every site is part of a time zone - sorry to say we're not interested to know in what time zone the cloud servers are located.: We expect either standardized (GMT) or LOCAL - inccluding time zone and DST time on any events. conclude, time is still not handled according to ISO: Back to RD.

Event		OnPlus: Connecton status
Event Date/Time		2011-05-30 18:31:56-05:00
Event Message		Site Comms down: 84.nn.nn.nn
Customer Name		KCS (SOHO)
Device ID		00:50:43:nn:nn:nn
CONN, CLOSE or HEARTBEAT		HEARTBEAT
UP or DOWN		DOWN

Event		OnPlus: Connecton status
Event Date/Time		2011-05-30 19:09:46-05:00
Event Message		Site Comms down: 84.nn.nn.nn
Customer Name		KCS (SOHO)
Device ID		00:50:43:nn:nn:nn
CONN, CLOSE or HEARTBEAT		HEARTBEAT
UP or DOWN		DOWN

Tastes to me like a bad (statless) system design....

jamwyatt · ‎06-01-2011

Hi Kurt,

While I can't comment on the Timezone concerns, I can comment on the missing 'up' events. It turns out that they are simply a lower severity and don't show in the default event view (shows warnings and above). Further, there are two types of 'down' events. The ones you see below are generated when we detect loss of connection with the site (cable pull type of event). The second class is ones that are generated when we expected the loss of connection (i.e. after we trigger a reboot from the topology view) or the operating system was still active when the heart client was stopped (software upgrade causing a reboot). Both are also of a lower severity than the default view and we got a TCP packet from the site to close the socket.

While that's the details of today's operations, I can note that we discussed the severity issues several times. It was finally decided to use this two severity setting. While it is easy to change, the question is should we? The final thinking was to leave the 'heartbeat' failure at a higher severity so that the user could trigger alerts on 'warnings' and avoid general warnings from normal 'up/down' events.

Robert

Michael Holloway · ‎06-01-2011

Kurt,

Regarding the timestamps being formatted in the wrong timezone, you are absolutely correct in stating that we don't have it quite right yet. We're aware of the issue and I'm certain that it will eventually be addressed to your satisfaction. Gone are the days in the trial when we presented ambiguous timestamps with no timezone listed and english abbreviations for days and months (Fri Jun 2010). Internally, we store all timestamps in a way that we can generate an ISO-8601 style timestamp offset for any timezone.

We've picked up additional development folks, and right now we're in the mode of working down a prioritzed list of issues for stability, hardening, and a few features left to be implemented before we can release. You've pointed out some of the bugs and missing features in other posts, and we're always appreciative for your unfiltered feedback. It helps our management folks reorder the priorities correctly.

Kurt Schumacher · ‎06-14-2011

Robert,

Both DOWN and UP must have the same priority. In every case.

Why? Because of a human is receiving and monitoring these. When somethhing goes DOWN, and comes UP later again: Fine, we have to investigate what went potentially wrong.

When somehting goes DOWN and does no let us know on the same path that it is back UP again (regardelss of the trigger, ie. a reboot) - every operator is URGED to start investigating and escalating after some time. Beeing buffed and find "Hey, the device is up and running, can't find anthing wrong" after some time is a waste of time and a result of poorly adjusted resources - caused by your mis-designed implementation.

Back to the decision makers - Fix it!

-Kurt.

jamwyatt · ‎06-30-2011

As mentioned before, we've had this discussion more than a few times internally. Combined with your polite request and other's inputs, we decided to make the change to match the heartbeat UP/DOWN events at the 'warning' severity. All other management and maintenance UP/DOWN events will be at the 'information' severity. The change should be made available in the next scheduled update of the beat node.

Thanks,

Robert