cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
242
Views
0
Helpful
4
Replies

OnPlus - Asymmetric Events and Useless Timestamps - DOWN, but no UP...

Kurt Schumacher
Level 1
Level 1

Ongoing... DOWN events, but no UP events...

Every site is part of a time zone - sorry to say we're not interested to know in what time zone the cloud servers are located.: We expect either standardized (GMT) or LOCAL - inccluding time zone and DST time on any events. conclude, time is still not handled according to ISO: Back to RD.

Event

OnPlus: Connecton status

Event Date/Time

2011-05-30 18:31:56-05:00

Event Message

Site Comms down: 84.nn.nn.nn

Customer Name

KCS (SOHO)

Device ID

00:50:43:nn:nn:nn

CONN, CLOSE or HEARTBEAT

HEARTBEAT

UP or DOWN

DOWN

Event

OnPlus: Connecton status

Event Date/Time

2011-05-30 19:09:46-05:00

Event Message

Site Comms down: 84.nn.nn.nn

Customer Name

KCS (SOHO)

Device ID

00:50:43:nn:nn:nn

CONN, CLOSE or HEARTBEAT

HEARTBEAT

UP or DOWN

DOWN


Tastes to me like a bad (statless) system design....

4 Replies 4

jamwyatt
Level 1
Level 1

Hi Kurt,

While I can't comment on the Timezone concerns, I can comment on the missing 'up' events. It turns out that they are simply a lower severity and don't show in the default event view (shows warnings and above). Further, there are two types of 'down' events. The ones you see below are generated when we detect loss of connection with the site (cable pull type of event). The second class is ones that are generated when we expected the loss of connection (i.e. after we trigger a reboot from the topology view) or the operating system was still active when the heart client was stopped (software upgrade causing a reboot). Both are also of a lower severity than the default view and we got a TCP packet from the site to close the socket.

While that's the details of today's operations, I can note that we discussed the severity issues several times. It was finally decided to use this two severity setting. While it is easy to change, the question is should we? The final thinking was to leave the 'heartbeat' failure at a higher severity so that the user could trigger alerts on 'warnings' and avoid general warnings from normal 'up/down' events.

Robert

Kurt,

Regarding the timestamps being formatted in the wrong timezone, you are absolutely correct in stating that we don't have it quite right yet. We're aware of the issue and I'm certain that it will eventually be addressed to your satisfaction. Gone are the days in the trial when we presented ambiguous timestamps with no timezone listed and english abbreviations for days and months (Fri Jun 2010). Internally, we store all timestamps in a way that we can generate an ISO-8601 style timestamp offset for any timezone.

We've picked up additional development folks, and right now we're in the mode of working down a prioritzed list of issues for stability, hardening, and a few features left to be implemented before we can release. You've pointed out some of the bugs and missing features in other posts, and we're always appreciative for your unfiltered feedback. It helps our management folks reorder the priorities correctly.

Robert,

Both DOWN and UP must have the same priority. In every case.

Why? Because of a human is receiving and monitoring these. When somethhing goes DOWN, and comes UP later again: Fine, we have to investigate what went potentially wrong.

When somehting goes DOWN and does no let us know on the same path that it is back UP again (regardelss of the trigger, ie. a reboot) - every operator is URGED to start investigating and escalating after some time. Beeing buffed and find "Hey, the device is up and running, can't find anthing wrong" after some time is a waste of time and a result of poorly adjusted resources - caused by your mis-designed implementation.

Back to the decision makers - Fix it!

-Kurt.

As mentioned before, we've had this discussion more than a few times internally. Combined with your polite request and other's inputs, we decided to make the change to match the heartbeat UP/DOWN events at the 'warning' severity. All other management and maintenance UP/DOWN events will be at the 'information' severity. The change should be made available in the next scheduled update of the beat node.

Thanks,

Robert

Getting Started

Find answers to your questions by entering keywords or phrases in the Search bar above. New here? Use these resources to familiarize yourself with the community: