cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
659
Views
0
Helpful
0
Replies

Inside Datadog’s $5M Outage [The Pragmatic Engineer]

Sean Dahlberg
Cisco Employee
Cisco Employee

Many of you may remember way back on March 8th when Datadog had a major outage for a little over 24 hours.

https://status.datadoghq.com/incidents/nhrdzp86vqtp?u=cvhfvzn2nr1g

Gergely has a great topic about this on The Pragmatic Engineer. Here's a quick snippet from it:

Datadog is an observability service used to monitor services and applications, and to alert teams when anomalies occur. Observability services are essential in confirming that services operate reliably. When they stop working, the results can be bad for business. On Wednesday, 8 March, this happened to Datadog, which suffered serious reliability issues for more than 24 hours. This incident was costly, at $5M in lost revenue directly as a result of the outage, as revealed in a later earnings call. This was due to usage-based billing: Datadog most likely did not charge customers for data transfers while the system was down. The loss represents about a day’s worth of revenue for the company.

The incident in March was the first global outage at Datadog, in which all its regions were simultaneously impacted, and every customer experienced downtime. More than 2 months later, there has still not been an external postmortem published, which is unusual for such a major incident. When I reached out to Datadog about this delay a week ago, I was told more details will be published, but not when.

Confusingly, company CEO Olivier Pomel suggested during the earnings call on 4 May, that a postmortem had been published.

If that got you interested, you can read the whole article here:

https://newsletter.pragmaticengineer.com/p/inside-the-datadog-outage

One of the points I found interesting is down in section 5 about Follow-up actions:

2. No more automatic updates. Datadog has disabled the legacy security update channel in the Ubuntu base image, and rolled this change out across all regions. From now on, the company will manually roll out all updates, including security updates, in a controlled fashion. 

While I understand the appeal of automatic updates, I fear we all hear stories too often about people getting burned by these.

While a long update (definitely have your favorite beverage ready; I had my first cup of coffee when I was reading it), "Inside Datadog’s $5M Outage (Real-World Engineering Challenges #8)" is an interesting and informative read. 

0 Replies 0