Decoding DORA Metrics at Doist

Introduction

When it comes to ensuring quick and reliable delivery in the fast-paced world of software development, most organizations rely on gut feelings and manual pulse checks with their engineering teams. Others might follow inefficient or borderline harmful metrics like lines of code written to gauge the productivity of their teams. None of them are objective nor easily correlated with business results.

Luckily, there’s research on what to do instead – most famously the DORA (DevOps Research and Assessment) metrics . DORA metrics capture important aspects of the software delivery process, and by tracking them, provide a deeper understanding of operational performance, identify areas for improvement, and work towards consistently delivering highest-quality software at a high velocity.

In this blog post, we’ll take a look at each DORA metric, discuss how we’re tracking them at Doist, and share some insights and learnings from our experiences.

Understanding DORA Metrics

DORA metrics were first introduced in the 2014 State of DevOps report and later gained industry traction with the book “Accelerate: The Science of Lean Software and DevOps” by the same authors, Nicole Forsgren , Jez Humble , and Gene Kim . They are a set of four key indicators that help businesses measure their engineering performance.

Deployment Frequency measures how often code is deployed to production.
Lead Time for Changes tracks the time from code commit to code successfully running in production.
Time to Restore Service quantifies how long it takes for a team to recover from a failure in the production environment.
Change Failure Rate gauges the percentage of deployments causing a failure in production.

Overview DORA Clusters from the state of DevOps report 2022 — Overview DORA Clusters,
State of DevOps report 2022

There’s a fifth metric since 2022, Reliability, representing operational performance. It tracks how well services meet user expectations, such as availability and performance. For this article, we’re going to focus on the original four metrics only.

According to research, there are little to no benefits in optimizing your performance once you’ve joined the high performing cluster. Optimization for optimization’s sake is a waste of time. Time that’s better spent talking to engineers and leaders in your organization, uncovering and resolving actual problems they’re facing.

It’s paramount to keep in mind that DORA metrics are a conversation starter, not a solution. They lack deeper context on why certain metrics are the way they are in each organization. Hence, they cannot replace getting together with your teams to discuss processes and workflows and understanding their unique problems and opportunities.

DORA at Doist

Tooling

We have been pragmatic about our approach from day 1. Without knowing the benefits we would get from a more profound understanding of where we stood within each DORA metric, we weren’t convinced to invest in a dedicated observability tool. We’re tracking all metrics as “Metrics” in Datadog , our main monitoring tool. We had experience piping data into Datadog and visualizing it there, so we decided to build upon this existing knowledge.

Most of the data collection is happening in our CI/CD setup on GitHub Actions. We’re leveraging the datadog-actions-metrics action to do most of the heavy lifting. The added benefit of this approach, is that we were able to build basic CI observability dashboards (how many actions ran, most costly actions, most flaky actions, …) from the same data – win-win.

Doist’s Definitions

Deployment Frequency

Each engineering team measures how often their deployment pipeline (GitHub Actions workflow) runs and sends an event to Datadog upon successful completion. This works both for our Backend and Frontend teams, which deploy multiple times per day, as well as our Android and Apple teams, which usually deploy only once or twice per week.

Lead Time for Changes

Every engineering team measures for how long each merged PR has been open for. We measure in seconds and report it in hours.

For now, we’re not overthinking it and treat all PRs as equal. Future iterations could consider the time a PR has been in draft state and exclude it from this metric. They may also exclude automated PRs. We’ll see when we need to refine, but not before – Pragmatism beats perfectionism here again.

Time to Restore Service

To measure “time to restore service” one would normally look at system metrics like error rate or availability and measure how long they take to recover. However, not all of our clients track this comparably.

We decided to define “time to restore service” as the time it takes to close a critical or high severity issue. This is less clear-cut as it takes time for a user to report an issue with customer experience, them reproducing and escalating it, and our engineers solving it. However, this is the flow we care most about optimizing, which makes this a healthy definition for our use case.

Change Failure Rate

We define a change failure as a deployment leading to a critical or high severity issue being raised. Change failure rate is tracked over a 7-day rolling average, as not each deployment can be automatically linked to a reported issue.

Similar to “time to restore service”, this metric is less clear-cut as it’s not coupled to hard data like an error rate or real-user monitoring. Again, this makes sense in our use case to optimize the things we actually care about.

Defining Failure

For both “Time to Restore Service” and “Change Failure Rate” we struggled to come up with a unified definition of failure across engineering teams. When we first set out to define and track DORA metrics, we had to accept a disconnect between the theory and our practice. In theory, we would want to characterize each new, unhandled error in production a failure of the system. Looking at our monitoring solutions (Sentry, Crashlytics, Datadog) this turned out to be unfeasible at the time.

As an example, our Frontend app reports 5-10 new errors per release. Some of them are valid and addressed swiftly (undefined is not a function, anyone?). Others are differently grouped occurrences of known issues. And still others might have causes outside our control.

As alluded to above, the right thing to do is clean it all up and always have a proper reporting & alerting hygiene in place. The reality of a bootstrapped product company is also that we need to keep working on features and fix bugs users are actually encountering (support tickets) as well as pay back ~10 years of technological evolution to eradicate whole categories of bugs instead of fixing them one by one.

We still want to get there, and eventually will, but we needed a more pragmatic approach to start. We settled on relating failure to support tickets, as closing them as quickly as possible is something every engineering team at Doist strives for. Measuring, reporting, and improving our handling of user-reported errors has the added benefit of being universally understood as a “good thing” by the entire company.

Spilling the beans

After reading this far, let’s take a look at how Doist’s engineering organization is performing on these metrics:

Screenshot of our DORA dashboard in Datadog

	July	August	September
Deployment Frequency	15 per day	26 per day	18 per day
Lead Time for Changes	63 hours	55 hours	101 hours
Time to Restore Service	177 days	174 days	32 days
Change Failure Rate	3.38%	2.88%	5.53%

According to the 2022 State of DevOps report, this put us in the “high performing” cluster for all metrics but time to restore service, where we’re worse than their “low performing” cluster.

Entering the high performing cluster for time to restore service means being able to resolve most user reports in under a day. Technically, we have all the puzzle pieces in place to easily achieve this:

All our experimental features are behind feature flags, toggling a broken feature back off is a matter of minutes. No deployments required.
Rolling back a broken deployment or fixing it forward is similarly quick due to our continuous deployment setup. Pushing or reverting a commit is all it takes to get back into a healthy state.
While our globally distributed and fully async culture can cause some communication delays, it also brings a huge advantage of round the clock coverage for most of our teams. There will almost always be some online to address an issue.

Why are we looking at 32 days to restore service, then? There are at least two reasons for this:

Coupling time to restore service to customer tickets is inherently slower than, e.g., automatic collection of crashes or other real-time service health metrics. There are too many hoops.
By coupling this to our (rather large) backlog of customer support tickets, we skew the metric every time we close out an old issue. Or even worse, an old issue gets reopened just to be closed again.

Generally speaking, it seems likely we’re defining a “failure in the production environment” too strictly, and might fare better by coupling it to automatically reported error rates like the rest of the industry. Before changing our definition, we’re going to improve our testing, observability, alerting, and prioritization of issues to improve this metric into a single digit range. To join the high performing cluster of the next State of DevOps report, yes, but more importantly because it’s the right thing to do for users 💪

Reflections

Our approach to tracking DORA metrics is simple, maybe even a tad too pragmatic. That’s because it is. We believe that perfection is a journey, not a destination, and that having some data – even if not perfect – is much better than having none at all.

Each month, we review our metrics, checking for and discussing any outliers. We value these discussions as they spark deeper insights, fueling our continuous improvement journey.

Our use of DORA metrics will be evolving in the future. Currently, they’re a mere pulse check for the health of our engineering teams and product delivery processes. However, as we continue to iterate, we’d like to make them the basis for informed staffing decisions. By coupling DORA metrics to SLOs we could use them to put more people on issue resolution or CD improvements if one of the metrics falls below a certain threshold. On the flip side, we can also confidently commit more people to feature work if our DORA metrics are healthy. By automating these decisions, we further streamline our processes and increase operational efficiency.

Lifting the veil of DORA metrics at Doist has revealed how we’re approaching this topic not only to improve our processes, but to foster a culture of continuous improvement and growth. We invite you to join us on this data-driven journey inspired by our relentless pursuit of excellence.