CORE METRICS

Mean Time To Resolution

The average time it takes to fully resolve an incident from detection to service restoration.

MTTR = Σ Durations / n

The average time it takes to fully resolve an incident from detection to service restoration.

Mean Time To Resolution (MTTR) is one of the "DORA metrics" and a critical key performance indicator (KPI) in Incident Management. It measures the average time elapsed from the moment an incident is **detected** to the moment it is **fully resolved** (i.e., the service is restored and functioning normally for users). ### Why MTTR is the "North Star" Metric MTTR is often considered the single most important metric for SRE teams because it directly correlates with customer downtime. Unlike Mean Time Between Failures (MTBF), which measures reliability, MTTR measures **resilience**: how quickly your system bounces back when (not if) it fails. > "You cannot prevent every failure, but you can control how fast you recover." *Note: In common industry usage, "Average" and "Mean" are used interchangeably. We use "Average" here for clarity, though "Mean" is the precise statistical term.* ### How to Calculate MTTR To calculate MTTR, divide the **total downtime** of all incidents during a specific period by the **total number of incidents** in that same period. **The Formula:** ```math MTTR = (Sum of all incident durations) / (Total number of incidents) ``` **Example Calculation:** If your team faced 4 incidents in Q1 with durations of 30m, 60m, 15m, and 15m: * Total Downtime = 30 + 60 + 15 + 15 = 120 minutes * Total Incidents = 4 * **MTTR = 120 / 4 = 30 minutes** ### MTTR vs. Other Metrics * **MTTD (Detect)**: Time to realize there is a problem. * **MTTA (Acknowledge)**: Time for a human to start working. * **MTTR (Resolve)**: The total time until the fix is live.

ExThe "Database Lockdown" Scenario

"A bad deployment causes the primary database to lock up, preventing all user logins. Alerts fire immediately."

Impact

100% of users are unable to login. Support tickets spike by 500% in 10 minutes.

Resolution

The on-call engineer rolls back the deployment to the previous stable version. Login functionality is restored in 12 minutes.

Why MTTR Matters

While you cannot prevent every failure, you can control how fast you recover. A low MTTR indicates resilient systems and high-performing teams.

In our State of Incident Management 2025 synthesis: 73% of orgs reported outages linked to ignored/suppressed alerts (Splunk), and a 250-engineer org can lose ~$9.4M/year to manual toil (simplified model). [Read the full methodology and sources → State of Incident Management 2025](/blog/state-of-incident-management-2025).

The Formula

MTTR = Σ Durations / n

Incident A: 30 minutes

Incident B: 90 minutes

MTTR: 120 / 2 = 60 minutes

MTTR vs. Other Metrics

MTTA

Alert → Acknowledged

MTTR

Alert → Resolved

MTTD

Issue → Alert fires

MTBF

Uptime between failures

Common Pitfalls

Stopping the clock too early

Ensure "Resolved" means the customer is back online, not just that the code is merged.

Averaging wild outliers

One 24-hour outage can skew your monthly MTTR. Consider using Median TTR for a more representative metric.

How to Use MTTR

🔍

Better Observability: Reduce MTTD to give your team a head start.

📖

Automated Runbooks: Attach runbooks directly to alerts.

🎯

Practice Incidents: Run chaos engineering drills regularly.

Industry Benchmarks

ExcellentTop 5%

< 30 min

GoodTop 15%

30-60 min

AverageTop 40%

1-2 hours

StrugglingBelow Avg

> 2 hours

Calculate Your Team's MTTR

Benchmark against industry standards and identify improvement opportunities.

Launch Tool

Frequently Asked Questions

What is a good MTTR?

World-class engineering teams (Elite performers) typically achieve an MTTR of less than 1 hour. Average teams range between 1 to 24 hours. If your MTTR is greater than 24 hours, you are considered a "Low" performer according to DORA standards.

Does MTTR include detection time?

Yes, typically MTTR is calculated from the *start* of the incident (when it actually began or was detected) to the end. Some organizations track "Mean Time to Repair" specifically from acknowledgment, but "Resolution" usually implies the full customer-impacting window.

How do runbooks reduce MTTR?

Runbooks provide step-by-step instructions for known issues. By removing the need for an engineer to "figure out" what to do, runbooks can reduce diagnosis and repair time by 50-80%.