Mean Time To Resolution
The average time it takes to fully resolve an incident from detection to service restoration.
The average time it takes to fully resolve an incident from detection to service restoration.
Mean Time To Resolution (MTTR) is one of the "DORA metrics" and a critical key performance indicator (KPI) in Incident Management. It measures the average time elapsed from the moment an incident is **detected** to the moment it is **fully resolved** (i.e., the service is restored and functioning normally for users). ### Why MTTR is the "North Star" Metric MTTR is often considered the single most important metric for SRE teams because it directly correlates with customer downtime. Unlike Mean Time Between Failures (MTBF), which measures reliability, MTTR measures **resilience**: how quickly your system bounces back when (not if) it fails. > "You cannot prevent every failure, but you can control how fast you recover." *Note: In common industry usage, "Average" and "Mean" are used interchangeably. We use "Average" here for clarity, though "Mean" is the precise statistical term.* ### How to Calculate MTTR To calculate MTTR, divide the **total downtime** of all incidents during a specific period by the **total number of incidents** in that same period. **The Formula:** ```math MTTR = (Sum of all incident durations) / (Total number of incidents) ``` **Example Calculation:** If your team faced 4 incidents in Q1 with durations of 30m, 60m, 15m, and 15m: * Total Downtime = 30 + 60 + 15 + 15 = 120 minutes * Total Incidents = 4 * **MTTR = 120 / 4 = 30 minutes** ### MTTR vs. Other Metrics * **MTTD (Detect)**: Time to realize there is a problem. * **MTTA (Acknowledge)**: Time for a human to start working. * **MTTR (Resolve)**: The total time until the fix is live.
ExThe "Database Lockdown" Scenario
"A bad deployment causes the primary database to lock up, preventing all user logins. Alerts fire immediately."
Why MTTR Matters
While you cannot prevent every failure, you can control how fast you recover. A low MTTR indicates resilient systems and high-performing teams.
In our State of Incident Management 2025 synthesis: 73% of orgs reported outages linked to ignored/suppressed alerts (Splunk), and a 250-engineer org can lose ~$9.4M/year to manual toil (simplified model). [Read the full methodology and sources → State of Incident Management 2025](/blog/state-of-incident-management-2025).
The Formula
MTTR = Σ Durations / nMTTR vs. Other Metrics
Common Pitfalls
How to Use MTTR
Industry Benchmarks
Calculate Your Team's MTTR
Benchmark against industry standards and identify improvement opportunities.