CORE METRICS

Mean Time Between Failures

Q: Does scheduled maintenance count?

Usually no. MTBF measures *failures* (unexpected outages). Scheduled downtime is planned.

Q: Is high MTBF always good?

Yes, but not at the cost of velocity. If you never ship code, your MTBF will be infinite, but your product will die. Balance innovation with reliability.

The average time between system failures or incidents.

MTBF = Total Uptime / Number of Failures

The average time between system failures or incidents.

## The Reliability Metric **MTBF (Mean Time Between Failures)** is a classic engineering metric (borrowed from hardware) that measures how often things break. ### MTBF vs. MTTR * **MTBF** asks: "How robust is the system?" (Reliability) * **MTTR** asks: "How fast can we fix it?" (Resilience) ### The Availability Equation Availability (e.g., 99.9%) is mathematically derived from these two numbers: ```math Availability = MTBF / (MTBF + MTTR) ``` To improve availability, you can either **crash less often** (Increase MTBF) or **fix it faster** (Decrease MTTR). Modern SRE teams focus more on MTTR because complex systems *will* eventually fail, so being able to recover fast is sustainable than trying to prevent every failure.

ExThe Memory Leak

"A server process had a slow memory leak that caused it to crash every 48 hours (MTBF = 48h)."

Impact

Frequent, predictable outages annoyed users.

Resolution

Engineers fixed the leak. Now the server runs 90 days without crashing (MTBF = 2160h).

Why MTBF Matters

MTBF measures system reliability. Higher MTBF means more stable infrastructure.

Used alongside MTTR to calculate availability: Availability = MTBF / (MTBF + MTTR).

MTBF vs. Other Metrics

MTBF

Uptime between failures

MTTR

Downtime duration

Availability

MTBF / (MTBF + MTTR)

Common Pitfalls

Tracking MTBF for Software

Hardware fails due to wear-out. Software fails due to changes. MTBF is less useful for software than hardware.

Ignoring Recovery

Optimizing for high MTBF (never failing) is expensive. Optimizing for low MTTR (fast recovery) is often better.

How to Use MTBF

🔧

Preventive Maintenance: Automate patch management and dependency updates.

🧪

Testing: Catch bugs in staging with Integration/E2E tests.

📈

Capacity Planning: Auto-scale before you hit resource limits.

Industry Benchmarks

ExcellentTop 5%

> 720 hours

GoodTop 15%

168-720 hours

AverageTop 40%

72-168 hours

StrugglingBelow Avg

< 72 hours

Related Terms

MTTR Availability Reliability SLO

Frequently Asked Questions

Does scheduled maintenance count?