Mean Time Between Failures
The average time between system failures or incidents.
The average time between system failures or incidents.
## The Reliability Metric **MTBF (Mean Time Between Failures)** is a classic engineering metric (borrowed from hardware) that measures how often things break. ### MTBF vs. MTTR * **MTBF** asks: "How robust is the system?" (Reliability) * **MTTR** asks: "How fast can we fix it?" (Resilience) ### The Availability Equation Availability (e.g., 99.9%) is mathematically derived from these two numbers: ```math Availability = MTBF / (MTBF + MTTR) ``` To improve availability, you can either **crash less often** (Increase MTBF) or **fix it faster** (Decrease MTTR). Modern SRE teams focus more on MTTR because complex systems *will* eventually fail, so being able to recover fast is sustainable than trying to prevent every failure.
ExThe Memory Leak
"A server process had a slow memory leak that caused it to crash every 48 hours (MTBF = 48h)."
Why MTBF Matters
MTBF measures system reliability. Higher MTBF means more stable infrastructure.
Used alongside MTTR to calculate availability: Availability = MTBF / (MTBF + MTTR).