Mean Time To Detect
The average time from when an issue occurs to when an alert fires.
The average time from when an issue occurs to when an alert fires.
## The "Eyes" of the System **MTTD (Mean Time To Detect)** measures the gap between "It broke" and "We know it broke." ### The "Scream Test" If your MTTD is high (hours or days), you likely rely on customers to report bugs via support tickets or Twitter. This is called the "Scream Test," and it is the worst way to monitor a system. ### How to Improve MTTD 1. **Observability**: You cannot detect what you cannot see. Instrument your code. 2. **Synthetics**: Have a bot attempt to "Login" and "Checkout" every minute. If it fails, alert immediately. 3. **Anomaly Detection**: Use AI/ML to detect weird patterns (e.g., "Traffic dropped by 50%").
ExThe Silent Cache Failure
"A caching layer failed, slowing the site down by 500ms. No errors were thrown, so no alerts fired."
Why MTTD Matters
MTTD is the hidden killer of uptime. The faster you detect issues, the faster you can resolve them.
Many teams discover outages from customers first. Proactive detection prevents reputation damage.