CORE METRICS

Mean Time To Detect

Q: Should I alert on everything?

No. Alerting on every minor jitter causes Alert Fatigue. focus on symptoms (User Pain) rather than causes (High CPU).

The average time from when an issue occurs to when an alert fires.

MTTD = Σ Detection Times / n

The average time from when an issue occurs to when an alert fires.

## The "Eyes" of the System **MTTD (Mean Time To Detect)** measures the gap between "It broke" and "We know it broke." ### The "Scream Test" If your MTTD is high (hours or days), you likely rely on customers to report bugs via support tickets or Twitter. This is called the "Scream Test," and it is the worst way to monitor a system. ### How to Improve MTTD 1. **Observability**: You cannot detect what you cannot see. Instrument your code. 2. **Synthetics**: Have a bot attempt to "Login" and "Checkout" every minute. If it fails, alert immediately. 3. **Anomaly Detection**: Use AI/ML to detect weird patterns (e.g., "Traffic dropped by 50%").

ExThe Silent Cache Failure

"A caching layer failed, slowing the site down by 500ms. No errors were thrown, so no alerts fired."

Impact

The issue persisted for 3 days until a user complained about slowness.

Resolution

Team added Latency SLOs (alert if p95 latency > 200ms). MTTD for slowness dropped to <1 minute.

Why MTTD Matters

MTTD is the hidden killer of uptime. The faster you detect issues, the faster you can resolve them.

Many teams discover outages from customers first. Proactive detection prevents reputation damage.

MTTD vs. Other Metrics

MTTD

Issue → Alert fires

MTTA

Alert → Acknowledged

MTTR

Alert → Resolved

Common Pitfalls

Monitoring Only Uptime

A server can be "up" (responding 200 OK) but serving blank white pages. Monitor functionality, not just headers.

Missing Third-Party Failures

If Stripe goes down, your payments fail. Monitor your dependencies.

How to Use MTTD

📊

Comprehensive Monitoring: Implement "Four Golden Signals" (Latency, Traffic, Errors, Saturation).

🎯

SLOs: Alert when error budget burns too fast.

🔍

Synthetic Monitoring: Simulate user traffic to catch issues 24/7.

Industry Benchmarks

ExcellentTop 5%

< 1 min

GoodTop 15%

1-5 min

AverageTop 40%

5-15 min

StrugglingBelow Avg

> 15 min

Related Terms

MTTA MTTR Observability Monitoring

Frequently Asked Questions

Is MTTD zero if I have instantaneous alerts?

Technically yes, but practically there is always a lag (e.g., 30s polling interval). < 1 minute is the gold standard.

Should I alert on everything?