Monitoring
The process of collecting, analyzing, and using data to track the health of applications and infrastructure.
The process of collecting, analyzing, and using data to track the health of applications and infrastructure.
## "The Dashboard" **Monitoring** is about knowing the knowns. It answers: "Is the system healthy according to the rules we defined?" ### Blackbox vs. Whitebox * **Blackbox Monitoring**: Testing from the outside. "Is the website returning 200 OK?" (Pingdom, UptimeRobot). * **Whitebox Monitoring**: Testing from the inside. "Is the JVM heap usage < 80%?" (Prometheus, CloudWatch). ### The Golden Signals (Google SRE) If you monitor nothing else, monitor these four: 1. **Latency**: Time to serve a request. 2. **Traffic**: Demand on the system (RPS). 3. **Errors**: Rate of failed requests. 4. **Saturation**: How "full" the service is (CPU, Memory, Disk).
ExThe Silent HDD Failure
"A database server's hard drive filled up. The "Disk Full" alert was disabled because it was "noisy"."
Why Monitoring Matters
Monitoring is the foundation of incident detection. Without it, you rely on customers to tell you something is broken.
Good monitoring means knowing about problems before your users do.