Reliability
The probability that a system will function correctly under stated conditions for a specified period.
The probability that a system will function correctly under stated conditions for a specified period.
## Trust Validation **Reliability** is bigger than uptime. It is: "Does the system do what the user expects it to do?" ### Reliability vs. Availability * **Availability**: "The site loads." * **Reliability**: "The site loads *and* I can add items to my cart." A site that returns a 200 OK status code but serves a blank white page is **Available** but **Unreliable**. ### Principles of Reliable Systems 1. **Redundancy**: No single point of failure (N+1). 2. **Degradation**: If the search bar breaks, the rest of the site should still work (Graceful Degradation). 3. **Simplicity**: Boring is better. Complex systems break in complex ways.
ExThe "Ghost" Site
"A video streaming site was "up" (users could browse movies) but video playback failed for 10% of users."
Why Reliability Matters
Reliability is the foundation of user trust. Unreliable systems lose customers and damage reputation.
SRE is fundamentally about engineering reliability into systems from the start.