Site Reliability Engineering
A discipline that incorporates aspects of software engineering and applies them to infrastructure and operations problems.
A discipline that incorporates aspects of software engineering and applies them to infrastructure and operations problems.
## "Hope is not a Strategy" **Site Reliability Engineering (SRE)** is what happens when you ask a software engineer to design an operations function. Coined by Google, it replaces the traditional "SysAdmin" model with a focus on automation, reliability, and scale. ### Core Principles 1. **Embrace Risk**: 100% uptime is too expensive and prevents innovation. Use **Error Budgets** to balance speed and stability. 2. **Eliminate Toil**: If a human has to push a button 10 times a day, write a script to do it. 3. **Monitor Distributed Systems**: Focus on symptoms (latency, errors) rather than potential causes (CPU usage). ### SRE vs. DevOps * **DevOps** is the culture/philosophy ("Break down silos"). * **SRE** is the implementation ("Here is how we measure reliability"). * *Analogy*: DevOps is "Chemistry"; SRE is "Chemical Engineering."
ExThe Manual Launch
"A team manually deployed code to 500 servers. It took 6 hours and 3 people."
Why SRE Matters
SRE bridges the gap between development and operations.
SRE principles (SLOs, Error Budgets, Blameless Postmortems) define modern incident management.