ROLES

Site Reliability Engineering

A discipline that incorporates aspects of software engineering and applies them to infrastructure and operations problems.

SRE

A discipline that incorporates aspects of software engineering and applies them to infrastructure and operations problems.

## "Hope is not a Strategy" **Site Reliability Engineering (SRE)** is what happens when you ask a software engineer to design an operations function. Coined by Google, it replaces the traditional "SysAdmin" model with a focus on automation, reliability, and scale. ### Core Principles 1. **Embrace Risk**: 100% uptime is too expensive and prevents innovation. Use **Error Budgets** to balance speed and stability. 2. **Eliminate Toil**: If a human has to push a button 10 times a day, write a script to do it. 3. **Monitor Distributed Systems**: Focus on symptoms (latency, errors) rather than potential causes (CPU usage). ### SRE vs. DevOps * **DevOps** is the culture/philosophy ("Break down silos"). * **SRE** is the implementation ("Here is how we measure reliability"). * *Analogy*: DevOps is "Chemistry"; SRE is "Chemical Engineering."

ExThe Manual Launch

"A team manually deployed code to 500 servers. It took 6 hours and 3 people."

Impact

Deployments were so painful they only happened once a month. Features rotted.

Resolution

An SRE wrote a deployment pipeline. Deploys now take 10 minutes and happen 20 times a day automatically.

Why SRE Matters

SRE bridges the gap between development and operations.

SRE principles (SLOs, Error Budgets, Blameless Postmortems) define modern incident management.

Common Pitfalls

Rebranding Ops

Renaming your Ops team to "SRE" without changing *how* they work (still doing manual tickets).

Ignoring Error Budgets

Tracking SLOs but ignoring them when they burn down.

Gatekeeping

SREs becoming the "Department of No" instead of enabling developers.

How to Use SRE

📉

Reduce Toil: Automate repetitive work to free up engineering time.

🎯

Start with SLOs: Define reliability goals before building solutions.

🤝

Blameless Culture: Focus on system failure, not human error.

Related Terms

DevOps Reliability Toil

Frequently Asked Questions

Do I need a dedicated SRE team?

Not initially. You can practice SRE principles (SLOs, automation) within your product teams.

Is SRE just a new name for Ops?

No. Traditional Ops closes tickets. SREs write code to close tickets forever.

What is Toil?

Manual, repetitive, automatable work. SREs aim to cap toil at 50% of their time.