You've seen the sales deck: "99.9% uptime guaranteed." Ask the engineering team what that means. What happens when you miss it? Who decides what counts as downtime? Often, nobody can answer quickly. SLA, SLO, and SLI get used interchangeably. Teams set arbitrary targets ("let's do 99.9% because everyone else does"), then wonder why customers are angry when "nothing technically broke." These aren't synonyms. They serve completely different purposes. Here's what each one actually means and how to use them without creating busywork. --- ## What You'll Learn - What SLI, SLO, and SLA actually mean (and why the order matters) - How to pick SLIs that customers care about (not just what's easy to measure) - How to set realistic SLO targets (not copy-paste 99.9%) - Error budgets: the framework that stops "is this urgent?" arguments - Copy-paste SLO template (30-minute setup) - Common mistakes and how to avoid them --- ## SLI: What You Measure Service Level Indicator. The actual metric you track. SLI is the measurement. SLO is the target. SLA is the promise. **Good SLIs:** Error rate, latency (p95, p99), availability. Things customers notice. **Bad SLIs:** CPU utilization, memory usage, disk space. Things ops teams notice but users don't. The trap: picking SLIs because they're easy to measure, not because they matter. Track CPU as your SLI and you'll spend months optimizing it. Meanwhile, API latency spikes to 5 seconds and customers can't log in. Your dashboard looks perfect. Customers are furious. **The rule:** If a user wouldn't notice it breaking, it's not an SLI. It's just a metric. ### Common SLIs by Service Type | Service Type | Good SLI | Why It Matters | |--------------|----------|----------------| | API | Success rate (2xx/total requests) | Users see errors directly | | API | Latency (p95 < 500ms) | Slow = broken for users | | Database | Query success rate | Failed queries = broken features | | Frontend | Time to interactive | Users abandon slow pages | | Background jobs | Processing time per job | Delayed jobs = broken workflows | Pick 1-2 SLIs per service. More than that and you're tracking everything, optimizing nothing. --- ## SLO: Your Internal Target Service Level Objective. The number you're aiming for. SLOs are internal targets. SLAs are external promises. **Example:** "99.5% of API requests succeed within 500ms." - SLI = request success rate + latency - SLO = 99.5% threshold SLOs are **internal**. You don't publish them to customers. They're how engineering defines "good enough" and aligns with [incident response playbooks](/blog/incident-response-playbook). ### How to Pick an SLO (Don't Copy-Paste 99.9%) **Step 1: Look at your last 30 days** What are you actually delivering right now? If you're at 99.3%, don't set a target of 99.9%. You'll miss it immediately and the number becomes meaningless. **Step 2: Set the target slightly below current reality** Give yourself room for bad days. - Current performance: 99.7% - Target SLO: 99.5% - Buffer: 0.2% for unexpected issues **Step 3: Validate it maps to user experience** Ask: "If we hit 99.5%, will customers be happy?" If the answer is no, your SLI is wrong (not your target). ### Monthly vs Weekly SLOs Most teams use **monthly SLOs** because: - SLAs (contracts) are typically monthly - Industry standard for reporting - Easier to absorb bad days But track **weekly burn rate** to avoid surprises: - Monthly SLO: 99.5% = 216 minutes allowed downtime - Weekly burn rate: 216 ÷ 4.33 ≈ 50 minutes/week - If you burn 200 minutes in week 1, you're in trouble **Policy example:** - Track monthly SLO (99.5%) - Review weekly burn rate - Trigger escalation at 50% of monthly budget burned ### The Cost of Nines Each additional "9" is often an order-of-magnitude more effort/cost, depending on architecture and org maturity. | Uptime Target | Downtime/Year | Downtime/Month | What It Takes | |---------------|---------------|----------------|---------------| | 99% | 3.65 days | ~7.2 hours | Basic monitoring, manual responses | | 99.5% | 1.83 days | ~3.6 hours | Automated alerts, on-call rotation | | 99.9% | 8.77 hours | ~43 minutes | Redundancy, automated failover | | 99.99% | 52 minutes | ~4 minutes | Multi-region, chaos engineering | Promise 99.99% to win a deal and you might spend $50k/month on infrastructure for a $5k/month customer. Sales shouldn't set SLOs without engineering sign-off. --- ## SLA: Your External Promise Service Level Agreement. The contract with consequences. SLAs are **external**. They define what happens when you miss your target. **Example:** "We commit to 99.5% monthly uptime. If we fall below, you get a 10% service credit." ### Who Needs an SLA? **Yes:** - B2B selling to enterprises - Contracts with procurement teams - Customers who require guaranteed uptime **No:** - Early-stage startups (under 50 customers) - Internal tools - Self-serve products with monthly billing A 20-person startup calculating SLA credits for $50/month customers is creating accounting busywork without meaningful upside. ### Smart Buffer: Internal SLO > External SLA Don't promise externally what you barely deliver internally. **Example setup:** - Internal SLO: 99.7% (what engineering targets) - External SLA: 99.5% (what customers get promised) - Buffer: 0.2% for unexpected issues Gives you room to have a bad week without breaching customer contracts. --- ## Error Budget: What Makes This Actually Useful Error budget is how teams decide: ship features, or pay down reliability debt. SLOs without error budgets are just numbers on a dashboard. Error budgets turn SLOs into a [prioritization framework](/blog/how-to-reduce-mttr). ### The Math **Error budget = 100% - SLO target** If your SLO is 99.5%, your error budget is 0.5%. | SLO Target | Error Budget/Month | Weekly Burn Rate Estimate | |------------|-------------------|----------------------| | 99.9% | ~43 minutes | ~10 minutes | | 99.5% | ~3.6 hours | ~50 minutes | | 99% | ~7.2 hours | ~1.7 hours | *Weekly burn rate = monthly budget ÷ 4.33 weeks. Track weekly to avoid burning entire monthly budget early.* ### How Teams Use Error Budgets **The rule:** If you have budget left, ship features. If you're burning budget, stop shipping and fix reliability. **Example policy:** - Weekly error budget drops below 50%? → Triage. Identify root cause. - Weekly error budget drops below 20%? → Feature freeze. Reliability becomes priority #1. - Error budget refills weekly. Start fresh every Monday. No more arguments about "is this urgent?" Burning error budget = urgent. Not burning = queue it. --- ## How to Set Your First SLO in 30 Minutes Here's the step-by-step process. ### Step 1: Pick Your Most Important Service (5 minutes) Start with one service. The one customers complain about when it breaks. API? Database? Frontend? ### Step 2: Choose 1-2 SLIs (10 minutes) Ask: "What do users notice when this breaks?" **For an API:** - Success rate (requests returning 2xx / total requests) - Latency (p95 response time) **For a database:** - Query success rate - Query latency (p99) **For a frontend:** - Page load time (p95) - Time to interactive Pick the one that matters most. Don't track everything. ### Step 3: Measure Current Performance (10 minutes) Pull the last 30 days of data. What's your actual success rate? 99.2%? 99.7%? 98.5%? Be honest. No aspirational numbers. ### Step 4: Set Target Slightly Below Reality (5 minutes) - Current: 99.7% - Target SLO: 99.5% Give yourself buffer. ### Done. You Have an SLO. Now track it weekly. When you burn error budget, investigate. When you have budget, ship features. --- ## SLO Template (Copy-Paste) Use this to document your first SLO. ```markdown ## SLO: [Service Name] **Service:** [e.g., Payment API] **Owner:** [Team name] **Last updated:** [Date] ### SLI (What We Measure) - Metric: [e.g., Request success rate] - Definition: [e.g., HTTP 2xx responses / total requests] - Measurement window: [e.g., Monthly, evaluated weekly] ### SLO (Our Target) - Target: [e.g., 99.5% success rate] - Current performance (last 30 days): [e.g., 99.7%] - Error budget: [e.g., 0.5% = 216 minutes/month or ~50 minutes/week burn rate] ### SLA (External Promise) - Optional - Customer promise: [e.g., 99.5% monthly uptime] - Consequence: [e.g., 10% service credit if breached] - Measurement period: [e.g., Monthly] ### Escalation Policy - Error budget < 50%: Triage, identify root cause - Error budget < 20%: Feature freeze, fix reliability - Error budget refills: Weekly (every Monday) Combine with [incident severity levels](/blog/incident-severity-levels) to align response urgency. ### How We Measure - Dashboard: [Link to dashboard] - Alert: [Link to alert config] - On-call: [Link to on-call schedule] ``` Copy this. Fill in the blanks. You're done. --- ## Real Examples (What This Looks Like in Practice) Here are common patterns. ### Example 1: API Service (B2B SaaS) **Service:** User authentication API **SLI:** Request success rate **Internal SLO:** 99.7% weekly **External SLA:** 99.5% monthly **Error budget:** ~30 min/week (internal), ~3.6 hours/month (external) **How they use it:** - Daily dashboard shows weekly SLO burn rate - If weekly drops below 99.5%, all-hands triage - Sales can't promise below 99.5% without engineering sign-off - If error budget hits 20%, feature work pauses **Why it works:** Clear line between "we're fine" and "drop everything." ### Example 2: Background Job Processing **Service:** Email sending queue **SLI:** Processing time per job **Internal SLO:** 95% of jobs processed within 5 minutes **External SLA:** None (internal tool) **Error budget:** 5% of jobs can exceed 5 minutes **How they use it:** - Jobs taking > 5 minutes get logged - If more than 5% exceed threshold in a day, investigate - No external SLA because it's internal tooling **Why it works:** Simple threshold, no customer promises needed. ### Example 3: The Team That Set 99.99% and Regretted It A startup promised 99.99% uptime to land an enterprise deal. The contract was $10k/month. The infrastructure to deliver 99.99%? $30k/month in redundancy, multi-region failover, and 24/7 on-call. Six months in, they renegotiated down to 99.5%. The customer didn't care (they never checked the SLA). Engineering stopped hemorrhaging budget. **The lesson:** Don't promise nines you can't afford. --- ## What Teams Get Wrong ### Mistake 1: Copying 99.9% Without Doing the Math 99.9% uptime = ~8.7 hours/year downtime allowed 99.99% uptime = ~52 minutes/year downtime allowed The gap is often an order-of-magnitude more expensive to achieve. Chase 99.99% because a competitor claimed it and you'll discover they measured it differently. ### Mistake 2: Setting SLOs You Can't Measure Team sets 99.9% uptime but doesn't have: - Automated monitoring - Clear definition of what counts as "down" - Alerting when they're out of SLO Your SLO is 99.9%. Someone asks "how did we do last month?" and the answer is "we haven't set that up yet." That's not an SLO. That's a goal written on a napkin. ### Mistake 3: No Buffer Between Internal and External Team sets: - Internal SLO: 99.5% - External SLA: 99.5% First bad week? Immediate SLA breach. Customer credits. Angry emails. **Better:** - Internal SLO: 99.7% - External SLA: 99.5% - Buffer: 0.2% wiggle room Gives you space to have a bad week without breaching contracts. ### Mistake 4: Too Many SLOs Team tracks 15 SLOs across 3 services. Result: Everything's yellow. Nothing's a priority. Analysis paralysis. **Better:** 1-2 SLOs per service. Track what matters. Ignore the rest. ### Mistake 5: SLOs Nobody Checks Team sets SLOs in a wiki. Nobody looks at them until a customer complains. **Better:** Daily dashboard. Weekly review. Automated alerts when burning error budget. If nobody's checking your SLO, you don't have an SLO. --- ## Error Budget Calculator Use this to calculate your error budget. **Formula:** ``` Error budget (minutes/month) = (100% - SLO%) × 43,200 minutes ``` **Examples:** | SLO | Calculation | Error Budget/Month | |-----|-------------|-------------------| | 99.9% | (100% - 99.9%) × 43,200 | 43.2 minutes | | 99.5% | (100% - 99.5%) × 43,200 | 216 minutes (3.6 hours) | | 99% | (100% - 99%) × 43,200 | 432 minutes (7.2 hours) | | 95% | (100% - 95%) × 43,200 | 2,160 minutes (36 hours) | **Weekly estimate (from a monthly SLO):** Divide the monthly minutes by 4.33 (weeks per month) 99.5% monthly SLO = ~50 minutes/week error budget --- ## Quick Reference | Term | What It Is | Who Sets It | Example | Public? | |------|-----------|-------------|---------|---------| | **SLI** | The metric you track | Engineering | Error rate, latency | No | | **SLO** | Your internal target | Engineering | 99.5% success rate | No | | **SLA** | Your external promise | Business/Legal | "99.5% uptime or 10% credit" | Yes | **Key insight:** SLIs and SLOs are for engineering. SLAs are for customers and contracts. --- ## The Bottom Line - **SLI** = what you measure (pick what users notice, not what's easy) - **SLO** = your internal target (set it below current reality, not aspirational) - **SLA** = your external promise (only if selling to enterprises) Use error budgets to drive prioritization. Stop arguing about "is this urgent?" Let your error budget decide. Start with 1 service, 1-2 SLIs, 1 SLO. Add complexity only when needed. If you're setting SLOs based on competitor claims, you'll end up optimizing the wrong thing. Set them based on what you can actually deliver, then improve. --- ## Common Questions **What's the difference between SLO and SLA?** SLO = internal target (what engineering aims for). SLA = external promise with contractual consequences (what customers get). Your SLO should be stricter than your SLA to give yourself buffer. **What SLO should I set?** Look at your last 30 days of actual performance. Set the target slightly below that (0.2-0.5% buffer). Don't copy-paste 99.9% because it sounds good. **Do I need an SLA?** Only if you're selling to enterprises that require contractual guarantees. Most startups don't need SLAs until Series B+. Internal tools never need SLAs. **How many SLOs should I have?** Start with 1-2 per service. More than that and you're tracking everything, prioritizing nothing. Focus beats coverage. **What if we miss our SLO?** Nothing happens contractually (that's what SLAs are for). But if you miss consistently, either (1) you have a reliability problem, or (2) your target is wrong. Investigate which. **How do I calculate error budget?** Error budget = 100% - SLO target. For 99.5% SLO, error budget is 0.5%. In a 30-day month (43,200 minutes), that's 216 minutes or 3.6 hours of allowed downtime. **What's a realistic SLO for a startup?** 99% to 99.5% is realistic for most startups. 99.9% requires significant investment. 99.99% is overkill unless you're in fintech, healthcare, or selling to enterprises with hard requirements. **Should internal tools have SLOs?** Only if they're critical. Your deployment pipeline? Maybe. Your internal wiki? Probably not. Don't create SLO overhead for tools that don't need it. **How often should I review SLOs?** Weekly for error budget burn. Quarterly for target adjustments. If your performance drifts significantly (up or down), update the SLO target. --- ## Next Reads - [Incident Severity Levels: The Framework That Actually Works](/blog/incident-severity-levels) - [How to Reduce MTTR in 2026: The Coordination Framework](/blog/how-to-reduce-mttr) - [Incident Management at Scale: Research from 25+ Teams](/blog/scaling-incident-management) --- <script type="application/ld+json"> { "@context": "https://schema.org", "@type": "FAQPage", "mainEntity": [ { "@type": "Question", "name": "What's the difference between SLO and SLA?", "acceptedAnswer": { "@type": "Answer", "text": "SLO is your internal target (what engineering aims for). SLA is your external promise with contractual consequences (what customers get). Your SLO should be stricter than your SLA to give yourself buffer." } }, { "@type": "Question", "name": "What SLO should I set?", "acceptedAnswer": { "@type": "Answer", "text": "Look at your last 30 days of actual performance. Set the target slightly below that (0.2-0.5% buffer). Don't copy-paste 99.9% because it sounds good." } }, { "@type": "Question", "name": "Do I need an SLA?", "acceptedAnswer": { "@type": "Answer", "text": "Only if you're selling to enterprises that require contractual guarantees. Most startups don't need SLAs until Series B+. Internal tools never need SLAs." } }, { "@type": "Question", "name": "How many SLOs should I have?", "acceptedAnswer": { "@type": "Answer", "text": "Start with 1-2 per service. More than that and you're tracking everything, prioritizing nothing. Focus beats coverage." } }, { "@type": "Question", "name": "What if we miss our SLO?", "acceptedAnswer": { "@type": "Answer", "text": "Nothing happens contractually (that's what SLAs are for). But if you miss consistently, either you have a reliability problem, or your target is wrong. Investigate which." } }, { "@type": "Question", "name": "How do I calculate error budget?", "acceptedAnswer": { "@type": "Answer", "text": "Error budget = 100% - SLO target. For 99.5% SLO, error budget is 0.5%. In a 30-day month (43,200 minutes), that's 216 minutes or 3.6 hours of allowed downtime." } }, { "@type": "Question", "name": "What's a realistic SLO for a startup?", "acceptedAnswer": { "@type": "Answer", "text": "99% to 99.5% is realistic for most startups. 99.9% requires significant investment. 99.99% is overkill unless you're in fintech, healthcare, or selling to enterprises with hard requirements." } }, { "@type": "Question", "name": "Should internal tools have SLOs?", "acceptedAnswer": { "@type": "Answer", "text": "Only if they're critical. Your deployment pipeline? Maybe. Your internal wiki? Probably not. Don't create SLO overhead for tools that don't need it." } }, { "@type": "Question", "name": "How often should I review SLOs?", "acceptedAnswer": { "@type": "Answer", "text": "Weekly for error budget burn. Quarterly for target adjustments. If your performance drifts significantly up or down, update the SLO target." } } ] } </script>
SLA vs. SLO vs. SLI: What Actually Matters (With Templates)
SLI = what you measure. SLO = your target. SLA = your promise. Here's how to set realistic targets, use error budgets to prioritize, and avoid the 99.9% trap.