incident-severitysev0sev1

Incident Severity Matrix (SEV0-SEV4): Free Template & Generator

Stop arguing over SEV1 vs SEV2. Use our SEV0-SEV4 matrix and decision tree to standardize your incident classification and reduce alert fatigue.

Runframe TeamJan 17, 202611 min read

# Incident Severity Levels: The Framework That Actually Works A team told us someone paged the entire org at 3 AM because a dashboard was loading 200ms slower than usual. Meanwhile, actual customer-impacting outages got ignored because "everything is a SEV1." When you're scaling from 20 to 200 people, it's tough to get severity levels right the first time. Without clear definitions, every incident feels like a crisis and on-call burns out. Here's what we've seen work across dozens of teams at your stage. Without clear severity levels, you can't prioritize response. Teams often confuse incident response (fixing fast) with incident management (preventing recurrence). [Read our incident management vs incident response guide to see why MTTR alone isn't enough](/blog/incident-management-vs-incident-response). --- ## TL;DR - We recommend SEV0-SEV4 (clearer than SEV1-SEV5, but start with what works for you) - SEV0 = catastrophic, SEV1 = core service down, SEV2 = degraded with workaround, SEV3 = minor, SEV4 = proactive - Classify in 30 seconds using: "Is revenue/users impacted? Is there a workaround?" - Consider adding SEV4 for proactive work (teams report it prevents 80% of incidents) - Severity ≠ Priority (severity = impact, priority = fix order) --- ![Incident Severity Matrix](/images/articles/incident-severity-levels/incident-severity-levels-og.webp) --- ## SEV0-SEV4: The Framework We recommend starting at zero, not one. SEV0 = zero room for error—it's more intuitive than SEV1 being your worst case. That said, if your team is under 50 people, you might start with just 3 levels (SEV1-SEV3) and add SEV0 and SEV4 as you scale. Here's the full framework: <table> <caption>Complete SEV0-SEV4 framework showing impact description, response target time, and who responds for each severity level</caption> <thead> <tr> <th>Severity</th> <th>Impact</th> <th>Response</th> <th>Who</th> </tr> </thead> <tbody> <tr> <td><strong>SEV0</strong></td> <td>Catastrophic. Data loss, security breach, total outage, or critical revenue-impacting failure</td> <td>Ack target: 15 min</td> <td>War room (IC + core responders; exec notification depends on your org)</td> </tr> <tr> <td><strong>SEV1</strong></td> <td>Critical. Core service down for everyone</td> <td>Ack target: 30 min</td> <td>On-call + backup</td> </tr> <tr> <td><strong>SEV2</strong></td> <td>Major. Significant degradation, workaround exists</td> <td>Ack target: 1 hour</td> <td>On-call</td> </tr> <tr> <td><strong>SEV3</strong></td> <td>Minor. Limited impact, business hours fix</td> <td>Business hours</td> <td>Don't page</td> </tr> <tr> <td><strong>SEV4</strong></td> <td>Pre-emptive. Could break, proactive fix</td> <td>Backlog</td> <td>Owner + due window</td> </tr> </tbody> </table> The difference between SEV1 and SEV2? One question: **Is there a workaround?** Checkout completely broken = SEV1 (no workaround). Search down but category browsing works = SEV2 (workaround exists). Simple. --- **What teams at your stage say:** > *"Start with 3 levels. Don't over-engineer day one. You can always add SEV0 and SEV4 later."* > — CTO, 40-person startup > *"We added SEV4 when we hit 80 people. Prevented 38 out of 47 potential incidents in 6 months."* > — Engineering Manager, Series B SaaS --- ## Why SEV4 Matters (And When to Add It) Many teams start without SEV4—it can feel like overhead when you're just trying to survive incidents. "If nothing's broken, why track it?" Fair question. Here's when it becomes valuable: **If you're under 50 people:** You probably don't need SEV4 yet. Focus on responding to actual incidents first. **When you hit 75-100 people:** This is when SEV4 becomes valuable. You have enough operational maturity to track "could break" work systematically. **What happens without SEV4 at scale:** → Disk space hits 100% at 2 AM (could have been SEV4 at 80%) → SSL cert expires, users see security warnings (could have been SEV4 at 30 days) → Database query gets 10x slower overnight (could have been SEV4 when it hit 2x) Without SEV4, you're always reacting. Never preventing. --- ## What Each Level Means ### SEV0: The Building Is On Fire Complete outage. Data loss. Security breach. Critical revenue-impacting failure. Database corrupted? Multi-region outage? Authentication completely broken? Payment processing down? That's SEV0. Wake everyone. War room. You have 15 minutes. **Real examples:** - Database corruption with data loss (can't recover from backup) - AWS us-east-1 down AND your backup region failed - Security breach exposing customer data - Authentication completely broken (nobody can log in) - Payment processing down (revenue loss >$10K/hour) ### SEV1: Core Service Down Major impact but not catastrophic. Core service unavailable for most/all customers, with no workaround. API totally down. Checkout completely broken. Search gone (if search is a core workflow for your product). Auth intermittent for a meaningful subset of users. Page on-call immediately. All hands on deck during business hours. 30-minute target. **Real examples:** - Total API outage (all endpoints returning 500) - Checkout flow completely broken (can't process payments) - Search functionality down (core feature for your product) - Authentication intermittent (meaningful subset of users can't log in) - Performance degradation (APIs materially degraded, not just slower) ### SEV2: Significant but Workaround Exists Broken but usable. Meaningful subset of customers affected, or core functionality degraded but usable. Checkout failing for some users? File uploads broken? API materially degraded but responding? Primary on-call handles it. Don't wake backup. 1-hour target. **Real examples:** - Checkout failing for some users (payment gateway issue for some cards) - File uploads completely broken (users can't upload, but can use existing files) - API materially degraded but usable (users can still complete key workflows, possibly slower) - Dashboard not loading (users can still use core product) - Single region degradation (multi-region setup, one region struggling) ### SEV3: Minor Partial failure. Limited impact. Not urgent. Profile pictures broken. Intermittent errors that auto-recover. Reporting delayed. Fix during business hours. Don't page on-call. Can wait until morning. **Real examples:** - Minor feature broken (user profile pictures not displaying) - Intermittent errors that auto-recover (happens a few times/hour, clears itself) - Reporting delay (analytics data not real-time, updates hourly) - Non-critical integration failing (Slack notifications delayed, email works) - UI polish issues (button misaligned, font wrong) ### SEV4: Pre-emptive Nothing broken yet. But something could. Disk at 80%. SSL expiring soon. Query slowing down. Dependency vulnerability. Monitoring gap. Create a ticket with an owner + due window (e.g., "this sprint" / "within 30 days"). No page needed. **Real examples:** - Disk space at 80% (not critical yet, but will be in 2 weeks) - SSL certificate expiring in 30 days - Database query degrading (taking 2x longer, not failed yet) - Dependency vulnerability (CVE in a library, not exploited) - Monitoring gap discovered (no alerting for a critical service) --- ## Classify Fast. Don't Debate. **Target: 30 seconds to classify.** When you're in the middle of an incident, speed matters more than perfection. If you're debating SEV1 vs SEV2 for 5 minutes while customers wait, just pick one and move on. Pro tip: Default higher when uncertain. It's easier to downgrade a SEV1 to SEV2 later than explain why you under-classified and delayed response. **Is this catastrophic** (data loss, security breach, total outage)? → SEV0 **Is a core workflow blocked for most users?** - No workaround → SEV1 - Workaround exists → SEV2 **Otherwise:** limited impact → SEV3; not broken yet → SEV4 **Tie-breaker:** pick higher, note why, downgrade later. --- ## Common Questions (What We've Learned from Teams at Your Stage) ### "It's 2 AM and I'm not sure if this is SEV1 or SEV2" Default SEV1. Assess the situation. Page backup only if blocked or primary hasn't responded within your escalation window. You can downgrade in the morning. You can't un-break customer trust. ### "Only 5% of users are affected, but they're our biggest customers" Use your "materially impacted" definition. If those 5% represent 40% of revenue, it's material. SEV1. ### "The bug is cosmetic but our CEO is freaking out" Still SEV3. Severity = customer impact, not internal panic. But maybe add "Executive visibility" as a separate flag. Some teams use: - Severity: SEV3 (minor) - Priority: P1 (fix today) - Visibility: High (CEO watching) This way you fix it fast without training on-call to page for non-issues. ### "We fixed it in 5 minutes, do we still call it SEV1?" Yes. Severity is based on potential impact, not duration. If the database was completely down (even for 5 minutes), that's SEV1. Duration doesn't change severity. It goes in MTTR metrics. --- ## What Makes Severity Levels Actually Work The key is specificity. **Vague (doesn't help at 3 AM):** "SEV1 is when something important is broken." **Specific (makes decisions instant):** "SEV1 is when a core service is down for all customers, with no workaround." --- ## Frameworks That Actually Work (Choose Based on Your Size) ### Startup Starter (20-50 people) Start simple with 3 levels. Add more as you scale. <table> <caption>Starter severity framework for startups with 20-50 people</caption> <thead> <tr> <th>Severity</th> <th>Impact</th> <th>Response</th> </tr> </thead> <tbody> <tr> <td>SEV1</td> <td>Core service down</td> <td>Page everyone</td> </tr> <tr> <td>SEV2</td> <td>Degraded but usable</td> <td>Page on-call</td> </tr> <tr> <td>SEV3</td> <td>Minor, can wait</td> <td>Business hours</td> </tr> </tbody> </table> ### Scaling Company (50-150 people) Add SEV0 when catastrophic incidents become possible. <table> <caption>Severity framework for scaling companies of 50-150 people with acknowledgment SLAs</caption> <thead> <tr> <th>Severity</th> <th>Impact</th> <th>Page Who?</th> <th>Ack SLA</th> </tr> </thead> <tbody> <tr> <td>SEV0</td> <td>Catastrophic</td> <td>War room</td> <td>15 min</td> </tr> <tr> <td>SEV1</td> <td>Core service down</td> <td>On-call + backup</td> <td>30 min</td> </tr> <tr> <td>SEV2</td> <td>Significant degradation</td> <td>On-call</td> <td>1 hour</td> </tr> <tr> <td>SEV3</td> <td>Minor issues</td> <td>Business hours</td> <td>1 day</td> </tr> <tr> <td>SEV4</td> <td>Proactive work</td> <td>Backlog</td> <td>None</td> </tr> </tbody> </table> ### Enterprise-Bound (150+ people) Full framework with war rooms and executive escalation. <table> <caption>Enterprise severity framework for 150+ person organizations with SLAs and escalation</caption> <thead> <tr> <th>Severity</th> <th>Impact</th> <th>Page Who?</th> <th>Ack SLA</th> </tr> </thead> <tbody> <tr> <td>SEV0</td> <td>Catastrophic</td> <td>War room</td> <td>15 min</td> </tr> <tr> <td>SEV1</td> <td>Core service down</td> <td>On-call + backup</td> <td>30 min</td> </tr> <tr> <td>SEV2</td> <td>Significant degradation</td> <td>On-call</td> <td>1 hour</td> </tr> <tr> <td>SEV3</td> <td>Minor issues</td> <td>Business hours</td> <td>1 day</td> </tr> <tr> <td>SEV4</td> <td>Proactive work</td> <td>Backlog</td> <td>None</td> </tr> </tbody> </table> --- ## How to Evolve Your Severity Levels as You Scale ### Starting with SEV1 vs SEV0 **If you're under 50 people:** Starting with SEV1-SEV3 is totally fine. Many teams do this. **As you grow past 100 people:** Consider adding SEV0 for truly catastrophic incidents (data loss, security breaches). "Zero" = zero room for error, which makes the hierarchy more intuitive. **Why it matters:** As your maximum possible blast radius grows, you need a tier above "critical outage" for existential threats. ### When to Add SEV4 (Proactive Work) **If you're under 50 people:** You probably don't need SEV4 yet. Focus on responding to actual incidents first. **When you hit 75-100 people:** This is when SEV4 becomes valuable. You have enough operational maturity to track "could break" work systematically. **What changes:** Instead of jumping from "everything's fine" to "everything's on fire," you can track warning signs (disk at 80%, SSL expiring soon, query degrading) and fix them before they page someone at 3 AM. One team added SEV4 at 80 people and prevented 80% of potential incidents over 6 months. ### Ignoring Business Impact **The problem:** Technical severity ≠ business severity. A "minor" pricing page typo can be catastrophic if it causes chargebacks. **The fix:** Define severity in terms of customer impact and revenue, not technical complexity. --- ## Severity vs Priority Teams confuse these constantly. **Severity** = Business impact (doesn't change) **Priority** = Fix order (changes based on context) **Example:** Footer has a typo: "Contact sales@compnay.com" - Severity: SEV3 (minor impact, users can still email sales@company.com directly) - Priority: P3 (fix this week) BUT: Legal says the wrong email violates our [contract SLA](/blog/sla-vs-slo-vs-sli). - Severity: Still SEV3 (customer experience unchanged) - Priority: Now P1 (fix today, legal risk) Severity didn't change. Priority did. **Another example:** Database completely down. - Severity: SEV0 (catastrophic) - Priority: P1 (obviously) But your lead DBA is on vacation and backup doesn't know the system. - Severity: Still SEV0 (impact unchanged) - Priority: Still P1, but now you escalate to vendor support Severity = "how bad is it?" Priority = "when/how do we fix it?" Don't conflate them. > *"Severity is 'how bad is it?' Priority is 'when do we fix it?' Don't conflate them."* > — Engineering Manager, Series B Healthcare SaaS --- ## Make It Work: Rollout Plan ### Week 1: Start Simple **If you're 20-50 people:** Copy the 3-level version (SEV1-SEV3) and customize examples to your product. **If you're 50-150 people:** Use the 4-level version (SEV0-SEV3 or SEV1-SEV4). **If you're 150+ people:** Go with the full 5-level framework (SEV0-SEV4). The key is customizing examples to YOUR business. B2B looks different than B2C. Enterprise SaaS looks different than consumer apps. ### Week 1: Get Buy-In Share in Slack. Review in standup. **Most importantly:** Get agreement from the people who'll be woken up at 3 AM. If on-call hates it, they won't use it. > *"The best severity framework is the one your team actually uses. If on-call hates it, they'll ignore it."* > — SRE Manager, 180-person infrastructure company ### Weeks 2-5: Use It Classify every incident. Track how it goes. ### Week 6: Iterate After 30 days, ask: - Classification debates? → Clarify definitions - SEV3s waking people? → Make "don't page" explicit - SEV4s actually getting fixed? → It's working Expect to adjust 2-3 times in the first 6 months. That's normal. --- ## Quick Reference: During an Incident Q: "Is this SEV1 or SEV2?" A: Can customers work around it? Yes = SEV2. No = SEV1. Q: "Only 10% of users affected. Still SEV1?" A: Is that 10% material to your business? (Check your definition) Q: "We fixed it fast. Was it really SEV1?" A: Severity = potential impact, not duration. Yes, still SEV1. Q: "CEO is panicking but customer impact is minor" A: Severity = customer impact. This is SEV3. (But maybe Priority P1) Q: "Not sure. What do I do?" A: Default higher. Downgrade later if needed. --- ## FAQ **Q: SEV0-SEV4 or SEV1-SEV5?** A: SEV0-SEV4. "Zero" means no room for error. Mature teams use this. **Q: Can't tell if SEV1 or SEV2?** A: Default higher (SEV1). Easier to downgrade than explain under-classification. **Q: How many levels?** A: Start with 3-4. Most end up at 5 (SEV0-SEV4). **Q: Does severity change during an incident?** A: No. Based on initial impact. If things change dramatically, document it in the postmortem. **Q: Who decides?** A: Incident commander or first responder. Disagreement? Default higher, resolve in postmortem. --- ## Generate Your Framework in 2 Minutes If you want a copy/paste template, there's a severity matrix generator here: [severity matrix generator](/tools/incident-severity-matrix-generator) Or copy the table from this article and adapt it. Either way, have something defined before your next incident. --- ## Next Reads - [SLA vs. SLO vs. SLI: What Actually Matters (With Templates)](/blog/sla-vs-slo-vs-sli) - [Incident Response Playbook: Scripts, Roles & Templates](/blog/incident-response-playbook) - [How to Reduce MTTR in 2026: The Coordination Framework](/blog/how-to-reduce-mttr) ---

Share this article

Found this helpful? Share it with your team.

Related Articles

Feb 18, 2026

Build vs Buy Incident Management: 2026 Cost & Decision Framework

A defensible 2026 build vs buy framework for incident management: real TCO ranges, reliability gotchas, hybrid options, and a decision checklist.

Read more
Feb 1, 2026

Incident Communication: 8 Copy-Paste Templates for Status, Email & Execs

Stop writing updates at 2 AM. Copy-paste templates for status pages, emails, exec updates, and social posts. Plus cadence and ownership rules for SREs.

Read more
Jan 26, 2026

SLA vs. SLO vs. SLI: What Actually Matters (With Templates)

SLI = what you measure. SLO = your target. SLA = your promise. Here's how to set realistic targets, use error budgets to prioritize, and avoid the 99.9% trap.

Read more
Jan 24, 2026

Runbook vs Playbook: The Difference That Confuses Everyone

Runbooks document technical execution. Playbooks document roles, escalation, and comms. Here's when to use each, with copy-paste templates.

Read more
Jan 23, 2026

OpsGenie Shutdown 2027: The Complete Migration Guide

OpsGenie ends support April 2027. Real migration timelines, export guides, and pricing for 7 alternatives (PagerDuty, incident.io, Squadcast).

Read more
Jan 19, 2026

How to Reduce MTTR in 2026: The Coordination Framework

MTTR isn't just about debugging faster. Learn why coordination is the biggest lever for reducing incident duration for startups scaling from seed to Series C.

Read more
Jan 15, 2026

Incident Management vs Incident Response: The Difference That Matters for MTTR & Recurrence

Don't confuse response with management. Learn why fast MTTR isn't enough to stop recurring fires and how to build a long-term incident lifecycle.

Read more
Jan 10, 2026

2026 State of Incident Management Report: Key Statistics & Benchmarks

Operational toil rose to 30% in 2025 despite AI. Get the latest data on burnout, alert fatigue, and why engineering teams are struggling to keep up.

Read more
Jan 7, 2026

Slack Incident Response Playbook: Roles, Scripts & Templates (Copy-Paste)

Stop the 3 AM chaos. Copy our battle-tested Slack incident playbook: includes scripts, roles, escalation rules, and templates for production outages.

Read more
Jan 2, 2026

On-Call Rotation Templates & The 2-Minute Handoff Guide

Move your on-call from a Google Sheet to a repeatable system. Learn our 2-minute handoff framework and get templates for primary and backup rotations.

Read more
Dec 29, 2025

Post-Incident Review Templates: 3 Real-World Examples (Make Copy)

Skip the 5-page docs nobody reads. Use our 3 ready-to-use postmortem templates and examples to drive real learning and stop recurring incidents.

Read more
Dec 22, 2025

Reducing Context Switching: The 10-Minute Incident Coordination Framework for Slack

Outages are expensive; coordination is harder. Use our 10-minute framework to cut context switching and speed up MTTR during Slack-based incidents.

Read more
Dec 15, 2025

Scaling Incident Management: A Guide for Teams of 40-180 Engineers

Is your incident process breaking as you grow? Learn the 4 stages of incident management for teams of 40-180. Scale your SRE practices without the chaos.

Read more

Automate Your Incident Response

Runframe replaces manual copy-pasting with a dedicated Slack workflow. Page the right people, spin up incident channels, and force structured updates—all without leaving Slack.