incident-managementscaling-incident-managementengineering-teams

Scaling Incident Management: A Guide for Teams of 40-180 Engineers

Is your incident process breaking as you grow? Learn the 4 stages of incident management for teams of 40-180. Scale your SRE practices without the chaos.

Runframe TeamDec 15, 202512 min read

Before building anything, we wanted to understand how teams actually handle incidents in production. Not the polished version from case studies or the theoretical best practices from [SRE books](https://sre.google/books/) (the messy, 3 AM reality of what happens when the database goes down). Over the past few months, we conducted 22 calls and collected 5 async writeups from engineering teams ranging from 12-person startups to 180-person scale-ups (skewing toward teams already using Slack heavily). Some were using established incident management platforms, some were using newer tools, and a surprising number were still running incidents through ad-hoc Slack channels and Python scripts. Looking for the practical guide? Read [The Silent Killer of Engineering Productivity in Incident Management](/blog/engineering-productivity-incident-management). We asked the same questions: What works in your incident response? What breaks? What do you wish existed? The conversations challenged a lot of our assumptions. We expected to hear about cost barriers and alert fatigue. Instead, the problems that kept teams up at night were setup complexity and coordination breakdowns. --- ## What Is Scaling Incident Management? Scaling incident management is the process of evolving your incident response practices as your engineering team grows. What works for a 10-person startup (informal Slack coordination) breaks down at 50 people (needs formal on-call rotations, dedicated tools, clear escalation paths). Most teams go through four predictable stages: 1. **Single Slack channel** (5-15 people) 2. **Python scripts** (15-40 people) 3. **"Should buy a tool" limbo** (40-100 people) ← where most teams get stuck 4. **Formal tool adoption** (100+ people) The challenge isn't technical—it's organizational. As teams grow, informal coordination ("whoever's around handles it") stops working. You need clear ownership, documented processes, and tools that reduce coordination overhead rather than add complexity. This research examines how 25+ engineering teams navigated these transitions, what blocked them, and what actually worked. --- ### Key Findings **✓ Most teams get stuck at Stage 3** (40-100 people). They've outgrown Python scripts but can't commit to enterprise tools **✓ Setup complexity blocks adoption, not cost.** Almost no teams mentioned price as the primary barrier **✓ Coordination matters more than speed.** The technical fix is usually straightforward; getting everyone aligned is the hard part **✓ 40-50 people is the inflection point.** That's when informal "whoever's around" on-call stops working and formal rotations become necessary ## The 4 Stages of Incident Management Maturity <table> <caption>The four stages of incident management maturity from startup to enterprise with team sizes, setup time, what works, and what breaks at each stage</caption> <thead> <tr> <th>Stage</th> <th>Team Size</th> <th>Setup</th> <th>What Works</th> <th>What Breaks</th> </tr> </thead> <tbody> <tr> <td><strong>1. Single Slack Channel</strong></td> <td>5-15 people</td> <td>5 min</td> <td>Informal coordination, founder-led</td> <td>Multiple concurrent incidents</td> </tr> <tr> <td><strong>2. Python Scripts</strong></td> <td>15-40 people</td> <td>1 day</td> <td>Auto-channel creation, some automation</td> <td>Script maintenance, API changes, no docs</td> </tr> <tr> <td><strong>3. "Should Buy Tool" Limbo</strong></td> <td>40-100 people</td> <td>Months of indecision</td> <td>Nothing—stuck evaluating</td> <td>Setup complexity, decision fatigue</td> </tr> <tr> <td><strong>4. Formal Tool</strong></td> <td>100+ people</td> <td>1-2 weeks</td> <td>Structured process, clear ownership</td> <td>Feature overload, workflow mismatch</td> </tr> </tbody> </table> Most teams get stuck at Stage 3 for 6-12 months before a crisis forces Stage 4 adoption. ## What You'll Learn - [The 4 stages every team goes through](#four-stages) (and why most get stuck at Stage 3) - [Why teams avoid adopting tools](#real-reason) (hint: it's not cost) - [The "just works" gap](#just-works) that tools are missing - [What actually matters: coordination vs speed](#coordination) - [The on-call rotation inflection point](#on-call) - [The pattern for success](#pattern) based on what worked for teams <h2 id="four-stages">The 4 Stages of Scaling Incident Management (And Why Teams Get Stuck at Stage 3)</h2> This pattern showed up in roughly 20 of the 25+ conversations; the wording differed, but the structure was consistent. **Stage 1: The Single Slack Channel (5-15 people)** Everything goes into #incidents. One of the founders or senior engineers declares "we have an incident," people jump in, someone figures it out, and everyone moves on. One CTO at a 10-person startup told us: "We have maybe two real incidents a month. Why would I pay $200/month for a tool when a Slack channel works fine?" Fair point. At this stage, the Slack channel IS the incident management system. **Stage 2: The Python Script Phase (15-40 people)** Once you hit two concurrent incidents, the single channel breaks down. Conversations overlap. People lose track of who's working on what. The history becomes impossible to parse. So someone (usually a senior engineer who's annoyed by the chaos) spends an afternoon writing a script that: - Creates a dedicated Slack channel per incident - Posts to a Notion page or Linear issue - Maybe tags the right people based on keywords This works great. For a few months. Then something changes: the engineer who wrote it leaves, gets promoted, or just stops maintaining it. Sometimes Slack's API changes. And weird things start happening. We heard from an engineering manager at a Series B company: "Our script created 11 channels for the same incident last month. Turns out it was triggering on every alert notification, not just the initial one. Nobody caught it because the person who wrote it had left six months ago, and honestly, we were all scared to touch the code." We asked to see the script. It was 380 lines of Python with zero comments and variable names like `ch_id` and `usr_grp_2`. **Stage 3: The "We Should Probably Buy a Tool" Discussion (40-100 people)** This is where we found most teams stuck. This is when teams need formal on-call rotations. See [our on-call rotation guide with weekly primary+backup schedules and 5-minute escalation rules](/blog/on-call-rotation-guide). They've outgrown the janky script. Incidents are happening more frequently, maybe 8-12 per month now. The script breaks in new and creative ways. Everyone agrees they need something more robust. So they start evaluating tools. And then... nothing happens for months. At first, we thought this was about price. The tools are expensive (roughly $15-20 per user per month from what teams shared, so ballpark $750-1,000/month for a 50-person team). But when we dug deeper, price wasn't the main blocker. A VP of Engineering explained: "We got budget approved for an incident management platform. Then our platform lead spent two weeks trying to set it up. He got frustrated with the escalation policies config and basically gave up. We're still using the script." Another team had bought a tool, used it for one incident, and then just stopped. When we asked why, the EM said: "I think people found it easier to just create the Slack channel manually. We still pay for it, we just don't use it." **Stage 4: Finally Adopting a Tool (Usually Post-Crisis)** The teams that successfully adopted a tool almost always had the same trigger: a bad incident that exposed the gaps in their janky setup. "We had a P0 on Black Friday," a CTO shared. "Our Python script was down (ironically) and we ended up with three different incident channels that people created manually, each with different subsets of the team. It was chaos. The next Monday I told our platform team: find a tool, get it set up, I don't care what it costs." They adopted a modern incident management platform and were live within a week. What struck us: this company had been "planning to adopt a tool" for over a year. The incident finally forced the decision. <h2 id="real-reason">Why Teams Avoid Incident Management Tools (It's Not Cost)</h2> We went into these conversations assuming the barrier was price. SaaS incident management tools are expensive, and startups are budget-conscious. Cost came up, but it wasn't the first thing teams complained about. Setup complexity and decision fatigue dominated the conversations. The real barrier? **Decision fatigue and setup overhead.** ### The Enterprise Platform Problem Eight teams had tried to set up enterprise incident management platforms and abandoned the process mid-way. The pattern was similar across most teams: An engineer starts the setup, gets to the escalation policies configuration, realizes they need to make a dozen decisions they don't have answers for, and just stops. Though we also heard about integration complexity and change management resistance as blockers. "I opened the setup guide and it was 40+ pages," an engineer mentioned. "Questions like: How many severity levels do we need? What's our escalation policy? Who's the primary, secondary, and tertiary on-call for each service? We're 30 people. I don't know the answer to these questions. So I closed the tab and went back to our script." Enterprise platforms are comprehensive. But comprehensive means complex. And complex means decisions. For teams that already have a mature incident response process, these tools are powerful. They give you the flexibility to model complex **incident response workflows** with clear roles. But for teams still figuring out their process? All that flexibility is overwhelming. You're still defining your [incident commander](/learn/incident-commander) role, building your first [runbook](/learn/runbook), and establishing a [blameless culture](/learn/blameless-postmortem) around postmortems. ### The Feature Overload Problem Some newer tools have improved the setup experience compared to legacy platforms. But teams mentioned a different issue: feature overload. "The tool we tried is great," one EM said. "But we maybe use 20% of the features. AI postmortems, [status page](/learn/status-page) updates, call integrations... nice to have, but not what we actually needed. We just wanted a way to create incident channels and track what happened." Another team had a more specific complaint: "The voice call feature is cool, but we're async-first. Nobody wants to jump on a call at 11 PM when an incident happens. We just want a Slack channel and good thread organization." The insight: Tools often impose a specific incident response philosophy (synchronous, structured, process-heavy) that doesn't match how all teams actually work. <h2 id="just-works">The "Just Works" Gap in Incident Management Tools</h2> The pattern became clear after the fifth conversation: > "I just want incident management to work out of the box. I don't want to become an expert in incident response theory just to configure a tool. I want reasonable defaults that make sense for a team our size." This quote is from a tech lead at a 45-person startup. But we heard variations of this repeatedly. What does "just works" actually mean? We asked teams to be specific. **Reasonable defaults:** - "If someone is primary on-call, try them first. Wait 5 minutes, then escalate to their backup. Don't make me design an escalation policy from scratch." - "Give me 3 severity levels: P0 (customer-facing), P1 (degraded), P2 (non-urgent). Don't make me define my own severity matrix." - "Auto-create an incident channel with a sensible name. Post updates there. That's 90% of what we need." **Low maintenance:** - "When someone joins or leaves the team, it should just update automatically from Slack/email. I don't want to maintain a separate user list." - "If our integrations break, tell me clearly what broke and how to fix it. Don't make me dig through error logs." **"Don't force me to configure everything day one"** "Let me start simple: one on-call rotation, basic alerts, Slack channels. Then when we grow, let me add more complexity. Don't force me to set up stakeholder notifications and status pages on day one; I'll add those when I need them." One technical founder summed it up: "I want the Heroku of incident management. Just make it work. I'll customize it later if I need to." ## The Alert Fatigue Myth We expected to hear a lot about alert fatigue: too many alerts, teams ignoring notifications, etc. And we did hear about it. But not in the way we expected. The conventional wisdom is: "Companies have too many alerts. They need better monitoring and smarter alerting rules." But what we heard was more nuanced. **Problem wasn't volume. It was relevance.** "We get maybe 15 alerts per day," an SRE explained. "That's not overwhelming. The problem is that 12 of them don't actually need a response. So we've learned to ignore alerts. Which means when a real incident happens, it takes us longer to notice because we're conditioned to ignore the notifications." Another team had the opposite problem: too few alerts. "We're worried we're under-alerting," an engineering lead said. "We've tuned our alerts to be very conservative because we don't want to wake people up for nothing. But I think we're missing real issues because we're not alerting enough." What both teams wanted: better signal-to-noise ratio. One team had found a creative solution: "We have two alert channels. #alerts-info for things that are off but not urgent. And #alerts-action for things that need immediate response. The key is that #alerts-action is almost always quiet. When something hits that channel, everyone knows it's real." Simple, but apparently this took them three months of experimentation to figure out. <h2 id="coordination">Incident Coordination vs Speed: What Actually Matters</h2> The most counterintuitive finding? We expected teams to focus on [MTTR (Mean Time to Resolution)](https://www.atlassian.com/incident-management/kpis/common-metrics): how quickly they fix incidents. When we asked "What matters most in your incident response?" few teams mentioned MTTR. To be clear: leaders still track MTTR as a KPI. But the engineers and on-call responders we spoke with consistently cited coordination and communication as their dominant pain point in their **incident response workflow**. The most common answer? **Coordination and communication.** "The technical fix is usually straightforward," a CTO noted. "The hard part is making sure everyone knows what's happening, who's working on what, and what's already been tried." A technical lead at a 60-person company told us: "Our worst incidents aren't the ones that take longest to fix. They're the ones where communication breaks down. Three people debugging the same thing. The customer support team not knowing we're working on it. Management asking for updates every 10 minutes because they haven't heard anything." This kept coming up: **The incident itself is usually solvable. The coordination problem is harder.** After every incident, teams need effective postmortems. See our [post-incident review templates with 3 ready-to-use formats (15-minute, standard, and comprehensive)](/blog/post-incident-review-template). This explains why [Slack](https://slack.com/)-based incident management is popular. It's not that Slack is the best tool for incident management. It's that Slack is where coordination already happens. One engineer put it perfectly: "During an incident, I need to coordinate with 5-10 people. If your tool requires me to leave Slack to manage the incident, you're adding overhead at the worst possible time. I'll just coordinate in Slack and skip your tool." Interestingly, three teams mentioned they had *more* incidents after adopting a formal tool. When we dug in, it turned out they weren't creating problems. They were finally tracking incidents they'd previously ignored. The tool didn't increase incidents; it made existing problems visible. As one team put it: "We realized we were having 15-20 incidents a month, not the 5-6 we thought. We just weren't counting the ones we fixed quickly." <h2 id="on-call">The On-Call Rotation Problem: When Teams Hit 40-50 People</h2> We asked teams about their on-call setup. This was eye-opening. **Most teams didn't have formal on-call rotations.** The rest? "Whoever's around handles it." At first this seemed dysfunctional. But when we dug into it, we found it was often intentional. "We tried doing formal on-call," an EM shared. "It created more problems than it solved. People would wait for the on-call person instead of just fixing things. And our incidents are unpredictable. Sometimes they need the database person, sometimes the frontend person. A generic on-call rotation didn't make sense." Their solution: "We have a #incidents channel. When something breaks, someone posts. Usually 2-3 people who are around and know that system jump in. It's informal but it works." For teams under 40 people, this informal approach was common. But teams over 50 people almost always had formal rotations. "You can't rely on 'whoever's around' when you're 80 people across 5 timezones," a VP of Engineering explained. The inflection point seemed to be around 40-50 people. That's when informal coordination stops scaling. <h2 id="pattern">Incident Management Best Practices: What Works at Each Stage</h2> Based on these conversations, here's what we'd suggest: **If you're at the "single Slack channel" stage:** Don't rush to adopt a tool. If incidents are rare (< 5/month) and the team is small (< 20 people), a Slack channel is probably fine. For teams under 20 people, see [our guide to reducing context switching during incidents with a 10-minute coordination framework](/blog/engineering-productivity-incident-management). But do document your incident response process. Even just a simple doc: "Here's how we handle incidents. Here's who owns what system." **If you're maintaining a janky Python script:** You're probably at the point where a proper tool makes sense. But don't just start evaluating tools randomly. The successful teams we talked to did this first: They audited their process. - How many incidents/month are we handling? - What breaks in our current process? - Do we need formal on-call or is informal okay? - What actually matters: speed, coordination, documentation? Then they evaluated tools based on those answers. **If you're evaluating tools:** **Migrating from OpsGenie?** With OpsGenie shutting down April 2027, [read our complete migration guide](/blog/opsgenie-migration-guide) with real timelines, pricing comparisons, and step-by-step plans from teams who've already migrated. For general tool evaluation: Don't just do free trials. Actually run a real incident through each tool. Pay attention to: - **Setup time** - If you get frustrated during setup, your team will too - **Workflow match** - Does it fit how you actually work (async vs sync, lightweight vs process-heavy)? - **Appropriate complexity** - Is it sized right for your team, or built for a different scale? The right tool is the one that matches YOUR workflow, not what's popular or feature-rich. **If you already have a tool but nobody uses it:** This was more common than I expected. Teams paying for tools they've abandoned. Figure out why. Usually it's one of: - Setup was too complex (nobody finished configuring it) - It didn't match the team's workflow (tool is synchronous, team is async) - It added overhead instead of reducing it Sometimes the answer is "switch tools." Sometimes it's "finish the setup you abandoned." Sometimes it's "go back to Slack and cancel the subscription." For the tactical playbook, read our [incident coordination guide](/blog/engineering-productivity-incident-management). ## Looking for Incident Management Software? We're building Runframe based on these insights: reasonable defaults that work out of box, low maintenance overhead, lives in Slack where teams coordinate, and progressive complexity as you grow. Built for teams stuck between Python scripts and enterprise platforms (20-100 people). We're in private beta. If you're dealing with these challenges, we'd love to hear about your setup. [Join the waitlist](/contact) or email us at hello@runframe.io --- **Want the next step?** Read [our incident coordination guide to reduce context switching](/blog/engineering-productivity-incident-management), [post-incident review templates that work](/blog/post-incident-review-template), or [our on-call rotation guide](/blog/on-call-rotation-guide). --- ## Scaling Incident Management FAQ **At what team size should I adopt an incident management tool?** Most teams successfully adopt tools between 40-100 people. Below 40, a Python script or Slack channel often works fine. Above 100, you need structured incident management with formal on-call rotations. **Why do teams get stuck at Stage 3?** Setup complexity and decision fatigue. Enterprise tools require dozens of upfront decisions (severity levels, escalation policies, on-call schedules) that teams don't have answers for yet. This blocks adoption for months. **What's the biggest incident management mistake teams make?** Choosing tools based on features rather than workflow fit. A feature-rich tool that doesn't match how your team actually works (async vs sync, lightweight vs process-heavy) won't get adopted. **When should I move from a Python script to a real tool?** When the script breaks frequently, the person who wrote it has left, or you're handling 8+ incidents per month. If setup complexity is blocking you, look for tools with "just works" defaults rather than enterprise platforms. **What's more important: MTTR or coordination?** Coordination. Engineers consistently cited coordination breakdowns as their biggest pain point, not incident duration. The technical fix is usually straightforward; getting everyone aligned is harder. **What's the on-call rotation inflection point?** Around 40-50 people. Below that, informal "whoever's around" coordination often works. Above 50, you need formal rotations with clear primary/backup ownership. --- *Thanks to the 25+ engineering teams who shared their incident war stories with us. Several of you will probably recognize your quotes in this piece (anonymized). If we got anything wrong, let us know. We're still learning.*

Share this article

Found this helpful? Share it with your team.

Related Articles

Feb 18, 2026

Build vs Buy Incident Management: 2026 Cost & Decision Framework

A defensible 2026 build vs buy framework for incident management: real TCO ranges, reliability gotchas, hybrid options, and a decision checklist.

Read more
Feb 1, 2026

Incident Communication: 8 Copy-Paste Templates for Status, Email & Execs

Stop writing updates at 2 AM. Copy-paste templates for status pages, emails, exec updates, and social posts. Plus cadence and ownership rules for SREs.

Read more
Jan 26, 2026

SLA vs. SLO vs. SLI: What Actually Matters (With Templates)

SLI = what you measure. SLO = your target. SLA = your promise. Here's how to set realistic targets, use error budgets to prioritize, and avoid the 99.9% trap.

Read more
Jan 24, 2026

Runbook vs Playbook: The Difference That Confuses Everyone

Runbooks document technical execution. Playbooks document roles, escalation, and comms. Here's when to use each, with copy-paste templates.

Read more
Jan 23, 2026

OpsGenie Shutdown 2027: The Complete Migration Guide

OpsGenie ends support April 2027. Real migration timelines, export guides, and pricing for 7 alternatives (PagerDuty, incident.io, Squadcast).

Read more
Jan 19, 2026

How to Reduce MTTR in 2026: The Coordination Framework

MTTR isn't just about debugging faster. Learn why coordination is the biggest lever for reducing incident duration for startups scaling from seed to Series C.

Read more
Jan 17, 2026

Incident Severity Matrix (SEV0-SEV4): Free Template & Generator

Stop arguing over SEV1 vs SEV2. Use our SEV0-SEV4 matrix and decision tree to standardize your incident classification and reduce alert fatigue.

Read more
Jan 15, 2026

Incident Management vs Incident Response: The Difference That Matters for MTTR & Recurrence

Don't confuse response with management. Learn why fast MTTR isn't enough to stop recurring fires and how to build a long-term incident lifecycle.

Read more
Jan 10, 2026

2026 State of Incident Management Report: Key Statistics & Benchmarks

Operational toil rose to 30% in 2025 despite AI. Get the latest data on burnout, alert fatigue, and why engineering teams are struggling to keep up.

Read more
Jan 7, 2026

Slack Incident Response Playbook: Roles, Scripts & Templates (Copy-Paste)

Stop the 3 AM chaos. Copy our battle-tested Slack incident playbook: includes scripts, roles, escalation rules, and templates for production outages.

Read more
Jan 2, 2026

On-Call Rotation Templates & The 2-Minute Handoff Guide

Move your on-call from a Google Sheet to a repeatable system. Learn our 2-minute handoff framework and get templates for primary and backup rotations.

Read more
Dec 29, 2025

Post-Incident Review Templates: 3 Real-World Examples (Make Copy)

Skip the 5-page docs nobody reads. Use our 3 ready-to-use postmortem templates and examples to drive real learning and stop recurring incidents.

Read more
Dec 22, 2025

Reducing Context Switching: The 10-Minute Incident Coordination Framework for Slack

Outages are expensive; coordination is harder. Use our 10-minute framework to cut context switching and speed up MTTR during Slack-based incidents.

Read more

Automate Your Incident Response

Runframe replaces manual copy-pasting with a dedicated Slack workflow. Page the right people, spin up incident channels, and force structured updates—all without leaving Slack.