## TL;DR We expected AI to reduce toil. Every report, every vendor, every conference deck said the same thing. But when we looked at the data from 20+ industry reports and spoke to 25+ engineering teams, we found something different. **Toil rose to 30% (from 25%) — for the first time in five years.** Here's what's actually happening in incident management right now: 1. **AI isn't delivering (yet):** Many organizations are investing $1M+ in AI initiatives (51% deployed, 86% expect to by 2027), yet operational toil rose from 25% to 30%. The first rise in five years. 2. **People are burning out:** 78% of developers spend ≥30% of their time on manual toil. 73% of organizations experienced outages linked to ignored alerts ([Splunk](https://www.splunk.com/en_us/blog/observability/state-of-observability-2025.html), n=1,855). This isn't sustainable. 3. **The market is consolidating fast:** OpsGenie is scheduled to shut down in 2027. Freshworks acquired FireHydrant. SolarWinds acquired Squadcast. Organizations are moving from "best-of-breed" stacks to unified platforms because they can't manage 7+ tools anymore. Meanwhile, observability keeps evolving from troubleshooting tool to strategic asset ([65% say it positively impacts revenue](https://www.splunk.com/en_us/blog/observability/state-of-observability-2025.html)), which means incident management has to keep pace. **The uncomfortable truth**: While executives expect 171% ROI from AI investments, the reality is more complexity, not less. Developer toil can cost ~$9.4M/year per 250 engineers (simplified model). The "AI revolution" has paradoxically increased the blast radius of bad deployments for 92% of teams. And it's getting more expensive to get it wrong. High-impact IT outages now cost ~$2M/hour ([New Relic Observability Forecast 2025](https://newrelic.com/resources/report/observability-forecast/2025), n=1,700). Organizations lose a median of ~$76M annually from unplanned downtime ([New Relic Observability Forecast 2025](https://newrelic.com/sites/default/files/2025-09/new-relic-2025-observability-forecast-report.pdf)). This report synthesizes 20+ industry reports and surveys published in 2025, focusing on the latest trends in incident response automation and operational efficiency. **Scope:** This report focuses on SRE/engineering incident response and operational toil, not security operations (SOC). ## The 2025 Incident Index <table> <caption>Key 2025 incident management statistics and findings from industry reports</caption> <thead> <tr> <th>Finding</th> <th>Statistic</th> <th>Source</th> </tr> </thead> <tbody> <tr> <td>AI agents deployed</td> <td>51%</td> <td>PagerDuty, 2025</td> </tr> <tr> <td>Expect AI agents by 2027</td> <td>86%</td> <td>PagerDuty, 2025</td> </tr> <tr> <td>Expected ROI from AI</td> <td>171% avg</td> <td>PagerDuty, 2025</td> </tr> <tr> <td>AI increases blast radius</td> <td>92%</td> <td>Harness, 2025</td> </tr> <tr> <td>Toil percentage (up from 25%)</td> <td>30%</td> <td>Catchpoint, 2025</td> </tr> <tr> <td>Devs spend ≥30% on toil</td> <td>78%</td> <td>Harness, 2025</td> </tr> <tr> <td>Outages from ignored alerts</td> <td>73%</td> <td>Splunk, 2025</td> </tr> <tr> <td>Developers work >40 hours/week</td> <td>88%</td> <td>Harness, 2025</td> </tr> <tr> <td>Observability impacts revenue</td> <td>65%</td> <td>Splunk, 2025</td> </tr> <tr> <td>High performers ROI advantage</td> <td>+53%</td> <td>Splunk, 2025</td> </tr> <tr> <td>High-impact outage cost per hour</td> <td>$2M</td> <td>New Relic, 2025</td> </tr> <tr> <td>Annual outage cost (median)</td> <td>~$76M</td> <td><a href="https://newrelic.com/sites/default/files/2025-09/new-relic-2025-observability-forecast-report.pdf">New Relic Observability Forecast 2025</a></td> </tr> <tr> <td>CrowdStrike global impact</td> <td>~8.5M devices, >~$5B economic impact</td> <td><a href="https://www.parametrixinsurance.com/reports-white-papers/crowdstrikes-impact-on-the-fortune-500">Parametrix, Reuters</a>, 2024</td> </tr> </tbody> </table> --- ## About This Research **Methodology:** - 20+ industry reports analyzed - 25+ engineering team interviews conducted July - December 2025 (Series A to enterprise, 30-60 minute structured interviews) - Major incident analysis (CrowdStrike, AWS, OpenAI) - Published: January 2026 **Why we wrote this:** We're building Runframe after talking to 25+ engineering teams about their incident management pain. The conversations kept surfacing the same themes: AI isn't delivering, alert fatigue is crushing teams, tooling is too complex. This report synthesizes what we heard from across the industry — *Disclosure: we're building Runframe; we've aimed to keep the analysis vendor-neutral*. **Who should read this:** - Engineering leaders evaluating incident management tools - SREs dealing with alert fatigue and burnout - CTOs planning 2026 tooling strategy - Anyone migrating away from OpsGenie --- ## 1. The AI Trust Gap: Why Toil Rose to 30% (From 25%) ### What executives are betting on - 51% of companies have already deployed AI agents ([PagerDuty Agentic AI Survey 2025](https://www.pagerduty.com/newsroom/agentic-ai-survey-2025/), n=1,000) - 86% expect to be operational with AI agents by 2027 - 75% of organizations are investing $1M+ in AI - 62% expect more than 100% ROI, with an average expected return of 171% - 100% of organizations are now using AI in some capacity, and AI capabilities are now the #1 criterion for selecting observability tools ([Dynatrace](https://www.dynatrace.com/news/blog/ai-observability-business-impact-2025/), n=842) The hype is real. Executives are all-in. <img src="/images/articles/state-of-incident-management-2025/ai_expectation_reality_gap.png" alt="State of Incident Management 2025: AI Operational Toil Expectation vs Reality Gap Graph" class="w-full max-w-xl mx-auto rounded-lg shadow-md my-8 border border-border/50" /> <img src="/images/articles/state-of-incident-management-2025/operational_toil_trend.png" alt="State of Incident Management 2025: Global Operational Toil Trend 2021-2025 Statistics" class="w-full max-w-xl mx-auto rounded-lg shadow-md my-8 border border-border/50" /> ### What's actually happening - **Operational toil rose to 30% from 25%** — the first rise in five years ([Catchpoint SRE Report 2025](https://www.catchpoint.com/press-releases/the-sre-report-2025-highlighting-critical-trends-in-site-reliability-engineering), n=301) - Enterprise incidents increased **16% YoY** ([PagerDuty State of Digital Operations 2024](https://www.pagerduty.com/blog/news-announcements/2024-state-of-digital-operations/)) - **92%** of developers say AI tools increase the "blast radius" from bad deployments ([Harness State of Software Delivery 2025](https://www.harness.io/state-of-software-delivery), n=500) The first wave of AI deployments has added new layers of complexity: new tools to monitor, new alerts to triage, new skills to learn, and more code to review. --- > *"What was most eye opening from our report findings this year was that, for most teams, it seems the burden of operational tasks has grown for the first time in five years. The expectation was that AI would reduce toil, not exacerbate it."* > > *— Catchpoint SRE Report 2025* --- ### The Implementation Gap (Not A Tech Failure) - **69%** of AI-powered decisions are still verified by humans ([Dynatrace](https://www.dynatrace.com/news/blog/ai-observability-business-impact-2025/)) - **25%** of leaders believe improving trust in AI should be a top priority **Crucially, the technology isn't failing — our implementation strategy is.** We are currently living through the "awkward adolescence" of AI. This is likely the **worst version** of these models we will ever use. They are powerful but prone to hallucinations, requiring human verification for almost every action. The rise to 30% (from 25%) in toil isn't because AI is bad. It's because we've added a "verification tax" on top of existing workloads without removing anything yet. We are in the messy middle: not fully autonomous, but no longer purely manual. ### The Rise of Agentic AI in SRE Multi-agent systems are now being deployed for complex incident resolution. AWS and others are shipping "agent" concepts aimed at reducing time-to-triage and time-to-mitigate (early-stage; outcomes vary). Platforms like Rootly, Harness, and PagerDuty are shipping AI-powered runbook execution and autonomous triage capabilities. The future of AI in incident management is human-in-the-loop, not fully autonomous. AI suggests, humans approve. --- **Takeaway:** Organizations invested heavily in AI expecting reduced toil. Instead, toil rose to 30% (the first rise in five years). The AI correction phase is coming in 2026. --- ## 2. The Burnout Tax: The $9.4M Cost of Silence ### The $9.4M annual waste nobody talks about (Simplified Model) - **78%** of developers spend at least 30% of their time on manual, repetitive tasks ([Harness](https://www.harness.io/state-of-software-delivery)) - Average software engineer salary: **$125,000** ([Indeed](https://www.indeed.com/career-software-engineer/salaries), [Glassdoor](https://www.glassdoor.com/Salaries/united-states-software-engineer-salary-SRCH_IL.0,13_IN1_KO14,31.htm), [ZipRecruiter](https://www.ziprecruiter.com/Salaries/Software-Engineer-Salary)) *— varies widely by market/level; treat ranges as directional* - 30% toil × $125,000 = **$37,500 of wasted investment per engineer annually** - For organizations with 250+ engineers: **~$9.4M in lost productivity annually** *(simplified model: assumes $125k avg salary, 30% time on toil; actual costs vary by geography, role mix, and toil type)* Developers report that frequent overtime leads to burnout, increases stress and anxiety, steals time from family and friends, and eventually makes them leave. *For more on sustainable on-call rotations, see our <a href="/blog/on-call-rotation-guide" target="_blank" rel="noopener noreferrer">On-Call Rotation Guide</a>.* ### [Alert fatigue](/learn/alert-fatigue) increases the chance of missed signals - **73%** of organizations experienced outages linked to ignored or suppressed alerts ([Splunk State of Observability 2025](https://www.splunk.com/en_us/blog/observability/state-of-observability-2025.html), n=1,855) - Industry analyses suggest **as many as 67% of alerts are ignored daily** ([incident.io blog](https://incident.io/blog/alert-fatigue-solutions-for-dev-ops-teams-in-2025-what-works); underlying primary dataset not published) - **Customer-impacting incidents increased 43%**, each costing nearly **$800,000** ([PagerDuty Cost of Incidents study](https://www.pagerduty.com/newsroom/study-cost-of-incidents/)) <img src="/images/articles/state-of-incident-management-2025/alerts_ignored_67.png" alt="State of Incident Management 2025: Industry reports suggest ~67% of alerts are ignored daily (incident.io, 2025)" class="w-full max-w-xl mx-auto rounded-lg shadow-md my-8 border border-border/50" /> This is what we heard over and over in our interviews: teams are drowning in alerts. They've learned to ignore them. Then real incidents happen and nobody responds. --- > *"Our on-call engineers get 200+ pages per week. Maybe 5 are real. The rest? Threshold noise, flapping alerts, things that auto-resolved. We've trained our team to ignore alerts — which is terrifying."* > > *— VP Engineering, Healthcare SaaS (160 engineers)* --- ### On-call burnout is at crisis levels - **Unstable organizational priorities** lead to meaningful decreases in productivity and substantial increases in burnout ([DORA 2024 Report](https://services.google.com/fh/files/misc/2024_final_dora_report.pdf)) ### The Firefighting Trap - **20%** say they often or always start a "war room" with members of many teams until an issue is resolved, and **43%** spend too much time responding to alerts ([Splunk State of Observability 2025](https://www.splunk.com/en_us/blog/observability/state-of-observability-2025.html), n=1,855) - Alert fatigue is taking its toll, with teams missing critical signals in the noise Organizations that break free from the firefighting trap prioritize alert hygiene and invest in automated noise reduction, correlation, and intelligent routing. --- **What this means:** Alert fatigue increases the chance of missed signals. ~$9.4M/year lost per 250 engineers (simplified model). Burnout is at crisis levels. The 30-day rule: delete alerts nobody acts on. --- ## 3. The Great Consolidation: Why Best-of-Breed is Dead ### The Incident Management Market is Undergoing Unprecedented Consolidation #### OpsGenie Shutdown (June 2025 - April 2027) - **June 4, 2025**: No new OpsGenie accounts can be created - **April 5, 2027**: Complete service shutdown - Forcing thousands of organizations to evaluate alternatives - [Official Atlassian announcement](https://www.atlassian.com/software/opsgenie/migration) | [Read our migration guide](/blog/opsgenie-migration-guide) #### SolarWinds Acquires Squadcast (March 2025) - Announced March 3, 2025 - Unifying observability and incident response - [Press release](https://www.solarwinds.com/company/newsroom/press-releases/solarwinds-acquires-squadcast-unifying-observability-and-incident-response) #### Freshworks Acquires FireHydrant (December 2025) - Freshworks acquiring FireHydrant's AI-native incident management platform - Deepening IT service and operations portfolio - [Press release](https://www.freshworks.com/press-releases/freshworks-to-deepen-its-it-service-and-operations-portfolio-with-acquisition-of-firehydrants-ai-native-incident-management-and-reliability-platform/) ### Why Consolidation Is Happening 1. **Burnout-driven consolidation:** Organizations can no longer manage 7+ different tools 2. **Integration complexity:** Best-of-breed stacks create too many integration points 3. **Economic pressure:** Unified platforms reduce licensing and training costs 4. **AI capabilities:** Vendors with unified data have advantage in building AI features Organizations are actively comparing incident.io vs. FireHydrant vs. PagerDuty. Migration away from OpsGenie is accelerating given the 2027 shutdown deadline, with many teams searching for modern, Slack-native alternatives. --- **What this means:** The incident management market is consolidating rapidly. OpsGenie shutdown in 2027, Freshworks-FireHydrant and SolarWinds-Squadcast acquisitions. Organizations are moving from 7-tool stacks to unified platforms to reduce burnout and complexity. --- ## Major Incidents (2024–2025): Why Incident Response Mattered *Learn how to run incidents with clear roles and escalation in our <a href="/blog/incident-response-playbook" target="_blank" rel="noopener noreferrer">Incident Response Playbook</a>.* ### July 2024: CrowdStrike Global Outage — The $5B Wake-Up Call **The Incident:** - **Impact**: ~8.5 million Windows devices crashed globally ([Reuters, citing Microsoft](https://www.reuters.com/technology/microsoft-says-about-85-million-its-devices-affected-by-crowdstrike-related-2024-07-20/)) - **Duration**: Some businesses recovered in hours; others took days **Business impact**: Airlines grounded, hospitals disrupted, financial services halted; economic impact estimates exceed ~$5B (e.g., [Parametrix analysis](https://www.parametrixinsurance.com/reports-white-papers/crowdstrikes-impact-on-the-fortune-500); methodologies vary) **Why Incident Response Was the Difference:** Organizations with established incident response processes recovered significantly faster. The difference wasn't technical architecture. It was **coordination, communication, and decision-making**: - Companies with **pre-defined escalation paths** knew who could authorize system-wide changes - Teams with **customer communication templates** kept stakeholders informed instead of scrambling - Organizations with **incident command structures** avoided decision paralysis > *"The difference between a 2-hour outage and a 2-day outage wasn't the bug. It was how quickly teams could coordinate remediation, communicate with customers, and execute rollback procedures."* ### October 2025: AWS US-East-1 Outage — Coordination Chaos **The Incident:** - **Duration**: ~15 hours ([ThousandEyes](https://www.thousandeyes.com/blog/aws-outage-analysis-october-20-2025)) - **Impact**: Services across multiple industries affected - **Business impact**: Widespread service disruption; direct revenue impact varied by company **What Went Wrong:** For many organizations impacted by the outage, the breakdown wasn't infrastructure. It was **incident response**: - **Unclear ownership**: Teams spent critical hours determining who was responsible for what - **Missing communication loops**: Stakeholders learned about outages from social media, not internal updates - **No pre-defined response**: Organizations improvised instead of executing established playbooks **The Lesson:** Multi-region strategies help, but they're useless without **incident management discipline**. Some industry analyses claim organizations with documented runbooks and clear roles reduced their MTTR by up to 60% compared to those improvising ([Xurrent](https://www.xurrent.com/incident-management-response); *treat as directional*). ### December 2024: OpenAI ChatGPT Outage — The Recovery Challenge **The Incident:** - **Duration**: ~4 hours of global service disruption - **Impact**: Millions of users unable to access ChatGPT, API, and developer tools - **Root cause**: A new telemetry service deployment created Kubernetes circular dependencies ([OpenAI status page](https://status.openai.com/incidents/01JMYB483C404VMPCW726E8MET)) **The Hidden Story:** While OpenAI's official postmortem focused on the technical root cause, the incident illustrates a broader **incident response challenge**: - **Recovery complexity**: When systems have circular dependencies, recovery requires coordinated decision-making across multiple teams - **Status communication**: With millions of users affected, timely updates become critical, yet challenging without established communication protocols - **Break-glass dilemma**: OpenAI noted they're implementing "break-glass mechanisms" for future incidents, highlighting that manual recovery procedures must be defined in advance, not improvised during an outage **The Lesson:** When complex infrastructure fails, the difference between a 2-hour outage and a 4-hour outage often comes down to **incident response discipline**: pre-defined recovery procedures, clear escalation paths, and established communication channels. Technical root causes will happen; response processes determine how long they impact your business. ### The Pattern: Alert Fatigue Causes Real Outages Multiple 2025 incidents shared a common contributing factor: **real alerts were ignored because teams were drowning in noise**. - In our interviews, financial services teams reported outages extended by hours when preceding alerts were dismissed as noise - Healthcare SaaS teams told us incidents were delayed 20-30 minutes due to "is this real?" debate. That's time that matters when patient care is at stake - 73% of organizations report outages caused by ignored or suppressed alerts **The Connection to Incident Management:** Alert noise isn't a monitoring problem. It's an **incident management problem**. Without proper alert routing, noise reduction, and escalation processes, teams train themselves to ignore notifications. Then real incidents happen. > *"We've built an incident management system that cries wolf. Actual humans are paying the price when real incidents occur."* --- ## What We Heard Firsthand: Insights from 25+ Engineering Teams As part of building Runframe, we conducted discovery interviews with 25+ engineering teams. These conversations (with leaders from Series A startups to Fortune 500 enterprises) informed both our product and this report. ### On AI Adoption > *"We deployed Copilot company-wide expecting a 30% productivity boost. Six months in, we're spending more time reviewing AI-generated code than we saved writing it. The junior engineers are the most affected — they're accepting suggestions they don't fully understand."* > — **Engineering Manager, Series C Fintech (150 engineers)** > *"The AI tools are great for boilerplate. But for incident response? We tried an AI runbook assistant and it confidently gave wrong commands during a P1. We turned it off that night."* > — **SRE Lead, E-commerce Platform (80 engineers)** ### On Alert Fatigue > *"Our on-call engineers get 200+ pages per week. Maybe 5 are real. The rest? Threshold noise, flapping alerts, things that auto-resolved. We've trained our team to ignore alerts — which is terrifying."* > — **VP Engineering, Healthcare SaaS (160 engineers)** ### On DevOps Burnout > *"We lost three senior SREs in six months. All cited on-call burden. These are people with 10+ years of experience who could work anywhere. We couldn't retain them."* > — **CTO, Infrastructure Startup (60 engineers)** > *"I asked my team what would make their lives better. Number one answer: 'Fewer tools.' We use 7 different systems to manage incidents. Seven."* > — **Director of Platform, Media Company (120 engineers)** ### On What's Actually Working > *"The single biggest improvement we made was deleting 80% of our alerts. Not tuning them — deleting. If nobody acts on an alert for 30 days, it's gone. Our MTTA dropped by 40%."* > — **SRE Manager, Gaming Company (90 engineers)** > *"We stopped doing weekly on-call rotations. Moved to follow-the-sun with 3 regional teams. Burnout complaints dropped to almost zero."* > — **Head of Reliability, Global SaaS (175 engineers)** ### On Market Consolidation > *"With OpsGenie shutting down, we had to migrate 200+ users. We chose a Slack-native alternative that meant no context switching. Our MTTR dropped 25% in the first month."* > — **DevOps Lead, Series B SaaS (75 engineers)** --- ## What This Means for 2026 The data in this report is sobering. But here's what gives us hope: **the market is correcting fast.** The problems are obvious. The solutions are emerging. 2026 will be the year incident management software catches up to the complexity we've created. ### 1. AI Tools Will Actually Work (Finally) **What's changing:** The first wave of AI tools shipped features. The second wave will ship **outcomes**. - **Toil reduction becomes the primary metric:** Not "lines of code generated" or "suggestions accepted." Vendors will measure: "Did operational toil decrease?" - **Governance becomes built-in:** No more "AI suggested we delete the production database." Human-in-the-loop approval for high-impact changes becomes standard. - **Multi-agent systems mature:** Instead of one "AI assistant," you'll have specialized agents: triage agent (routes incidents), RCA agent (analyzes logs), remediation agent (executes fixes), communication agent (updates stakeholders). Each does one thing well. **Why we're optimistic:** The ~$9.4M/year toil cost (simplified model) is too expensive to ignore. Organizations that figure out AI-powered toil reduction will have a massive competitive advantage. The winners will be the ones who ship AI that **reduces complexity**, not adds to it. **Prediction (Confidence: Medium):** Q2–Q3 2026. The first wave of "AI that actually reduces toil" ships. ### 2. Alert Fatigue Gets Solved (It Has To) **What's changing:** 73% of organizations experienced outages linked to ignored or suppressed alerts. The tooling to fix this already exists — organizations just haven't deployed it yet. - **Intelligent correlation becomes standard:** Tools like Splunk, Dynatrace, and new entrants are shipping AI-powered alert correlation. 200 alerts become 3 actionable incidents. - **Context-aware routing:** Alerts route to the right person based on: who's on-call, who owns the service, who fixed it last time, current load. No more "ping everyone and hope." - **Self-healing loops:** For known issues (database connection pool exhausted, cache miss storm), systems will auto-remediate and only alert if remediation fails. **What changes at the org level:** Teams will adopt the **"30-day rule"**: If an alert hasn't been acted on in 30 days, delete it. Not tune it. Delete it. Organizations that do this will see MTTA drop 40%+ (we've seen it firsthand). **Why we're optimistic:** The cost of ignoring alerts is now measurable: 73% of orgs had outages because real alerts were ignored. Leadership now cares. Budget will follow. **Prediction:** H1 2026. Alert fatigue becomes a board-level discussion. ### 3. Consolidation Creates Better Tools (Not Worse) **What's changing:** Yes, the market is consolidating. But consolidation isn't inherently bad — it's a forcing function for **better integration**. The "best-of-breed" stack era (monitoring + alerting + incident response + postmortems + on-call + status pages + chat ops) created integration hell. Seven tools, seven logins, seven contexts to switch between. **What replaces it:** - **Unified platforms with specialized workflows:** Not "one tool for everything," but platforms that handle the full incident lifecycle without context switching. - **Slack/Teams-native workflows:** Work where your team already works. No separate "incident management app" to check. - **Open ecosystems:** The winners won't be closed platforms. They'll be the ones with the best APIs, webhooks, and extensibility. **Why we're optimistic:** The OpsGenie shutdown is forcing thousands of teams to re-evaluate. They're not just migrating — they're **rethinking** their entire incident stack. This is an opportunity to fix 5+ years of accumulated tool sprawl. **Prediction:** Throughout 2026. The "great migration" happens. ### 4. Incident Response Becomes a Discipline (Not Just Firefighting) **What's changing:** For too long, incident management has been "whoever's around figures it out." That's changing. Organizations are realizing that **process matters as much as tooling**. - **Incident Commander becomes a recognized role:** Not "whoever got paged," but trained ICs who know how to coordinate, communicate, and close incidents properly. - **Runbooks evolve into decision trees:** Not static docs, but interactive workflows: "Is the database responding? No → Try this. Yes → Check this." - **[SLOs](/learn/slo) become operational** - 50% of organizations are investigating or implementing SLOs ([Grafana Observability Survey 2025](https://grafana.com/observability-survey/2025/)) **Why we're optimistic:** The CrowdStrike and AWS incidents showed the difference between teams with process and teams without. Companies that recovered in hours had clear playbooks. Companies that took days didn't. The lesson is obvious. **Prediction:** 2026-2027. Industry-wide shift from reactive to proactive. ### 5. Agentic AI Gets Real (With Guardrails) **What's changing:** The hype around "autonomous agents" will mature into **practical, constrained automation**. - **Agents with boundaries:** Not "AI does everything," but "AI handles known scenarios, escalates unknowns." - **Specialized by domain:** Triage agent, RCA agent, remediation agent. Each with clear scope. - **Human approval for high-impact:** AI can restart a service. It can't delete a database without approval. **What this looks like in practice:** > Incident declared. Triage agent analyzes symptoms, suggests root cause. RCA agent pulls relevant logs, identifies the failing deployment. Remediation agent proposes: "Rollback to v2.3.1?" Human approves. Agent executes. Communication agent posts update to status page. Time saved: 20+ minutes of coordination. **Why we're optimistic:** The technology exists. The challenge was trustability. 2026 will be the year we figure out the right balance: AI speed + human judgment. Models have gotten so much better — tooling will catch up in 2026. **Prediction:** Late 2026. First production-ready agentic incident systems ship. ## The Bottom Line: Things Will Get Better Yes, 2025 was hard. Toil went up. Burnout is real. Alert fatigue is crushing teams. But the problems are now **measurable**. And what gets measured gets fixed. - ~$9.4M/year in developer toil (simplified model) → CFOs now care - 73% had outages linked to ignored or suppressed alerts → Boards now care - 88% of developers work >40 hours/week → Retention now threatened ([Harness, 2025](https://www.harness.io/state-of-software-delivery)) The market is responding. Better tooling is coming. Organizations are consolidating, simplifying, and focusing on outcomes over features. **Prediction (Confidence: Medium):** Toil drops back toward 25%. Alert noise decreases 50%+. First wave of incident response platforms that actually reduce complexity ships. The future isn't more tools. It's **better tools**. And they're closer than you think. --- ## What Engineering Teams Should Do in 2026 Based on this research, here's what we recommend: ### **If you're drowning in alert noise:** 1. Implement the 30-day rule: Delete alerts that haven't been acted on in 30 days 2. Deploy correlation tools (Splunk, Dynatrace, or alternatives) 3. Measure: What % of alerts are noise? Target <20% ### **If your team is burning out:** 1. Audit on-call rotation: Are people working >40 hours + on-call? 2. Implement recovery time: Paged at 2 AM? Start late the next day 3. Consider compensation: $200-400/week or TOIL ([industry benchmark](https://www.ilert.com/blog/on-call-compensation-2025)) ### **If you're managing 5+ incident tools:** 1. List all tools you use for: monitoring, alerting, incident response, postmortems, on-call, status pages, chat ops 2. Calculate total cost (licenses + engineering time) 3. Evaluate unified platforms (you'll be surprised by the savings) ### **If you're migrating from OpsGenie:** - Timeline: June 2025 = no new accounts, April 2027 = shutdown - Key vendors to consider: PagerDuty, incident.io, and emerging platforms - Prioritize: Slack-native workflows, alert correlation, unified platform - **Read our complete [OpsGenie Migration Guide](/blog/opsgenie-migration-guide)** for timelines, pricing, and step-by-step plans ### **If you're investing in AI:** 1. Measure toil before and after deployment 2. Implement human-in-the-loop for high-impact changes 3. Track: Did operational toil decrease? (Not "lines of code generated") **Need help?** [Join the waitlist for early access](https://runframe.io/contact) | <a href="/blog" target="_blank" rel="noopener noreferrer">Read our blog</a> --- --- ## Sources ### Industry Research Reports 1. [Splunk State of Observability 2025](https://www.splunk.com/en_us/blog/observability/state-of-observability-2025.html) — n=1,855 professionals 2. [Dynatrace State of Observability 2025](https://www.dynatrace.com/news/blog/ai-observability-business-impact-2025/) — n=842 senior leaders 3. [PagerDuty Agentic AI Survey 2025](https://www.pagerduty.com/newsroom/agentic-ai-survey-2025/) — n=1,000 executives 4. [Harness State of Software Delivery 2025](https://www.harness.io/state-of-software-delivery) — n=500 practitioners 5. [Catchpoint SRE Report 2025](https://www.catchpoint.com/press-releases/the-sre-report-2025-highlighting-critical-trends-in-site-reliability-engineering) — n=301 professionals 6. [New Relic Observability Forecast 2025](https://newrelic.com/sites/default/files/2025-09/new-relic-2025-observability-forecast-report.pdf) 7. [DORA Report 2024](https://services.google.com/fh/files/misc/2024_final_dora_report.pdf) — Google Cloud ### Additional Sources 8. [Atlassian State of Incident Management 2024](https://www.atlassian.com/incident-management/2024-state-of-incident-management) — n=500+ practitioners 9. [PagerDuty State of Digital Operations 2024](https://www.pagerduty.com/blog/news-announcements/2024-state-of-digital-operations/) 10. [PagerDuty Cost of Incidents Study](https://www.pagerduty.com/newsroom/study-cost-of-incidents/) 11. [DevOps.com Burnout Survey 2024](https://devops.com/survey-surfaces-high-devops-burnout-rates-despite-ai-advances/) ### Major Incidents & Case Studies 12. [CrowdStrike Global Outage — Microsoft estimate (Reuters)](https://www.reuters.com/technology/microsoft-says-about-85-million-its-devices-affected-by-crowdstrike-related-2024-07-20/) — July 2024 13. [AWS US-East-1 Outage Analysis (ThousandEyes)](https://www.thousandeyes.com/blog/aws-outage-analysis-october-20-2025) — October 2025 14. [OpenAI Outage Postmortem (OpenAI status)](https://status.openai.com/incidents/01JMYB483C404VMPCW726E8MET) — December 2024 ### Market News 15. [OpsGenie Shutdown - Official Atlassian Announcement](https://www.atlassian.com/software/opsgenie/migration) 16. [SolarWinds Acquires Squadcast](https://www.solarwinds.com/company/newsroom/press-releases/solarwinds-acquires-squadcast-unifying-observability-and-incident-response) 17. [Freshworks Acquires FireHydrant](https://www.freshworks.com/press-releases/freshworks-to-deepen-its-it-service-and-operations-portfolio-with-acquisition-of-firehydrants-ai-native-incident-management-and-reliability-platform/) --- ## Report Highlights > "The AI Paradox: While 75% of organizations invest $1M+ in AI expecting 171% ROI, operational toil has increased for the first time in five years. The promise of automation has become a burden of complexity." > "The $9.4M Developer Waste: With 78% of developers spending 30% of their time on manual toil, a 250-person engineering team can lose ~$9.4M annually (simplified model)." > "The Alert Fatigue Crisis: 73% of organizations experienced outages linked to ignored or suppressed alerts ([Splunk](https://www.splunk.com/en_us/blog/observability/state-of-observability-2025.html), n=1,855). Industry analyses suggest ~67% of alerts may be ignored daily ([incident.io blog](https://incident.io/blog/alert-fatigue-solutions-for-dev-ops-teams-in-2025-what-works); underlying dataset not published). Teams are being trained to ignore notifications." > "High-impact IT outages now cost ~$2 million per hour. Organizations lose a median of ~$76 million annually from unplanned downtime. The economic case for incident management has never been stronger." --- ## About This Report This research was compiled by the [Runframe](https://runframe.io) team to help engineering organizations navigate the changing landscape of incident management. Published January 2026. **Runframe** is an incident management platform being built for modern engineering teams. We're building it because the problems in this report are real. If you're dealing with alert fatigue, tool sprawl, or burnout on your team, we'd love to help. 👉 **[Join the waitlist for early access at runframe.io](https://runframe.io/contact):** Launching Q1 2026
2026 State of Incident Management Report: Key Statistics & Benchmarks
Operational toil rose to 30% in 2025 despite AI. Get the latest data on burnout, alert fatigue, and why engineering teams are struggling to keep up.