Incident Automation
The use of technology to perform tasks with reduced human assistance.
The use of technology to perform tasks with reduced human assistance.
## "Robots Do It Better" **Automation** is the application of code to remove manual effort. In Incident Management, it saves minutes when every second counts. ### What to Automate 1. **Detection**: Alerts (obviously). 2. **Diagnostics**: Scripts that run automatically when an alert fires to gather logs/graphs. 3. **Remediation**: Auto-restarting bad pods, auto-scaling clusters. 4. **Administration**: Creating Slack channels, Jira tickets, and Zoom links. ### The Automation Paradox Automation saves time, but it takes time to build. You must weigh the "Return on Investment" (ROI). If a task takes 1 minute and happens once a year, don't spend 2 weeks automating it.
ExThe Slack Bot
"Setting up a war room (invite people, create doc, create channel) took 15 minutes of clicking."
Why Incident Automation Matters
Automation reduces toil, minimizes human error, and speeds up incident response.
The goal of SRE is to automate this year's job away.