Learn DevOps &
SRE Concepts.
Essential incident management and reliability engineering terms explained. From MTTR to SLOs, master the concepts that power elite engineering teams.
Complete Glossary
56 terms across 9 categories
Jump to category
Core Metrics(9 terms)
The average time it takes to fully resolve an incident from detection to service restoration.
The average time from when an alert fires to when a human acknowledges it.
The average time from when an issue occurs to when an alert fires.
The average time between system failures or incidents.
A target reliability threshold for a service, typically expressed as a percentage over a time period.
A measurable metric that indicates service performance, used to track SLOs.
The amount of unreliability a service can have before violating its SLO.
The period of time during which a system or service is unavailable or failing to perform its primary function.
The percentage of time that a system is fully operational and available to users.
Roles(5 terms)
The person responsible for all high-level coordination and decision-making during an incident.
The person responsible for all internal and external communication during an incident.
The person responsible for accurately documenting the timeline, actions, and decisions during an incident.
The technical specialist responsible for diagnosing and fixing the specific service or component causing the incident.
A discipline that incorporates aspects of software engineering and applies them to infrastructure and operations problems.
Severity(6 terms)
Critical emergency. The system is completely unusable for a significant portion of users, or data integrity is at risk.
Major incident. Significant functionality is broken or degraded, but a workaround may exist or the impact is partial.
Degraded performance or minor functionality broken for some users. Workarounds may exist.
Minor bug or confusing non-critical issue. No immediate user impact.
Cosmetic issues, typos, or internal-only problems. Zero user impact.
A classification system (P0-P4 or SEV1-SEV5) that determines the urgency and response time required for an incident.
Processes(7 terms)
The process of detecting, responding to, and resolving system incidents or outages.
A meeting to analyze what happened during an incident and identify improvements.
A post-incident analysis that focuses on system and process failures rather than individual blame.
A step-by-step guide for handling specific operational tasks or incidents.
A comprehensive guide containing strategies, procedures, and best practices for handling scenarios.
Predefined rules for when and how to escalate incidents to additional resources or management.
The use of technology to perform tasks with reduced human assistance.
On-Call Management(6 terms)
A systematic method for identifying the underlying causes of problems or incidents.
The practice of designating specific team members to be available to respond to urgent system issues outside of standard working hours.
A roster that determines which engineer is responsible for responding to incidents at any given time.
The engineer currently designated to receive and act upon system alerts and incidents.
The structured process of transferring incident context, active alerts, and duties from one on-call engineer to the next.
A global on-call scheduling model where shifts are assigned to teams in active time zones to avoid night shifts.
Monitoring & Observability(5 terms)
A contractual commitment to customers regarding service performance and availability.
The measure of how well internal states of a system can be inferred from knowledge of its external outputs.
The process of collecting, analyzing, and using data to track the health of applications and infrastructure.
The proportion of time a system is operational and accessible.
The four key metrics that represent the health of a system: Latency, Traffic, Errors, and Saturation.
Incident Response(8 terms)
The complete lifecycle of how organizations prevent, detect, respond to, and learn from incidents.
A schedule that determines which engineer is responsible for answering alerts during a specific time period.
The practice of intentionally injecting failures into systems to build resilience.
A scheduled practice session where teams simulate incidents to test response procedures.
A dedicated physical or virtual space where incident responders coordinate during major incidents.
The initial phase of incident response where the severity, impact, and required expertise are determined.
The end-to-end journey of an incident from the moment it occurs until the post-incident review is completed.
The point in the incident lifecycle where the service is restored to full functionality for the customer.
Engineering Practices(5 terms)
Operational work that tends to be manual, repetitive, and automatable.
The probability that a system will function correctly under stated conditions for a specified period.
A cultural philosophy that combines software development (Dev) and IT operations (Ops) to shorten the systems development life cycle.
A public or private dashboard that communicates the current health of services to users.
A collaboration model that connects people, tools, and scripts into a transparent workflow (usually Slack/Teams).
Health & Well-being(5 terms)
Desensitization caused by excessive or low-quality alerts, leading to missed critical alerts.
Physical and emotional exhaustion caused by frequent sleep interruption, excessive alerts, and the stress of maintaining high availability.
An on-call practice that prioritizes human health and long-term team viability alongside system reliability.
A scheduling principle ensuring that the burden of on-call duties (including weekends and holidays) is distributed equitably across the team.
The practice of measuring and equalizing the amount of time and effort each team member spends on on-call duties.