On a call last month, an engineering manager said: > "We have an on-call schedule in a Google Sheet. The problem is, nobody looks at it. When something breaks at 2 AM, everyone waits for someone else to speak up first. By the time someone actually responds, you've lost 20 minutes." That's the moment the "informal" system starts costing real minutes. "Whoever's around" can work at 10–15 people. Around 40–50 people, it starts failing in predictable ways. You have two options: keep winging it, or put in a rotation that's boring, explicit, and repeatable. Across dozens of conversations, the teams that avoid burnout tend to converge on the same structure: Here's what works. **TL;DR:** Primary + backup (weekly). No-response rule (5 min). Written handoff (2 min). Visible in Slack daily. Recovery after overnight pages. **This guide includes:** - 3 copy-paste templates (handoff, escalation, rotation schedule) - Severity matrix (SEV-0 through SEV-3) - Compensation benchmarks ($200-500/week) - When to use spreadsheets vs tools - 8 FAQ covering real edge cases Based on conversations with 25+ engineering teams. Bookmark this-you'll come back to it. --- ## What Is On-Call Rotation? On-call rotation is a scheduled system where your **incident response team** takes turns being the primary responder for production incidents. It includes: - **Primary responder** - First person contacted when something breaks - **Backup responder** - Steps in if primary doesn't respond in 5 minutes - **Clear escalation rules** - When and how to page backup or manager. See: [escalation policy](/learn/escalation-policy) - **Defined time boundaries** - Usually weekly (Monday 9 AM → Monday 9 AM) - **Written handoffs** - 2-minute transfer of context between shifts The goal: 24/7 coverage without burning out any single person. --- ## On-Call Rotation Approaches Compared <table> <caption>Comparison of on-call rotation approaches showing team size fit, failure point, and why each approach fails</caption> <thead> <tr> <th>Approach</th> <th>Works For</th> <th>Breaks At</th> <th>Why It Fails</th> </tr> </thead> <tbody> <tr> <td>"Whoever's around"</td> <td><15 people</td> <td>40+ people</td> <td>Assumes everyone knows who to call</td> </tr> <tr> <td>Solo on-call</td> <td>Almost never</td> <td>Immediately</td> <td>No backup when they're unavailable</td> </tr> <tr> <td>Daily rotation</td> <td>Rarely</td> <td>Always</td> <td>Constant anxiety, no clean "off" time</td> </tr> <tr> <td><strong>Weekly primary + backup</strong></td> <td><strong>20-100 people</strong></td> <td><strong>Rarely (if done right)</strong></td> <td><strong>Only if you skip recovery time</strong></td> </tr> <tr> <td>Enterprise tools</td> <td>100+ people</td> <td>Cost-sensitive <100</td> <td>Overkill for team size</td> </tr> </tbody> </table> --- ## Why On-Call Breaks as Teams Grow These were the most common failure modes: **Solo on-call.** One person is "it" for the week. If they're sick, unreachable, or asleep through a page, you lose time fast. One 30-person team told me their on-call was out sick mid-week. The incident lasted 3 hours before someone finally called the CTO directly because nobody knew who to escalate to. Everyone paid for the ambiguity. **Office-hours-only coverage.** "Maria is on-call 9–5." Then production breaks at 8 PM and people hesitate because "it's not covered." The "schedule" becomes an excuse to delay escalation. **Unknown escalation path.** Who do you call when on-call doesn't respond? A Series B company wasted 45 minutes during a database outage because nobody knew who to escalate to. They had a backup on paper-nobody could name them under pressure. **Daily rotations.** They look fair, but they keep people anxious because they're always "up next." You never get a clean "off" period. One team tried this and morale collapsed within weeks. **On-call as punishment.** "You broke it, you're on-call." I heard this from three teams. It teaches people to delay reporting and quietly patch around problems. **No compensation or recovery time.** Three teams told me they expected engineers to do on-call "as part of the job" with no stipend, no comp time, no acknowledgment. Two had someone quit within 6 months specifically citing on-call burden as the reason. --- ## The Worst On-Call Setup I've Seen A 35-person startup had monthly rotation with no backup and no escalation path. One person was expected to be available 24/7 for 30 days straight. Three things happened: **Their best senior engineer quit after two rotations.** "I couldn't plan anything for a month at a time. Every weekend was 'maybe I'll get paged, maybe not.' I couldn't commit to anything." **During one rotation, the on-call was at a wedding with no cell service.** A database failure went undetected for 4 hours. Customers started emailing support before the team even knew there was a problem. **Junior engineers started refusing to do on-call.** The rotation fell apart. The VP of Engineering personally covered 3 months straight until they redesigned it. They switched to weekly rotations with backup. Turnover dropped. Nobody quit over on-call again. Don't do monthly solo on-call. Just don't. --- ## Why Teams Move Away From PagerDuty and Opsgenie **Migrating from OpsGenie?** [Read our complete migration guide with timelines, pricing, and step-by-step plans](/blog/opsgenie-migration-guide). Before we get to what works, here's what doesn't: enterprise on-call tools for teams under 100 people. The teams we talked to had similar complaints: **"Too complex for our size."** A 40-person team: "PagerDuty has features we'll never use. We just need scheduling and escalation." **"Expensive for what we need."** Another team: "We're paying $50+/seat. For our size, that's overkill." **"Not where we work."** Multiple teams: "Our team lives in Slack. PagerDuty feels like another tool to check." Most teams sit in this gap: too big for spreadsheets, too small (or too budget-conscious) for PagerDuty. --- ## An On-Call Rotation Setup That Prevents Burnout Most sustainable setups look like: ### Primary + Backup + Escalation Rules Primary is the first person to respond when something breaks. If primary hasn't responded in 5 minutes, page backup (any severity). If backup hasn't responded in another 5 minutes, escalate to the engineering manager for Sev-0/Sev-1. For Sev-2+, escalate at 30 minutes (or next business hours), unless impact increases. A 40-person fintech team told me: "Primary for the week, backup as a safety net. The rule is simple enough that nobody argues in the moment." The 5-minute rule is for *no response*, not technical escalation. It removes hesitation: when nobody responds, the clock decides. It forces visibility: if nobody responds, you've found a broken escalation path-fast. Backup should be lower load by design. They're not expected to hover-just to be reachable. This fairness matters-backup burns people out less than being solo on-call. **Severity levels guide escalation timing:** <table> <caption>Severity levels response targets and escalation rules</caption> <thead> <tr> <th>Severity</th> <th>Description</th> <th>Response Target</th> <th>Escalation Rule</th> </tr> </thead> <tbody> <tr> <td><strong>SEV-0</strong></td> <td>Complete outage, all customers down</td> <td>Immediate</td> <td>5 min → backup, 10 min → EM</td> </tr> <tr> <td><strong>SEV-1</strong></td> <td>Major feature down, significant impact</td> <td><5 minutes</td> <td>5 min → backup, 10 min → EM</td> </tr> <tr> <td><strong>SEV-2</strong></td> <td>Minor feature down, some users affected</td> <td><15 minutes</td> <td>30 min or next business day</td> </tr> <tr> <td><strong>SEV-3</strong></td> <td>Degraded performance, no customer impact</td> <td>Next business day</td> <td>No escalation needed</td> </tr> </tbody> </table> Use these response targets to maintain **SLA compliance** for your customers while protecting your team from burnout. (More on compensation and recovery time below-it matters more than most teams realize.) **Page Policy (to prevent burnout):** Page only for customer impact, data loss risk, security, or hard downtime. Everything else becomes a ticket for business hours. ### Weekly Rotations (Default for Most Teams) Daily rotations are too stressful. Monthly rotations are too long. Weekly is the simplest cadence most teams can sustain. "The Monday handoff became a predictable ritual. Everyone knew their week was coming and could plan around it," a staff engineer told me. Some teams move to 2-week rotations once they have enough redundancy. Weekly is still the default for most. ### Time Zones: Don't Page People at 2 AM Local Time If your team spans time zones, on-call needs to account for that. A global team (SF/London/Singapore) told me: "We used to have one global on-call. The person in SF was getting paged at 2 AM constantly. They fixed it with regional coverage blocks. SF covers SF hours. London covers EMEA. Singapore covers APAC. Much more humane." If you can't do regional coverage, align on-call with your riskiest window (deploys, peak traffic, known batch jobs). If you're doing a big deploy on Friday, the on-call that week is someone who's around Friday-not someone taking Friday off. Rule of thumb: if you routinely page someone at 2 AM their time, the system is mis-designed (rotation, alerts, or both). ### Handoffs: 2 Minutes, Written, In Public The teams that scale on-call keep handoff friction close to zero. Outgoing posts a short handoff note: what paged, what's unresolved, what to watch. Incoming replies to confirm ownership. If someone misses handoff, they post as soon as they're online (no silent gaps). A 30-person infrastructure team: "Our handoff takes 2 minutes. Post what happened, acknowledge receipt, done. The teams that struggled had handoff meetings that nobody attended. Friction kills adoption." These handoffs feed directly into [post-incident reviews](/blog/post-incident-review-template)-document what happened so the whole team learns. ### Make "Who's On-Call?" Impossible to Miss The most common complaint I heard: "Nobody knows who's on-call." The fix: make it visible where the work happens. Put it in Slack: channel topic + pinned message + a daily post. Ensure incident declaration tags the primary (and names the backup). Pattern that works: a bot posts daily in #incidents - "On-call: @primary · Backup: @backup". That's it. Now everyone knows who to ping. The teams that struggled had the information hidden in a spreadsheet. The teams that worked made it impossible to miss. ### Compensation and Recovery Time This came up in almost every conversation: on-call deserves recognition. **Money is the clearest signal.** What I saw teams actually doing: $200-300/week for startups under 50 people, $400-500/week at larger companies. It's direct, it's fair, and it acknowledges that on-call is work outside normal hours. **If you can't do stipends, recovery time is non-negotiable.** If you get paged overnight, start later or take the morning off-no permission needed. Many teams offer TOIL (time-off in lieu): if you spend 2 hours at 2 AM fixing an incident, you get 2+ hours off to recover. This directly addresses burnout. **Other recognition patterns:** No on-call before or after vacations. Swap-friendly so people can trade shifts if they have conflicts. Public acknowledgment of on-call contributions. A 25-person startup: "We give $200/week for on-call plus a comp day if paged overnight. It's not about the money. It's about recognizing the burden." On-call has a real cost. If you can't pay for it, at minimum give time back. If you ignore both, you'll pay for it in attrition. ### Why "Follow the Sun" On-Call Is Usually Overrated A lot of advice says: "If you have global teams, do follow-the-sun on-call where each region covers their hours." Sounds great in theory. In practice? Many teams under 100 people don't need true follow-the-sun. **It can fragment context.** When APAC hands off to EMEA who hands off to US, context gets lost. "Redis was flaky" becomes "something was weird" and the thread resets. One team told me: "We tried follow-the-sun. Half our incidents got worse because the person picking it up had no context." **It can hide a noisy-alert problem.** If you're getting paged at 3 AM every night, the issue isn't your rotation-it's your monitoring. This causes [alert fatigue](/learn/alert-fatigue), where your team stops responding because they're conditioned to ignore pages. Reduce pages first: tighten alerting, add runbooks, automate common fixes. Don't build a 24/7 rotation to work around noisy alerts. **Regional coverage is often enough.** You don't need "follow the sun." You need "don't wake up someone at 2 AM in their timezone." Have a US on-call and an EMEA on-call. That covers 16+ hours. For the gap, either accept delayed response or rotate who covers it. Exception: if you have true 24/7 SLAs *and* real usage across all time zones, follow-the-sun can be worth the complexity. But most startups have follow-the-sun guilt, not follow-the-sun need. For more on managing incidents at scale, read our [engineering productivity guide](/blog/engineering-productivity-incident-management). --- ## On-Call Rotation Template This works for 20-100 person teams. Adapt it to your needs. **Setup time:** ~10 minutes if you keep it simple. ### The Setup Set a clear boundary: Monday 9 AM → Monday 9 AM (local time). Coverage is primary (first responder) plus backup (5-min escalation). Handoff happens Monday morning in #on-call (written, not a meeting). **Example rotation for 6 engineers:** ``` Week 1: Alice (primary), Bob (backup) Week 2: Charlie (primary), Alice (backup) Week 3: Bob (primary), Charlie (backup) [Repeat] ``` For larger teams, add more people first; only then consider 2-week rotations. ### Handoff Message Template Every Monday morning, the outgoing on-call posts in #on-call: ``` 👋 On-Call Handoff - Week of Jan 13 (Mon 9 AM → Mon 9 AM) Outgoing: @alice → Incoming: @bob Pages / incidents this week: - Tuesday: Database alert, false positive - Thursday: API latency, fixed by restarting cache Notes for next week: - Cache has been flaky, keep an eye on it - Check the [runbook](/learn/runbook) for cache restarts if latency spikes again @bob - can you confirm you're primary for this week? ``` Incoming on-call confirms: ``` ✅ Confirmed, I'm on-call for this week ``` That's it. Two minutes. Done. ### Escalation Path (No-Response Rule) Write this down and put it everywhere: 1. Page primary (wait 5 minutes) 2. If no response: page backup at 5 minutes (wait 5 minutes) 3. If no response from backup: escalate to engineering manager at 10 minutes total (for Sev-0/Sev-1) Note: For Sev-2+ incidents, escalate at 30 minutes or next business hours unless impact increases. ### Slack Channels to Create Create #on-call for handoffs, schedule updates, and meta discussion. Create #incidents for incident declarations and coordination only. Optionally create #incidents-private for customer details and security issues. --- ## Common On-Call Rotation Scenarios (Copy/Paste Rules) **On-call person doesn't respond?** If there's no response: page backup at 5 minutes. For Sev-0/Sev-1: escalate to EM at 10 minutes total. "We used to wait 30 minutes because we didn't want to bother people. Now we escalate at 5 minutes. It's not rude; it's responsible," a senior engineer told me. Waiting feels polite, but it's expensive. Every minute you spend wondering "should I escalate?" is a minute where the incident is getting worse. Make escalation automatic. **Someone is sick or unavailable?** Make it okay to say "I can't do this week." Post in #on-call for a swap, or have the engineering manager cover. If the process punishes real life, it won't survive contact with reality. For more on coordinating across the team during incidents, see our [guide to scaling incident management](/blog/scaling-incident-management). **Someone refuses to do on-call?** First, make sure your on-call isn't miserable. Are they getting paged for non-urgent things? Are they responding at 2 AM for things that could wait? Do they have proper backup? Are they being compensated or recognized? If the process is solid and someone still refuses, have a direct conversation. One VP of Engineering: "We made it clear: on-call is part of the role. If you're not willing to do it, we need to talk about role fit. Harsh but fair." Most resistance I saw wasn't about on-call itself-it was about bad on-call. Fix the process first. **Too small for formal on-call?** If you're under 20 people and pages are rare, you probably don't need a formal rotation. Just document "who to contact for what" and make sure coverage isn't falling on the same 1-2 people. A CTO at a 12-person startup: "We don't have on-call rotations. Infrastructure issues go to @alice, frontend goes to @bob. It works. We'll revisit when we're bigger." Don't add ceremony before you have the problem. --- ## When Spreadsheets Stop Working (and What to Add First) Most teams start with a spreadsheet. That's fine. The pain shows up when: Nobody remembers to update the sheet. People miss handoffs because there's no reminder. You waste time figuring out "who's on-call right now?" during an incident. Shift swaps require manual coordination. You're coordinating on-call across multiple services or time zones. At that point, either add a small Slack layer (visibility + reminders) or adopt scheduling software. PagerDuty/Opsgenie make sense when you have multiple services, complex schedules, and real 24/7 requirements. They're powerful but often overkill for smaller teams. A lighter option can help earlier if it lives in Slack and removes "who's on-call?" confusion. A platform lead at 100 people: "We used a Google Sheet for years. Once we hit 80 people and multiple services, we switched. The sheet was getting unwieldy." Another team at 40 people: "The sheet works for us. But we built a Slack bot to post who's on-call every morning. That solved 90% of our pain." --- ## Start This Week (20 Minutes) Keep it boring. Here's the minimum viable setup: 1. Pick a primary + backup for this week (write it down) 2. Post in #on-call: "Primary: @alice · Backup: @bob · No-response rule: 5 minutes → backup · Sev-0/1: 10 minutes → EM" 3. Set a recurring reminder for Monday 9 AM handoff 4. Document common fixes in a [runbook](/learn/runbook) so the next person doesn't start from scratch 5. Keep the rules stable for 4 weeks, then adjust based on pages and misses That's it. Start simple, add complexity only when you hit pain points. The goal isn't elegance. It's eliminating "who owns this?" when production is on fire. --- ## FAQ **How often should on-call rotate?** Weekly hits the sweet spot for most teams under 50 people. Daily is too stressful. Monthly is too long. Some larger teams with 80+ people do 2-week rotations. **What if the on-call person is on vacation?** Plan ahead. Don't schedule people for on-call right before or after vacations. If emergencies happen, let people swap shifts or have the engineering manager cover. **Should on-call get paid extra?** Most teams do some form of compensation: flat stipend of $100-500/week, comp days if paged overnight, or extra PTO. It's not required but it recognizes the burden. **What if someone refuses to do on-call?** First, make sure your on-call process isn't miserable. Are they getting paged for non-urgent things? Do they have backup? If the process is solid and someone still refuses, have a direct conversation about role expectations. **How do we handle time zones?** Prefer regional coverage blocks. If you can't, align on-call with known risk windows and avoid repeated 2 AM pages. **When should we switch from spreadsheets to on-call software?** When "who's on-call?" costs minutes, swaps are frequent, or you're coordinating across time zones/services. If a spreadsheet + Slack bot works, stick with it. **What's the difference between on-call and incident management?** On-call is who responds. Incident management is how the team coordinates, documents, and communicates once the response starts. You need both. [Read our post-incident review template guide with 3 downloadable formats and action-item tracking](/blog/post-incident-review-template) for the documentation part. **How do we handle on-call for engineers with families?** Same as everyone else-primary plus backup plus 5-minute escalation. Some teams offer "family-friendly" rotations where people with young children can opt into backup-heavy roles or take shifts during school hours. But the structure stays the same. Don't assume people with families can't do on-call-ask them what they need. --- **Want the next step?** Read [our post-incident review template guide with 3 downloadable formats and action-item tracking](/blog/post-incident-review-template). --- ## Looking for On-Call Management Software? We're building on-call for Slack: auto-handoff reminders, one-click escalation, rotation visible in your #incidents channel. No separate app to check. Built for teams 20-100 people who think PagerDuty is overkill. [Join the waitlist for Q1 early access](/contact) ---
On-Call Rotation Templates & The 2-Minute Handoff Guide
Move your on-call from a Google Sheet to a repeatable system. Learn our 2-minute handoff framework and get templates for primary and backup rotations.