At 3am, under pressure, no one invents a good process. You don't rise to the occasion. You fall to your runbook. This is the runbook to fall to, and it is yours to copy, fill in, and put where on-call can reach it in under a minute.
The teams that recover fast are not braver or smarter in the moment. They have a plan they already filled in, tested, and trust. The difference shows up in minutes, and minutes are money.
The common pattern: improvise the response live, hope the right person is awake, and write the process down after the outage.
The approach that works: a filled-in runbook with a severity matrix, named roles, and pre-written comms that on-call can follow without thinking.
Severity drives everything else: how fast you respond, who gets paged, and how often you communicate. Pick the level early.
Name an incident commander, a comms lead, and a scribe.
Stop the bleeding first. Roll back, flip a feature flag, or fail over.
Four severity levels, each with a clear definition, real examples, a response time, who gets paged, and a comms cadence.
The three roles you assign at declare time: incident commander, comms lead, and ops/scribe.
Pre-written internal and external status update formats.
A ready-to-fill escalation table with primary on-call, secondary on-call, and an escalation path for every critical service.
Set severity fast, assign roles, stop the bleeding before chasing root cause, communicate on a cadence, and review without blame.
It is a single document your on-call team follows when something breaks. It defines severity levels, who does what, how to escalate, what to tell customers, and how to review the incident afterward. The point is to remove guesswork at the worst possible moment, so the person who picks up the page knows what to do, who to call, and what to say.
Replace every bracketed placeholder with your own values. Fill the escalation table with your real on-call names and paths, set your severity definitions, and pin the link in your incident channel and alerting tool. Then test it in a game day before you rely on it in production.
Review it after every SEV1 and SEV2, and on a fixed cadence even when nothing breaks. A runbook that drifts from reality is worse than none, because people trust it and it misleads them. Update it for anything it got wrong, and confirm the escalation table is current for every critical service.
SRE leads, on-call engineers, and platform teams. If you run production services and carry a pager, this is for you. It works for a two-person startup and a large platform group alike. Use it as written or strip it for parts and keep the pieces that fit your setup.
A blameless review asks what in the system let the incident happen, not who to blame. People act reasonably given what they know in the moment. If a person could trigger the incident, the system allowed it, and the system is what you fix. No names are attached to fault, which is what makes people honest about what really happened.