BLUEPRINT

Incident Response Runbook Template

At 3am, under pressure, no one invents a good process. You don't rise to the occasion. You fall to your runbook. This is the runbook to fall to, and it is yours to copy, fill in, and put where on-call can reach it in under a minute.

Download WhitePaper

Train the runbook, not the heroics

The teams that recover fast are not braver or smarter in the moment. They have a plan they already filled in, tested, and trust. The difference shows up in minutes, and minutes are money.

The common pattern: improvise the response live, hope the right person is awake, and write the process down after the outage.
The approach that works: a filled-in runbook with a severity matrix, named roles, and pre-written comms that on-call can follow without thinking.

Download White Paper

The Numbers That Make This A Board-Level Conversation

$5,600 per minute

the average cost of IT downtime, per Gartner, which works out to roughly $336,000 an hour.

$9,000 per minute

what enterprise downtime now costs on average, with many large firms reporting

2.5x

more frequent deployments and faster recovery

The Three Moves Every On-Call Team Needs

Set severity in the first two minutes

Severity drives everything else: how fast you respond, who gets paged, and how often you communicate. Pick the level early.

Assign roles the moment you declare

Name an incident commander, a comms lead, and a scribe.

Mitigate before you chase root cause

Stop the bleeding first. Roll back, flip a feature flag, or fail over.

What's Inside the Runbook

Severity matrix

Four severity levels, each with a clear definition, real examples, a response time, who gets paged, and a comms cadence.

Incident roles

The three roles you assign at declare time: incident commander, comms lead, and ops/scribe.

Communications templates

Pre-written internal and external status update formats.

Escalation table and blameless review

A ready-to-fill escalation table with primary on-call, secondary on-call, and an escalation path for every critical service.

Incidents are won before they start

Set severity fast, assign roles, stop the bleeding before chasing root cause, communicate on a cadence, and review without blame.

Download White Paper

Frequently Asked Questions

What is an incident response runbook?

It is a single document your on-call team follows when something breaks. It defines severity levels, who does what, how to escalate, what to tell customers, and how to review the incident afterward. The point is to remove guesswork at the worst possible moment, so the person who picks up the page knows what to do, who to call, and what to say.

How do I customize it for my team?

Replace every bracketed placeholder with your own values. Fill the escalation table with your real on-call names and paths, set your severity definitions, and pin the link in your incident channel and alerting tool. Then test it in a game day before you rely on it in production.

How often should we update the runbook?

Review it after every SEV1 and SEV2, and on a fixed cadence even when nothing breaks. A runbook that drifts from reality is worse than none, because people trust it and it misleads them. Update it for anything it got wrong, and confirm the escalation table is current for every critical service.

Who is this template for?

SRE leads, on-call engineers, and platform teams. If you run production services and carry a pager, this is for you. It works for a two-person startup and a large platform group alike. Use it as written or strip it for parts and keep the pieces that fit your setup.

What makes a post-incident review blameless?

A blameless review asks what in the system let the incident happen, not who to blame. People act reasonably given what they know in the moment. If a person could trigger the incident, the system allowed it, and the system is what you fix. No names are attached to fault, which is what makes people honest about what really happened.