A reliability playbook for Heads of SRE turning availability targets into measured outcomes — honest SLOs from the customer's perspective, error budgets that actually change behavior, and the deploy hygiene that kills most of the incidents.
Your dashboard says four. Your customers know.
Five-nines is 5.26 minutes of downtime per year. It is a small number. It is also the number written into healthcare contracts that drive renewal and reference revenue.
The aspirational SLO is the failure mode. The team commits to five-nines because the contract demands it, then measures from inside the platform where the number flatters them, and learns the truth from customer escalations.
SLOs defined from the customer's perspective. Per critical user journey. Measured from outside the platform on the paths the customer actually uses, so the number on the dashboard is the number the customer feels.
Error budgets are calculated weekly. When the budget is on track, the team ships. When the budget is burning, the team stops shipping and works the burn. The budget is the rule, not the suggestion.
Most incidents are deployment-related. Deployment hygiene reduces them — progressive rollout, automated rollback, change windows for the riskiest services, and a kill switch on every new path to production.
SLOs defined from the customer's perspective. Per critical user journey.
Error budgets are calculated weekly. When the budget is on track, the team ships. When the budget burns, the team stops and works the burn.
Most incidents are deployment-related. Deployment hygiene reduces them. Progressive rollout, automated rollback, change windows on the riskiest services.
If your reliability claims do not match what your customers measure, the gap is operational discipline.
Because that is what the customer experiences. Internal SLOs flatter the platform; external SLOs reveal it.
Synthetic traffic, chaos drills, read-only failovers before write failovers. We never test failover for the first time on real customer traffic.
The team stops feature work on that service and runs the burn down — root-cause the recent incidents, harden the deploy path, and only resume shipping when the budget recovers.
Initially, yes, when the budget burns. Long term, no — fewer incidents means more capacity for shipping.
No. Five-nines belongs on the critical user journeys the contract covers. Internal tooling and exploration tiers run lower targets so the engineering investment goes where it matters.