Site reliability engineering (SRE) is the discipline of running production systems by applying software engineering to operations problems: reliability gets defined numerically, measured continuously, and engineered deliberately, and the work of keeping systems running gets automated rather than performed. The canonical compression, from the Google team that named the field: SRE is what happens when you ask a software engineer to design an operations function.
The discipline's core mechanism is the SLO and error budget loop. A service level objective (SLO) states, numerically, how reliable a service should be from its users' perspective (99.9% of requests succeed within 300ms, measured over 30 days). The gap between the SLO and perfection is the error budget: the quantity of unreliability the service is allowed to spend. While budget remains, teams ship features at full speed; when the budget is exhausted, releases slow and engineering shifts to reliability work, by prior agreement rather than by crisis negotiation. The loop's genius is that it converts the eternal tug-of-war between velocity and stability into an explicit, self-regulating contract: reliability stops being a virtue argued about and becomes a resource managed.
The second pillar is the war on toil. Toil is the operational work that is manual, repetitive, automatable, and scales linearly with the service (restarting things, provisioning by hand, responding to the same alert weekly); SRE practice caps it (Google's formulation: no more than 50% of an SRE's time) and treats the other half as engineering time spent eliminating toil's sources. The cap is structural, not aspirational: a team allowed to drown in toil becomes an operations team with a fancier title, which is the discipline's best-documented failure mode.
Around these pillars sits a recognizable practice set: golden-signal monitoring and alerting tuned to symptoms users feel rather than causes engineers imagine, blameless postmortems that extract systemic fixes from incidents instead of extracting apologies from individuals, structured incident response with defined roles, capacity planning, and release engineering (the gradual rollouts and automated verification that make change safe, since change causes most outages). SRE shares DevOps's goals and differs in form: DevOps is a broad cultural movement about collapsing the wall between development and operations; SRE is a specific, opinionated implementation with named roles, numeric contracts, and prescribed mechanisms.
This page covers the SLO/error budget machinery, the toil economics, the incident and postmortem culture, and what adopting SRE actually requires at organizations that are not Google, where the discipline's transplant record is mixed for reasons worth understanding.
The measurement stack has three layers that get conflated. SLIs (service level indicators) are the metrics: the actual measured ratio of good events to total events (successful requests, fast-enough responses, fresh-enough data). SLOs are the targets set on those indicators: internal engineering contracts stating what good enough means. SLAs are the external legal versions: customer contracts with penalties, set deliberately looser than the internal SLOs that protect them. The ordering matters: SLIs are facts, SLOs are decisions, SLAs are lawyers' renderings of decisions, and organizations that start with the SLA and work backwards usually discover they never decided what they were actually promising.
Good SLOs are user-centric, few, and deliberately imperfect. User-centric: measure what users experience (request success at the load balancer, end-to-end checkout completion), not what infrastructure reports (CPU, node health), because users do not experience CPU. Few: three to five per service, covering availability, latency, and correctness dimensions that matter; twenty SLOs are a dashboard, not a contract. Deliberately imperfect: 100% is the wrong target for almost everything, because each additional nine costs roughly ten times the last and users cannot distinguish 99.99% from 99.999% through their own flaky WiFi. The right target is the cheapest one users will not notice, which is an economic finding, not an engineering aspiration.
The error budget is what gives SLOs teeth. A 99.9% monthly SLO yields roughly 43 minutes of allowed full downtime (or its equivalent in partial degradation); that allowance is the budget, and the policy attached to it is the actual governance instrument: while budget remains, ship freely; when it burns too fast, slow releases, add review, divert engineering to reliability; when it is exhausted, feature freezes until the budget recovers. The policy must be agreed by product and engineering leadership in advance, because its entire value is that nobody negotiates it during the incident. An SLO without an agreed budget policy is a chart; the policy is the contract.
Alerting follows from budget mathematics rather than threshold guessing. Burn-rate alerting (page when the budget is being consumed at a rate that would exhaust it in hours; ticket when the slow leak would exhaust it in days) replaces the classic pathology of threshold alerts: hundreds of cause-based rules (disk 80%, CPU high) that page humans for conditions users never feel, training responders to ignore the pager that then cries wolf through a real outage. Symptom-based, budget-derived alerting is the single most transferable SRE technique, and teams can adopt it without adopting anything else.
The budget also prices architectural honesty. Dependencies compound: a service with a 99.9% target built atop five dependencies each offering 99.9% cannot meet its own number without resilience engineering (retries, timeouts, fallbacks, redundancy) to mask dependency failures. The SLO machinery surfaces this arithmetic early, which is why mature organizations treat SLOs as architectural inputs (what must this design survive?) rather than monitoring afterthoughts, and why the hardest SLO conversations are about the dependency chain nobody wants to own.
Toil has a precise definition worth preserving: work that is manual, repetitive, automatable, tactical (interrupt-driven rather than strategic), devoid of enduring value (the system is no better after the work than before), and scaling linearly with service growth. The definition excludes things teams mislabel: incident response to novel failures is not toil (it is unavoidable operational work), and neither is genuine engineering judgment. The category exists to be measured and attacked, and the measurement (what fraction of team time goes to toil?) is the health metric that distinguishes SRE teams from relabeled ops teams.
The 50% cap is an economic argument, not a comfort policy. Every hour of toil performed is an hour not spent eliminating the toil's source, so a team past the tipping point degenerates: toil grows with the service, engineering time shrinks, the automation debt compounds, and within a few quarters the team is purely reactive, which is precisely the operations model SRE was designed to replace. The cap forces the investment decision (automate, redesign the service, refuse the work, or staff up) before the spiral, and teams that enforce it report the counterintuitive result: operational load stays roughly flat while the services they run grow severalfold, because the engineering half of the time compounds.
The automation ladder has rungs worth distinguishing. Documented procedure (a human follows a runbook), automated procedure (a human triggers a script), gated automation (the system acts, a human approves), autonomous operation (the system detects, decides, acts, and reports), and finally designed-away (the service no longer has the failure mode; nothing needs to act). Each rung up trades implementation effort for permanent toil reduction, and the top rung is the quietly radical one: the best automation is the system redesign that makes the runbook unnecessary, which is why SREs need authority over service design and not just operations tooling.
Self-service platforms are toil elimination at organizational scale. The pattern: instead of SREs provisioning, configuring, and deploying on behalf of fifty product teams (linear toil), the SRE function builds the paved road (templates, pipelines, guardrails, golden paths) that lets teams operate themselves within safe boundaries. This converges SRE with platform engineering (the disciplines overlap heavily in practice), and the division of labor that emerges at mature organizations: platform teams build the road, SRE defines what safe means (SLO frameworks, production-readiness standards) and handles the system-level reliability that no single team owns.
And toil discipline has a refusal component that organizations find culturally hard. SRE teams that accept every operational request become the company's outsourced hands; the practice requires saying no (or "yes, via the platform") to work that does not meet the bar, and handing pagers back to teams whose services are too operationally immature to support (the production-readiness review is the gate). These refusals are the mechanism that keeps the function an engineering discipline; organizations unwilling to grant SRE that authority get the title without the economics.
Incident response is structured because cognition degrades under stress. The practice imports incident command structure: a designated incident commander who coordinates (and explicitly does not debug), an operations lead doing the hands-on work, a communications lead handling stakeholders, with clean handoffs as incidents outlast shifts. The structure exists because the failure pattern it prevents (five engineers all debugging, nobody deciding, leadership pinging the people typing) is universal and expensive. Severity levels with predefined response expectations make engagement proportionate, and the on-call rotation underneath is staffed and compensated as real work, with page load monitored as a team health metric rather than endured as a rite.
Blamelessness is an epistemological position, not a nicety. The postmortem's premise: the engineer who typed the destructive command is the proximate cause; the system that made the command possible, plausible, and unverified is the actual cause, and only the second is fixable (replacing the engineer recruits a new person for the same trap). Blameless culture exists because punished engineers stop reporting near-misses, and near-miss reporting is the cheapest reliability data an organization can collect. The practical test of whether a culture is actually blameless: do postmortems for self-inflicted outages name the systemic gaps with the same rigor as vendor-inflicted ones, and do the authors of honest postmortems prosper?
The postmortem's value is entirely in its action items, which means the discipline lives or dies in follow-through. A postmortem that produces insight without owned, prioritized, deadline-bearing fixes is an essay; the recurring industry failure is the action-item graveyard (everything filed, nothing funded). Working systems treat postmortem actions as first-class engineering work (tracked in the same backlog, reviewed at the same cadence, with reliability work's share protected by the error-budget policy when budgets burn). The meta-metric worth watching: repeat incidents from known, unfixed causes, which is the precise measure of an organization writing fiction in its postmortems.
The learning loop extends beyond failures. Mature practice reviews near-misses (the outage that almost happened teaches the same lesson at no customer cost), runs game days and chaos exercises (inject failures deliberately, verify the recovery path and the humans, in business hours while everyone is calm), and treats production readiness as a teachable checklist (new services pass an SRE review covering SLOs, monitoring, runbooks, capacity, and failure modes before receiving production traffic or SRE support). Each practice converts reliability from tribal knowledge into institutional capability, which is the actual asset.
And the culture's load-bearing wall is leadership behavior during budget exhaustion. The error-budget contract gets tested the first time it blocks a launch the business wants: if leadership overrides it casually, the entire machinery (SLOs, budgets, postmortem priorities) is revealed as decoration, and the organization reverts to arguing about reliability per-incident. Organizations where the first override became a leadership-level decision with a documented rationale kept their systems; organizations where it became a precedent did not. SRE is, in the end, a set of agreements, and agreements are only as real as their enforcement under pressure.
The renaming failure dominates the adoption record. The pattern: the operations team becomes the "SRE team" by org-chart edit, inherits the same ticket queue, the same manual work, the same lack of engineering authority, and acquires a hiring problem (the title attracts engineers who leave upon discovering the reality). No mechanism changed: no SLOs with consequences, no toil cap, no postmortem discipline, no authority over production readiness. The diagnostic question for any SRE adoption: which specific decisions does the error budget now make, and what work did the team refuse last quarter because of the toil cap? No answer, no SRE.
Right-sized adoption starts with mechanisms, not headcount. The transferable core for a mid-size organization: SLOs on the handful of services that matter, with burn-rate alerting replacing threshold noise; an error-budget policy that product leadership has actually signed; blameless postmortems with tracked actions; and a toil measurement that informs automation investment. This is achievable by existing teams (often a single platform/infrastructure team carrying the framework) without a dedicated SRE org, and it captures most of the discipline's value. Dedicated SRE teams earn their existence at the scale where system-level reliability (the cross-service failure modes no product team owns) becomes a full-time engineering domain.
The embedding model matters as much as the existence of the function. The patterns in production use: centralized SRE (one team owns reliability frameworks and the highest-stakes services; risks becoming a bottleneck or an ivory tower), embedded SRE (reliability engineers seated within product teams; risks dissolving into feature work), and consulting SRE (the center sets standards and reviews readiness; teams operate their own services; scales best, demands the most cultural maturity). Most successful non-Google implementations land on the consulting model with a small centralized core, because the alternative models require either Google's staffing ratios or unusual discipline.
The prerequisites are unglamorous and frequently absent. SLO machinery presupposes observability (you cannot budget what you cannot measure); toil elimination presupposes automation-capable infrastructure (infrastructure as code, deployment pipelines); error-budget enforcement presupposes deployment control (gradual rollouts, fast rollback, the zero-downtime machinery); and all of it presupposes leadership willing to let a number slow a launch. Organizations missing these foundations should build them first and call the work what it is; SRE adopted atop manual deployments and absent monitoring is a vocabulary purchase.
What transfers universally, even where the full discipline does not: user-centric SLOs as the definition of reliability (replacing infrastructure-metric theater), burn-rate alerting (replacing pager fatigue), blameless postmortems with enforced follow-through (replacing the blame-and-forget cycle), and the toil concept as a budgeting lens (replacing the assumption that operational pain is weather rather than debt). Any team can adopt these four within a quarter, and they constitute the majority of the discipline's portable value, which makes them the honest starting point for everyone whose context is not a planet-scale service company.
The rhythm of a functioning SRE team is mostly not firefighting. A typical week: the on-call engineer handles pages and interrupt work (and only that, with the load tracked); the rest of the team works the engineering backlog (automation eliminating last quarter's toil, reliability improvements from postmortem actions, capacity work ahead of the traffic curve); the weekly review walks the SLO dashboards (which budgets are burning, which services are drifting toward trouble), the page statistics (is on-call load sustainable), and the postmortem action queue (what is overdue). The visible artifact of health is boring weeks; the visible artifact of dysfunction is heroics, celebrated.
Production-readiness reviews are the recurring gatekeeping work. A team wants to launch a new service or hand an existing one to SRE support: the review walks a standard checklist (SLOs defined and instrumented, dashboards and alerts in place, runbooks written, deployment and rollback automated, failure modes enumerated, capacity modeled), and the outcome is a punch list or a pass. Done well, the review is a teaching instrument that spreads reliability practice through the organization one launch at a time; done badly, it is a bureaucratic gate that teams route around, and the difference is whether the reviewing engineers also help close the gaps they find.
Capacity and performance work runs on a longer cycle. Forecasting traffic against resource headroom, load-testing ahead of seasonal peaks, hunting the regressions that creep into latency budgets, and pushing the architectural changes (caching, sharding, queue redesigns) that keep the SLOs affordable as scale grows. This is the work that prevents the incidents that never happen, which makes it perpetually vulnerable to deprioritization, and protecting it is one of the error budget's quieter functions: a service burning budget on capacity-related degradation has a contractual claim on engineering time.
And game days punctuate the calendar. On a schedule, the team deliberately breaks things (kills a dependency in staging or production, fails a region, corrupts a config) and watches whether the defenses hold: did the alerts fire, did the runbook work, did the on-call engineer find the dashboard, did the failover actually fail over. Each exercise produces a fix list, and the cumulative effect over quarters is the difference between an incident response that is rehearsed and one that is improvised. The chaos-engineering tradition formalized this; the SRE practice adopted it as routine hygiene.
The discipline of running production systems through software engineering: reliability defined numerically as SLOs, managed as an error budget that regulates release velocity, and operational work systematically automated rather than manually performed.
DevOps is the broad cultural movement (collapse the dev/ops wall, automate delivery, share ownership); SRE is a specific, opinionated implementation with named mechanisms: SLOs, error budgets, toil caps, blameless postmortems, production-readiness gates. The standard formulation: SRE is a concrete way of doing DevOps, and the two are complementary rather than competing.
SLI: the measured indicator (fraction of requests that succeeded within latency target). SLO: the internal objective set on that indicator (99.9% over 30 days), an engineering contract. SLA: the external customer agreement with penalties, set looser than the SLO that protects it. Facts, decisions, and legal renderings of decisions, in that order.
The allowed unreliability implied by an SLO: 99.9% monthly means roughly 43 minutes of downtime-equivalent to spend. While budget remains, teams ship at full speed; when it burns fast or exhausts, a pre-agreed policy slows releases and redirects engineering to reliability. The budget converts the velocity-stability argument into arithmetic.
Operational work that is manual, repetitive, automatable, devoid of enduring value, and scaling with the service: hand-provisioning, recurring restarts, the same weekly alert. Novel incident response and genuine judgment calls are not toil. SRE practice measures it, caps it (the canonical 50%), and spends the protected time eliminating its sources.
Because the typing engineer is the proximate cause and the system that allowed the action is the fixable cause, and because punished engineers stop reporting near-misses, which destroys the cheapest reliability data available. Blamelessness is an accuracy discipline: it keeps the analysis pointed at what can actually be changed.
The cheapest number users will not notice, found by asking what reliability the user experience actually requires and what each additional nine costs (roughly tenfold per nine). Most internal services are honestly served by 99.5-99.9%; user-facing revenue paths by 99.9-99.95%; the five-nines conversation is almost always about a narrower critical path than the one being discussed.
Not to start. The core mechanisms (SLOs with a signed budget policy, burn-rate alerting, blameless postmortems, toil measurement) can be carried by existing platform or infrastructure teams. Dedicated SRE earns its headcount when cross-service reliability becomes a full-time engineering domain, and the consulting model (standards and reviews from a small core, teams operating their own services) scales best outside very large organizations.
Pick the one service whose failures hurt most; define two or three user-centric SLOs with current performance as the baseline; replace its threshold alerts with burn-rate alerts; agree the budget policy with the product owner; and run the next incident's postmortem blamelessly with tracked actions. That sequence costs a quarter, requires no reorganization, and demonstrates the machinery on which everything larger is built.