Site Reliability Engineering (SRE) is the discipline of running production systems using software engineering principles. SREs build automation, define service level objectives (SLOs), manage error budgets, and treat operational problems as engineering problems to be solved with code rather than manual effort. The practice originated at Google, where they applied software engineering rigor to operations work that other companies handled with traditional system administration approaches.
The defining insight of SRE is that operations work that scales linearly with system size eventually becomes unsustainable. If running one server takes one hour of operations work per day, running a thousand servers takes a thousand hours, which is impossible. The only way out is automation: turn the operations work into software so it scales without proportional human effort. SREs are software engineers who specialize in this kind of automation.
Google published "Site Reliability Engineering: How Google Runs Production Systems" in 2016, codifying the practices they had developed internally. The book made SRE accessible to organizations beyond Google. By 2026 SRE has become standard practice in many large software organizations, though specific implementations vary widely. Some companies have dedicated SRE teams; others embed SRE practices in development teams; others apply SRE principles without using the title.
The practice has well-defined components. SLIs (service level indicators) measure specific aspects of system behavior: latency, error rate, availability. SLOs (service level objectives) set targets for those indicators: 99.9% availability, P95 latency under 200ms. Error budgets calculate the allowed amount of unreliability based on SLOs; when the budget is exhausted, the team shifts focus from features to reliability work. Toil reduction identifies and automates manual operational work. Blameless post-mortems extract lessons from incidents without assigning individual blame.
SRE overlaps with DevOps but emphasizes different specifics. DevOps is a broader cultural movement bringing development and operations together. SRE is a specific implementation with explicit tools (SLOs, error budgets) and a particular focus on reliability through software engineering. Many organizations practice both terms simultaneously; the practices substantially overlap with different vocabulary.
Service Level Objectives are explicit reliability targets. "99.9% of requests complete within 500 milliseconds." "99.95% availability over 30 days." "Less than 0.1% error rate." The targets reflect what users actually need rather than what is technically achievable. They drive everything else in SRE practice.
Defining SLOs requires understanding user experience. A web service might have multiple SLOs: availability of the home page, latency of search, success rate of checkout. Each captures a different aspect of what matters. The targets are calibrated to user expectations: how reliable does this need to be for the use case?
Error budgets convert SLOs into operational decisions. A 99.9% availability target allows 43 minutes of downtime per month. The 43 minutes is the error budget. Within the budget, the team can take risks (deploy new code, run experiments, do migrations). When the budget is exhausted, the team must prioritize reliability work over new features until the system stabilizes.
Error budgets create natural feedback loops. Teams that consistently consume their budget learn to be more careful with changes. Teams that have unused budget can take more risks. The system aligns incentives toward appropriate reliability without dictating specific behavior.
The hard part of SLOs is calibration. Targets that are too aggressive cause constant alerts and force every change into reliability work. Targets that are too loose let real problems through. Most organizations spend the first six to twelve months tuning their SLOs as they learn what is achievable and what users actually need.
Toil is manual, repetitive operational work that scales linearly with service size. Restarting services after crashes. Handling routine alerts. Performing regular maintenance. The defining characteristic is that the work has to happen but does not contribute to lasting improvement. Each instance is unproductive in itself.
SRE practice identifies toil and reduces it through automation. The classic Google guidance is that SREs should spend no more than 50% of their time on toil; the rest should go to engineering work that reduces future toil. Teams that exceed the 50% threshold are signaling that they need more automation investment or fewer responsibilities.
Toil reduction strategies include automation of common operations (auto-remediation for known issues), self-healing systems (services that recover from failures without human intervention), better tooling (dashboards and runbooks that make operations work faster), and architectural changes (eliminating the work entirely).
The investment in toil reduction pays back over time. Time spent automating an operation saves time every time the operation would have happened. Compounding over months and years, the savings are significant. Teams that under-invest in automation often find themselves overwhelmed by toil and unable to escape.
The 50% threshold is a guideline, not a hard rule. Some teams in some periods need more operational work. The threshold is a signal: if it is consistently exceeded, something needs to change. The change might be more SREs, less responsibility, or more aggressive automation investment.
Production incidents will happen. SRE practice prepares for them with on-call rotations, runbooks for common issues, escalation paths, and incident response procedures. The goal is not to prevent all incidents (impossible) but to detect and resolve them quickly when they occur.
On-call rotations distribute the burden of being available for incidents. Engineers rotate through on-call duty for periods (typically a week at a time). During on-call, they are reachable for production issues. The rotation prevents any one person from carrying the operational burden alone.
Runbooks document the steps for resolving common issues. When alerts fire, the on-call engineer follows the runbook. Runbooks are living documents that improve as the team learns more. Good runbooks dramatically reduce mean time to recovery for known issues.
Blameless post-mortems analyze incidents without assigning individual blame. The focus is on system and process improvement rather than punishing the engineer who happened to be involved. The pattern is essential because blame discourages people from admitting mistakes, which prevents the team from learning. Without blame, people share what happened openly, and the team can address root causes.
Post-mortems produce action items: changes to systems, processes, or training that should reduce the chance of similar incidents. Tracking these action items to completion is part of the SRE practice. Post-mortems that produce no action items or whose action items get ignored fail to deliver the value of the post-mortem process.
DevOps is a broader cultural movement bringing development and operations together. SRE is a specific implementation of DevOps principles with explicit tools (SLOs, error budgets) and a particular focus on reliability through software engineering. Many organizations use both terms; the practices substantially overlap with different vocabulary and emphasis.
Specific differences. SRE has more explicit reliability targets through SLOs. SRE emphasizes the 50% time-on-toil guideline. SRE comes from a specific cultural origin (Google) with specific practices that have become canonical. DevOps is more inclusive of different specific approaches.
Practical overlap. Both emphasize automation. Both bring development and operations together. Both treat operational problems as engineering problems. Both use modern tooling (CI/CD, IaC, observability). The differences are mostly in vocabulary and specific frameworks rather than fundamental practice.
The choice between SRE and DevOps as organizational vocabulary is largely cultural. Organizations that have a tradition of working closely with Google's practices often adopt SRE explicitly. Organizations with broader influences often use DevOps as the umbrella term. Some organizations use both: DevOps as the cultural framing, SRE as the specific reliability practice.
A specific reliability target for a service: 99.9% availability, P95 latency under 500ms, 99.99% data durability. Drives error budget calculations and operational decisions. SLOs reflect what users need rather than what is technically maximum. The practice of defining SLOs forces clarity about what reliability means for a service. A service team that cannot articulate clear SLOs probably does not have a shared understanding of what reliable means. Working through SLO definitions usually surfaces useful conversations about user expectations and engineering priorities.
The allowed unreliability based on SLO. A 99.9% SLO allows 0.1% error rate or downtime; over 30 days that is roughly 43 minutes of total time the system can be down. The error budget is the operational manifestation of the SLO. When the team consumes its error budget, the implication is that something needs to change. Either the team should focus on reliability work until the system stabilizes, or the SLO is wrong (probably too aggressive) and needs adjustment. Error budgets create natural feedback loops that align team behavior with reliability targets.
When the budget is exhausted, teams shift focus from features to reliability. New feature work pauses. Engineering effort goes to fixing reliability issues. Once the budget recovers (because the system has been stable for a period), feature work resumes. The pattern provides explicit alignment between leadership priorities and engineering work. Without error budgets, reliability work often gets postponed in favor of feature work, which produces unreliable systems. Error budgets force the conversation about trade-offs to happen at a reasonable cadence rather than crisis-by-crisis. What is toil? Manual, repetitive operational work that scales linearly with service size. Restarting services after crashes. Handling routine alerts. Performing regular maintenance. The defining characteristic is that the work has to happen but does not contribute to lasting improvement. SRE practice measures toil and aims to reduce it through automation. The Google guidance is that SREs should spend no more than 50% of their time on toil. Time freed from toil goes to engineering work that prevents future toil, creating a compounding improvement effect.
Varies widely. Google traditionally has lower ratios (more developers per SRE) than smaller organizations adopting SRE practices. Typical ratios in 2026 range from one SRE per twenty developers (well-tooled organizations with mature SRE practice) to one per five (organizations earlier in their SRE adoption or with more complex operational requirements). The ratio is a derived metric rather than a target. The actual question is whether the team has enough SRE capacity to support the application teams effectively. If application teams are blocked on SRE work or if SREs are overwhelmed, the ratio is too low. If SREs have unused capacity, it might be too high.
SLO attainment (are services meeting their reliability targets), error budget consumption (are teams using their budgets reasonably), toil percentage (are SREs spending appropriate time on automation versus operations), mean time to recovery (how quickly do incidents get resolved), deployment frequency (does reliability allow fast deployment). The metrics together capture whether SRE is actually working. High SLO attainment plus low toil plus fast recovery suggests effective SRE practice. Any of these out of balance signals issues to investigate.
Latency, traffic, errors, and saturation as the four key metrics for service health. The monitoring focus comes from Google's SRE book and has become standard practice. Together they capture most of what matters for service reliability without requiring exhaustive metric collection. Most observability platforms support golden signal monitoring out of the box. Dashboards for golden signals are standard SRE practice. Alerts on golden signals catch most production issues. The framework is simple enough to apply consistently across services.
Platforms often include SRE-style observability and reliability tooling. Platform teams may include SRE roles or apply SRE practices to the platform itself. The platform then provides SRE capabilities (monitoring, alerting, runbooks) to application teams. The relationship is complementary rather than competing. Platform Engineering operationalizes SRE practices for many teams. SRE practices inform what platform engineering should provide. Together they make production engineering scalable across large organizations.
Deliberately introducing failures to test resilience. The practice of running controlled experiments that fail components to verify the system handles failures gracefully. Often practiced by SRE teams to validate reliability. Tools like Chaos Monkey (Netflix) and Gremlin (commercial) support chaos engineering experiments. The pattern works particularly well for distributed systems where the actual behavior under failure is hard to predict from architecture alone. Running chaos experiments in production (with appropriate safety) reveals issues that testing in development misses.
Tighter integration with AI for incident response (AI-assisted log analysis, suggested remediation, automated detection). Broader adoption in non-tech industries as software becomes core to more businesses. Continued maturation of tooling and practices. Recognition as a distinct career path with clear progression. The bigger trend is SRE practices becoming standard expectations rather than special discipline. Modern engineering organizations assume SLO-based reliability, blameless post-mortems, and toil reduction as baseline practices. The frontier is in how those practices integrate with platform engineering, AI assistance, and broader operational excellence.