A service level objective, or SLO, is a target for how reliable a service should be, stated as a measurable number, like 99.9 percent of requests succeeding over a month. An error budget is the flip side of that target: if your SLO allows 0.1 percent of requests to fail, that 0.1 percent is your budget for failures, the amount of unreliability you have explicitly decided is acceptable. Together they turn reliability from a vague aspiration into a concrete, measurable quantity that you can track, spend, and make decisions against, which is what makes them so useful.
The insight behind them is that perfect reliability is neither achievable nor worth pursuing. Every nine you add to an availability target costs exponentially more, and beyond a certain point users cannot even perceive the difference, so chasing perfection wastes enormous effort for no benefit. SLOs accept this by setting reliability at a level that is good enough for users and explicitly tolerating the rest. The error budget names that tolerance: it is the gap between your SLO and perfection, and it represents failures you have decided in advance you can live with, which reframes reliability as a deliberate trade-off rather than an absolute.
The error budget is powerful because it is something you spend, and that framing changes behavior. When the service is well within its budget, you have room to take risks, ship faster, and try things, because you can afford some failures. When the budget is exhausted, you have used up your allowance for unreliability and the priority shifts to stabilizing the service before shipping more. The budget gives an objective, agreed-upon signal for when to push and when to slow down, replacing the endless subjective argument between moving fast and staying stable with a number everyone can see.
The concepts come from site reliability engineering, where they were developed to manage the tension between development teams wanting to ship features and operations teams wanting stability. By 2026 they are widely adopted well beyond their origin, because the underlying idea, measure reliability, budget for failure, and let the budget govern the pace of change, is broadly useful for any service where reliability matters. The discipline is less about the specific math and more about the cultural shift of treating reliability as a measurable, spendable resource rather than an all-or-nothing ideal.
This page covers what SLOs and error budgets are, how they turn reliability into a measurable budget, why they resolve the reliability-versus-features fight, and how to set them well. The tooling for measuring and tracking them keeps improving. The underlying idea, that reliability is a deliberate target with a budget for failure that governs how fast you change things, is the durable and valuable part.
The starting point is choosing what to measure, which is captured in a service level indicator, or SLI. An SLI is the actual metric of reliability, the proportion of requests that succeed, the proportion served faster than some latency threshold, the proportion of time the service is available. The SLI has to reflect what users actually experience, because the whole point is to measure reliability as users feel it, not as some internal metric that may not correspond to their experience. Picking the right SLI is the foundation, because everything else is built on measuring the thing that genuinely matters to users.
The SLO sets the target on that indicator. If the SLI is the proportion of successful requests, the SLO might be that 99.9 percent of requests succeed over a rolling month. The number is a deliberate choice about how reliable the service needs to be, balancing what users need against the cost of achieving it. This is where reliability becomes concrete: instead of we want the service to be reliable, you have a specific, measurable commitment that you can track continuously and know, at any moment, whether you are meeting.
The error budget falls out of the SLO automatically. If the SLO is 99.9 percent success, then 0.1 percent failure is permitted, and over a month that 0.1 percent translates into a specific quantity of allowable failures or downtime. That quantity is the budget. As failures occur, they consume the budget, and you can watch the budget deplete over the period the way you watch any budget. The arithmetic is simple, but the reframing is profound: failures stop being purely bad events and become expenditures against an allowance you decided on, which is what makes them manageable.
Tracking the budget over time turns it into a live decision-making tool. You monitor how much of the budget you have consumed and how fast, and that consumption rate tells you whether you are on track or burning through the budget too quickly. A sudden spike in failures shows up as a rapid burn that warns you before the budget is gone; a steady, slow consumption tells you the service is comfortably within target. This continuous view of the budget is what makes it operationally useful, because it gives an early, objective signal about the state of reliability rather than a verdict only after the period ends.
The classic organizational conflict is between developers who want to ship features fast and operations who want to keep the service stable, and it is usually unresolvable because both sides are right and there is no objective way to adjudicate. Developers argue that shipping is the point; operations argue that stability protects users; and without a shared measure, the argument comes down to who has more influence, which satisfies no one and shifts with politics. The fight recurs endlessly because there is no agreed standard for how much reliability is enough.
Error budgets supply exactly that missing standard. By defining in advance how much unreliability is acceptable, they give both sides a number to point to. When the budget is healthy, the developers are right that the team can afford to ship and take risks, because there is reliability headroom to spend. When the budget is exhausted, operations is right that the team should slow down and stabilize, because the agreed tolerance for failure has been used up. The same number answers the question in both directions, and crucially it was agreed before the specific situation arose, so neither side is making it up to win the current argument.
This converts a political dispute into a data-driven decision, which lowers the temperature considerably. Instead of two teams arguing about judgment and priorities, both look at the budget and reach the same conclusion, because the conclusion follows from the number rather than from whose opinion prevails. The decision about whether to push or hold becomes mechanical in the best sense: the policy is set in advance, the budget shows the current state, and the action follows. This is why teams that adopt error budgets often find that a chronic source of friction simply dissolves.
The deeper effect is aligning incentives rather than just settling arguments. When developers know that burning the error budget will halt feature work, they have a direct incentive to ship reliably, because unreliable shipping costs them their ability to keep shipping. The budget makes reliability the developers' concern, not just operations', because it directly affects their freedom to do what they want. This shared incentive, where both teams care about staying within budget for their own reasons, is more durable than any negotiated truce, because it makes reliability and feature velocity two sides of the same managed resource.
The hardest and most important part is choosing SLOs that reflect what users actually care about, not what is easy to measure. A service can hit an internal availability target while users have a terrible experience, if the metric does not capture what matters to them. The SLO should be grounded in the user experience, measuring the reliability of the things users actually depend on, so that meeting the SLO genuinely means users are well served. Getting this wrong produces SLOs that are met while users suffer, which is worse than useless because it provides false confidence.
The target level should be set where additional reliability stops being worth the cost, which is usually lower than instinct suggests. Teams often reach for very high targets out of a sense that more reliability is always better, but each additional nine costs exponentially more and delivers diminishing perceptible benefit, so an excessively high SLO commits the team to enormous effort for gains users cannot feel. The right target is the level that keeps users genuinely satisfied, and no higher, which leaves a meaningful error budget to actually use. An SLO set too high produces a budget so small that it is always exhausted, which makes the whole system punitive rather than enabling.
The SLO has to come with a real policy for what happens when the budget is spent, or it is just a number. The power of the error budget comes from the agreement that exhausting it triggers a change in behavior, typically halting risky feature work to focus on reliability. Without that policy and the organizational commitment to honor it, the budget is a measurement with no teeth, and the features-versus-stability fight returns. Defining the policy in advance, and having leadership back it so that the budget actually governs decisions, is what turns the measurement into a tool that changes behavior.
SLOs need to be revisited as the service and its users change, rather than set once and forgotten. The right reliability target shifts as a service matures, as user expectations evolve, and as the business context changes, so an SLO that was right a year ago may be too loose or too strict now. Treating SLOs as living targets that are reviewed periodically keeps them aligned with reality, while a stale SLO gradually loses its connection to what actually matters and stops being a useful guide. The discipline is ongoing, like the reliability work the SLOs are meant to govern.
A concrete example makes the concepts tangible. Imagine an API that other teams depend on. A reasonable SLO might be that 99.9 percent of requests succeed and 99 percent return within 300 milliseconds, measured over a rolling 30-day window. The error budget is then the 0.1 percent of requests allowed to fail, which over a month of traffic is a specific number of failed requests. The team watches that budget through the month, and as long as they stay within it, they have room to ship changes; if they burn through it, they shift focus to reliability.
A consumer-facing service shows how SLOs map to user experience. For a website, the SLO might focus on the proportion of page loads that succeed and load quickly, because that is what users actually feel, rather than on backend server uptime that may not correspond to the user experience. If a backend is technically up but pages load slowly or error intermittently, an uptime metric would look fine while users suffer, which is exactly the trap a well-chosen SLO avoids by measuring what users experience. The example underlines that the SLI must reflect the user's reality, not the system's internal state.
A data pipeline illustrates SLOs beyond request-response systems. For a pipeline that delivers data consumers depend on, an SLO might target the proportion of runs that deliver complete, correct data on time, with the error budget being the runs allowed to be late or incomplete. This shows that SLOs are not only for web services; the same idea, a measurable reliability target with a budget for failure, applies wherever reliability matters and can be measured, including data systems, where it connects naturally to data observability and reliability practices.
Across these examples the structure is identical even as the specifics differ: pick an indicator that reflects what consumers actually care about, set a target that is good enough without chasing perfection, derive the error budget from the gap, and use the budget to govern the pace of change. The examples vary in what they measure, requests, page loads, pipeline runs, but the discipline is the same. Seeing the pattern applied across different systems makes clear that SLOs are a general tool for managing reliability as a measurable, spendable quantity, not a technique tied to one kind of system.
The error budget becomes operationally powerful when you watch how fast it is being consumed, which is called the burn rate. A burn rate tells you whether you are spending the budget faster than the period can sustain: a service that burns its entire monthly budget in a few days is in trouble even if the budget is not yet fully gone, because at that rate it will be exhausted long before the month ends. Burn rate turns the budget from a number you check at the end into an early warning you can act on during the period, which is where much of its operational value comes from.
Fast burn and slow burn call for different responses, and good alerting distinguishes them. A sudden, fast burn, a spike of failures consuming the budget rapidly, signals an acute problem that needs immediate attention, and alerts tuned to fast burn catch incidents as they happen. A slow, steady burn that is nonetheless faster than sustainable signals a chronic issue degrading reliability over time, which warrants attention but not a page in the middle of the night. Alerting on burn rate, rather than just on the budget being gone, lets the team respond proportionately to how urgently the budget is being depleted.
Acting on the budget means having a real policy that consumption triggers, not just observation. When the budget is healthy and burning slowly, the team can take risks and ship freely; when burn accelerates or the budget nears exhaustion, the policy shifts behavior toward reliability, slowing risky changes and prioritizing stability. This is the mechanism that makes the whole system more than measurement: the budget and its burn rate drive concrete decisions about whether to push forward or pull back. Without acting on it, the budget is just a dashboard; acting on it is what makes it govern the pace of change.
The discipline of responding to burn rate is what keeps reliability managed continuously rather than addressed only in crises. A team watching burn rate notices reliability degrading while there is still budget to spare and can correct course early, rather than discovering the problem when the budget is gone and users are already affected. This continuous, proportionate response, enabled by treating the error budget as a live resource with a burn rate to watch, is the operational heart of the practice, turning the SLO from a static target into an active tool that guides day-to-day decisions about reliability and change.
An SLI is the actual measurement of reliability, such as the proportion of requests that succeed. An SLO is the target you set on that measurement, such as 99.9 percent success over a month. The error budget is the amount of failure the SLO permits, the 0.1 percent in that example, expressed as a concrete quantity over the period. The SLI is what you measure, the SLO is the goal, and the error budget is your allowance for falling short, which you treat as a resource to spend.
Because it is neither achievable nor worth it. Every additional nine of reliability costs exponentially more effort and money, and beyond a certain point users cannot perceive the difference, so chasing perfection wastes enormous resources for no real benefit. It also leaves no error budget, which means no room to ship changes, since all change carries some risk. SLOs deliberately set reliability at a level that genuinely satisfies users and explicitly tolerate the rest, which is both realistic and far more useful than an impossible ideal.
By giving both sides an agreed number to point to, set in advance. When the budget is healthy, the team can afford to ship and take risks; when it is exhausted, the team should slow down and stabilize. The same number answers the question in both directions, so the decision follows from data rather than from whose opinion prevails. Because the policy was agreed before the situation arose, neither side is inventing it to win the current argument, which dissolves a chronic source of friction.
That depends on the policy you set in advance, but typically risky feature work pauses and the team focuses on reliability until the service is back within budget. This is the mechanism that gives the error budget teeth: exhausting it triggers a real change in behavior. Without such a policy and the organizational commitment to honor it, the budget is just a measurement with no consequences, and the old features-versus-stability conflict returns. Defining and backing the policy is what makes the whole system work.
Set it where additional reliability stops being worth the cost, which is usually lower than instinct suggests. Base it on the user experience, then pick the level that keeps users genuinely satisfied and no higher, so a meaningful error budget remains to actually use. Targets set too high commit the team to exponential effort for imperceptible gains and produce a budget so small it is always exhausted, making the system punitive. The right target balances user needs against the real cost of each additional increment of reliability.
One that reflects what users actually experience, not what is convenient to measure. A service can meet an internal availability metric while users have a poor experience if the metric does not capture what they depend on. A good SLI measures the reliability of the things users genuinely care about, such as whether their requests succeed and respond quickly, so that meeting the SLO truly means users are well served. Choosing the SLI that corresponds to real user experience is the foundation everything else is built on.
No. They originated in site reliability engineering, but the underlying idea, measure reliability, budget for failure, and let the budget govern the pace of change, is useful for any service where reliability matters, regardless of company size. You do not need a dedicated SRE team to define a meaningful SLO and use an error budget to guide decisions. The concepts scale down well, and even a small team benefits from replacing vague reliability aspirations with a concrete target and an explicit tolerance for failure.
Yes. The right reliability target shifts as a service matures, as user expectations evolve, and as the business context changes, so SLOs should be revisited periodically rather than set once and forgotten. A target that was right a year ago may now be too loose or too strict, and a stale SLO gradually loses its connection to what actually matters. Treating SLOs as living targets, reviewed and adjusted over time, keeps them aligned with reality and useful as a guide for decisions.
Burn rate is how fast you are consuming the error budget. It matters because it turns the budget into an early warning you can act on during the period rather than a number you check at the end. A service burning its whole monthly budget in a few days is in trouble even before the budget is gone, because at that rate it will be exhausted long before the month ends. Alerting on burn rate, and distinguishing a fast burn that needs immediate attention from a slow burn that signals chronic degradation, lets the team respond proportionately and manage reliability continuously instead of only in crises.