High-availability systems are designed to keep running and stay reachable even when individual parts of them fail. The goal is to minimize the time a service is unavailable to the people or systems that depend on it, by building it so the failure of any single component does not take the whole thing down. Hardware breaks, networks drop, processes crash, and entire data centers go dark. A high-availability system is engineered with the assumption that these things will happen and built to survive them. Availability is the property of being up and usable when needed, and high availability is the deliberate engineering to make that property hold despite the constant failures underneath.
The problem high availability addresses is that everything fails eventually, and a naive system inherits the failure of every part it depends on. A single server fails, a single database goes down, a single network link drops, and if your service runs on one of each, any one of those failures takes it offline. The more components a system depends on, the more often something is broken somewhere, so a system that needs all its parts working at once becomes less reliable as it grows, not more. High availability exists to break this chain, so the failure of a part does not become the failure of the whole.
The central mechanism is redundancy combined with the ability to fail over. You run more than one of the things that can fail, and you build the system to detect when one has failed and shift the work to the others. Redundancy alone is not enough, because a spare nobody switches to does not help, and failover is impossible without something to fail over to. Having backups and being able to use them automatically and fast is what turns a collection of fallible parts into a service that stays up. Almost every high-availability design is some arrangement of redundancy and failover applied to the parts most likely to fail.
By 2026 high availability is an expectation rather than a luxury for any service that matters, and the cloud has made much of it easier while raising the bar for what counts as acceptable. Managed services offer multi-zone and multi-region options, load balancers and orchestrators handle failover, and the patterns are well understood. But you still design for it deliberately, because the defaults do not always give it to you and the gap between a system that usually works and one that stays up through real failures is filled with specific engineering choices someone has to make.
This page covers what high-availability systems are, how redundancy and failover work, why measuring availability in nines matters, and the design patterns that keep services running. The specific technologies keep changing. The underlying idea, that you achieve availability by assuming failure and engineering around it with redundancy and failover, is durable and applies to every system people need to depend on.
Redundancy means running more than one of any component whose failure would otherwise stop the service. Instead of one server, you run several behind a load balancer; instead of one database, a primary with replicas; instead of one network path, more than one. The idea is simple: if a part can fail and you have only one of it, its failure is your failure, but if you have several, the failure of one is absorbed by the others. The art is identifying which parts actually need redundancy, because duplicating everything is expensive, and putting it where a single failure would do the most damage.
Failover is the mechanism that makes redundancy useful, the act of detecting a failure and shifting work to a healthy replacement. A load balancer that stops sending traffic to a dead server is doing failover; a database cluster that promotes a replica when the primary dies is doing failover; an orchestrator that reschedules a crashed container onto a healthy node is doing failover. What matters is how quickly the system detects the failure and how cleanly it shifts the work, because the gap between those is the window during which the service is degraded or down. Good failover is fast and automatic, so the redundancy actually protects availability rather than just sitting there.
Active-active and active-passive are the two basic shapes of redundancy. In active-active, all the redundant copies serve traffic at once, so a failure just means the survivors carry more load, with no failover delay because everyone was already working. In active-passive, a standby sits idle until the active copy fails and then takes over, which is simpler and sometimes necessary for stateful components but introduces a failover delay and the risk the standby was not ready. Active-active gives better availability and uses the spare capacity, but it is harder to build correctly, especially for anything that holds state, so the right choice depends on the component.
State is what makes failover hard, and where most high-availability designs get complicated. Stateless components are easy to make redundant, because any copy can handle any request and a failure just means routing elsewhere. Stateful components, databases, queues, anything that remembers things, are difficult, because the replacement has to have the state. That means replicating data fast enough that a promoted replica is current, and handling the awkward cases where a failover might lose recent writes or create two things that both think they are in charge. Most of the real engineering in high availability is in handling state across failures, which is why people push state into a small number of carefully built systems and keep everything above them stateless.
Availability is expressed as a percentage of time the system is up, and counting nines exists because the interesting differences are at the high end. Saying a system is 99 percent available sounds good until you work out that it allows more than three and a half days of downtime a year. Three nines, 99.9 percent, allows about nine hours; four nines, 99.99 percent, about an hour; five nines, 99.999 percent, about five minutes. Each additional nine cuts the allowed downtime by roughly a factor of ten, and counting nines is the practical way to talk about a number whose meaning is all in how close to 100 percent it gets.
Each additional nine costs much more than the last, which is the most important economic fact about high availability. Going from one nine to two is cheap, three to four requires real engineering, and four to five demands eliminating nearly every single point of failure, automating every recovery, and engineering for failures most systems never have to think about. The cost curve is steep and superlinear, so the question is never how much availability you can achieve but how much the service actually needs, because buying nines it does not need spends heavily for downtime reductions no one will value.
Measuring availability honestly requires defining what counts as down and from whose point of view, which is harder than the clean percentage suggests. A service that responds but is too slow to use is arguably down; one that is up for most users but unreachable for a region is partially down; one whose backend is healthy but whose login is broken is down for everyone even though most of it works. Meaningful measurement is from the perspective of the user trying to do something, not the server reporting itself healthy, and getting this right is what makes the nines correspond to something real rather than a comforting number that hides actual outages.
The point of measuring availability is to set a deliberate target and engineer to it rather than chase the maximum. A well-run service decides what availability it needs, based on what the service is for and what downtime costs, sets that as an objective, and engineers the redundancy and failover to meet it, no more and no less. This connects to the discipline of service level objectives and error budgets, where the agreed target governs how much risk the team can take, and it replaces the false goal of perfect uptime with a managed trade-off between availability and the cost and pace of change. The nines are a tool for that conversation, not a score to maximize.
A single point of failure is any component whose failure takes the whole system down, and finding and eliminating these is the central discipline of high-availability design. The reason is direct: a system is only as available as its least available necessary part, so one component with no backup caps the availability of everything, no matter how redundant the rest is. You can run a hundred redundant servers, but if they all depend on one database or one load balancer with no alternative, that one thing is your single point of failure and your real availability is its availability. The work of high availability is largely the work of hunting these down.
Finding single points of failure means tracing every dependency and asking what happens if it fails. It is not enough to make the obvious components redundant; you have to follow the chain, the database the servers depend on, the DNS that routes to the load balancer, the authentication service every request needs, and find where a single failure has no alternative. These hidden dependencies are where availability quietly leaks away, because teams make their application redundant and forget it all funnels through one shared service that was never built to be highly available. Systematic dependency analysis is how you find what you missed.
Some single points of failure are inside components you do not control, which is where the cloud and managed services change the picture in both directions. A managed database or load balancer can remove a single point of failure, because the provider runs it redundantly across zones better than you would, but it can also create one if you depend on a single regional service or a single provider and that has an outage. The provider's availability becomes your ceiling, so understanding what your dependencies actually guarantee, and whether they span zones and regions, is part of finding your single points of failure rather than assuming the provider has handled it.
Eliminating a single point of failure usually means redundancy, but sometimes the better answer is to remove the dependency or contain its failure. If a component is hard to make redundant, you can sometimes design the system to keep working in a degraded way when it fails, caching its last known state, queuing work until it returns, or falling back to a simpler path, so its failure degrades rather than stops the service. This graceful degradation is often more achievable than full redundancy for awkward dependencies, and it reflects the deeper goal: the point is not that nothing ever fails, but that no failure of a part becomes a total failure of the whole.
Load balancing across redundant instances is the foundational pattern for stateless services, and it does a lot of the work of high availability on its own. A load balancer sits in front of several identical instances, health-checks them, and routes traffic only to the healthy ones, so a failed instance simply stops receiving requests while the others carry on. This makes the service tolerant of instance failures with no manual intervention and no visible outage, and it scales naturally because adding capacity is just adding instances behind the balancer. For anything stateless, this pattern plus enough spare capacity to absorb failures is most of what high availability requires.
Spreading across availability zones and regions protects against larger failures, the loss of a whole data center or region. Cloud providers organize capacity into zones that fail independently and regions that are geographically separate, and a high-availability design places redundant copies across multiple zones so the loss of one does not take the service down, and for the highest availability across multiple regions so a whole region's loss does not either. Multi-zone is now a reasonable default for important services and not very hard; multi-region is much harder, especially for stateful systems, because of the distance and the data consistency problems, and it is reserved for services that genuinely need to survive a regional outage.
Replication and consensus are the patterns that bring high availability to stateful systems, the hard part. A database achieves availability by replicating its data to other nodes that can take over, and the difficulty is keeping the replicas current and deciding which one is in charge after a failure without losing data or ending up with two primaries. Modern data systems use consensus protocols and careful replication to promote a healthy replica automatically when the primary fails, and getting this right, fast failover without data loss or split brain, is what makes a stateful system highly available. Most teams rely on databases that have solved this rather than building it themselves, because it is genuinely difficult to get correct.
Graceful degradation and isolation keep partial failures from becoming total ones, and they reflect a mature view of availability. Rather than assuming everything will be up, a resilient system is built so that when a dependency fails, the parts that need it degrade while the rest keeps working, through timeouts that prevent one slow dependency from hanging everything, circuit breakers that stop hammering a failed service, bulkheads that isolate failures to one area, and fallbacks that serve something useful when the ideal answer is unavailable. These patterns accept that failures will happen and contain their blast radius, which is often more valuable than preventing every failure, because the goal is for the user to keep getting a usable service even when something underneath is broken.
High availability and disaster recovery are related but distinct, and confusing them leads to systems that handle the wrong failures. High availability is about staying up through ordinary, frequent failures, a server dying, a zone going down, a process crashing, by having redundancy and failover that keep the service running with little or no interruption. Disaster recovery is about recovering from rare, large failures, the loss of a whole region, a catastrophic data corruption, a major security incident, where the goal is not zero downtime but getting the service back within an acceptable time and with acceptable data loss. They overlap but aim at different failure classes, and a system needs both.
The two are measured by different objectives, which clarifies the distinction. High availability is measured by uptime and the nines, how little the service is down during normal operation. Disaster recovery is measured by recovery time objective, how long it takes to restore service after a disaster, and recovery point objective, how much data you can afford to lose. A system can have excellent high availability and poor disaster recovery, surviving daily failures gracefully but unable to recover from a regional catastrophe, or the reverse, so the two have to be designed and measured separately even though they share mechanisms like replication.
The mechanisms overlap, which is why people conflate them, but they are tuned differently. Replicating data across zones serves high availability by allowing fast failover; replicating across regions and keeping backups serves disaster recovery by allowing recovery from a regional loss. The same replication technology serves both, but the design choices differ: high availability wants the replica current and ready to take over in seconds, while disaster recovery wants a copy far enough away and protected enough to survive whatever takes out the primary, even if recovering from it takes longer. Recognizing that the same tools serve different goals is what lets you design for both deliberately.
The practical implication is that you decide your requirements for both and design accordingly, rather than assuming one covers the other. A service needs an availability target its high-availability design meets, and a recovery time and recovery point objective its disaster recovery plan meets, and these are different conversations driven by different failures. Many serious outages come from teams that built good high availability and assumed it would also save them from a disaster, only to discover that surviving a server failure is nothing like recovering from a region loss or a data corruption. Treating the two as the distinct disciplines they are, both necessary, is what produces a service that survives both the common failures and the rare catastrophes.
It is a system engineered to keep running and stay reachable even when individual components fail, by assuming hardware, networks, and processes will fail and designing around it. The core mechanism is redundancy combined with failover: running more than one of each fallible part and shifting work automatically to the healthy ones when one fails. The goal is to minimize the time the service is unavailable to the people who depend on it, so the failure of a part does not become the failure of the whole. High availability is a deliberate engineering effort, not a property systems have by default.
Availability is the percentage of time the system is up, and nines is shorthand for how close to 100 percent that is. Two nines, 99 percent, allows more than three days of downtime a year; three nines about nine hours; four nines about an hour; five nines about five minutes. Each additional nine cuts the allowed downtime by roughly a factor of ten, which is why people count nines rather than quote the raw percentage. Each nine also costs much more than the last to achieve, so the number should reflect what the service genuinely needs.
It is any component whose failure takes the whole system down, because it has no redundant alternative. A system is only as available as its least available necessary part, so one component with no backup caps the availability of everything else, no matter how redundant the rest is. Finding and eliminating these, by tracing every dependency including hidden ones like DNS, authentication, or a shared database, is the central discipline of high-availability design. Where full redundancy is hard, the alternative is to design the system to degrade gracefully when that component fails rather than stop entirely.
In active-active, all the redundant copies serve traffic at once, so a failure just means the survivors carry more load and there is no failover delay. In active-passive, a standby sits idle until the active copy fails and then takes over, which is simpler but introduces a failover delay and the risk the standby was not ready. Active-active gives better availability and uses the spare capacity but is harder to build correctly, especially for stateful components. The right choice depends on what the component is and whether it holds state, which is what makes redundancy hard.
Because a stateless component can be made redundant easily, any copy handles any request and a failure just means routing elsewhere, while a stateful component has to have its data on the replacement before failover helps. That means replicating data fast enough that a promoted replica is current, and handling awkward cases where a failover might lose recent writes or create two things that both think they are in charge. Most of the real engineering in high availability is in handling state across failures, which is why teams push state into a small number of carefully built data systems and keep everything above stateless.
The cloud makes much of high availability easier, with managed services that run redundantly across zones, load balancers and orchestrators that handle failover, and multi-zone and multi-region options available. But the defaults do not always give you high availability, so you still design for it deliberately. The cloud also shifts where single points of failure live, since a managed service can remove one by running redundantly, or create one if you depend on a single region or provider that has an outage. The provider's availability becomes your ceiling, so you have to understand what your dependencies actually guarantee.
High availability is about staying up through ordinary, frequent failures like a server dying or a zone going down, measured by uptime and the nines. Disaster recovery is about recovering from rare, large failures like the loss of a whole region or a catastrophic data corruption, measured by recovery time and recovery point objectives. They overlap in mechanisms like replication but aim at different failure classes and are tuned differently. A system needs both, and assuming good high availability also provides disaster recovery is a common and dangerous mistake, because surviving a server failure is nothing like recovering from a region loss.
It is designing the system so that when a dependency fails, the parts that need it degrade while the rest keeps working, rather than the whole service stopping. It uses patterns like timeouts that stop one slow dependency from hanging everything, circuit breakers that stop hammering a failed service, bulkheads that isolate failures to one area, and fallbacks that serve something useful when the ideal answer is unavailable. It accepts that failures will happen and contains their blast radius, which is often more achievable than preventing every failure and reflects the real goal: keeping the user on a usable service even when something underneath is broken.
As much as it genuinely needs and no more, because each additional nine costs far more than the last. The right approach is to decide what availability the service requires, based on what it is for and what downtime actually costs the business and its users, set that as an explicit target, and engineer the redundancy and failover to meet it. This connects to service level objectives and error budgets, where the agreed target governs how much risk the team can take. Chasing the maximum number of nines for a service that does not need them spends heavily on downtime reductions no one will value.