Cloud architecture is the design of how an application or system uses cloud services to meet its functional and operational requirements. The discipline covers compute choice (containers, serverless, VMs), data storage (object stores, databases, caches), networking (VPC design, service mesh, edge), security boundaries (identity, encryption, network controls), and the cross-cutting concerns of observability, reliability, and cost. Real examples reveal which architectural patterns companies actually ship to production, what failures emerge after years of operation, and where the gap between vendor reference architectures and lived reality matters.
The discipline diverged from traditional system architecture as cloud-specific affordances accumulated. Auto-scaling groups changed how to think about capacity. Managed databases changed how to think about persistence. Serverless changed how to think about compute. Multi-region deployment changed how to think about availability. Each affordance brought new design choices that did not exist in the pre-cloud era and that older architecture habits did not address well.
The category in 2026 has settled into recognizable patterns. Most new application workloads run on Kubernetes or serverless platforms. Most data workloads run on cloud-native warehouses or lakehouses. Most public-facing systems sit behind CDNs and use cloud-native edge services. The specific vendor choices vary (AWS, GCP, Azure, sometimes multi-cloud), but the architectural shapes converge across vendors.
What separates a working cloud architecture from a struggling one is usually the coherence of choices and the operational maturity around them. Architectures that pick a primary cloud, lean into its native services, and operate them well produce better outcomes than architectures that try to stay vendor-neutral by avoiding cloud-specific services. The neutrality almost always costs more than it saves.
This page surveys real cloud architectures across consumer-facing platforms, enterprise applications, data platforms, and AI/ML workloads. The architectural patterns are more stable than the specific service names; the patterns translate across vendors and across years.
Netflix's architecture is one of the most-documented cloud-native systems. The platform runs on AWS, uses microservices with Spring Cloud-derived patterns, leans heavily on AWS managed services (S3, DynamoDB, Kinesis, EMR), and exemplifies the cell-based deployment pattern that contains blast radius for failures. The architecture has been documented extensively in Netflix's tech blog and conference talks.
Airbnb's architecture runs primarily on AWS with a substantial Kubernetes deployment. The platform handles searching, booking, and managing hospitality inventory at global scale. The migration from a Rails monolith to a service-oriented architecture took years and produced patterns the team has shared publicly.
Stripe operates on AWS with a service-oriented architecture and substantial investment in their own internal platform. The architecture handles payment processing at scale with stringent latency, reliability, and consistency requirements. Stripe's engineering team has published extensively on the patterns they use for high-reliability systems.
Shopify runs on Google Cloud (after a major migration from on-premise) with Kubernetes-based deployment and a service-oriented architecture. The platform handles e-commerce for millions of merchants with traffic spikes during sales events. The architecture choices reflect the operational reality of multi-tenant SaaS at scale.
Coinbase, Robinhood, and similar fintech platforms have detailed engineering blogs describing their cloud architectures. The patterns reflect strict regulatory and security requirements layered onto cloud-native foundations. Common patterns include strict network segmentation, comprehensive audit logging, and aggressive observability.
Many enterprise organizations have published their cloud transformation case studies through cloud vendor partnerships. Capital One on AWS. HSBC on Google Cloud. BMW on Azure. The case studies describe architectures that combine cloud-native services with enterprise governance, security, and compliance requirements.
Kubernetes is the most common compute substrate for new builds. EKS on AWS, GKE on Google Cloud, and AKS on Azure are the managed Kubernetes services. The patterns include namespace-based multi-tenancy, GitOps for deployment, service meshes (Istio, Linkerd) for advanced traffic management, and operators for managing stateful workloads. The Kubernetes layer is operationally complex but provides portability and a rich ecosystem.
Serverless compute (AWS Lambda, Google Cloud Run, Azure Functions) fits workloads with sporadic traffic, event-driven processing, and unpredictable scaling. The pattern eliminates capacity planning for these workloads and scales to zero between events. The trade-off is execution time limits and cold-start latency that some workloads cannot tolerate.
Container services without Kubernetes (AWS ECS, Google Cloud Run, Azure Container Apps) offer simpler operation than Kubernetes for teams that want containers without the full Kubernetes complexity. The trade-off is fewer ecosystem options and vendor lock-in to the specific container service.
Virtual machines persist for legacy workloads, applications that do not containerize cleanly, and high-performance computing. The pattern is mature and well-understood; the operational practices have not changed dramatically with cloud adoption beyond what auto-scaling groups enable.
Mixed compute architectures are typical at large companies. The same architecture might use Kubernetes for the main application services, serverless for event processing, container services for specific operational tools, and VMs for legacy systems. The mix matches each workload to the compute model that fits best.
Operational data lives in managed databases (RDS, Cloud SQL, Azure SQL, plus NoSQL options like DynamoDB, Firestore, Cosmos DB). Most new builds use these rather than self-managed databases on VMs. The operational simplicity wins almost always; cost differences are smaller than the operational savings.
Analytical data lives in cloud warehouses and lakehouses. Snowflake, BigQuery, Redshift, Databricks. The pattern has been established for years and is the default for any non-trivial analytics workload.
Cache layers use managed services (ElastiCache, Memorystore, Azure Cache for Redis) or self-managed Redis on Kubernetes. The choice depends on cost at scale; managed services are easier but more expensive per gigabyte at large sizes.
Object storage (S3, GCS, Azure Blob) serves a long list of use cases beyond data lakes: application static assets, backups, ML model artifacts, video files, document storage. The service is the workhorse of cloud architecture and shows up in almost every system.
Specialized data stores fill specific niches. Elasticsearch or OpenSearch for full-text search. Time-series databases (Timestream, InfluxDB) for metrics. Graph databases (Neptune, Neo4j) for relationship queries. Vector databases (Pinecone, Weaviate, plus integrated options in mainstream databases) for embedding queries.
VPC design separates workloads and provides network-level isolation. The patterns include public subnets for load balancers, private subnets for application tiers, and isolated subnets for sensitive data tiers. Network ACLs and security groups enforce traffic rules between subnets.
Identity and access management has shifted toward fine-grained permissions and short-lived credentials. AWS IAM Roles for Service Accounts. GCP Workload Identity. Azure Managed Identities. The patterns eliminate long-lived credentials in favor of role-based access that the platform provides dynamically.
Edge architecture uses CDNs (CloudFront, Cloud CDN, Azure Front Door, plus third-party options like Cloudflare and Fastly) for static asset delivery and increasingly for edge compute. Lambda@Edge, Cloudflare Workers, and similar services run logic at the edge for latency-sensitive use cases.
Encryption is table-stakes across the architecture. At rest with cloud-managed keys or customer-managed keys in cloud KMS. In transit with TLS for all service-to-service communication. The cloud providers make these the default; opting out is harder than opting in.
Audit logging captures who accessed what. CloudTrail on AWS, Cloud Audit Logs on GCP, Activity Log on Azure. The logs feed both security tooling and compliance reporting. The pattern is essential for regulated environments and useful for operational forensics in any environment.
Single-region single-cloud handles most workloads adequately. The complexity of multi-region or multi-cloud is real and should only be taken on when there is a specific reason. The reasons include strict regulatory requirements, disaster recovery requirements that single-region cannot meet, latency requirements for global users, and vendor risk concerns at very large scale.
Active-active multi-region deployment serves users from the geographically closest region with low latency. The pattern requires careful data architecture (replication, conflict resolution, consistency choices) and significantly more operational complexity than single-region. Companies running this pattern at scale (Netflix, Twitter, Discord) have published material on the trade-offs.
Active-passive multi-region for disaster recovery serves all traffic from one region with a standby region ready to take over if the primary fails. The pattern is simpler than active-active but has its own challenges: keeping the standby warm enough to take over, periodic failover testing, data consistency during failover.
Multi-cloud architectures are rare in pure form. More common is primary-cloud-plus-secondary, where one cloud hosts the main workload and another hosts specific services that are better there or that exist for vendor risk mitigation. True workload portability across clouds is expensive and usually not worth the cost.
Trying to stay vendor-neutral by avoiding cloud-native services. The architecture ends up reinventing services the cloud provides while losing the benefits of the managed versions. The fix is leaning into one cloud's native services and accepting some lock-in.
Multi-region complexity without need. The team builds for active-active across regions for resilience that single-region with backups would have provided. The operational burden becomes a permanent drag without commensurate benefit. The fix is starting single-region and adding regions only when need is demonstrated.
Networking complexity that no one fully understands. VPCs, peering, transit gateways, VPNs, service meshes all combined produce architectures only one person on the team understands. The fix is simplification, documentation, and limiting networking complexity to what use cases actually require.
Security retrofitted after problems emerge. The architecture shipped with permissive defaults; incidents revealed gaps; security gets bolted on after. The fix is security-aware architecture from day one with explicit threat modeling.
Cost growth that surprises leadership. Cloud bills grow faster than business expects; explanations are confused; cuts come as crisis. The fix is FinOps practices, cost attribution, and ongoing optimization rather than reactive responses to bill spikes.
Whichever your team has experience with or which has the services your workload most needs. AWS has the broadest service catalog. GCP has strong data and AI services. Azure has strong enterprise integration with Microsoft tooling. The differences between major clouds are smaller than the differences between using any one well or poorly.
If you have multiple teams shipping containerized services and need orchestration across them, yes. If you have a few services and could use a simpler container platform (ECS, Cloud Run, Container Apps), the simpler platform is often easier to operate. Kubernetes earns its complexity at scale; below scale, it adds overhead.
Strong default toward managed services unless there is a specific reason to self-manage. The operational savings usually exceed the cost premium. Self-manage when you need capabilities the managed service does not provide, when cost at scale demands it, or when vendor independence is genuinely required.
Only if you have a specific reason: regulatory requirements, disaster recovery SLA, global latency, or vendor risk at scale. If you do, start with active-passive for disaster recovery rather than active-active. The complexity of active-active is significant and most workloads do not benefit from it enough to justify it.
Common at large enterprises with existing on-premise investments. The architectures usually keep specific workloads on-premise (latency-sensitive, regulatory, sunk capital) and move new workloads to the cloud. Hybrid is more typical than pure cloud at enterprise scale; the operational patterns differ enough that hybrid is its own architectural style.
As an architectural concern from day one. Logs, metrics, and traces across all services. Centralized collection through Datadog, Splunk, Honeycomb, or cloud-native services. Alerting policies aligned to user impact rather than infrastructure metrics. The investment pays back many times in operational visibility.
ML training workloads run on GPU instances or specialized services (SageMaker, Vertex AI, Azure ML). Inference runs on container or serverless platforms depending on latency requirements. Model artifacts live in object storage. Feature data lives in warehouses or feature stores. The patterns integrate with the rest of the cloud architecture rather than living separately.
Use it for workloads with strict latency requirements that the regional cloud cannot meet. CDN edge compute (CloudFront Functions, Cloudflare Workers) handles request-time logic close to users. IoT-style edge devices handle local processing for telemetry. Most workloads do not need edge; those that do should explicitly evaluate the trade-offs.
Toward more managed services that reduce operational burden. Toward better cross-region and multi-cloud abstractions where they are needed. Toward more AI-assisted architecture work in design and operation. Toward more standardization around Kubernetes, OpenTelemetry, and similar cross-vendor patterns. The fundamentals are stable; the tooling continues to mature.