Cloud infrastructure is the set of compute, storage, network, and supporting resources provisioned from a cloud provider that an application runs on. Where cloud architecture is the design, cloud infrastructure is what actually exists at runtime: the VMs, containers, load balancers, storage volumes, networks, DNS records, certificates, and dozens of other resources that the application depends on. Real examples reveal what production cloud infrastructure looks like at companies of different sizes, how it gets managed, and where the operational reality differs from the marketing pictures of cloud's effortless elasticity.
The provisioning has shifted decisively toward Infrastructure as Code over the past decade. The hand-clicked console-managed infrastructure of early cloud adoption has become rare in production environments. Terraform, Pulumi, CDK, and similar tools define infrastructure in code; the code goes through pull requests and CI; the resulting infrastructure is reproducible and traceable. The shift made infrastructure operational like application code.
The category in 2026 covers the cloud-provider native resources (AWS EC2/RDS/S3 and equivalents at GCP and Azure), the layered services that provide higher-level abstractions (Kubernetes, managed databases, serverless platforms), and the third-party infrastructure that integrates with cloud (Cloudflare, Datadog, GitHub, Vercel, and many others). Modern production stacks combine all three categories rather than relying on just the cloud provider's native services.
What separates well-managed cloud infrastructure from struggling cloud infrastructure is usually the discipline around lifecycle, change management, and cost. Well-managed infrastructure has explicit ownership, automated provisioning, clear lifecycle policies, and active cost management. Struggling infrastructure has unowned resources accumulating in unknown regions, manual changes that nobody documented, and cost surprises every month.
This page surveys real cloud infrastructure setups across startups, scale-ups, and enterprises, plus the patterns that have emerged for managing infrastructure as systems grow. The specific resource types and vendor offerings evolve continuously; the patterns for managing them are more stable.
A typical pre-Series-A startup runs on a single cloud (usually AWS or GCP), a managed Kubernetes cluster or container service, a managed PostgreSQL, object storage, a CDN (Cloudflare or CloudFront), and observability through Datadog or a cheaper alternative. The whole infrastructure fits in one Terraform repository managed by a few engineers who also write application code.
A Series-B-to-C startup typically has the same components but with more sophistication. Separate environments (dev, staging, production), more Kubernetes namespaces, multiple managed databases for different services, secrets management through AWS Secrets Manager or HashiCorp Vault, dedicated platform engineers responsible for the infrastructure layer. Cost has become a real line item that finance asks about.
A scale-up at hundreds of engineers has dedicated platform and SRE teams. Multiple Kubernetes clusters across regions for various workloads. Service mesh for advanced traffic management. Centralized observability across all services. Cost allocation across teams. Compliance work (SOC 2, HIPAA, PCI as applicable) shaping infrastructure choices. The infrastructure team is a recognized engineering organization rather than a side responsibility.
An enterprise has hundreds or thousands of cloud accounts, multiple primary clouds in some cases, decades of legacy systems coexisting with cloud-native ones, formal landing zones and governance, dedicated cloud centers of excellence, FinOps teams, security operations centers. The complexity is on a different scale; the principles of infrastructure management still apply but with more layers of process and tooling.
The published case studies from scaled companies (Stripe, Airbnb, Shopify, Coinbase, Robinhood) describe infrastructure setups at these scales. The patterns are recognizable; the magnitudes vary; the underlying tooling overlaps significantly across companies.
Managed Kubernetes (EKS, GKE, AKS) dominates compute infrastructure for new builds. Worker nodes run as EC2 instances or equivalents; the cluster runs application workloads in pods. Karpenter or Cluster Autoscaler manages node capacity; horizontal pod autoscaling manages application capacity. The patterns are well-established and supported by extensive tooling.
Serverless platforms (Lambda, Cloud Run, Azure Functions) handle event-driven and sporadic workloads. The infrastructure is invisible at runtime; functions just run when triggered. Provisioning is in code (SAM, Serverless Framework, CDK) but there are no instances to manage. The pattern eliminates large amounts of operational work for workloads that fit it.
Virtual machines persist for legacy workloads, specific high-performance use cases, and applications that have not been containerized. The patterns include auto-scaling groups for elastic capacity, spot or preemptible instances for cost optimization, and dedicated hosts for licensing or compliance requirements.
GPU infrastructure has grown significantly with AI workloads. The patterns include dedicated GPU instance pools for training workloads, GPU-enabled Kubernetes nodes for ML serving, managed GPU services (SageMaker, Vertex AI Workbench) for data science teams. GPU costs are high and capacity is sometimes constrained; infrastructure decisions around GPUs deserve careful attention.
The mixed compute model is typical at scale. Different parts of the workload run on different compute types based on what fits best. Application services on Kubernetes. Event processing on serverless. ML training on GPU pools. Legacy systems on VMs. The mix is intentional and the cost of running multiple compute types is usually less than the cost of forcing everything onto one.
Object storage (S3, GCS, Azure Blob) handles the bulk of unstructured data: static assets, backups, ML artifacts, log archives, video and image files, data lake content. The economics are favorable for any volume; the operational simplicity is the main draw. Lifecycle policies move data to cheaper storage classes as it ages.
Block storage (EBS, Persistent Disk, Managed Disks) backs VMs and Kubernetes persistent volumes. The choice is between SSD and HDD performance tiers, with various throughput and IOPS options. Storage classes in Kubernetes abstract the underlying block storage details for application developers.
Managed databases (RDS, Cloud SQL, Azure SQL) provide PostgreSQL, MySQL, and proprietary options. The managed service handles backups, patching, replication, and failover. The trade-off is some loss of low-level control compared to self-managed; the operational savings are substantial.
NoSQL managed services (DynamoDB, Firestore, Cosmos DB, Bigtable) handle high-throughput, low-latency, schema-flexible workloads. Each has its specific strengths and limitations. The pricing models can produce surprises if not understood; DynamoDB on-demand can be expensive at high constant throughput, for example.
Specialized storage shows up for specific needs. Elasticsearch or OpenSearch clusters for search. Redis or Memcached for caching. Time-series stores for metrics. Vector stores for embeddings. Each fits a specific workload type that general-purpose databases cannot serve well.
VPC topology shapes how traffic flows between services. Public subnets for internet-facing load balancers. Private subnets for application tiers. Isolated subnets for sensitive data tiers. Security groups and network ACLs enforce traffic rules. The topology gets defined in IaC and changes go through review.
Cross-VPC connectivity uses peering, transit gateways, or PrivateLink-style services. The choice depends on scale and topology. Transit gateways centralize connectivity for hub-and-spoke architectures common at enterprise scale. Peering works for simpler topologies.
Hybrid connectivity (Direct Connect, ExpressRoute, Cloud Interconnect) links on-premise networks to cloud. The patterns are mature; the operational considerations include redundancy, monitoring, and the boundary between cloud-managed and on-premise-managed network segments.
CDN and edge services (CloudFront, Cloud CDN, Cloudflare, Fastly, Akamai) deliver content from edge locations near users. The patterns include cache configuration, origin failover, and increasingly edge compute for latency-sensitive logic.
Service meshes (Istio, Linkerd, Consul) provide advanced traffic management, mTLS, observability, and policy enforcement at the network layer. The complexity is significant; adoption is justified for teams that need the features and can operate the mesh well.
State management for IaC. Terraform state in S3 with DynamoDB locking is the most common pattern. Terraform Cloud, Spacelift, Env0, and similar platforms add features for team collaboration, policy enforcement, and audit. The state is the source of truth for what infrastructure exists; managing it well is critical.
CI/CD for infrastructure changes. Changes flow through pull requests with terraform plan in CI; merging triggers apply; the deployed state matches the code. The pattern brings software engineering discipline to infrastructure changes.
Policy enforcement through OPA, Sentinel, or cloud-native services (AWS Config, Azure Policy, GCP Policy Intelligence). Policies prevent certain kinds of infrastructure changes from being applied: open security groups, unencrypted storage, untagged resources. The patterns shift from detecting problems after deployment to preventing them at the proposal stage.
Drift detection identifies infrastructure that has changed outside the IaC. Manual console changes, automatic adjustments by cloud services, untracked modifications. The drift gets reported and reconciled. The patterns are essential at scale where unowned changes accumulate without detection.
Cost management tooling (cloud-native cost explorers plus third-party tools like Vantage, CloudZero, Spot.io) tracks spending across teams and surfaces optimization opportunities. The tools work best when paired with team-level cost attribution and active optimization practices.
Infrastructure changes made through the console rather than IaC. The infrastructure drifts from the code; reproducibility breaks; the next IaC apply produces unexpected changes. The fix is enforcing IaC for all changes and using drift detection to catch the exceptions.
Unowned resources accumulating across accounts. Old development environments, abandoned projects, forgotten experiments. The cost grows without anyone responsible. The fix is required tagging, regular reviews, and aggressive cleanup of unowned resources.
Cost growth that no one understands. The bill grows; explanations are unclear; cuts come reactively. The fix is FinOps practices, cost attribution to teams, and ongoing optimization rather than crisis response.
Security gaps from default-permissive configurations. Open S3 buckets, overly broad IAM permissions, security groups allowing too much. The fix is policy enforcement that prevents these configurations from being applied in the first place.
Operational concentration in a few infrastructure experts. The team that knows the infrastructure is small; turnover creates risk; expansion requires extensive ramp-up. The fix is documentation, automation that reduces specialist requirements, and steady investment in growing infrastructure capability.
Terraform is the safe default with the largest community and broadest provider support. Pulumi if you want to define infrastructure in a general-purpose language (Python, TypeScript, Go) rather than HCL. CDK if you are deeply in AWS and prefer its idioms. The choice matters less than using one of them consistently.
By environment and component. Separate state files for separate concerns prevent blast radius from one mistake affecting everything. Common patterns include one repo per service with infrastructure inside it, or a separate infrastructure monorepo with modules for each environment. Either pattern works; consistency matters more than the specific choice.
At scale, yes. The threshold is fuzzy but somewhere around fifty engineers across multiple services. Below that, application engineers can handle infrastructure as part of their work. Above that, dedicated platform engineers produce better outcomes than asking application engineers to do it part-time.
Through a managed secrets store (AWS Secrets Manager, GCP Secret Manager, Azure Key Vault) or a dedicated tool (HashiCorp Vault). Applications retrieve secrets at runtime through IAM-based access. Never put secrets in IaC, environment variables in container definitions, or anywhere else they end up in source control.
Pick one primary region close to most users or most consumed services. Add more regions only when there is a specific reason. Multi-region adds substantial operational complexity that single-region with backups usually avoids while still meeting most reliability requirements.
Define an RPO and RTO that match the business requirements. Most workloads do well with backup-and-restore patterns (low RPO/RTO with managed database snapshots and infrastructure recreated from IaC). Higher availability requirements push toward multi-region active-passive or active-active, with the operational complexity those imply.
Adopt the practices early rather than waiting for cost crisis. Tag for attribution. Surface costs to teams. Review the most expensive resources monthly. Set budgets and alerts. Look for the obvious wastes (idle resources, oversized instances, no lifecycle on storage). The practices are well-known; the discipline to apply them is the harder part.
Apply least-privilege IAM. Encrypt at rest and in transit. Enable audit logging. Use policy enforcement to prevent risky configurations. Run vulnerability scanning on infrastructure. Have incident response procedures ready. The patterns are mature; the failures usually come from skipping them, not from inadequate patterns.
Toward more managed services that absorb operational complexity. Toward more AI-assisted infrastructure work in design, debugging, and optimization. Toward better cross-cloud abstractions where they are needed. Toward more sophisticated cost management tooling. The category remains foundational and continues to evolve through tooling rather than through paradigm shifts.