Cloud Infrastructure: Real Examples & Use Cases

Definition

Cloud infrastructure is the set of compute, storage, network, and supporting resources provisioned from a cloud provider that an application runs on. Where cloud architecture is the design, cloud infrastructure is what actually exists at runtime: the VMs, containers, load balancers, storage volumes, networks, DNS records, certificates, and dozens of other resources that the application depends on. Real examples reveal what production cloud infrastructure looks like at companies of different sizes, how it gets managed, and where the operational reality differs from the marketing pictures of cloud's effortless elasticity.

The provisioning has shifted decisively toward Infrastructure as Code over the past decade. The hand-clicked console-managed infrastructure of early cloud adoption has become rare in production environments. Terraform, Pulumi, CDK, and similar tools define infrastructure in code; the code goes through pull requests and CI; the resulting infrastructure is reproducible and traceable. The shift made infrastructure operational like application code.

The category in 2026 covers the cloud-provider native resources (AWS EC2/RDS/S3 and equivalents at GCP and Azure), the layered services that provide higher-level abstractions (Kubernetes, managed databases, serverless platforms), and the third-party infrastructure that integrates with cloud (Cloudflare, Datadog, GitHub, Vercel, and many others). Modern production stacks combine all three categories rather than relying on just the cloud provider's native services.

What separates well-managed cloud infrastructure from struggling cloud infrastructure is usually the discipline around lifecycle, change management, and cost. Well-managed infrastructure has explicit ownership, automated provisioning, clear lifecycle policies, and active cost management. Struggling infrastructure has unowned resources accumulating in unknown regions, manual changes that nobody documented, and cost surprises every month.

This page surveys real cloud infrastructure setups across startups, scale-ups, and enterprises, plus the patterns that have emerged for managing infrastructure as systems grow. The specific resource types and vendor offerings evolve continuously; the patterns for managing them are more stable.

Key Takeaways

Cloud infrastructure is the runtime collection of compute, storage, network, and supporting resources an application depends on.
Infrastructure as Code (Terraform, Pulumi, CDK) has become the standard for provisioning; hand-clicked console management is mostly legacy.
Production stacks combine cloud-provider native services, layered abstractions like Kubernetes, and third-party infrastructure services.
The operational practices around infrastructure (ownership, lifecycle, change management, cost) matter more than the specific resource choices.
The patterns scale across company sizes; what changes is the team structure and tooling, not the underlying infrastructure approach.

Production Infrastructure at Different Scales

A typical pre-Series-A startup runs on a single cloud (usually AWS or GCP), a managed Kubernetes cluster or container service, a managed PostgreSQL, object storage, a CDN (Cloudflare or CloudFront), and observability through Datadog or a cheaper alternative. The whole infrastructure fits in one Terraform repository managed by a few engineers who also write application code.

A Series-B-to-C startup typically has the same components but with more sophistication. Separate environments (dev, staging, production), more Kubernetes namespaces, multiple managed databases for different services, secrets management through AWS Secrets Manager or HashiCorp Vault, dedicated platform engineers responsible for the infrastructure layer. Cost has become a real line item that finance asks about.

A scale-up at hundreds of engineers has dedicated platform and SRE teams. Multiple Kubernetes clusters across regions for various workloads. Service mesh for advanced traffic management. Centralized observability across all services. Cost allocation across teams. Compliance work (SOC 2, HIPAA, PCI as applicable) shaping infrastructure choices. The infrastructure team is a recognized engineering organization rather than a side responsibility.

An enterprise has hundreds or thousands of cloud accounts, multiple primary clouds in some cases, decades of legacy systems coexisting with cloud-native ones, formal landing zones and governance, dedicated cloud centers of excellence, FinOps teams, security operations centers. The complexity is on a different scale; the principles of infrastructure management still apply but with more layers of process and tooling.

The published case studies from scaled companies (Stripe, Airbnb, Shopify, Coinbase, Robinhood) describe infrastructure setups at these scales. The patterns are recognizable; the magnitudes vary; the underlying tooling overlaps significantly across companies.

Compute Infrastructure Patterns

Managed Kubernetes (EKS, GKE, AKS) dominates compute infrastructure for new builds. Worker nodes run as EC2 instances or equivalents; the cluster runs application workloads in pods. Karpenter or Cluster Autoscaler manages node capacity; horizontal pod autoscaling manages application capacity. The patterns are well-established and supported by extensive tooling.

Serverless platforms (Lambda, Cloud Run, Azure Functions) handle event-driven and sporadic workloads. The infrastructure is invisible at runtime; functions just run when triggered. Provisioning is in code (SAM, Serverless Framework, CDK) but there are no instances to manage. The pattern eliminates large amounts of operational work for workloads that fit it.

Virtual machines persist for legacy workloads, specific high-performance use cases, and applications that have not been containerized. The patterns include auto-scaling groups for elastic capacity, spot or preemptible instances for cost optimization, and dedicated hosts for licensing or compliance requirements.

GPU infrastructure has grown significantly with AI workloads. The patterns include dedicated GPU instance pools for training workloads, GPU-enabled Kubernetes nodes for ML serving, managed GPU services (SageMaker, Vertex AI Workbench) for data science teams. GPU costs are high and capacity is sometimes constrained; infrastructure decisions around GPUs deserve careful attention.

The mixed compute model is typical at scale. Different parts of the workload run on different compute types based on what fits best. Application services on Kubernetes. Event processing on serverless. ML training on GPU pools. Legacy systems on VMs. The mix is intentional and the cost of running multiple compute types is usually less than the cost of forcing everything onto one.

Storage Infrastructure Patterns

Object storage (S3, GCS, Azure Blob) handles the bulk of unstructured data: static assets, backups, ML artifacts, log archives, video and image files, data lake content. The economics are favorable for any volume; the operational simplicity is the main draw. Lifecycle policies move data to cheaper storage classes as it ages.

Block storage (EBS, Persistent Disk, Managed Disks) backs VMs and Kubernetes persistent volumes. The choice is between SSD and HDD performance tiers, with various throughput and IOPS options. Storage classes in Kubernetes abstract the underlying block storage details for application developers.

Managed databases (RDS, Cloud SQL, Azure SQL) provide PostgreSQL, MySQL, and proprietary options. The managed service handles backups, patching, replication, and failover. The trade-off is some loss of low-level control compared to self-managed; the operational savings are substantial.

NoSQL managed services (DynamoDB, Firestore, Cosmos DB, Bigtable) handle high-throughput, low-latency, schema-flexible workloads. Each has its specific strengths and limitations. The pricing models can produce surprises if not understood; DynamoDB on-demand can be expensive at high constant throughput, for example.

Specialized storage shows up for specific needs. Elasticsearch or OpenSearch clusters for search. Redis or Memcached for caching. Time-series stores for metrics. Vector stores for embeddings. Each fits a specific workload type that general-purpose databases cannot serve well.

Network Infrastructure Patterns

VPC topology shapes how traffic flows between services. Public subnets for internet-facing load balancers. Private subnets for application tiers. Isolated subnets for sensitive data tiers. Security groups and network ACLs enforce traffic rules. The topology gets defined in IaC and changes go through review.

Cross-VPC connectivity uses peering, transit gateways, or PrivateLink-style services. The choice depends on scale and topology. Transit gateways centralize connectivity for hub-and-spoke architectures common at enterprise scale. Peering works for simpler topologies.

Hybrid connectivity (Direct Connect, ExpressRoute, Cloud Interconnect) links on-premise networks to cloud. The patterns are mature; the operational considerations include redundancy, monitoring, and the boundary between cloud-managed and on-premise-managed network segments.

CDN and edge services (CloudFront, Cloud CDN, Cloudflare, Fastly, Akamai) deliver content from edge locations near users. The patterns include cache configuration, origin failover, and increasingly edge compute for latency-sensitive logic.

Service meshes (Istio, Linkerd, Consul) provide advanced traffic management, mTLS, observability, and policy enforcement at the network layer. The complexity is significant; adoption is justified for teams that need the features and can operate the mesh well.

Operational Tooling Around Infrastructure

State management for IaC. Terraform state in S3 with DynamoDB locking is the most common pattern. Terraform Cloud, Spacelift, Env0, and similar platforms add features for team collaboration, policy enforcement, and audit. The state is the source of truth for what infrastructure exists; managing it well is critical.

CI/CD for infrastructure changes. Changes flow through pull requests with terraform plan in CI; merging triggers apply; the deployed state matches the code. The pattern brings software engineering discipline to infrastructure changes.

Policy enforcement through OPA, Sentinel, or cloud-native services (AWS Config, Azure Policy, GCP Policy Intelligence). Policies prevent certain kinds of infrastructure changes from being applied: open security groups, unencrypted storage, untagged resources. The patterns shift from detecting problems after deployment to preventing them at the proposal stage.

Drift detection identifies infrastructure that has changed outside the IaC. Manual console changes, automatic adjustments by cloud services, untracked modifications. The drift gets reported and reconciled. The patterns are essential at scale where unowned changes accumulate without detection.

Cost management tooling (cloud-native cost explorers plus third-party tools like Vantage, CloudZero, Spot.io) tracks spending across teams and surfaces optimization opportunities. The tools work best when paired with team-level cost attribution and active optimization practices.

Common Failure Modes

Infrastructure changes made through the console rather than IaC. The infrastructure drifts from the code; reproducibility breaks; the next IaC apply produces unexpected changes. The fix is enforcing IaC for all changes and using drift detection to catch the exceptions.

Unowned resources accumulating across accounts. Old development environments, abandoned projects, forgotten experiments. The cost grows without anyone responsible. The fix is required tagging, regular reviews, and aggressive cleanup of unowned resources.

Cost growth that no one understands. The bill grows; explanations are unclear; cuts come reactively. The fix is FinOps practices, cost attribution to teams, and ongoing optimization rather than crisis response.

Security gaps from default-permissive configurations. Open S3 buckets, overly broad IAM permissions, security groups allowing too much. The fix is policy enforcement that prevents these configurations from being applied in the first place.

Operational concentration in a few infrastructure experts. The team that knows the infrastructure is small; turnover creates risk; expansion requires extensive ramp-up. The fix is documentation, automation that reduces specialist requirements, and steady investment in growing infrastructure capability.

Best Practices

Define all infrastructure as code with version control, CI checks, and explicit review processes.
Implement policy enforcement that prevents non-compliant infrastructure changes at the proposal stage.
Tag everything for cost attribution, ownership, and lifecycle management; enforce tagging through policy.
Run drift detection to catch infrastructure changes made outside the IaC.
Treat cost as a managed operational concern with team-level attribution and ongoing optimization.

Common Misconceptions

Cloud infrastructure is unlimited and elastic; quotas, capacity constraints (especially GPUs), and cost limits all matter in practice.
Managed services eliminate operational work; they reduce it but do not eliminate it; ownership and operational understanding still matter.
IaC means infrastructure changes are safe; bad IaC can still cause incidents; the discipline shifts but does not remove risk.
Multi-cloud infrastructure is a hedge against vendor lock-in; it usually creates a different kind of lock-in (to the portability abstraction) that costs more.
The cloud is more expensive than on-premise; the comparison depends on workload, utilization, and what operational costs are counted.

Cloud Infrastructure: Real Examples & Use Cases

Definition

Key Takeaways

Production Infrastructure at Different Scales

Compute Infrastructure Patterns

Storage Infrastructure Patterns

Network Infrastructure Patterns

Operational Tooling Around Infrastructure

Common Failure Modes

Best Practices

Common Misconceptions

Frequently Asked Questions (FAQ's)

What IaC tool should I use?

How do I structure my IaC repository?

Do I need a dedicated platform team?

How do I manage secrets in cloud infrastructure?

How should I think about cloud regions?

How do I handle disaster recovery?

What about FinOps?

How do I keep infrastructure secure?

Where is cloud infrastructure heading?