A Data Lakehouse: Real Examples & Use Cases

Definition

A data lakehouse is a storage and query architecture that puts warehouse-style table semantics on top of cheap object storage, giving the same data both the analytical performance of a warehouse and the openness and economics of a lake. The pattern combines an open table format (Iceberg, Delta, or Hudi) over a storage layer like S3, GCS, or ADLS, with one or more compute engines that can read and write those tables. Real examples reveal which companies have actually consolidated their stacks on this pattern, what they gave up, and what they kept.

The lakehouse argument is that the historical separation between warehouses and lakes was a side effect of older technology limitations, not a fundamental requirement. Warehouses had fast query but locked-in storage. Lakes had cheap storage but weak semantics. Open table formats closed the gap by giving lake-stored data the ACID transactions, time travel, schema evolution, and indexing that used to require a warehouse. The result is a single architecture that can serve analytics, ML, and operational workloads from one copy of the data.

The category in 2026 is dominated by three open formats. Apache Iceberg has the broadest cross-vendor support, with Snowflake, BigQuery, Redshift, Databricks, AWS Glue, and most independent query engines reading and writing it. Delta Lake is the format Databricks promotes and the default in their managed environment. Apache Hudi has strong adoption for streaming-update use cases, particularly where CDC ingestion is the primary write pattern.

What separates a real lakehouse from a renamed lake is the presence of the table format layer and the operational discipline around it. A real lakehouse has tables with schemas you query reliably, transactions that prevent partial reads, and metadata that tracks history. A lake without those things still has the problems lakes always had: query plans that scan the world, schemas that drift silently, and partial writes that corrupt analyses.

This page surveys real lakehouse implementations across analytics, ML, and operational use cases. The specific format choices and vendor combinations evolve fast; the architectural pattern of open formats on object storage with multi-engine compute is stable enough to plan around.

Key Takeaways

A lakehouse uses open table formats over object storage to combine warehouse semantics with lake economics and openness.
Apache Iceberg, Delta Lake, and Apache Hudi are the three major table formats; pick based on engine support and write pattern.
The same data can serve analytics, ML training, and operational workloads from one copy with no duplication.
Catalog choice (Glue, Unity, Polaris, Tabular, Nessie) shapes which engines can interoperate cleanly.
The pattern works best for new builds; migrations from existing warehouses involve real engineering work.

Production Lakehouse Deployments

Netflix built much of the original tooling that became Apache Iceberg to handle their own scale problem: petabyte-scale tables with thousands of concurrent readers and writers across multiple engines (Spark, Trino, Flink). The open-sourced project moved into the Apache foundation and became the de facto open table format for cross-engine deployments. Netflix continues to run one of the largest Iceberg installations.

Apple uses Iceberg at scale across their internal data platform. Apple's contributions to the project shape its direction; the engineering investment reflects how seriously they take open standards for storage. Many of Iceberg's enterprise-readiness features came out of Apple's production requirements.

Stripe runs Iceberg-on-S3 with Trino as the primary query engine. The architecture lets Stripe's data team operate at petabyte scale without warehouse-vendor compute costs. The engineering team has written about the work to operate Iceberg in production, particularly around catalog choice and table maintenance.

Airbnb adopted Iceberg as part of their broader data platform consolidation. The migration from earlier formats took significant engineering effort but consolidated Airbnb's data infrastructure on a single format that supported all their downstream engines.

Databricks customers run Delta Lake at every scale from small teams to the largest enterprises. The format is the default for any data written through the Databricks platform. Comcast, Conde Nast, Shell, and many similar large companies have published case studies of Delta-based lakehouse implementations on Databricks.

Uber developed Apache Hudi to handle their CDC-heavy ingestion patterns where rows update frequently rather than being append-only. Hudi's design optimizes for the streaming-update use case in a way Iceberg and Delta did originally less well. Uber continues to run Hudi at scale across their data platform.

Format Comparisons That Hold Up

Iceberg leads in cross-engine support. Snowflake, BigQuery, Redshift, Databricks (with some caveats), Trino, Spark, Flink, DuckDB, and most independent query engines read and write Iceberg tables. The interoperability story is the strongest reason for new builds to default to Iceberg.

Delta Lake is the natural choice in Databricks-heavy environments. The format has feature parity with Iceberg on most dimensions and tighter integration with the Databricks tooling stack. Outside Databricks, support is good but less universal than Iceberg.

Hudi remains the strongest choice for streaming-update workloads where the same rows get updated continuously. The merge-on-read storage pattern serves this use case efficiently. For mostly-append workloads, Iceberg and Delta have caught up enough that the historical Hudi advantage no longer matters as much.

The format wars have cooled. Databricks announced Iceberg compatibility through their Uniform feature and acquired Tabular (one of the major commercial Iceberg companies). Snowflake added native Iceberg support. The major engines now read each other's formats well enough that the lock-in concern has receded. Teams pick a primary format but worry less about whether they can switch later.

The differences that still matter in practice: write throughput at scale, support for specific features like row-level operations and merge syntax, catalog ecosystem, and the maturity of cleanup operations. The differences are smaller than they were two years ago and shrinking further each release cycle.

Multi-Engine Architectures

A working lakehouse typically has multiple query engines reading the same tables. Trino or Presto for interactive analytical queries. Spark for heavy transformations and ML training data preparation. Flink for streaming jobs. DuckDB or Athena for ad hoc analysis. Each engine picks the part of the workload it serves well.

Stripe's setup runs Trino as the primary interactive engine with Spark for heavy batch work. The two engines share the same Iceberg tables. Analysts query through Trino; engineers run transformations through Spark; the data is the same data with no copy.

Databricks customers typically use Spark (through Databricks) for both transformation and query. Some add Trino or DuckDB for low-latency interactive use cases that benefit from a different engine. The pattern still gives multi-engine flexibility on the same Delta tables.

Snowflake customers increasingly read Iceberg tables that other tools write. The pattern lets Snowflake be the consumption engine for the BI layer while Spark or Flink handles the production pipelines. The data lives once; multiple engines serve different audiences.

The catalog layer coordinates this. AWS Glue, Databricks Unity Catalog, Snowflake's Polaris, and Project Nessie are the major catalog options. The catalog tracks which tables exist, where they live, and what their current state is. Different engines authenticate to the catalog and read or write through it; the catalog enforces consistency across engines.

Migration Patterns from Existing Stacks

Companies migrating from Snowflake or Redshift toward lakehouse usually go gradually. New pipelines write to Iceberg or Delta tables on object storage. Existing pipelines continue writing to the warehouse. Over time, more workloads move to the lakehouse side; the warehouse becomes the consumption engine for some workloads and gets replaced for others.

The economics favor migration for high-volume workloads. Storage in object storage is roughly an order of magnitude cheaper than warehouse storage at scale. Compute is decoupled and can be sized to the workload. The savings show up most dramatically at petabyte scale where warehouse storage costs become significant line items.

The complexity to manage is real. Self-managed lakehouse infrastructure (catalog operations, table maintenance, compaction, optimization) needs engineering investment that the warehouse handles for you. Teams that migrate without budgeting for that investment end up with cheaper storage and more operational pain.

Managed lakehouse offerings (Databricks, Tabular's services, the various cloud-native options) reduce the operational burden while preserving the openness benefits. The pricing is usually still favorable compared to traditional warehouses for the same workloads, particularly at scale.

The honest pattern: most migrations are partial. The warehouse keeps serving the BI and ad hoc query workloads where its query optimizer and concurrency story remain best-in-class. The lakehouse picks up the heavy data engineering, ML training data, and large historical storage. The two coexist for years before one fully displaces the other, if ever.

ML and Operational Workloads on Lakehouse

ML training data assembly works well on lakehouse architectures. The training pipeline reads from the same tables that analytics queries read. Data scientists do not need a separate data dump; they query the lakehouse with Spark or DuckDB and assemble training sets directly. The pattern reduces duplication and keeps training data consistent with analytical data.

Feature serving for online inference is usually not done from the lakehouse directly because the latency profile is wrong. Lakehouse tables serve batch inference and feature backfill. Online serving typically uses a separate fast store (Redis, DynamoDB) that gets populated from the lakehouse. The pattern combines batch lakehouse processing with operational online storage.

CDC ingestion lands changes from operational databases into lakehouse tables. The pattern uses Debezium or similar to capture changes, Kafka to transport them, and a streaming engine to apply them to lakehouse tables. The result is a near-real-time replica of operational state in a format that analytics can query without touching production databases.

Reverse ETL reads from lakehouse tables and pushes derived data into operational systems. The pattern lets the lakehouse serve as the source of truth for derived metrics and segments that operational tools consume. Census, Hightouch, and similar tools read from lakehouse tables the same way they read from warehouses.

Common Failure Modes

Skipping table maintenance leads to performance degradation. Iceberg, Delta, and Hudi all need periodic compaction, expired snapshot cleanup, and metadata pruning. Without maintenance, query performance drops and storage costs grow. The fix is scheduled maintenance jobs running as part of the production pipeline.

Catalog fragmentation produces multi-engine pain. Different engines pointing at different catalog implementations cannot easily share tables. The fix is consolidating on a single catalog that all engines can use, even if it means migrating away from an engine's default catalog.

Small file accumulation degrades read performance. Streaming writes especially produce many small files that query engines have to read individually. The fix is regular compaction jobs that rewrite small files into larger ones.

Schema evolution mishandled. The table format supports evolution, but engines have different rules about what evolutions they accept. The fix is testing schema changes against all consumer engines before deploying them.

Migration that ignores the operational burden. Teams move data from the warehouse to lakehouse and discover they now need to operate things the warehouse handled automatically. The fix is either budgeting for the operational work or using a managed lakehouse service.

Best Practices

Pick one primary table format and one primary catalog; standardize across the platform.
Schedule table maintenance jobs (compaction, snapshot expiration, metadata cleanup) as core platform operations.
Use the table format's schema evolution features rather than recreating tables when columns change.
Treat catalog operations as production infrastructure with backup, monitoring, and access control.
Validate that all consumer engines support the features you plan to use before standardizing on them.

Common Misconceptions

A lakehouse is a renamed lake; the table format layer and operational discipline make it a different system from an unstructured lake.
The lakehouse replaces the warehouse; in practice, the two often coexist with each handling workloads it serves best.
Iceberg, Delta, and Hudi are functionally identical; they differ enough in write patterns and ecosystem that the choice matters for some workloads.
Going lakehouse saves money automatically; storage saves money, operational complexity may eat the savings without investment.
Lakehouse means no vendor lock-in; the catalog and the engine layer still create lock-in even when the storage format is open.

A Data Lakehouse: Real Examples & Use Cases

Definition

Key Takeaways

Production Lakehouse Deployments

Format Comparisons That Hold Up

Multi-Engine Architectures

Migration Patterns from Existing Stacks

ML and Operational Workloads on Lakehouse

Common Failure Modes

Best Practices

Common Misconceptions

Frequently Asked Questions (FAQ's)

Should new builds default to lakehouse or warehouse?

Which table format should I pick?

Do I need a separate catalog?

How does lakehouse interact with my BI tool?

Can I do ACID transactions in a lakehouse?

How do I handle streaming ingest into lakehouse tables?

What about time travel and historical queries?

Does lakehouse work for small teams?

Where is the lakehouse pattern heading?