A data lakehouse is a storage and query architecture that puts warehouse-style table semantics on top of cheap object storage, giving the same data both the analytical performance of a warehouse and the openness and economics of a lake. The pattern combines an open table format (Iceberg, Delta, or Hudi) over a storage layer like S3, GCS, or ADLS, with one or more compute engines that can read and write those tables. Real examples reveal which companies have actually consolidated their stacks on this pattern, what they gave up, and what they kept.
The lakehouse argument is that the historical separation between warehouses and lakes was a side effect of older technology limitations, not a fundamental requirement. Warehouses had fast query but locked-in storage. Lakes had cheap storage but weak semantics. Open table formats closed the gap by giving lake-stored data the ACID transactions, time travel, schema evolution, and indexing that used to require a warehouse. The result is a single architecture that can serve analytics, ML, and operational workloads from one copy of the data.
The category in 2026 is dominated by three open formats. Apache Iceberg has the broadest cross-vendor support, with Snowflake, BigQuery, Redshift, Databricks, AWS Glue, and most independent query engines reading and writing it. Delta Lake is the format Databricks promotes and the default in their managed environment. Apache Hudi has strong adoption for streaming-update use cases, particularly where CDC ingestion is the primary write pattern.
What separates a real lakehouse from a renamed lake is the presence of the table format layer and the operational discipline around it. A real lakehouse has tables with schemas you query reliably, transactions that prevent partial reads, and metadata that tracks history. A lake without those things still has the problems lakes always had: query plans that scan the world, schemas that drift silently, and partial writes that corrupt analyses.
This page surveys real lakehouse implementations across analytics, ML, and operational use cases. The specific format choices and vendor combinations evolve fast; the architectural pattern of open formats on object storage with multi-engine compute is stable enough to plan around.
Netflix built much of the original tooling that became Apache Iceberg to handle their own scale problem: petabyte-scale tables with thousands of concurrent readers and writers across multiple engines (Spark, Trino, Flink). The open-sourced project moved into the Apache foundation and became the de facto open table format for cross-engine deployments. Netflix continues to run one of the largest Iceberg installations.
Apple uses Iceberg at scale across their internal data platform. Apple's contributions to the project shape its direction; the engineering investment reflects how seriously they take open standards for storage. Many of Iceberg's enterprise-readiness features came out of Apple's production requirements.
Stripe runs Iceberg-on-S3 with Trino as the primary query engine. The architecture lets Stripe's data team operate at petabyte scale without warehouse-vendor compute costs. The engineering team has written about the work to operate Iceberg in production, particularly around catalog choice and table maintenance.
Airbnb adopted Iceberg as part of their broader data platform consolidation. The migration from earlier formats took significant engineering effort but consolidated Airbnb's data infrastructure on a single format that supported all their downstream engines.
Databricks customers run Delta Lake at every scale from small teams to the largest enterprises. The format is the default for any data written through the Databricks platform. Comcast, Conde Nast, Shell, and many similar large companies have published case studies of Delta-based lakehouse implementations on Databricks.
Uber developed Apache Hudi to handle their CDC-heavy ingestion patterns where rows update frequently rather than being append-only. Hudi's design optimizes for the streaming-update use case in a way Iceberg and Delta did originally less well. Uber continues to run Hudi at scale across their data platform.
Iceberg leads in cross-engine support. Snowflake, BigQuery, Redshift, Databricks (with some caveats), Trino, Spark, Flink, DuckDB, and most independent query engines read and write Iceberg tables. The interoperability story is the strongest reason for new builds to default to Iceberg.
Delta Lake is the natural choice in Databricks-heavy environments. The format has feature parity with Iceberg on most dimensions and tighter integration with the Databricks tooling stack. Outside Databricks, support is good but less universal than Iceberg.
Hudi remains the strongest choice for streaming-update workloads where the same rows get updated continuously. The merge-on-read storage pattern serves this use case efficiently. For mostly-append workloads, Iceberg and Delta have caught up enough that the historical Hudi advantage no longer matters as much.
The format wars have cooled. Databricks announced Iceberg compatibility through their Uniform feature and acquired Tabular (one of the major commercial Iceberg companies). Snowflake added native Iceberg support. The major engines now read each other's formats well enough that the lock-in concern has receded. Teams pick a primary format but worry less about whether they can switch later.
The differences that still matter in practice: write throughput at scale, support for specific features like row-level operations and merge syntax, catalog ecosystem, and the maturity of cleanup operations. The differences are smaller than they were two years ago and shrinking further each release cycle.
A working lakehouse typically has multiple query engines reading the same tables. Trino or Presto for interactive analytical queries. Spark for heavy transformations and ML training data preparation. Flink for streaming jobs. DuckDB or Athena for ad hoc analysis. Each engine picks the part of the workload it serves well.
Stripe's setup runs Trino as the primary interactive engine with Spark for heavy batch work. The two engines share the same Iceberg tables. Analysts query through Trino; engineers run transformations through Spark; the data is the same data with no copy.
Databricks customers typically use Spark (through Databricks) for both transformation and query. Some add Trino or DuckDB for low-latency interactive use cases that benefit from a different engine. The pattern still gives multi-engine flexibility on the same Delta tables.
Snowflake customers increasingly read Iceberg tables that other tools write. The pattern lets Snowflake be the consumption engine for the BI layer while Spark or Flink handles the production pipelines. The data lives once; multiple engines serve different audiences.
The catalog layer coordinates this. AWS Glue, Databricks Unity Catalog, Snowflake's Polaris, and Project Nessie are the major catalog options. The catalog tracks which tables exist, where they live, and what their current state is. Different engines authenticate to the catalog and read or write through it; the catalog enforces consistency across engines.
Companies migrating from Snowflake or Redshift toward lakehouse usually go gradually. New pipelines write to Iceberg or Delta tables on object storage. Existing pipelines continue writing to the warehouse. Over time, more workloads move to the lakehouse side; the warehouse becomes the consumption engine for some workloads and gets replaced for others.
The economics favor migration for high-volume workloads. Storage in object storage is roughly an order of magnitude cheaper than warehouse storage at scale. Compute is decoupled and can be sized to the workload. The savings show up most dramatically at petabyte scale where warehouse storage costs become significant line items.
The complexity to manage is real. Self-managed lakehouse infrastructure (catalog operations, table maintenance, compaction, optimization) needs engineering investment that the warehouse handles for you. Teams that migrate without budgeting for that investment end up with cheaper storage and more operational pain.
Managed lakehouse offerings (Databricks, Tabular's services, the various cloud-native options) reduce the operational burden while preserving the openness benefits. The pricing is usually still favorable compared to traditional warehouses for the same workloads, particularly at scale.
The honest pattern: most migrations are partial. The warehouse keeps serving the BI and ad hoc query workloads where its query optimizer and concurrency story remain best-in-class. The lakehouse picks up the heavy data engineering, ML training data, and large historical storage. The two coexist for years before one fully displaces the other, if ever.
ML training data assembly works well on lakehouse architectures. The training pipeline reads from the same tables that analytics queries read. Data scientists do not need a separate data dump; they query the lakehouse with Spark or DuckDB and assemble training sets directly. The pattern reduces duplication and keeps training data consistent with analytical data.
Feature serving for online inference is usually not done from the lakehouse directly because the latency profile is wrong. Lakehouse tables serve batch inference and feature backfill. Online serving typically uses a separate fast store (Redis, DynamoDB) that gets populated from the lakehouse. The pattern combines batch lakehouse processing with operational online storage.
CDC ingestion lands changes from operational databases into lakehouse tables. The pattern uses Debezium or similar to capture changes, Kafka to transport them, and a streaming engine to apply them to lakehouse tables. The result is a near-real-time replica of operational state in a format that analytics can query without touching production databases.
Reverse ETL reads from lakehouse tables and pushes derived data into operational systems. The pattern lets the lakehouse serve as the source of truth for derived metrics and segments that operational tools consume. Census, Hightouch, and similar tools read from lakehouse tables the same way they read from warehouses.
Skipping table maintenance leads to performance degradation. Iceberg, Delta, and Hudi all need periodic compaction, expired snapshot cleanup, and metadata pruning. Without maintenance, query performance drops and storage costs grow. The fix is scheduled maintenance jobs running as part of the production pipeline.
Catalog fragmentation produces multi-engine pain. Different engines pointing at different catalog implementations cannot easily share tables. The fix is consolidating on a single catalog that all engines can use, even if it means migrating away from an engine's default catalog.
Small file accumulation degrades read performance. Streaming writes especially produce many small files that query engines have to read individually. The fix is regular compaction jobs that rewrite small files into larger ones.
Schema evolution mishandled. The table format supports evolution, but engines have different rules about what evolutions they accept. The fix is testing schema changes against all consumer engines before deploying them.
Migration that ignores the operational burden. Teams move data from the warehouse to lakehouse and discover they now need to operate things the warehouse handled automatically. The fix is either budgeting for the operational work or using a managed lakehouse service.
For most new builds in 2026, lakehouse-first is the right default. Storage is cheaper, openness preserves optionality, and the engine support is now broad enough that consumption is not a bottleneck. The exception is small workloads where a managed warehouse's operational simplicity outweighs the storage savings.
Iceberg as a default unless you have a specific reason to choose otherwise. The cross-engine support is the strongest. Delta if you are committed to Databricks. Hudi if your write pattern is streaming-update-heavy and you need its merge-on-read storage.
Yes. The table format defines table state on disk; the catalog tracks which tables exist and coordinates concurrent operations. AWS Glue Catalog is the most common starting point. Unity Catalog if you are in Databricks. Polaris, Nessie, or Tabular's services for more advanced needs.
The BI tool connects to a query engine that reads lakehouse tables. Trino, Snowflake, BigQuery, Athena, and Databricks all serve this role. The BI tool sees normal SQL tables; it does not know whether the underlying storage is a warehouse or a lakehouse.
Yes. The table format provides ACID guarantees for single-table operations. Multi-table transactions are not always supported and not always needed; most analytical workloads do not require cross-table transactional consistency.
Use a streaming engine (Spark Structured Streaming, Flink) that writes to the lakehouse format. The format handles concurrent writes with appropriate isolation. Compaction jobs run periodically to consolidate small streaming files.
All three major formats support time travel through snapshot history. You can query the table as of a previous point in time, useful for audits, debugging, and reproducible ML training. Snapshot retention has cost implications and needs explicit lifecycle policies.
It can, but a managed warehouse is often simpler for small teams. The operational overhead of running a self-managed lakehouse pays back at scale; below a certain volume, the warehouse's managed simplicity wins. Managed lakehouse services bridge the gap if you want lakehouse benefits without the operational work.
Toward more vendor and format convergence as Iceberg becomes a near-universal standard. Toward better catalog interoperability through projects like Polaris and Iceberg REST catalogs. Toward more lakehouse-native features in traditional warehouses as those vendors absorb the openness story. The boundary between warehouse and lakehouse is dissolving.