A Data Lake: Real Examples & Use Cases

Definition

A data lake is a centralized repository that stores raw data in its native format at low cost, without imposing the schema and modeling requirements that a warehouse would. The defining trait is the deferral of structure: data lands first, structure gets applied at read time by whatever consumer wants to use it. Real examples reveal what the pattern looks like after a decade of production use, where it still earns its place in modern architectures, and where it has been displaced or absorbed by the lakehouse pattern that grew out of it.

The pattern emerged in the early 2010s as Hadoop deployments scaled. The promise was that companies could land everything cheaply on HDFS, defer the modeling work, and unlock new use cases when ready. The reality was messier. The lakes that lacked discipline became swamps where data accumulated and no one could find or trust anything. The lakes that worked well had organizational practices that prevented that outcome.

The category in 2026 has shifted significantly. Pure Hadoop-on-premise lakes are mostly legacy; new builds rarely choose that path. Cloud object storage (S3, GCS, ADLS) replaced HDFS as the storage layer. The lakehouse pattern added structure on top of lake storage. Pure data lakes without lakehouse table formats still exist but the category has narrowed to specific use cases where the deferred-structure model genuinely fits.

What separates a useful lake from a swamp is the discipline around what lands, where it lands, who owns it, and how it gets discovered. The technology is the same in both cases; the operational practice determines the outcome. Lakes that succeed have zone organization, naming conventions, ownership, cataloging, and lifecycle policies. Lakes that fail have none of these.

This page surveys real lake implementations across raw data archival, ML training data, log storage, and the broader role of object storage in modern data architectures. Tooling has consolidated significantly; the patterns about what lakes are good for remain useful.

Key Takeaways

A data lake stores raw data in its native format on cheap storage, deferring structural decisions to read time.
Cloud object storage (S3, GCS, ADLS) replaced HDFS as the dominant lake storage layer; pure on-premise Hadoop is mostly legacy.
The lakehouse pattern grew out of the lake by adding open table formats on top of object storage.
Lakes still serve specific use cases well: raw data archival, ML training inputs, logs and observability data, and as the storage layer underneath lakehouses.
The difference between a useful lake and a swamp is organizational practice, not technology choice.

Production Lake Deployments

Netflix's S3-based storage layer holds petabytes of raw data from operational systems, viewing events, and content metadata. The lake provides the foundation that the Iceberg-based lakehouse layer sits on top of. The pure lake portion (raw events before any structured transformation) still functions as a lake in the original sense.

Airbnb operates substantial S3-based storage for raw event data, operational replicas, and historical archives. The team's transition over the years from a Hadoop-on-prem lake to S3-based storage parallels the broader industry shift. The patterns around governance and ownership informed many of the lake-design lessons that have been published.

Capital One built a large AWS-based lake covering operational data, customer interactions, risk data, and regulatory records. The financial services context puts strong governance requirements on the lake, including audit trails, access controls, and retention policies. The implementation involves significant tooling to maintain compliance at scale.

Expedia, Booking.com, and similar travel platforms maintain lakes that hold raw booking events, search interactions, and inventory snapshots from hundreds of supplier systems. The raw archival is valuable because business questions evolve and historical reprocessing becomes necessary periodically.

Uber's lake storage holds the raw events that feed both the analytics platform and ML pipelines. The lake-plus-lakehouse architecture is typical of large companies: raw data lands in the lake, structured analytical and ML data lives in lakehouse tables on top of the same storage layer.

Many telcos, banks, and large industrial companies still operate Hadoop-on-premise lakes that have not been fully migrated to cloud object storage. The migration projects are multi-year and the existing systems continue serving production workloads through the transition.

What Lakes Still Do Well

Raw data archival at very low cost. Object storage pricing is hard to beat for long-term retention of data that may or may not be queried. Compliance retention requirements (years to decades of records for financial and medical data) fit lake storage naturally. The cost-per-terabyte is low enough that retention is cheap insurance.

ML training data inputs. Training pipelines often read raw or lightly processed data rather than fully modeled analytical data. The lake holds the inputs in their native format; training jobs read directly from object storage; the structure happens in the training pipeline itself. The pattern fits how data scientists actually work with data.

Log and observability storage. Application logs, infrastructure metrics, distributed traces. The volumes are huge; the access patterns are scan-heavy and infrequent. Lake storage handles this well at cost levels that traditional databases cannot match. Tools like Datadog, Splunk, and the various open-source alternatives often store the bulk historical data on object storage.

Backup and disaster recovery. Operational databases get backed up to object storage. The lake serves as a long-term repository for backups that might be needed for recovery, audit, or historical reconstruction. The use case does not require the analytical capability of a lakehouse but does need cheap reliable storage.

Data exchange and exfiltration. Sharing data with partners, exporting bulk data for customers, providing data dumps for regulators. The lake serves as the staging area for outbound data movement. Parquet, JSON, CSV, and other portable formats live well in object storage and can be shared through pre-signed URLs or partner-shared buckets.

Lake Zone Patterns

Raw zone (sometimes called bronze or landing) holds data exactly as it arrived. No transformation, no cleanup, no schema enforcement. The principle is that the raw data is the source of truth for everything downstream; if a transformation later proves wrong, the raw zone allows reprocessing without going back to the original source.

Conformed zone (sometimes called silver or refined) holds data that has been cleaned, deduplicated, and brought into consistent structure but not yet modeled for consumption. The data is queryable and reusable across downstream uses without each consumer redoing the basic cleanup work.

Curated zone (sometimes called gold or business) holds modeled data ready for specific business uses: dimensional tables for analytics, feature tables for ML, integration tables for operational consumption. The structure here matches the consumption pattern.

Naming conventions across zones matter. Predictable paths (raw/source/table/date/), consistent partitioning, and standardized metadata files let tools and humans navigate the lake without specialized knowledge. The conventions feel bureaucratic until the lake has thousands of objects across hundreds of sources and they pay back many times over.

Lifecycle policies move data between storage tiers as it ages. Frequently accessed data lives on the hottest tier; rarely accessed data moves to colder, cheaper tiers (S3 Standard-IA, Glacier, Coldline). The policies have material cost impact at scale and almost no impact below the tens-of-terabytes threshold.

Lake-to-Lakehouse Evolution

The lakehouse pattern grew out of recognition that the deferred-structure lake had real problems. Schema drift went undetected. Concurrent writes corrupted data. Time travel was impossible. Query engines had to scan entire tables for selective queries. The lakehouse added a table format layer (Iceberg, Delta, Hudi) that solved these issues without giving up the lake's storage economics.

Most modern lakes have become lakehouses for the data that needs structure. The raw zone often remains lake-native (Parquet or JSON files without a table format). The conformed and curated zones use table formats. The result is a hybrid: raw landing in lake style, structured layers in lakehouse style, all sharing the same underlying storage.

The migration from pure lake to lakehouse can happen incrementally. New tables get the table format treatment from the start. Existing tables stay as plain files until there is a reason to convert them. The conversion is mechanical (rewrite to Parquet, add table format metadata) but takes time to roll out across an established lake.

Some workloads still genuinely benefit from pure lake patterns. ML pipelines that read raw event JSON. Log analysis that scans recent files without needing transactional guarantees. Backup archives that just need to exist. The lakehouse adds value when the data has analytical consumers; the lake remains the right pattern when consumers do not need analytical guarantees.

Governance Patterns That Prevent Swamps

Ownership at the table or dataset level. Every meaningful dataset has an owner team responsible for its quality, documentation, and lifecycle. Without ownership, datasets accumulate without anyone responsible for keeping them useful.

Cataloging that exposes what is in the lake. AWS Glue Data Catalog, DataHub, OpenMetadata, or commercial alternatives. Consumers cannot use what they cannot find. The catalog has to be populated automatically from production sources or it drifts immediately.

Access control aligned to organizational rules. Lake Formation in AWS, Unity Catalog in Databricks, IAM-based policies in GCS, or custom access layers. The control prevents accidental exposure of sensitive data and provides audit trails for compliance.

Schema documentation that captures intent. The schema in the files describes what the data is structurally; documentation captures what it means. The documentation lives in the catalog or in a wiki linked from the catalog. Without it, consumers misuse the data.

Lifecycle policies that prevent infinite growth. Data with no consumers gets deprecated and eventually deleted. The policies require explicit decisions about retention; without them, lakes grow forever and cost more each year than they should.

Common Failure Modes

Schema drift in raw zone files. The producer changes the JSON shape; consumers expect the old shape; processing breaks. The fix is schema enforcement at the conformed zone boundary and producer contracts where possible.

Discovery breakdown when the catalog is incomplete. Consumers cannot find the data they need; they re-create datasets that already exist; the lake fills with duplicates. The fix is automated catalog population from production paths and conventions that make discovery work.

Cost growth from indefinite retention. Years of accumulated data with no usage but full retention costs. The bill grows without business value. The fix is lifecycle policies and explicit retention decisions per dataset.

Security incidents from inadequate access controls. Sensitive data lands in the lake; access is overly permissive; an unrelated user or breach exposes it. The fix is access control aligned to data sensitivity from the start, not retrofitted after an incident.

Lakes that no team owns. The original team that built the lake has moved on; no one is responsible for the lake as a whole; problems go unaddressed. The fix is explicit platform team ownership of lake infrastructure even when individual datasets have separate owners.

Best Practices

Organize the lake into zones (raw, conformed, curated) with clear progression rules between them.
Establish naming conventions and partition patterns that are predictable and discoverable.
Catalog everything automatically; manual catalogs drift to uselessness within months.
Apply access controls based on data sensitivity from the first table, not as a retrofit.
Set lifecycle policies that move data to cheaper tiers and eventually delete it when retention requirements expire.

Common Misconceptions

Data lakes failed and lakehouses replaced them; lakes still serve real use cases, lakehouses extended the pattern rather than replacing it.
Lakes are unstructured; the data within them has structure, the lake just defers when and how structure gets applied.
Lake storage is always cheaper than warehouse storage; lake storage is cheaper per byte, but operational overhead can offset the savings at smaller scales.
The lake replaces the warehouse; lakes and warehouses serve different parts of the data lifecycle, and many stacks have both.
Schema-on-read means you do not need schemas; schemas exist in every consumer of the data, the lake just does not enforce them at write time.

A Data Lake: Real Examples & Use Cases

Definition

Key Takeaways

Production Lake Deployments

What Lakes Still Do Well

Lake Zone Patterns

Lake-to-Lakehouse Evolution

Governance Patterns That Prevent Swamps

Common Failure Modes

Best Practices

Common Misconceptions

Frequently Asked Questions (FAQ's)

Should I build a lake or a lakehouse for new analytics?

What storage layer should the lake use?

How do I prevent the lake from becoming a swamp?

What goes in the raw zone?

How do I catalog the lake?

What about access control?

How do lakes handle deletes?

How do I migrate from an existing lake to a lakehouse?

Where are data lakes heading?