A data lake is a centralized repository that stores raw data in its native format at low cost, without imposing the schema and modeling requirements that a warehouse would. The defining trait is the deferral of structure: data lands first, structure gets applied at read time by whatever consumer wants to use it. Real examples reveal what the pattern looks like after a decade of production use, where it still earns its place in modern architectures, and where it has been displaced or absorbed by the lakehouse pattern that grew out of it.
The pattern emerged in the early 2010s as Hadoop deployments scaled. The promise was that companies could land everything cheaply on HDFS, defer the modeling work, and unlock new use cases when ready. The reality was messier. The lakes that lacked discipline became swamps where data accumulated and no one could find or trust anything. The lakes that worked well had organizational practices that prevented that outcome.
The category in 2026 has shifted significantly. Pure Hadoop-on-premise lakes are mostly legacy; new builds rarely choose that path. Cloud object storage (S3, GCS, ADLS) replaced HDFS as the storage layer. The lakehouse pattern added structure on top of lake storage. Pure data lakes without lakehouse table formats still exist but the category has narrowed to specific use cases where the deferred-structure model genuinely fits.
What separates a useful lake from a swamp is the discipline around what lands, where it lands, who owns it, and how it gets discovered. The technology is the same in both cases; the operational practice determines the outcome. Lakes that succeed have zone organization, naming conventions, ownership, cataloging, and lifecycle policies. Lakes that fail have none of these.
This page surveys real lake implementations across raw data archival, ML training data, log storage, and the broader role of object storage in modern data architectures. Tooling has consolidated significantly; the patterns about what lakes are good for remain useful.
Netflix's S3-based storage layer holds petabytes of raw data from operational systems, viewing events, and content metadata. The lake provides the foundation that the Iceberg-based lakehouse layer sits on top of. The pure lake portion (raw events before any structured transformation) still functions as a lake in the original sense.
Airbnb operates substantial S3-based storage for raw event data, operational replicas, and historical archives. The team's transition over the years from a Hadoop-on-prem lake to S3-based storage parallels the broader industry shift. The patterns around governance and ownership informed many of the lake-design lessons that have been published.
Capital One built a large AWS-based lake covering operational data, customer interactions, risk data, and regulatory records. The financial services context puts strong governance requirements on the lake, including audit trails, access controls, and retention policies. The implementation involves significant tooling to maintain compliance at scale.
Expedia, Booking.com, and similar travel platforms maintain lakes that hold raw booking events, search interactions, and inventory snapshots from hundreds of supplier systems. The raw archival is valuable because business questions evolve and historical reprocessing becomes necessary periodically.
Uber's lake storage holds the raw events that feed both the analytics platform and ML pipelines. The lake-plus-lakehouse architecture is typical of large companies: raw data lands in the lake, structured analytical and ML data lives in lakehouse tables on top of the same storage layer.
Many telcos, banks, and large industrial companies still operate Hadoop-on-premise lakes that have not been fully migrated to cloud object storage. The migration projects are multi-year and the existing systems continue serving production workloads through the transition.
Raw data archival at very low cost. Object storage pricing is hard to beat for long-term retention of data that may or may not be queried. Compliance retention requirements (years to decades of records for financial and medical data) fit lake storage naturally. The cost-per-terabyte is low enough that retention is cheap insurance.
ML training data inputs. Training pipelines often read raw or lightly processed data rather than fully modeled analytical data. The lake holds the inputs in their native format; training jobs read directly from object storage; the structure happens in the training pipeline itself. The pattern fits how data scientists actually work with data.
Log and observability storage. Application logs, infrastructure metrics, distributed traces. The volumes are huge; the access patterns are scan-heavy and infrequent. Lake storage handles this well at cost levels that traditional databases cannot match. Tools like Datadog, Splunk, and the various open-source alternatives often store the bulk historical data on object storage.
Backup and disaster recovery. Operational databases get backed up to object storage. The lake serves as a long-term repository for backups that might be needed for recovery, audit, or historical reconstruction. The use case does not require the analytical capability of a lakehouse but does need cheap reliable storage.
Data exchange and exfiltration. Sharing data with partners, exporting bulk data for customers, providing data dumps for regulators. The lake serves as the staging area for outbound data movement. Parquet, JSON, CSV, and other portable formats live well in object storage and can be shared through pre-signed URLs or partner-shared buckets.
Raw zone (sometimes called bronze or landing) holds data exactly as it arrived. No transformation, no cleanup, no schema enforcement. The principle is that the raw data is the source of truth for everything downstream; if a transformation later proves wrong, the raw zone allows reprocessing without going back to the original source.
Conformed zone (sometimes called silver or refined) holds data that has been cleaned, deduplicated, and brought into consistent structure but not yet modeled for consumption. The data is queryable and reusable across downstream uses without each consumer redoing the basic cleanup work.
Curated zone (sometimes called gold or business) holds modeled data ready for specific business uses: dimensional tables for analytics, feature tables for ML, integration tables for operational consumption. The structure here matches the consumption pattern.
Naming conventions across zones matter. Predictable paths (raw/source/table/date/), consistent partitioning, and standardized metadata files let tools and humans navigate the lake without specialized knowledge. The conventions feel bureaucratic until the lake has thousands of objects across hundreds of sources and they pay back many times over.
Lifecycle policies move data between storage tiers as it ages. Frequently accessed data lives on the hottest tier; rarely accessed data moves to colder, cheaper tiers (S3 Standard-IA, Glacier, Coldline). The policies have material cost impact at scale and almost no impact below the tens-of-terabytes threshold.
The lakehouse pattern grew out of recognition that the deferred-structure lake had real problems. Schema drift went undetected. Concurrent writes corrupted data. Time travel was impossible. Query engines had to scan entire tables for selective queries. The lakehouse added a table format layer (Iceberg, Delta, Hudi) that solved these issues without giving up the lake's storage economics.
Most modern lakes have become lakehouses for the data that needs structure. The raw zone often remains lake-native (Parquet or JSON files without a table format). The conformed and curated zones use table formats. The result is a hybrid: raw landing in lake style, structured layers in lakehouse style, all sharing the same underlying storage.
The migration from pure lake to lakehouse can happen incrementally. New tables get the table format treatment from the start. Existing tables stay as plain files until there is a reason to convert them. The conversion is mechanical (rewrite to Parquet, add table format metadata) but takes time to roll out across an established lake.
Some workloads still genuinely benefit from pure lake patterns. ML pipelines that read raw event JSON. Log analysis that scans recent files without needing transactional guarantees. Backup archives that just need to exist. The lakehouse adds value when the data has analytical consumers; the lake remains the right pattern when consumers do not need analytical guarantees.
Ownership at the table or dataset level. Every meaningful dataset has an owner team responsible for its quality, documentation, and lifecycle. Without ownership, datasets accumulate without anyone responsible for keeping them useful.
Cataloging that exposes what is in the lake. AWS Glue Data Catalog, DataHub, OpenMetadata, or commercial alternatives. Consumers cannot use what they cannot find. The catalog has to be populated automatically from production sources or it drifts immediately.
Access control aligned to organizational rules. Lake Formation in AWS, Unity Catalog in Databricks, IAM-based policies in GCS, or custom access layers. The control prevents accidental exposure of sensitive data and provides audit trails for compliance.
Schema documentation that captures intent. The schema in the files describes what the data is structurally; documentation captures what it means. The documentation lives in the catalog or in a wiki linked from the catalog. Without it, consumers misuse the data.
Lifecycle policies that prevent infinite growth. Data with no consumers gets deprecated and eventually deleted. The policies require explicit decisions about retention; without them, lakes grow forever and cost more each year than they should.
Schema drift in raw zone files. The producer changes the JSON shape; consumers expect the old shape; processing breaks. The fix is schema enforcement at the conformed zone boundary and producer contracts where possible.
Discovery breakdown when the catalog is incomplete. Consumers cannot find the data they need; they re-create datasets that already exist; the lake fills with duplicates. The fix is automated catalog population from production paths and conventions that make discovery work.
Cost growth from indefinite retention. Years of accumulated data with no usage but full retention costs. The bill grows without business value. The fix is lifecycle policies and explicit retention decisions per dataset.
Security incidents from inadequate access controls. Sensitive data lands in the lake; access is overly permissive; an unrelated user or breach exposes it. The fix is access control aligned to data sensitivity from the start, not retrofitted after an incident.
Lakes that no team owns. The original team that built the lake has moved on; no one is responsible for the lake as a whole; problems go unaddressed. The fix is explicit platform team ownership of lake infrastructure even when individual datasets have separate owners.
A lakehouse for analytics. The lake-plus-table-format pattern gives you the storage economics of a lake with the analytical capabilities of a warehouse. Pure lakes still fit for raw archival and non-analytical use cases, but new analytical workloads should land in lakehouse tables.
Cloud object storage. S3 on AWS, GCS on Google Cloud, ADLS Gen2 on Azure. The choice usually follows the rest of the cloud architecture. On-premise HDFS persists in legacy environments but new builds should go to object storage.
Through organizational practices: explicit ownership, automated cataloging, access controls, naming conventions, lifecycle policies. The technology supports any outcome; the practices determine which outcome you get.
Data exactly as it arrived from the source. No transformation, no cleanup, no enforcement. The raw zone is the source of truth that downstream zones derive from. If a downstream transformation later proves wrong, the raw zone lets you reprocess.
With automated extraction from production paths. AWS Glue Crawlers, similar tools in other clouds, or open-source projects like DataHub and OpenMetadata. The catalog should populate from the actual files in the lake, not from manual documentation. Manual catalogs drift; automatic catalogs stay current.
Apply it from the start, aligned to data sensitivity. Use Lake Formation, Unity Catalog, or IAM-based policies. Track who has access to what and audit periodically. Sensitive data (PII, financial, regulated) needs stricter controls than internal operational data.
Object storage does not delete in place; you write new files that supersede old ones, then remove the old ones. The pattern is awkward for compliance use cases like GDPR right-to-be-forgotten. The lakehouse table formats provide better delete semantics; pure file-based lakes require custom handling.
Incrementally. New tables get the table format treatment from the start. Existing tables convert when there is a reason (new consumer that benefits from the format, performance issues that the format would address). Full conversion across an established lake takes time but does not need to happen all at once.
The pure lake pattern is becoming the substrate underneath lakehouses rather than the primary architecture. Most new builds use lakehouse table formats for analytical data and keep pure lake storage for raw archival, ML inputs, and logs. The lake remains a useful concept but rarely the headline architecture choice for analytics.