LS LOGICIEL SOLUTIONS
Toggle navigation

Data Lineage: Real Examples & Use Cases

Definition

Data lineage is the recorded relationship between datasets and the operations that produce them, showing where any piece of data came from and what depends on it. The recording can be column-level (this column derives from these columns through this transformation) or table-level (this table is produced from these source tables by this job). Real examples reveal which lineage implementations actually serve the use cases teams care about and where the metadata becomes shelf-ware that no one consults.

The use cases for lineage are concrete: when a pipeline breaks, what downstream reports are affected. When a regulator asks where personal data lives, the answer comes from lineage. When a metric definition needs to change, the impact assessment uses lineage. When a developer wants to understand an unfamiliar table, lineage shows the upstream provenance. The pattern earns its place by answering questions that are otherwise expensive to investigate.

The category in 2026 covers a range of implementations. Catalog vendors (Atlan, Collibra, Alation, DataHub, OpenMetadata) all provide lineage as a core feature. Open standards like OpenLineage define how tools should emit lineage events. Warehouse-native features in Snowflake, BigQuery, and Databricks expose lineage within their own ecosystems. Custom in-house lineage exists at large companies whose stacks predate vendor coverage.

What separates working lineage from theatrical lineage is whether the graph reflects reality. Lineage automatically extracted from query logs, transformation code, and orchestration definitions usually reflects reality because it derives from the actual production systems. Lineage maintained manually as documentation usually does not, because the documentation drifts the moment a developer makes a change.

This page surveys real lineage implementations and the patterns that have emerged across different scales and stack choices. The tooling moves fast; the underlying patterns about extraction, scope, and use cases are stable.

Key Takeaways

  • Data lineage records the relationships between datasets and the operations that produce them, used for impact analysis, debugging, governance, and discovery.
  • Automatic extraction from query logs, transformation code, and orchestration definitions produces lineage that reflects reality; manual lineage drifts.
  • Column-level lineage is more useful than table-level when available but harder to extract reliably for complex transformations.
  • The OpenLineage standard is converging the way tools emit lineage events, reducing custom integration work.
  • Lineage shows its value when it answers concrete operational questions, not when it sits in a catalog dashboard no one consults.

Production Lineage Deployments

Airbnb's Dataportal exposed lineage as a core feature when the team built it in the mid-2010s; the project was one of the early influential data discovery platforms and shaped what later vendors built. The internal Dataportal is still in use; the patterns informed many follow-on tools.

LinkedIn's WhereHows and later DataHub (which LinkedIn open-sourced) built lineage as a first-class feature of the metadata graph. DataHub has grown into one of the most-adopted open-source catalogs with strong lineage support across many data sources.

Lyft's Amundsen (also open-sourced) focused on data discovery with lineage as supporting metadata. Adoption of Amundsen has slowed relative to DataHub and OpenMetadata, but the project remains in use at companies that adopted it during its peak.

Netflix's internal data platform tracks lineage extensively across their Iceberg-based stack. The lineage feeds operational tooling for cost allocation, impact analysis, and incident response. The implementation predates many of the open-source alternatives and influenced their direction.

Vendor catalog deployments include thousands of enterprise customers running Atlan, Collibra, Alation, and similar platforms. The customers span industries; financial services and healthcare are particularly heavy adopters because the compliance use cases for lineage are strong in regulated industries.

dbt's built-in lineage shows up in every dbt project. The lineage covers transformations within dbt but does not extend upstream to source ingestion or downstream to BI consumption automatically. Many teams treat dbt lineage as part of a larger lineage graph that includes upstream and downstream context.

How Lineage Gets Extracted

SQL parsing extracts table-level and often column-level lineage from query text. The parser walks the SQL AST to identify which tables and columns are read, written, and how they relate. The approach works well for SQL-based transformations and is the primary technique for warehouse-centric lineage.

Orchestration metadata captures pipeline-level lineage from the dependency definitions in Airflow, Dagster, Prefect, and similar tools. The metadata is rich for the pipeline structure but does not capture detail inside individual tasks. The combination of orchestration metadata plus SQL parsing covers most of a typical warehouse pipeline.

Code-level instrumentation extracts lineage from non-SQL transformations (Python, Spark). OpenLineage's Spark integration emits lineage events as Spark jobs run, capturing source and destination tables. Similar integrations exist for Flink and other engines.

Query log analysis derives lineage from production query history. Snowflake, BigQuery, Databricks, and others expose query logs that lineage tools parse to reconstruct what queries actually ran and what they read and wrote. The approach has the advantage of capturing reality rather than intent; if a query runs in production, it shows up.

Manual annotation fills gaps where automatic extraction is impossible. External vendor data feeds, custom transformations in older languages, business rules implemented outside the standard stack. Manual lineage works for cases where the underlying flow does not change often enough for the documentation to drift.

Use Cases That Earn Their Place

Impact analysis when a data quality issue is detected. The lineage graph shows which downstream tables, dashboards, and ML models depend on the broken source. The team can communicate to affected stakeholders and prioritize fixes by impact. Without lineage, impact analysis is detective work.

Root cause investigation when a downstream metric looks wrong. The analyst walks lineage backward from the metric to find which upstream source might explain the discrepancy. Lineage points the investigation at the most likely source instead of requiring the analyst to know the entire pipeline by memory.

Governance and compliance reporting. Where does PII for European customers live? Lineage from the customer table downstream identifies all tables and reports that contain or derive from that data. The reporting work that used to take weeks of manual investigation becomes a graph query.

Migration planning when systems change. Moving off an old warehouse, deprecating a transformation tool, replacing a BI platform. Lineage identifies every consumer of the old system that needs to be migrated. The planning gets specific instead of guess-driven.

Cost allocation across teams. Lineage from compute jobs to the consumers that benefit lets the platform team allocate warehouse spend back to the teams that drove it. The accountability changes behavior in ways that flat platform charges do not.

Column-Level vs Table-Level

Table-level lineage is easier to extract and covers most operational use cases. When a table breaks, knowing which downstream tables and dashboards depend on it is usually enough. The simplicity of table-level lineage makes it the right starting point for most teams.

Column-level lineage matters when the granularity changes the answer. A table might be consumed by many downstream tables, but a specific column might only be consumed by a few. Impact analysis at column level is more precise; without it, every column change looks like an everything change.

Column-level lineage is also valuable for compliance. Knowing that a specific PII column flows into specific downstream artifacts is more useful than knowing that the table containing PII flows everywhere. Regulatory reporting often needs column-level precision.

Extracting column-level lineage from complex SQL is hard. Joins, aggregations, window functions, dynamic SQL, stored procedures, UDFs all complicate the parser. Most production column-level lineage tools achieve good coverage for simple queries and degraded coverage for complex ones.

The honest pattern: most teams start with table-level and add column-level coverage gradually for the high-value tables. Complete column-level lineage across an entire stack is rare and expensive to maintain.

Cross-System Lineage Patterns

Within a single warehouse, lineage extraction is straightforward. The warehouse logs queries; the catalog parses them; the graph reflects what happened. Snowflake's Account Usage views, BigQuery's information schema, and similar native sources give catalogs everything they need.

Cross-system lineage is harder. A pipeline reads from an operational database via CDC, lands in Kafka, processes through Flink, lands in a lakehouse table, gets transformed by dbt, lands in a warehouse, and feeds a BI tool. Each system has its own lineage representation; the catalog has to stitch them together.

OpenLineage emerged to address this. The standard defines how systems should emit lineage events. Producers integrate once with OpenLineage; consumers (catalogs, observability tools) read OpenLineage events from any producer. Adoption is growing but not universal.

Manual mapping fills gaps where standards have not reached. The catalog supports custom edges between systems. The team defines them based on operational knowledge. The mapping degrades over time as systems change; periodic refresh keeps it usable.

The cross-system gap is shrinking but not closed. Most production lineage implementations are still strongest within a single platform and weaker at the boundaries between platforms. Companies whose data flows across many systems should expect lineage gaps and plan for manual filling.

Common Failure Modes

Manual lineage that drifts immediately. The team documents lineage in a wiki when the pipeline first ships; the pipeline evolves; the wiki does not. The lineage becomes worse than useless because consumers trust it and get wrong answers. The fix is automatic extraction from production systems.

Incomplete coverage that limits usefulness. The catalog covers warehouse tables but not upstream ingestion. Or covers BI tools but not the transformations behind them. Impact analysis is unreliable because parts of the graph are missing. The fix is steady expansion of coverage to the systems that matter for the use cases the team cares about.

Stale lineage that no one trusts. The graph claims a table feeds a dashboard that was deprecated months ago. Consumers learn the lineage is unreliable and stop using it. The fix is refresh cadence that keeps the graph current and explicit handling of deprecated assets.

Lineage with no use case attached. The catalog has rich lineage; nobody consults it; the metadata grows without producing value. The fix is identifying specific use cases (impact analysis, compliance reporting, migration planning) and building workflows that consume the lineage actively.

Column-level lineage with poor accuracy. The extraction works for simple queries but produces wrong edges for complex ones; consumers learn to distrust column-level answers. The fix is honest accuracy reporting per query type and limiting column-level claims to the queries the extractor handles well.

Best Practices

  • Extract lineage automatically from production systems; manual lineage drifts before it ships.
  • Start with table-level coverage across the stack before chasing column-level depth.
  • Tie lineage to concrete use cases (impact analysis, compliance, migration) that consume it actively.
  • Adopt OpenLineage where producers and consumers support it to reduce custom integration work.
  • Refresh lineage continuously rather than as scheduled batches so the graph reflects current reality.

Common Misconceptions

  • Lineage is for compliance only; the operational use cases (impact analysis, debugging, discovery) are usually higher value.
  • Column-level lineage is always better than table-level; column-level is more useful but harder to extract reliably and is not always needed.
  • The catalog tool does the lineage work; the tool provides infrastructure, the integration with each producer is the work.
  • Lineage shows what should happen; it shows what actually happened based on extraction from running systems.
  • Lineage tools work out of the box; coverage and accuracy depend heavily on integration depth with each source.

Frequently Asked Questions (FAQ's)

Do I need a separate lineage tool?

Usually not. The catalog vendors all include lineage as a core feature; dbt has built-in lineage for its scope; warehouse-native features cover within-warehouse lineage. A separate dedicated lineage tool is rare; most teams get lineage from one of these embedded sources.

How do I get lineage from non-SQL transformations?

Through code-level instrumentation. OpenLineage's Spark and Flink integrations emit lineage from those engines. Custom Python transformations may need manual annotation or wrapping in instrumented frameworks. Coverage gaps in non-SQL parts of the stack are common.

How is lineage different from data discovery?

Discovery helps you find datasets; lineage tells you how datasets relate. The two are complementary and usually live in the same tool (a catalog). Discovery is broader; lineage is one dimension of the metadata a catalog provides.

How accurate is column-level lineage in practice?

Highly accurate for simple SELECT/JOIN/aggregation queries. Less accurate for complex window functions, dynamic SQL, and queries that use UDFs or stored procedures. Vendors publish accuracy data for their parsers; the accuracy varies meaningfully across vendors and across query patterns within a single tool.

Can lineage track data outside the data platform?

It can, with appropriate integration. Lineage into operational systems (the CRM that consumes reverse ETL output, the recommendation system that consumes ML features) requires custom integration most of the time. Some catalogs support this; coverage is sparser than for traditional analytics flows.

How does OpenLineage help?

OpenLineage standardizes how systems emit lineage events. Producers integrate once with OpenLineage; any consumer that reads OpenLineage gets lineage from those producers without custom integration. The standard reduces the per-tool integration work that previously dominated lineage projects.

How do I use lineage for incident response?

When a data quality alert fires on a source table, the on-call person consults lineage to identify affected downstream consumers, notifies the affected teams, and prioritizes the fix by downstream impact. Without lineage, the same investigation requires manual tracing through pipeline code.

What is the cost of comprehensive lineage?

Tool licensing in the low six figures annually for vendor catalogs at enterprise scale. Engineering investment to integrate the producers, configure the extraction, and operate the catalog. Most production lineage programs are ongoing investments rather than one-time projects.

Where is data lineage heading?

Toward better automated extraction across more system types. Toward broader OpenLineage adoption that reduces custom integration. Toward tighter integration between lineage and adjacent capabilities (observability, governance, cost management). The metadata layer is consolidating around a few dominant patterns and standards.