A data pipeline is a sequence of operations that moves data from a source through one or more processing stages to a destination, with the operations chained so that the output of each stage feeds the next. The simplest pipeline copies a table once a day. The most elaborate pipelines branch, merge, retry, backfill, and serve hundreds of downstream consumers from the same source. Real examples reveal how the pattern varies by use case, what production pipelines actually look like under the hood, and which design choices separate pipelines that run for years from pipelines that need a rewrite every quarter.
The pipeline pattern shows up across analytics ingestion, ML training data assembly, event activation, CDC replication, and many other contexts. The shared structure is the chained sequence of steps with explicit data flow between them. The differences across use cases live in latency requirements, failure handling, schema evolution, and what counts as "done" for a pipeline run.
Pipelines in 2026 are usually written declaratively rather than imperatively. The pipeline definition describes the graph of operations and their dependencies; the orchestrator schedules the actual execution, handles retries, and reports status. The declarative shift made pipelines testable, versionable, and reviewable in ways that ad hoc scripts never were.
What separates a production pipeline from a one-off script is operational discipline. Production pipelines run on schedule without human intervention, report when they fail, recover from common failure modes, can be backfilled across historical date ranges, and have clear ownership. Scripts run when someone remembers to run them. The gap is structural; turning a script into a production pipeline takes real engineering work.
This page surveys real implementations across ingestion, transformation, ML, and activation pipelines. Tool choices vary by company and by year; the underlying patterns are stable enough to be useful even as the specific tools rotate.
Airbyte and Fivetran handle the high-volume case of "I need data from System X into my warehouse." Both maintain hundreds of connectors. Fivetran's managed approach trades configuration depth for operational simplicity; Airbyte's open-source approach trades the reverse. Most companies pick one of the two as their default ingestion tool and only write custom pipelines for the sources neither covers well.
Custom ingestion pipelines show up where managed connectors are insufficient: internal databases with unusual schemas, third-party APIs with rate limits and pagination requirements, legacy file drops on SFTP, or sources where the data needs significant cleanup before landing. Teams typically build these on the same orchestrator they use for transformations.
CDC-based ingestion replicates changes from operational databases into the warehouse continuously. Debezium captures changes from PostgreSQL, MySQL, MongoDB, and others, publishing them to Kafka. Downstream consumers (warehouse loaders, search indexes, caches) read the Kafka stream. The pattern keeps the warehouse seconds-to-minutes fresh rather than the hours-to-days of batch ingestion.
Event ingestion pipelines capture web and mobile events from clients, validate them against a schema, and route them to destinations. Segment, Rudderstack, and Snowplow are the major vendors. The pipelines often combine schema validation, enrichment with user attributes, and fan-out to multiple destinations like the warehouse plus marketing automation tools.
Log ingestion at large scale uses pipelines that route from application logs through aggregation systems (Vector, Fluentd, Logstash) to long-term stores (S3, ClickHouse, OpenSearch). The volumes can dwarf application data; pipeline efficiency matters because the cost of careless log ingestion can exceed the cost of the workload that produced the logs.
dbt projects function as transformation pipelines on top of warehouses. The team defines models in SQL or Python, dbt resolves the dependency graph, and the pipeline executes the models in order. The pattern has become the default at companies using Snowflake, BigQuery, Databricks, or Redshift. Tens of thousands of companies run dbt in production.
Airflow DAGs predate dbt and remain common for orchestrating heavier transformation workflows that combine SQL, Python, and external systems. Airflow's strength is flexibility; the trade-off is operational complexity that smaller teams sometimes underestimate. Many companies run dbt inside Airflow tasks.
Dagster takes a different approach with asset-based pipelines. Pipeline definitions describe the data assets being produced rather than the tasks being executed. The model fits modern data engineering better than the task-based abstractions of older orchestrators. Adoption has grown steadily.
Spark and Flink pipelines handle workloads too large or too latency-sensitive for warehouse-native transformation. Netflix, Uber, Pinterest, and similar large-data companies run extensive Spark transformation pipelines for batch and Flink pipelines for streaming. The engineering investment is real but the throughput and flexibility justify it at scale.
Notebook-driven pipelines exist but are usually a code smell. Notebooks are great for exploration; they are bad pipeline runtimes. The teams that ran production on notebooks for years eventually migrate to proper pipelines after enough silent failures and reproducibility problems.
Training data pipelines assemble the features, labels, and splits that a model trains on. The pipelines join across many sources, handle time-aware features carefully to avoid leakage, and produce versioned outputs the training run can consume reproducibly. Bad training pipelines are the most common source of subtle model quality problems.
Feature pipelines compute features that feed both training and serving. The same logic that produces training data for a model also produces the features that serve at inference time. Feature stores (Tecton, Feast, Feathr, the platform-native offerings) coordinate the dual-purpose pipelines so training and serving stay consistent.
Inference pipelines apply trained models to new data, either in batch or real-time. Batch inference scores all customers nightly for a recommendation system. Real-time inference scores a single request as it arrives. The pipeline shape differs dramatically between the two; both are valid patterns and both show up at the same companies.
Retraining pipelines refresh models on schedule or when drift detection triggers a refresh. The pipeline runs the training, evaluates against a held-out set, compares against the current production model, and either promotes the new model or alerts a human reviewer. The pattern automates model lifecycle work that used to require constant human attention.
Evaluation pipelines run models against test sets and produce quality metrics. The pipelines belong to the model rather than to the inference path. Running evaluation on every model change catches regressions before they ship. The discipline is borrowed from software testing and has been steadily adopted in ML engineering.
Uber's surge pricing pipeline processes real-time location and demand data to produce price multipliers. The latency budget is seconds because riders are waiting for prices to load. The pipeline combines Kafka transport, Flink processing, and operational stores serving the inference layer.
Real-time fraud pipelines at payment companies score transactions in milliseconds. The pipeline ingests the transaction, joins it with historical data and recent transaction patterns, runs a model, and returns an allow or deny decision. The whole sequence has to complete within the payment authorization timeout.
Real-time personalization pipelines at content platforms (Netflix, Spotify, YouTube) update user feature stores as users interact, so the next page load reflects the very recent activity. The patterns combine event capture, feature computation, and feature store writes within the latency window between user actions.
CDC pipelines from operational databases to search indexes and caches keep auxiliary stores fresh without dual-writing from the application. The pattern frees application code from coordinating multiple writes and provides a single source of truth (the database) that feeds the others.
Streaming pipelines need careful handling of late data, out-of-order events, and exactly-once semantics. The engineering complexity is real and underestimated. Teams that adopt streaming for use cases that do not strictly require it often end up running batch-with-extra-steps and paying the streaming complexity premium for no real benefit.
Airflow is the most-deployed orchestrator. The patterns are well understood, the community is large, and most data engineers have used it. The complaints are the same complaints that come up year after year: scheduling can be fragile at scale, the task abstractions are aging, and the UI shows its age. The new generation of orchestrators competes on better abstractions.
Dagster offers asset-based abstractions, better local development, and tighter integration with modern data tooling. Adoption has grown fastest at teams starting fresh in the last two years. Migration from Airflow is feasible but not trivial.
Prefect emphasizes Python-native pipeline definition and a more flexible execution model than Airflow's DAGs. The product has good adoption in ML-heavy teams. Argo Workflows shows up in Kubernetes-heavy environments where the team wants pipelines that look like the rest of their infrastructure.
Managed orchestration (AWS Step Functions, GCP Workflows, Azure Data Factory, dbt Cloud, the various cloud-native data pipeline services) reduces operational burden at the cost of flexibility. The choice is usually about how much custom code the pipeline needs. Simple pipelines fit well in managed services; complex pipelines outgrow them.
Operations matter as much as orchestration choice. Pipelines need on-call ownership, SLA definitions, monitoring dashboards, and clear runbooks for common failures. Teams that skip the operational layer end up with brittle pipelines that need constant manual intervention.
Upstream schema changes that break the pipeline silently. The producer adds a column, the pipeline drops the new data without warning, downstream consumers compute on incomplete inputs. The fix is schema contracts plus drift detection that alerts on producer changes.
Late-arriving data that the pipeline does not handle. The pipeline runs at 2am, the upstream source publishes the last batch at 2:15am, the pipeline output is missing fifteen minutes of data. The fix is either delaying the pipeline or designing it to backfill late data on subsequent runs.
Backfill blowups when historical reprocessing exceeds the cluster's capacity. The team needs to rebuild ninety days of derived data, kicks off the backfill, and brings the warehouse to its knees. The fix is rate-limited backfills and the discipline of designing transformations to be incrementally backfillable.
Cascading failures where one bad source corrupts downstream computations. The fix is data quality checks at the source boundary plus the ability to halt processing rather than propagating bad data downstream.
Operational drift where pipelines accumulate one-off tweaks that no one fully understands. Years of accumulated changes leave the team afraid to touch the pipeline. The fix is code review, documentation, and periodic refactoring.
As often as the downstream use case actually requires, and no more. Dashboards updated daily can run their pipeline daily. Operational integrations that activate based on recent user behavior may need hourly or sub-hourly. Real-time use cases need streaming. Running pipelines more frequently than the use case requires wastes compute and adds operational risk for no benefit.
An orchestrator schedules and monitors pipeline execution (Airflow, Dagster, Prefect). A transformation framework defines what the pipeline does (dbt, SQLMesh, custom Spark code). The two layers are complementary; most production pipelines combine an orchestrator for scheduling with a transformation framework for transformations.
Retry transient failures automatically with exponential backoff. Alert on persistent failures to whoever owns the pipeline. Log enough context that the on-call person can understand what failed. Build idempotent pipelines so reruns produce the same output as a single successful run. Have backfill capability for cases where you need to reprocess after fixing a bug.
Incremental when you can, full refresh when you must. Incremental is cheaper at scale but harder to reason about. Full refresh is simpler but expensive at scale. Many production pipelines are incremental day-to-day with a periodic full refresh as a safety net.
Unit tests on transformation logic where possible. Integration tests that run the pipeline against small representative inputs. Data quality tests on production outputs (row counts, nulls, distributions, referential integrity). Pre-deployment tests that catch regressions before they ship. The combination catches more problems than any one layer alone.
The same way you version application code: git, branches, pull requests, code review. The pipeline definition is code. Deployment goes through CI/CD. The discipline is borrowed from software engineering and produces the same benefits: history, review, rollback, reproducibility.
Contracts define the expected shape of data at producer-consumer boundaries. The producer commits to the contract; the consumer depends on the contract; changes go through a defined process. Contracts catch the most common breakage mode (silent schema changes) before it propagates downstream.
Track metrics: success rate per pipeline, mean time to detect failures, mean time to recover, data freshness against SLA. Set targets and review monthly. Reliability is something you measure, not something you assume.
Toward more declarative definitions, more incremental processing, more integration between batch and streaming abstractions, and more AI-assisted pipeline development and debugging. The orchestrator landscape will continue to consolidate around a few dominant players. The patterns of building reliable pipelines will remain mostly the same; the tools that implement them will keep getting better.