The Term Gets Stretched
"Data pipeline" gets used loosely in product documentation, vendor marketing, and casual technical conversation. The looseness produces confusion because data pipelines have a precise technical meaning that shapes design, operation, and procurement decisions. Confusion at the definition stage produces miscommunication that compounds into architecture mistakes.
This primer anchors the definition, distinguishes the types, and grounds each type in real examples. It is the reference for the conversations that have to happen before serious data pipeline design or procurement.
The Definition
A data pipeline is a system that moves data from sources to destinations, applying transformations along the way. The definition has four components.
Sources are where data originates. Application databases, event streams, APIs, files, sensors, third-party services. The data exists somewhere, and the pipeline starts there.
Destinations are where data needs to end up. Data warehouses, data lakes, operational databases, analytical systems, ML training environments, user-facing applications. The pipeline ends when data reaches its destination in usable form.
Transformations are what happens to the data in transit. Cleaning, normalization, enrichment, aggregation, format conversion, schema alignment. The transformations make the destination's view of the data different from the source's view in deliberate ways.
The movement is what connects the four components. Pipelines move data; they do not just store it. The movement is reliable, repeatable, and observable in well-built pipelines.
A system that meets all four components is a data pipeline. A system missing any one is something else. A storage system is not a pipeline. A transformation script that operates on already-loaded data is not a pipeline. An API integration that does not include transformation is more accurately called integration.
The definition matters because pipelines have specific design considerations that other categories of system do not.
The Main Types
Data pipelines come in distinguishable types based on how they move data and what they do with it.
ETL pipelines extract data from sources, transform it in transit, and load the transformed data into destinations. The transformation happens between extraction and loading. ETL is the classical model. Examples include traditional data warehouse loading processes that read source databases, apply business logic, and load curated tables.
ELT pipelines extract data and load it in raw form, then transform inside the destination. The destination has the compute to handle transformation, which has been increasingly true as cloud data warehouses scaled. Examples include Snowflake or BigQuery deployments where raw data lands and dbt or similar tools transform it in place.
Streaming pipelines move data continuously rather than in batches. Each event flows through processing as it arrives. Examples include Kafka-based architectures that process events for fraud detection, real-time recommendations, or operational monitoring.
CDC (change data capture) pipelines move incremental changes from source databases rather than full snapshots. Updates, inserts, and deletes propagate through the pipeline to destinations. Examples include Debezium-based architectures that keep analytical systems synchronized with operational databases.
ML pipelines specifically prepare data for machine learning training and inference. Feature engineering, dataset construction, and inference serving may all be considered ML pipelines depending on context. Examples include feature stores feeding ML models with consistent feature data for both training and inference.
Each type has different operational characteristics. The type a workload needs depends on what the data does, not on which vendor wants to sell what tool.
Real Examples That Distinguish Types
Specific examples make the differences concrete.
A retail company runs nightly ETL that extracts the day's transactions from operational databases, transforms them through business logic that computes margins and categorizes products, and loads the results into a data warehouse where business analysts query them in the morning.
The same retail company runs streaming pipelines that capture in-store events as they happen, process them through real-time analytics, and update operational dashboards that store managers monitor throughout the day.
A financial services company runs ELT that lands raw bank account changes into a cloud data warehouse, then applies dbt transformations to produce regulatory reports, customer analytics, and risk models.
A media company runs CDC that synchronizes content metadata from its content management system to its recommendation engine. As content is published or updated, the recommendation engine sees the changes within seconds.
A SaaS company runs ML pipelines that compute features about user behavior, store them in a feature store, and serve them to ML models for personalization and churn prediction.
Each pipeline is solving a specific problem with a fit-for-purpose architecture. Confusing the types or applying the wrong type produces operational pain.
What Distinguishes Good Pipelines From Bad
Independent of type, pipelines vary in quality. Several characteristics distinguish well-built from poorly-built.
Idempotence. Running the same pipeline twice produces the same result. Errors during processing do not produce double-counted data or partially-updated state. Idempotence makes recovery from failures clean.
Observability. The pipeline emits metrics and logs that let operators understand what it is doing. Data volumes processed, error rates, latency, lineage. Without observability, problems become customer reports.
Schema management. The pipeline handles schema changes from sources and to destinations gracefully. Schema breaking changes are detected and either handled automatically or fail loudly enough to be addressed.
Data quality enforcement. The pipeline checks the data it moves. Missing required fields, value ranges, referential integrity. Quality issues are caught at the pipeline rather than downstream where they cause harder-to-diagnose problems.
Lineage tracking. The pipeline records where data came from, what transformations were applied, and where it went. Lineage supports debugging, compliance, and trust in the destination data.
A well-built pipeline has all five characteristics. A poorly-built pipeline can produce results that look correct and are not.
Where Pipelines Sit in the Broader Architecture
Data pipelines are infrastructure that supports analytics, AI, operations, and business intelligence. The pipeline itself does not produce business value directly; it makes business value possible by getting data to where decisions are made.
Above the pipeline sit the consumers: business intelligence tools, ML training environments, operational dashboards, customer-facing applications. The pipeline serves these consumers with current, correct, accessible data.
Below the pipeline sit the data sources: operational systems, third-party services, event producers. The pipeline reads from these and abstracts the source-specific access patterns.
Beside the pipeline sit related systems: schema registries, data catalogs, lineage tools, observability platforms. These support the pipeline's operation and the consumers' usage.
A pipeline is not a complete data system. It is one layer of a complete data system. Treating it as the whole produces architectures where the pipeline is over-built and the surrounding capabilities are under-built.
What Logiciel Does Here
Logiciel works with engineering and data teams designing or modernizing their data pipeline architecture. The work is typically structured around defining the pipeline categories the workload needs, selecting fit-for-purpose architecture for each, and integrating with the broader data infrastructure.
The AI Data Pipelines framework covers AI-specific pipeline patterns. The Data Pipelines Batch Streaming Hybrid framework covers the three primary architectural choices.
A 30-minute working session is enough to clarify which pipeline types your workload actually needs.
Frequently Asked Questions
What is the difference between ETL and ELT?
The order of operations. ETL transforms data between extraction and loading. ELT loads raw data first, then transforms in the destination. ELT has become more common as cloud data warehouses gained the compute capacity to handle in-warehouse transformation.
When do I need streaming versus batch?
Streaming when the data's value drops sharply with latency above a threshold (real-time fraud detection, live recommendations, operational monitoring). Batch when data can be processed in scheduled windows (daily reports, weekly analytics, periodic model retraining).
What about reverse ETL?
Reverse ETL is a specific pattern where data flows from analytical destinations back to operational systems. Examples include syncing customer segments from a data warehouse to a marketing automation tool. It is a pipeline category that has emerged in 2022-2024 as cloud data warehouses became central data hubs.
Do I need a data engineer to build pipelines?
For non-trivial pipelines, yes. Modern pipeline tools have lowered the bar for simple cases (Fivetran, Airbyte for extraction; dbt for transformation), but design decisions about architecture, schema management, and operational discipline still require engineering judgment.
What tools should I use?
Depends on type and scale. For batch ETL/ELT, Airflow plus dbt is common. For streaming, Kafka plus Flink or managed alternatives. For CDC, Debezium or managed CDC products. The tool choice should follow type and scale decisions, not lead them. Sources: - dbt Labs, "Analytics Engineering Benchmarks 2024" - Confluent, "State of Data in Motion 2024"