Data integration is the process of combining data from multiple sources into a unified view. A company might have customer information in Salesforce, transaction history in Stripe, and user activity on its website. Each system stores data in different formats and locations. Integration brings them together so analysts and systems can work with a complete picture of the data.
Integration solves the problem of data silos. Without it, getting a complete answer to a simple question requires querying multiple systems separately then manually stitching the results together. An analyst might spend an hour pulling data from four sources and reconciling the differences. With integration, they query one system and get results in seconds.
Data integration encompasses several patterns. ETL extracts data from sources, transforms it, then loads it into a target. ELT loads data first, then transforms it in the target. CDC captures only changes since the last run, making updates efficient. API integration connects to sources through their APIs. Data virtualization presents multiple sources as a single database without moving data. Each pattern has trade-offs between simplicity, cost, and freshness.
Modern data infrastructure is built on integration. Data flows from operational systems into a central data warehouse. From there, it's transformed, analyzed, and used to drive decisions. Integration is the critical layer that makes this flow possible.
ETL is the classical data integration pattern. Extract reads data from a source system. A database query pulls all customer records. A file export provides transaction history. The extract phase gets raw data out of the source. Transform applies business logic. Data is cleaned, reshaped, and enriched. A customer's first name and last name are combined into a full name. Transactions are converted from one currency to another. Duplicate records are identified and merged. The transform phase prepares data for use. Load writes the transformed data into a target system, usually a data warehouse.
ETL keeps data transformation outside the target system. This has advantages. The transformation environment can be optimized for transformation work. Complex business logic can be expressed in code. Data validation and quality checks happen before data enters the warehouse. The warehouse receives clean, validated data. The downside is the upfront transformation must be comprehensive. If you missed a needed field, you have to rerun ETL.
ETL is often batch-oriented. Run nightly at 2am. Extract all data since the last run. Transform and load it. This creates a predictable window where data is fresh. Most traditional data warehouses rely on nightly batch ETL. An analyst knows that by 8am, yesterday's data is in the warehouse.
ELT inverts the ETL order. Extract and load happen first. Raw data goes directly into the warehouse. Transform happens second, inside the warehouse using SQL. This is possible because modern cloud warehouses are fast and cheap. Running SQL transformations on a terabyte of data in Snowflake is fast and costs little. The advantage is simplicity. Loading raw data is straightforward. No transformation logic to write during load. The warehouse does all transformation.
ELT also provides flexibility. Analysts can access raw data while transformations run. They can build ad-hoc analyses or troubleshoot issues. Transformation logic is expressed in SQL, which many analysts know. Tools like dbt make SQL transformations manageable. You write transformation logic as code, test it, and version control it. dbt has become the standard for ELT transformation.
The downside of ELT is the warehouse handles both raw and transformed data. Storage might duplicate data. Query performance depends on how well transformations are optimized. If you write inefficient SQL, queries are slow. But for most use cases, the benefits outweigh downsides. ELT has become the default for modern data stacks because it aligns with cloud warehouse architecture.
CDC captures changes to data at the source and replicates only the changes to a target. A customer updates their email. CDC detects this change and sends an update event to the warehouse. The warehouse applies the change. This is far more efficient than extracting the entire customer table daily. CDC scales to massive datasets. Extracting a billion-record table daily is expensive and slow. Replicating millions of daily changes is efficient and fast.
CDC requires the source system to log or track changes. Most modern databases support CDC through transaction logs. PostgreSQL has logical decoding. MySQL has binary logs. SQL Server has CDC built-in. The source system writes changes to these logs. A CDC tool reads the logs and replicates changes downstream. Kafka Connect, a data integration framework, includes CDC connectors for many databases. These connectors monitor source databases and stream changes to Kafka, which then feeds a warehouse.
CDC enables real-time or near-real-time data movement. As data changes in the source, those changes reach the warehouse within seconds. This freshness is valuable for operational analytics. Real-time dashboards show current data. Alerts trigger immediately when something changes. CDC is increasingly the preferred pattern for high-volume databases because it's efficient and keeps data fresh.
Many data sources, especially SaaS applications, only expose data through APIs. Salesforce, HubSpot, Stripe, and others provide APIs for accessing data. API integration pulls data from these APIs on a schedule or on-demand, then loads it into a warehouse. API integration is simple conceptually. Call the API, get data, store it. The challenge is that APIs have limitations. They return data in their format. Rate limits restrict how fast you can pull. Pagination makes retrieving large datasets slow.
API integration works well for slowly changing data. If customer data in Salesforce changes daily, pulling via API once daily is acceptable. For faster-changing data or massive datasets, APIs become a bottleneck. A SaaS application with millions of records can't be pulled via API in a timely way. Some platforms, like Stripe, provide data exports or CDC options that are more efficient than APIs.
Pre-built connectors for popular SaaS applications have become standard. Fivetran and Airbyte include connectors for hundreds of SaaS sources. You configure the connector with your credentials. The tool handles API calls, pagination, and error handling. The connector runs on a schedule and loads data into your warehouse. This removes the burden of building custom API integrations.
Data virtualization presents multiple data sources as a single virtual database. An analyst writes a query against the virtual view. The system routes parts of the query to different sources. A query asking for customer age and purchase count might fetch age from the CRM and purchase count from the transaction database. The system combines results and returns them. From the analyst's perspective, it's one database.
Data virtualization avoids moving data. No duplication. Data lives where it was created. This keeps data fresh because you always query live sources. The downside is query performance depends on source system performance. If querying the CRM is slow, your entire query is slow. If the transaction database is overloaded, queries fail or timeout. Most virtualization tools include caching to mitigate this. Frequently accessed data is cached locally.
Data virtualization is useful when you have a few stable sources and real-time data is critical. For large-scale analytics with many sources, data integration to a warehouse is usually better. Moving data gives you control. You can cache, index, and optimize. You're not dependent on source system performance. But virtualization excels in specific scenarios where you need real-time data from a few sources.
The market for data integration tools is large and fragmented. Fivetran and Airbyte are no-code cloud platforms. They handle hundreds of pre-built connectors. You configure a source, select a warehouse, and data flows automatically. Fivetran is commercial, mature, and used by thousands of organizations. Airbyte is open-source, growing rapidly, and has lower costs. Both are excellent for standard integrations. They remove the burden of building connectors.
Enterprise platforms like Talend and MuleSoft handle integration across the entire organization. They support databases, APIs, applications, and more. They're powerful but complex. They suit large organizations with sophisticated integration needs. Smaller organizations usually find Fivetran or Airbyte sufficient. Custom solutions in Spark or Python are appropriate for unique sources or transformations that pre-built tools don't support. If you need a custom connector or complex logic, code gives you flexibility.
The choice depends on your sources, frequency, and team. Pre-built SaaS connectors suit most organizations. Custom logic suits specific needs. Budget matters. Fivetran costs money per source. Building custom connectors costs engineering time. The equation differs for each organization. Most start with a managed tool like Fivetran, then build custom connectors for unique sources as needed.
Data from sources is often dirty. Null values, missing fields, inconsistent formats. Integration pipelines must address this. The transform phase includes data quality checks. You validate that records have required fields. You check that numeric columns are actually numeric. You verify foreign keys exist. If a customer has an invalid order ID, you flag it. The pipeline doesn't reject the customer. It flags the invalid reference for manual review while loading the rest.
Most integration tools include data quality features. You define rules. The tool checks them and logs violations. Some tools include built-in transformations for common cleaning tasks. Converting dates to a standard format. Trimming whitespace. Removing duplicates. These built-in functions reduce custom logic. Schema validation ensures source data matches expectations. If the source adds an unexpected column, you're alerted. If a column type changes, it's caught.
Monitoring data quality over time is important. A source that was clean might degrade. Manual data entry errors accumulate. Your quality checks catch this. Dashboards show quality metrics over time. Are null values increasing. Is duplicate rate growing. Trends alert you to problems early. Data quality is often iterative. You discover issues, add checks, and improve quality continuously.
Schema changes break pipelines silently. A source adds a column. Your ETL expects specific columns. The pipeline loads successfully, but the new column is ignored. Weeks later, an analyst needs the new column. It doesn't exist. Monitoring schema changes is critical. Most tools include schema detection and alert on changes. You should review schema changes regularly. Some columns might be required. Others optional. Your pipeline must handle both.
Maintaining connectors for hundreds of sources is operationally challenging. Each source evolves. APIs change. Database schemas are altered. Your connectors must adapt. This is why managed services like Fivetran are popular. They maintain connectors for you. When an API changes, Fivetran's engineers update the connector. You don't have to. Custom connectors are your responsibility. You must monitor and update them.
API rate limits are a common bottleneck. A SaaS API might limit you to 100 requests per hour. Pulling a large dataset takes days. Batching requests or requesting increased limits helps. Some APIs offer bulk export endpoints that are faster than sequential API calls. Understanding source capabilities and constraints is important. Data security and compliance add complexity. Credentials must be stored securely. Data at rest must be encrypted. Some regulations require specific handling of personal data. Your integration pipeline must address these requirements.
Monitoring data completeness is often overlooked. Data fails to load for many reasons. A source is down. Network connectivity is lost. Quotas are exceeded. Your pipeline should alert if data fails to load or is suspiciously incomplete. An analyst might not notice that a source is missing until they try to use it. Alerting catches this immediately. Data freshness is another metric. Is the data you're using actually current. Are there gaps where data failed to load. These metrics should be monitored actively.
Data integration is combining data from multiple sources into a unified view. A company collects customer data from a CRM system, transaction data from a payment processor, and usage data from a website. Each system stores data in a different format and location. Integration combines them into a single source of truth. An analyst can query customer age from the CRM, purchase history from the payment processor, and website behavior in one query. Without integration, analysts write separate queries against each system, then manually reconcile the data. Integration automates this.
ETL (extract-transform-load) extracts data from sources, transforms it into a standard format, and loads it into a target system like a data warehouse. The transformation happens outside the target system. ELT (extract-load-transform) loads raw data into the target first, then transforms it there. The transformation happens inside the warehouse where it often runs faster. CDC (change data capture) extracts only changed records since the last run, improving efficiency. API integration connects to sources through their APIs, pulling data on demand. Data virtualization presents multiple sources as a single virtual database without physically moving data. Each pattern has trade-offs. ETL is traditional and well-understood. ELT leverages modern warehouse compute. CDC is efficient for continuous updates. APIs are simple but limited to what the API exposes. Data virtualization is flexible but query performance depends on source system performance.
ETL transforms data outside the target system. You extract data, clean and reshape it using code or tools, then load the transformed data into the warehouse. Transformation happens in the transformation environment, not the warehouse. ELT loads raw data into the warehouse first, then transforms it using SQL. Both accomplish the same goal but in different order. ELT became popular with modern cloud warehouses because they're fast at SQL transformations. Loading raw data is simpler than transforming first. Analysts can access raw data while transformations run. ETL requires more upfront transformation, but some teams prefer this because it produces cleaner data entering the warehouse. The choice depends on whether your transformations are simpler in SQL or code.
CDC captures changes to data at the source and replicates them to a target. Instead of extracting all data every time, CDC extracts only changes. A customer updates their phone number. CDC detects this change and sends the update to the warehouse. This is far more efficient than extracting the entire customer table daily. CDC requires the source system to log changes. Most modern databases support CDC. It's typically implemented using database transaction logs or triggers. Kafka Connect includes CDC connectors for many databases. CDC enables real-time or near-real-time replication. As data changes in the source, those changes flow to the target immediately. This is more responsive than daily batch ETL. CDC is becoming more common as organizations need fresher data in their warehouses.
Data virtualization presents multiple data sources as a single virtual database. An analyst writes a query against the virtual view. The system routes parts of the query to different sources, retrieves data, and combines results. The analyst doesn't know or care where data lives. From their perspective, it's one database. Data virtualization avoids physically moving data. No duplication. Easier to keep data fresh because you query live sources. The downside is query performance depends on source system performance. If the CRM is slow, queries are slow. If the payment processor is overloaded, queries timeout. Most virtualization tools include caching to mitigate this. Data virtualization is useful when you have a few sources and real-time data is critical. For large-scale analytics, data integration to a warehouse is often better.
API integration connects to data sources through their APIs, pulling data on demand or on a schedule. An SaaS application exposes an API returning customer data. You write a connector that calls the API, retrieves data, and loads it into your warehouse. API integration is simple for sources that don't support database connectors. Most SaaS applications have APIs. API integration works well for slowly changing data. If customer data changes daily, running API calls daily is acceptable. API integration is limited to what the API exposes. If the API doesn't return a particular field, you can't get it. Rate limiting can be an issue. APIs often limit requests per second. Pulling massive amounts of data is slow. API integration is commonly used for SaaS data that doesn't have native connectors.
Fivetran and Airbyte are the dominant no-code data integration platforms. They handle hundreds of data sources with pre-built connectors. You configure a source, pick the warehouse, and data flows. Fivetran is commercial and mature. Airbyte is open-source and growing. Talend and MuleSoft are enterprise iPaaS platforms. They handle integration across the entire organization, including APIs, databases, and applications. They're more powerful but more complex. Stitch (acquired by Talend) and Segment focus on specific use cases. dbt is increasingly used for transformation logic after data loads. Apache NiFi handles complex data flows and routing. Kafka Connect handles CDC and event streaming. The choice depends on your sources, frequency of changes, and technical preferences. No-code tools like Fivetran suit most organizations. Custom solutions in Spark or Python suit organizations with unique requirements.
Schema evolution is when data sources change their structure. A table adds a new column. A field changes type from integer to string. Your integration pipeline must handle these changes. ETL tools include schema detection. They infer the schema of source data and handle new columns automatically. Most modern tools add new columns to the target table transparently. Column type changes are trickier. If a field changes type, you might have incompatibility. Some data lands as an integer, other data as a string. Integration tools typically reject mismatches or convert types. You should monitor schema changes and review them. A column name change in the source might mean you're suddenly pulling from a different field. Schema validation catches this. Most tools support schema registries that version schemas and validate compatibility.
Data quality in integration means ensuring data is accurate, complete, and consistent. Source data might be dirty. A customer record has a null email. An order has a product ID that doesn't exist. Integration pipelines should validate and clean data. Null emails might be rejected or marked for manual review. Invalid product IDs might be flagged. Data quality checks happen during transformation. You verify record counts match expectations. You validate that foreign keys exist. You check that numeric columns are actually numeric. Most integration tools include data quality features. You define rules. The tool checks them and logs violations. Logging violations without failing is important. You don't want to reject valid data because one record is bad. Flag it for review while loading the rest. Data quality is often an iterative process. You discover issues over time and add checks.
Start by asking how fresh your data needs to be. Daily is acceptable? Use batch ETL. Real-time is required? Use CDC or API polling frequently. Also consider source capabilities. Does the source support CDC? If not, API or batch extraction is your option. Consider data volume. Large datasets are costly to extract frequently. CDC is more efficient than daily full extraction. Consider complexity of transformation. Simple reshaping works in ELT. Complex logic might be easier in ETL. Consider cost. ELT to a cheap cloud warehouse is usually cost-effective. Multiple API calls might be expensive. Consider team expertise. Your team knows Python? ETL in Python. Your team knows SQL? ELT in the warehouse. Consider operational burden. No-code tools like Fivetran require less operation. Custom solutions require more expertise. Most organizations use multiple patterns. High-volume databases use CDC. SaaS applications use API integration. Slowly changing data uses batch ETL. The pattern depends on the specific source.