What Is Change Data Capture?

Definition

Change data capture (CDC) is a technology that captures changes (INSERT, UPDATE, DELETE) made to a database and streams those changes to downstream systems in real time or near-real-time. CDC reads from the database's transaction log, which records every change with exact timing. As applications update the source database, those changes are automatically captured and streamed to other systems like data warehouses, caches, event brokers, or other databases.

CDC enables real-time data replication without waiting for batch jobs to run. Instead of loading all data nightly, CDC pushes changes as they happen. This makes data fresh and enables real-time analytics, event-driven architecture, and microservices synchronization. CDC is the inverse of ETL: instead of pulling data on a schedule, CDC pushes changes continuously.

The core value of CDC is eliminating latency. Traditional batch ETL jobs run overnight, so data in the warehouse is hours or days old by morning. CDC makes data available in minutes or seconds. This matters for real-time dashboards, fraud detection, and recommendation systems that need current data to work well.

CDC has different flavors: log-based CDC reads transaction logs directly, which is efficient and captures all changes. Query-based CDC runs periodic queries to detect changes, which is simpler but slower and less reliable. The right choice depends on your source database and requirements.

Key Takeaways

Change data capture streams INSERT/UPDATE/DELETE operations from source databases to downstream systems in real-time, eliminating the latency of batch ETL jobs.
Log-based CDC reads database transaction logs directly for efficiency and completeness, while query-based CDC runs periodic queries and is simpler but slower and less reliable.
CDC enables real-time analytics, event streaming, data replication, cache invalidation, and microservices synchronization without rebuilding batch infrastructure.
Common CDC tools include Debezium (open-source), Fivetran (managed), Airbyte (hybrid), and many others that integrate with streaming platforms like Kafka.
CDC challenges include managing ordering, handling schema changes, deduplication, exactly-once delivery, and monitoring lag to ensure downstream systems stay current.
CDC is complementary to batch ETL: use CDC for real-time serving and streaming, use batch ETL for historical analysis and aggregations that don't need live freshness.

How CDC Works: Log-Based vs Query-Based

Log-based CDC reads from the database's transaction log. In MySQL, this is the binary log (binlog). In PostgreSQL, it is the write-ahead log (WAL). In Oracle, it is the redo log. These logs record every change made to the database in sequence. CDC systems read these logs, parse the changes, and stream them downstream. This approach is efficient because the log already exists for database recovery. CDC just consumes it. Log-based CDC also captures all changes, even those from applications that bypass normal queries or use bulk operations.

Query-based CDC runs periodic queries to detect changes. A common approach is querying rows where updated_at is newer than the last run time. This requires the source to have timestamp columns and be designed for this pattern. Query-based CDC is simpler to implement because it does not require database-specific log access. But it has drawbacks: it misses changes unless the timestamp is updated, it requires querying the entire table repeatedly, and it cannot handle deletes unless you use soft-delete patterns.

Log-based is preferred for production systems. It is efficient, reliable, and captures all changes. Query-based is acceptable for simpler scenarios or databases that don't support log access. Some CDC tools support both, letting you choose based on your database capabilities.

CDC vs Traditional ETL: Different Architectures

Traditional ETL follows a schedule. At 2am, an extraction job runs. It selects all data from the source, transforms it, and loads it into the warehouse. By 8am, when analysts arrive, data from 2am is in the warehouse. Any changes that happened between 2am and 8am are invisible until the next run. This delay is inherent to batch architecture.

CDC inverts this. As applications update the source database, those changes are captured and streamed immediately. A warehouse consumer sees those changes in minutes or seconds. Dashboards show current data. Real-time systems have fresh information. The tradeoff is complexity: CDC systems are more complex to implement and operate than batch jobs.

The two are complementary. CDC provides fresh updates for real-time systems. Batch ETL provides aggregations and transformations that are expensive to do in real-time. Many modern data stacks use both: CDC feeds raw changes to a warehouse in real-time. Batch jobs run nightly to aggregate and compute things that do not need live freshness. This hybrid approach gets benefits of both without committing to purely streaming architecture.

Log-Based CDC Internals: Understanding Transaction Logs

Database transaction logs record every change made to the database. When you INSERT a row, the INSERT is logged. When you UPDATE a row, the UPDATE (before and after values) is logged. When you DELETE a row, the DELETE is logged. These logs are written sequentially and contain enough information to recover the database if it crashes.

CDC systems position themselves as database replicas. They read the transaction log just like a replication slave would. As the log advances, CDC captures changes. Some CDC tools connect directly to the transaction log files. Others use the database's replication protocol (like MySQL replication or PostgreSQL replication slots) which abstracts log access. The database handles all the complexity of managing the log.

The advantage is that CDC sees all changes in order. A transaction that updates multiple tables is logged atomically. CDC can stream those changes together. This preserves logical consistency. Query-based CDC cannot guarantee this because it queries tables at different times.

The challenge is database-specific log formats. PostgreSQL's WAL format is different from MySQL's binlog format. Oracle's redo logs are different again. CDC tools must handle these format differences. This is why many CDC tools support specific databases well and others less well.

CDC Tools and Platforms

Debezium is an open-source CDC platform maintained by Red Hat. It supports many databases including MySQL, PostgreSQL, MongoDB, Oracle, SQL Server, and others. Debezium publishes changes to Kafka, so downstream systems can consume them. Debezium handles database-specific log formats and shields applications from complexity. For teams running Kafka internally, Debezium is a natural choice.

Fivetran offers managed CDC connectors with little configuration. You authenticate to your source, Fivetran handles everything. They maintain and support the connectors. Fivetran can load to many destinations (data warehouses, data lakes, applications). The tradeoff is cost: managed services cost more than open-source.

Airbyte is an open-source data integration platform with growing CDC support. It offers both open-source and managed versions. Airbyte is more general-purpose than Debezium, supporting both batch and streaming. Airbyte can load to many destinations, making it a complete solution.

Many other CDC tools exist including Stitch (managed), Maxwell (MySQL-specific), and native CDC support in databases like SQL Server and Oracle. The choice depends on your source databases, destinations, and infrastructure preferences.

CDC Use Cases: When to Use It

Real-time analytics is a primary use case. Stream changes from your operational database to your data warehouse. Dashboards always show current data. Reports reflect reality within minutes rather than showing yesterday's snapshot. This matters for dashboards monitoring key metrics where staleness causes poor decisions.

Event sourcing uses CDC to capture all changes as immutable events. These events form an event log that can be replayed to rebuild any past state. If something goes wrong, you can replay events to fix it. This architecture is powerful for systems where you need complete audit trails.

Data replication keeps multiple systems synchronized without full copies. Replicate your production database to a read replica, cache, or search index. When data changes, CDC streams those changes downstream. Both systems stay synchronized automatically. This is more efficient than periodic full copies.

Microservices synchronization uses CDC to keep services updated when data changes. Service A's database changes, CDC streams those changes to Service B's event bus, Service B consumes the events and updates its own database. Services stay loosely coupled but synchronized.

Compliance and audit logs use CDC to maintain immutable records of all data changes. Regulatory requirements might demand knowing exactly when and how data changed. CDC provides this audit trail automatically.

CDC Challenges and How to Handle Them

Exactly-once delivery is hard. Network failures can cause CDC messages to be sent multiple times or skipped. Downstream systems must be idempotent: applying the same change twice produces the same result as applying it once. This requires careful design. Using change timestamps or sequence numbers helps detect duplicates.

Ordering is critical. If row X is updated twice, the changes must be applied in order or the final state is wrong. Most CDC systems partition changes by primary key: all changes to the same row go to the same partition or topic, ensuring order is preserved. But if you have a distributed downstream system, ensuring order across all updates is harder.

Large initial snapshots are slow. Capturing all existing data before streaming changes is time-consuming for large tables. Most CDC tools batch snapshot work to speed it up. Some use parallel extraction. But for very large tables, snapshot time can be hours or days.

Schema evolution is complex. If the source schema changes (add column, drop column, change type), CDC systems must detect this and communicate it downstream. Downstream systems must handle schema changes. This requires coordination and testing.

Monitoring CDC lag is essential. If CDC falls behind, the message queue grows. Downstream systems show stale data. If lag exceeds thresholds, alert and investigate. Lag can indicate capacity problems, downstream slowness, or upstream issues. Understanding lag helps you size infrastructure correctly.

Operational Complexity of CDC Systems

CDC systems add operational complexity. You now have another system to monitor and maintain. Debezium connectors can fail. Streaming platforms can run out of capacity. Downstream systems can lag behind. You need monitoring dashboards for CDC lag, failure rates, and message throughput. You need runbooks for common failure modes.

Schema changes require coordination. When the source schema changes, downstream systems must adapt. If they don't, data might be truncated or rejected. You need processes to detect schema changes, test impacts downstream, and roll out changes safely. This coordination is more complex than batch ETL where you test everything in development before deploying.

Debugging is harder. With batch ETL, you run a job, see the output, debug issues. With CDC, you have a continuous stream. If something goes wrong, you might have processed millions of messages incorrectly before noticing. You need comprehensive monitoring and testing to catch issues early.

Cost can be significant. CDC requires a streaming platform (Kafka is popular but not free). It requires compute to run CDC connectors. It requires capacity to handle peak change volume. For low-volume data, batch ETL is cheaper. For high-volume, high-frequency changes, CDC can be expensive unless you optimize carefully.

Finally, CDC is most valuable when downstream systems actually consume the changes in real-time. If you capture changes via CDC but load them in batch, you get the operational complexity of CDC without the real-time benefits. Make sure your use cases actually need real-time before committing to CDC.

Best Practices

Partition CDC messages by primary key to ensure that all changes to the same row are processed in order, preventing inconsistency from out-of-order updates.
Design downstream systems to be idempotent so applying the same CDC message twice produces the correct result, handling duplicates gracefully.
Monitor CDC lag continuously and alert if it exceeds thresholds, ensuring downstream systems stay current and detecting capacity or performance problems early.
Test schema changes in development before applying to production to verify that CDC systems and downstream consumers handle changes correctly without breaking.
Use CDC for real-time requirements where freshness matters (dashboards, fraud detection, recommendations) and batch ETL for historical analysis and aggregations that don't need live updates.

Common Misconceptions

CDC is a replacement for ETL, when in fact they serve different purposes and many modern data stacks use both for different use cases.
Implementing CDC means all data is immediately available in real-time, when in fact CDC requires complete infrastructure (streaming platforms, monitoring, downstream consumers) to deliver value.
Query-based CDC is sufficient if your database has timestamp columns, when it misses deletes, requires constant querying, and is inefficient compared to log-based approaches.
Once CDC is set up, it requires no monitoring because CDC systems are reliable, when in fact CDC systems can fall behind, connectors can fail, and careful operational oversight is essential.
CDC captures changes completely and correctly automatically, when ordering, deduplication, schema changes, and late-arriving data all require careful handling in downstream systems.

What Is Change Data Capture?

Definition

Key Takeaways

How CDC Works: Log-Based vs Query-Based

CDC vs Traditional ETL: Different Architectures

Log-Based CDC Internals: Understanding Transaction Logs

CDC Tools and Platforms

CDC Use Cases: When to Use It

CDC Challenges and How to Handle Them

Operational Complexity of CDC Systems

Best Practices

Common Misconceptions

Frequently Asked Questions (FAQ's)

What is change data capture (CDC) and how does it work?

What is the difference between log-based CDC and query-based CDC?

How does CDC compare to traditional ETL?

What are common CDC tools and platforms?

What are the main use cases for CDC?

What challenges does CDC introduce?

How do you ensure CDC messages are processed in the correct order?

What is the difference between snapshot and incremental CDC?

How do you handle schema changes in CDC systems?

What is the cost of running CDC systems?

How do you choose between CDC and batch ETL?

What happens if CDC falls behind and messages pile up?