LS LOGICIEL SOLUTIONS
Toggle navigation

What Is Change Data Capture?

Definition

Change data capture (CDC) is a technology that captures changes (INSERT, UPDATE, DELETE) made to a database and streams those changes to downstream systems in real time or near-real-time. CDC reads from the database's transaction log, which records every change with exact timing. As applications update the source database, those changes are automatically captured and streamed to other systems like data warehouses, caches, event brokers, or other databases.

CDC enables real-time data replication without waiting for batch jobs to run. Instead of loading all data nightly, CDC pushes changes as they happen. This makes data fresh and enables real-time analytics, event-driven architecture, and microservices synchronization. CDC is the inverse of ETL: instead of pulling data on a schedule, CDC pushes changes continuously.

The core value of CDC is eliminating latency. Traditional batch ETL jobs run overnight, so data in the warehouse is hours or days old by morning. CDC makes data available in minutes or seconds. This matters for real-time dashboards, fraud detection, and recommendation systems that need current data to work well.

CDC has different flavors: log-based CDC reads transaction logs directly, which is efficient and captures all changes. Query-based CDC runs periodic queries to detect changes, which is simpler but slower and less reliable. The right choice depends on your source database and requirements.

Key Takeaways

  • Change data capture streams INSERT/UPDATE/DELETE operations from source databases to downstream systems in real-time, eliminating the latency of batch ETL jobs.
  • Log-based CDC reads database transaction logs directly for efficiency and completeness, while query-based CDC runs periodic queries and is simpler but slower and less reliable.
  • CDC enables real-time analytics, event streaming, data replication, cache invalidation, and microservices synchronization without rebuilding batch infrastructure.
  • Common CDC tools include Debezium (open-source), Fivetran (managed), Airbyte (hybrid), and many others that integrate with streaming platforms like Kafka.
  • CDC challenges include managing ordering, handling schema changes, deduplication, exactly-once delivery, and monitoring lag to ensure downstream systems stay current.
  • CDC is complementary to batch ETL: use CDC for real-time serving and streaming, use batch ETL for historical analysis and aggregations that don't need live freshness.

How CDC Works: Log-Based vs Query-Based

Log-based CDC reads from the database's transaction log. In MySQL, this is the binary log (binlog). In PostgreSQL, it is the write-ahead log (WAL). In Oracle, it is the redo log. These logs record every change made to the database in sequence. CDC systems read these logs, parse the changes, and stream them downstream. This approach is efficient because the log already exists for database recovery. CDC just consumes it. Log-based CDC also captures all changes, even those from applications that bypass normal queries or use bulk operations.

Query-based CDC runs periodic queries to detect changes. A common approach is querying rows where updated_at is newer than the last run time. This requires the source to have timestamp columns and be designed for this pattern. Query-based CDC is simpler to implement because it does not require database-specific log access. But it has drawbacks: it misses changes unless the timestamp is updated, it requires querying the entire table repeatedly, and it cannot handle deletes unless you use soft-delete patterns.

Log-based is preferred for production systems. It is efficient, reliable, and captures all changes. Query-based is acceptable for simpler scenarios or databases that don't support log access. Some CDC tools support both, letting you choose based on your database capabilities.

CDC vs Traditional ETL: Different Architectures

Traditional ETL follows a schedule. At 2am, an extraction job runs. It selects all data from the source, transforms it, and loads it into the warehouse. By 8am, when analysts arrive, data from 2am is in the warehouse. Any changes that happened between 2am and 8am are invisible until the next run. This delay is inherent to batch architecture.

CDC inverts this. As applications update the source database, those changes are captured and streamed immediately. A warehouse consumer sees those changes in minutes or seconds. Dashboards show current data. Real-time systems have fresh information. The tradeoff is complexity: CDC systems are more complex to implement and operate than batch jobs.

The two are complementary. CDC provides fresh updates for real-time systems. Batch ETL provides aggregations and transformations that are expensive to do in real-time. Many modern data stacks use both: CDC feeds raw changes to a warehouse in real-time. Batch jobs run nightly to aggregate and compute things that do not need live freshness. This hybrid approach gets benefits of both without committing to purely streaming architecture.

Log-Based CDC Internals: Understanding Transaction Logs

Database transaction logs record every change made to the database. When you INSERT a row, the INSERT is logged. When you UPDATE a row, the UPDATE (before and after values) is logged. When you DELETE a row, the DELETE is logged. These logs are written sequentially and contain enough information to recover the database if it crashes.

CDC systems position themselves as database replicas. They read the transaction log just like a replication slave would. As the log advances, CDC captures changes. Some CDC tools connect directly to the transaction log files. Others use the database's replication protocol (like MySQL replication or PostgreSQL replication slots) which abstracts log access. The database handles all the complexity of managing the log.

The advantage is that CDC sees all changes in order. A transaction that updates multiple tables is logged atomically. CDC can stream those changes together. This preserves logical consistency. Query-based CDC cannot guarantee this because it queries tables at different times.

The challenge is database-specific log formats. PostgreSQL's WAL format is different from MySQL's binlog format. Oracle's redo logs are different again. CDC tools must handle these format differences. This is why many CDC tools support specific databases well and others less well.

CDC Tools and Platforms

Debezium is an open-source CDC platform maintained by Red Hat. It supports many databases including MySQL, PostgreSQL, MongoDB, Oracle, SQL Server, and others. Debezium publishes changes to Kafka, so downstream systems can consume them. Debezium handles database-specific log formats and shields applications from complexity. For teams running Kafka internally, Debezium is a natural choice.

Fivetran offers managed CDC connectors with little configuration. You authenticate to your source, Fivetran handles everything. They maintain and support the connectors. Fivetran can load to many destinations (data warehouses, data lakes, applications). The tradeoff is cost: managed services cost more than open-source.

Airbyte is an open-source data integration platform with growing CDC support. It offers both open-source and managed versions. Airbyte is more general-purpose than Debezium, supporting both batch and streaming. Airbyte can load to many destinations, making it a complete solution.

Many other CDC tools exist including Stitch (managed), Maxwell (MySQL-specific), and native CDC support in databases like SQL Server and Oracle. The choice depends on your source databases, destinations, and infrastructure preferences.

CDC Use Cases: When to Use It

Real-time analytics is a primary use case. Stream changes from your operational database to your data warehouse. Dashboards always show current data. Reports reflect reality within minutes rather than showing yesterday's snapshot. This matters for dashboards monitoring key metrics where staleness causes poor decisions.

Event sourcing uses CDC to capture all changes as immutable events. These events form an event log that can be replayed to rebuild any past state. If something goes wrong, you can replay events to fix it. This architecture is powerful for systems where you need complete audit trails.

Data replication keeps multiple systems synchronized without full copies. Replicate your production database to a read replica, cache, or search index. When data changes, CDC streams those changes downstream. Both systems stay synchronized automatically. This is more efficient than periodic full copies.

Microservices synchronization uses CDC to keep services updated when data changes. Service A's database changes, CDC streams those changes to Service B's event bus, Service B consumes the events and updates its own database. Services stay loosely coupled but synchronized.

Compliance and audit logs use CDC to maintain immutable records of all data changes. Regulatory requirements might demand knowing exactly when and how data changed. CDC provides this audit trail automatically.

CDC Challenges and How to Handle Them

Exactly-once delivery is hard. Network failures can cause CDC messages to be sent multiple times or skipped. Downstream systems must be idempotent: applying the same change twice produces the same result as applying it once. This requires careful design. Using change timestamps or sequence numbers helps detect duplicates.

Ordering is critical. If row X is updated twice, the changes must be applied in order or the final state is wrong. Most CDC systems partition changes by primary key: all changes to the same row go to the same partition or topic, ensuring order is preserved. But if you have a distributed downstream system, ensuring order across all updates is harder.

Large initial snapshots are slow. Capturing all existing data before streaming changes is time-consuming for large tables. Most CDC tools batch snapshot work to speed it up. Some use parallel extraction. But for very large tables, snapshot time can be hours or days.

Schema evolution is complex. If the source schema changes (add column, drop column, change type), CDC systems must detect this and communicate it downstream. Downstream systems must handle schema changes. This requires coordination and testing.

Monitoring CDC lag is essential. If CDC falls behind, the message queue grows. Downstream systems show stale data. If lag exceeds thresholds, alert and investigate. Lag can indicate capacity problems, downstream slowness, or upstream issues. Understanding lag helps you size infrastructure correctly.

Operational Complexity of CDC Systems

CDC systems add operational complexity. You now have another system to monitor and maintain. Debezium connectors can fail. Streaming platforms can run out of capacity. Downstream systems can lag behind. You need monitoring dashboards for CDC lag, failure rates, and message throughput. You need runbooks for common failure modes.

Schema changes require coordination. When the source schema changes, downstream systems must adapt. If they don't, data might be truncated or rejected. You need processes to detect schema changes, test impacts downstream, and roll out changes safely. This coordination is more complex than batch ETL where you test everything in development before deploying.

Debugging is harder. With batch ETL, you run a job, see the output, debug issues. With CDC, you have a continuous stream. If something goes wrong, you might have processed millions of messages incorrectly before noticing. You need comprehensive monitoring and testing to catch issues early.

Cost can be significant. CDC requires a streaming platform (Kafka is popular but not free). It requires compute to run CDC connectors. It requires capacity to handle peak change volume. For low-volume data, batch ETL is cheaper. For high-volume, high-frequency changes, CDC can be expensive unless you optimize carefully.

Finally, CDC is most valuable when downstream systems actually consume the changes in real-time. If you capture changes via CDC but load them in batch, you get the operational complexity of CDC without the real-time benefits. Make sure your use cases actually need real-time before committing to CDC.

Best Practices

  • Partition CDC messages by primary key to ensure that all changes to the same row are processed in order, preventing inconsistency from out-of-order updates.
  • Design downstream systems to be idempotent so applying the same CDC message twice produces the correct result, handling duplicates gracefully.
  • Monitor CDC lag continuously and alert if it exceeds thresholds, ensuring downstream systems stay current and detecting capacity or performance problems early.
  • Test schema changes in development before applying to production to verify that CDC systems and downstream consumers handle changes correctly without breaking.
  • Use CDC for real-time requirements where freshness matters (dashboards, fraud detection, recommendations) and batch ETL for historical analysis and aggregations that don't need live updates.

Common Misconceptions

  • CDC is a replacement for ETL, when in fact they serve different purposes and many modern data stacks use both for different use cases.
  • Implementing CDC means all data is immediately available in real-time, when in fact CDC requires complete infrastructure (streaming platforms, monitoring, downstream consumers) to deliver value.
  • Query-based CDC is sufficient if your database has timestamp columns, when it misses deletes, requires constant querying, and is inefficient compared to log-based approaches.
  • Once CDC is set up, it requires no monitoring because CDC systems are reliable, when in fact CDC systems can fall behind, connectors can fail, and careful operational oversight is essential.
  • CDC captures changes completely and correctly automatically, when ordering, deduplication, schema changes, and late-arriving data all require careful handling in downstream systems.

Frequently Asked Questions (FAQ's)

What is change data capture (CDC) and how does it work?

Change data capture is a technique for capturing changes (INSERT, UPDATE, DELETE) made to a database and streaming those changes downstream to other systems. CDC reads from a database's transaction log, which records every change. As applications update the database, those changes are captured and streamed to downstream systems like data warehouses, caches, or other databases in real time or near-real-time.

CDC enables real-time data replication, event streaming, and analytics on fresh data without waiting for batch jobs to run. The core benefit is eliminating latency. Instead of loading all data nightly, CDC pushes changes as they happen. This makes data available in minutes or seconds rather than hours or days.

CDC systems read database transaction logs which are designed for recovery and already exist. CDC just consumes them. This is efficient and captures all changes, even those from bulk operations or applications that bypass normal queries.

What is the difference between log-based CDC and query-based CDC?

Log-based CDC reads from the database transaction log (binlog in MySQL, WAL in PostgreSQL, redo logs in Oracle). The log records every change with exact timing. This is efficient because the log already exists for database recovery. CDC just reads it. Log-based CDC captures all changes, including those from applications that bypass normal queries.

Query-based CDC runs periodic queries (like 'SELECT * WHERE updated_at > last_run') to detect changes. This is simpler to implement but misses changes, requires timestamp columns, and is inefficient because it queries the entire table repeatedly. Log-based is preferred for mission-critical systems. Query-based is good for simpler scenarios or databases that don't support log access.

The performance difference is substantial. Log-based CDC adds minimal load to the database. Query-based CDC forces periodic full table scans. For large tables, query-based CDC becomes prohibitively expensive.

How does CDC compare to traditional ETL?

Traditional ETL runs on a schedule: every night, extract all data from the source, transform it, load it into the warehouse. Data is stale until the next run. CDC captures changes continuously as they happen and pushes them downstream. This creates real-time or near-real-time data without waiting for batch jobs.

ETL is simpler to implement and debug because runs are discrete and repeatable. CDC is more complex but enables real-time analytics and replication. Many modern data stacks use both: CDC for real-time updates and ETL for periodic aggregations and transformations. This hybrid approach gets benefits of both architectures.

Choose CDC for use cases where freshness matters: real-time dashboards, fraud detection, recommendations. Choose ETL for historical analysis and aggregations that don't need live freshness. Both have their place in a modern data stack.

What are common CDC tools and platforms?

Debezium is an open-source CDC platform that supports many databases (MySQL, PostgreSQL, MongoDB, Oracle, SQL Server). It publishes changes to Kafka so downstream systems can consume them. Fivetran offers managed CDC connectors for many sources with minimal configuration. Airbyte has open-source and managed CDC support. Stitch is a managed service for CDC from Fivetran. Maxwell is another open-source tool for MySQL CDC.

The choice depends on your source databases, infrastructure preferences (managed vs self-hosted), and downstream systems. Debezium is a natural choice if you are running Kafka internally. Fivetran is good if you want a managed solution and don't mind the cost. Airbyte is a good middle ground: open-source with managed options and many destination integrations.

Many databases also have native CDC support like SQL Server Change Data Capture and Oracle Golden Gate. If your database has native support, that might be simpler than external tools.

What are the main use cases for CDC?

Real-time analytics: stream changes to a data warehouse so dashboards show current data, not stale snapshots from last night. Event sourcing: capture all changes as events and replay them to rebuild state or drive business logic. Data replication: keep multiple databases or systems synchronized without full copies. Cache invalidation: when source data changes, invalidate caches immediately so stale data is not served.

Microservices sync: when data changes in one service, keep other services updated in real time. Compliance and audit: maintain immutable logs of all data changes for regulatory requirements. CDC enables all these patterns without building custom change-tracking logic into your applications.

The key is understanding when real-time matters. If your business works fine with daily data updates, batch ETL is sufficient and cheaper. If some users need real-time, add CDC for those use cases. Evaluate your use cases carefully before committing to CDC infrastructure.

What challenges does CDC introduce?

Exactly-once delivery is hard. Network failures can cause duplicate messages or missed changes. You need idempotent downstream systems or deduplication logic. Ordering can be tricky. If multiple changes happen to the same row, you must apply them in order or the final state is wrong. Large initial snapshots are slow. Capturing all existing data before streaming changes is time-consuming.

Schema evolution is complex. If the source schema changes, CDC systems must adapt. Downstream systems must handle schema changes. Late-arriving data is possible. A change might be captured late due to transaction timing or network delays. Downstream systems need to handle late data gracefully. Monitoring lag is essential. If CDC falls behind, the message queue grows and downstream systems show stale data.

Each of these challenges is solvable but requires thoughtful design. Understanding them upfront helps you build robust CDC systems.

How do you ensure CDC messages are processed in the correct order?

Many CDC tools partition changes by primary key: all changes to the same row go to the same partition. Partitions are processed in order. This ensures that if you update the same row multiple times, changes are applied in sequence. Downstream systems must respect this ordering when applying changes. If you process row 1 changes out of order, you might end up with stale data.

Kafka and similar systems preserve order within a partition, so using partitioning is the standard solution. Each row's primary key maps to a partition. All changes to that row go to that partition. Consumers process partitions sequentially. This guarantees order for each row.

The tradeoff is that you lose ordering across rows. Row 1 change might be processed after Row 2 change even if Row 1 change happened first. But for most use cases, per-row ordering is sufficient. If you need global ordering, you must sacrifice parallelism.

What is the difference between snapshot and incremental CDC?

Snapshot CDC captures the entire current state of a table as a starting point. Then it streams changes incrementally. This is necessary because downstream systems need the full data before incremental changes make sense. Incremental CDC alone would miss the initial data. Setting up snapshots is time-consuming for large tables, which is why many CDC tools batch snapshot work.

They might snapshot a large table in parallel chunks to speed things up. For a 100GB table, they might split it into 10 parallel jobs instead of one sequential job. After the snapshot, incremental changes are streamed continuously. This approach balances speed with resource usage.

The snapshot and incremental phases must be coordinated. You capture a snapshot at a specific point in time. Then you stream changes from that point forward. Any changes that happened during snapshot are included exactly once, not duplicated.

How do you handle schema changes in CDC systems?

Schema changes (adding columns, changing types, dropping columns) must be handled gracefully. Log-based CDC captures schema change events from the transaction log. Downstream systems must be aware of these changes and adapt. Some CDC tools automate schema evolution: they detect schema changes and communicate them to Kafka schemas or downstream systems.

Others require manual intervention. The safest approach is to test schema changes in development before applying to production, and to have a strategy for backward and forward compatibility. If you add a new optional column, it is backward compatible. If you drop a required column, you must update downstream systems that use it.

Many teams use schema registries (like Confluent Schema Registry) to manage schema evolution. The schema registry tracks all versions and enforces compatibility rules. This prevents downstream systems from breaking when schema changes.

What is the cost of running CDC systems?

CDC has three cost components. Database load: CDC reads from transaction logs, which adds some load to the source database. This is usually small, but can be significant if the database is already stressed. Streaming platform cost: capturing changes requires a streaming platform like Kafka. Larger volumes mean larger clusters.

Downstream compute: applying changes to multiple systems costs compute. Real-time replication to five destinations costs 5x more than batch loading to one. The total cost depends on data volume, change frequency, and number of downstream systems. For high-volume, high-frequency changes to many destinations, CDC can be expensive. For low-volume data, batch ETL is cheaper and simpler.

Cost optimization requires sizing infrastructure for peak volume and monitoring utilization. If CDC systems are running at 10% capacity, you are over-provisioned. If they are at 90% and lagging, you are under-provisioned. Regular capacity reviews help optimize cost.

How do you choose between CDC and batch ETL?

Use CDC for real-time requirements: fraud detection needs current customer data. Recommendation engines need recent user behavior. Live dashboards need current metrics. Use batch ETL for historical analysis, data science, and aggregations that do not need real-time freshness. Use both together: CDC for real-time serving, batch ETL for overnight aggregations and historical snapshots.

Many teams start with batch ETL because it is simpler, then add CDC later when real-time requirements emerge. Evaluate your use cases and user expectations. If most users can wait until morning, batch ETL is sufficient and cheaper. If some users need real-time data, add CDC for those use cases. This incremental approach avoids over-engineering for theoretical future needs.

The hybrid approach is most common in modern data stacks: CDC provides fresh updates for real-time systems. Batch jobs run nightly to aggregate and compute things that don't need live freshness. This hybrid approach gets benefits of both without committing to purely streaming architecture.

What happens if CDC falls behind and messages pile up?

If CDC cannot keep up with change volume, messages queue up. A Kafka topic might accumulate millions of messages. If the backlog is large, it takes a long time to catch up. During this period, downstream systems show stale data. If the queue grows unbounded, it might run out of storage or memory.

You must monitor CDC lag (how far behind real time it is) and alert if it exceeds thresholds. A lag of 5 minutes might be acceptable. A lag of 5 hours means downstream systems are stale. If CDC falls behind, you might need to increase streaming platform capacity, optimize downstream processing, or reduce the freshness requirements.

For critical systems, you need enough capacity that CDC never falls behind. This requires capacity planning based on peak change volume. Size your streaming platform to handle 2-3x typical load. Monitor utilization closely and increase capacity before hitting limits.