LS LOGICIEL SOLUTIONS
Toggle navigation

Change Data Capture (CDC): Real Examples & Use Cases

Definition

Change Data Capture is the practice of recording row-level changes in a source database and propagating those changes to downstream systems in near-real-time. The capture happens by reading the database's transaction log rather than querying the tables, which gives a complete and ordered record of every insert, update, and delete without putting load on the source. Real examples reveal where CDC has actually displaced batch replication, what failure modes show up in production, and where the operational cost is still high enough that teams reach for alternatives.

The pattern matters because batch replication has a fundamental freshness ceiling. If the warehouse loads daily, downstream consumers cannot react to changes faster than that. CDC pushes the freshness down to seconds-to-minutes, which unlocks use cases batch cannot support: real-time analytics, fresh feature stores, search indexes that stay in sync, operational caches, and event-driven workflows downstream of the database.

The category in 2026 includes open-source connectors (Debezium dominates), commercial CDC products (Fivetran's HVR, Striim, Qlik Replicate, Estuary Flow), cloud-native services (AWS DMS, GCP Datastream, Azure Data Factory CDC), and database-native CDC features (Postgres logical replication, MySQL binlog, SQL Server CDC, MongoDB change streams). Most production CDC pipelines combine the database's native CDC primitives with a connector that turns them into a stream of events.

What separates production CDC from prototype CDC is operational discipline around the failure modes that only show up in production: replication slots filling up, schema changes mid-flight, very large transactions, snapshot-and-catch-up after outages, and tombstone handling for deletes. The connectors handle the happy path; the operational layer handles the rest.

This page surveys real CDC implementations across analytics ingestion, search index synchronization, operational caching, and event-driven architectures. The connector ecosystem evolves quickly; the underlying patterns about source databases, downstream sinks, and operational handling are stable.

Key Takeaways

  • CDC captures row-level changes from source databases and streams them to downstream systems in near-real-time.
  • The pattern reads transaction logs rather than querying tables, so it puts minimal load on the source.
  • Debezium is the dominant open-source CDC connector; commercial options trade openness for managed operations.
  • Common downstream uses include warehouse ingestion, search index sync, cache invalidation, and event-driven workflows.
  • Production CDC requires operational discipline around schema evolution, replication slot management, and large-transaction handling.

Companies Running CDC at Scale

LinkedIn pioneered the production CDC pattern with their Databus system in the early 2010s. Databus captured changes from Oracle and distributed them to downstream consumers including search indexes and caches. The work informed the broader industry's adoption of CDC patterns and influenced Debezium's design years later.

Netflix uses CDC extensively from Cassandra and other operational stores into their data platform. The changes feed both analytics and ML feature stores. The Netflix team has published material on the operational challenges of running CDC at their scale, particularly around schema management and very high write throughput.

Uber's data platform consumes CDC from operational MySQL and Postgres databases through Debezium and into their lakehouse. The data lands in Hudi tables that handle the streaming-update pattern efficiently. The combination of Debezium plus Hudi is a recognized pattern for high-volume CDC-to-lakehouse use cases.

Shopify operates CDC pipelines from operational MySQL into analytics and into their internal microservice ecosystem. Service teams subscribe to change streams to react to operational events without polling the source databases. The pattern reduced coupling between services and the source databases.

Stripe consumes CDC from internal Postgres databases for fraud detection, analytics, and downstream services. The freshness requirement for fraud detection (catching fraudulent payments within the authorization window) drove much of the CDC infrastructure investment.

Many smaller companies run CDC for the analytics-freshness use case alone, replicating from Postgres or MySQL into Snowflake, BigQuery, or Databricks for analytics that needs to be fresher than batch can deliver.

Source Database Patterns

Postgres CDC works through logical replication. Postgres exposes a replication slot that streams WAL changes; Debezium or another connector reads from the slot and converts the changes into events. The pattern is mature and widely used. The operational concerns include slot maintenance (slots can fill the disk if connectors lag) and handling of large transactions.

MySQL CDC reads the binary log (binlog). Connectors decode binlog events into change records. The pattern is similarly mature. MySQL's binlog has rotation behavior that connectors must handle carefully to avoid missing events when rotation happens during a connector outage.

SQL Server has a built-in CDC feature that records changes in side tables. Connectors poll the side tables and emit events. The pattern is more polling-based than the log-reading patterns of Postgres and MySQL but works similarly from a downstream perspective.

MongoDB exposes change streams as a native API. Consumers subscribe to a collection's change stream and receive events for inserts, updates, and deletes. The pattern is simpler to operate than transaction-log reading because MongoDB handles the underlying mechanics; the consumer just reads the stream.

DynamoDB Streams provides similar functionality for DynamoDB tables. The stream is consumed by Lambda functions, Kinesis pipelines, or other downstream processors. The pattern is the standard way to react to DynamoDB changes in AWS-native architectures.

Cassandra CDC is supported through CDC files on each node. The pattern is harder to operate cleanly than log-based CDC on relational databases because of Cassandra's distributed nature; aggregating CDC across nodes requires extra coordination.

Downstream Sink Patterns

Warehouse ingestion is the most common use case. CDC streams land in staging tables in the warehouse; transformations apply the changes to produce current-state and history tables. The pattern keeps the warehouse fresh to within minutes of the source. Snowflake, BigQuery, Databricks, and Redshift all have established CDC ingestion patterns.

Lakehouse ingestion through formats designed for update workloads. Apache Hudi was built for this use case; Iceberg and Delta have added similar capabilities in recent versions. The pattern lands CDC changes in lakehouse tables that downstream queries can read like normal tables, with the most recent state per primary key.

Search index synchronization keeps Elasticsearch, OpenSearch, or similar search systems consistent with the source database. The CDC stream becomes a sequence of index operations. The pattern eliminates the dual-write problem (writing to the database and the search index from application code) by reading the source of truth from the database.

Cache invalidation uses CDC to expire or update cache entries when source data changes. Redis or Memcached entries that depend on database state get invalidated based on the CDC stream. The pattern keeps caches consistent without coupling the cache invalidation logic into application code.

Event-driven service architectures consume CDC as service events. Service B listens for changes in Service A's data and reacts without Service A having to explicitly publish events. The pattern (sometimes called outbox or transactional outbox) is debated for tight coupling concerns but works well in many production architectures.

Common Operational Challenges

Schema changes mid-flight are the most common operational problem. The source database adds a column; the CDC stream now contains records with new fields; downstream consumers may or may not handle the new field. The fix is automated schema detection in the connector plus contracts that govern how schema changes propagate.

Replication slot management for Postgres is a frequent source of outages. The slot accumulates WAL while the connector is down; if it accumulates too much, the database disk fills up. The fix is monitoring slot size, alerting when it grows past threshold, and procedures for either restarting the connector or removing the slot in emergencies.

Large transactions create memory pressure on the connector. A transaction that updates a million rows produces a million events that the connector may try to hold in memory. The fix is configuring connectors for streaming behavior and tuning batch sizes appropriately.

Snapshot-and-catch-up after extended outages requires care. When a connector has been down long enough that the log has rotated past where it stopped, the connector needs to re-snapshot from the table state and then resume from the current log position. The operational procedure for this is important to have ready before it is needed.

Tombstone handling for deletes is sometimes overlooked. The CDC stream emits a delete event; downstream consumers need to handle it correctly. Warehouse loaders need to soft-delete or remove rows; search indexes need to delete documents; caches need to invalidate entries. Inconsistent handling across consumers leads to stale data in some sinks.

Connector Options That Get Used

Debezium dominates the open-source space. It supports Postgres, MySQL, SQL Server, MongoDB, Oracle, and others, runs on Kafka Connect (or standalone), and has the largest production deployment base. The trade-off is operational complexity; running Debezium at scale requires real expertise.

Estuary Flow is a newer commercial CDC platform that bundles capture, transformation, and delivery in a managed service. The product fits teams that want CDC without operating Kafka and Debezium themselves. Adoption has grown in mid-market data teams.

Fivetran's HVR acquisition gave Fivetran high-end CDC capabilities for enterprise databases including Oracle, SAP HANA, and DB2. The product fits enterprises with legacy database stacks where the open-source connectors have less mature support.

AWS DMS handles CDC across many source and destination types in AWS-native architectures. The service is fully managed and integrates with the AWS ecosystem; the operational simplicity is the main draw. Limitations exist around specific source-destination pairs and around very high throughput.

GCP Datastream provides similar functionality for Google Cloud-native architectures. The service captures from Postgres, MySQL, and Oracle into BigQuery, GCS, or other Google Cloud destinations. The integration with BigQuery is particularly tight.

Native database features (PostgreSQL's pg\_logical, MySQL's binlog tools) work for simple cases without an additional CDC product. Teams sometimes start here and add a managed product as their CDC needs grow.

Common Failure Modes

Forgotten replication slots that fill source database disks. The connector crashes silently; the slot accumulates WAL; the database hits disk full; production goes down. The fix is monitoring and alerting on slot size as a first-class metric.

Schema changes that break downstream consumers. The source adds a non-null column with no default; the CDC stream propagates rows that downstream loaders cannot apply; the pipeline halts. The fix is coordinating schema changes with downstream consumers and configuring connectors for safe handling of unknown columns.

Dual-write inconsistency when CDC is used alongside application-level writes. The application writes to the database and to a secondary store; the CDC pipeline also writes to the secondary store; conflicts and drift result. The fix is picking one pattern and using it consistently.

Snapshot-and-catch-up failures after extended downtime. The team realizes the connector has been down for days; the log no longer has the events; restarting requires a full re-snapshot that takes hours and is operationally disruptive. The fix is monitoring connector lag and aggressive procedures for restarting before logs rotate.

Cost surprises from very chatty source databases. A source database that updates a hot row a hundred times per second produces a hundred CDC events per second downstream. Downstream costs (warehouse ingestion, network, sink processing) scale with the event count. The fix is awareness of upstream write patterns and downstream cost monitoring.

Best Practices

  • Monitor replication slot sizes, connector lag, and source database resource usage as first-class operational metrics.
  • Define schema change procedures that coordinate source changes with CDC connector and downstream consumer updates.
  • Test snapshot-and-catch-up procedures before they are needed in an emergency.
  • Pick one consistent pattern (CDC or application events, not both) for any given data flow.
  • Plan for tombstone handling explicitly at every downstream sink so deletes propagate consistently.

Common Misconceptions

  • CDC is the same as ETL with shorter intervals; CDC reads transaction logs and captures every change including deletes, which polling-based ETL cannot do reliably.
  • CDC removes the need for batch loading; CDC handles incremental changes well but initial snapshots and historical backfills still need batch patterns.
  • CDC is always real-time; latency from source change to downstream consumption is seconds to minutes typically, not microseconds.
  • Debezium works out of the box; production deployment requires substantial operational investment around the connector's failure modes.
  • CDC eliminates the dual-write problem entirely; the pattern reduces it but still requires careful handling of downstream consumer consistency.

Frequently Asked Questions (FAQ's)

When should I use CDC instead of batch loading?

When downstream consumers need data fresher than the batch interval can provide. Analytics that needs minute-level freshness, search indexes that need to reflect new content immediately, ML features that depend on recent activity. For consumers that only need daily freshness, batch is simpler and cheaper.

What is the typical latency from source change to downstream consumption?

Seconds to a few minutes for well-tuned setups. The latency includes time for the source database to commit, the connector to read the log, the message bus to deliver the event, and the downstream consumer to apply it. Sub-second is achievable but requires investment; minutes-to-tens-of-minutes is typical without tuning.

Does CDC put load on my source database?

Minimal, when configured correctly. Log-based CDC reads the transaction log that the database is already maintaining. The read overhead is low. The pattern is much lighter than polling-based replication that runs SELECT statements against the tables.

How do I handle schema changes?

With explicit coordination. Test the change in a staging environment that includes the connector and downstream sinks. Roll out the source change with the connector configured to handle new fields gracefully. Update downstream consumers to use the new fields after they appear. The discipline prevents the common failure of source-side changes breaking downstream pipelines.

What happens during connector downtime?

The connector picks up where it left off when it restarts, reading from the saved position in the source log. The acceptable downtime depends on the log retention; Postgres replication slots hold WAL indefinitely (with risk of filling the disk), MySQL binlog retention is configurable. Monitor lag and recover quickly when outages happen.

How do I handle large transactions?

Configure the connector for streaming behavior rather than buffering full transactions in memory. Tune batch sizes for downstream sinks. Some connectors expose configuration for handling very large transactions specifically; consult the connector documentation.

What about CDC for non-relational databases?

MongoDB, DynamoDB, and Cassandra all have CDC primitives; the connectors are less mature than for relational databases but production-usable. The patterns are similar in shape, the operational details differ.

Is CDC suitable for cross-region replication?

Yes, though latency and network costs become considerations. Many production setups use CDC for cross-region replication of analytical data; the pattern handles the long-distance gracefully when configured for batched delivery.

Where is CDC heading?

Toward more managed offerings that reduce the operational burden. Toward tighter integration with lakehouse table formats that handle the streaming-update pattern efficiently. Toward broader source database coverage as connectors mature for more legacy systems. The category is mature in pattern but continues to improve in tooling.