LS LOGICIEL SOLUTIONS
Toggle navigation

What Is AI-Ready Data Infrastructure?

Definition

Apache Kafka is a distributed event streaming platform. It is not a message queue, though people often call it one. A message queue temporarily stores messages until they are consumed. Kafka stores data persistently in a distributed log, allowing multiple consumers to read the same events independently and replay history. This distinction is fundamental to understanding what Kafka is good for.

At its core, Kafka is a sequence of events stored durably across multiple servers. A producer writes events (a user clicked a button, a payment was processed, a sensor reading arrived). These events flow into Kafka topics. Consumers read from those topics. The key insight is that events remain in Kafka after being read, available to other consumers and at different times. With a traditional message queue, once a message is consumed, it is gone. With Kafka, a message can be consumed by hundreds of different consumers at their own pace.

Kafka handles scale. Billions of events per day flow through Kafka clusters at companies like Netflix, LinkedIn, and Uber. It is designed to be a central nervous system for data: the place where everything that happens in your systems becomes visible and accessible. Event streams feed into real-time analytics, machine learning systems, data warehouses, and other
applications.

The platform consists of producers (things that write events), brokers (the servers that store events), topics (named streams), partitions (parallel units within topics), and consumers (things that read events). Understanding these components and how they interact is essential to using
Kafka effectively.

Key Takeaways

  • Kafka is a distributed event streaming platform, not a message queue; it persists events and allows multiple independent consumers to read the same events at different times.
  • Topics contain events, partitions within topics provide parallelism and ordering guarantees, and consumer groups distribute reading load across multiple consumers.
  • Kafka enables event sourcing (storing all changes as events), change data capture (streaming database changes), and real-time analytics at massive scale.
  • Replication stores each partition on multiple brokers, ensuring durability and availability; at-least-once delivery is the default, with exactly-once achievable with careful design.
  • Common use cases include event-driven applications, log aggregation, streaming analytics, microservices communication, and building audit trails of all system changes.
  • Managed Kafka services (Confluent Cloud, AWS MSK, Azure Event Hubs) reduce operational complexity; self-hosted Kafka requires careful monitoring, capacity planning, and expertise.

Understanding Kafka's Architecture: Brokers, Topics, and Partitions

Kafka's architecture centers around a cluster of brokers. Each broker is a server that stores partitions. Together, brokers replicate data, handle producer and consumer requests, and manage coordination. A Kafka cluster typically consists of three or more brokers for high availability. If one broker fails, others continue operating.

Data flows into topics. A topic is a logical stream of events, like a channel. For example, userevents might be a topic for all user interactions. Payment-completed might be another. Topics are divided into partitions, which are the unit of parallelism. Each partition is an ordered sequence of events. When you write an event to a topic, it goes to exactly one partition (determined by the event key). All events with the same key go to the same partition, preserving order for that key. This is how Kafka guarantees that all events about a specific user stay in order while still processing millions of users in parallel.

Replication happens at the partition level. Each partition is replicated to multiple brokers. One broker is the leader (handles writes and reads), others are followers (mirror the data). If the leader fails, one follower becomes the new leader. The replication factor (usually 3) determines how many copies exist. Higher replication means better durability but more storage and network cost. Each broker stores some partitions as a leader and others as a follower, distributing load.

Producers send events to topics. The producer client library handles routing to the correct partition. Consumers read from topics. Multiple consumers can read the same topic independently. Consumer groups coordinate: each partition is read by exactly one consumer in a group. This allows parallel processing. If you have ten partitions and five consumers in a group, two consumers read two partitions each. If you add consumers, Kafka rebalances automatically, redistributing partitions. This design scales horizontally: add more partitions and more consumers and load distributes automatically.

How Producers and Consumers Work in Kafka

Producers are applications or services that write events to Kafka. A web server might emit userclicked events. A database might emit change events. A sensor might emit readings. Producers send events to a topic. The producer client specifies the topic, the value (the event data), and optionally a key (used to determine the partition). The producer client library batches events and sends them in bulk to improve efficiency. Configuration options control batching behavior, retry logic, and how long to wait for broker acknowledgment before returning.

The producer sends to a partition leader. The leader replicates the write to followers. Configuration controls consistency: you can wait for all replicas to acknowledge (safe but slow), only the leader (fast but risky), or a quorum. This is a fundamental trade-off in distributed systems: consistency versus latency. Most setups wait for the leader and a quorum of followers, balancing both concerns. If a producer cannot reach Kafka, it retries automatically. Idempotent producers ensure that duplicates caused by retries do not result in duplicate events in Kafka. This requires brokers to deduplicate, which adds overhead but guarantees no duplicate events reach Kafka.

Consumers are applications that read events from Kafka. A consumer subscribes to topics and reads events in order. Consumer client libraries handle connecting to brokers, maintaining offsets (position in the partition), and rebalancing when consumers join or leave. When a consumer reads an event, it advances an offset. If the consumer crashes and restarts, it can resume from the last committed offset, preventing duplicate processing (mostly; at-least-once is the default guarantee). Offsets are stored in a Kafka topic, making offset management simple and reliable.

Consumer groups allow multiple consumers to read from the same topic in parallel. Each consumer in a group reads from a subset of partitions. If a group has more consumers than partitions, some consumers stay idle (you cannot use them). If a group has fewer consumers than partitions, each consumer reads from multiple partitions. Rebalancing happens when consumers join or leave. During rebalancing, consumers stop reading briefly. This is unavoidable with Kafka's design. Applications need to tolerate brief pauses. Consumer lag indicates how far behind a consumer is: if lag is zero, the consumer is caught up; high lag means the consumer is slow and falling behind.

Real-World Use Cases for Kafka

Event sourcing stores all changes to state as immutable events. Instead of updating a user record, you emit a user-updated event. All events for a user are stored in Kafka. To get current state, you replay all events from the beginning. This enables powerful features: audit trails (see exactly what changed and when), temporal queries (what was the state at time X), and recovery (if you fix a bug in your event processing logic, replay events with the fixed code). Event sourcing is complex but powerful for critical state like payments or user identity.

Change data capture captures changes from databases and streams them to Kafka. A database change (row inserted, updated, or deleted) becomes a Kafka event. Systems can react to these changes in real time. For example, when a customer is created in the operational database, a CDC stream captures that event. Multiple services (email service, analytics, recommendations) subscribe to the stream and react. This decouples systems: the database does not call each service directly; it publishes events that services consume. Kafka makes CDC practical at scale. Tools like Debezium capture changes from MySQL, PostgreSQL, Oracle, and other databases.

Real-time analytics processes events as they arrive, computing aggregate statistics. Instead of waiting for a daily batch job, you get up-to-the-minute metrics. A commerce site might compute per-second revenue or per-minute top-selling products from Kafka event streams. Stream processors like Kafka Streams or Flink consume Kafka events, compute aggregations, and output results to dashboards or databases. This enables real-time insights and rapid response to trends.

Log aggregation centralizes logs from many services. Instead of SSH-ing into servers to read log files, all logs flow to Kafka. Log processing services consume Kafka and write to Elasticsearch or other storage. This is critical for monitoring and debugging. When something goes wrong, you search logs to understand what happened. Kafka handles the scale: thousands of services producing logs.

Microservices communication uses Kafka instead of direct service-to-service calls. When a service needs to notify others of an event, it publishes to Kafka. Other services subscribe. This is loosely coupled: services do not need to know about each other or their addresses. They only care about events. This enables independent scaling and deployment. If one service is slow, others are not blocked waiting for it; they consume events at their own pace. Building eventdriven microservice architectures is a major Kafka use case.

Kafka vs. Alternative Technologies

RabbitMQ is a traditional message broker. It is mature, widely used, and handles point-to-point messaging well. RabbitMQ is simpler to deploy initially and has lower latency. However, RabbitMQ deletes messages once consumed. If you add a new consumer later, it cannot read old messages. This limitation makes RabbitMQ unsuitable for event sourcing, replay, or event-driven architectures where multiple consumers need the same events. RabbitMQ scales vertically (bigger machines) better than horizontally. For simple queue use cases, RabbitMQ is often better than Kafka: less to operate, lower latency, clearer semantics.

AWS SQS is a managed queue service. It is even simpler than RabbitMQ: you do not manage infrastructure. SQS is good for decoupling services with simple point-to-point messaging. But like RabbitMQ, it deletes messages after consumption and does not support replay. Google Cloud Pub/Sub is more Kafka-like: it retains messages and supports multiple subscribers. Azure Event Hubs is Microsoft's Kafka equivalent. Both are managed services that reduce operational burden compared to self-hosted Kafka.

The choice comes down to requirements. Use Kafka if you need: high throughput (millions of events per second), message replay, multiple independent consumers, or event-driven architecture. Use RabbitMQ if you need: simplicity, lower latency, or point-to-point messaging. Use SQS if you want fully managed and do not need replay. Use Pub/Sub or Event Hubs if you want the Kafka model but do not want to operate it yourself. Many companies use multiple: Kafka for core data streams, RabbitMQ for service communication, and SQS for durable work queues. They are complementary, not competitive.

Kafka is not for everything. It is not suitable for request-response communication (use HTTP or gRPC). It is not suitable for storing state queries (use a database). It is not suitable for very lowlatency trading (use shared memory or custom networks). But for anything involving streams of events at scale, Kafka is often the right choice. Its dominance in the industry reflects that it solves a real problem well.

Replication and Delivery Guarantees Explained

Kafka stores each partition on multiple brokers. Replication factor is the number of copies. With a factor of three, three brokers store each partition. One is the leader (handles writes and reads), the other two are followers (replicate data). If the leader fails, a follower becomes the new leader. This protects against broker failures. If a broker crashes, other brokers have the data. Replication is automatic: you specify the factor and Kafka distributes replicas across brokers.

Write acknowledgments control consistency. A producer can wait for acknowledgment from the leader only (acks=1), meaning the write is recorded but not yet replicated. This is fast but risky: if the leader crashes before replicating, the message is lost. Or a producer can wait for all in-sync replicas to acknowledge (acks=all), meaning the data is replicated to all followers. This is slow but safe. Most setups use a middle ground: the leader acknowledges, meaning the write is persistent but not yet replicated; replicas catch up asynchronously. This balances speed and safety.

Delivery guarantees describe what happens if failures occur. At-least-once means every message is delivered, but duplicates are possible. If a consumer crashes after processing but before committing the offset, it reprocesses when it restarts. Most Kafka use cases tolerate duplicates and ensure their applications are idempotent (processing twice has the same effect as once). Exactly-once is harder: it requires coordinating Kafka and the application. Kafka Streams handles exactly-once for stream processing. For writing to external systems, exactlyonce requires careful implementation: using distributed transactions or idempotent writes. Most real systems use at-least-once with idempotent consumers, which is simpler and more practical.

Replication is automatic but configuration matters. Too much replication (factor of five) is expensive and slow. Too little (factor of one) is risky. Factor of three is standard. In-sync replicas configuration affects consistency: which replicas must acknowledge a write. This is a critical setting: too strict and writes are slow, too loose and durability is compromised. Default settings are usually reasonable, but critical topics might warrant different settings. Understanding replication is essential to operating Kafka reliably.

Challenges When Operating and Using Kafka at Scale

Operational complexity grows significantly as Kafka scales. Running a three-broker test cluster is straightforward. Running a 50-broker production cluster handling billions of events is complex. You need monitoring (alerting on high consumer lag, under-replicated partitions, broker failures), capacity planning (predicting when you need more brokers), and troubleshooting procedures (what to do when a broker is stuck or network connectivity is flaky). Most organizations use managed Kafka services precisely because self-hosting is complex. If you self-host, plan for dedicated operations expertise.

Consumer lag is a constant challenge. Slow consumers fall behind and lag grows. If lag exceeds your retention period, consumers lose messages. The solution is adding consumers, but rebalancing causes brief pauses. Large rebalances can take minutes, during which consumers do not process. Some people use pausing during rebalancing, which avoids losing data but makes consumers even slower. Getting this right requires understanding your specific workloads and configuring accordingly. There is no universal solution.

Schema evolution breaks consumers unprepared for new fields. If you add a field to events, old consumers might not understand it. This seems minor but causes real issues. Using a schema registry helps: it enforces compatibility rules so producers cannot make breaking changes. Without it, surprises happen at 3am. This is often overlooked until it bites you.

Storage management is another challenge. With high volume and long retention, disk usage grows rapidly. A system ingesting one million events per second with seven-day retention uses hundreds of terabytes. You need to plan storage capacity and retention policies. Keeping all data forever is not practical. Setting retention too short loses data you later want. Getting this balance right requires understanding your use cases and planning ahead. Cost surprises happen if you do not plan.

Debugging failures is hard. When something goes wrong in a Kafka cluster, isolating the cause requires understanding distributed systems. Is it a leader election issue, replication lag, network partition, broker crash? Logs are verbose. Network issues are particularly hard: you cannot easily detect them with monitoring; you just see latency spikes and timeouts. Building expertise takes time and experience.

Best Practices for Kafka

  • Start with a managed Kafka service (Confluent Cloud, AWS MSK) rather than selfhosting unless you have infrastructure expertise and significant scale justifying the operational investment.

  • Design topics with appropriate replication factor (usually three) and retention policies based on your use cases and storage constraints, and monitor retention to catch unexpected growth.

  • Partition strategically using keys to ensure related events stay ordered while distributing load; avoid single-partition topics unless you specifically need strict ordering across all events.

  • Use consumer groups to parallelize processing and distribute load across multiple instances with proper rebalancing configuration and lag monitoring.

  • Implement schema management from the start using a schema registry to prevent breaking changes and handle schema evolution as your events evolve.

Common Misconceptions About Kafka

  • Kafka is a message queue like RabbitMQ; it is actually an event streaming platform designed for durability, replay, and multiple independent consumers.

  • You should use Kafka for all data movement; sometimes a simpler tool like RabbitMQ or a database is more appropriate for your specific problem.

  • Exactly-once delivery is always necessary and always achievable; in practice, atleast-once with idempotent consumers is simpler and usually sufficient.

  • Operating Kafka is not much harder than operating a database; at scale Kafka requires specialized expertise in distributed systems and is often better managed by third-party services.

  • Kafka solves all your event stream problems once you install it; the real work is designing topics, partitions, consumer groups, and handling failure cases correctly.

Frequently Asked Questions (FAQ's)

Is Kafka a message queue or an event streaming platform?

Kafka is often called a message queue, but that is technically inaccurate. A message queue (like RabbitMQ) stores messages temporarily and deletes them once consumed. Kafka is an event streaming platform. It stores data persistently in a distributed log, allowing multiple consumers to read the same events independently and at their own pace.

This is the fundamental difference. With RabbitMQ, if you add a new consumer after messages have been processed, you cannot replay them. With Kafka, you can. This makes Kafka suitable for building systems where history matters: audit logs, event sourcing, and rebuilding state. Message queues solve point-to-point communication problems. Kafka solves broadcast and replay problems at scale.

The terminology matters because it affects how you use the tool. If you treat Kafka like a message queue, you will not leverage its strengths in replay and event sourcing. If you treat it like a distributed log, you can build powerful systems on top of it.

How do Kafka topics, partitions, and consumer groups work?

A Kafka topic is like a named channel or stream of events. All events of a certain type (say, user clicks) go into one topic. Topics are divided into partitions, which are ordered sequences of events. Each event is assigned to a partition based on its key. Events with the same key always go to the same partition, ensuring order for that key.

Consumer groups are teams of consumers that together read from a topic. Each partition is read by exactly one consumer in a group, distributing the load. If you have ten partitions and five consumers in a group, some consumers read from two partitions. If you add more consumers, Kafka rebalances automatically.

This design allows Kafka to scale horizontally: add more partitions and more consumers and distribute load automatically. Understanding these three concepts (topics, partitions, consumer groups) is essential to using Kafka effectively. They are the building blocks that enable both ordering and parallelism.

What are common use cases for Kafka?

Event sourcing stores all changes to state as immutable events, allowing you to reconstruct state at any point in time. Kafka is perfect for this because events are stored permanently. Change data capture (CDC) captures changes from databases and streams them elsewhere, useful for keeping multiple systems in sync.

Real-time analytics processes events as they arrive rather than waiting for batch jobs. Logging and monitoring sends logs from services to a central location. Log aggregation tools like Elasticsearch often read from Kafka. Building event-driven applications where services react to events from other services.

Stream processing transforms events as they flow through: filtering, enriching, aggregating. Microservices use Kafka for asynchronous communication: one service publishes an event, others respond to it. Kafka enables these patterns by guaranteeing durability, ordering, and replaying. Any system that needs to process data streams at scale can benefit from Kafka.

How does Kafka compare to RabbitMQ and other message brokers?

RabbitMQ is a traditional message queue. It is good at point-to-point messaging, has lower latency, and is simpler to deploy initially. But RabbitMQ deletes messages once consumed, making it unsuitable for event sourcing or replay. RabbitMQ scales vertically (bigger machines) better than horizontally.

Kafka is an event streaming platform built for durability and replay at massive scale. Kafka retains messages for configurable periods, allows multiple consumers to read independently, and scales horizontally by adding partitions and brokers. Kafka has higher latency than RabbitMQ (milliseconds versus microseconds) but processes far more throughput.

AWS SQS is a managed queue service. It is simpler than RabbitMQ but still follows the queue model. Google Cloud Pub/Sub combines queue and streaming: it looks like a queue but allows replaying messages. Azure Event Hubs is Microsoft's streaming platform, very similar to Kafka. Choose Kafka if you need durability, replay, and high throughput. Choose RabbitMQ if you need simplicity and lower latency for point-to-point work. SQS is good if you want fully managed and do not want to operate infrastructure.

What does 'exactly-once' delivery mean in Kafka?

Message delivery guarantees are critical in streaming. At-most-once means a message might not be delivered, but if it is, it is delivered once. At-least-once means every message is delivered, but some might be delivered multiple times. Exactly-once means every message is delivered exactly once, no more, no less.

Kafka guarantees at-least-once delivery by default: if a consumer dies, it can restart and reprocess the last few messages. Achieving exactly-once is harder because you need to coordinate two systems: the message broker and whatever system you are writing to. If you write to a database and then commit to Kafka, network failure could cause you to commit twice.

The solution is idempotency: make your writes idempotent (writing twice has the same effect as writing once) or use distributed transactions. Kafka Streams has built-in exactly-once semantics for stream processing. For exactly-once end-to-end, you need to coordinate carefully. Most systems tolerate occasional duplicates instead of implementing exactly-once. It is one of the hardest problems in distributed systems.

What is Kafka replication and why does it matter?

Replication means Kafka stores each partition on multiple broker nodes. If one broker fails, another has a copy. The replication factor (number of copies) is configurable. A replication factor of three means three copies exist. If two brokers fail, data is not lost. Replication is critical for durability: you need it to survive broker failures.

The trade-off is that replication increases storage costs and network traffic. Each write must be replicated to multiple nodes. Write latency depends on your replication settings: if you require all replicas to acknowledge the write before returning, latency is higher. If you require only one replica (the leader) to acknowledge, latency is lower but durability is reduced.

Configuration depends on your use case: critical data gets high replication, less critical data gets less. Kafka handles replication automatically. You specify the replication factor and Kafka distributes partitions across brokers to ensure replicas are on different nodes. This is one of the engineering problems that made Kafka complex but powerful.

What tools exist for managing Kafka in production?

Operating Kafka yourself requires managing brokers, monitoring them, handling failures, and scaling. This is why managed Kafka services exist. Confluent Cloud is a managed Kafka offering from the creators of Kafka. You do not manage infrastructure; Confluent handles that. Amazon MSK (Managed Streaming for Kafka) is AWS's managed Kafka. It handles provisioning and operations.

Azure Event Hubs is Microsoft's alternative. Aiven is a third-party Kafka hosting service. These services reduce operational burden significantly. The trade-off is cost: managed services cost more than self-hosted Kafka. For most organizations, the cost is worth it because operating Kafka at scale is complex. You need monitoring, alerting, backup strategies, upgrade procedures. Managed services handle all of this.

If you do self-host, tooling like Kafka Manager (now Kafdrop), Confluent Control Center, or Prometheus helps with monitoring and operations. The general recommendation is to use a managed service unless you have significant infrastructure expertise and scale that justifies self-hosting. Most organizations find managed services more cost-effective when accounting for operational overhead.

What is Kafka Streams and how is it different from standalone Kafka?

Kafka Streams is a library for building stream processing applications. Instead of writing custom consumers that read from Kafka and process events, Kafka Streams provides abstractions (streams, tables, topologies) that make this easier. You write code that defines how events flow and transform, then run it as an application. It is simpler than building custom consumers and provides exactly-once semantics automatically.

Confluent KsqlDB is a SQL-based stream processor that runs on top of Kafka. Instead of writing code, you write SQL queries to transform streams. It is simpler than Kafka Streams but less flexible. Apache Flink is an alternative stream processing engine. It is more powerful and sophisticated than Kafka Streams but also more complex.

The choice depends on complexity and preference. For simple transformations, Kafka Streams or KsqlDB work well. For complex event processing, Flink or other engines might be better. All of these tools read from Kafka topics and produce to Kafka topics, making Kafka the central data infrastructure that many stream processors feed from and write to.

How do you decide between Kafka and other data infrastructure?

Kafka is not always the right choice. Use Kafka when you need: event streaming at scale (thousands of events per second), replay of historical events, multiple independent consumers reading from the same stream, durable log storage, or building event-driven applications. Do not use Kafka if you need: simple point-to-point messaging (use RabbitMQ or SQS), real-time request-response (use HTTP or gRPC), or batch data processing on stored data (use a data warehouse or Spark).

If you have less than a thousand events per second, Kafka might be overkill; RabbitMQ or managed queues might be simpler. If you have terabytes of historical data that needs processing, a data warehouse might be more appropriate than Kafka. Kafka excels at the middle ground: high-volume streaming data that many consumers need to access independently.

Many companies use both: Kafka for streaming events, a data warehouse for historical analysis. They are complementary, not competitive. The right architecture uses the right tool for each problem. Kafka is powerful for streaming, data warehouses are powerful for analytics, databases are powerful for transactional consistency. Combining them gives you the best of each.

How do you handle schema evolution in Kafka?

As your events evolve, their structure changes. A user event might add a new field. Consumers need to handle both old and new versions. Schema registries like Confluent Schema Registry or AWS Glue help. They store all versions of a schema and enforce compatibility rules.

Avro is a data format that works with schema registries. It is compact and includes schema information. Protocol Buffers (protobuf) and JSON Schema are alternatives. Without schema management, adding fields breaks consumers that do not expect them. With a schema registry, you define rules: can consumers handle new fields (forward compatibility), can producers handle messages with old schemas (backward compatibility), or should new and old messages never mix (full compatibility).

Most systems use backward compatibility: old consumers can handle messages from new producers. This lets you deploy new producers before new consumers are ready. Schema management is underrated. Ignoring it causes production issues when you add fields or change types without coordinating consumers. Invest in schema management from day one.

What are the performance characteristics of Kafka?

Throughput: Kafka handles millions of events per second on a commodity cluster. This is one of its defining characteristics. With a properly sized cluster, Kafka can ingest and process petabytes of data. Latency: Kafka's latency is milliseconds under load, which is higher than message queues but acceptable for streaming use cases. If you need microsecond latencies, Kafka is not the right tool.

Disk usage: Kafka stores data on disk. With a seven-day retention policy, a system ingesting one million events per second uses about 600 terabytes of storage. You need to plan storage carefully. This is why retention policies are important: storing forever is not practical. Processing time: Reading from Kafka and processing is reasonably fast. The bottleneck is usually the consumer application, not Kafka.

Scalability: Kafka scales horizontally. Adding more brokers increases throughput. Adding more partitions lets more consumers read in parallel. These characteristics make Kafka suitable for high-volume streaming but unsuitable for low-latency request-response work or systems that need microsecond timing. Know your performance requirements before committing to Kafka.

How do you monitor and operate Kafka in production?

Key metrics to monitor include: consumer lag (how far behind a consumer is from the latest message), partition leader balance (whether leaders are distributed across brokers), underreplicated partitions (partitions with fewer replicas than configured), and broker availability. High consumer lag indicates your consumers are slow or overwhelmed. If lag keeps growing, you need more consumers.

Under-replicated partitions indicate a broker might be down. Leader imbalance can cause load imbalance. Alerting is crucial. Set up alerts for consumer lag exceeding a threshold, for any under-replicated partitions, for broker availability. Include context in alerts: which topic, which consumer group, what is the trend. Logs are invaluable for debugging.

Enable verbose logging but only for production issues; verbose logging impacts performance. Capacity planning matters. Monitor disk usage, network bandwidth, and CPU. Kafka is I/O intensive. As load grows, you need faster disks or more brokers. Many teams use Prometheus plus Grafana for monitoring, or cloud provider monitoring. The key is visibility: you cannot operate Kafka without understanding its state.

What are common mistakes people make with Kafka?

Not thinking about scale: people design single-partition topics then find they are bottlenecked. Design for the scale you expect, not what you have now. Ignoring retention policies: without retention limits, Kafka fills up with old data. Set appropriate retention based on your use cases.

Poor consumer design: if your consumer crashes, lag grows. Build resilient consumers with proper error handling and monitoring. Not planning for consumer group rebalancing: when consumers join or leave, Kafka rebalances. This causes pauses. Design for these pauses. Using Kafka for everything: not every data problem needs Kafka. Sometimes a database or a simpler queue is more appropriate.

Forgetting about schema evolution: adding fields breaks consumers not ready for them. Manage schemas from the start. Not monitoring: Kafka operates fine at small scale without monitoring. At scale, you are blind without metrics. Set up monitoring early. Not planning for growth: you can absorb 10x growth more easily than 100x. Add headroom in capacity planning.

How does Kafka's distributed log model enable event sourcing?

Event sourcing stores all changes to state as immutable events. Instead of storing a user's current status (active, inactive), you store all status changes (created, activated, deactivated). You can replay all events to reconstruct state at any point. Kafka's distributed log is perfect for this: events are appended sequentially, immutable, and persisted.

A user's history is a sequence of events in a Kafka topic. To get current state, you process all events in order. To audit what happened, you examine the event log. To recover from bugs, you replay events (with fixed code) to rebuild state correctly. This pattern is powerful but requires different thinking. You do not update records; you emit events. The database becomes a derived cache of events. Changes become auditable.

This enables replay, audit, and debugging that traditional databases cannot. Some systems use event sourcing for everything. Most use it selectively for critical state (payment history, user changes) and traditional databases for less critical state. Kafka is widely used for event sourcing at scale because it was designed for this pattern. Understanding this use case reveals why Kafka matters.

What is the learning curve for Kafka and how do you get started?

The basics (topics, producers, consumers) are straightforward. You can get a working example in an hour. Operating Kafka at scale is much harder and requires understanding distributed systems, handling failures, and managing resources. Start by installing Kafka locally (using Docker or native installation), creating a topic, writing a producer that sends messages, and writing a consumer that reads them.

Use the command-line tools to explore. Then build a simple application: maybe a producer that reads from an API and a consumer that writes to a database. The Kafka documentation is excellent. Confluent has good tutorials. Most learning comes from building things. After the basics, learn about consumer groups and partitioning: these are the features that enable scale.

Then learn Kafka Streams if you need to process events. Finally, learn operations: monitoring, troubleshooting, capacity planning. This progression takes months, not hours. Kafka is powerful but has a steep learning curve past the basics. The best way to learn is by using it: start with a small use case and expand from there. Do not try to master everything at once.