What Is Apache Kafka?

Frequently Asked Questions (FAQ's)

Is Kafka a message queue or an event streaming platform?

Kafka is often called a message queue, but that is technically inaccurate. A message queue (like RabbitMQ) stores messages temporarily and deletes them once consumed. Kafka is an event streaming platform. It stores data persistently in a distributed log, allowing multiple consumers to read the same events independently and at their own pace.

This is the fundamental difference. With RabbitMQ, if you add a new consumer after messages have been processed, you cannot replay them. With Kafka, you can. This makes Kafka suitable for building systems where history matters: audit logs, event sourcing, and rebuilding state. Message queues solve point-to-point communication problems. Kafka solves broadcast and replay problems at scale.

The terminology matters because it affects how you use the tool. If you treat Kafka like a message queue, you will not leverage its strengths in replay and event sourcing. If you treat it like a distributed log, you can build powerful systems on top of it.

How do Kafka topics, partitions, and consumer groups work?

A Kafka topic is like a named channel or stream of events. All events of a certain type (say, user clicks) go into one topic. Topics are divided into partitions, which are ordered sequences of events. Each event is assigned to a partition based on its key. Events with the same key always go to the same partition, ensuring order for that key.

Consumer groups are teams of consumers that together read from a topic. Each partition is read by exactly one consumer in a group, distributing the load. If you have ten partitions and five consumers in a group, some consumers read from two partitions. If you add more consumers, Kafka rebalances automatically.

This design allows Kafka to scale horizontally: add more partitions and more consumers and distribute load automatically. Understanding these three concepts (topics, partitions, consumer groups) is essential to using Kafka effectively. They are the building blocks that enable both ordering and parallelism.

What are common use cases for Kafka?

Event sourcing stores all changes to state as immutable events, allowing you to reconstruct state at any point in time. Kafka is perfect for this because events are stored permanently. Change data capture (CDC) captures changes from databases and streams them elsewhere, useful for keeping multiple systems in sync.

Real-time analytics processes events as they arrive rather than waiting for batch jobs. Logging and monitoring sends logs from services to a central location. Log aggregation tools like Elasticsearch often read from Kafka. Building event-driven applications where services react to events from other services.

Stream processing transforms events as they flow through: filtering, enriching, aggregating. Microservices use Kafka for asynchronous communication: one service publishes an event, others respond to it. Kafka enables these patterns by guaranteeing durability, ordering, and replaying. Any system that needs to process data streams at scale can benefit from Kafka.

How does Kafka compare to RabbitMQ and other message brokers?

RabbitMQ is a traditional message queue. It is good at point-to-point messaging, has lower latency, and is simpler to deploy initially. But RabbitMQ deletes messages once consumed, making it unsuitable for event sourcing or replay. RabbitMQ scales vertically (bigger machines) better than horizontally.

Kafka is an event streaming platform built for durability and replay at massive scale. Kafka retains messages for configurable periods, allows multiple consumers to read independently, and scales horizontally by adding partitions and brokers. Kafka has higher latency than RabbitMQ (milliseconds versus microseconds) but processes far more throughput.

AWS SQS is a managed queue service. It is simpler than RabbitMQ but still follows the queue model. Google Cloud Pub/Sub combines queue and streaming: it looks like a queue but allows replaying messages. Azure Event Hubs is Microsoft's streaming platform, very similar to Kafka. Choose Kafka if you need durability, replay, and high throughput. Choose RabbitMQ if you need simplicity and lower latency for point-to-point work. SQS is good if you want fully managed and do not want to operate infrastructure.

What does 'exactly-once' delivery mean in Kafka?

Message delivery guarantees are critical in streaming. At-most-once means a message might not be delivered, but if it is, it is delivered once. At-least-once means every message is delivered, but some might be delivered multiple times. Exactly-once means every message is delivered exactly once, no more, no less.

Kafka guarantees at-least-once delivery by default: if a consumer dies, it can restart and reprocess the last few messages. Achieving exactly-once is harder because you need to coordinate two systems: the message broker and whatever system you are writing to. If you write to a database and then commit to Kafka, network failure could cause you to commit twice.

The solution is idempotency: make your writes idempotent (writing twice has the same effect as writing once) or use distributed transactions. Kafka Streams has built-in exactly-once semantics for stream processing. For exactly-once end-to-end, you need to coordinate carefully. Most systems tolerate occasional duplicates instead of implementing exactly-once. It is one of the hardest problems in distributed systems.

What is Kafka replication and why does it matter?

Replication means Kafka stores each partition on multiple broker nodes. If one broker fails, another has a copy. The replication factor (number of copies) is configurable. A replication factor of three means three copies exist. If two brokers fail, data is not lost. Replication is critical for durability: you need it to survive broker failures.

The trade-off is that replication increases storage costs and network traffic. Each write must be replicated to multiple nodes. Write latency depends on your replication settings: if you require all replicas to acknowledge the write before returning, latency is higher. If you require only one replica (the leader) to acknowledge, latency is lower but durability is reduced.

Configuration depends on your use case: critical data gets high replication, less critical data gets less. Kafka handles replication automatically. You specify the replication factor and Kafka distributes partitions across brokers to ensure replicas are on different nodes. This is one of the engineering problems that made Kafka complex but powerful.

What tools exist for managing Kafka in production?

Operating Kafka yourself requires managing brokers, monitoring them, handling failures, and scaling. This is why managed Kafka services exist. Confluent Cloud is a managed Kafka offering from the creators of Kafka. You do not manage infrastructure; Confluent handles that. Amazon MSK (Managed Streaming for Kafka) is AWS's managed Kafka. It handles provisioning and operations.

Azure Event Hubs is Microsoft's alternative. Aiven is a third-party Kafka hosting service. These services reduce operational burden significantly. The trade-off is cost: managed services cost more than self-hosted Kafka. For most organizations, the cost is worth it because operating Kafka at scale is complex. You need monitoring, alerting, backup strategies, upgrade procedures. Managed services handle all of this.

If you do self-host, tooling like Kafka Manager (now Kafdrop), Confluent Control Center, or Prometheus helps with monitoring and operations. The general recommendation is to use a managed service unless you have significant infrastructure expertise and scale that justifies self-hosting. Most organizations find managed services more cost-effective when accounting for operational overhead.

What is Kafka Streams and how is it different from standalone Kafka?

Kafka Streams is a library for building stream processing applications. Instead of writing custom consumers that read from Kafka and process events, Kafka Streams provides abstractions (streams, tables, topologies) that make this easier. You write code that defines how events flow and transform, then run it as an application. It is simpler than building custom consumers and provides exactly-once semantics automatically.

Confluent KsqlDB is a SQL-based stream processor that runs on top of Kafka. Instead of writing code, you write SQL queries to transform streams. It is simpler than Kafka Streams but less flexible. Apache Flink is an alternative stream processing engine. It is more powerful and sophisticated than Kafka Streams but also more complex.

The choice depends on complexity and preference. For simple transformations, Kafka Streams or KsqlDB work well. For complex event processing, Flink or other engines might be better. All of these tools read from Kafka topics and produce to Kafka topics, making Kafka the central data infrastructure that many stream processors feed from and write to.

How do you decide between Kafka and other data infrastructure?

Kafka is not always the right choice. Use Kafka when you need: event streaming at scale (thousands of events per second), replay of historical events, multiple independent consumers reading from the same stream, durable log storage, or building event-driven applications. Do not use Kafka if you need: simple point-to-point messaging (use RabbitMQ or SQS), real-time request-response (use HTTP or gRPC), or batch data processing on stored data (use a data warehouse or Spark).

If you have less than a thousand events per second, Kafka might be overkill; RabbitMQ or managed queues might be simpler. If you have terabytes of historical data that needs processing, a data warehouse might be more appropriate than Kafka. Kafka excels at the middle ground: high-volume streaming data that many consumers need to access independently.

Many companies use both: Kafka for streaming events, a data warehouse for historical analysis. They are complementary, not competitive. The right architecture uses the right tool for each problem. Kafka is powerful for streaming, data warehouses are powerful for analytics, databases are powerful for transactional consistency. Combining them gives you the best of each.

How do you handle schema evolution in Kafka?

As your events evolve, their structure changes. A user event might add a new field. Consumers need to handle both old and new versions. Schema registries like Confluent Schema Registry or AWS Glue help. They store all versions of a schema and enforce compatibility rules.

Avro is a data format that works with schema registries. It is compact and includes schema information. Protocol Buffers (protobuf) and JSON Schema are alternatives. Without schema management, adding fields breaks consumers that do not expect them. With a schema registry, you define rules: can consumers handle new fields (forward compatibility), can producers handle messages with old schemas (backward compatibility), or should new and old messages never mix (full compatibility).

Most systems use backward compatibility: old consumers can handle messages from new producers. This lets you deploy new producers before new consumers are ready. Schema management is underrated. Ignoring it causes production issues when you add fields or change types without coordinating consumers. Invest in schema management from day one.

What are the performance characteristics of Kafka?

Throughput: Kafka handles millions of events per second on a commodity cluster. This is one of its defining characteristics. With a properly sized cluster, Kafka can ingest and process petabytes of data. Latency: Kafka's latency is milliseconds under load, which is higher than message queues but acceptable for streaming use cases. If you need microsecond latencies, Kafka is not the right tool.

Disk usage: Kafka stores data on disk. With a seven-day retention policy, a system ingesting one million events per second uses about 600 terabytes of storage. You need to plan storage carefully. This is why retention policies are important: storing forever is not practical. Processing time: Reading from Kafka and processing is reasonably fast. The bottleneck is usually the consumer application, not Kafka.

Scalability: Kafka scales horizontally. Adding more brokers increases throughput. Adding more partitions lets more consumers read in parallel. These characteristics make Kafka suitable for high-volume streaming but unsuitable for low-latency request-response work or systems that need microsecond timing. Know your performance requirements before committing to Kafka.

How do you monitor and operate Kafka in production?

Key metrics to monitor include: consumer lag (how far behind a consumer is from the latest message), partition leader balance (whether leaders are distributed across brokers), underreplicated partitions (partitions with fewer replicas than configured), and broker availability. High consumer lag indicates your consumers are slow or overwhelmed. If lag keeps growing, you need more consumers.

Under-replicated partitions indicate a broker might be down. Leader imbalance can cause load imbalance. Alerting is crucial. Set up alerts for consumer lag exceeding a threshold, for any under-replicated partitions, for broker availability. Include context in alerts: which topic, which consumer group, what is the trend. Logs are invaluable for debugging.

Enable verbose logging but only for production issues; verbose logging impacts performance. Capacity planning matters. Monitor disk usage, network bandwidth, and CPU. Kafka is I/O intensive. As load grows, you need faster disks or more brokers. Many teams use Prometheus plus Grafana for monitoring, or cloud provider monitoring. The key is visibility: you cannot operate Kafka without understanding its state.

What are common mistakes people make with Kafka?

Not thinking about scale: people design single-partition topics then find they are bottlenecked. Design for the scale you expect, not what you have now. Ignoring retention policies: without retention limits, Kafka fills up with old data. Set appropriate retention based on your use cases.

Poor consumer design: if your consumer crashes, lag grows. Build resilient consumers with proper error handling and monitoring. Not planning for consumer group rebalancing: when consumers join or leave, Kafka rebalances. This causes pauses. Design for these pauses. Using Kafka for everything: not every data problem needs Kafka. Sometimes a database or a simpler queue is more appropriate.

Forgetting about schema evolution: adding fields breaks consumers not ready for them. Manage schemas from the start. Not monitoring: Kafka operates fine at small scale without monitoring. At scale, you are blind without metrics. Set up monitoring early. Not planning for growth: you can absorb 10x growth more easily than 100x. Add headroom in capacity planning.

How does Kafka's distributed log model enable event sourcing?

Event sourcing stores all changes to state as immutable events. Instead of storing a user's current status (active, inactive), you store all status changes (created, activated, deactivated). You can replay all events to reconstruct state at any point. Kafka's distributed log is perfect for this: events are appended sequentially, immutable, and persisted.

A user's history is a sequence of events in a Kafka topic. To get current state, you process all events in order. To audit what happened, you examine the event log. To recover from bugs, you replay events (with fixed code) to rebuild state correctly. This pattern is powerful but requires different thinking. You do not update records; you emit events. The database becomes a derived cache of events. Changes become auditable.

This enables replay, audit, and debugging that traditional databases cannot. Some systems use event sourcing for everything. Most use it selectively for critical state (payment history, user changes) and traditional databases for less critical state. Kafka is widely used for event sourcing at scale because it was designed for this pattern. Understanding this use case reveals why Kafka matters.

What is the learning curve for Kafka and how do you get started?

The basics (topics, producers, consumers) are straightforward. You can get a working example in an hour. Operating Kafka at scale is much harder and requires understanding distributed systems, handling failures, and managing resources. Start by installing Kafka locally (using Docker or native installation), creating a topic, writing a producer that sends messages, and writing a consumer that reads them.

Use the command-line tools to explore. Then build a simple application: maybe a producer that reads from an API and a consumer that writes to a database. The Kafka documentation is excellent. Confluent has good tutorials. Most learning comes from building things. After the basics, learn about consumer groups and partitioning: these are the features that enable scale.

Then learn Kafka Streams if you need to process events. Finally, learn operations: monitoring, troubleshooting, capacity planning. This progression takes months, not hours. Kafka is powerful but has a steep learning curve past the basics. The best way to learn is by using it: start with a small use case and expand from there. Do not try to master everything at once.

What Is AI-Ready Data Infrastructure?

Definition

Key Takeaways

Understanding Kafka's Architecture: Brokers, Topics, and Partitions

How Producers and Consumers Work in Kafka

Real-World Use Cases for Kafka

Kafka vs. Alternative Technologies

Replication and Delivery Guarantees Explained

Challenges When Operating and Using Kafka at Scale

Best Practices for Kafka

Common Misconceptions About Kafka

Frequently Asked Questions (FAQ's)