LS LOGICIEL SOLUTIONS
Toggle navigation

Apache Kafka Explained: When You Need It and When You Don't

Apache Kafka Explained When You Need It and When You Don't

At some time no matter which engineering team you are part of you will likely receive a recommendation that goes something like this… “Let’s use Apache Kafka.” This type of recommendation generally comes during times of severe strain on the system usually due to increased volume being processed by the system and data pipelines failing to be reliable and more recently with the emergence of real-time data requirements so the idea of introducing Apache Kafka seems to be an obvious answer.

However in reality the large majority of engineering teams introduce Apache Kafka much too early or for all the wrong reasons. For a staff or principal-level engineer the true challenge isn’t understanding what Apache Kafka is; it is determining when should Apache Kafka be considered along with when does it just add more complexity than needed.

This guide provides a higher level overview of what Apache Kafka is providing more than just simple definitions but answering the important questions:

  • What Apache Kafka is and how it works with a little understanding as to why so many organizations are embracing it
  • When Apache Kafka is the best solution (use case) and when it is a poor decision (overkill)
  • How Apache Kafka compares with other real-time platforms that provide similar capabilities
  • Important considerations for implementing Kafka in production environments

What Is Apache Kafka?

Apache Kafka is a distributed event streaming platform designed to support high-volume high-speed real-time data feeds.

Apache Kafka Allows Now Systems To:

  • Publish Data Streams
  • Subscribe To Data Streams
  • Durably Store Data Streams
  • Process Data Streams In Real Time

Overview Of How Apache Kafka Works

Kafka Consists Of A Few Basic Components:

  • Producers Send Data
  • Consumers Read Data
  • Topics Organize Data
  • Partitions Support Scalability
  • Brokers Manage And Store Data

Why Was Apache Kafka Created?

LinkedIn Originally Built Kafka To Solve A Major Problem For The Company:

To Enable Reliability When Processing Large Amounts Of Real-Time Event Data.

Things You Should Know About Kafka

Kafka Is Much More Than A Message Queue.
Kafka Is Essentially A Distributed Log System Designed For Scalability And Durability.

Why Is Apache Kafka So Popular?

Kafka Has Found Itself Quickly Being Adopted In Organizations Across All Areas Of Industry For Several Reasons:

Why Is Apache Kafka So Popular?

1. Very High Throughput

Kafka Handles Millions Of Events Per Second.

2. Scalability

Kafka Scales Horizontally By Adding More Brokers And Partitions.

3. Durability

Data Is Stored Persistently And Replicated To Other Nodes In An Apache Kafka Cluster.

4. Decoupled Systems

Producers And Consumers Can Work Independently From Each Other.

5. Real-Time Processing

Kafka Provides Almost Real-Time Data Pipelines.

Example

For Example, A Major E-Commerce Company Uses Kafka For Monitoring User Activity, Processing Orders, And Updating Their Recommendation Engines.

Overall, Kafka Effectively Handles The Challenges Of Scaling And Processing Data In Real Time.

How Does Apache Kafka Work In Practice?

To Fully Understand How Apache Kafka Works, You Have To Understand The Flow Process Used In Kafka.

Flow Process

  • Step 1: Producers Write Event(s) To Topics
  • Step 2: Events Are Stored In Partitions On Brokers
  • Step 3: Consumers Read Events At Their Own Rate
  • Step 4: Kafka Tracks Which Messages Have Been Consumed
  • Step 5: Data will be replicated to ensure fault tolerance

Example Architecture

A standard design for the pipeline includes the following stages:

Applications ---> Kafka ---> Stream processing ---> Data Warehouse

Key takeaway from this example: Kafka acts as the "central nervous system" of the overall data flow.

When to Use Apache Kafka

Apache Kafka has great capabilities; however, it is not essential for every use case.

1. Real-Time Data Streaming

If you have a need for:

  • Live analytics
  • Event processed systems

Then Apache Kafka is a strong choice.

2. High-Throughput Systems

When doing the following:

  • Processing millions of events
  • Performing large scale data ingestion

Kafka is a leading solution for high-throughput solutions.

3. Microservices Communication

By using Kafka, your application can:

  • Communicate asynchronously
  • Maintain loose coupling

4. Data Integrating Across Systems

Apache Kafka can serve as the central data "backbone" in your organization.

5. Event Sourcing Architectures

With an event sourcing architecture, messages are retained in Apache Kafka for future retrieval, providing example of durable event logs.

Example Use Cases

A financial services firm utilizes Apache Kafka to achieve the following:

  • Processing transaction events in real-time
  • Identifying incidents of fraud
  • Providing input into downstream analytic solutions

Key takeaway from these examples: If application scalability and real-time processing are not negotiable, consider using Apache Kafka.

When Not to Use Apache Kafka

One of the common mistakes that companies make is to incorrectly manage their applications and prevent the use of Apache Kafka.

1. Low Volume Data Systems

If the application is processing:

  • Relatively small volume of data sets
  • Low volume of transaction events over time

Then Apache Kafka will add complexity without enhancing the application.

2. Basic ETL Pipelines

Most batch processing systems do not need to utilize Apache Kafka for their processes.

3. Early Stage Startups

If your organization is in early stages, scale and/or real-time requirements:

  • Will delay your application development

4. Limited Engineering Resources

Kafka systems require:

  • Operational expertise to support ongoing operations and maintenance

5. Over Engaging Problems

Many companies utilize Apache Kafka to address problems that they suspect may exist in the future.

Example

Startup uses Apache Kafka as its primary data streaming solution; as a result, the startup is experiencing:

  • More complex applications
  • Slower application development
  • Higher total cost of ownership

Key takeaway: Apache Kafka is a powerful technology; however, it is also expensive in complexity.

What are some major differences between Apache Kafka and other streaming platforms?

A typical way to compare Kafka would be against:

RabbitMQ

Kafka has high throughput with its distributed log, whereas RabbitMQ serves as a message broker and supports more basic use cases.

Kinesis

Kafka is more flexible and supports a wider ecosystem. They are very similar to one another; however, Kinesis is also a managed service.

Pulsar

Kafka has a more mature ecosystem, while Pulsar offers more advanced capabilities such as multi-tenancy.

Key Considerations

Many organizations prefer Kafka due to its flexibility and the maturity of its ecosystem.

Takeaway to remember: Although Kafka may cost more in the short term, it will be much more extensible than other options.

Considering Managed Kafka vs Self-Hosted

Managed Kafka vs Self-Hosted is also an important comparison.

Self-Hosted offers:

  • Full control over your system
  • Customized configuration options

However, with Self-Hosted Kafka comes the downside of:

  • Complexity of operations
  • Maintenance of operations

Managed Kafka offers:

e.g., Confluent Cloud, AWS Managed/Managed Service and their respective pricing.

Managed Kafka Is a Better Option for Growing Teams for Three Reasons:

  • Less Complexity
  • Faster to Deploy
  • Less Configured

Best Practices for Running Kafka in Production

  • Design for Partitioning. Partitions dictate the scalability of your solution.
  • Monitor Your Consumer Lag.
  • Implement Schema Management.
  • Secure Your Kafka Cluster.
  • Plan For Failure.
  • Optimize Retention Policies.

Some Common Mistakes Engineers Make

1. Using Kafka as a Queue

Kafka was originally designed as an event streaming solution, not a queuing solution.

2. A Bad Design Will Lower Efficiency

If you choose the appropriate topic design for your application, you will save time and build an efficient system.

3. Misjudging Your Operational Complexity

Kafka is an operational tool. You will need to manage your environment on an ongoing basis.

4. Not Considering How Scalable Your Design Is

The design decisions you make today will affect your scalability tomorrow.

5. Not Having Any Observability

Without observability, debugging your application will be difficult.

Who Uses Apache Kafka?

Many of the top organizations and companies use Kafka.

Common Use Cases Across Different Industries

  • Companies in the tech industry using event tracking
  • Financial services processing transactions
  • Retailers using customer analytics

Why Do Companies Use Kafka?

  • Reliability
  • Scalability
  • Flexibility

Insight Into Apache Kafka?

When organizations use Kafka, they cannot afford to have any downtime or high latency.

What Is the Future of Apache Kafka?

Kafka is continually being evolved.

Emerging Trends

  • Serverless Kafka
  • Stream processing integration (e.g., Flink)
  • AI-enabled data pipelines

What Is The Impact Of This On Apache Kafka?

Apache Kafka is becoming:

  • Easier to use
  • More integrated
  • More automated

Strategic Insight

Apache Kafka is likely to be one of the main components of the modern data architecture moving forward.

Frequently Asked Questions

What Is Apache Kafka?
Apache Kafka is an open-source platform for distributed event streaming and real-time data pipelines.
How Does Apache Kafka Work?
Apache Kafka requires producers, consumers, topics, and partitions to process data streams.
Who Created Apache Kafka?
Apache Kafka was created at LinkedIn and subsequently released as open-source software.
When Should I Use Apache Kafka?
If you have a high-volume, real-time data-streaming application and a distributed environment, then you should use Apache Kafka to manage it.
When Should I Not Use Apache Kafka?
Do not use Kafka if your application is on a small scale, consists of simple pipelines, or if the additional operational complexity will outweigh the other business benefits.

Conclusion: Apache Kafka Is A Strategic Asset and Must Not Be Viewed As A Default Solution.

Apache Kafka represents one of the more powerful tools that exist in data engineering today, however, it should not be deemed as the default solution. When used appropriately, Apache Kafka helps:

  • Create scalable real-time systems
  • Create reliable data pipelines
  • Decouple architectures

When used inappropriately, Apache Kafka creates:

  • Complexity
  • An additional cost to your application
  • Ongoing maintenance overhead

The Logiciel Point of View

At Logiciel Solutions, we help engineering teams design "AI-First" and event-driven architectures where tools like Apache Kafka are used strategically and not just by default. Our focus is on building scalable streaming systems to meet the various real business needs for your application, while providing performance, reliability, and maintainability over time.

If your team is evaluating the use of Apache Kafka, or is currently struggling with you current implementation of Kafka or a previous implementation, now is the time to review your architecture and to determine whether you have constructed your architecture for scaling or have restricted the ability to scale.

Let us help you design systems that will grow with your data and your company.

Submit a Comment

Your email address will not be published. Required fields are marked *