Data streaming is the continuous processing of data as it arrives from a source, rather than waiting to collect it in batches. Instead of gathering events over an hour or a day then running analysis, streaming systems process events in near real-time, often within milliseconds of creation. This enables immediate decisions and reactions to new information.
The core of streaming architecture is the producer-broker-consumer pattern. Producers generate events from application logs, sensors, user interactions, or database changes. Brokers like Apache Kafka buffer and distribute these events reliably to many consumers. Consumers subscribe to event streams and process them using tools like Apache Flink or Spark Structured Streaming, applying transformations, aggregations, and state management to produce results.
Streaming differs fundamentally from batch processing in timing and workflow. Batch systems run on a schedule, processing all accumulated data in one pass. Streaming processes each event as it arrives, maintaining state across the stream to produce continuous outputs. This makes streaming lower-latency but more complex operationally, requiring careful management of state, ordering, and failure recovery.
The trade-off is deliberate. Streaming costs more in infrastructure and operational overhead but delivers insights and actions in seconds instead of hours. The right choice depends on your latency requirements. Fraud detection needs streaming. Historical analysis usually uses batch.
This three-layer architecture is the foundation of almost every streaming system. Producers emit events to a broker without knowing or caring who consumes them. This loose coupling is powerful. You can add new consumers to an existing stream without changing producers. You can replace consumers or add parallel processing without disrupting the event flow.
The broker acts as a persistent queue. Events arrive from producers, get written to disk for durability, and sit there until consumed. Multiple consumers can read the same events independently. One consumer processes events for analytics, another for fraud detection, a third for user notifications. Each maintains its own position in the stream, able to move backward to replay events or forward to skip ahead.
This design also handles backpressure gracefully. If a consumer is slow, events accumulate in the broker. The producer continues uninterrupted. No requests are dropped. Without a broker, producers would have to wait for consumers to finish before accepting more data, creating bottlenecks. Kafka's durability and partitioning make this model scale to millions of events per second across thousands of machines.
A stream processor consumes events from a broker, applies logic, and emits results. The logic ranges from simple filtering (keep only errors) to complex transformations. Filtering is stateless. Anything more interesting requires state. Counting how many events per minute is stateful. So is tracking the total spend per customer or detecting unusual sequences of actions.
Apache Flink and Spark Structured Streaming are the two dominant open-source processors. Flink processes events individually, achieving microsecond latency for event-driven applications. Spark groups events into micro-batches, making it simpler to program and better for scenarios where you can tolerate 100-500ms latency. Both support windowing, joining multiple streams, and exactly-once semantics. Both are designed to run on distributed clusters, scaling the work across many machines.
State management is where streaming gets complex. The processor needs to remember information across events. If you're counting events per hour, the processor holds an in-memory map of counts, updating it as each event arrives. If a machine fails, that state is lost unless it was checkpointed to durable storage. Both Flink and Spark support checkpointing. On failure, they restore state and replay unprocessed events, guaranteeing no data is lost and no results are duplicated.
Fraud detection is the canonical streaming use case. A transaction arrives from a customer. Within 100 milliseconds, the stream processor checks it against thousands of rules learned from historical fraud patterns, geolocation anomalies, and velocity checks. If suspicious, it blocks the transaction immediately. Batch processing can't do this. By the time batch results come in hours later, the fraud has already happened.
Recommendations work similarly. A user lands on your site. Streaming updates their profile in real-time, feeding a model that selects products or content to show. The personalization feels responsive because it's based on their current session, not yesterday's aggregate. E-commerce sites use streaming to track inventory as sales happen, ensuring you don't oversell. Financial traders stream market data and news feeds, making split-second decisions on pricing or hedging.
Operationally, streaming pipelines need constant monitoring. You track consumer lag, the distance between the latest event and what the consumer has processed. Growing lag indicates your processor can't keep up. You monitor state size to ensure it fits in available memory. You watch for poison messages, malformed events that cause processing to fail. Logging, alerting, and dashboards become more critical because failures impact real-time decisions, not batch jobs that can retry tomorrow.
Processing each event individually minimizes latency. An event is processed microseconds after arrival. But individual processing has overhead. The processor must acquire locks, manage state updates, and persist checkpoints. Handling 10,000 events per second with per-event processing is harder than handling 10,000 events in a second-long batch.
Micro-batching, used by Spark Streaming and others, divides the stream into small batches, usually 100-500 milliseconds of events. Processing batches instead of individual events amortizes overhead. You acquire one lock for the batch, update state once, checkpoint once. This improves throughput significantly. Latency increases slightly because you wait for the batch to fill, but it's still subsecond.
The right choice depends on what you're building. A feature store for machine learning can accept 100ms batches because model training happens offline. A fraud detection system needs 50ms latency. A daily analytics report can use batch processing overnight. Understanding your actual latency requirements, not just assuming you need the lowest possible latency, prevents overengineering and keeps your infrastructure simple and cheap.
In distributed systems, events often arrive out of order. An event created in New York reaches your data center before an event created in San Francisco, even though they were created seconds apart. Some events arrive late, delayed by network congestion or failing machine restarts. Ignoring these problems breaks your results. A user's profile update might arrive after you've already used their old profile. A payment confirmation might arrive after a delivery.
Windowing addresses this. You divide the stream into time windows. A tumbling window groups all events from 2:00-2:05pm. You allow the window to close a grace period after 2:05pm, catching events delayed by a few seconds. Events arriving after the grace period are either dropped or sent to a side output for manual inspection. Allowed lateness is a setting you tune based on how late your events typically are and how wrong you can afford your results to be.
Exactly-once semantics means each event affects the result once, even if failures occur. The processor saves checkpoints of its internal state. The output system tracks which results have been committed. On failure, the processor rewinds to the last checkpoint, replays events, and the output system avoids duplicating results already written. This is expensive. Exactly-once requires coordination between processor and sink, slowing things down. For some use cases, at-least-once is acceptable if downstream systems can detect and ignore duplicates.
Debugging streaming pipelines is fundamentally harder than debugging batch jobs. Batch jobs have clear inputs and outputs. You can run them locally, reproducing issues. Streaming is continuous. Data moves constantly. State lives on a cluster. Reproducing a bug might require capturing production traffic and replaying it through a local cluster, a tedious process. Silent failures are worse. If your processor stops processing events but doesn't crash, no alert fires. Consumer lag grows silently. Hours later, someone notices the results are stale.
Backpressure handling becomes an operational puzzle. Your processor crashes and restarts. While restarting, events accumulate in the broker. When the processor comes up, it tries to process months of backlog in minutes, using far more CPU and memory than normal. If backlog grows faster than the processor can catch up, lag keeps climbing. You might need to scale up, but that's slow. Some systems add code to skip old events if lag is too high, losing data intentionally to keep the system responsive. It's an uncomfortable trade-off with no clean answer.
Schema evolution is another common problem. Your events have a structure defined by a schema, like a JSON or Avro definition. You want to add a new field to track an additional data point. Existing producers emit events with the old schema. New producers emit the new schema. How does the processor handle both? A poorly designed solution breaks, failing on unknown fields. A robust solution requires a schema registry and versioning discipline across the organization. That's extra operational burden.
Resource management at scale introduces further complexity. A single Flink cluster might process terabytes per day. Memory usage depends on state size, which depends on how many unique customers or sessions you're tracking. Growing user bases mean growing state. State that fit in 10GB last month needs 50GB this month. You scale the cluster, adding machines. But scaling itself can trigger bugs if not done carefully. Some state might not redistribute evenly, leaving one machine overloaded. The coordination required to safely scale is nontrivial.
Data streaming processes data continuously as it arrives, enabling low-latency decisions on fresh data. Batch processing collects data over a period, then processes it in scheduled bulk runs. Streaming is optimized for real-time analytics and immediate action, while batch handles high-volume historical analysis efficiently. The choice depends on your latency requirements and use case. Fraud detection needs streaming because minutes matter. Historical trend analysis typically uses batch because the processing window is flexible.
A typical streaming pipeline has three layers: producers, brokers, and consumers. Producers generate events from application logs, user interactions, IoT sensors, or databases. Brokers like Kafka or Pulsar buffer and distribute these events reliably. Consumers subscribe to event streams and process them using tools like Flink, Spark Structured Streaming, or custom applications. This separation lets you scale and operate each layer independently. A producer failure doesn't block consumers, and you can add new consumers without touching producers.
Kafka is a distributed event streaming platform that acts as the broker in the producer-consumer model. It persists events to disk, allowing multiple consumers to replay the same data independently. Kafka scales horizontally across brokers and provides strong ordering guarantees within partitions. Most streaming architectures use Kafka or similar brokers because they decouple producers from consumers and provide durability. Without Kafka, producers would need to know about every consumer, making the system rigid and hard to change.
Stream processors read events from a broker, apply transformations or aggregations, and emit results to a downstream system. Apache Flink and Spark Structured Streaming are the two main open-source options. Flink processes events individually for lower latency, while Spark groups events into micro-batches for better throughput and easier semantics. Both support windowed aggregations, joins across multiple streams, and exactly-once processing guarantees. The processor maintains state to track things like running counts or session data across event boundaries.
Event streaming emphasizes the ordered flow of discrete events through a system, often for operational use cases like real-time notifications or transaction processing. Data streaming is broader, covering any continuous flow of data. In practice, the terms overlap. Event streaming usually implies a focus on individual events and causality, while data streaming might refer to high-volume numeric data from sensors or applications. Both use similar technologies, but event streaming often prioritizes ordering and exactly-once semantics more strictly.
Late events are common in distributed systems. Stream processors handle them with windowing strategies and allowed lateness parameters. You define a window (e.g., events from 2-3pm) and allow the processor to wait a grace period after the window closes for straggler events. Events arriving after that grace period are either dropped or sent to a side output for analysis. Out-of-order events are managed by buffering them long enough to reorder within your processing window. The trade-off is between accuracy and latency. Longer grace periods give more accurate results but delay final answers.
Fraud detection is perhaps the most well-known use case. Streaming lets you flag suspicious transactions in milliseconds. Recommendation systems use streaming to update user models as they interact with your product. Operational monitoring streams application metrics and logs to dashboards, alerting teams to issues before customers notice. Financial institutions stream market data for trading decisions. E-commerce platforms use streaming to track inventory in real-time. IoT applications stream sensor data for immediate anomaly detection. Each of these benefits from making decisions within seconds or less of the event occurring.
Latency is how quickly you can process an individual event. Throughput is how many events you process per unit of time. Processing each event individually gives low latency but often lower throughput because of overhead. Batching events before processing improves throughput but increases latency since you wait for a batch to fill. Micro-batching, used by Spark Streaming, batches small groups of events (100ms windows) to balance both concerns. The right choice depends on your requirements. Stock trading needs low latency even if it means processing fewer events. Log aggregation can tolerate higher latency to get better throughput.