A streaming migration playbook for Data Engineering Leads moving healthcare workloads to real-time — bounded scope, managed Kafka and Flink, and a one-workload-per-week migration cadence with the batch path running in parallel.
But your stack is batch.
Healthcare data architectures grew up batch. Overnight ETL was the right answer for a generation of reporting workloads. It is the wrong answer for sepsis prediction, clinical alerts, care gap notification, and eligibility verification — the workloads where minutes change outcomes.
Most batch-to-streaming migrations we audit are too ambitious. The team tries to replace the warehouse, replace the ETL framework, and rebuild every dashboard at the same time, and the project ships in 18 months or not at all.
Identify the three to five workloads where streaming matters most. Healthcare typically picks: ED throughput, clinical alerts, eligibility verification, care gap notification, and sepsis prediction. Everything else stays batch until the streaming layer has earned the right to expand.
Deploy a managed Kafka and Flink stack. Confluent Cloud, AWS MSK with Managed Flink, or equivalent. Self-building the streaming infrastructure consumes the entire migration window before any workload ships.
Migrate one workload per week to the streaming stack. Each migration ships behind a feature flag with the batch path still running in parallel, so rollback is one toggle and the team learns the streaming stack in production before it is the only path.
Identify the three to five workloads where streaming matters most. Healthcare typically picks: ED throughput, clinical alerts, eligibility verification, care gap notification, sepsis prediction.
Deploy a managed Kafka and Flink stack. Confluent Cloud, AWS MSK with Managed Flink, or equivalent.
Migrate one workload per week to the streaming stack. Each migration ships behind a feature flag with the batch path still running in parallel.
If your healthcare platform needs real-time and your stack is batch, the migration ships in one quarter when scope is bounded and the streaming stack is managed.
Time. Self-building Kafka + Flink alone takes longer than the entire 13-week window. Managed gets you to production faster and the cost premium is justified by the time saved.
Late-arrival window per workload, with side outputs for very late records. We have run this with up to 30-minute late-arrival tolerance on AMI-style data and shorter on EHR data.
Parallel operation. The batch alert and the streaming alert run side by side until reconciliation shows the streaming output matches. Only then does the batch path retire.
Same posture as the existing analytics platform — encrypted in transit and at rest, BAA with the managed provider, role-based access. The streaming layer inherits the trust boundary.
Not all of it. Streaming replaces the batch outputs that benefit from real-time. Reporting workloads on long aggregations stay batch.