There is a pipeline that failed at 3 AM, and the engineer who got paged has never touched it. There is no runbook, the original author is asleep or gone, and the alert says only that a job exited non-zero. The next hour is spent reverse-engineering a system under pressure while a morning dashboard quietly depends on the outcome. By the time it is fixed, the data is late and the engineer is awake for the day.
This is more than a rough night. It is a failure of the data on-call operating model.
A functioning data on-call practice is more than a rotation and a pager. It is a designed system of runbooks, actionable alerts, clear escalation, and an operating model that lets whoever is on call recover a pipeline they did not build, quickly and safely, at 3 AM.
However, many teams add a pager to a rotation and call it on-call, and discover what was missing the first time someone is paged for a system they have never seen.
If you are a Head of Data or data engineering lead responsible for reliability, the intent of this article is:
- Define what a real data on-call practice consists of
- Walk through the runbooks and alerting that make 3 AM recoverable
- Lay out the operating model that keeps on-call sustainable
To do that, let's start with the basics.
Real Estate Platform Reduced Pipeline Costs 45%
A pipeline FinOps playbook for FinOps Leads who need cost reductions that survive next quarter.
What Is the Data On-Call Practice? The Basic Definition
At a high level, a data on-call practice is the combination of runbooks, actionable alerting, escalation paths, and operating discipline that lets any engineer on the rotation diagnose and recover a pipeline failure without depending on the system's original author.
To compare:
If a pager with no runbook is a smoke alarm that just screams, a real on-call practice is the alarm plus a labeled extinguisher and instructions on the wall. The fire is the same; one setup lets anyone present put it out.
Why Is a Data On-Call Practice Necessary?
Issues that a data on-call practice addresses or resolves:
- Letting any on-call engineer recover a pipeline they did not build
- Turning vague failures into actionable, diagnosable alerts
- Reducing mean time to recovery for the pipelines the business depends on
Resolved Issues by a Data On-Call Practice
- Removes dependence on the original author being available
- Converts cryptic failures into guided, documented recoveries
- Makes late data and outages shorter and less frequent
Core Components of a Data On-Call Practice
- Runbooks for each critical pipeline's known failure modes
- Actionable alerts that say what broke and what to check
- A defined escalation path when the first responder is stuck
- Severity and ownership so the right failures wake the right people
- A sustainable rotation that does not burn the team out
Modern Data On-Call Tools
- PagerDuty and Opsgenie for rotation, alerting, and escalation
- Airflow, Dagster, and Prefect alerting on task and SLA failures
- Data observability platforms that detect freshness and volume anomalies
- Runbook tooling and wikis linked directly from alerts
- Incident tracking to capture and learn from each failure
These tools support the practice, but no tool substitutes for the runbooks and operating model that make a 3 AM page survivable.
Other Core Issues They Will Solve
- Provide a record of failures to drive systemic fixes
- Give the business predictable recovery for critical data
- Allow new team members to take on-call without months of tenure
In Summary: A data on-call practice turns a 3 AM pipeline failure from an archaeology project into a routine, documented recovery.
Importance of a Data On-Call Practice in 2026
On-call has become essential as data has moved onto the critical path. Four reasons explain why it matters now.
1. Data is now load-bearing.
Pipelines feed dashboards, decisions, and models that the business depends on daily. A late or wrong pipeline is now an operational incident, not an inconvenience.
2. Pipelines fail at inconvenient times.
Upstream jobs, vendor feeds, and scheduled loads break overnight and on weekends. Recovery cannot wait for the author to wake up.
3. Tribal knowledge does not scale or persist.
The engineer who built a pipeline may be unavailable or gone. Runbooks turn their knowledge into something the rotation can use.
4. Burnout is a real and costly risk.
A bad on-call practice, noisy alerts and no runbooks, drives the best data engineers away. A good one protects the team as much as the data.
Traditional vs. Modern Data On-Call
- Page the author vs. any on-call engineer can recover
- Cryptic failure alerts vs. actionable alerts linked to runbooks
- Tribal knowledge vs. documented failure modes and recoveries
- Heroic firefighting vs. routine, repeatable recovery
In summary: A modern data on-call practice is documented, actionable, and sustainable, not a pager bolted onto a rotation.
Details About the Core Components of a Data On-Call Practice: What Are You Designing?
Let's go through each layer.
1. Runbook Layer
Documented recovery for each critical pipeline.
Runbook decisions:
- One runbook per critical pipeline's known failure modes
- Steps written for someone unfamiliar with the system
- Linked directly from the alert that fires
2. Alerting Layer
What gets the engineer's attention and why.
Alerting decisions:
- Alerts that state what broke and what to check
- Tuned to avoid noise that trains people to ignore them
- Severity matched to business impact
3. Escalation Layer
What happens when the first responder is stuck.
Escalation decisions:
- A clear next contact when the runbook does not resolve it
- Time-boxed before escalation
- A path that does not dead-end
4. Ownership Layer
Who is responsible for each pipeline.
Ownership decisions:
- An owner per critical pipeline accountable for its runbook
- Severity defined so the right failures page
- Non-critical failures handled in business hours
5. Sustainability Layer
How the rotation stays healthy.
Sustainability decisions:
- Alert volume kept low enough to be sustainable
- Toil from recurring failures fixed, not just absorbed
- Rotation sized so on-call is not punishing
Benefits Gained from Runbooks and Actionable Alerts
- Any on-call engineer can recover a pipeline they did not build
- Recovery is faster because the alert points to the fix
- The team is protected from burnout and the data from prolonged outages
How It All Works Together
A critical pipeline fails at 3 AM. An actionable alert fires, stating what broke and linking directly to the runbook for that failure mode. The on-call engineer, who did not build the pipeline, follows the documented steps and recovers it without waking anyone. If the runbook does not resolve it within a time box, a clear escalation path reaches the right next person. The incident is logged, and a recurring failure becomes a backlog item to fix the underlying cause rather than absorb again. The night is routine instead of heroic.
Common Misconception
On-call is just a rotation and a pager.
On-call is a rotation and a pager plus the runbooks, actionable alerts, escalation, and sustainability practices that make a page survivable for someone who did not build the system. Without those, the pager just distributes panic.
Key Takeaway: The pager is the easy part. The runbooks and the operating model are what turn a 3 AM page into a routine recovery.
Real-World Data On-Call in Action
Let's take a look at how a data on-call practice operates with a real-world example.
We worked with a data team whose on-call meant paging whoever built the broken pipeline, with these constraints:
- Let any on-call engineer recover critical pipelines
- Make alerts actionable rather than cryptic
- Keep the rotation sustainable for the team
Step 1: Identify and Rank the Critical Pipelines
Decide which failures actually warrant a page.
- Critical pipelines identified by business impact
- Severity assigned so only real impact pages
- Non-critical failures routed to business hours
Step 2: Write Runbooks for Known Failure Modes
Document recovery for each critical pipeline.
- Failure modes captured from history
- Steps written for an unfamiliar engineer
- Runbook linked from the alert
Step 3: Make Alerts Actionable
Tune alerts to say what broke and what to check.
- Alerts state the failure and the first checks
- Noise reduced so alerts are trusted
- Severity matched to impact
Step 4: Define Escalation
Ensure a stuck responder has somewhere to go.
- Clear next contact per pipeline
- Time box before escalation
- No dead-end paths
Step 5: Protect Sustainability
Keep on-call from burning out the team.
- Alert volume monitored and reduced
- Recurring failures fixed at the root
- Rotation sized fairly
Where It Works Well
- A runbook for every critical pipeline, linked from its alert
- Actionable alerts that point to the fix
- Recurring failures fixed at the root, not absorbed nightly
Where It Does Not Work Well
- Paging only the original author, who may be unavailable
- Cryptic alerts that start every recovery from scratch
- Noisy alerting that trains the team to ignore the pager
Key Takeaway: The on-call practice that makes 3 AM survivable is the one whose runbooks and actionable alerts let any engineer recover, not the one that just distributes a pager.
Common Pitfalls
i) A pager without runbooks
Paging someone for a system with no runbook turns every incident into archaeology. Write runbooks for the critical pipelines before relying on the rotation.
- Runbook per critical pipeline
- Written for an unfamiliar engineer
- Linked from the alert
ii) Cryptic alerts
An alert that says only "job failed" starts recovery from zero. Make alerts state what broke and what to check.
iii) Alert noise
Too many alerts train the team to ignore the pager, so the real one gets missed. Tune ruthlessly.
iv) Absorbing recurring toil
A failure that recurs every week is a fix waiting to happen, not a nightly chore. Turn repeated incidents into backlog items.
Takeaway from these lessons: Most painful on-call experiences trace to missing runbooks and noisy alerts, not to bad luck. Document recoveries, make alerts actionable, and fix recurring toil.
Data On-Call Best Practices: What High-Performing Teams Do Differently
1. Write runbooks for an unfamiliar engineer
Document each critical pipeline's failure modes so someone who never built it can recover it. That is the test of a good runbook.
2. Make every alert actionable
An alert should state what broke and what to check, and link to the runbook. Cryptic alerts waste the most expensive minutes.
3. Page only on real impact
Severity tied to business impact keeps the pager meaningful. Non-critical failures wait for business hours.
4. Fix recurring toil at the root
A failure that recurs is a backlog item, not a chore. Reducing toil is what keeps on-call sustainable.
5. Treat sustainability as a first-class goal
Monitor alert volume, size the rotation fairly, and protect the team. A burnout-driven on-call practice loses the engineers who run the data.
Logiciel's value add is helping teams identify critical pipelines, write the runbooks, tune alerting to be actionable, and design an operating model that keeps on-call both effective and sustainable.
Takeaway for High-Performing Teams: Focus on runbooks, actionable alerts, and sustainability. The pager is trivial; the operating model is what protects both the data and the team.
Signals You Are Running Data On-Call Correctly
How do you know the on-call practice is set up to succeed? Not in the existence of a rotation, but in the daily evidence the team produces. Below are the signals that distinguish programs on the path from programs that look like progress.
Anyone on the rotation can recover. The team can show that a recent 3 AM failure was resolved by an engineer who did not build the pipeline, using the runbook.
Alerts point to fixes. The team's alerts state what broke and what to check, not just that something failed.
Recurring failures get fixed. The team can name a toil source they eliminated rather than absorbed.
The pager is trusted. Alert volume is low enough that people respond rather than ignore.
On-call is sustainable. The team can describe the rotation as manageable, not as the reason people are looking to leave.
Adjacent Capabilities and Connected Work
This work does not exist in isolation. Data on-call depends on, and feeds into, several adjacent capabilities. Building one without thinking about the others is the most common scoping mistake.
In most enterprise programs, data on-call shares infrastructure with the orchestration layer, the observability stack, and the incident management process. It shares team capacity with data engineering, platform engineering, and the SRE practice if one exists. And it shares leadership attention with whatever the next reliability or data initiative is on the roadmap. Naming these adjacencies upfront helps the program scope realistically and helps leadership see the work as a portfolio rather than a one-off project.
The most common mistake in adjacent-capability scoping is treating each adjacency as someone else's problem. The observability that detects failures is your problem. The incident process that captures learnings is your problem. The upstream dependencies whose failures page you are your problem to monitor. Pretending otherwise pushes work to teams that did not plan for it, and the work returns to you later as a repeated 3 AM incident. Own the adjacencies you depend on; partner with the teams that own them; share the timeline.
Conclusion
A data on-call practice is what turns inevitable pipeline failures from heroics into routine. The discipline that makes 3 AM survivable is the same discipline behind any reliability work: document the recoveries, make the signals actionable, and protect the people who run the system.
Key Takeaways:
- On-call is runbooks and an operating model, not just a pager
- Runbooks must let an unfamiliar engineer recover the pipeline
- Actionable alerts and root-cause fixes keep on-call fast and sustainable
Building an effective data on-call practice requires runbook, alerting, and sustainability discipline. When done correctly, it produces:
- Any on-call engineer able to recover any critical pipeline
- Faster recovery and shorter data outages
- A team protected from burnout
- Recurring failures eliminated rather than endured
Healthcare CIO Cuts AI Costs Without Accuracy Loss
A field guide to AI cost optimization for VP Engineering teams running clinical and operational LLMs in production.
What Logiciel Does Here
If your on-call means paging whoever built the broken pipeline, identify your critical pipelines, write runbooks an unfamiliar engineer can follow, and make every alert actionable.
Learn More Here:
- The SLO Handbook: Setting Targets Your Team Can Actually Hit
- Data SLAs and Incident Response Services
- Data Observability: Why Your Dashboards Keep Lying to You
At Logiciel Solutions, we work with Heads of Data on on-call design, runbooks, alerting, and reliability operating models. Our reference patterns come from production data platforms.
Explore how to make your data on-call survivable at 3 AM.
Frequently Asked Questions
What makes data on-call different from software on-call?
The failures are often about data correctness and freshness, not just service availability, and they frequently originate upstream in vendor feeds or producer changes. Recovery depends on runbooks for data-specific failure modes, not just restarting a service.
What should a pipeline runbook contain?
The known failure modes for that pipeline and the step-by-step recovery for each, written so an engineer who never built the pipeline can follow it. It should be linked directly from the alert that fires.
How do we stop alert fatigue?
Tune alerts so only real business impact pages, make each alert actionable, and fix recurring failures at the root instead of alerting on them repeatedly. A pager that fires constantly trains people to ignore it.
How do we keep on-call sustainable?
Page only on real impact, size the rotation fairly, monitor alert volume, and turn recurring toil into backlog fixes. Sustainability protects the engineers who run the data and is as important as recovery speed.
What is the biggest mistake in data on-call?
Adding a pager to a rotation without runbooks or actionable alerts. It distributes panic rather than capability, because whoever is paged for an unfamiliar system has to reverse-engineer it under pressure at the worst possible time.