LS LOGICIEL SOLUTIONS
Toggle navigation
Technology

SLOs for Data Products: Operational Targets That Survive Audits

SLOs for Data Products: Operational Targets That Survive Audits

The Quality Commitment That Did Not Mean Anything

A CDO at a manufacturing company described his organization's data quality commitments to me as theatrical. The data team had published quality SLAs for years. The SLAs said things like "data is current within one business day" and "data quality is maintained at acceptable levels." Twenty-three different data products had similar documents. None of the documents specified what acceptable meant or how it would be measured.

When the audit committee asked specific questions in late 2023, the data team could not answer them with current numbers. They could not say which products were meeting their SLAs and which were not. They could not say what corrective action followed when an SLA was violated. The quality commitments existed as policy documents and not as operational discipline.

He told me his team had spent the following year converting theatrical SLAs into operational SLOs. The work was unglamorous and substantial. The result was that the data team could answer the audit committee's questions with specifics. They also discovered that many products had been quietly missing their SLOs for months, which led to specific remediation work that the theatrical SLAs had concealed.

The pattern is common. Most data SLAs are theatrical. The ones that survive audits and produce operational results have specific characteristics that distinguish them.

Why Board Decks Reject Technical Infrastructure Cases

Inside a financial-frame business case that turned a 14-month stall into a 45-minute board approval.

Download

What an Operational SLO Actually Specifies

An operational SLO is more specific than a theatrical SLA. The specificity is the operational part.

The first specificity is measurable. The SLO defines what is measured (freshness, completeness, accuracy, availability) and how. The measurement runs automatically against production data. The result is a number, updated continuously.

The second specificity is bounded. The SLO defines a target value and a measurement window. Freshness under one hour, 99 percent of the time, measured over a rolling 30-day window. The target and window together produce an unambiguous pass-fail measure.

The third specificity is consequenced. When the SLO is violated, something happens. The on-call team gets paged. The data product is marked degraded in the catalog. Downstream consumers receive notifications. Repeated violations trigger remediation work that gets prioritized appropriately.

Without these three specificities, the SLO is a document. With them, the SLO is operational discipline that producers, consumers, and auditors can rely on.

The Five SLO Categories That Cover Most Data Products

Five SLO categories cover most production data products. Most products do not need all five; the relevant subset depends on the product and its consumers.

Freshness SLOs specify how recently the data was updated. Pertinent for time-sensitive analytical and operational workloads. The measurement is the lag between source event and data availability.

Completeness SLOs specify how complete the data is relative to expected volume. Pertinent for workloads where missing records produce wrong analysis. The measurement is the count of records received versus expected.

Accuracy SLOs specify how accurate the data is against ground truth. Pertinent for workloads where wrong values produce wrong decisions. The measurement is sampled validation against known-correct values or other authoritative sources.

Availability SLOs specify how often the data is queryable. Pertinent for operational workloads where unavailable data breaks user-facing features. The measurement is the percentage of queries that succeed.

Lineage SLOs specify how traceable the data is back to source. Pertinent for regulated workloads where audit reconstruction is required. The measurement is the fraction of data with complete lineage metadata.

Most data products have two to four of these categories as active SLOs. Heavy regulatory workloads use all five. Internal-only analytical workloads might use only freshness and completeness.

The Three Patterns That Make SLOs Operational

Three implementation patterns distinguish operational SLOs from theatrical ones. The patterns are necessary because SLOs require infrastructure that theatrical SLAs do not.

The first pattern is automated measurement infrastructure. The SLO measurement runs as code, not as periodic manual checks. The infrastructure usually combines data observability tools (Monte Carlo, Bigeye, Anomalo, custom monitoring) with metric stores (Prometheus, Datadog) and dashboards.

The measurement infrastructure has its own operational requirements. It has to be reliable enough that SLO measurements are accurate. It has to be efficient enough that measurement cost is reasonable. It has to be accessible enough that producers, consumers, and auditors can see current SLO status.

The second pattern is integration with operational practice. The SLO measurement feeds into incident response. When SLOs are at risk, the on-call team gets notified. When SLOs are violated, post-incident review follows. The SLO is part of how the data team operates, not a parallel reporting activity.

This integration usually requires SLO ownership inside the team that operates the data product. Pure ownership by a separate quality function tends to disconnect SLO measurement from operational response. Ownership by the operating team keeps the loop tight.

The third pattern is consumer-facing communication. Data product consumers can see current SLO status. They know when their dependencies are at risk. They can make informed decisions about whether to act on data that has missed its freshness SLO or is currently in violation.

Communication channels vary. Some teams publish SLO status in data catalogs. Some publish to Slack channels. Some integrate with consumer dashboards directly. The specific channel matters less than the communication actually happening.

What SLO Achievement Actually Costs

Operating SLOs at production grade requires real investment. The investment falls in three areas.

Measurement infrastructure has tooling cost (monitoring platforms, alerting systems) and engineering cost (integration, customization, maintenance). For a moderate-sized data platform, the annual cost typically lands in the $100K to $400K range depending on platform scale.

Engineering capacity to respond to SLO signals is the second area. Producers need bandwidth to fix issues that the SLO surfaces. Without the bandwidth, SLO measurement just documents failures without producing improvement. The capacity allocation is typically 15-25 percent of data engineering time at sustained operation.

Cultural investment is the third area. The team has to operate against SLOs rather than around them. The shift from theatrical SLAs to operational SLOs requires management commitment and ongoing reinforcement. The cultural work is harder than the technical work in many organizations.

The combined investment is substantial. The return is data products that consumers trust and audits that the organization passes.

What Goes Wrong With SLO Programs

Three patterns of SLO program failure are common enough to call out.

The first is SLOs without enforcement. The program defines SLOs and measurement but does not connect them to operational response. Violations get logged and nothing changes. The SLO becomes informational documentation. Producers learn that violations are acceptable.

The second is SLOs that are too ambitious. The program defines SLOs that the current architecture cannot meet. The team chronically misses them. The SLO becomes background noise that the team learns to ignore. Eventually the program loses credibility.

The third is SLOs that are too lax. The program defines SLOs that the current architecture meets easily. The SLOs do not surface the issues that consumers actually experience. Consumers complain about data quality while SLOs report green. The disconnect erodes trust in both.

The recoverable pattern from each is calibration. SLOs need to be ambitious enough to drive improvement and achievable enough to be respected. The calibration usually takes a few quarters of iteration to settle.

Why Better Reliability Doesn't Make Stakeholders Trust You

Inside a published-SLA program that turned silent reliability gains into a +42 NPS swing.

Download

What Logiciel Does Here

Logiciel works with data engineering and platform teams establishing operational SLO programs. The work is typically structured around assessment of current quality commitments, identification of operational gaps, and sequenced buildout of measurement and response infrastructure.

The Data Reliability Engineering framework covers the broader DRE discipline that SLOs sit within. The Data Observability framework covers the measurement infrastructure that SLOs depend on.

A 30-minute working session is enough to assess your current quality commitments against operational SLO patterns.

Frequently Asked Questions

How do I pick the right SLO targets?

Through analysis of historical performance and consumer requirements. Look at what the data product has actually achieved over the past few quarters. Set targets slightly tighter than recent achievement to drive improvement. Adjust based on consumer feedback about what level of service they actually need.

How do I handle SLO violations without producing alert fatigue?

Through tiered alerting. Warning thresholds before violation. Page on violation. Severity that escalates with sustained violation. The alerting has to be calibrated to actual consequence; pages for issues that consumers do not care about produce fatigue.

What about products with no clear single consumer?

Through stewardship. The product team holds SLOs based on the most demanding legitimate consumer. The SLOs serve all consumers; the demanding consumer drives the target setting.

How do these interact with regulatory requirements?

Some regulations imply SLOs (GDPR data subject request timeliness, HIPAA breach notification timing, financial reporting timing). The operational SLOs should align with regulatory requirements where they exist. Operational SLOs also produce evidence that regulators look for.

How do I get producer teams to commit to SLOs?

Through demonstration that SLOs are achievable and useful. Start with one product team willing to operate SLOs rigorously. The success becomes a reference. Other teams adopt the pattern. Forcing SLOs across teams that resist them produces resentment without operational improvement. Sources: - Monte Carlo, "State of Data Quality 2024" - Google SRE Book, SLO guidance, 2024 update

Submit a Comment

Your email address will not be published. Required fields are marked *