A data observability playbook for Heads of Data who suspect the failures they don't see are the expensive ones.
You don't know which numbers, and you don't know how wrong.
Silent data quality failures are the most expensive failures because they get into decisions.
Energy operators are particularly exposed.
Most data observability deployments are point monitoring dressed up.
Every dataset has an expected freshness — a maximum acceptable lag from source. Freshness monitoring fires when the lag exceeds the threshold.
Every dataset has an expected volume — record count, byte count, or both. Volume monitoring catches the second class of silent failure: the pipeline ran, but the data was wrong size.
Schema changes are one of the most common silent failure causes. A column type changed upstream.
Every dataset has an expected freshness — a maximum acceptable lag from source.
Every dataset has an expected volume — record count, byte count, or both.
Schema changes are one of the most common silent failure causes.
If you are a Head of Data and you suspect the failures you cannot see are the expensive ones, the answer is a five-class monitor program.
Job monitoring tells you whether the pipeline ran. Data observability tells you whether what came out of it was right. They are complementary.
From historical data with the data owner. We start sensitive and tune toward less noise as we learn the natural variation. Auto-tuning helps for stable signals.
We have implemented this program on Monte Carlo, Anomalo, and on open-source stacks built on Soda + DataDog. Tool choice is downstream of the framework.