Models Decay Even When You Do Nothing
The model that worked yesterday may not work as well today. The system around it may be the same. The model itself may be the same. The performance may still drop. This is not a defect in the technology. It is the predictable consequence of operating systems whose behavior depends on input distributions and external dependencies that both change over time.
DataRobot's 2024 production AI survey found 91 percent of production models showed measurable performance degradation within 12 months of deployment (DataRobot, "State of AI in Production 2024"). The number is large enough that monitoring for decay is mandatory rather than optional for any production-deployed model.
If your production model has been running for six months without monitoring beyond uptime, the actual quality state is probably worse than your last evaluation showed. The question worth answering is which decay mode is most active in your specific deployment.
The Three Decay Modes
Production AI decay comes in three modes. They are not mutually exclusive; a real system can suffer from all three simultaneously. They are operationally distinct because each one requires different monitoring and triggers different responses.
The first mode is input drift. The data flowing into the model changes over time. The customer mix shifts. The query patterns evolve. The document corpus the retrieval system searches grows or rotates. The model continues to produce output for the new inputs, but the output quality degrades because the inputs differ from what produced the model's training or eval baseline.
The second mode is dependency drift. The model itself may change in ways the operating team did not anticipate. Provider model updates within a version (especially common for cloud-hosted models). External knowledge bases the model relies on (retrieval corpus, embedding indexes, fine-tuning data) shift in composition. Adjacent systems (other APIs the workflow depends on) modify their behavior. Each change can ripple into the model's effective behavior.
The third mode is concept drift. The relationship between inputs and correct outputs changes in the world. Customer support requests about a specific product feature mean something different after a UI redesign. Fraud patterns evolve. Medical guidelines update. The model was trained or prompted to produce outputs that were correct under prior conditions and continues to produce them under new conditions where they are wrong.
A monitoring program that addresses one or two of the three modes leaves the third uncovered. Production decay continues from the unmonitored mode regardless of how well the others are watched.
How to Detect Each Mode
The three modes each produce different observable signals. Knowing the signals shapes what monitoring to build.
Input drift produces measurable changes in input distribution. The distribution of input lengths, the topic mix of incoming queries, the demographic composition of users, the frequency of specific intents. Statistical drift detection on input features catches it. Tools like Evidently, WhyLabs, Arize, and the major MLOps platforms all support this.
Dependency drift produces changes in output behavior with unchanged input. The same input produces different output than it did before. Detection requires periodic re-running of the same fixed inputs and comparing output. The pattern is called regression testing for AI systems. The fixed eval set running on a schedule against the production model catches dependency drift.
Concept drift produces changes in output quality without obvious changes in inputs or outputs. The model still produces the same kind of output it always did. Users are less satisfied. Customer support tickets increase. Detection requires user feedback signals, customer outcome metrics, or human review of samples. Pure model-level monitoring misses concept drift entirely.
Mature production monitoring runs all three detection mechanisms. The combination catches what each mode produces.
The Response Patterns That Work
Detecting decay is half the work. Responding effectively is the other half. Each mode has different response patterns.
Input drift response usually involves updating the eval set to include the new input distribution, re-evaluating against the updated baseline, and adjusting the model or prompts if quality has degraded. The system may also need data pipeline changes if the input drift reflects underlying data quality issues that should be addressed upstream.
Dependency drift response depends on the source. Provider model updates may require version pinning, prompt adjustments, or migration. External knowledge base changes may require retrieval pipeline updates. Adjacent system changes may require interface adaptation. The response is specific to the dependency.
Concept drift response usually requires more substantial work. The model's underlying assumptions have to be updated. This can mean retraining if the model was trained, prompt updates and example refresh if the model was prompted, or workflow redesign if the underlying business reality has shifted enough that the AI workflow no longer fits.
Teams that have practiced these responses through drills handle real decay incidents faster than teams that improvise. The response runbooks for AI decay belong in the same operational discipline as incident response runbooks for traditional systems.
What This Costs
Building production-grade decay monitoring for a moderate AI workload typically requires one engineer for one to two months for initial setup plus ongoing 10-15 percent of one engineer's capacity for sustained operation.
The cost of not building it is the cost of decay-induced incidents, which usually appear as gradual quality regression that nobody noticed until customer complaints accumulated. The cost of these incidents is highly variable and often larger than the cost of the monitoring infrastructure.
For workloads where AI quality affects revenue or regulated outcomes, the monitoring investment is mandatory rather than optional regardless of the cost-benefit framing.
What Logiciel Does Here
Logiciel works with engineering teams operating production AI where the monitoring discipline has lagged the deployment pace. The work is typically structured around the three-decay-mode framework with priority on whichever mode is producing the most current quality degradation.
The AI Reliability Framework covers the four-surface observability that decay monitoring fits within. The Continuous Intelligence Reliability framework covers the broader system-level reliability that includes drift detection.
A 30-minute working session is enough to assess your current monitoring coverage against the three decay modes.
Frequently Asked Questions
How often should decay monitoring run?
Daily for the eval-against-fixed-inputs check (dependency drift). Continuous for input distribution monitoring (input drift). Weekly minimum for human review samples and user feedback aggregation (concept drift). The cadence reflects how quickly each mode typically emerges.
What is the right alert threshold?
Two thresholds usually. A warning threshold (5-10 percent degradation from baseline) that triggers investigation. A page threshold (15-20 percent degradation) that triggers immediate response. Calibrate based on workload sensitivity.
Can I rely on aggregate metrics or do I need per-segment monitoring?
Per-segment monitoring catches issues that aggregate hides. Aggregate accuracy can be stable while specific user segments experience meaningful regression. For workloads with identifiable segments (customer types, query categories, geographic regions), per-segment monitoring is worth the additional complexity.
How do I distinguish concept drift from other quality issues?
By the signals. Concept drift typically shows user dissatisfaction without model output changes. Other issues (bugs, prompt errors, retrieval failures) typically show measurable output changes or specific error patterns. The diagnostic process narrows from broad to specific.
What tools should I use?
For input and dependency drift, Evidently (open source), WhyLabs, Arize, or the major MLOps platforms. For concept drift, customer feedback infrastructure plus periodic human review. The two categories require different tooling that the same vendor sometimes supplies but the disciplines are distinct. Sources: - DataRobot, "State of AI in Production 2024" - MIT Sloan, "Closing the AI Reliability Gap," 2024