Model drift is what happens when an AI model that worked well at launch produces worse results over time. The data the model sees in production shifts, the relationships it learned during training stop matching reality, or the world simply moves on. The model has not changed. Everything around it has, and quality silently erodes until somebody notices.
The term covers a few related phenomena. Data drift is when the distribution of inputs changes: customers ask different questions, document formats shift, new product categories appear. Concept drift is when the relationship between inputs and correct outputs changes: what counted as fraud last year is not what counts as fraud this year. Prediction drift is the downstream effect: the model's outputs shift in distribution even when inputs look similar. All three matter, and production teams have to monitor for each.
In LLM-based systems there is a fourth flavor specific to the era: provider drift. The vendor updates the foundation model. The same prompts produce subtly different outputs. Quality changes overnight without any code change on your side. This is the version of drift most current AI teams encounter most often, and it has different defenses from traditional ML drift.
The common thread: the model gets worse without anyone changing the model. Detection requires monitoring. Response requires retraining, prompt updates, or version pinning depending on what kind of drift hit you.
Data drift is the most studied form. The input distribution at inference time differs from the training distribution. A model trained on customer reviews from 2023 sees reviews in 2026 that include topics, slang, and product references that did not exist in training. Statistical tests like KS (Kolmogorov-Smirnov) and PSI (Population Stability Index) detect data drift by comparing distributions over time.
Concept drift is when the mapping from input to correct output changes. A fraud detection model learns that a certain transaction pattern is fraudulent. Fraudsters change their pattern. Now the same model misses the new pattern even though the inputs look statistically similar to past inputs. Concept drift is harder to detect because it requires ground truth labels for current data, which often arrive late.
Prediction drift is the symptom of either or both. The distribution of predictions changes. Detected through monitoring output distributions over time. Useful as an early warning even when ground truth is delayed.
Provider drift is specific to LLM applications using vendor APIs. The provider releases a new model version (often labeled the same way). The same prompts produce different outputs. Pinning to specific versions where supported is the first defense. Running evaluation against new versions before adopting them is the second.
Label drift is a related concept where the labels themselves shift in meaning over time. What was labeled "high risk" in 2022 may not match what gets labeled high risk today, even by the same labelers. This affects training data quality and downstream model behavior.
Three layers of monitoring catch most drift. First, monitor input distributions. Track features statistically over time and alert when distributions shift beyond a threshold. PSI, KL divergence, and similar metrics work for tabular features. Embedding distributions can be monitored similarly for unstructured inputs.
Second, monitor predictions. The output distribution often shifts before ground truth catches up. A sudden jump in predicted positives might mean concept drift, data drift, or both. Either way it is signal worth investigating.
Third, monitor performance against ground truth as it becomes available. For some use cases ground truth arrives in days (a recommendation either gets clicked or not). For others it takes months (a loan default plays out over years). Where possible, this is the gold standard signal.
For LLM applications, sampled production traces evaluated by humans or judge models catch generation quality drift. Run a fixed evaluation set monthly and watch for score changes. Compare current production quality to historical baselines.
Tools that help include Arize, Fiddler, WhyLabs, Evidently, and the major cloud providers' built-in monitoring. The choice depends on stack and depth of analysis needed. Most production AI systems adopt one and integrate it with deployment.
The response depends on the cause. For data drift, retraining on recent data usually resolves it. The retrain cadence depends on how fast the data shifts; some workloads need weekly retraining, some annual.
For concept drift, retraining helps but you need fresh labels reflecting the current relationship. Active learning, where the model flags uncertain cases for human labeling, helps maintain a current label set.
For prediction drift without confirmed ground truth changes, the response is investigation. Look at recent inputs, sample outputs, check whether the model is doing the right thing or drifting unhelpfully. Often the cause is upstream data quality (a sensor changed format, a feature pipeline broke).
For provider drift in LLM applications, evaluate the new model version against your test set. If quality holds or improves, migrate. If it regresses, stay on the older version until the regression is addressed (prompt adjustment, fallback to a different model). Pin model versions where supported.
Establish a regular review cadence. Weekly or monthly review of monitoring metrics, with someone responsible for action when thresholds fire. Drift caught early is much cheaper to fix than drift that has compounded for months.
Daily checks on input and prediction distributions are common for high-traffic systems. Weekly evaluation against a labeled set catches generation quality drift in LLM systems. Monthly deeper reviews assess broader trends. The right cadence depends on how fast the underlying data and task can shift; fraud detection might need daily monitoring while a stable classification model might survive on weekly review.
Typical conventions: PSI under 0.1 is no significant change, 0.1 to 0.25 is moderate change worth investigation, above 0.25 is significant drift requiring action. These thresholds are starting points; tune based on your data's normal variability. Some features naturally fluctuate more than others.
Traditional ML drift focuses on feature distributions and concept changes around custom-trained models. LLM drift adds provider-driven changes (the foundation model changes underneath you), retrieval drift (your vector database content shifts), and prompt drift (changes to prompts or templates can produce regressions). The detection methods overlap but the response patterns differ. Traditional ML drift usually means retraining; LLM drift often means prompt updates or version management.
No, but it can be managed. Stable underlying processes drift slower than rapidly changing ones. Robust feature engineering and prompts that depend on stable concepts drift less than those tuned to current specifics. Regular retraining and prompt review handle drift that does occur. The goal is detect-and-respond, not prevention.
The content in the vector database changes over time as documents are added, removed, or updated. Queries that worked with one corpus state may behave differently after updates. Monitoring includes both the corpus (size, distribution by category, recency) and retrieval results for fixed test queries. When retrieval changes affect quality, the response often involves chunking strategy updates or embedding model changes.
Real-time systems need fast detection. Streaming statistics on inputs and outputs, alerts that fire within minutes or hours rather than days, and automated rollback or fallback paths when quality degrades sharply. The infrastructure cost is higher but the detection latency matters when degraded quality affects active users.
Ground truth is the most reliable signal but often the slowest. When available (for use cases where outcomes confirm or refute predictions), ground truth performance metrics are the best indicator of drift. When unavailable or delayed, distribution monitoring and human review of sampled outputs serve as proxies. Most systems combine both.
Fine-tuned models drift like traditional ML models when input distributions shift. They also drift when the base model is updated and the fine-tuned version becomes incompatible (in many providers, fine-tunes have to be redone on new base models). Monitor both data drift and provider deprecation timelines. Plan for periodic refinetuning as base models evolve.
Quality degradation reaches users gradually. Customer satisfaction declines without obvious cause. Business metrics shift in ways that take time to attribute. Eventually somebody investigates and finds the model has been performing worse for months. The fix is usually retraining or prompt updates, but the cost is the months of degraded user experience and potentially lost trust. The cost of monitoring is small relative to the cost of unmonitored drift.
Some models reach a point where drift exceeds what regular retraining can address. The underlying task has shifted enough that the model architecture or feature set is no longer right. Retirement and replacement become more economical than continued patching. Recognizing this point requires honest evaluation; teams sometimes invest disproportionately in maintaining models that should be replaced.