An AI evaluation harness is the test suite for an AI system. It is a defined set of inputs, expected behavior or quality criteria, and a scoring mechanism that runs whenever something changes (a new prompt, a new model version, a new retrieval setup). The harness produces measurable results that tell you whether changes are improvements, regressions, or neutral.
The reason it deserves its own concept: AI systems do not have unit tests in the traditional sense. The output is non-deterministic. There is no exception thrown when the model returns a wrong answer. Quality is a distribution, not a binary. An evaluation harness is what gives teams something rigorous to measure against in a domain that resists rigorous measurement.
In 2025 and 2026 the harness has become the most important investment for production AI systems. Teams that built one early can iterate confidently. Teams that did not are guessing whether their changes help. The discipline mirrors traditional software testing in role but differs significantly in mechanics.
The test cases are the most important part. They should reflect real production traffic patterns, including the long tail of edge cases. A common structure is a CSV or JSON file where each row is a test case with input, expected output (where available), and scoring criteria.
The scoring mechanism varies by case type. Where ground truth exists, exact-match or normalized-match scoring works (the model produced the expected answer). For generative outputs, heuristic checks evaluate specific properties (response is in JSON, includes required fields, cites a real source, stays under length limit). Judge model evaluation uses another LLM to score the output against a rubric on dimensions like correctness, completeness, and clarity.
The runner executes the test cases against the system, captures outputs, applies scoring, and produces aggregate results. Most tools offer this; teams can also write it themselves in a few hundred lines of Python.
The reporting layer compares current results to baseline. Did quality go up, down, or stay flat? Which test cases regressed? Which improved? Visualization helps the team scan results quickly and identify patterns.
Production sampling extends the harness to real traffic. Periodically take a sample of production interactions, score them, and compare to baseline. This catches drift the offline harness cannot see.
Start small. Twenty to fifty representative test cases with clear expected behavior is enough to start. The mistake is waiting until you have a perfect set of hundreds before running anything. Imperfect coverage that runs is more valuable than perfect coverage that does not.
Pull test cases from real production traffic where possible. Logs of past interactions show what the system actually faces. Cherry-pick representative successes, failures, and edge cases.
Define scoring criteria explicitly. "Is the response good" is too vague. "Does the response correctly identify the customer's product, propose the right next step, and stay under 200 words" is measurable.
Mix scoring methods. Exact match for cases with ground truth. Heuristic checks for structural properties. Judge model evaluation for subjective quality. Each catches different kinds of issues.
Run on every change. Before deploying a prompt update, run the harness. Before adopting a new model version, run the harness. Make running the harness a CI step where possible.
Update the harness over time. When you find new failure modes in production, add them as test cases. When use cases evolve, retire test cases that no longer reflect current traffic. The harness is a living artifact.
Promptfoo is a popular open-source tool for prompt evaluation. It supports multiple providers, parallel execution, and various scoring methods. Lightweight and easy to integrate.
DeepEval and Ragas focus on RAG-specific evaluation, with metrics for retrieval quality (context relevance, context recall) and generation quality (faithfulness, answer relevance) tuned to retrieval-augmented systems.
Braintrust and LangSmith Evals provide more polished platforms with experiment tracking, comparison views, and integration with their broader observability tools. Useful when teams want a managed UX rather than scripts.
Custom evaluation harnesses written in Python or TypeScript work well for many teams. The logic is not complex; what matters is having the harness exist and run. A 200-line script can serve a sophisticated team well.
For LLM-as-judge scoring, the same foundation models used in production can evaluate outputs against rubrics. Some teams use a more capable model as judge than the one being evaluated. Others use the same model. The trade-off is cost versus calibration.
Minimum useful coverage is 20 to 50 representative cases. Production-grade harnesses for important use cases typically have 100 to 500. Beyond that, marginal value of additional cases drops, but coverage of edge cases and specific failure modes can justify more. The right number depends on use case complexity and risk.
Three approaches work. First, define expected outputs verbatim where there is a clear correct answer. Second, define quality criteria as a rubric (the response should accurately identify X, propose Y, avoid Z). Third, use reference outputs (this is one good answer; others may also be acceptable) with judge model scoring against the reference. Mix approaches based on case type.
A pattern where another LLM scores outputs against criteria. The judge model gets the original input, the system's output, and a rubric, then produces a score. Works well for subjective quality dimensions where exact match does not apply. Calibration matters; check that judge scores correlate with human judgment for your use case before relying on them.
On every meaningful change to the system: prompt updates, model upgrades, retrieval changes, tool definition updates. Many teams integrate the harness into CI so changes cannot deploy without running it. Production sampling adds periodic runs (daily or weekly) against current production behavior.
Two metrics layers. Retrieval quality measures whether the right chunks were returned: precision, recall, mean reciprocal rank against a labeled set. Generation quality measures whether the model answered correctly given the retrieved chunks: faithfulness to context, answer relevance, completeness. Tools like Ragas implement both layers.
For a 100-case harness with a frontier model judge, each run typically costs a few dollars to a few tens of dollars depending on prompt and judge sizes. Daily runs are economical for most teams. The cost is usually trivial relative to the value of catching regressions before deployment.
Run each test case multiple times (often 3 to 5) and look at distributions. A model that scores well on average with low variance is more reliable than one with higher average but high variance. Set temperature to zero or low values to reduce randomness during evaluation runs.
When a specific failure case is found in production, add it to the harness as a regression test. Future changes that reintroduce the failure get caught before deployment. Over time the regression test set becomes a record of issues the team has solved, ensuring they do not recur.
In addition to output quality, capture and score the agent's path: did it use the right tools, in a reasonable order, within step and cost budgets, with appropriate handoff to humans when needed. Tools like LangSmith and Braintrust support agent-specific evaluation. The evaluation is more involved than for simple chat but follows the same pattern of representative cases plus scoring.
In version control alongside the code. Treat them like other code artifacts: reviewed when added or changed, owned by a specific team, refactored over time. Some teams use spreadsheets early and migrate to JSON or YAML in version control as the harness matures. The format matters less than the discipline of treating test cases as durable assets.