LLM evaluation and testing is the practice of measuring whether a system built on a large language model actually works: whether its outputs are correct, useful, safe, and stable enough to ship and keep shipping. It is the LLM era's answer to the question software testing used to answer with assertions, and it is harder, because the thing being tested is non-deterministic and the outputs have no single right answer.
Traditional testing assumes you can specify the expected output. LLM systems break that assumption twice over. The same prompt can yield different completions across runs, and for most real tasks (summarize this document, answer this support ticket, draft this email) many different outputs are acceptable and many subtly wrong ones look acceptable. So evaluation shifts from "does the output equal X" to "does the output satisfy these properties," and measuring property satisfaction at scale becomes its own engineering problem.
The discipline has settled into a few working layers. Offline evaluation runs the system against curated test sets before deployment: golden examples with reference answers, adversarial cases, regression suites built from past failures. Online evaluation measures the live system: user feedback, task completion rates, sampled human review of production traffic. Connecting them is the eval harness, the infrastructure that runs test sets against any version of the system and reports differences, which is what makes prompt changes and model upgrades safe rather than vibes-based.
The most consequential technique, and the most abused, is LLM-as-judge: using a strong model to grade the outputs of the system under test. It is the only approach that scales subjective quality assessment past what human review can afford, and it works well enough to be standard practice. It also imports its own biases (judges favor longer answers, favor their own model family, and drift across versions), so unvalidated judge scores are a measurement of something, just not necessarily quality.
This page covers what evaluation actually involves, the methods in production use, how teams build eval suites that predict real behavior, and the failure modes that let badly measured systems ship anyway.
Unit testing rests on determinism: same input, same output, assert equality. LLMs give you neither. Sampling introduces run-to-run variance, and even at temperature zero, model updates and context changes shift outputs. Asserting string equality against an LLM is testing the random seed.
The deeper problem is answer multiplicity. "Summarize this contract" has thousands of acceptable outputs and an enormous space of plausible-looking wrong ones, including the dangerous class: fluent, confident, subtly incorrect. Human reviewers catch these; exact-match metrics do not; and the subtle failures are precisely the ones that damage trust in production.
Failures are also distributional, not binary. A traditional bug either reproduces or does not. An LLM system might handle a query category correctly 94% of the time, and whether that is shippable depends entirely on the category. The unit of quality stops being the test case and becomes the failure rate per slice of traffic, which forces statistical thinking onto teams that did not previously need it.
And the system under test is usually more than the model. A RAG pipeline can fail in retrieval (wrong documents), in generation (right documents, wrong answer), or in orchestration (right answer, wrong format for the downstream parser). Agent systems add tool calls and multi-step plans. End-to-end scores alone cannot localize a failure; useful evaluation measures components separately, the way integration tests never replaced unit tests.
What survives from traditional practice is the discipline, not the assertions: versioned test sets, CI gates, regression tracking, and the habit of turning every production incident into a test case. Teams that treat evals as a casual spreadsheet get casual quality; teams that treat them as test infrastructure get the compounding benefits testing has always given.
Reference-based metrics compare outputs to gold answers. Exact match and F1 work for closed tasks (extraction, classification, math with verifiable answers). Semantic similarity scores stretch to open tasks but reward fluent paraphrase of wrong content. The honest rule: reference-based metrics are strong where answers are checkable and weak everywhere else, and most product tasks live everywhere else.
Property-based checks assert what must be true without specifying the full answer: the output is valid JSON, cites only retrieved documents, contains no PII, stays under length, never mentions competitors. Cheap, deterministic, fast, and they catch a surprising share of real failures. Every production system should run these as hard gates; they are the unit tests that still work.
LLM-as-judge handles the subjective middle: graded criteria like faithfulness, helpfulness, and tone. The mechanics matter enormously. Pairwise comparison (which of these two answers is better?) is more reliable than absolute scoring. Rubrics with explicit criteria beat "rate 1-10." Known biases need countermeasures: randomize position to beat order bias, control for length, and periodically check the judge against human labels. A judge that agrees with your human raters 90% of the time is a usable instrument; an unvalidated judge is a confident random number.
Human evaluation remains the ground truth and the budget constraint. Expert review is the only trustworthy measure for high-stakes domains (medical, legal, financial) and the calibration source for every automated judge. The economical pattern: humans label a few hundred examples to validate the judge, the judge grades tens of thousands, and humans return on a sampling cadence to catch judge drift.
Online measurement closes the loop. Thumbs ratings, edit distance between draft and what the user actually sent, task abandonment, escalation to human agents, A/B comparisons on real traffic. Online signals are noisy and confounded, but they measure the only thing that finally matters, which is whether the system helps. The mature setup uses offline evals to decide what may ship and online metrics to discover what the offline suite failed to anticipate.
Start from real traffic, not imagination. Test sets brainstormed in a conference room overrepresent clean, typical queries and miss how users actually write: fragments, typos, mixed languages, three questions in one, screenshots described in words. Pull from production logs (or pilot logs, or support tickets) from day one. A hundred real queries beat a thousand synthetic ones.
Weight the suite toward the edges. Uniform sampling produces suites that are 90% easy cases, where every candidate system scores well and the differences hide in the rounding. The informative cases live at the boundaries: ambiguous requests, adversarial inputs, questions whose answer is "I don't know," queries that should be refused. Deliberately overample these. An eval suite's value is concentrated in the cases where systems disagree.
Make every incident a test case. The single highest-value habit: when production produces a bad output, it goes into the regression set with a target behavior, permanently. Over a year this builds the asset that generic benchmarks can never be, a suite that encodes your system's actual failure history. It is the same logic as regression testing, and it compounds the same way.
Version everything and gate on diffs. The suite, the judge prompts, the scoring config: all versioned alongside the system. Every prompt change, retrieval tweak, or model upgrade runs the full suite in CI, and the review artifact is the diff: which cases improved, which regressed, and whether the regressions are acceptable. Aggregate scores hide regressions; per-case diffs surface them. This is the mechanism that makes iteration safe, and it is the difference between teams that upgrade models in a day and teams that take a quarter.
Keep the suite alive. Test sets rot in two directions: the product drifts (new features, new user populations the suite does not cover) and the suite leaks (its cases get used in prompts or fine-tuning data, so the system memorizes the test). Schedule refreshes, hold out cases that never touch development, and retire sections that no longer match traffic. An eval suite is a garden, not a monument.
RAG systems need the retrieval and generation layers scored separately. Retrieval gets classic IR metrics: does the relevant document appear in the top k, how high does it rank. Generation gets faithfulness scoring: given these retrieved documents, is the answer supported by them. The split matters because the fixes differ completely; bad retrieval is a chunking or embedding problem, unfaithful generation is a prompting or model problem. End-to-end accuracy alone cannot tell you which budget to spend.
Hallucination measurement deserves its own machinery. The standard approach decomposes an answer into atomic claims and checks each against the provided sources, by judge model or NLI classifier. Claim-level checking catches the most damaging pattern (nine supported claims and one invented one in a fluent paragraph) that whole-answer scoring misses. For systems that answer from documents, faithfulness is usually the metric most worth investing in.
Agents multiply the evaluation surface. A tool-using agent can pick the wrong tool, call the right tool with bad arguments, misread the result, or loop forever doing valid steps that accomplish nothing. Useful agent evals therefore score trajectories, not just final answers: was the plan sensible, were the calls well-formed, did it stop when done. Final-answer-only scoring lets an agent that succeeds by accident pass and one that failed for a fixable reason teach you nothing.
Safety evaluation is adversarial by nature, so test like the adversary. Red-team suites of injection attempts, jailbreaks, requests for harm in costume, and topic-boundary probes, run on every release. Safety regressions arrive silently with model swaps and prompt edits; a refusal behavior that held last month is not evidence it holds today. Treat the red-team suite as a hard CI gate, the way security teams treat their scanners.
Multi-turn behavior needs its own cases, because single-turn suites miss whole failure classes: context loss across turns, contradiction of earlier statements, degradation as the conversation lengthens, vulnerability to gradual manipulation. Building multi-turn evals is genuinely harder (the test must simulate a user), which is exactly why the teams that bother find failures their competitors ship.
Benchmark transfer. The model scored highly on MMLU and the leaderboards, so the team assumed competence on their insurance-claims task. Public benchmarks rank base models on academic distributions; they do not measure your prompts, your retrieval corpus, your users, or your failure costs. The model choice is an input; the system eval is the test.
Goodharted metrics. Whatever number gates releases becomes the number the team optimizes, including by accident. Judge favors long answers; answers get longer. Faithfulness counts only checkable claims; outputs get vague. The defenses: multiple metrics in tension, periodic human audit of high-scoring outputs, and suspicion of any score that improves while users stay unhappy.
Eval theater. A suite exists, runs occasionally, and is consulted after decisions rather than before them. The tell is a team that can quote its eval scores but cannot name the last release the evals blocked. Evals that never fail anything are decoration; the fix is wiring them into CI as gates with owners, not dashboards with viewers.
Contaminated measurement. Test cases drift into prompt examples or fine-tuning data, scores climb, capability does not. Quietly common in fast-moving teams sharing data across workstreams. The defense is boring data hygiene: held-out sets with access controls and a rule that no eval case is ever used to improve the system directly.
Judge drift and judge trust. The judge model gets upgraded, scores shift, and the team reads a real quality change where there is only a new grader. Pin judge versions, re-calibrate against human labels on a schedule, and when comparing two systems, never let one of them grade the contest. None of these failures announces itself; that is the property they share, and the reason evaluation needs the same skepticism as any other measurement instrument.
The practice of systematically measuring whether an LLM-based system's outputs are correct, useful, and safe, using curated test sets, automated graders, and production signals in place of the exact-match assertions traditional testing relied on.
The system is non-deterministic and most tasks have many acceptable answers, so equality assertions are replaced by property checks, graded scoring, and failure rates per traffic slice. The surviving inheritance from software testing is the discipline: versioned suites, CI gates, and regression tracking.
Fifty to a hundred real cases with clear pass criteria beat a thousand synthetic ones. Start small and honest, gate releases on it, and grow it with every production failure. Suites built this way reach genuine usefulness within a few months.
After validation, yes, within limits. Check judge agreement against a few hundred human-labeled examples, use rubrics and pairwise comparisons, randomize answer order, and control for length bias. Skip the validation and you are gating releases on noise that happens to produce a number.
Decompose answers into individual claims and verify each against the source documents, using a judge model or entailment classifier. Claim-level checking catches the one invented fact inside an otherwise supported answer, which is the failure mode whole-answer scores miss and users remember.
The full offline suite, diffed per case against the current system: which cases improved, which regressed, whether regressions touch high-stakes slices, plus an unchanged red-team safety pass. Teams with this in CI evaluate new models in days; teams without it either upgrade blind or freeze.
Yes, in two permanent roles: ground truth for high-stakes domains, and calibration for automated judges. The working economy is humans label hundreds, validated judges grade tens of thousands, and humans re-audit on a schedule to catch drift.
Score the trajectory, not just the outcome: tool selection, argument validity, response handling, loop termination, and final task success as separate measures. Outcome-only scoring passes agents that succeed by luck and teaches you nothing about the ones that almost worked.
The floor is nearly free: a versioned file of real cases, deterministic property checks, and a scripted judge pass, run on every change. The realistic steady state for a production product is a part-time engineering commitment plus modest judge-inference spend, which is cheap against the cost of shipping a regression to every user at once.