What Is LLM Evaluation And Testing?

Definition

LLM evaluation and testing is the practice of measuring whether a system built on a large language model actually works: whether its outputs are correct, useful, safe, and stable enough to ship and keep shipping. It is the LLM era's answer to the question software testing used to answer with assertions, and it is harder, because the thing being tested is non-deterministic and the outputs have no single right answer.

Traditional testing assumes you can specify the expected output. LLM systems break that assumption twice over. The same prompt can yield different completions across runs, and for most real tasks (summarize this document, answer this support ticket, draft this email) many different outputs are acceptable and many subtly wrong ones look acceptable. So evaluation shifts from "does the output equal X" to "does the output satisfy these properties," and measuring property satisfaction at scale becomes its own engineering problem.

The discipline has settled into a few working layers. Offline evaluation runs the system against curated test sets before deployment: golden examples with reference answers, adversarial cases, regression suites built from past failures. Online evaluation measures the live system: user feedback, task completion rates, sampled human review of production traffic. Connecting them is the eval harness, the infrastructure that runs test sets against any version of the system and reports differences, which is what makes prompt changes and model upgrades safe rather than vibes-based.

The most consequential technique, and the most abused, is LLM-as-judge: using a strong model to grade the outputs of the system under test. It is the only approach that scales subjective quality assessment past what human review can afford, and it works well enough to be standard practice. It also imports its own biases (judges favor longer answers, favor their own model family, and drift across versions), so unvalidated judge scores are a measurement of something, just not necessarily quality.

This page covers what evaluation actually involves, the methods in production use, how teams build eval suites that predict real behavior, and the failure modes that let badly measured systems ship anyway.

Key Takeaways

LLM evaluation measures correctness, usefulness, and safety for systems whose outputs are non-deterministic and have no single right answer.
Public benchmarks measure model families; they say almost nothing about your system on your task with your data.
A useful eval suite is built from your own failure cases and updated continuously, like a regression test suite.
LLM-as-judge makes subjective grading affordable at scale, but judges must themselves be validated against human judgments.
Offline evals gate deployment; online measurement catches what the test set missed; teams need both.

Why Familiar Testing Breaks Here

Unit testing rests on determinism: same input, same output, assert equality. LLMs give you neither. Sampling introduces run-to-run variance, and even at temperature zero, model updates and context changes shift outputs. Asserting string equality against an LLM is testing the random seed.

The deeper problem is answer multiplicity. "Summarize this contract" has thousands of acceptable outputs and an enormous space of plausible-looking wrong ones, including the dangerous class: fluent, confident, subtly incorrect. Human reviewers catch these; exact-match metrics do not; and the subtle failures are precisely the ones that damage trust in production.

Failures are also distributional, not binary. A traditional bug either reproduces or does not. An LLM system might handle a query category correctly 94% of the time, and whether that is shippable depends entirely on the category. The unit of quality stops being the test case and becomes the failure rate per slice of traffic, which forces statistical thinking onto teams that did not previously need it.

And the system under test is usually more than the model. A RAG pipeline can fail in retrieval (wrong documents), in generation (right documents, wrong answer), or in orchestration (right answer, wrong format for the downstream parser). Agent systems add tool calls and multi-step plans. End-to-end scores alone cannot localize a failure; useful evaluation measures components separately, the way integration tests never replaced unit tests.

What survives from traditional practice is the discipline, not the assertions: versioned test sets, CI gates, regression tracking, and the habit of turning every production incident into a test case. Teams that treat evals as a casual spreadsheet get casual quality; teams that treat them as test infrastructure get the compounding benefits testing has always given.

The Methods Actually in Use

Reference-based metrics compare outputs to gold answers. Exact match and F1 work for closed tasks (extraction, classification, math with verifiable answers). Semantic similarity scores stretch to open tasks but reward fluent paraphrase of wrong content. The honest rule: reference-based metrics are strong where answers are checkable and weak everywhere else, and most product tasks live everywhere else.

Property-based checks assert what must be true without specifying the full answer: the output is valid JSON, cites only retrieved documents, contains no PII, stays under length, never mentions competitors. Cheap, deterministic, fast, and they catch a surprising share of real failures. Every production system should run these as hard gates; they are the unit tests that still work.

LLM-as-judge handles the subjective middle: graded criteria like faithfulness, helpfulness, and tone. The mechanics matter enormously. Pairwise comparison (which of these two answers is better?) is more reliable than absolute scoring. Rubrics with explicit criteria beat "rate 1-10." Known biases need countermeasures: randomize position to beat order bias, control for length, and periodically check the judge against human labels. A judge that agrees with your human raters 90% of the time is a usable instrument; an unvalidated judge is a confident random number.

Human evaluation remains the ground truth and the budget constraint. Expert review is the only trustworthy measure for high-stakes domains (medical, legal, financial) and the calibration source for every automated judge. The economical pattern: humans label a few hundred examples to validate the judge, the judge grades tens of thousands, and humans return on a sampling cadence to catch judge drift.

Online measurement closes the loop. Thumbs ratings, edit distance between draft and what the user actually sent, task abandonment, escalation to human agents, A/B comparisons on real traffic. Online signals are noisy and confounded, but they measure the only thing that finally matters, which is whether the system helps. The mature setup uses offline evals to decide what may ship and online metrics to discover what the offline suite failed to anticipate.

Building an Eval Suite That Predicts Reality

Start from real traffic, not imagination. Test sets brainstormed in a conference room overrepresent clean, typical queries and miss how users actually write: fragments, typos, mixed languages, three questions in one, screenshots described in words. Pull from production logs (or pilot logs, or support tickets) from day one. A hundred real queries beat a thousand synthetic ones.

Weight the suite toward the edges. Uniform sampling produces suites that are 90% easy cases, where every candidate system scores well and the differences hide in the rounding. The informative cases live at the boundaries: ambiguous requests, adversarial inputs, questions whose answer is "I don't know," queries that should be refused. Deliberately overample these. An eval suite's value is concentrated in the cases where systems disagree.

Make every incident a test case. The single highest-value habit: when production produces a bad output, it goes into the regression set with a target behavior, permanently. Over a year this builds the asset that generic benchmarks can never be, a suite that encodes your system's actual failure history. It is the same logic as regression testing, and it compounds the same way.

Version everything and gate on diffs. The suite, the judge prompts, the scoring config: all versioned alongside the system. Every prompt change, retrieval tweak, or model upgrade runs the full suite in CI, and the review artifact is the diff: which cases improved, which regressed, and whether the regressions are acceptable. Aggregate scores hide regressions; per-case diffs surface them. This is the mechanism that makes iteration safe, and it is the difference between teams that upgrade models in a day and teams that take a quarter.

Keep the suite alive. Test sets rot in two directions: the product drifts (new features, new user populations the suite does not cover) and the suite leaks (its cases get used in prompts or fine-tuning data, so the system memorizes the test). Schedule refreshes, hold out cases that never touch development, and retire sections that no longer match traffic. An eval suite is a garden, not a monument.

Evaluating Pipelines, Not Just Prompts

RAG systems need the retrieval and generation layers scored separately. Retrieval gets classic IR metrics: does the relevant document appear in the top k, how high does it rank. Generation gets faithfulness scoring: given these retrieved documents, is the answer supported by them. The split matters because the fixes differ completely; bad retrieval is a chunking or embedding problem, unfaithful generation is a prompting or model problem. End-to-end accuracy alone cannot tell you which budget to spend.

Hallucination measurement deserves its own machinery. The standard approach decomposes an answer into atomic claims and checks each against the provided sources, by judge model or NLI classifier. Claim-level checking catches the most damaging pattern (nine supported claims and one invented one in a fluent paragraph) that whole-answer scoring misses. For systems that answer from documents, faithfulness is usually the metric most worth investing in.

Agents multiply the evaluation surface. A tool-using agent can pick the wrong tool, call the right tool with bad arguments, misread the result, or loop forever doing valid steps that accomplish nothing. Useful agent evals therefore score trajectories, not just final answers: was the plan sensible, were the calls well-formed, did it stop when done. Final-answer-only scoring lets an agent that succeeds by accident pass and one that failed for a fixable reason teach you nothing.

Safety evaluation is adversarial by nature, so test like the adversary. Red-team suites of injection attempts, jailbreaks, requests for harm in costume, and topic-boundary probes, run on every release. Safety regressions arrive silently with model swaps and prompt edits; a refusal behavior that held last month is not evidence it holds today. Treat the red-team suite as a hard CI gate, the way security teams treat their scanners.

Multi-turn behavior needs its own cases, because single-turn suites miss whole failure classes: context loss across turns, contradiction of earlier statements, degradation as the conversation lengthens, vulnerability to gradual manipulation. Building multi-turn evals is genuinely harder (the test must simulate a user), which is exactly why the teams that bother find failures their competitors ship.

The Failure Modes That Let Bad Systems Ship

Benchmark transfer. The model scored highly on MMLU and the leaderboards, so the team assumed competence on their insurance-claims task. Public benchmarks rank base models on academic distributions; they do not measure your prompts, your retrieval corpus, your users, or your failure costs. The model choice is an input; the system eval is the test.

Goodharted metrics. Whatever number gates releases becomes the number the team optimizes, including by accident. Judge favors long answers; answers get longer. Faithfulness counts only checkable claims; outputs get vague. The defenses: multiple metrics in tension, periodic human audit of high-scoring outputs, and suspicion of any score that improves while users stay unhappy.

Eval theater. A suite exists, runs occasionally, and is consulted after decisions rather than before them. The tell is a team that can quote its eval scores but cannot name the last release the evals blocked. Evals that never fail anything are decoration; the fix is wiring them into CI as gates with owners, not dashboards with viewers.

Contaminated measurement. Test cases drift into prompt examples or fine-tuning data, scores climb, capability does not. Quietly common in fast-moving teams sharing data across workstreams. The defense is boring data hygiene: held-out sets with access controls and a rule that no eval case is ever used to improve the system directly.

Judge drift and judge trust. The judge model gets upgraded, scores shift, and the team reads a real quality change where there is only a new grader. Pin judge versions, re-calibrate against human labels on a schedule, and when comparing two systems, never let one of them grade the contest. None of these failures announces itself; that is the property they share, and the reason evaluation needs the same skepticism as any other measurement instrument.

Best Practices

Build the eval suite from production traffic and failures, oversampling edge cases, ambiguity, and adversarial inputs where systems actually differ.
Run cheap deterministic property checks (format, citations, PII, length) as hard gates on every output before any subjective scoring.
Validate LLM judges against human labels before trusting them, control for length and position bias, and pin judge versions.
Score pipeline components separately (retrieval vs. generation, plan vs. final answer) so failures localize to a fixable layer.
Wire the suite into CI with per-case diffs on every prompt, retrieval, or model change, and turn every production incident into a permanent test case.

Common Misconceptions

Strong benchmark scores do not mean the model works for your task; benchmarks rank base models, not your system on your data.
Non-determinism does not make testing impossible; it changes the unit of measurement from exact matches to failure rates and property satisfaction.
LLM-as-judge is not circular by definition; validated against human judgments and bias-controlled, it is a usable instrument, just never a free one.
Evaluation is not a pre-launch phase; suites that are not maintained stop predicting production behavior within months.
High average scores do not mean shippable; the per-slice failure rates, especially on the worst slices, are what users actually experience.

What Is LLM Evaluation And Testing?

Definition

Key Takeaways

Why Familiar Testing Breaks Here

The Methods Actually in Use

Building an Eval Suite That Predicts Reality

Evaluating Pipelines, Not Just Prompts

The Failure Modes That Let Bad Systems Ship

Best Practices

Common Misconceptions

Frequently Asked Questions (FAQ's)

What is LLM evaluation, in one sentence?

How is it different from regular software testing?

How many test cases do we need to start?

Is LLM-as-judge reliable enough to trust?

How do we measure hallucinations?

What should gate a model upgrade?

Do we still need human evaluation?

How do we evaluate agents rather than chatbots?

What does this cost, and what is the minimum viable setup?