"It feels better" is not a release decision. This is the fill-in scorecard you complete per model and per version so you ship on numbers instead of gut feel. Score each metric, set the target first, and let pass or fail decide whether the model goes out.
A vibe check is a sample size of one reviewer reading a handful of cases they happened to pick. An eval set is fixed cases, fixed gold answers, fixed scoring, run on every version.
The common pattern: One reviewer reads a few outputs, trusts their gut, and approves the release. The check is biased toward the prompts they remember and blind to the ones they do not.
The approach that works: A versioned eval set runs on every candidate. A faithfulness drop or a leaked record shows up as a number before a user finds it.
Pick the line before you run the eval, not after. If you score first, you will be tempted to move the line so the model passes. A medical assistant needs faithfulness near 1.0; an internal brainstorming tool can run looser. Write the target down either way.
Once a user reports a bug, it goes into the eval set and stays there forever. This is how you stop shipping the same failure twice. Your eval set is the real asset, not the model, so version it and lock it.
A failing metric should block the release automatically, not start an argument. No green scorecard, no ship. Then watch the same metrics on live traffic, because offline eval catches regressions before launch and production monitoring catches drift after.
Ten metrics across retrieval, generation, safety, and ops, each with what it measures, how to score it, and a starting target. Context precision, context recall, faithfulness, answer relevance, correctness, toxicity, jailbreak resistance, PII leakage, p95 latency, and cost per answer. One row per metric, a pass or fail box for each.
A category table with minimum case counts: 50 happy-path cases, 30 edge cases, 20 out-of-scope, 40 adversarial, 30 retrieval-hard, and a known-failures bucket that grows over time. Enough cases per category that one bad output does not swing the score.
The four-step sequence: build an offline eval set, use LLM-as-judge for the open metrics with a calibrated rubric, wire the scorecard into CI/CD as a gate, and monitor the same metrics in production with alerts on drift.
You cannot improve what you do not measure, and you cannot regression-test a vibe.
It is a fill-in template that scores a single model and version against fixed metrics across retrieval, generation, safety, and ops. You set a target for each metric in advance, score the candidate, and the pass or fail result decides whether it ships.
Enough cases per category that one bad output does not swing the score. The template suggests 50 happy-path cases, 30 edge cases, 20 out-of-scope, 40 adversarial, and 30 retrieval-hard, plus a known-failures set that grows every time a user reports a bug.
The scorecard runs as a gate in CI/CD on every candidate before promotion. A fail on any hard metric blocks the release. After launch you monitor the same metrics on production traffic and alert when any one slips, so drift gets caught the same way regressions do.
Ten, grouped into four sections. Retrieval covers context precision and recall. Generation covers faithfulness, answer relevance, and correctness. Safety covers toxicity, jailbreak resistance, and PII leakage. Ops covers p95 latency and cost per answer. Faithfulness is the hallucination metric and usually the one to watch first.
The scorecard ships with starting defaults, like faithfulness at or above 0.90 and jailbreak resistance at or above 0.95, but you set your own per use case. The rule that matters is to pick the line before you score and write it down, so you are not tempted to move it later.