Measuring LLM Systems
Scorers & LLM-as-judge
Lesson 2 of 5
What you'll learn
- Choose between exact, semantic, and rubric scorers
- Use an LLM-as-judge for open-ended output
- Recognize judge biases and how to constrain them
A scorer is only as honest as its match function. For "what is 2+2" exact match is perfect. For "summarize this ticket" exact match is useless — there are a thousand correct summaries and none equal a single gold string. The hard part of evals is picking a scorer that captures what you actually care about without rewarding the wrong thing.
Three families of scorer
Exact / programmatic. String or structural equality, regex, JSON-schema validation. Cheap, deterministic, no false positives. Use it whenever the output space is closed: classifications, extracted fields, tool-call arguments.
Semantic. "Close enough in meaning." In production this is embedding similarity above a threshold; here we model it with normalization and keyword overlap. It tolerates paraphrase but can be fooled by fluent nonsense.
Rubric. A checklist of criteria, each scored, then combined. This is how you grade open-ended output without a single gold answer.
const rubric = [
{ name: "mentions_refund", weight: 2 },
{ name: "polite_tone", weight: 1 },
{ name: "under_50_words", weight: 1 },
];
LLM-as-judge, and its biases
When even a rubric needs judgment ("is this tone polite?"), you hand the output to another model and ask it to score against the rubric, returning a number plus reasoning. This is LLM-as-judge, and it scales grading to open-ended tasks. It also imports the judge model's failure modes: a length bias (longer answers scored higher), a position bias in pairwise comparisons, and self-preference (a model favors text in its own style). Constrain it: force a fixed scale, demand reasoning before the score, and calibrate against a small human-labeled set.
A judge is a system under test too
Never trust a judge you haven't evaluated. Hold out human-labeled cases and measure the judge's agreement with them. An uncalibrated judge gives you confident numbers that drift from reality.
Run it. A deterministic rubric scorer grades a reply, and a simulated judge returns a 1-5 score with reasoning. No real model is called — the judge is plain JS over the rubric.
Which scorer is the right choice for open-ended output that has no single gold answer?
Saved on this device. Sign in to sync your progress everywhere.