BuildBot

Measuring LLM Systems

Scorers & LLM-as-judge

Lesson 2 of 5

What you'll learn

  • Choose between exact, semantic, and rubric scorers
  • Use an LLM-as-judge for open-ended output
  • Recognize judge biases and how to constrain them

A scorer is only as honest as its match function. For "what is 2+2" exact match is perfect. For "summarize this ticket" exact match is useless — there are a thousand correct summaries and none equal a single gold string. The hard part of evals is picking a scorer that captures what you actually care about without rewarding the wrong thing.

Three families of scorer

Exact / programmatic. String or structural equality, regex, JSON-schema validation. Cheap, deterministic, no false positives. Use it whenever the output space is closed: classifications, extracted fields, tool-call arguments.

Semantic. "Close enough in meaning." In production this is embedding similarity above a threshold; here we model it with normalization and keyword overlap. It tolerates paraphrase but can be fooled by fluent nonsense.

Rubric. A checklist of criteria, each scored, then combined. This is how you grade open-ended output without a single gold answer.

const rubric = [
  { name: "mentions_refund", weight: 2 },
  { name: "polite_tone", weight: 1 },
  { name: "under_50_words", weight: 1 },
];

LLM-as-judge, and its biases

When even a rubric needs judgment ("is this tone polite?"), you hand the output to another model and ask it to score against the rubric, returning a number plus reasoning. This is LLM-as-judge, and it scales grading to open-ended tasks. It also imports the judge model's failure modes: a length bias (longer answers scored higher), a position bias in pairwise comparisons, and self-preference (a model favors text in its own style). Constrain it: force a fixed scale, demand reasoning before the score, and calibrate against a small human-labeled set.

A judge is a system under test too

Never trust a judge you haven't evaluated. Hold out human-labeled cases and measure the judge's agreement with them. An uncalibrated judge gives you confident numbers that drift from reality.

A rubric scorer and a simulated judge

Run it. A deterministic rubric scorer grades a reply, and a simulated judge returns a 1-5 score with reasoning. No real model is called — the judge is plain JS over the rubric.

Loading editor…
Knowledge check

Which scorer is the right choice for open-ended output that has no single gold answer?

Saved on this device. Sign in to sync your progress everywhere.