Measuring LLM Systems

Eval harnesses

Lesson 1 of 5

What you'll learn

Understand evals as tests for probabilistic systems
Build a case set, a scorer, and an aggregate metric
Know why you pin (freeze and version) the dataset

Unit tests assume determinism: same input, same output, assert equality. LLM systems break that assumption. The same prompt can return different text on every call, and "correct" is often fuzzy. An eval is the discipline that replaces the assertion: run your system over a fixed dataset, score each output, and aggregate into a single number you can track over time.

const dataset = [{ input: "2+2", expected: "4" }];

The eval is the only thing standing between "the demo worked once" and "this holds up across a thousand inputs." It is also the artifact a hiring manager looks for first — anyone can prompt a model, but a candidate who shows up with a harness has clearly run real systems in anger.

Cases, scorer, aggregate

Three pieces. A case set of inputs with expected behavior. A scorer that turns one output into a number (often 0/1, sometimes 0..1). An aggregate that collapses all the per-case scores into a headline metric like accuracy.

const score = (got, expected) => (got === expected ? 1 : 0);
const aggregate = (scores) => scores.reduce((a, b) => a + b, 0) / scores.length;

The aggregate is what you compare across prompt edits, model swaps, and refactors. If the number drops, you regressed — even if the demo still looks fine.

Pin the dataset

A score only means something if the dataset underneath it is frozen. If you quietly add easy cases, accuracy rises without the system improving. Version the dataset like code: a file in the repo, changed deliberately, reviewed in a PR.

Pin the dataset

A frozen, versioned dataset is what turns an eval into a regression test instead of a vibe check. When the number moves, you want to know it was the system that changed, not the ruler.

A tiny eval harness

Run it. The harness scores each case against ground truth, then aggregates accuracy. Notice one case fails — that's the signal.

Loading editor…

Knowledge check

Why must an eval dataset be pinned, meaning frozen and versioned, rather than edited freely?

Saved on this device. Sign in to sync your progress everywhere.

Next Scorers & LLM-as-judge