BuildBot

Measuring LLM Systems

pass@k & reliability under sampling

Lesson 3 of 5

What you'll learn

  • See why single-shot accuracy hides non-determinism
  • Compute pass@k from multiple sampled attempts per case
  • Reason about flaky outputs and when retries actually help

A model with temperature above zero is a slot machine. Run the same case ten times and you might pass seven. Reporting a single run as "70% accuracy" is a lie of precision — the next run reports 60%, and a teammate concludes you regressed. Reliability metrics make the randomness a first-class part of the measurement instead of noise you pretend away.

pass@k

pass@k asks: if I sample k attempts for a case, what's the probability at least one passes? It captures the reality of systems that retry. A case that passes 3 of 10 samples has a low pass@1 but a high pass@5 — retries rescue it. A case that passes 0 of 10 is dead no matter how many times you roll.

// Per case: out of n samples, c passed.
// Unbiased estimator: pass@k = 1 - C(n-c, k) / C(n, k)

The honest version uses the combinatorial estimator above rather than naively resampling, because it removes the variance of the estimate itself. The intuition: it's the chance that a random k-subset of your n samples contains at least one success.

Flaky vs. broken

pass@k separates two failure modes that single-shot accuracy blends together. A flaky case (passes sometimes) is a candidate for retry logic, self-consistency, or a higher sampling budget. A broken case (never passes) needs a prompt, tool, or model fix — no amount of retrying helps. Knowing which is which tells you where to spend.

Retries are a budget, not a free lunch

Every extra sample costs latency and tokens. pass@k tells you the ceiling retries can buy you; it doesn't make the ceiling free. Use it to decide whether the reliability is worth the spend.

Compute pass@k from sampled attempts

Run it. Each case has several sampled outputs (1 = pass, 0 = fail). The estimator reports pass@1 and pass@3 per case so you can see flaky cases lifted by retries.

Loading editor…
Knowledge check

What does pass@k measure for a non-deterministic system?

Saved on this device. Sign in to sync your progress everywhere.