Measuring LLM Systems
pass@k & reliability under sampling
Lesson 3 of 5
What you'll learn
- See why single-shot accuracy hides non-determinism
- Compute pass@k from multiple sampled attempts per case
- Reason about flaky outputs and when retries actually help
A model with temperature above zero is a slot machine. Run the same case ten times and you might pass seven. Reporting a single run as "70% accuracy" is a lie of precision — the next run reports 60%, and a teammate concludes you regressed. Reliability metrics make the randomness a first-class part of the measurement instead of noise you pretend away.
pass@k
pass@k asks: if I sample k attempts for a case, what's the probability at least one passes? It captures the reality of systems that retry. A case that passes 3 of 10 samples has a low pass@1 but a high pass@5 — retries rescue it. A case that passes 0 of 10 is dead no matter how many times you roll.
// Per case: out of n samples, c passed.
// Unbiased estimator: pass@k = 1 - C(n-c, k) / C(n, k)
The honest version uses the combinatorial estimator above rather than naively resampling, because it removes the variance of the estimate itself. The intuition: it's the chance that a random k-subset of your n samples contains at least one success.
Flaky vs. broken
pass@k separates two failure modes that single-shot accuracy blends together. A flaky case (passes sometimes) is a candidate for retry logic, self-consistency, or a higher sampling budget. A broken case (never passes) needs a prompt, tool, or model fix — no amount of retrying helps. Knowing which is which tells you where to spend.
Retries are a budget, not a free lunch
Every extra sample costs latency and tokens. pass@k tells you the ceiling retries can buy you; it doesn't make the ceiling free. Use it to decide whether the reliability is worth the spend.
Run it. Each case has several sampled outputs (1 = pass, 0 = fail). The estimator reports pass@1 and pass@3 per case so you can see flaky cases lifted by retries.
What does pass@k measure for a non-deterministic system?
Saved on this device. Sign in to sync your progress everywhere.