BuildBot

Retrieval Engineering

RAG quality & evaluation

Lesson 5 of 5

What you'll learn

  • Assemble retrieved context into a grounded, citable prompt
  • Compute precision@k and recall@k for a retrieval result
  • Name the RAGAS-style metrics that score end-to-end RAG quality

Retrieval is only half the system. Once you have ranked chunks you assemble context: dedupe near-identical chunks, fit within the token budget, and keep a stable ID on each so the model can produce citations. Citations aren't decoration — they let you verify the answer was grounded in retrieved evidence rather than invented, and they make failures debuggable.

Measure retrieval, not vibes

You cannot improve what you don't measure. With a labeled set of queries and their known-relevant doc IDs, two metrics anchor everything:

  • Precision@k — of the k chunks you returned, what fraction were relevant. Punishes noise in the context window.
  • Recall@k — of all relevant chunks that exist, what fraction landed in the top k. Punishes missing the answer entirely.
{
  "query": "how do refunds work",
  "retrieved": ["d3", "d9", "d1", "d7"],
  "relevant": ["d1", "d3", "d5"]
}

These two trade off, which is why you report both, and often MRR (mean reciprocal rank) when the position of the first hit matters.

End-to-end metrics

Retrieval metrics don't tell you if the answer is good. RAGAS-style evaluation adds generation-aware scores: context precision and context recall (was the right context retrieved and ranked high), faithfulness (is every claim in the answer supported by the retrieved context, i.e. no hallucination), and answer relevance (does the answer address the question). Track these on a fixed eval set and every retriever change becomes a measurable experiment instead of a guess.

Faithfulness is the safety metric

A confident, fluent answer that isn't grounded in the retrieved context is the failure mode that erodes trust fastest. Faithfulness scoring — checking each claim against the cited chunks — is what catches it before your users do.

precision@k and recall@k

Run it. It computes precision@k and recall@k for a retrieval result against a known set of relevant doc ids.

Loading editor…
Knowledge check

In RAGAS-style evaluation, what does the faithfulness metric specifically check?

Saved on this device. Sign in to sync your progress everywhere.