Retrieval Engineering

Chunking & indexing

Lesson 2 of 5

What you'll learn

Understand why documents are split before embedding
Reason about chunk size and overlap trade-offs
Know what text actually gets embedded versus stored

You rarely embed a whole document. Models have a context budget, and more importantly a single vector for a 40-page PDF is a blurry average that matches everything weakly and nothing well. So you chunk: split the document into smaller passages, embed each one, and index them independently. Retrieval returns chunks, and your prompt is assembled from the best chunks.

Size and overlap are a trade-off

Chunk size controls granularity. Small chunks (a few sentences) give precise matches but can lose the surrounding context needed to answer. Large chunks carry context but dilute the embedding — the signal for the relevant sentence is averaged with paragraphs of noise.

Overlap copies the last few tokens of one chunk into the start of the next. It's insurance against splitting a sentence or an idea exactly at a boundary, so an answer that straddles two chunks still lands intact in at least one.

{
  "chunkSize": 512,
  "overlap": 64,
  "splitOn": "tokens",
  "note": "tune size/overlap per corpus; legal/medical favor larger, FAQs favor smaller"
}

Embed the passage, store the rest

A common mistake is embedding everything you store. Embed only the passage text that carries meaning. Keep IDs, titles, URLs, timestamps, and section paths as metadata alongside the vector — you'll use them for filtering and for citations later, but they shouldn't pollute the semantic signal.

Chunk on structure when you can

Splitting on natural boundaries — headings, paragraphs, list items — beats blind fixed-width windows. A chunk that respects document structure is a chunk that reads as a coherent unit, which is exactly what the embedding model rewards.

Overlapping window chunker

Run it. It splits a text into word-based chunks of a fixed size with a fixed overlap, then prints each chunk and its boundaries.

Loading editor…

Knowledge check

What is the main downside of making chunks very large before embedding them?

Saved on this device. Sign in to sync your progress everywhere.

PreviousEmbeddings & cosine similarity Next kNN & approximate nearest neighbors