BuildBot

Content and Search

Elasticsearch: the inverted index

Lesson 3 of 5

What you'll learn

  • Understand how a document is tokenized into terms
  • See why the inverted index makes search fast
  • Build and query a tiny inverted index

A relational database finds a row by scanning or indexing on a column value. Elasticsearch does something different: when you index a document, it runs the text through an analyzer that lowercases it and splits it into terms (tokens). Then it stores, for every term, the list of documents that contain it. That map is the inverted index.

{ "_id": 1, "title": "Fast search with Elasticsearch" }
{ "_id": 2, "title": "Search engines and ranking" }

Analyzing the two titles produces terms, and the index inverts the relationship — instead of "doc → words", it stores "word → docs":

elasticsearch -> [1]
fast          -> [1]
search        -> [1, 2]
engines       -> [2]
ranking       -> [2]

Why this is fast

To answer "which documents contain search?", Elasticsearch does not read any documents — it jumps straight to the search term and returns its postings list [1, 2]. Lookups are proportional to the number of matching terms, not the size of the corpus, which is why search stays fast as data grows. Combining terms (AND/OR) is just intersecting or unioning these lists.

Search the index, not the documents

The original documents are stored too, but matching never scans them. All the speed comes from looking up pre-computed term → document mappings.

The challenge builds an inverted index from a few documents, then looks up a term.

Build an inverted index

Run it. It tokenizes each document, builds a term -> doc-ids map, then looks up which documents contain a search term.

Loading editor…

Next: how field types and query types decide what actually matches.

Sign in to save your progress across devices.