Inference & Ops

Distributed inference (model sharding)

Lesson 9 of 10

What you'll learn

Understand why a model may not fit on one machine
See how layer sharding spreads a model across RPC workers
Model assigning layers to workers and running one pipeline pass

Routing assumes some node holds the whole model. But a 70B-parameter model can exceed any single machine's memory. Rather than buy a bigger GPU, you can shard the model — split its layers across several machines and run them as one. llama.cpp supports this with an RPC mode: lightweight workers each hold a slice of the model, and a coordinator drives a forward pass through them in sequence.

A transformer is a stack of layers applied in order. Sharding cuts the stack into contiguous ranges and hands each range to a worker. Activations flow worker-to-worker like a pipeline: worker 1 runs layers 0–19 and passes its output to worker 2, which runs 20–39, and so on. The token comes out the far end.

# Each machine runs an RPC server exposing its GPU/CPU:
rpc-server --host 0.0.0.0 --port 50052

# The coordinator splits the model across those workers:
llama-cli -m model.gguf --rpc 10.0.0.2:50052,10.0.0.3:50052 -p "hi"

Quorum wraps this: it can auto-download the model and wire the RPC workers so a sharded model appears in the cluster like any other.

The network is now in the hot path

Sharding trades memory for bandwidth. Activations cross the wire on every layer boundary, so a slow link makes a sharded model crawl. It's the right tool when a model won't fit at all — not a way to make a model that already fits go faster.

The challenge models the split: assign a model's layers to N workers in contiguous ranges, then run one pass that hops worker to worker.

Shard layers across workers (JS model)

Run it. 8 layers split across 3 workers; the pass walks each layer to the worker that owns it. Try changing the worker count.

Loading editor…

Knowledge check

When is sharding a model across machines the right call?

Next: getting all of this into the world — deploying the relay and dashboard, and shipping updates.

Saved on this device. Sign in to sync your progress everywhere.

PreviousLocal backends behind one interface Next Deploying & one-click self-update