The Cluster
What an inference distribution network is
Lesson 1 of 10
What you'll learn
- Understand the problem an inference distribution network solves
- Name the four moving parts and what each is responsible for
- See how a request gets routed to a machine that holds the model
You probably already run models locally — Ollama on the laptop, LM Studio on the desktop, maybe a llama.cpp build on a box with a real GPU. Each is a private island: reachable only from that machine, idle whenever you're not sitting at it. An inference distribution network stitches those islands into one cluster behind a single endpoint, so any of your apps can hit one URL and have the request land on whichever machine actually holds the model.
Quorum is that network. Every machine runs the same desktop app, advertises the models it has, and shares them with the rest of the fleet. The payoff is a single OpenAI-compatible API — the same /v1/chat/completions shape every SDK already speaks — backed by hardware you own instead of a metered cloud.
The four pieces
Desktop App A ──┐ ┌── Desktop App B
(holds llama) │ WebSocket relay │ (holds qwen)
├──────────────► (VPS) ◄┤
Ollama/MLX │ │ llama.cpp
└── mDNS (LAN) ──────────┘
│
Convex control plane ── Web dashboard
(clusters, keys, presence, usage)
- Desktop app — runs on each machine; exposes local models and serves the OpenAI API.
- Relay — a small server on a VPS that connects apps across different networks.
- Control plane (Convex) — stores clusters, API keys, who's online, and usage.
- Dashboard — a website to manage keys and watch the fleet.
You already know Convex and Clerk, so this course skips them and focuses on the parts that are new: the desktop app, the networking, and the inference layer.
Peers, not a central brain
There's no single machine doing the work. A node that receives a request it can't serve routes it to a peer that can — over the LAN if the peer is on the same network, or through the relay if it isn't. Routing is just "find a node that holds this model, and forward."
OpenAI-compatible is the whole trick
Because the cluster speaks the OpenAI wire format, you don't write a custom client. Point any existing SDK at localhost:32768/v1 and it works — the distribution happens invisibly behind that endpoint.
Run it. Each node lists the models it holds; the router picks one that can serve the request.
What makes a request servable by a particular node in the cluster?
Next: what happens when the chosen holder is slow, busy, or offline — scheduling, retry, and fallback.
Saved on this device. Sign in to sync your progress everywhere.