Building an AI Inference Network
Turn the machines you own into one OpenAI-compatible AI cluster. This course unpacks the stack behind Quorum: a Go + Wails desktop app, mDNS LAN discovery, a WebSocket relay for cross-network routing, an OpenAI-compatible streaming API, pluggable local model backends, and distributed inference. Real Go and config to read; each mechanism is runnable as a JavaScript model.
10 lessons · ~2.5 hours
1. The Cluster
2. The Desktop App
Wails: Go backend, web UI, one binary
How a single desktop binary runs a Go core and a React frontend that call each other directly — no HTTP server in between.
The local control plane UI
A local-first React UI fed by live events from the Go core — fleet nodes, models, and cloud status without polling.
3. The Network
LAN discovery with mDNS
How nodes on the same network find each other automatically — multicast announcements, heartbeats, and pruning the dead.
The WebSocket relay
Why cross-network routing needs a relay, and how it registers nodes, tracks presence, and forwards messages between them.
The OpenAI-compatible streaming API
Serving /v1/chat/completions with server-sent events so any OpenAI client works unchanged.
4. Inference & Ops
Local backends behind one interface
Wrapping Ollama, LM Studio, llama.cpp, MLX, and vLLM behind a single driver so the cluster doesn't care which one runs.
Distributed inference (model sharding)
Splitting one model across several machines with llama.cpp RPC when it won't fit on a single GPU.
Deploying & one-click self-update
Running the relay on Linux with systemd, shipping the dashboard, and updating the desktop app without a reinstall.