The Cluster
Scheduling, retry & fallback
Lesson 2 of 10
What you'll learn
- See how a scheduler picks among several nodes that hold a model
- Understand retry across peers as the cluster's resilience story
- Know why local execution is the last-resort fallback
Last lesson's router stopped at "first holder wins." Real clusters have several machines holding the same model, and machines fail mid-request — a peer goes to sleep, a Wi-Fi drop kills a socket, a backend hangs. The scheduler decides who gets the request; retry and fallback decide what happens when that choice doesn't work out.
The order matters: try a remote holder, and if it errors, try a different holder before giving up. Only when no peer can serve it does the node fall back to running the model itself. That sequence is what stops one flaky machine from turning into one failed request.
Don't retry the machine that just failed
The cardinal rule: a retry must go to a different holder than the one that just errored. Retrying the same dead node burns your attempts for nothing. Quorum's router walks the list of holders, skipping any it has already tried, with a bounded retry count so a request can't loop forever.
// apps/desktop/internal/router — simplified
func (r *Router) Serve(req Request) (Response, error) {
holders := r.cluster.Holders(req.Model) // peers that have it
var lastErr error
for attempt := 0; attempt < maxRetries && len(holders) > 0; attempt++ {
node := r.schedule(holders) // pick one
resp, err := r.forward(node, req)
if err == nil {
return resp, nil
}
lastErr = err
holders = remove(holders, node) // never retry the same node
}
return r.runLocal(req) // fallback: serve it ourselves
}
Retries must be idempotent-safe
Streaming completions can fail after tokens have been sent. Quorum only retries before the first byte reaches the client — once output has started, a silent retry would splice two different generations together and produce garbled text. Retry the connection, not a half-streamed answer.
The challenge models the loop: a list of holders where some throw, a bounded retry count, and a local fallback when the peers are exhausted.
Run it. Two holders fail before a third succeeds; if all fail, it falls back to local. Try setting every holder to fail.
Why does the router remove a node from the candidate list after it errors?
Next: where these nodes actually live — the Wails desktop app that runs the Go router and a web UI in a single binary.
Saved on this device. Sign in to sync your progress everywhere.