BuildBot

The Cluster

Scheduling, retry & fallback

Lesson 2 of 10

What you'll learn

  • See how a scheduler picks among several nodes that hold a model
  • Understand retry across peers as the cluster's resilience story
  • Know why local execution is the last-resort fallback

Last lesson's router stopped at "first holder wins." Real clusters have several machines holding the same model, and machines fail mid-request — a peer goes to sleep, a Wi-Fi drop kills a socket, a backend hangs. The scheduler decides who gets the request; retry and fallback decide what happens when that choice doesn't work out.

The order matters: try a remote holder, and if it errors, try a different holder before giving up. Only when no peer can serve it does the node fall back to running the model itself. That sequence is what stops one flaky machine from turning into one failed request.

Don't retry the machine that just failed

The cardinal rule: a retry must go to a different holder than the one that just errored. Retrying the same dead node burns your attempts for nothing. Quorum's router walks the list of holders, skipping any it has already tried, with a bounded retry count so a request can't loop forever.

// apps/desktop/internal/router — simplified
func (r *Router) Serve(req Request) (Response, error) {
    holders := r.cluster.Holders(req.Model) // peers that have it
    var lastErr error
    for attempt := 0; attempt < maxRetries && len(holders) > 0; attempt++ {
        node := r.schedule(holders)          // pick one
        resp, err := r.forward(node, req)
        if err == nil {
            return resp, nil
        }
        lastErr = err
        holders = remove(holders, node)      // never retry the same node
    }
    return r.runLocal(req)                    // fallback: serve it ourselves
}

Retries must be idempotent-safe

Streaming completions can fail after tokens have been sent. Quorum only retries before the first byte reaches the client — once output has started, a silent retry would splice two different generations together and produce garbled text. Retry the connection, not a half-streamed answer.

The challenge models the loop: a list of holders where some throw, a bounded retry count, and a local fallback when the peers are exhausted.

Retry across holders, then local (JS model)

Run it. Two holders fail before a third succeeds; if all fail, it falls back to local. Try setting every holder to fail.

Loading editor…
Knowledge check

Why does the router remove a node from the candidate list after it errors?

Next: where these nodes actually live — the Wails desktop app that runs the Go router and a web UI in a single binary.

Saved on this device. Sign in to sync your progress everywhere.