Inference & Ops

Local backends behind one interface

Lesson 8 of 10

What you'll learn

See why a backend abstraction keeps the rest of the system simple
Understand the adapter pattern applied to inference engines
Normalize different backend responses to one common shape

A node might run Ollama on one machine, LM Studio on another, raw llama.cpp or Apple's MLX on a third, vLLM on a server. Each has its own API, its own quirks. If the router had to know all of them, every new engine would touch the whole codebase. Instead Quorum defines one backend interface and writes a small adapter per engine. The router speaks only the interface.

The lucky break: most of these engines already expose an OpenAI-compatible endpoint. So the dominant adapter — openaicompat — covers Ollama, LM Studio, llama.cpp's server, and vLLM with a single driver pointed at different ports. Engines that don't fit get their own adapter, but they all satisfy the same Go interface.

// apps/desktop/internal/backends — the contract every engine satisfies
type Backend interface {
    Name() string
    Models(ctx context.Context) ([]Model, error)
    Chat(ctx context.Context, req ChatRequest) (Stream, error)
}

One shape out, many shapes in

The adapter's real work is normalization: take whatever the engine returns and reshape it into the cluster's common type. Ollama labels its list one way, an OpenAI-style server another; the router never sees the difference because the adapter erases it.

Add an engine, change nothing else

Because the router, the API server, and the UI all depend on the Backend interface — not on Ollama or vLLM specifically — supporting a new engine means writing one adapter that satisfies the interface. Nothing upstream changes. That's the entire point of the abstraction.

The challenge models two backends with different response shapes, each wrapped by an adapter that returns the same normalized model list.

Adapters normalize backends (JS model)

Run it. Two engines report models differently; each adapter maps them to one common shape.

Loading editor…

Knowledge check

What does the backend interface buy the rest of the system?

Next: what happens when one model is too big for any single machine — distributed inference.

Saved on this device. Sign in to sync your progress everywhere.

PreviousThe OpenAI-compatible streaming API Next Distributed inference (model sharding)