BuildBot

The Network

The OpenAI-compatible streaming API

Lesson 7 of 10

What you'll learn

  • Understand why matching the OpenAI wire format matters
  • Read the server-sent events (SSE) format used for token streaming
  • Parse an SSE stream into tokens, including the terminating sentinel

The cluster's whole value proposition is "point your existing tools at it." That only works if it speaks the exact dialect those tools already know: OpenAI's. So every node serves POST /v1/chat/completions on localhost:32768, accepting the same JSON and returning the same shape. Swap the base_url, keep your code.

curl http://localhost:32768/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"llama3","stream":true,
       "messages":[{"role":"user","content":"hi"}]}'

Streaming with server-sent events

With "stream": true, you don't wait for the whole answer — tokens arrive as they're generated. The transport is SSE: the server holds the response open and writes a series of data: lines, each carrying a JSON chunk, separated by blank lines. A literal data: [DONE] marks the end.

data: {"choices":[{"delta":{"content":"Hel"}}]}

data: {"choices":[{"delta":{"content":"lo"}}]}

data: [DONE]

Each chunk's delta.content is the next slice of text; concatenating the deltas reconstructs the message. This is the same format OpenAI streams, which is why a stock client renders Quorum's output token-by-token with no special handling. Quorum also passes through reasoning deltas for models that emit thinking tokens.

[DONE] is a sentinel, not JSON

The final data: [DONE] is a plain string, not a JSON object. Parsing every data: payload as JSON without checking for the sentinel first is the classic SSE bug — it throws right at the end of an otherwise perfect stream. Check for [DONE], then parse.

The challenge parses a raw SSE stream into the assembled message, handling the sentinel correctly.

Parse an SSE token stream (JS model)

Run it. Split the stream into events, skip [DONE], and concatenate the deltas into the final text.

Loading editor…
Knowledge check

What is the correct way to handle the final `data: [DONE]` line in an OpenAI-style stream?

Next: what actually generates those tokens — the local model backends behind one interface.

Saved on this device. Sign in to sync your progress everywhere.