Measuring LLM Systems

Agent observability & tracing

Lesson 4 of 5

What you'll learn

Model an agent run as a trace of spans (think → tool → observe)
Attribute latency, tokens, and cost to each step
Explain why 2026 treats observability as table stakes

A single LLM call is easy to reason about. An agent loop — think, call a tool, observe the result, think again — is not. When it's slow, expensive, or wrong, "the agent" is too coarse to debug. You need to see which step burned three seconds, which tool call cost a dollar, which observation sent the model down a bad path. That visibility is a trace: a tree of timed, instrumented spans covering one run.

Spans and traces

A span is one unit of work with a start, an end, and attributes. A trace is all the spans for one agent run, linked by a shared trace ID and parent pointers. The standard agent loop produces a recognizable shape:

{
  "trace_id": "t_91",
  "spans": [
    { "name": "think",   "ms": 820, "tokens": 1400, "cost_usd": 0.0042 },
    { "name": "tool:search", "ms": 410, "tokens": 0,    "cost_usd": 0.0001 },
    { "name": "observe", "ms": 30,  "tokens": 600,  "cost_usd": 0.0018 }
  ]
}

Roll the spans up and you get the three numbers every operator watches: total wall-clock latency, total tokens, total cost. Break them down by span and you find the one step worth optimizing.

Why it's table stakes now

In 2026, observability is the difference between "we shipped an agent" and "we operate an agent." Without traces you cannot answer the questions that decide whether a system stays in production: Why did p95 latency double? Which user's run cost $4? Which tool is flaky? Teams instrument spans the way web backends instrument requests — and interviewers ask about it because it's the clearest tell that someone has run agents past the demo.

Instrument the loop, not just the call

Wrapping only the model call hides where agents actually spend time and money: tool calls, retries, and waiting. Put a span around every step of the loop so the trace reflects the whole run.

Build a trace and summarize it

Run it. Given the steps of one agent run, assemble a trace and roll it up into total latency, tokens, and cost — then find the slowest span.

Loading editor…

Knowledge check

Why should you wrap every step of an agent loop in a span rather than only the model call?

Saved on this device. Sign in to sync your progress everywhere.

Previouspass@k & reliability under sampling Next Guardrails & regression gates