Measuring LLM Systems
Agent observability & tracing
Lesson 4 of 5
What you'll learn
- Model an agent run as a trace of spans (think → tool → observe)
- Attribute latency, tokens, and cost to each step
- Explain why 2026 treats observability as table stakes
A single LLM call is easy to reason about. An agent loop — think, call a tool, observe the result, think again — is not. When it's slow, expensive, or wrong, "the agent" is too coarse to debug. You need to see which step burned three seconds, which tool call cost a dollar, which observation sent the model down a bad path. That visibility is a trace: a tree of timed, instrumented spans covering one run.
Spans and traces
A span is one unit of work with a start, an end, and attributes. A trace is all the spans for one agent run, linked by a shared trace ID and parent pointers. The standard agent loop produces a recognizable shape:
{
"trace_id": "t_91",
"spans": [
{ "name": "think", "ms": 820, "tokens": 1400, "cost_usd": 0.0042 },
{ "name": "tool:search", "ms": 410, "tokens": 0, "cost_usd": 0.0001 },
{ "name": "observe", "ms": 30, "tokens": 600, "cost_usd": 0.0018 }
]
}
Roll the spans up and you get the three numbers every operator watches: total wall-clock latency, total tokens, total cost. Break them down by span and you find the one step worth optimizing.
Why it's table stakes now
In 2026, observability is the difference between "we shipped an agent" and "we operate an agent." Without traces you cannot answer the questions that decide whether a system stays in production: Why did p95 latency double? Which user's run cost $4? Which tool is flaky? Teams instrument spans the way web backends instrument requests — and interviewers ask about it because it's the clearest tell that someone has run agents past the demo.
Instrument the loop, not just the call
Wrapping only the model call hides where agents actually spend time and money: tool calls, retries, and waiting. Put a span around every step of the loop so the trace reflects the whole run.
Run it. Given the steps of one agent run, assemble a trace and roll it up into total latency, tokens, and cost — then find the slowest span.
Why should you wrap every step of an agent loop in a span rather than only the model call?
Saved on this device. Sign in to sync your progress everywhere.