In the previous post I argued that agents will make your telemetry explode and most stacks are not ready. Several people wrote to ask the natural follow-up: so what do you actually do about it?
This is my answer.
The short version: you need to instrument the agent’s reasoning as first-class telemetry, before it ships anything to production. Not as an afterthought. Not as a dashboard you add later. As a design constraint from the beginning.
Here is what that means in practice.
The taxonomy that names the problem
Brian Suh wrote a piece that I keep referencing in client conversations. Without programmatic verification of what an agent did, you end up in one of three modes: babysitter (a human reviews every action), auditor (you run exhaustive end-to-end checks after the fact), or prayer (you ship and hope the outputs are acceptable).
If you’ve ever written MANDATORY or DO NOT SKIP in a prompt, you’ve hit the ceiling of prompting.
Most teams that are struggling with agent observability are in prayer mode without knowing it. They have traces. They have logs. But none of it connects to the agent’s actual decision path — which tool it called, why it called it, what it saw before it decided to make a change. The telemetry exists but it cannot answer the question you ask at 3am: what did the agent do and why.
The three things you actually need
The first is structured tool call logging. Every tool an agent invokes — read file, write file, call API, run test — needs to be a span. Not a log line. A span, with a duration, a parent trace ID, and the input and output attached as attributes. This sounds obvious but almost nobody does it by default. Auto-instrumentation gives you HTTP calls and database queries. It does not give you the agent’s tool calls unless you instrument them explicitly.
The second is trace context propagation that survives agent boundaries. When an agent spawns a sub-agent, or hands off to another service, or writes a file that another process reads later — the trace context needs to follow. W3C trace context via OpenTelemetry is the standard, but it only works if every hop in the chain propagates it. One uninstrumented boundary and the trace breaks. You are back to correlating log timestamps by hand at 3am.
The third is treating the agent’s decision as a span of its own. The LLM call is not the unit of work — the decision is. “I read these three files, I determined the documentation was stale, I updated it” is the span. The LLM call is a child of that span. This distinction matters because the LLM call alone tells you nothing about intent. The decision span tells you what the agent was trying to accomplish and what it saw when it made the call.
Known shape, observable behavior
Will Larson makes the point that agents work best on recurring tasks where the shape of the work is known in advance. I think this is right, and I think it is also the key to making agents observable. If you know the shape of the work, you can define what correct looks like. And if you can define correct, you can instrument it — you know what spans to expect, what attributes to assert on, what SLOs to set.
Dosu’s documentation drift detection in GitHub Actions is a clean example of this. The agent runs on a known trigger, performs a known type of analysis, produces a known type of output. Every step is structured. Every step is observable. Response time went from 42 hours to 12 minutes not because the agent is magic but because the task was defined precisely enough that the agent could own it fully — and the team could verify the result.
That contrast — 42 hours vs 12 minutes — is not about model quality. It is about task structure. Vague tasks produce unobservable behavior. Structured tasks produce observable behavior.
What Maggie Appleton is pointing at
Maggie Appleton from GitHub Next calls it zero alignment: multiple developers running multiple agents on the same codebase, with no shared context between the agents. Each agent has its own view of the world. Each agent produces its own telemetry. None of it is correlated.
This is the observability problem at the coordination layer, and it is coming for teams that are moving fast with agents. The solution is not a better dashboard. It is a shared trace context that all agents participate in — so that when two agents touch the same file in the same hour, you can see that in a single trace view rather than discovering the conflict at code review or, worse, in production.
OpenTelemetry’s baggage propagation is the mechanism. It lets you carry arbitrary key-value context across process boundaries. An agent that knows it is part of a larger job — say, a migration or a refactor — can tag every span it produces with that job ID. Every other agent doing work under the same job does the same. Now you have a coherent view of what the agents collectively did, not just what each one did in isolation.
The design constraint
I said at the start that agent observability is a design constraint, not a dashboard problem. This is what I mean: if you wait until the agents are in production to figure out what to instrument, you will be too late. The decisions about what to log, what context to propagate, what spans to create — those need to happen before the agent ships its first PR. Not after the first incident.
The teams I work with that are getting this right all have one thing in common: they treat the agent’s observability requirements as a first-class part of the agent’s design. The same way you would not ship a service with no health check endpoint, you do not ship an agent with no decision trace.
If your agents are already in production and you are retrofitting this — which is most people’s reality — the place to start is the tool calls. Instrument every tool call as a span. Get that into your existing trace backend. That single change will show you more about what your agents are actually doing than any amount of log searching.
The rest follows from there.
If this is work you need help with, reach out.
Practical lessons on shipping software, straight to your inbox. No fluff.