What is the difference between logging and tracing for AI agents?

Logs capture discrete events like a tool call happening. Traces capture the causal chain across a full agent run, including the parent-child relationship between the model call that decided to invoke a tool and the tool span itself. Logs tell you what happened at one point; traces let you follow a failure back through the steps that caused it.

Why should tool calls get their own trace spans instead of being folded into the LLM call span?

Folding everything into one span hides which step caused latency, cost, or a wrong answer. A separate span per tool call, with its own arguments, return payload, and duration, lets you isolate a slow database lookup from a slow model completion instead of guessing between them.

How often should agent evals run in production?

On every change to a prompt, tool description, or model version, not just at initial launch. Treat a labeled regression suite as a deploy gate the same way you would gate on unit tests, and pair it with periodic sampling of live traffic to catch failure modes the fixed suite does not cover.

Why does an audit trail need to be append-only?

Because its purpose is to prove what an agent did, including for compliance and dispute resolution, and a table that can be edited after the fact cannot prove that. Append-only storage, such as insert-only tables or hash-chained logs, makes any tampering detectable rather than relying on a policy of not editing records.

AI Agent Observability: Logs, Traces, Evals

Why agent failures are hard to debug

A single support agent handling one ticket might call a knowledge lookup, run a CRM query, invoke a refund tool, and generate a final reply — four to eight model and tool calls chained together. When the customer gets the wrong answer, a plain log line telling you "agent responded" is useless. You need to know which step introduced the error: a bad retrieval, a hallucinated tool argument, or a model that ignored the tool result.

This is the core difference between observability for a normal web service and observability for an AI agent. A web request is usually one hop. An agent run is a tree of decisions, and the failure is almost never at the leaf you first inspect.

Structured logging is the floor, not the ceiling

Structured JSON logs (request ID, agent ID, step type, latency, token count) are necessary but not sufficient. They tell you a tool call happened. They do not tell you why the model chose that tool, what it saw when the tool returned, or how that fed into the next decision. Logs answer "what happened"; traces answer "why."

Tracing at the right granularity

Most teams that instrument agents make one mistake early: they trace at the level of the outer HTTP request only, wrapping the whole agent run in a single span. That collapses ten steps into one black box.

The pattern that actually helps in production is span-per-tool-call, distinct from span-per-LLM-call. Each model invocation gets its own span (prompt, completion, token counts, latency, model version). Each tool execution gets its own sibling span (tool name, arguments, return payload, duration, success or failure). Parent-child relationships tie them into a single trace per agent run. This is what the emerging OpenTelemetry GenAI semantic conventions (gen_ai.* span attributes) standardize — a shared vocabulary so traces from different agent frameworks land in the same backend without custom glue code.

With that granularity, you can answer questions a single-span trace can't:

Which specific tool call added the most latency to this run?
Did the model retry the same tool with different arguments, and why?
Which step in a five-step chain first introduced the wrong entity ID?

Cost and token attribution per step, not per run

A related blind spot: most teams only track total tokens per run. That tells finance the run cost 0.04 euros, but not which step drove it. A single overlong tool result stuffed back into context can dominate the cost of an otherwise cheap run. Attributing tokens and cost per span, not just per run, is what lets you find the one workflow step that's quietly burning budget, and lets you decide whether that step needs a smaller model, a shorter context window, or caching.

Evals as a regression gate, not a one-time report

Most teams run an eval once, when the agent ships, and never again. That's backwards. The moment you change a system prompt, swap a model, or adjust a tool description, you can silently regress behavior that worked yesterday.

Treat evals the way you treat unit tests: run a fixed regression suite (a few hundred labeled scenarios with expected tool calls or answer characteristics) on every prompt or config change, before it reaches production. Fail the deploy if pass rate drops. This is the single highest-leverage practice separating teams who ship agent changes weekly without incident from teams who ship monthly and still get paged.

Pair automated evals with a small sample of human or LLM-judge review on live traffic. Automated checks catch known failure modes, but agents in production surface failure modes you didn't think to test for.

Expert tip: keep your eval suite adversarial. Include a few scenarios designed to trigger the wrong tool call on purpose. A suite that only contains easy cases will pass even after a real regression.

Why the audit trail has to be append-only

Traces and evals are for debugging and quality. The audit trail is a different artifact, built for a different reader: a compliance officer, an auditor, or a customer disputing a decision. It has to answer what did the agent actually do, and can that record be trusted not to have been altered after the fact.

That requirement, non-repudiation, is why an audit trail needs to be append-only at the storage layer, not just "we don't usually edit it" at the application layer. A mutable table with an updated_at column doesn't satisfy this; anyone with database access could edit history. Append-only storage (insert-only tables, write-once object storage, or a hash-chained log) makes tampering detectable even by someone with elevated access. Under frameworks like the EU AI Act, this distinction matters for high-risk use cases — see AgentWorks' AI Act readiness for how that maps to concrete controls.

Correlate traces to business outcomes, not just latency

The last gap in most observability setups: dashboards full of latency percentiles and error rates, with no link back to whether the agent actually helped. A trace that completed in 400ms with zero errors is worthless if the customer still called support five minutes later because the answer was wrong.

Tag traces with the business outcome once it's known — ticket resolved without escalation, refund approved without a manual review, deal stage advanced — and join that back to the trace ID. This turns observability data from an engineering dashboard into something a product or operations lead can actually use to decide whether a given agent is worth running.

What this looks like in practice on AgentWorks

Every agent run on AgentWorks writes a structured trace with a span per model call and per tool call, including token and cost attribution at the step level. Human-in-the-loop approval gates sit at any step you configure, so a risky action (a refund, a data export, an external email) pauses for a human decision, and that decision is captured in the same append-only audit trail as the rest of the run — nothing is rewritten after the fact. PII is masked at the gateway before it reaches a model provider, so traces stay useful for debugging without becoming a compliance liability.

Pricing follows the same transparency principle as the traces themselves: tokens are billed at cost plus a 10% markup, drawn from a euro wallet, so the cost you see per run in the trace is the cost you're actually billed. See pricing for plan details.

Getting started

Instrument spans per tool call and per model call, not per request — use the OTel GenAI conventions if you're building custom.
Attribute tokens and cost at the span level so you can find the expensive step, not just the expensive run.
Build a regression eval suite and gate prompt or config changes on it before they ship.
Make the audit trail append-only at the storage layer, and tag traces with business outcomes so the data answers "did this work" as well as "was this fast."

Agents that are cheap to trace are cheap to trust. Once you can see every step, HITL gates and audit trails stop being compliance overhead and become the fastest way to find out what actually went wrong.

Observability for AI Agents: Logs, Traces, and Evals

Why agent failures are hard to debug

Structured logging is the floor, not the ceiling

Tracing at the right granularity

Cost and token attribution per step, not per run

Evals as a regression gate, not a one-time report

Why the audit trail has to be append-only

Correlate traces to business outcomes, not just latency

What this looks like in practice on AgentWorks

Getting started

About the author

How to Reduce AI Hallucinations with Cited Answers

GPT-5 vs Claude vs Gemini: Picking the Right Model

Connect Notion & Confluence to Your AI Agents

Why agent failures are hard to debug

Structured logging is the floor, not the ceiling

Tracing at the right granularity

Cost and token attribution per step, not per run

Evals as a regression gate, not a one-time report

Why the audit trail has to be append-only

Correlate traces to business outcomes, not just latency

What this looks like in practice on AgentWorks

Getting started

About the author

Related articles

How to Reduce AI Hallucinations with Cited Answers

GPT-5 vs Claude vs Gemini: Picking the Right Model

Connect Notion & Confluence to Your AI Agents