← All insights
TechnicalMay 26, 20265 min read

Evaluating Multi-Agent Workflows in Production: Beyond Per-Agent Accuracy

Share
Article cover placeholder

TL;DR

A five-layer evaluation harness for multi-agent workflows: per-agent, handoff, pipeline, production replay, and human evaluation. Plus the metrics, failure modes, and deployment sequence that catch regressions before users do.

Evaluating Multi-Agent Workflows in Production: Beyond Per-Agent Accuracy

Single-agent evaluation is a solved problem in principle. Build a test set, run inference, measure correctness against the gold standard. The methodology is well-understood, the tools are mature, the metrics are simple.

Multi-agent evaluation is not solved. Per-agent accuracy is necessary but insufficient. A pipeline of three agents each 95% accurate is not 95% accurate — it is closer to 86% if the errors compound. The pipeline can also fail in ways no individual agent fails: handoff format mismatches, context loss between steps, infinite loops between agents, deadlocks on approval gates.

This is the evaluation pattern that actually works for multi-agent workflows in production.

What single-agent evaluation gives you

A well-instrumented single-agent evaluation produces:

  • Accuracy against a labeled test set
  • Latency distribution
  • Cost per inference
  • Failure mode breakdown (which categories of error are most common)
  • Drift detection over time

These are necessary for each agent in your pipeline. They are not enough at the pipeline level.

What multi-agent evaluation has to add

Pipeline-level correctness: did the overall pipeline produce the right outcome, not just the right intermediate outputs? An agent that produces a beautiful structured output that the next agent cannot parse is correct individually and broken in pipeline.

Compound failure rate: when each agent has some error rate, what is the rate of pipeline-level errors? Often much higher than the worst individual agent, sometimes lower if downstream agents catch upstream errors.

Handoff integrity: between every pair of agents, does the data format that agent A produces match what agent B expects? Schema validation catches the obvious; semantic mismatch (correct schema, wrong content) is harder.

Loop and deadlock detection: multi-agent systems can loop indefinitely or deadlock. Production needs telemetry to detect these and circuit breakers to stop them.

Approval gate behaviour: do human approval gates resolve in expected time? What is the override rate? When approvals are slow, what is the queue depth?

Cost compounding: a pipeline's cost is the sum of its agents' costs. Predicting and capping pipeline cost requires evaluation at the pipeline level.

The evaluation harness pattern

Build the harness with these layers:

Layer 1: per-agent evaluation. Standard test sets per agent, run on each model version and prompt change. Gates the agent's deployment.

Layer 2: handoff evaluation. For each agent-to-agent boundary, a test set of upstream outputs and the expected downstream behaviour. Catches format mismatches and semantic drift at handoffs.

Layer 3: pipeline evaluation. End-to-end pipeline runs on representative inputs, comparing pipeline output to gold-standard pipeline output. Measures compound correctness.

Layer 4: production replay. Periodically replay production traces against current pipeline configuration, comparing current output to historical output. Catches silent regressions.

Layer 5: human evaluation. For subjective quality (helpfulness, tone, completeness), sampled human evaluation on production outputs. Tracks quality dimensions that automated metrics cannot.

Each layer answers different questions. All five together give you confidence in the pipeline's behaviour.

The metrics that actually matter

Beyond accuracy, multi-agent pipelines need:

  • End-to-end success rate: proportion of pipeline runs that produce a usable outcome
  • Time to outcome: distribution of total pipeline duration, broken out by phase
  • Cost per successful outcome: total agent costs divided by successful outcomes
  • Override rate per agent: fraction of agent outputs that a human reviewer rejects or modifies
  • Approval gate cycle time: distribution of time from "awaiting approval" to "approved"
  • Loop rate: fraction of runs that hit the loop limit
  • Failure mode distribution: categories of failure (model error, integration error, timeout, approval rejection)

For each metric, baseline and continuously measure. When the metric moves materially, investigate.

Specific failure modes to watch

Silent prompt injection: an upstream agent produces output that contains instructions to a downstream agent. The downstream agent follows the embedded instructions instead of its intended behaviour. Catch with output validation that strips control patterns and with prompt structure that distinguishes "data the agent should process" from "instructions the agent should follow."

Context window overflow: a pipeline that passes accumulated context between agents can hit the model's context limit on long runs. The agent then either truncates context arbitrarily (losing critical earlier information) or fails. Detect with context size monitoring and explicit truncation strategies.

Schema drift: an upstream agent's output schema evolves; the downstream agent has not been updated. Schema validation between agents catches this immediately.

Approval gate timeout: a human approver is unavailable; the pipeline waits indefinitely. Set timeouts on approval gates with escalation paths.

Cost runaway: an agent with a loop bug runs the model thousands of times per pipeline. Cap costs per agent and per pipeline; alert when caps are approached.

Cascading failure: an upstream agent failure causes downstream agents to receive garbage input and produce garbage output. Add explicit error states between agents rather than passing failed outputs as if they were valid.

The production playbook for catching regressions

When you change an agent (new prompt, new model, new tool), the deployment sequence:

  1. Per-agent evaluation passes on current test set
  2. Handoff evaluation passes on the boundaries this agent participates in
  3. Pipeline evaluation passes on the pipelines this agent is part of
  4. Shadow deployment: new agent runs in parallel with old, outputs compared
  5. Canary deployment: small percentage of traffic on new agent
  6. Production replay: weekly replay of historical pipelines on current configuration
  7. Full deployment

Each stage gates the next. Skipping stages is how silent regressions reach production.

What AgentWorks supports for evaluation

AgentWorks provides per-agent test sets and evaluation runs, pipeline-level evaluation harnesses, production replay, and the observability needed to track the metrics above. The audit log captures enough detail (see the audit trail article) that production replay can reconstruct exactly what happened on any past run for any pipeline.

The harder part is not the tooling; it is the discipline. Building the test sets, running the harness on every change, treating evaluation regressions as deployment blockers rather than nice-to-fixes. This is engineering culture work that the platform supports but does not replace.

For multi-agent workflows that matter — where errors have business consequences — the evaluation investment pays for itself within a quarter of the first prevented incident. The teams that skip it ship faster initially and rebuild from incidents repeatedly afterwards.

About the author

· Founder, AgentWorks

Erwin Berkouwer is the founder of AgentWorks — an AI agent platform purpose-built for European teams that need EU AI Act-ready governance, multi-LLM choice across OpenAI, Anthropic, Google and Mistral, and transparent per-token € pricing.

Read more about Erwin