← All insights
TechnicalMay 26, 20266 min read

Agent Error Handling and Recovery Patterns: Production-Ready Resilience

Share
Article cover placeholder

TL;DR

Production error handling for AI agents: retry/backoff for transient model errors, try-different patterns for persistent failures, per-tool recovery, context degradation, pipeline failure paths, and the observability that makes resilience operational.

Agent Error Handling and Recovery Patterns: Production-Ready Resilience

A demo agent that works on the happy path is two months of engineering away from a production agent that handles the rough edges. Most agent failures in production are not bugs in the agent logic — they are external failures the agent does not handle: model provider timeouts, tool call errors, malformed responses, partial pipeline executions interrupted by infrastructure issues.

This is the catalogue of error patterns and recovery techniques that turn brittle agents into resilient production systems.

The categories of failure

For production agent systems, the failures break down:

Transient model errors: rate limits, temporary provider outages, network errors, occasional malformed responses. Common, retry-able.

Persistent model errors: model returns an output that fails downstream validation, model refuses to comply (safety filtering), model produces incorrect content that passes validation but is semantically wrong. Less common, often need different strategies.

Tool errors: external API returns an error, MCP server is unreachable, tool result is malformed or doesn't match the schema the agent expects. Common, varied recovery strategies.

Context errors: prompt exceeds the model's context window, retrieval returned no results, conversation history is corrupted.

Pipeline orchestration errors: a step in a multi-agent pipeline fails, a human approval times out, a parallel branch fails while others succeed.

Infrastructure errors: platform-side outage, network partition, database unavailable. Usually the platform's responsibility but the agent's behaviour matters during the incident.

Each category needs different handling.

Transient model errors: retry with backoff

The standard pattern:

  • Exponential backoff: 1s, 2s, 4s, 8s with jitter
  • Maximum retries: typically 3-5
  • Per-error-code policies: 429 (rate limit) retries with backoff; 5xx retries; 4xx (client error) does not retry without modification
  • Circuit breaker: after persistent failures, stop retrying briefly to let the upstream recover

What goes wrong: retrying without backoff during a provider incident multiplies load and worsens the outage. Retrying client errors loops indefinitely. Not retrying transient errors makes the system more fragile than it needs to be.

Implementation: the platform's model gateway should handle retry/backoff/circuit-breaker by default so individual agents do not each implement it differently.

Persistent model errors: try-different patterns

When the same call repeatedly fails:

Different model: route to a different provider or model. If GPT-4o is having issues, try Claude. If the primary model refuses the task, try a model with different safety training.

Different prompt: a variation of the prompt that achieves the same goal in different wording. Some prompts trigger model-specific failures that a rephrase avoids.

Different context: if context-window-related, summarise older context to free space. If retrieval-based, expand the query or different retrieval strategy.

Escalate to human: after retry strategies are exhausted, route to a human reviewer rather than producing a low-confidence output.

The order matters: cheap retries first (same provider, same prompt), then alternate-provider, then prompt rephrasing, then human escalation.

Tool errors: defined recovery per tool

Each tool integration should declare its error handling:

  • Idempotent tools (lookup operations, read-only queries): retry safely with backoff
  • Non-idempotent tools (sending email, posting to chat, creating records): retry carefully, possibly with idempotency keys to prevent duplicate effects
  • Tools with rate limits: respect 429 responses, back off appropriately
  • Tools with structured error responses: parse the error, decide based on the error type
  • Tools that may be permanently down: failover to alternative tools if available; otherwise escalate

The agent's logic does not need to know all of this. The tool wrapper (MCP server, integration adapter) handles tool-specific recovery and exposes a simpler success/escalate interface to the agent.

Context errors: graceful degradation

Context window overflow is the common one:

  • Detect approaching the limit before it happens
  • Summarise older conversation history into compact form
  • Drop the lowest-value retrieved chunks
  • For very long contexts, switch to a model with a larger context window (cost trade-off)

Retrieval returning no results:

  • The agent should know this can happen and ask for clarification rather than hallucinating
  • For some workflows, "no result" is an acceptable answer; for others it triggers escalation
  • Always log the empty-result case for retrieval debugging

Pipeline orchestration errors: structured failure paths

Multi-agent pipelines need explicit failure paths:

Step failure: when a pipeline step fails, the pipeline should not silently emit garbage to the next step. Explicit failure states ("step X failed with reason Y") flow to a failure handler.

Failure handlers per pipeline: each pipeline declares what happens on failure. Some pipelines retry automatically. Some escalate to a human. Some compensate (undo upstream work) and exit.

Idempotency for retry-able pipelines: when a pipeline run is retried, side effects from the partial first run need to be either reapplied safely or undone. Idempotency keys per step help.

Partial success handling: in parallel branches, if some succeed and some fail, the orchestration needs to decide whether the result is usable or needs the failed branches retried.

Approval timeout: human approval gates have timeouts. The timeout behaviour (escalate to another approver, abandon the pipeline, retry the approval request) is part of the pipeline definition.

Observability for errors

You cannot recover from what you cannot see:

  • Every failure recorded in the audit log with the error category and the recovery action taken
  • Aggregate dashboards: error rate per agent, per tool, per pipeline step, broken down by error type
  • Alerting on error rate spikes
  • Sampling of detailed failure traces for engineering investigation

Without this, you discover that agents are failing only when users complain. With it, you see degradation before users notice.

Patterns that fail in production

Silent retry without limit: retry forever, never escalate. Looks fine until the upstream is permanently down and the agent loops indefinitely.

Generic exception handling: catch all exceptions, log a message, continue. Hides the actual failures and leaves no audit trail.

Treating warnings as success: model returns a partial response with a warning. Code treats it as success. Downstream consumes the partial result thinking it is complete.

No idempotency on side-effect tools: pipeline retry sends the same customer email three times because the email tool is not idempotent.

Approval gate with no timeout: pipeline stuck on an approval gate for a person who is on holiday. No escalation, no alert, no recovery.

Failure handling defined in code rather than as configuration: each agent has its own bespoke error handling. The platform-level patterns (retry, circuit-breaker, escalation) get reimplemented inconsistently.

What good error handling looks like

A production-ready agent has:

  • Documented per-tool error handling per tool
  • Documented failure modes per agent and the recovery for each
  • Pipeline failure paths declared at the pipeline level, not invented per step
  • Idempotency keys where side effects matter
  • Audit log entries for every failure with category and recovery action
  • Dashboards visible to the team that owns the agent
  • Alerting on error rate spikes
  • Regular review of the error pattern to identify systematic improvements

Most of these are platform-level concerns rather than per-agent engineering. The platform earns its keep by providing them by default.

What AgentWorks supports

The platform handles model retry/backoff/circuit-breaker at the gateway. Tool integrations declare their error handling at integration time. Pipeline definitions include failure paths. Audit logs capture errors with the structure that supports both debugging and compliance evidence. Observability dashboards surface error patterns by agent, tool, and pipeline step.

The agents themselves can focus on their domain logic; the resilience patterns are handled at the platform layer. This is the difference between a one-off agent that works on the happy path and a production agent that runs reliably.

The honest summary: error handling is the difference between a demo and a production system. The patterns above are the ones that make agents reliable in front of real users with real edge cases.

About the author

· Founder, AgentWorks

Erwin Berkouwer is the founder of AgentWorks — an AI agent platform purpose-built for European teams that need EU AI Act-ready governance, multi-LLM choice across OpenAI, Anthropic, Google and Mistral, and transparent per-token € pricing.

Read more about Erwin