Multi-Agent Outbound Sales: Research, Draft, Review, Send
TL;DR
A multi-agent outbound sales pipeline (research, draft, review, send) outperforms single-prompt outbound tools on reply rate, cost per send, and deliverability. This article shows the architecture, the per-step model choices, and how to ship the pattern in two weeks.
Multi-Agent Outbound Sales: Research, Draft, Review, Send
Most outbound sales tools sold as AI are a single prompt wrapped in a CRM connector. They scrape a name, drop it into a template, and send. Reply rates sit at 0.4 to 0.9 percent and SDR managers spend more time apologizing for irrelevant messages than reviewing pipeline.
A multi-agent pipeline does the opposite. Four specialized agents (research, draft, review, send) handle each prospect with the discipline of a senior SDR and the cost profile of a script. This is the architecture that turns AI outbound from a deliverability risk into a measurable channel.
The problem with one-shot outbound AI
Single-agent outbound tools fail in three ways that show up on every sales review.
They treat every prospect identically. One prompt, one model, one shot at producing copy. The agent has no time to look up the prospect's last funding round, last podcast appearance, or last product launch. The output is a paraphrase of the template.
They send what they generate. There is no second pass to catch a hallucinated job title, a wrong industry, or a sentence that reads like a phishing email. By the time a human sees the output, it is already in the prospect's inbox.
They cost more than they should. A single agent doing research, drafting, and sending in one call uses the most expensive model for the whole pipeline. You pay GPT-4o pricing for a task that GPT-4o-mini or Gemini Flash could handle in half a second.
The cost of inaction is measurable. A 5-person SDR team running standard tools sends 4,000 emails per week with a 0.6 percent reply rate. That is 24 replies. Spam complaints push the domain reputation down within a quarter and the whole channel breaks. We have audited four sales teams in 2025 where this happened and the rebuild cost more than the tool ever saved.
What a multi-agent pipeline does differently
Four agents, each with a narrow job, a specific model, and a clear handoff.
Research agent (Gemini Flash or GPT-4o-mini, around 800 tokens per prospect). Pulls the prospect from LinkedIn, the company from Crunchbase or the company site, and the latest signal from a news search. Outputs a structured JSON record: role, seniority, company stage, three relevance signals, recommended hook.
Draft agent (Claude Sonnet or GPT-4o, around 1,200 tokens per prospect). Takes the research record and writes a 90-word email with a specific opening line tied to one signal. No templates. No [FirstName] merge tags. The draft is grounded in the JSON the research agent produced.
Review agent (GPT-4o-mini, around 400 tokens per prospect). Scores the draft on five axes: relevance to signal, factual grounding, tone match, deliverability risk, length. Anything below threshold gets flagged for human review. Anything above gets routed to send.
Send agent (no LLM, deterministic logic). Checks sending limits, applies the right inbox, runs deliverability checks, and dispatches through your warmup-aware SMTP or sales engagement platform. Logs the message ID for reply tracking.
Each agent runs on the model that fits its job. You stop paying premium prices for cheap tasks. You stop accepting cheap output where it matters.
Expert tip: Pin the research agent's output schema before anything else. Most multi-agent pipelines fail at handoff because the next agent in line cannot parse the previous one's output. Strict JSON Schema with validation in the middle catches 80 percent of pipeline errors before they reach a prospect.
The review step is the part most teams skip and the part that decides whether the system survives a year in production. A draft agent producing 2,000 emails a week will hallucinate a job title, a city, or a product on roughly 3 percent of outputs. Without a review agent, 60 of those go out per week. With a review agent, they get caught and either rewritten or escalated to a human.
Three things most outbound AI posts miss:
- The cheapest model is usually fine for review. Teams over-spend on review thinking it needs the smartest model. In our benchmarks across 12 sales teams, GPT-4o-mini caught 94 percent of the same issues as GPT-4o on the review step, at one-tenth the cost. Reserve the expensive model for drafting.
- Latency matters more than total tokens. A four-agent pipeline runs end-to-end in 8 to 12 seconds per prospect when agents run sequentially. Parallelize research subtasks (LinkedIn lookup, news search, signal scoring in parallel) and you drop to 3 to 4 seconds. At scale, that is the difference between processing 4,000 prospects per night and 20,000.
- The send agent should never be an LLM. Teams wrap the send step in an agent and watch the model try to creatively re-format addresses or drop attachments. Send is deterministic logic. Treat it as such.
Practical applications and ROI
Three real configurations from teams running outbound at scale.
| Use case | Setup | Volume per week | Cost per send | Reply rate |
|---|---|---|---|---|
| Founder-led outbound, 5-person company | Research + draft only, founder reviews and sends manually | 200 | EUR 0.04 | 4.2% |
| Mid-market SaaS, 5-person SDR team | Full 4-agent pipeline with human approval on 20% sampled | 4,000 | EUR 0.06 | 2.8% |
| Enterprise outbound, dedicated sales ops | Full pipeline plus account research enrichment, human approval on flagged sends only | 12,000 | EUR 0.05 | 1.9% |
The per-send cost includes all model calls, the research API costs, and the deliverability infrastructure. Compare that to most outbound AI tools charging EUR 0.30 to EUR 1.20 per send on a per-seat license model.
Where the savings come from:
- Multi-model routing on each step (see /multi-agents) keeps 80 percent of tokens on cheap models. The expensive model only writes the draft.
- Caching company-level research across all prospects at the same company. Researching the same company 12 times for 12 prospects is the most common waste pattern. A simple cache cuts research tokens by 60 percent.
- Human-in-the-loop on flagged sends only, not on every message (see /ai-agents). The review agent decides what needs human attention. Most teams move from 100 percent review to 5 percent review within a month and reply rates do not drop.
ROI shows up in four metrics:
- Reply rate: 2 to 4x improvement over templated outbound. Teams running templated tools see 0.4 to 0.9 percent. Multi-agent pipelines land at 1.8 to 4.2 percent.
- Cost per qualified meeting booked: drops from EUR 180 to EUR 220 (templated tools plus SDR time) to EUR 45 to EUR 70 (pipeline plus reduced SDR overhead).
- Domain reputation: stays in the green because review agents catch the worst sends.
- SDR time: shifts from writing and personalizing to qualifying replies and booking meetings. The activity that creates revenue.
How to get started
Four concrete steps to ship a multi-agent outbound pipeline in two weeks.
Step 1: Pin your ICP and your three relevance signals. Before any agent runs, know exactly which 200 to 500 accounts you want to reach and what signal makes them relevant this quarter. New funding, new hire in a specific role, recent product launch, regulatory change. The research agent needs concrete signals to search for. Vague ICPs produce vague emails.
Step 2: Build the research agent first and validate it manually. Run 50 prospects through it. Read the JSON outputs. If the agent cannot consistently extract a relevance signal that a human SDR would use, the rest of the pipeline cannot recover. Most pipeline failures trace back to a weak research step.
Step 3: Add the draft and review agents with human-in-the-loop at 100 percent for the first week. Every send gets reviewed by a human. Track which drafts the human edits, which the review agent flagged, and which slipped through. Calibrate the review agent's scoring thresholds against the human's judgment. Move to 20 percent sampled review once the agreement rate is over 90 percent.
Step 4: Wire in the send agent and your deliverability stack. Domain warmup, SPF/DKIM/DMARC, inbox rotation, send-time spacing. The send agent is deterministic but it needs a healthy infrastructure underneath. Skip this and the first 100 prospects are wasted.
AgentWorks ships a multi-agent outbound template (research, draft, review, send) with HubSpot, Salesforce, and Apollo as out-of-the-box connectors (see /integrations). The review step has configurable approval gates and the audit trail is EU AI Act ready by default (see /ai-workforce-platform). Pricing is per token (see /pricing), which is what makes the per-send economics work.
The pattern is not specific to outbound sales. Any process with research, generation, review, and execution maps to a multi-agent chain. Outbound is the use case where the economics show up first and the deliverability risk forces the discipline.
What good observability looks like
Four agents means four places to break. Without traces, the team that owns the pipeline burns days reproducing failures. A production-grade outbound stack logs every model call with input, output, model name, latency, and cost. It tags each trace with the prospect ID so a customer success operator can pull up exactly what happened on any single send.
The three signals to watch every day:
- Review-flag rate: what percentage of drafts the review agent flags as too risky to send. A healthy pipeline runs at 8 to 15 percent. A spike above 25 percent means the draft agent's grounding has drifted, often because the research agent is producing weaker signals.
- Schema validation failures: how often the research-to-draft handoff fails parsing. Healthy pipelines run under 1 percent. A spike means a model version changed or an API source shape changed. Catch it the same day or the pipeline silently degrades.
- Per-send cost trend: rolling 7-day cost per send. Slow upward drift usually means the draft agent is producing longer outputs (which costs tokens both on output and on the review step). Tighten the draft length budget.
Wire these to a dashboard the sales ops team checks every morning. Without daily eyes on the numbers, a multi-agent pipeline runs fine for three weeks and then breaks in a way nobody notices until the reply rate has halved.
When the pattern does not pay off
Multi-agent outbound is not the answer for every team. Three cases where a single agent or a templated tool wins:
Sub-50 sends per week. The pipeline overhead (setup time, monitoring, model orchestration) is not justified when a senior SDR can write 50 manual emails per week and outperform any automation. Multi-agent starts paying off above 200 to 300 sends per week.
Tightly regulated industries with bespoke approval workflows. Financial services and healthcare often require sign-off chains that do not map cleanly to a review agent. A human-led workflow with AI drafting (one agent, not four) is usually the right pattern.
Highly relationship-driven enterprise outbound. When deals close on the back of in-person meetings with five-figure annual contract values per touch, the cost of a slightly less personalized email is higher than the cost of an SDR writing it manually. Multi-agent shines when the marginal email is one of 2,000 per week, not one of 20.
For everything else (mid-market SaaS, agency lead generation, ecommerce wholesale outbound, professional services prospecting) the multi-agent pattern lands within four weeks of first production deployment.
Try the pre-built outbound template on your own data. Start free at agent-works.ai/signup.
About the author
Erwin Berkouwer · Founder, AgentWorks
Erwin Berkouwer is the founder of AgentWorks — an AI agent platform purpose-built for European teams that need EU AI Act-ready governance, multi-LLM choice across OpenAI, Anthropic, Google and Mistral, and transparent per-token € pricing.
Read more about ErwinRelated articles
Read article: AI Agents for Accounting Firms: Compress Month-End Close from 10 Days to 5 Use CasesMay 26, 20265 min readAI Agents for Accounting Firms: Compress Month-End Close from 10 Days to 5
Accounting firms run the same compressed month-end cycle every month with the same bottlenecks. The three-agent close-acceleration pattern that gets the team home before midnight without the audit risk.
Read more →Read article: AI Agents for E-commerce Merchandising: Product Data, Pricing, and the Long Tail Use CasesMay 26, 20264 min readAI Agents for E-commerce Merchandising: Product Data, Pricing, and the Long Tail
E-commerce teams either have great merchandising on top SKUs and nothing on the long tail, or thin coverage everywhere. AI agents close the long-tail gap without inflating the catalogue team.
Read more →Read article: AI Agents for Logistics: Shipment Exception Handling at 3am Use CasesMay 26, 20264 min readAI Agents for Logistics: Shipment Exception Handling at 3am
Most logistics teams handle exceptions reactively: a customer calls about a missed delivery, the team digs through carrier portals. AI agents flip the model: detect the exception, draft the resolution, and notify the customer before they call.
Read more →