What is time to first token and why does it matter more than total latency?

Time to first token (TTFT) is the delay between sending a request and the first visible output reaching the user. It matters more than total completion time because users judge responsiveness from the first few hundred milliseconds, not the full duration. A response that starts streaming at 400ms and finishes at 6 seconds feels faster than one that stays silent for 3 seconds before showing anything.

Does prompt caching always reduce latency?

No. Prompt caching only helps when the cached prefix is genuinely stable across calls. Injecting a timestamp or reordering tool definitions at the start of a prompt invalidates the cache every time, adding overhead without the benefit. Wrapping an entire long-context payload in a cache block when only a small part changes each turn can also add latency instead of removing it.

Should every agent request go to the largest available model?

No. Routing every request to the biggest model is the most common latency and cost mistake in agent design. A cheapest-capable-model router handles simple tasks like classification or field extraction with a small fast model and only escalates to a larger model when the smaller one fails validation or produces low-confidence output.

How much does parallelizing tool calls actually save?

When tool calls are independent (an inventory check and a customer lookup, for example), running them concurrently collapses total wait time to the slowest single call instead of the sum of all of them. This reduces P90 latency without any quality tradeoff, as long as the agent's dependency logic correctly separates independent calls from calls that must run in sequence.

Reducing LLM Latency for AI Agents

A support agent that thinks for four seconds before it says anything reads as broken, even if the final answer is correct. Users judge speed within the first few hundred milliseconds of a response, long before the task is done. If you are shipping an AI agent that a human waits on, latency is not a performance metric you tune later. It is part of the product.

This article covers the five levers that actually move the needle: streaming, prompt-prefix caching, model routing, parallel tool calls, and perceived-latency design. Most latency guides stop at "turn on streaming." That gets you maybe a third of the way there.

Why raw response time is the wrong metric

Total completion time hides the part users actually feel: time to first token (TTFT), the gap between sending a request and the first visible output. A response that finishes in 6 seconds but starts streaming at 400ms feels instant. A response that finishes in 3 seconds but sits silent until the end feels slow, even though it's faster overall.

Prefill (processing the input) and decode (generating the output) are different bottlenecks. Prefill time scales with prompt length: a system prompt that grows from 4K to 12K tokens roughly doubles or triples prefill time, because every token in the context has to pass through the model before the first output token appears. If your agent's TTFT drifts upward over a few sprints, check your system prompt size before you touch anything else. Tool definitions, few-shot examples, and accumulated instructions bloat quietly.

Prompt-prefix caching: design for cache hits, not just cache existence

Anthropic, OpenAI, and Google all offer prompt caching that skips reprocessing the parts of a prompt that haven't changed. Used correctly, it cuts TTFT substantially on multi-turn conversations, because the system prompt, tool schema, and early conversation history stay cached across turns instead of being reprocessed on every call.

The catch is that caching only pays off when your prompt has a stable prefix. If you inject a timestamp, a random request ID, or reordered tool definitions at the start of the prompt, you invalidate the cache on every call and get none of the benefit while still paying the cache-write overhead. The fix is structural: put everything static (system instructions, tool schemas, few-shot examples) first, and put anything that changes per request (user message, retrieved context, dynamic variables) last. Order the tool list once and keep it stable across calls rather than re-sorting it by relevance each turn.

It's also possible to over-cache. Wrapping an entire long-context RAG payload in a cache block when only the last few hundred tokens actually change between calls can add latency instead of removing it, because you're paying to check and maintain a much larger cache boundary than the piece that's actually reused. Cache the parts that are genuinely stable across a session; leave dynamic retrieval results outside the cached prefix.

Expert tip: instrument TTFT and total-completion-time separately in your logs. A regression in one and not the other tells you immediately whether the problem is prefill (prompt size, cache misses) or decode (output length, model choice).

Model routing: stop defaulting to the biggest model

The most common latency mistake in agent design is routing every request, including simple ones, to the largest available model because it's the safest default during development. In production, that's the slowest and most expensive path for a large share of traffic.

A cheapest-capable-model router evaluates task complexity first and only escalates to a larger model when the smaller one is demonstrably insufficient — low confidence, a failed validation check, or an explicit request for clarification. Classifying an intent, extracting a field, or routing a support ticket rarely needs a frontier model; drafting a nuanced response or reasoning over ambiguous instructions often does. Splitting these into fast and slow paths, rather than running everything through one model tier, is what actually moves median latency, because most agent turns are simple.

An alternative for latency-critical, cost-tolerant paths is speculative parallel execution: fire the cheap and expensive model at once and stream from whichever responds first, falling back to the larger model's output if the smaller one fails validation. This trades extra compute cost for a latency floor set by your fastest model rather than your most capable one — worth it for the first message in a new chat, less so for a background batch job.

Parallel tool calls: the most underused lever

When an agent needs to check inventory, look up a customer record, and pull recent order history, most implementations still call each tool one after another, waiting for each result before starting the next. If the three calls are independent, that's three round trips serialized for no reason.

Modern tool-calling APIs support returning multiple tool calls in a single model turn, and running independent tool calls concurrently instead of sequentially reduces P90 latency without a quality cost, because the total wait time collapses to the slowest single call instead of the sum of all of them. The design work is in dependency analysis: a scheduler needs to know which tool calls can run in parallel (independent lookups) versus which have a strict order (a lookup that feeds the input of a later action). Get that dependency graph wrong and you either serialize unnecessarily or race a call that needed to wait.

Streaming partial tool-call arguments as they're generated, rather than waiting for the full JSON payload to close, shaves further time off multi-tool turns — the client can start resolving a tool call's arguments before the model has finished writing them out.

Perceived latency: the trick most teams skip

TTFT is the number in your dashboard; perceived latency is what the user actually experiences, and the two are not the same thing. An interface that shows a typing indicator, a partial tool-call card ("Checking your order..."), or the first streamed sentence within 200ms feels responsive even if the full answer takes several more seconds to complete. An interface that shows nothing until the entire response is ready feels slow even at an objectively fast TTFT.

Concretely: render tool calls as they start, not after they resolve. Stream text token by token instead of buffering full paragraphs. Show a state change the moment the agent begins working, not the moment it finishes. These are frontend decisions, not model decisions, and they often close more of the perceived gap than any backend optimization.

Putting it together

None of these levers work in isolation. A well-cached prompt with a slow model still feels sluggish. A fast model with a serialized tool chain still stalls on the second lookup. A parallelized, well-routed backend with no streaming UI still feels like it's thinking in silence. Latency work on agents is a full-stack problem: prompt structure, model selection, tool orchestration, and interface feedback all have to move together.

AgentWorks routes each task to the cheapest model that can handle it — including Claude, GPT-5, Gemini, and Mistral, with an AUTO mode that picks automatically — streams every response and tool call as it happens, and runs independent tool calls in parallel rather than one at a time. See how the routing and execution model works on the AgentWorks agent platform page.

Not sure where AI agents fit? Request a tailored roadmap at agent-works.ai/contact.

Reducing LLM Latency for User-Facing AI Agents

Why raw response time is the wrong metric

Prompt-prefix caching: design for cache hits, not just cache existence

Model routing: stop defaulting to the biggest model

Parallel tool calls: the most underused lever

Perceived latency: the trick most teams skip

Putting it together

About the author

Company-Wide AI Adoption: A Practical Playbook

When to Use Which LLM: A Practical Decision Guide

How to Build Your First AI Agent Team

Why raw response time is the wrong metric

Prompt-prefix caching: design for cache hits, not just cache existence

Model routing: stop defaulting to the biggest model

Parallel tool calls: the most underused lever

Perceived latency: the trick most teams skip

Putting it together

About the author

Related articles

Company-Wide AI Adoption: A Practical Playbook

When to Use Which LLM: A Practical Decision Guide

How to Build Your First AI Agent Team