Glossary

What is Retrieval-Augmented Generation (RAG)?

Last updated: 2026-05-05

Definition

Retrieval-Augmented Generation (RAG) is a technique that grounds a large language model in a specific corpus of documents at query time. Instead of relying only on what the model learned during training, RAG retrieves relevant passages from your data and adds them to the prompt — letting the model answer using your knowledge, current and proprietary.

Why Retrieval-Augmented Generation matters

RAG was introduced by Lewis et al. in 2020 (Facebook AI Research) and has since become the dominant pattern for grounding generative AI in enterprise data. According to a 2025 Stanford study, RAG-augmented LLMs reduce factual hallucination on internal-data questions by 40-60% compared to base models, while keeping inference cost roughly equivalent.

How Retrieval-Augmented Generation works

  1. 1Ingest your documents (PDFs, web pages, internal wikis) and split them into chunks of typically 200-1000 tokens.
  2. 2Convert each chunk into a numerical embedding (a vector) using an embedding model.
  3. 3Store the embeddings in a vector database (Pinecone, Qdrant, pgvector, Postgres + pgvector).
  4. 4When a user asks a question, embed the question and search the vector database for the most similar chunks (top-k retrieval).
  5. 5Construct a prompt that includes the retrieved chunks as context, then send it to the LLM.
  6. 6The LLM answers using the provided context, ideally citing which chunk it drew from.

Examples

  • A support agent grounded on your help-center articles, so it can answer customer questions accurately.
  • A legal-research agent grounded on the EU AI Act regulation text, so it can cite specific articles when explaining obligations.
  • A sales-research agent grounded on your CRM notes, so it can summarize the last six months of customer interactions.

References

FAQ

Retrieval-Augmented Generation — common questions

Does RAG eliminate AI hallucinations?
No, but it reduces them substantially on questions about your own data. The model can still hallucinate if the retrieved chunks do not contain the answer or if the model misreads them. Pair RAG with citation requirements ("cite the chunk you used") and with evaluation loops to catch failures.
How is RAG different from fine-tuning?
Fine-tuning bakes new knowledge into the model weights — slow, expensive, hard to update. RAG keeps knowledge external in a vector database — fast to update, easy to audit, and the model can cite which document it used. RAG is the default for most enterprise grounding use cases.
What is the right chunk size for RAG?
There is no universal answer. 200-500 tokens works for short Q&A; 800-1500 tokens works for longer contextual answers. Use overlap (typically 10-20%) so important context is not lost at chunk boundaries. Test against your own retrieval-quality benchmarks.
Is RAG GDPR compliant?
RAG is a technique, not a compliance regime. GDPR compliance depends on what data you index. AgentWorks applies PII redaction at the gateway layer before any chunk reaches a third-party LLM, supports EU data residency on all knowledge bases, and logs every retrieval for audit.