Chunking Strategies for Enterprise RAG: Where Quality Lives and Dies
TL;DR
Chunking patterns for enterprise RAG by document type: structure-aware for markup, semantic for prose, hierarchical for long structured documents, table and code preservation, and per-type rules. Plus the evaluation that catches bad chunking.
Chunking Strategies for Enterprise RAG: Where Quality Lives and Dies
Retrieval-Augmented Generation is "just" three steps: chunk documents, embed and index, retrieve and generate. The chunk step looks like the simplest. It is not. Chunking decisions made on day one set a ceiling on RAG quality that no amount of embedding upgrade or prompt engineering can fully lift. Most RAG quality complaints we see at AgentWorks trace back to chunking that was wrong for the document type.
This is the practical guide: what chunking does, where it goes wrong, and the patterns that work for the document types enterprises actually run.
What chunking actually decides
The chunking step decides:
- The unit of retrieval (what gets returned when a query matches)
- The context window the model receives (one chunk, several chunks, or many chunks)
- The granularity of citations (you can only cite what you indexed as a chunk)
- The coupling between related pieces of information (split too aggressively, related info gets separated)
Every downstream stage depends on these decisions. Chunking is not pre-processing; it is information architecture.
The patterns that produce bad chunks
Fixed-size character chunks: split documents into 500- or 1000-character chunks regardless of content structure. Easy to implement, broken for any document with headings or sections. A chunk that ends mid-sentence loses meaning. A heading separated from its content makes both useless.
Naive sentence splitting: split on sentence boundaries. Better than character-based for prose but terrible for technical content (code blocks split mid-block, table rows orphaned).
No overlap: each chunk is independent with no overlap to neighbouring chunks. Queries that match content at chunk boundaries miss the context on either side.
Single chunking strategy for all document types: applying the same chunking rules to email, code, PDFs with tables, and structured policy documents. Each needs different handling.
The patterns that work
Structure-aware chunking
For documents with markup (Markdown, HTML, structured PDF), chunk along the document's natural structural boundaries:
- Markdown documents: chunk at heading boundaries, with parent headings prepended to each chunk for context. A chunk for "Refund Policy → International Orders" includes both headings so the model knows the scope.
- HTML content: chunk at section, article, or h2/h3 boundaries. Strip navigation and footer content before chunking.
- PDFs with structure: extract the document outline (table of contents, heading hierarchy) and chunk at appropriate levels. For technical PDFs, often h2 or h3 level.
Semantic chunking for prose
For prose without clear structural markers (research papers, customer correspondence, long-form articles), use semantic chunking:
- Split into semantic units using embedding-distance-based boundaries
- A new chunk starts when the topic shifts significantly
- Typical chunk size: 300-800 tokens, with topic-driven boundaries rather than fixed length
Implementation: compute embeddings of sentences or short sliding windows, detect topic shifts via cosine distance, chunk at shift points.
Hierarchical chunking
For long structured documents (policy manuals, technical documentation, regulations) where both fine and coarse retrieval matter:
- Index multiple chunk sizes from the same document
- Small chunks (200-500 tokens) for precise retrieval
- Medium chunks (1,000-2,000 tokens) for context
- Whole-document or whole-section summaries for navigation
The retrieval step picks the right chunk size based on query type. Specific factual queries match small chunks; "tell me about" queries match medium chunks.
Table and code preservation
For documents with tables and code:
- Tables: keep the whole table as one chunk, with the table caption and the column headers preserved. Splitting tables across chunks destroys their meaning.
- Code blocks: keep the whole code block as one chunk, with surrounding prose context as metadata. Code split mid-function is useless.
This requires the chunking pipeline to recognise tables and code blocks before splitting.
Per-document-type rules
The honest pattern in production: chunking is a function of document type, not a global setting. For an enterprise knowledge base mixing:
- Policy PDFs: heading-based with overlap
- Knowledge base articles (Markdown): heading-based with parent headings prepended
- Customer support transcripts: semantic chunking on conversation turns
- Code repositories: function-level or class-level boundaries
- Email threads: per-email with thread context as metadata
- Spreadsheets: per-sheet with column headers preserved
A single chunking strategy across this mix produces uneven retrieval quality. Per-type rules are more work to maintain but materially better in practice.
The metadata that matters
Each chunk should carry metadata that makes retrieval smarter:
- Source identifier: which document, which version, which section
- Document type: for type-specific retrieval boosting
- Access controls: which users or roles can see this chunk
- Created and modified timestamps: for freshness-aware retrieval
- Parent context: for documents with hierarchical structure
- Custom domain tags: regulatory area, product line, customer segment
The metadata serves both retrieval (filter, rank, control access) and citation (the answer points to the source document with the right precision).
Overlap, and how much
Chunk overlap reduces the risk of relevant content falling on a boundary. Typical overlap:
- 10-15% for structure-aware chunking (the structure already provides natural boundaries)
- 15-25% for semantic chunking (semantic boundaries are imperfect)
- 0% for table and code chunks (you do not want partial tables overlapping)
Too much overlap wastes index space and produces duplicate retrieval results. Too little misses relevant context at boundaries.
Evaluation: how you know your chunking is working
Chunking quality is best measured by retrieval quality, not by chunk inspection. Build an evaluation set:
- 50-200 representative queries that real users ask
- For each, the gold-standard answer and the source document(s)
- Run retrieval, measure recall@k (does the right source appear in top-k?) and precision (are top results relevant?)
- Inspect failures to understand whether they are chunking, embedding, or query-side problems
When you change chunking strategy, re-run the evaluation. If recall@5 drops below baseline, the new strategy is worse. If recall@5 holds and precision improves, the new strategy is better.
The common questions
"How big should my chunks be?" Wrong question. Ask "what is the right chunking strategy for each document type." Size emerges from the strategy.
"Should I use semantic chunking everywhere?" No. Semantic chunking is great for prose, wasteful for structured documents that already have structural boundaries.
"How do I handle very long documents?" Hierarchical chunking with multi-level indexes. Retrieve at the right level for the query.
"What about images and diagrams?" Extract text from images (OCR, image captioning models), index the extracted content with a reference back to the original image. For documents where images carry primary content (technical drawings, infographics) consider multimodal retrieval.
What AgentWorks ships
The knowledge-rag features in AgentWorks ship with structure-aware chunking as the default, semantic chunking as an option for prose-heavy content, and per-document-type configuration for production knowledge bases. The metadata model supports the per-chunk access controls and citation linking that production deployments need. For teams without ML engineering capacity to tune chunking strategies, the defaults work for typical knowledge bases; for teams with specific document types or quality requirements, the configuration surface lets you tune per type.
The lesson from production: spend the first week of any RAG project on chunking analysis, not on prompt engineering. The ceiling RAG quality you can ever reach is set by chunking choices made at the start.
About the author
Erwin Berkouwer · Founder, AgentWorks
Erwin Berkouwer is the founder of AgentWorks — an AI agent platform purpose-built for European teams that need EU AI Act-ready governance, multi-LLM choice across OpenAI, Anthropic, Google and Mistral, and transparent per-token € pricing.
Read more about ErwinRelated articles
Read article: Agent Error Handling and Recovery Patterns: Production-Ready Resilience TechnicalMay 26, 20266 min readAgent Error Handling and Recovery Patterns: Production-Ready Resilience
Most agent failures are not bugs in the agent — they are external failures the agent did not handle. The patterns that turn brittle agents into resilient ones.
Read more →Read article: Prompt Injection Defense for Production AI Agents: Layered Controls That Actually Work TechnicalMay 26, 20267 min readPrompt Injection Defense for Production AI Agents: Layered Controls That Actually Work
Prompt injection is the OWASP Top 10 of LLM applications. The layered defence pattern that actually reduces real-world risk, beyond the toy demonstrations.
Read more →Read article: Reducing LLM Latency for User-Facing Agents: The Techniques That Actually Work TechnicalMay 26, 20266 min readReducing LLM Latency for User-Facing Agents: The Techniques That Actually Work
A user-facing agent that takes 12 seconds to respond feels broken. The techniques that bring that to 2-3 seconds without sacrificing quality, ranked by effort and impact.
Read more →