theLLMs

Last checked: 2026-05-24

Scope: Global. RAG and retrieval documentation was checked on 2026-05-24; this page is operational guidance, not a universal recipe.

AI draft model: gpt-5.4-mini

AI review model: llm-editor (deepseek-v4-pro)

Chunking documents for RAG: size, overlap and metadata choices

Chunking is one of those unglamorous choices that can make a retrieval system feel smart or stupid. Too large and the retriever gets lazy — pulling in paragraphs of irrelevant text that bury the useful signal. Too small and the answer loses context, forcing the LLM to reconstruct meaning from fragments. The real job is to design chunks around how users actually search, not around convenient token limits.

Quick answer

Start with chunk sizes matched to document type and retrieval goal:

  • Prose documents (reports, articles, manuals): 256–512 tokens with 10–20% overlap. This keeps most paragraphs intact while maintaining continuity across section boundaries. LangChain’s RecursiveCharacterTextSplitter defaults to a similar range for a reason — it preserves natural breakpoints better than flat character windows [1].
  • Code files: Chunk on function or class boundaries, not token counts. A 400-token chunk that splits a function body in half produces code that is semantically useless to both retriever and LLM [2].
  • Structured or tabular data: Keep logical rows together. Splitting a record across chunks destroys the relationships a downstream query needs.
  • Mixed corpora: Semantic chunking — using a sentence-level embedding model to find natural topic boundaries — often outperforms fixed-size windows, but it adds latency and model cost. Start with recursive character splitting (paragraph + sentence boundaries) before moving to semantic methods [1][2].

Overlap is a continuity aid, not a substitute for structure. Use 10–20% overlap for prose where sentences or concepts flow across chunk boundaries. Use less for structured data where each chunk is self-contained.

What this means

Chunking decides what the retriever can see and what it cannot. The embedding model turns each chunk into a vector; the retriever compares query vectors against those chunk vectors. If the chunk boundaries cut through a meaningful passage — a paragraph that explains why a specific method works, a function that implements a full algorithm — the retriever cannot find the complete answer. It can only find fragments.

This is not a preprocessing detail. It is part of the product design. A team that treats chunk size as a fixed parameter set once and forgotten will make retrieval decisions that silently degrade as the corpus grows, as query patterns shift, and as embedding models change their context-window behaviour.

Where teams get it wrong, with specific consequences

Choosing chunk size by token count alone

A team sets chunk size to 512 tokens across their entire corpus because it fits the embedding model’s context window. The corpus includes warranty documents (long prose paragraphs), API reference material (short code blocks), and customer support transcripts (conversational turns). The fixed token window chops paragraphs mid-sentence, splits code blocks at arbitrary lines, and concatenates unrelated chat turns into single chunks.

Consequence: Semantic search queries like “what happens if my boiler leaks gas during installation?” retrieve chunks that start mid-sentence and end in the middle of a warranty exclusion — neither the answer nor the legal context is usable. The team blames the embedding model when the real problem is chunk geometry.

Practical fix: Use a splitter that respects document structure first, token budget second. LangChain offers splitters for character, recursive character, code language, and Markdown — choose the one that matches your document type rather than force-fitting a single strategy across everything [1].

Using overlap as a substitute for good structure

A team knows their PDF parser loses section headings. Rather than fixing the parser, they set chunk overlap to 50% so that “most chunks contain a heading somewhere.” The result is massive chunk duplication — the corpus stores 1.5x the original document volume — and retrieval returns near-identical chunks from the same passage because the same sentence appears in two overlapping chunks with slightly different context.

Consequence: The reranker or downstream LLM sees redundant content that inflates context windows without adding information. A user query that should return one relevant passage returns three near-copies. The effective context window shrinks because half of it is duplicate text.

Practical fix: Overlap should compensate for natural boundary loss, not for missing structure. If your source documents have headings, tables of contents, or section markers, expose them to the chunker rather than hiding them behind aggressive overlap. Use metadata to carry section hierarchy instead of relying on overlap to preserve continuity [3].

Dropping metadata that later matters for filters or citations

A chunking pipeline strips source metadata during preprocessing because “we only need the text for embedding.” Later, the team wants to filter retrieval results by document source, date range, or section heading — but there is no way to do it. Every chunk is a plain text blob with no provenance.

Consequence: A legal team using RAG to search contract archives cannot restrict retrieval to current-year agreements. A customer support system cannot cite which document version a chunk came from. The only option is to re-chunk the entire corpus with metadata tracking, which costs time and compute.

Practical fix: Carry at minimum a minimal metadata schema with every chunk:

  • source_path — original file or document identifier
  • heading_hierarchy — array of ancestor headings (e.g., ["Chapter 3", "Warranty terms", "Exclusions"])
  • chunk_index — position within the parent document (for reassembly)
  • page_number or equivalent document-internal locator

Pinecone, Weaviate and Qdrant all support metadata filtering natively [3][4]. Using it costs nothing at chunk time and saves a full corpus reindex later.

Practical decision check

  • What is the natural unit of meaning in your documents? A paragraph, a function, a table row, a conversational turn?
  • Does the retrieval task need continuity across adjacent chunks, or is each chunk self-contained?
  • Which metadata fields are required to filter, cite or reassemble results?
  • What chunking strategy matches your document structure: recursive character, code-aware, semantic, or fixed-size?
  • How does your chosen chunk size relate to the embedding model’s context window? If chunks are close to the window limit, retrieval degrades because the model cannot distinguish signal from boundary noise [5].

What would change this advice

This guidance is current as of May 2026 and reflects documented behaviour in LangChain 0.3.x, LlamaIndex 0.12.x, OpenAI text-embedding-3-* and text-embedding-ada-002, and Pinecone/Weaviate/Qdrant metadata-filtering capabilities as of their latest stable releases.

The advice would need revision if:

  • Embedding models with drastically larger context windows become standard — e.g., models that can embed 8K+ token chunks without quality loss. That could reduce the need for aggressive splitting but would not eliminate the need for structure-respecting chunk boundaries.
  • Hybrid search (dense + sparse) becomes the default in vector databases — overlap strategies designed to compensate for boundary loss in pure semantic retrieval may become less important when keyword signals also contribute to ranking [4].
  • Metadata-filtered retrieval becomes a first-class vector DB primitive — if filters are automatically applied at query time without explicit chunk-level metadata, the provenance problem changes. For now, chunk metadata is still the reliable approach.

This page is operational guidance, not a universal constant. Test your settings against real queries and real retrieval failures before treating any heuristic as fixed.

Methodology and sources

Check date: 2026-05-24

What was checked: RAG framework chunking docs (LangChain text splitters, LlamaIndex node parsers), embedding provider documentation (OpenAI), and vector database metadata/indexing guidance (Pinecone, Weaviate, Qdrant).

What the sources were used for:

  • LangChain recursive character and code-aware splitting behaviour [1]
  • LlamaIndex semantic chunking strategies and node parser architecture [2]
  • Pinecone metadata filtering patterns [3]
  • Hybrid search and dense+sparse ranking strategies [4]
  • Embedding model context-window behaviour and chunk size trade-offs [5]

Assumptions and limits:

  • documents vary widely in structure, density and quality — no single chunk size works for every corpus
  • retrieval evaluation should measure real query performance, not test-set accuracy
  • chunking interacts with embedding choice — the same chunk size may perform differently with text-embedding-ada-002 vs text-embedding-3-large
  • this is operational guidance, not a research paper

Change log

  • 2026-05-24: first draft built from the llm-editor-approved brief.
  • 2026-05-24: revised after editorial review — added inline citations, concrete examples, expanded failure scenarios, metadata schema, evidence-change paragraph, and production route links.

Source list

  1. LangChain text splitters docs — https://python.langchain.com/docs/concepts/text_splitters/
  2. LlamaIndex chunking docs — https://docs.llamaindex.ai/en/stable/module_guides/loading/node_parsers/
  3. Pinecone docs — https://docs.pinecone.io/
  4. Weaviate hybrid search — https://weaviate.io/developers/weaviate/search/hybrid
  5. OpenAI embeddings docs — https://platform.openai.com/docs/guides/embeddings