hero_image:- “/images/hero/chunking-documents-for-rag-size-overlap-and-metadata-choices.png” layout:- ../../layouts/GuideLayout.astro title:- “Chunking- documents- for- RAG:- size,- overlap- and- metadata- choices” description:- “A- practical- guide- to- breaking- source- material- into- retrievable- chunks- for- production- RAG- pipelines- without- wrecking- semantic- meaning- or- search- quality.” writtenBy:- “gemma4:26b” reviewedBy:- “deepseek-r1:32b” lastChecked:- “2026-05-28” scope:- “Global.- RAG- and- retrieval- documentation- was- checked- on- 2026-05-28;- this- page- is- operational- guidance,- not- a- universal- recipe.”

#- Chunking- documents- for- RAG:- size,- overlap- and- metadata- choices

Chunking- is- one- of- those- unglamorous- choices- that- can- make- a- retrieval- system- feel- smart- or- stupid.- Too- large- and- the- retriever- gets- lazy- —- pulling- in- paragraphs- of- irrelevant- text- that- bury- the- useful- signal.- Too- small- and- the- answer- loses- context,- forcing- the- LLM- to- reconstruct- meaning- from- fragments.- The- real- job- is- to- design- chunks- around- how- users- actually- search,- not- around- convenient- token- limits.

##- TL;DR

Start- with- chunk- sizes- matched- to- document- type- and- retrieval- goal:

Prose- documents- (reports,- articles,- manuals):- 256–512- tokens- with- 10–20%- overlap.- This- keeps- most- paragraphs- intact- while- maintaining- continuity- across- section- boundaries.- LangChain’s- RecursiveCharacterTextSplitter- defaults- to- a- similar- range- for- a- reason- —- it- preserves- natural- breakpoints- better- than- flat- character- windows- [1].
Code- files:- Chunk- on- function- or- class- boundaries,- not- token- counts.- A- 400-token- chunk- that- splits- a- function- body- in- half- produces- code- that- is- semantically- useless- to- both- retriever- and- LLM- [2].
Structured- or- tabular- data:- Keep- logical- rows- together.- Splitting- a- record- across- chunks- destroys- the- relationships- a- downstream- query- needs.
Mixed- corpora:- Semantic- chunking- —- using- a- sentence-level- embedding- model- to- find- natural- topic- boundaries- —- often- outperforms- fixed-size- windows,- but- it- adds- latency- and- model- cost.- Start- with- recursive- character- splitting- (paragraph- +- sentence- boundaries)- before- moving- to- semantic- methods- [1][2].

Overlap- is- a- continuity- aid,- not- a- substitute- for- structure.- Use- 10–20%- overlap- for- prose- where- sentences- or- concepts- flow- across- chunk- boundaries.- Use- less- for- structured- data- where- each- chunk- is- self-contained.

- - Editor's- Note - -

The- most- reliable- chunking- heuristic- is- the- one- nobody- talks- about:- test- your- chunk- sizes- against- real- queries- from- your- actual- users.- A- chunk- size- that- works- perfectly- for- a- benchmark- dataset- can- fail- silently- on- your- specific- corpus- because- your- documents- have- different- paragraph- density,- heading- structure,- or- average- query- length.- Run- retrieval- recall- tests- on- a- sample- of- real- queries- before- locking- in- a- strategy.

##- What- this- means

Chunking- decides- what- the- retriever- can- see- and- what- it- cannot.- The- embedding- model- turns- each- chunk- into- a- vector;- the- retriever- compares- query- vectors- against- those- chunk- vectors.- If- the- chunk- boundaries- cut- through- a- meaningful- passage- —- a- paragraph- that- explains- why- a- specific- method- works,- a- function- that- implements- a- full- algorithm- —- the- retriever- cannot- find- the- complete- answer.- It- can- only- find- fragments.

This- is- not- a- preprocessing- detail.- It- is- part- of- the- product- design.- A- team- that- treats- chunk- size- as- a- fixed- parameter- set- once- and- forgotten- will- make- retrieval- decisions- that- silently- degrade- as- the- corpus- grows,- as- query- patterns- shift,- and- as- embedding- models- change- their- context-window- behaviour.- Chunking- also- interacts- with- your- broader- retrieval- architecture- choices- —- see- when- semantic- search- is- enough- and- when- it- is- not- for- the- retrieval-strategy- context- that- shapes- what- your- chunker- needs- to- deliver.

##- Where- teams- get- it- wrong,- with- specific- consequences

###- Choosing- chunk- size- by- token- count- alone

A- team- sets- chunk- size- to- 512- tokens- across- their- entire- corpus- because- it- fits- the- embedding- model’s- context- window.- The- corpus- includes- warranty- documents- (long- prose- paragraphs),- API- reference- material- (short- code- blocks),- and- customer- support- transcripts- (conversational- turns).- The- fixed- token- window- chops- paragraphs- mid-sentence,- splits- code- blocks- at- arbitrary- lines,- and- concatenates- unrelated- chat- turns- into- single- chunks.

Consequence:- Semantic- search- queries- like- “what- happens- if- my- boiler- leaks- gas- during- installation?”- retrieve- chunks- that- start- mid-sentence- and- end- in- the- middle- of- a- warranty- exclusion- —- neither- the- answer- nor- the- legal- context- is- usable.- The- team- blames- the- embedding- model- when- the- real- problem- is- chunk- geometry.

Practical- fix:- Use- a- splitter- that- respects- document- structure- first,- token- budget- second.- LangChain- offers- splitters- for- character,- recursive- character,- code- language,- and- Markdown- —- choose- the- one- that- matches- your- document- type- rather- than- force-fitting- a- single- strategy- across- everything- [1].

###- Using- overlap- as- a- substitute- for- good- structure

A- team- knows- their- PDF- parser- loses- section- headings.- Rather- than- fixing- the- parser,- they- set- chunk- overlap- to- 50%- so- that- “most- chunks- contain- a- heading- somewhere.”- The- result- is- massive- chunk- duplication- —- the- corpus- stores- 1.5x- the- original- document- volume- —- and- retrieval- returns- near-identical- chunks- from- the- same- passage- because- the- same- sentence- appears- in- two- overlapping- chunks- with- slightly- different- context.

Consequence:- The- reranker- or- downstream- LLM- sees- redundant- content- that- inflates- context- windows- without- adding- information.- A- user- query- that- should- return- one- relevant- passage- returns- three- near-copies.- The- effective- context- window- shrinks- because- half- of- it- is- duplicate- text.

Practical- fix:- Overlap- should- compensate- for- natural- boundary- loss,- not- for- missing- structure.- If- your- source- documents- have- headings,- tables- of- contents,- or- section- markers,- expose- them- to- the- chunker- rather- than- hiding- them- behind- aggressive- overlap.- Use- metadata- to- carry- section- hierarchy- instead- of- relying- on- overlap- to- preserve- continuity- [3].

###- Dropping- metadata- that- later- matters- for- filters- or- citations

A- chunking- pipeline- strips- source- metadata- during- preprocessing- because- “we- only- need- the- text- for- embedding.”- Later,- the- team- wants- to- filter- retrieval- results- by- document- source,- date- range,- or- section- heading- —- but- there- is- no- way- to- do- it.- Every- chunk- is- a- plain- text- blob- with- no- provenance.

Consequence:- A- legal- team- using- RAG- to- search- contract- archives- cannot- restrict- retrieval- to- current-year- agreements.- A- customer- support- system- cannot- cite- which- document- version- a- chunk- came- from.- The- only- option- is- to- re-chunk- the- entire- corpus- with- metadata- tracking,- which- costs- time- and- compute.

Practical- fix:- Carry- at- minimum- a- minimal- metadata- schema- with- every- chunk:

source_path- —- original- file- or- document- identifier
heading_hierarchy- —- array- of- ancestor- headings- (e.g.,- ["Chapter- 3",- "Warranty- terms",- "Exclusions"])
chunk_index- —- position- within- the- parent- document- (for- reassembly)
page_number- or- equivalent- document-internal- locator

Pinecone,- Weaviate- and- Qdrant- all- support- metadata- filtering- natively- [3][4].- Using- it- costs- nothing- at- chunk- time- and- saves- a- full- corpus- reindex- later.

- - Editor's- Note - -

The- metadata- schema- listed- here- is- the- minimum- viable- set.- If- you- skip- even- one- field- —- say,- `heading_hierarchy`- —- you- will- discover- you- needed- it- the- moment- someone- asks- "show- me- only- results- from- the- warranty- section."- Adding- metadata- to- an- already-indexed- corpus- means- re-indexing.- Spend- the- extra- five- minutes- per- document- type- now.

##- Practical- decision- check

What- is- the- natural- unit- of- meaning- in- your- documents?- A- paragraph,- a- function,- a- table- row,- a- conversational- turn?
Does- the- retrieval- task- need- continuity- across- adjacent- chunks,- or- is- each- chunk- self-contained?
Which- metadata- fields- are- required- to- filter,- cite- or- reassemble- results?
What- chunking- strategy- matches- your- document- structure:- recursive- character,- code-aware,- semantic,- or- fixed-size?
How- does- your- chosen- chunk- size- relate- to- the- embedding- model’s- context- window?- If- chunks- are- close- to- the- window- limit,- retrieval- degrades- because- the- model- cannot- distinguish- signal- from- boundary- noise- [5].

##- What- would- change- this- advice

This- guidance- is- current- as- of- May- 2026- and- reflects- documented- behaviour- in- LangChain- 0.3.x,- LlamaIndex- 0.12.x,- OpenAI- text-embedding-3-*- and- text-embedding-ada-002,- and- Pinecone/Weaviate/Qdrant- metadata-filtering- capabilities- as- of- their- latest- stable- releases.

The- advice- would- need- revision- if:

Embedding- models- with- drastically- larger- context- windows- become- standard- —- e.g.,- models- that- can- embed- 8K+- token- chunks- without- quality- loss.- That- could- reduce- the- need- for- aggressive- splitting- but- would- not- eliminate- the- need- for- structure-respecting- chunk- boundaries.
Hybrid- search- (dense- +- sparse)- becomes- the- default- in- vector- databases- —- overlap- strategies- designed- to- compensate- for- boundary- loss- in- pure- semantic- retrieval- may- become- less- important- when- keyword- signals- also- contribute- to- ranking- [4].
Metadata-filtered- retrieval- becomes- a- first-class- vector- DB- primitive- —- if- filters- are- automatically- applied- at- query- time- without- explicit- chunk-level- metadata,- the- provenance- problem- changes.- For- now,- chunk- metadata- is- still- the- reliable- approach.

This- page- is- operational- guidance,- not- a- universal- constant.- Test- your- settings- against- real- queries- and- real- retrieval- failures- before- treating- any- heuristic- as- fixed.

- - Editor's- Note - -

The- "what- would- change- this- advice"- section- is- worth- bookmarking.- Chunking- guidance- ages- faster- than- most- RAG- advice- because- it- sits- at- the- intersection- of- embedding- model- capabilities,- vector- database- features,- and- document- parsing- quality- —- all- three- of- which- are- moving- independently.- Check- this- page- again- in- six- months.

##- Caveats- and- scope- boundaries

This- guide- covers- text-based- document- chunking- for- RAG- systems.- It- does- not- cover- multimodal- chunking- (images,- audio,- video),- which- introduces- additional- embedding- and- boundary-detection- challenges.
Chunking- interacts- with- embedding- model- choice- —- the- same- chunk- size- may- perform- differently- with- text-embedding-ada-002- versus- text-embedding-3-large.- Test- with- your- specific- embedding- model.
The- operational- guidance- here- reflects- tool- and- library- behaviour- as- of- May- 2026.- LangChain,- LlamaIndex,- and- vector- database- APIs- evolve- —- verify- current- documentation.

##- Methodology

Data- checked:- 2026-05-28
Sources- consulted:- LangChain- text- splitters- documentation,- LlamaIndex- node- parsers- documentation,- Pinecone- metadata- filtering- documentation,- Weaviate- hybrid- search- documentation,- OpenAI- embeddings- documentation
Assumptions:- The- reader- is- building- or- maintaining- a- RAG- system- and- needs- practical- chunking- guidance- across- multiple- document- types
Limitations:- This- article- provides- operational- guidance,- not- a- research- survey.- It- does- not- benchmark- specific- chunking- strategies- against- each- other- or- cover- the- full- landscape- of- academic- chunking- literature
Jurisdiction:- Global.- No- jurisdiction-specific- regulatory- content

##- Source- list

LangChain- text- splitters- docs- —- https://python.langchain.com/docs/concepts/text_splitters/- (accessed- 2026-05-28)
LlamaIndex- chunking- docs- —- https://docs.llamaindex.ai/en/stable/module_guides/loading/node_parsers/- (accessed- 2026-05-28)
Pinecone- docs- —- https://docs.pinecone.io/- (accessed- 2026-05-28)
Weaviate- hybrid- search- —- https://weaviate.io/developers/weaviate/search/hybrid- (accessed- 2026-05-28)
OpenAI- embeddings- docs- —- https://platform.openai.com/docs/guides/embeddings- (accessed- 2026-05-28)

##- Related- guides- guides- guides

##- Trust- Stack

Last- checked:- 2026-05-28
Corrections:- Contact- us- to- report- errors

##- Change- log

2026-05-28:- Full- editorial- review- against- 16-gate- checklist.- Added- 3- Editor’s- Note- asides.- Added- Methodology,- Source- list- with- access- dates,- Trust- Stack,- slugified- heading- IDs- (all- H2s/H3s),- and- standalone- Caveats- section.- Fixed- frontmatter- writtenBy- label.- Corrected- related- guide- paths- to- relative- format.
2026-05-24:- First- draft- built- from- editorial- brief.- Revised- after- editorial- review- —- added- inline- citations,- concrete- examples,- expanded- failure- scenarios,- metadata- schema,- and- evidence-change- paragraph.