Best Chunking Methods (According to Chroma Research)
For the last few days I’ve been on a deep dive in regards to what might be the best methods, for a consumer level A.i model, to retrieve and store it’s data.
If anyone has tried building a simple A.i model you can quickly understand why a simple RAG system without an actual Vector Data Base isn’t viable. While still trying to maintain a simple approach (without the use of SQL or PGAI Vectorizer).
What’s Chunking?⌗
If you are reading this I presume you are already familiar with what chunking means but if you haven’t yet been accustomed to this term; Chunking is simply breaking large documents or text into smaller, manageable pieces.
In simple terms:
- Instead of dealing with entire documents, you split them into bite-sized chunks
- Each chunk contains related information
- When answering questions, the system only retrieves the relevant chunks
- Good chunking finds the right balance between including enough context without too much irrelevant information
Best Chunking Methods - According to Chroma Research⌗
Some of the Key Findings, that also stand out are;
1. Best Overall Performance⌗
- Method:
RecursiveCharacterTextSplitter - Parameters:
- Size: 250 tokens
- Overlap: 125 tokens
- Retrieve: 10 chunks
- Performance: Highest recall of 96.4 ± 16.9%
2. Alternative Strong Performers⌗
- Method:
ClusterSemanticChunker(novel strategy) - Parameters:
- Size: 200 tokens
- Overlap: 0 tokens
- Retrieve: 5 chunks
- Performance: Good precision and IoU scores
3. Performance Trade-offs⌗
- Larger chunk sizes (250 vs 200) → Better recall
- Overlap → Better recall but lower precision
- More retrieved chunks (10 vs 5) → Better recall but lower precision
Currently my Model is using a chunk size of ~ 2000 with a overlap of 400. Before seeing Chroma’s Research Paper, this was my best approach so far, keep in mind I was using FAISS instead of Chroma. But now I’m ready to make the switch.
My Current Implementation⌗
text_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=400)
This will be the Optimized Implementation⌗
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=250, # Reduced from 2000
chunk_overlap=125, # Reduced from 400
length_function=len,
separators=["\n\n", "\n", " ", ""] # Explicit separators
)
The reason this might be “Better”⌗
- I will be using smaller chunk sizes instead of the 2000 chunk size I’ll be switching to a more lean number of 250, potential having a more precise retrieval.
- The new overlap is of moderate size, and should have better context maintenance.
- The RecursiveCharacterTextSplitter seem to be the parameters with the best overall performance.
Key Takeaways⌗
- Optimal chunking significantly improves RAG performance
- Smaller, well-balanced chunks often outperform larger ones
- The right parameters depend on your specific use case
I will also leave a link to Chroma’s full research paper. Link Here And to Adam Lucek’s Comprehensive guide on The BEST Way to Chunk Text for RAG.