For the last few days I’ve been on a deep dive in regards to what might be the best methods, for a consumer level A.i model, to retrieve and store it’s data.

If anyone has tried building a simple A.i model you can quickly understand why a simple RAG system without an actual Vector Data Base isn’t viable. While still trying to maintain a simple approach (without the use of SQL or PGAI Vectorizer).


What’s Chunking?

If you are reading this I presume you are already familiar with what chunking means but if you haven’t yet been accustomed to this term; Chunking is simply breaking large documents or text into smaller, manageable pieces.

In simple terms:

  • Instead of dealing with entire documents, you split them into bite-sized chunks
  • Each chunk contains related information
  • When answering questions, the system only retrieves the relevant chunks
  • Good chunking finds the right balance between including enough context without too much irrelevant information

Best Chunking Methods - According to Chroma Research

Some of the Key Findings, that also stand out are;

1. Best Overall Performance

  • Method: RecursiveCharacterTextSplitter
  • Parameters:
    • Size: 250 tokens
    • Overlap: 125 tokens
    • Retrieve: 10 chunks
  • Performance: Highest recall of 96.4 ± 16.9%

2. Alternative Strong Performers

  • Method: ClusterSemanticChunker (novel strategy)
  • Parameters:
    • Size: 200 tokens
    • Overlap: 0 tokens
    • Retrieve: 5 chunks
  • Performance: Good precision and IoU scores

3. Performance Trade-offs

  • Larger chunk sizes (250 vs 200) → Better recall
  • Overlap → Better recall but lower precision
  • More retrieved chunks (10 vs 5) → Better recall but lower precision

Currently my Model is using a chunk size of ~ 2000 with a overlap of 400. Before seeing Chroma’s Research Paper, this was my best approach so far, keep in mind I was using FAISS instead of Chroma. But now I’m ready to make the switch.

My Current Implementation

text_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=400)

This will be the Optimized Implementation

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=250,      # Reduced from 2000
    chunk_overlap=125,   # Reduced from 400
    length_function=len,
    separators=["\n\n", "\n", " ", ""]  # Explicit separators
)

The reason this might be “Better”

  1. I will be using smaller chunk sizes instead of the 2000 chunk size I’ll be switching to a more lean number of 250, potential having a more precise retrieval.
  2. The new overlap is of moderate size, and should have better context maintenance.
  3. The RecursiveCharacterTextSplitter seem to be the parameters with the best overall performance.

Key Takeaways

  • Optimal chunking significantly improves RAG performance
  • Smaller, well-balanced chunks often outperform larger ones
  • The right parameters depend on your specific use case

I will also leave a link to Chroma’s full research paper. Link Here And to Adam Lucek’s Comprehensive guide on The BEST Way to Chunk Text for RAG.