RAPHAEL THYS
EN FR
Let's Talk
← All essays

Technical

Document preparation: the key performance factor in RAG systems

In a RAG project, 80% of final quality is decided before the model: in how sources are cleaned, segmented, and indexed.

Raphael Thys 10 min read EN
Lire en français
Diagram of a RAG pipeline, from source document to user query

RAG demos are misleading: they work on clean, well-labelled corpora. In companies, documents are messy, contradictory, and poorly structured. This article describes what happens upstream, and why that is where quality is decided.

RAG does not fix your data

A RAG system (Retrieval-Augmented Generation) queries a knowledge base before generating an answer. That is powerful, but it inherits the quality of the base wholesale. A badly scanned PDF, a table exported without headers, an undetected duplicate: the model will faithfully reproduce the problem.

[Migration in progress - full article body to be brought across from the original Notion source.]

The preparation checklist

  1. Cleaning - quality OCR, footer removal, encoding normalization.
  2. Segmentation - semantic chunking, not mechanical slicing.
  3. Metadata - date, source, status (current / outdated).
  4. Deduplication - two versions of the same document means two contradictory votes.
  5. Evaluation - a reference question-and-answer set before any production launch.

Keep reading