Technical
Document preparation: the key performance factor in RAG systems
In a RAG project, 80% of final quality is decided before the model: in how sources are cleaned, segmented, and indexed.
RAG demos are misleading: they work on clean, well-labelled corpora. In companies, documents are messy, contradictory, and poorly structured. This article describes what happens upstream, and why that is where quality is decided.
RAG does not fix your data
A RAG system (Retrieval-Augmented Generation) queries a knowledge base before generating an answer. That is powerful, but it inherits the quality of the base wholesale. A badly scanned PDF, a table exported without headers, an undetected duplicate: the model will faithfully reproduce the problem.
[Migration in progress - full article body to be brought across from the original Notion source.]
The preparation checklist
- Cleaning - quality OCR, footer removal, encoding normalization.
- Segmentation - semantic chunking, not mechanical slicing.
- Metadata - date, source, status (current / outdated).
- Deduplication - two versions of the same document means two contradictory votes.
- Evaluation - a reference question-and-answer set before any production launch.
Keep reading
AI Literacy • 2 Oct 2025 • EN
When should you start a new AI chat, and when should you continue?
The context window is a finite resource. Knowing when to reset and when to carry on is one of the highest-leverage skills in AI literacy.
AI Literacy • 18 Sept 2025 • EN
How to spot a hallucination before it spots you
Five practical tells that an AI answer is fabricated, written for non-technical readers who want to trust their tools without being burned by them.
AI Adoption • 12 Sept 2025 • EN
5 counter-intuitive lessons from AI training and coaching
Five lessons from supporting dozens of teams through AI activation: the opposite of what the market keeps promising.