RAG Explained Simply: Real-time Data & Why It Matters

November 22, 2024 • AI • Technical • LLM

Retrieval‑Augmented Generation (RAG) lets large language models answer with facts from your data—docs, websites, PDFs, databases—without retraining the model for every update. Think of it as giving your model a fast, trustworthy library at prompt‑time. This explainer keeps the math simple and the steps practical so developers, students, and advanced users can build and debug RAG systems with confidence.

Related: Model Context Protocol (MCP) servers—a standard way to expose secure tools (retrievers, rerankers, data access) that LLM apps and agents can discover and call.

Why RAG (and why not just a bigger LLM)?
High‑level architecture
Ingestion: chunking, cleaning, and metadata
Indexing: keywords, vectors, and hybrid search
Query time: retrieval, reranking, and grounding
Generation: prompts, guards, and citations
Evaluation: correctness, coverage, and cost
Production patterns: latency, caching, and safety
Pitfalls and debugging
FAQs
Bottom line

Why RAG (and why not just a bigger LLM)?

Models memorize patterns from training data but don’t know your private docs or yesterday’s prices. Finetuning every time your data changes is slow, costly, and often unnecessary. RAG keeps your truth separate from the model: you update a search index, and the model cites the latest snippets at answer time. Benefits: freshness, verifiability, smaller models, and permissions that mirror your data store.

High‑level architecture

User Question
   │
   ├─► Retriever (hybrid keyword + vector)
   │     └─► Top‑K passages + metadata
   │
   ├─► (Optional) Reranker (cross‑encoder)
   │     └─► Top‑N grounded contexts
   │
   └─► LLM (prompt = question + contexts + instructions)
         └─► Answer with citations

The retriever finds candidates, a reranker sorts the best few, and the LLM writes an answer using only what it sees. Grounding means the model’s claims are supported by retrieved text.

Ingestion: chunking, cleaning, and metadata

Chunking: Split long docs into small, overlapping passages (e.g., 300–800 tokens, 10–20% overlap).
Cleaning: Remove boilerplate nav, fix encoding, preserve headings and table structure.
Metadata: Keep source URL, title, section, date, product/version, and permissions.

function chunk(document) {
  const CHUNK = 500; const OVERLAP = 80;
  // return overlapping slices preserving headings in metadata
}

Indexing: keywords, vectors, and hybrid search

Keyword search (BM25) is precise for exact terms; vector search matches meaning. Hybrid retrievers combine both: start with BM25 for precision, union with vectors for recall, and deduplicate.

Embeddings: Use a strong multilingual embedding; store vectors with metadata.
Filters: Apply product/version/date filters to cut noise before scoring.
Scoring: Normalize and blend scores (e.g., 0.6 vector + 0.4 BM25).

Query time: retrieval, reranking, and grounding

Rewrite the query if needed (expand acronyms, unify terms).
Retrieve 50–200 candidates with hybrid search + filters.
Rerank to 5–10 contexts with a cross‑encoder trained on relevance.
Assemble a prompt with the question, contexts, and instructions (citations required).

PROMPT = [SYSTEM] You answer only with provided context. If missing, say you don't know.
[QUESTION]
[CONTEXT 1..N with source, title, url]

Generation: prompts, guards, and citations

Ask for inline citations after each claim or a reference list at the end.
Require a “Missing info” section when context is insufficient.
For tasks with actions (code, SQL), add validators and tests in a tool‑use loop.

Evaluation: correctness, coverage, and cost

Create a small, realistic eval set (50–200 questions). Score groundedness (does the answer cite the right passage?), factuality, completeness, and helpfulness. Track latency and cost alongside quality. Re‑run after model or index updates.

Groundedness: Every claim should trace to a context span.
Coverage: All key points present; no critical omissions.
Safety: No leakage of confidential info; refusal policies respected.

Production patterns: latency, caching, and safety

Caching: Cache embeddings and retrieval results for popular queries.
Batching: Batch embedding and reranking calls to cut cost/latency.
Async enrich: Add slow extras (tables, charts) after the first useful reply.
Permissions: Enforce row/tenant filters at retrieval time.
Observability: Log queries, contexts, sources, and user outcomes.

Pitfalls and debugging

Bad chunks: too big (dilutes relevance) or too small (loses meaning). Tune size/overlap.
Wrong index: use hybrid; pure vector misses rare terms, pure BM25 misses paraphrases.
No filters: add product/version/date to avoid stale or off‑topic passages.
Prompt bloat: keep contexts under model limits; prefer fewer, high‑relevance chunks.

FAQs

When do I finetune vs. use RAG? Finetune for style/format or stable domain skills; RAG for changing facts.

How big should chunks be? Start 300–800 tokens with 10–20% overlap; validate on your eval set.

Which embedding? Pick one strong multilingual model; consistency beats hopping models.

Bottom line

RAG works when retrieval is strong, prompts are disciplined, and answers are grounded with citations. Build a small eval set, tune chunking and hybrid search, and add lightweight guards. You’ll get fresher, verifiable answers without chasing ever‑larger models.

Real‑world use case: Ground a feature doc with RAG

Answer “what to build” with cited internal docs.

Create small eval set
Index spec docs
Answer 10 queries with citations

Expected outcome: Better decisions with verifiable references.

Implementation guide

Time: 60 minutes
Tools: Embeddings DB (lite), Eval questions
Prerequisites: Access to spec docs

Write 10 representative product questions.
Index 5–10 core docs; retrieve top‑k; answer with citations.
Note failures; adjust chunking or top‑k; repeat.

Prompt snippet

Answer using only these passages: [snippets]. Cite passage IDs. If unknown, say so.

SEO notes

Target: rag explained
Add diagram alt text for SEO

AITechnical

MCP Server Use Cases

Exploring Model Context Protocol servers and their practical applications in AI development.