The Problem RAG Solves

Large language models are trained on a fixed dataset with a knowledge cutoff date. They can't access your company's internal docs, recent data, or domain-specific information. When asked about things they don't know, they either refuse or hallucinate.

Retrieval-Augmented Generation (RAG) fixes this by giving the model access to external knowledge at query time. Instead of relying solely on what it memorized during training, the model retrieves relevant documents and uses them as context to generate answers.

How RAG Works

A RAG pipeline has three stages:

  1. Indexing — Your documents are split into chunks, converted to vector embeddings, and stored in a vector database.
  2. Retrieval — When a user asks a question, the question is embedded and the most similar document chunks are retrieved.
  3. Generation — The retrieved chunks are inserted into the prompt as context, and the LLM generates an answer grounded in that context.

Step 1: Document Chunking

Raw documents are too long to fit in an LLM's context window, so they're split into overlapping chunks — typically 200-500 tokens each.

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,
    separators=["\n\n", "\n", ". ", " "]
)

chunks = splitter.split_text(document_text)
print(f"Created {len(chunks)} chunks")
Chunking matters

Chunks too small lose context. Chunks too large dilute relevance. Overlap ensures ideas that span chunk boundaries aren't lost. Experiment with different sizes for your data.

Step 2: Embedding and Indexing

Each chunk is converted into a dense vector (embedding) that captures its semantic meaning. Similar chunks will have similar vectors.

from openai import OpenAI
import numpy as np

client = OpenAI()

def embed(texts):
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=texts
    )
    return [r.embedding for r in response.data]

# Embed all chunks
vectors = embed(chunks)

These vectors are stored in a vector database like Chroma, Pinecone, Weaviate, or pgvector. The database supports fast nearest-neighbor search.

Step 3: Retrieval

When a user asks a question, embed the question using the same model and find the top-k most similar chunks:

import chromadb

collection = chroma_client.get_collection("docs")

results = collection.query(
    query_texts=["How does authentication work?"],
    n_results=5
)

relevant_chunks = results["documents"][0]

Cosine similarity is the standard metric — the higher the score, the more semantically similar the chunk is to the query.

Step 4: Generation with Context

The retrieved chunks are injected into the prompt, and the LLM generates an answer grounded in the retrieved information:

context = "\n\n".join(relevant_chunks)

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "Answer questions using only the provided context. If the context doesn't contain the answer, say so."},
        {"role": "user", "content": f"Context:\n{context}\n\nQuestion: How does authentication work?"}
    ]
)

When to Use RAG

RAG vs Fine-Tuning

These are complementary, not competing approaches:

Common Pitfalls

RAG is the most practical way to make LLMs useful for real-world applications. It's not glamorous, but it works — and that's what matters in production.