How LLMs learned to look something up before answering — RAG explained.
Architectures 12 min Intermediate April 26, 2026
You ask ChatGPT about your company's vacation policy. It confidently invents an answer — because it has never seen your HR handbook. RAG fixes this: instead of trusting the model's memory, it teaches the model to look things up first.
In this article, you will learn how RAG works: from the core principle through data preparation to searching vector databases. You will discover why the quality of a RAG application critically depends on how documents are prepared.
The RAG Principle — Look It Up, Don't Guess
Retrieval-Augmented Generation (RAG) extends a language model with an external knowledge base. Instead of relying on knowledge baked into the model's weights during training, the system retrieves relevant documents at query time and injects them into the prompt as context. The concept was introduced by Patrick Lewis et al. in 2020. RAG solves three fundamental LLM problems: hallucinations, outdated knowledge, and data privacy.
2020 Papers
GPT-3: The 175 billion parameter model
The breakthrough to few-shot learning and emergent AI capabilities. On May 28, 2020, OpenAI's team led by Tom Brown presented the significant paper 'Language Models are Few-Shot Learners' – GPT-3 with 175 billion parameters, over 100 times larger than GPT-2. The scaling revealed emergent abilities: the model could solve new tasks with just a few examples, without fine-tuning. From translations to word puzzles to 3-digit arithmetic, GPT-3 demonstrated impressive versatility. Human evaluators could barely distinguish GPT-3-generated news articles from real ones. The system achieved nearly state-of-the-art results on SuperGLUE benchmarks through in-context learning alone. 31 OpenAI researchers (Tom Brown and 30 co-authors) proved: massive parameter scaling can produce qualitatively new capabilities. GPT-3 laid the foundation for ChatGPT and the modern LLM era.
Retrieval-Augmented Generation
AnalogyDefinition
A standard LLM is like a student taking a closed-book exam — they can only use what they memorized during studying. Sometimes they remember correctly, sometimes they confidently write something plausible but wrong. RAG is like an open-book exam: the student is allowed to look up facts in their textbook before writing their answer. They still need to understand the question and formulate a good answer, but the factual basis comes from a verified source.
Analogy:
A standard LLM is like a student taking a closed-book exam — they can only use what they memorized during studying. Sometimes they remember correctly, sometimes they confidently write something plausible but wrong. RAG is like an open-book exam: the student is allowed to look up facts in their textbook before writing their answer. They still need to understand the question and formulate a good answer, but the factual basis comes from a verified source.
Definition:
RAG is an architecture that connects an LLM with an external knowledge base. Documents are indexed, converted into vectors, and stored in a database. At query time, the most relevant document sections are retrieved and injected as context into the prompt. The model generates its answer based on these retrieved facts.
In a real open-book exam, the student picks up the textbook and flips through it. In RAG, the "flipping" is automated by a retrieval algorithm that might return the wrong pages. The quality of the lookup is not guaranteed.
The RAG Pipeline in Three Steps
1
Indexing: Split documents into chunks, encode as vectors, and store in a vector database.
2
Retrieval: Encode the user query as a vector and find the most similar chunks via similarity search.
3
Augmented Generation: Inject the retrieved chunks as context into the prompt and let the LLM generate an answer.
↓
Hallucinations Answers are grounded in real documents instead of invented knowledge
✓
Up-to-date Knowledge base can be updated at any time — no expensive retraining needed
✓
Data Privacy Sensitive data stays in your own database
A company chatbot has 500 internal PDFs (HR handbook, compliance guidelines, product manuals). An employee asks: "How do I apply for special leave?" The system converts the question into a vector, finds the 3 most relevant sections from the current HR handbook, and injects them into the prompt. The language model generates an answer with the correct procedure. Without RAG, it would either refuse or invent a plausible but wrong procedure.
RAG dramatically reduces hallucinations but does not eliminate them completely. The model can still misinterpret retrieved chunks, combine information from unrelated chunks incorrectly, or generate fluent nonsense when the retrieval step returns irrelevant results. RAG shifts the problem from "the model invents facts" to "the model might misread its reference material."
Interactive: RAG Pipeline as a Graph
The RAG pipeline consists of several stations that pass data along. This graph shows the connections between documents, chunking, embedding, vector database, retrieval, and LLM. Watch how BFS and DFS take different paths through the pipeline.
Queue (FIFO)
Doc
Visited
none yet
Click "Step" to start the breadth-first search.
Unvisited
In queue/stack
Current
Visited
Goal found
Chunking & Embeddings — Preparing the Library
Before documents can be searched semantically, they must be split into smaller pieces called chunks and transformed into embedding vectors. Chunk size determines search quality: too-small chunks (200 tokens) are precise but lose context. Too-large chunks (1000 tokens) preserve context but dilute search precision. Overlap (50-100 tokens) prevents information loss at cut boundaries.
Chunking
AnalogyDefinition
Imagine you have a 200-page manual. Keeping it as one book makes targeted search impossible. Copying every single sentence onto a separate card is too fragmented. The practical solution: index cards that each contain one coherent paragraph, with overlap at the edges. Then for each card, write a one-line summary on the tab — that is the embedding.
Analogy:
Imagine you have a 200-page manual. Keeping it as one book makes targeted search impossible. Copying every single sentence onto a separate card is too fragmented. The practical solution: index cards that each contain one coherent paragraph, with overlap at the edges. Then for each card, write a one-line summary on the tab — that is the embedding.
Definition:
Chunking splits documents into searchable text blocks of typically 200-1000 tokens. Two parameters control quality: chunk size (how large each block is) and overlap (how many tokens adjacent chunks share). After chunking, an embedding model converts each chunk into a high-dimensional vector (e.g., 1536 dimensions). Semantically similar texts produce vectors that are close together in vector space.
Index card summaries are written by a human who understands the content. Embedding models compute vectors mathematically — they capture semantic similarity but not necessarily logical relationships.
Too-small chunks (200 tokens)
High precision but context loss. A pronoun like "it" cannot be resolved if the referent appeared in the previous chunk. Warranty sentence isolated from context about what counts as a "manufacturing defect."
Too-large chunks (2000 tokens)
Context preserved but search precision diluted. The warranty paragraph is mixed with unrelated content about product specifications. More irrelevant text per result.
A 50-page product manual is chunked with 500 tokens per chunk and 100 tokens overlap. This produces 120 chunks. Each chunk is converted into a vector with 1536 dimensions: a matrix of 120 x 1536 numbers stored in the vector database. When searching for "warranty conditions," the query is also embedded into a 1536-dimensional vector and compared against all 120 stored vectors.
Misconception: Smallest possible chunks give the best precision
Smaller chunks increase precision but decrease recall and context. A chunk containing only "It lasts 24 months" without the surrounding context is useless — the "It" has no referent. The art of chunking is finding the size where each chunk is self-contained enough to be useful without being so large that it dilutes the search.
Chunking Strategies Compared
Fixed size: Document is cut into uniform blocks. Simple to implement but ignores content structure. A paragraph can be split mid-sentence.
Paragraph-based: Natural paragraph boundaries are respected. Preserves topical coherence but produces chunks of varying sizes.
Semantic: An embedding model detects topic changes in the text and sets boundaries there. Best quality but computationally expensive and more complex to implement.
Vector Databases — The AI's Memory
A vector database is a specialized storage system for embedding vectors. Unlike traditional relational databases that match exact values (SQL: WHERE name = 'Max'), vector databases search by mathematical similarity — they find the stored vectors closest to a query vector in high-dimensional space.
Vector Database
AnalogyDefinition
A traditional SQL database is like a library catalog — you search for an exact title, author, or ISBN, and it either finds a match or doesn't. A vector database is like a recommendation system: you describe what you are looking for in your own words, and it returns the items whose descriptions are most similar to yours — even if none match your exact wording.
Analogy:
A traditional SQL database is like a library catalog — you search for an exact title, author, or ISBN, and it either finds a match or doesn't. A vector database is like a recommendation system: you describe what you are looking for in your own words, and it returns the items whose descriptions are most similar to yours — even if none match your exact wording.
Definition:
Vector databases use distance metrics like cosine similarity or Euclidean distance to calculate the proximity of two vectors. For large-scale systems, Approximate Nearest Neighbor (ANN) algorithms trade a small amount of accuracy for massive speed gains. The ecosystem includes ChromaDB (beginner-friendly), FAISS (Facebook's library for fast local search), Pinecone (managed cloud solution), pgvector (PostgreSQL extension), and Weaviate (open source).
A real recommendation system uses collaborative filtering (what did similar users like?). A vector database relies purely on mathematical vector proximity — it captures semantic similarity but not user preferences.
SQL Search (Keyword)
Exact match required. LIKE '%return conditions%' finds nothing when the document says "refund policy." Fast for structured data but blind to synonyms and natural language.
Vector Search (Semantic)
Finds results by meaning. "Return conditions" and "refund policy" have similar embedding vectors despite different words. Enables natural-language queries.
A user types: "What are the return conditions?" The embedding model converts this question into a 1536-dimensional vector. The vector database finds the 5 closest stored vectors via cosine similarity. These come from the "Refund Policy" section — even though the user wrote "return conditions" and the manual says "refund policy." In contrast: an SQL query with LIKE '%return conditions%' would find nothing because the manual uses different words.
Misconception: A vector database is just a regular database with an extra column
The search mechanism is fundamentally different. Relational databases use B-trees and hash indexes for exact matching. Vector databases use specialized data structures (HNSW graphs, IVF indexes) optimized for high-dimensional nearest-neighbor search. pgvector in PostgreSQL offers basic vector search, but dedicated vector databases like Pinecone or Weaviate include optimizations (sharding, replication, metadata filtering) designed specifically for vector workloads at scale.
Interactive: The RAG Flow Step by Step
Follow the complete data flow of a RAG query: from document through chunking and embedding into the vector database, then back through retrieval to the LLM, which generates the final answer.
RAG Pipeline: From Document to Answer
Retrieval-Augmented Generation connects a knowledge base with a language model. Instead of only answering from its training, the LLM first searches for relevant documents and uses their content for a well-founded answer.
Step 0 of 6
Start animation
Click "Play" to see the RAG pipeline step by step.
Why RAG?
Without RAG, an LLM only answers from its training state — which can be outdated or incomplete. With RAG, the model accesses current, specific documents. This reduces hallucinations and enables source-based answers. RAG is the most widely used method for connecting LLMs with external knowledge.
RAG Limitations — No Silver Bullet
RAG significantly reduces hallucinations but does not eliminate them completely. The model can misinterpret retrieved information or incorrectly combine chunks from different contexts.
Garbage in, garbage out applies to RAG too: if the knowledge base is poorly structured, outdated, or contains errors, RAG also produces wrong answers — just with source references.
RAG and fine-tuning solve different problems. RAG adds external knowledge (ideal for current facts, company data). Fine-tuning changes the model's behavior (ideal for style, tone, domain-specific language). For best results, both approaches are often combined.
Key Takeaways
RAG teaches a language model to look up information before answering — solving hallucinations, staleness, and privacy concerns without retraining the model.
The quality of retrieval depends on chunking: chunks too small lose context, chunks too large dilute relevance. Overlap prevents information loss at cut boundaries.
Vector databases search by meaning, not by keywords — this is why a natural-language question can find the right document, even when the exact words differ.
Quiz: RAG
Question 1 / 4
Not completed
What is the main purpose of the "Retrieval" step in a RAG pipeline?
1. What is the main purpose of the "Retrieval" step in a RAG pipeline?
☐ A) Training the language model on new documents
☐ B) Finding the document chunks most semantically relevant to the user's query
☐ C) Generating the final answer for the user
☐ D) Converting documents into PDF format
2. You build a RAG system for a medical knowledge base. Doctors report that answers sometimes miss important dosage details that appear in the sentence right before a chunk boundary. What is the most effective fix?
☐ A) Use a larger language model
☐ B) Increase the overlap between adjacent chunks so boundary sentences appear in both
☐ C) Switch from vector database to SQL database
☐ D) Remove all chunk boundaries entirely
3. A customer asks your RAG chatbot: "Can I return shoes after 30 days?" The knowledge base document says: "Footwear refund requests must be submitted within four weeks of purchase." A keyword search for "return shoes 30 days" finds nothing. Why does the vector search succeed?
☐ A) Vector databases are faster than SQL databases
☐ B) The embedding vectors for "return shoes" and "footwear refund" capture their semantic similarity despite different words
☐ C) Vector databases store more data than relational databases
☐ D) The language model corrects the search query before searching
4. A startup uses RAG with a ChatGPT API to answer questions about their 200-page product manual. The chatbot sometimes gives wrong answers with high confidence. What is the most likely explanation?
☐ A) The language model is too small
☐ B) The retrieval step returned irrelevant chunks, and the model generated a fluent answer based on wrong context
☐ C) The manual is too long for RAG to handle
☐ D) RAG cannot work with ChatGPT
Answer Key: 1) B · 2) B · 3) B · 4) B
Comprehension Check
What are the three stages of a RAG pipeline — and what happens at each stage?
What are the consequences of too-small and too-large chunks on retrieval quality?
Why does a vector search find results that a keyword search would miss?