You ask ChatGPT about your company's vacation policy. It confidently invents an answer — because it has never seen your HR handbook. RAG fixes this: instead of trusting the model's memory, it teaches the model to look things up first.

In this article, you will learn how RAG works: from the core principle through data preparation to searching vector databases. You will discover why the quality of a RAG application critically depends on how documents are prepared.

The RAG Principle — Look It Up, Don't Guess

Retrieval-Augmented Generation (RAG) extends a language model with an external knowledge base. Instead of relying on knowledge baked into the model's weights during training, the system retrieves relevant documents at query time and injects them into the prompt as context. The concept was introduced by Patrick Lewis et al. in 2020. RAG solves three fundamental LLM problems: hallucinations, outdated knowledge, and data privacy.

A standard LLM is like a student taking a closed-book exam — they can only use what they memorized during studying. Sometimes they remember correctly, sometimes they confidently write something plausible but wrong. RAG is like an open-book exam: the student is allowed to look up facts in their textbook before writing their answer. They still need to understand the question and formulate a good answer, but the factual basis comes from a verified source.

Analogy:

A standard LLM is like a student taking a closed-book exam — they can only use what they memorized during studying. Sometimes they remember correctly, sometimes they confidently write something plausible but wrong. RAG is like an open-book exam: the student is allowed to look up facts in their textbook before writing their answer. They still need to understand the question and formulate a good answer, but the factual basis comes from a verified source.

Definition:

RAG is an architecture that connects an LLM with an external knowledge base. Documents are indexed, converted into vectors, and stored in a database. At query time, the most relevant document sections are retrieved and injected as context into the prompt. The model generates its answer based on these retrieved facts.

In a real open-book exam, the student picks up the textbook and flips through it. In RAG, the "flipping" is automated by a retrieval algorithm that might return the wrong pages. The quality of the lookup is not guaranteed.

1

Indexing: Split documents into chunks, encode as vectors, and store in a vector database.

2

Retrieval: Encode the user query as a vector and find the most similar chunks via similarity search.

3

Augmented Generation: Inject the retrieved chunks as context into the prompt and let the LLM generate an answer.

↓

Hallucinations Answers are grounded in real documents instead of invented knowledge

✓

Up-to-date Knowledge base can be updated at any time — no expensive retraining needed

✓

Data Privacy Sensitive data stays in your own database

A company chatbot has 500 internal PDFs (HR handbook, compliance guidelines, product manuals). An employee asks: "How do I apply for special leave?" The system converts the question into a vector, finds the 3 most relevant sections from the current HR handbook, and injects them into the prompt. The language model generates an answer with the correct procedure. Without RAG, it would either refuse or invent a plausible but wrong procedure.

RAG dramatically reduces hallucinations but does not eliminate them completely. The model can still misinterpret retrieved chunks, combine information from unrelated chunks incorrectly, or generate fluent nonsense when the retrieval step returns irrelevant results. RAG shifts the problem from "the model invents facts" to "the model might misread its reference material."

Interactive: RAG Pipeline as a Graph

The RAG pipeline consists of several stations that pass data along. This graph shows the connections between documents, chunking, embedding, vector database, retrieval, and LLM. Watch how BFS and DFS take different paths through the pipeline.

Chunking & Embeddings — Preparing the Library

Before documents can be searched semantically, they must be split into smaller pieces called chunks and transformed into embedding vectors. Chunk size determines search quality: too-small chunks (200 tokens) are precise but lose context. Too-large chunks (1000 tokens) preserve context but dilute search precision. Overlap (50-100 tokens) prevents information loss at cut boundaries.

Imagine you have a 200-page manual. Keeping it as one book makes targeted search impossible. Copying every single sentence onto a separate card is too fragmented. The practical solution: index cards that each contain one coherent paragraph, with overlap at the edges. Then for each card, write a one-line summary on the tab — that is the embedding.

Analogy:

Imagine you have a 200-page manual. Keeping it as one book makes targeted search impossible. Copying every single sentence onto a separate card is too fragmented. The practical solution: index cards that each contain one coherent paragraph, with overlap at the edges. Then for each card, write a one-line summary on the tab — that is the embedding.

Definition:

Chunking splits documents into searchable text blocks of typically 200-1000 tokens. Two parameters control quality: chunk size (how large each block is) and overlap (how many tokens adjacent chunks share). After chunking, an embedding model converts each chunk into a high-dimensional vector (e.g., 1536 dimensions). Semantically similar texts produce vectors that are close together in vector space.

Index card summaries are written by a human who understands the content. Embedding models compute vectors mathematically — they capture semantic similarity but not necessarily logical relationships.

Too-small chunks (200 tokens)

High precision but context loss. A pronoun like "it" cannot be resolved if the referent appeared in the previous chunk. Warranty sentence isolated from context about what counts as a "manufacturing defect."

Too-large chunks (2000 tokens)

Context preserved but search precision diluted. The warranty paragraph is mixed with unrelated content about product specifications. More irrelevant text per result.

A 50-page product manual is chunked with 500 tokens per chunk and 100 tokens overlap. This produces 120 chunks. Each chunk is converted into a vector with 1536 dimensions: a matrix of 120 x 1536 numbers stored in the vector database. When searching for "warranty conditions," the query is also embedded into a 1536-dimensional vector and compared against all 120 stored vectors.

Smaller chunks increase precision but decrease recall and context. A chunk containing only "It lasts 24 months" without the surrounding context is useless — the "It" has no referent. The art of chunking is finding the size where each chunk is self-contained enough to be useful without being so large that it dilutes the search.

Fixed size: Document is cut into uniform blocks. Simple to implement but ignores content structure. A paragraph can be split mid-sentence.

Paragraph-based: Natural paragraph boundaries are respected. Preserves topical coherence but produces chunks of varying sizes.

Semantic: An embedding model detects topic changes in the text and sets boundaries there. Best quality but computationally expensive and more complex to implement.

Vector Databases — The AI's Memory

A vector database is a specialized storage system for embedding vectors. Unlike traditional relational databases that match exact values (SQL: WHERE name = 'Max'), vector databases search by mathematical similarity — they find the stored vectors closest to a query vector in high-dimensional space.

A traditional SQL database is like a library catalog — you search for an exact title, author, or ISBN, and it either finds a match or doesn't. A vector database is like a recommendation system: you describe what you are looking for in your own words, and it returns the items whose descriptions are most similar to yours — even if none match your exact wording.

Analogy:

A traditional SQL database is like a library catalog — you search for an exact title, author, or ISBN, and it either finds a match or doesn't. A vector database is like a recommendation system: you describe what you are looking for in your own words, and it returns the items whose descriptions are most similar to yours — even if none match your exact wording.

Definition:

Vector databases use distance metrics like cosine similarity or Euclidean distance to calculate the proximity of two vectors. For large-scale systems, Approximate Nearest Neighbor (ANN) algorithms trade a small amount of accuracy for massive speed gains. The ecosystem includes ChromaDB (beginner-friendly), FAISS (Facebook's library for fast local search), Pinecone (managed cloud solution), pgvector (PostgreSQL extension), and Weaviate (open source).

A real recommendation system uses collaborative filtering (what did similar users like?). A vector database relies purely on mathematical vector proximity — it captures semantic similarity but not user preferences.

SQL Search (Keyword)

Exact match required. LIKE '%return conditions%' finds nothing when the document says "refund policy." Fast for structured data but blind to synonyms and natural language.

Vector Search (Semantic)

Finds results by meaning. "Return conditions" and "refund policy" have similar embedding vectors despite different words. Enables natural-language queries.

A user types: "What are the return conditions?" The embedding model converts this question into a 1536-dimensional vector. The vector database finds the 5 closest stored vectors via cosine similarity. These come from the "Refund Policy" section — even though the user wrote "return conditions" and the manual says "refund policy." In contrast: an SQL query with LIKE '%return conditions%' would find nothing because the manual uses different words.

The search mechanism is fundamentally different. Relational databases use B-trees and hash indexes for exact matching. Vector databases use specialized data structures (HNSW graphs, IVF indexes) optimized for high-dimensional nearest-neighbor search. pgvector in PostgreSQL offers basic vector search, but dedicated vector databases like Pinecone or Weaviate include optimizations (sharding, replication, metadata filtering) designed specifically for vector workloads at scale.

Interactive: The RAG Flow Step by Step

Follow the complete data flow of a RAG query: from document through chunking and embedding into the vector database, then back through retrieval to the LLM, which generates the final answer.

RAG Pipeline: From Document to Answer

Retrieval-Augmented Generation connects a knowledge base with a language model. Instead of only answering from its training, the LLM first searches for relevant documents and uses their content for a well-founded answer.

Step 0 of 6

Start animation

Click "Play" to see the RAG pipeline step by step.

Why RAG?

Without RAG, an LLM only answers from its training state — which can be outdated or incomplete. With RAG, the model accesses current, specific documents. This reduces hallucinations and enables source-based answers. RAG is the most widely used method for connecting LLMs with external knowledge.

RAG significantly reduces hallucinations but does not eliminate them completely. The model can misinterpret retrieved information or incorrectly combine chunks from different contexts.

Garbage in, garbage out applies to RAG too: if the knowledge base is poorly structured, outdated, or contains errors, RAG also produces wrong answers — just with source references.

RAG and fine-tuning solve different problems. RAG adds external knowledge (ideal for current facts, company data). Fine-tuning changes the model's behavior (ideal for style, tone, domain-specific language). For best results, both approaches are often combined.

RAG teaches a language model to look up information before answering — solving hallucinations, staleness, and privacy concerns without retraining the model.
The quality of retrieval depends on chunking: chunks too small lose context, chunks too large dilute relevance. Overlap prevents information loss at cut boundaries.
Vector databases search by meaning, not by keywords — this is why a natural-language question can find the right document, even when the exact words differ.

What is the main purpose of the "Retrieval" step in a RAG pipeline?

Training the language model on new documents

Finding the document chunks most semantically relevant to the user's query

Generating the final answer for the user

Converting documents into PDF format

1. What is the main purpose of the "Retrieval" step in a RAG pipeline?

☐ A) Training the language model on new documents
☐ B) Finding the document chunks most semantically relevant to the user's query
☐ C) Generating the final answer for the user
☐ D) Converting documents into PDF format

2. You build a RAG system for a medical knowledge base. Doctors report that answers sometimes miss important dosage details that appear in the sentence right before a chunk boundary. What is the most effective fix?

☐ A) Use a larger language model
☐ B) Increase the overlap between adjacent chunks so boundary sentences appear in both
☐ C) Switch from vector database to SQL database
☐ D) Remove all chunk boundaries entirely

3. A customer asks your RAG chatbot: "Can I return shoes after 30 days?" The knowledge base document says: "Footwear refund requests must be submitted within four weeks of purchase." A keyword search for "return shoes 30 days" finds nothing. Why does the vector search succeed?

☐ A) Vector databases are faster than SQL databases
☐ B) The embedding vectors for "return shoes" and "footwear refund" capture their semantic similarity despite different words
☐ C) Vector databases store more data than relational databases
☐ D) The language model corrects the search query before searching

4. A startup uses RAG with a ChatGPT API to answer questions about their 200-page product manual. The chatbot sometimes gives wrong answers with high confidence. What is the most likely explanation?

☐ A) The language model is too small
☐ B) The retrieval step returned irrelevant chunks, and the model generated a fluent answer based on wrong context
☐ C) The manual is too long for RAG to handle
☐ D) RAG cannot work with ChatGPT

Answer Key: 1) B · 2) B · 3) B · 4) B

Comprehension Check

What are the three stages of a RAG pipeline — and what happens at each stage?
What are the consequences of too-small and too-large chunks on retrieval quality?
Why does a vector search find results that a keyword search would miss?

Connecting Your Own Data (RAG)

The RAG Principle — Look It Up, Don't Guess

GPT-3: The 175-Billion-Parameter Model

Retrieval-Augmented Generation

Analogy:

Definition:

The RAG Pipeline in Three Steps

Misconception: RAG eliminates hallucinations completely

Interactive: RAG Pipeline as a Graph

Chunking & Embeddings — Preparing the Library

Chunking

Analogy:

Definition:

Misconception: Smallest possible chunks give the best precision

Chunking Strategies Compared

Vector Databases — The AI's Memory

Vector Database

Analogy:

Definition:

Misconception: A vector database is just a regular database with an extra column

Interactive: The RAG Flow Step by Step

RAG Limitations — No Silver Bullet

Key Takeaways

Quiz: RAG

What is the main purpose of the "Retrieval" step in a RAG pipeline?

Comprehension Check

The RAG Principle — Look It Up, Don't Guess

GPT-3: The 175-Billion-Parameter Model

Retrieval-Augmented Generation

Analogy:

Definition:

The RAG Pipeline in Three Steps

Misconception: RAG eliminates hallucinations completely

Interactive: RAG Pipeline as a Graph

Chunking & Embeddings — Preparing the Library

Chunking

Analogy:

Definition:

Misconception: Smallest possible chunks give the best precision

Chunking Strategies Compared

Vector Databases — The AI's Memory

Vector Database

Analogy:

Definition:

Misconception: A vector database is just a regular database with an extra column

Interactive: The RAG Flow Step by Step

RAG Limitations — No Silver Bullet

Key Takeaways

Quiz: RAG

What is the main purpose of the "Retrieval" step in a RAG pipeline?

Comprehension Check

Related Content

Article

Function Calling — When AI Presses Buttons

Using Interfaces: APIs & MCP

The Raw Material: Data Engineering for Machine Learning

No-Code Workflows with n8n

The Internet & APIs

Giving the Model a Task

Prompting Fundamentals

Glossary

Timeline