Connecting Your Own Data (RAG)

How LLMs learned to look something up before answering — RAG explained.

Architectures 12 min Intermediate April 26, 2026

You ask ChatGPT about your company's vacation policy. It confidently invents an answer — because it has never seen your HR handbook. RAG fixes this: instead of trusting the model's memory, it teaches the model to look things up first.

In this article, you will learn how RAG works: from the core principle through data preparation to searching vector databases. You will discover why the quality of a RAG application critically depends on how documents are prepared.

The RAG Principle — Look It Up, Don't Guess

Retrieval-Augmented Generation (RAG) extends a language model with an external knowledge base. Instead of relying on knowledge baked into the model's weights during training, the system retrieves relevant documents at query time and injects them into the prompt as context. The concept was introduced by Patrick Lewis et al. in 2020. RAG solves three fundamental LLM problems: hallucinations, outdated knowledge, and data privacy.

Retrieval-Augmented Generation

AnalogyDefinition
A standard LLM is like a student taking a closed-book exam — they can only use what they memorized during studying. Sometimes they remember correctly, sometimes they confidently write something plausible but wrong. RAG is like an open-book exam: the student is allowed to look up facts in their textbook before writing their answer. They still need to understand the question and formulate a good answer, but the factual basis comes from a verified source.

In a real open-book exam, the student picks up the textbook and flips through it. In RAG, the "flipping" is automated by a retrieval algorithm that might return the wrong pages. The quality of the lookup is not guaranteed.

The RAG Pipeline in Three Steps

1
Indexing: Split documents into chunks, encode as vectors, and store in a vector database.
2
Retrieval: Encode the user query as a vector and find the most similar chunks via similarity search.
3
Augmented Generation: Inject the retrieved chunks as context into the prompt and let the LLM generate an answer.
Hallucinations Answers are grounded in real documents instead of invented knowledge
Up-to-date Knowledge base can be updated at any time — no expensive retraining needed
Data Privacy Sensitive data stays in your own database

A company chatbot has 500 internal PDFs (HR handbook, compliance guidelines, product manuals). An employee asks: "How do I apply for special leave?" The system converts the question into a vector, finds the 3 most relevant sections from the current HR handbook, and injects them into the prompt. The language model generates an answer with the correct procedure. Without RAG, it would either refuse or invent a plausible but wrong procedure.

Misconception: RAG eliminates hallucinations completely

RAG dramatically reduces hallucinations but does not eliminate them completely. The model can still misinterpret retrieved chunks, combine information from unrelated chunks incorrectly, or generate fluent nonsense when the retrieval step returns irrelevant results. RAG shifts the problem from "the model invents facts" to "the model might misread its reference material."

Interactive: RAG Pipeline as a Graph

The RAG pipeline consists of several stations that pass data along. This graph shows the connections between documents, chunking, embedding, vector database, retrieval, and LLM. Watch how BFS and DFS take different paths through the pipeline.

Doc Chk Emb VDB Ret LLM
Queue (FIFO)
Doc
Visited
none yet

Click "Step" to start the breadth-first search.

Unvisited
In queue/stack
Current
Visited
Goal found

Chunking & Embeddings — Preparing the Library

Before documents can be searched semantically, they must be split into smaller pieces called chunks and transformed into embedding vectors. Chunk size determines search quality: too-small chunks (200 tokens) are precise but lose context. Too-large chunks (1000 tokens) preserve context but dilute search precision. Overlap (50-100 tokens) prevents information loss at cut boundaries.

Chunking

AnalogyDefinition
Imagine you have a 200-page manual. Keeping it as one book makes targeted search impossible. Copying every single sentence onto a separate card is too fragmented. The practical solution: index cards that each contain one coherent paragraph, with overlap at the edges. Then for each card, write a one-line summary on the tab — that is the embedding.

Index card summaries are written by a human who understands the content. Embedding models compute vectors mathematically — they capture semantic similarity but not necessarily logical relationships.

Too-small chunks (200 tokens)

High precision but context loss. A pronoun like "it" cannot be resolved if the referent appeared in the previous chunk. Warranty sentence isolated from context about what counts as a "manufacturing defect."

Too-large chunks (2000 tokens)

Context preserved but search precision diluted. The warranty paragraph is mixed with unrelated content about product specifications. More irrelevant text per result.

A 50-page product manual is chunked with 500 tokens per chunk and 100 tokens overlap. This produces 120 chunks. Each chunk is converted into a vector with 1536 dimensions: a matrix of 120 x 1536 numbers stored in the vector database. When searching for "warranty conditions," the query is also embedded into a 1536-dimensional vector and compared against all 120 stored vectors.

Misconception: Smallest possible chunks give the best precision

Smaller chunks increase precision but decrease recall and context. A chunk containing only "It lasts 24 months" without the surrounding context is useless — the "It" has no referent. The art of chunking is finding the size where each chunk is self-contained enough to be useful without being so large that it dilutes the search.

Fixed size: Document is cut into uniform blocks. Simple to implement but ignores content structure. A paragraph can be split mid-sentence.

Paragraph-based: Natural paragraph boundaries are respected. Preserves topical coherence but produces chunks of varying sizes.

Semantic: An embedding model detects topic changes in the text and sets boundaries there. Best quality but computationally expensive and more complex to implement.

Vector Databases — The AI's Memory

A vector database is a specialized storage system for embedding vectors. Unlike traditional relational databases that match exact values (SQL: WHERE name = 'Max'), vector databases search by mathematical similarity — they find the stored vectors closest to a query vector in high-dimensional space.

Vector Database

AnalogyDefinition
A traditional SQL database is like a library catalog — you search for an exact title, author, or ISBN, and it either finds a match or doesn't. A vector database is like a recommendation system: you describe what you are looking for in your own words, and it returns the items whose descriptions are most similar to yours — even if none match your exact wording.

A real recommendation system uses collaborative filtering (what did similar users like?). A vector database relies purely on mathematical vector proximity — it captures semantic similarity but not user preferences.

SQL Search (Keyword)

Exact match required. LIKE '%return conditions%' finds nothing when the document says "refund policy." Fast for structured data but blind to synonyms and natural language.

Vector Search (Semantic)

Finds results by meaning. "Return conditions" and "refund policy" have similar embedding vectors despite different words. Enables natural-language queries.

A user types: "What are the return conditions?" The embedding model converts this question into a 1536-dimensional vector. The vector database finds the 5 closest stored vectors via cosine similarity. These come from the "Refund Policy" section — even though the user wrote "return conditions" and the manual says "refund policy." In contrast: an SQL query with LIKE '%return conditions%' would find nothing because the manual uses different words.

Misconception: A vector database is just a regular database with an extra column

The search mechanism is fundamentally different. Relational databases use B-trees and hash indexes for exact matching. Vector databases use specialized data structures (HNSW graphs, IVF indexes) optimized for high-dimensional nearest-neighbor search. pgvector in PostgreSQL offers basic vector search, but dedicated vector databases like Pinecone or Weaviate include optimizations (sharding, replication, metadata filtering) designed specifically for vector workloads at scale.

Interactive: The RAG Flow Step by Step

Follow the complete data flow of a RAG query: from document through chunking and embedding into the vector database, then back through retrieval to the LLM, which generates the final answer.

RAG Pipeline: From Document to Answer

Retrieval-Augmented Generation connects a knowledge base with a language model. Instead of only answering from its training, the LLM first searches for relevant documents and uses their content for a well-founded answer.

Document 1Document 2Document 3Document NKnowledge BaseEmbeddingCreate vectorsVectorizationVectorDatabaseStorageUser QueryInputTop-K+ QueryLLMGenerationLanguage ModelAnswer
Step 0 of 6
Start animation

Click "Play" to see the RAG pipeline step by step.

Why RAG?

Without RAG, an LLM only answers from its training state — which can be outdated or incomplete. With RAG, the model accesses current, specific documents. This reduces hallucinations and enables source-based answers. RAG is the most widely used method for connecting LLMs with external knowledge.

RAG significantly reduces hallucinations but does not eliminate them completely. The model can misinterpret retrieved information or incorrectly combine chunks from different contexts.

Garbage in, garbage out applies to RAG too: if the knowledge base is poorly structured, outdated, or contains errors, RAG also produces wrong answers — just with source references.

RAG and fine-tuning solve different problems. RAG adds external knowledge (ideal for current facts, company data). Fine-tuning changes the model's behavior (ideal for style, tone, domain-specific language). For best results, both approaches are often combined.

Key Takeaways

  1. RAG teaches a language model to look up information before answering — solving hallucinations, staleness, and privacy concerns without retraining the model.
  2. The quality of retrieval depends on chunking: chunks too small lose context, chunks too large dilute relevance. Overlap prevents information loss at cut boundaries.
  3. Vector databases search by meaning, not by keywords — this is why a natural-language question can find the right document, even when the exact words differ.

Quiz: RAG

Question 1 / 4
Not completed

What is the main purpose of the "Retrieval" step in a RAG pipeline?

Select one answer
Answer Key: 1) B · 2) B · 3) B · 4) B

Comprehension Check

  • What are the three stages of a RAG pipeline — and what happens at each stage?
  • What are the consequences of too-small and too-large chunks on retrieval quality?
  • Why does a vector search find results that a keyword search would miss?