RNNs process language one word at a time. For a 500-word paragraph, word 1 must survive 499 steps of compression before influencing the output — often, it does not survive. In 2017, Vaswani et al. proposed a radical alternative: what if every word could directly attend to every other word, all at once?

That idea — Self-Attention — eliminated the sequential bottleneck and unleashed a wave of scaling that produced GPT, BERT, and virtually every major AI system in use today.

Self-Attention — Every Token Sees Every Other Token

The central innovation of Transformers: parallel processing, direct connections, dynamic weights. Why this solves the RNN problems:

Imagine a classroom discussion. In the RNN approach, students sit in a long line and can only whisper to their immediate neighbor — a message from student 1 must pass through students 2, 3, 4, ... before reaching student 30. Information gets lost. In the Self-Attention approach, every student can see and hear every other student directly. Each student decides independently how much attention to pay to each speaker based on what is currently being discussed.

Example

In a real classroom, students have limited cognitive bandwidth and cannot truly attend to 30 speakers simultaneously. Self-Attention computes attention to ALL tokens mechanically and in parallel — there is no cognitive limit, only computational cost.

Analogy:

Imagine a classroom discussion. In the RNN approach, students sit in a long line and can only whisper to their immediate neighbor — a message from student 1 must pass through students 2, 3, 4, ... before reaching student 30. Information gets lost. In the Self-Attention approach, every student can see and hear every other student directly. Each student decides independently how much attention to pay to each speaker based on what is currently being discussed.

Example

In a real classroom, students have limited cognitive bandwidth and cannot truly attend to 30 speakers simultaneously. Self-Attention computes attention to ALL tokens mechanically and in parallel — there is no cognitive limit, only computational cost.

Definition:

Self-Attention computes, for every token in a sequence, how relevant every other token is to it — all at once, in parallel. Unlike RNNs that compress everything into a single hidden state, every token has a direct connection to every other token. The attention weights are dynamic: they are computed fresh for each input sequence.

RNN (sequential)

Tokens are processed one by one. Information from token 1 must pass through 499 hidden states to reach token 500. The vanishing gradient problem causes long-range dependencies to disappear.

Transformer (parallel)

Every token has direct access to every other — the distance is always exactly one step. No information loss over distance. Massive GPU parallelization possible.

The word "bank" in two sentences: "The bank is by the river" vs. "The bank is bankrupt." Without attention, "bank" has the same representation in both. With Self-Attention: in sentence 1, "bank" attends strongly to "river" — its representation shifts toward "bench/seat." In sentence 2, "bank" attends strongly to "bankrupt" — representation shifts toward "financial institution." Same word, different meaning — computed automatically.

The attention weights are dynamic: they are computed fresh for each input — unlike static CNN filters or fixed dense layer weights.

Transformers compute statistical patterns, not meaning. The attention weights show correlations, not real comprehension. The impressive results come from pattern recognition on massive datasets — not from cognition.

Try It Out: Self-Attention Step by Step

See how Self-Attention computes attention weights for the token "Fuchs" in the example sentence "Der schlaue Fuchs":

Step 1 / 6Tokens ready

Three tokens are in the sequence. Self-Attention will now compute how much attention each token pays to each other — all at once, in parallel.

Query, Key, Value — The Mechanics of Attention

How attention is actually computed — step by step:

Think of a library search. You walk in with a specific research question (your Query). Every book has an index card describing its content (its Key). You compare your question against every index card — some match well (high dot product), others don't. For the matching books, you read the relevant passages (their Values). The final answer is a blend of information from all relevant books, weighted by how well each card matched your question.

Example

In a real library, you typically pick one or two best-matching books. Self-Attention computes a weighted combination of ALL values — every token contributes something, though most contributions are near-zero after softmax.

Analogy:

Think of a library search. You walk in with a specific research question (your Query). Every book has an index card describing its content (its Key). You compare your question against every index card — some match well (high dot product), others don't. For the matching books, you read the relevant passages (their Values). The final answer is a blend of information from all relevant books, weighted by how well each card matched your question.

Example

In a real library, you typically pick one or two best-matching books. Self-Attention computes a weighted combination of ALL values — every token contributes something, though most contributions are near-zero after softmax.

Definition:

Each token is projected into three vectors: Query (what information does this token seek?), Key (what information does it offer?), and Value (what actual content does it carry?). Attention scores are computed as scaled dot products: score = (Q · K^T) / sqrt(d_k). Softmax converts scores to probabilities. The output is a probability-weighted sum of all Value vectors.

Attention Formula

Attention(Q, K, V) = softmax(Q · K^T / √d_k) · V

Mini-attention for "Der schlaue Fuchs" (3 tokens, d_k = 4).

1

Dot Products Query for "Fuchs": Q = [2, 0, 2, 0]. Keys: K_Der = [1, 0, 0, 0], K_schlaue = [1, 1, 0, 0], K_Fuchs = [2, 0, 2, 0]. Dot products: Q_Fuchs · K_Der = 2, Q_Fuchs · K_schlaue = 2, Q_Fuchs · K_Fuchs = 8

2

Scaling Divided by sqrt(d_k) = sqrt(4) = 2: Result [1, 1, 4]

3

Softmax Softmax([1, 1, 4]) = [0.05, 0.05, 0.90]

4

Weighted Values Assume the value vectors are: V_Der=[1,0], V_schlaue=[0,1], V_Fuchs=[1,1]. Weighted sum: 0.05·[1,0] + 0.05·[0,1] + 0.90·[1,1] = [0.95, 0.95]

5

Result "Fuchs" attends primarily to itself — its own information dominates the output.

Without dividing by sqrt(d_k), dot products become very large in high dimensions. Large values push softmax to extreme values (near 0 or 1), causing vanishing gradients — so-called "softmax saturation." Dividing by sqrt(d_k) keeps values in a moderate range so softmax produces differentiable probabilities and training remains stable.

Interactive: The Attention Formula

Click on Q, K, or V to understand what each vector does in the attention computation:

Attention Formula

QKᵀV

Q — Query

The Query vector encodes what information this token is looking for. Each token generates its own Query by multiplying its embedding with the learned weight matrix W_Q. The Query is then compared against all Keys to determine relevance.

Concrete Example

Formula: Attention(Q, K, V) = softmax(Q·Kᵀ/√dₖ) · V

Components: Q·Kᵀ (relevance scores) , / √dₖ (numerical stability)

1Q·Kᵀ: Dot products measure relevance → [2, 2, 8]

2Divide by √d_k to prevent softmax saturation → [1, 1, 4]

3Softmax normalizes to probabilities → [0.05, 0.05, 0.90]

4Weighted sum of V vectors → output [0.95, 0.95]

The Transformer Block — Assembling the Machine

The three components of a Transformer Block in detail:

Think of a Transformer Block as a committee meeting in three phases. Phase 1 (Multi-Head Attention): multiple expert panels discuss the input simultaneously — one analyzes grammar, another meaning, another tracks references. Phase 2 (Add & Norm): each participant writes a brief combining the new discussion insights with their original notes (residual connection = "don't forget what you already knew"). Phase 3 (Feed-Forward): each participant privately processes and consolidates their updated understanding before the next round.

Example

Committee members influence each other during discussion. In Multi-Head Attention, the heads operate completely independently in parallel — they do not communicate with each other within the same block.

Analogy:

Think of a Transformer Block as a committee meeting in three phases. Phase 1 (Multi-Head Attention): multiple expert panels discuss the input simultaneously — one analyzes grammar, another meaning, another tracks references. Phase 2 (Add & Norm): each participant writes a brief combining the new discussion insights with their original notes (residual connection = "don't forget what you already knew"). Phase 3 (Feed-Forward): each participant privately processes and consolidates their updated understanding before the next round.

Example

Committee members influence each other during discussion. In Multi-Head Attention, the heads operate completely independently in parallel — they do not communicate with each other within the same block.

Definition:

A Transformer Block has three components: (1) Multi-Head Attention — multiple parallel attention computations learning different relationship patterns. (2) Add & Norm — a residual connection adds the original input to the attention output (prevents vanishing gradients), followed by Layer Normalization for training stability. (3) Feed-Forward Network — two dense layers with ReLU, applied independently to each token.

Transformer Block: Data Flow

Multi-Head Attention

Multiple parallel attention computations — each head learns different patterns (grammar, semantics, references)

Add & Norm

Residual connection + Layer Normalization — stabilizes gradient flow in deep networks

Feed-Forward Network

Two dense layers with ReLU — processes the information gathered by attention, per token

Add & Norm

Second residual connection + normalization — output ready for the next block

Modern models stack dozens to hundreds of such blocks:

96blocks

Transformer Blocks GPT-3: 96 blocks, each with 96 attention heads

175billion

Parameters GPT-3: 175 billion trainable parameters

2017

The Paper "Attention Is All You Need" — Vaswani et al.

More heads mean more parallel perspectives, but not necessarily better results. The heads share the total dimension — with the same model size, individual heads become smaller. The optimal number of heads depends on the task and model size.

Self-Attention compares every token with every other — that's n × n comparisons. Double the sequence length, quadruple the cost. At 1,000 tokens: 1 million comparisons. At 10,000 tokens: 100 million. This is why context length is a hard problem: GPT-3 was limited to 2,048 tokens. Modern models use various optimizations (Flash Attention, Sparse Attention) to enable longer contexts — but the fundamental quadratic complexity remains.

Self-Attention gives every token direct access to every other token — in parallel, without sequential bottleneck and without vanishing gradient over distance. The weights are dynamic (computed fresh for each input), not static like CNN filters.
The QKV mechanism structures attention as an information retrieval system: Query asks "what do I need?", Key offers "what do I have?", Value delivers the actual content. The dot product Q·K measures relevance, softmax normalizes to probabilities.
A Transformer Block combines Multi-Head Attention, residual connections with Layer Normalization, and a Feed-Forward Network. GPT-3 stacks 96 such blocks with 96 heads each — 175 billion parameters total.

Checkpoint

What two fundamental problems of RNNs does Self-Attention solve, and how exactly does it do so?
Walk through the steps of a QKV attention computation using a simple example: dot products, scaling, softmax, and weighted output.
What are the three components of a Transformer Block, and why does this architecture enable massive GPU parallelization?

What fundamental problem of RNNs does Self-Attention solve by giving every token direct access to every other token?

RNNs are too slow to train on GPUs.

Information from early tokens gets lost over long sequences (vanishing gradient), because it must pass through every intermediate hidden state.

RNNs cannot process text, only images.

RNNs produce too many parameters.

1. What fundamental problem of RNNs does Self-Attention solve by giving every token direct access to every other token?

☐ A) RNNs are too slow to train on GPUs.
☐ B) Information from early tokens gets lost over long sequences (vanishing gradient), because it must pass through every intermediate hidden state.
☐ C) RNNs cannot process text, only images.
☐ D) RNNs produce too many parameters.

2. In the QKV attention computation, the dot product Q·K for token A and token B equals 12, and d_k = 64. What is the scaled score before softmax?

☐ A) 12
☐ B) 12/64 = 0.1875
☐ C) 12/8 = 1.5
☐ D) 12 × 8 = 96

3. A Transformer model has 12 blocks, each with 12 attention heads. How many different attention patterns are computed in total across all blocks?

☐ A) 12 (one per block)
☐ B) 24 (12 blocks + 12 heads)
☐ C) 144 (12 blocks × 12 heads)
☐ D) 48 (12 blocks × 4 tokens)

4. Self-Attention has O(n²) computational complexity where n is the sequence length. If processing a 1,000-token document requires X computation, approximately how much more does a 10,000-token document require?

☐ A) 10 times more (linear growth)
☐ B) 100 times more (quadratic growth: 10² = 100)
☐ C) 1,000 times more (cubic growth)
☐ D) The same amount (attention is constant-cost)

Answer Key: 1) B · 2) C · 3) C · 4) B

The Breakthrough — Transformers & Attention

Attention Mechanism: The Key to Modern LLMs

Self-Attention — Every Token Sees Every Other Token

Self-Attention

Example

Analogy:

Example

Definition:

Watch Out: Transformers Don't "Understand" Language

Try It Out: Self-Attention Step by Step

Query, Key, Value — The Mechanics of Attention

Query, Key, Value

Example

Analogy:

Example

Definition:

Deep Dive: The Scaling Factor sqrt(d_k)

Interactive: The Attention Formula

Concrete Example

The Transformer Block — Assembling the Machine

Transformer Block

Example

Analogy:

Example

Definition:

Transformer Block: Data Flow

BERT significantly improves language understanding

Watch Out: More Attention Heads ≠ Automatically Better

Deep Dive: O(n²) — The Quadratic Cost of Attention

Key Takeaways

Checkpoint

Quiz: Transformers & Attention

What fundamental problem of RNNs does Self-Attention solve by giving every token direct access to every other token?

Attention Mechanism: The Key to Modern LLMs

Self-Attention — Every Token Sees Every Other Token

Self-Attention

Example

Analogy:

Example

Definition:

Watch Out: Transformers Don't "Understand" Language

Try It Out: Self-Attention Step by Step

Query, Key, Value — The Mechanics of Attention

Query, Key, Value

Example

Analogy:

Example

Definition:

Deep Dive: The Scaling Factor sqrt(d_k)

Interactive: The Attention Formula

Concrete Example

The Transformer Block — Assembling the Machine

Transformer Block

Example

Analogy:

Example

Definition:

Transformer Block: Data Flow

BERT significantly improves language understanding

Watch Out: More Attention Heads ≠ Automatically Better

Deep Dive: O(n²) — The Quadratic Cost of Attention

Key Takeaways

Checkpoint

Quiz: Transformers & Attention

What fundamental problem of RNNs does Self-Attention solve by giving every token direct access to every other token?

Related Content

Article

Embeddings & Latent Space

Tokenization: The Machine Alphabet

Computer Vision (CNNs): How Machines Learned to See

Time & Sequences (RNNs)

Large Language Models

The Spark: Activation Functions

The Heart of Learning

Consciousness vs. Probability

Data Structures III (Key-Based)

Generative Image Models (Diffusion)

The Network

The Ecosystem: Hugging Face

Image Generation

Running Models on Your Own PC (Local Inference)

Quantization — Shrinking Models

Connecting Your Own Data (RAG)

RLHF: How LLMs Learn Politeness

Sampling & Temperature

Transfer Learning & Fine-Tuning

Glossary

Timeline