Data Structures III (Key-Based)

Key-based data structures that inconspicuously power almost every system.

Arrays give you O(1) access by position. Binary search trees deliver O(log n) access by order. But what if you need to look up data by an arbitrary key — a username, a URL, or a session token? Hash tables complete the picture: they convert any key into an array position in constant time.

This article covers the three pillars of key-based data structures: the hash trick itself, the collision problem that every hash table must solve, and the caching principle that exploits the hash trick in practice.

Hash Tables: The O(1) Trick

Imagine a huge library. You could search every shelf — O(n). Or use an alphabetical catalogue that halves the search space each time — O(log n), like a binary search tree. Or you whisper the title to a magical librarian who instantly tells you the exact shelf number — O(1). The librarian is the hash function: a mechanical rule that converts the title into a number without understanding it.

Analogy:

Definition:

A hash table stores key-value pairs in an array of buckets. A hash function maps any key to an integer index: index = hash(key) % array_size. Looking up a key means computing the hash and jumping directly to the bucket — O(1) on average, regardless of table size. Good hash functions are deterministic, uniformly distributing, and fast to compute.

But beware: a real librarian recognises similarities and corrects typos. A hash function treats even the tiniest differences as completely different keys (avalanche effect). And what happens when two titles map to the same shelf? The magical librarian never has this problem — hash tables do.

Hash table with 10 slots:

hash("cat") = (99 + 97 + 116) % 10 = 312 % 10 = 2
hash("dog") = (100 + 111 + 103) % 10 = 314 % 10 = 4

Looking up "cat":
→ Compute hash → 2
→ Jump to slot 2
→ Done. One step instead of up to 10.

The ASCII values of the characters are summed and mapped to the table size using modulo. This gives you a direct address — no searching, no sorting, just compute and jump.

Array: O(1) by Index

Direct access via position number. Fast when you know the position — but you must know the number.

Hash Table: O(1) by Key

Direct access via any key (name, URL, ID). The hash function computes the position automatically.

O(1) is the average case with a good hash function and moderate load factor. In the worst case — many collisions, high load factor — access degrades to O(n). That is why rehashing exists: the table is enlarged and all entries are redistributed.

Collisions: When Two Keys Want the Same Slot

A collision occurs when two different keys hash to the same index. The pigeonhole principle guarantees this is unavoidable: if there are more possible keys than buckets, some must share.

Imagine a coat check with 100 numbered hooks. The attendant assigns hooks based on the first letter of your last name — A gets hook 1, B gets hook 2. When two guests named "Smith" and "Sullivan" both map to hook 19, the attendant has two options: hang both coats on the same hook and search through them later (separate chaining), or put the second coat on the next free hook (open addressing).

A real coat-check attendant improvises when running out of hooks — stacking coats, hanging them on the door, or placing them on a chair. A hash table cannot improvise. It needs strict, pre-programmed rules (chaining or probing) for every collision.

Separate Chaining Each bucket holds a linked list. On a collision, the new entry is appended to the list. Search: compute hash, then traverse the list in the bucket.

Open Addressing On a collision, the next free slot is found (linear probing, quadratic probing, or double hashing). All entries stay directly in the array.

Hash table with 7 slots, h(key) = len(key) % 7:

"cat" → len=3 → 3 % 7 = 3
"dog" → len=3 → 3 % 7 = 3  ← Collision!
"elephant" → len=8 → 8 % 7 = 1
"bird" → len=4 → 4 % 7 = 4

Separate chaining at slot 3:
[("cat", val1) → ("dog", val2)]

"cat" and "dog" both have 3 characters — their hashes collide at slot 3. With separate chaining, the slot holds a linked list of both entries. Searching for "dog" requires 2 comparisons instead of 1, but remains fast. As the load factor rises, chains grow — at 75%, rehashing is typically triggered.

Before and After: Rehashing

Before rehashing: 75% load factor, long chains, slower search

After rehashing: doubled size, short chains, fast search restored

Collisions are mathematically guaranteed by the pigeonhole principle. A well-designed hash table handles them efficiently. It only becomes problematic if the hash function distributes poorly or the load factor rises too high without rehashing.

Interactive: Hash Demonstrator

Enter a key and watch how the hash function computes a slot number from it. Insert multiple keys and experience collisions live — when two different keys map to the same slot.

Key

Table size

—

Caching: Trading Memory for Speed

Caching stores frequently or recently accessed data in fast memory to avoid slow re-computation or network calls. The key terms: cache hit (data found locally — fast), cache miss (data not in cache — must fetch from slow source), and cache invalidation (removing stale entries).

The pizza-in-the-fridge analogy: Ordering pizza from a delivery service takes 30-40 minutes (fetching from the original source). Reheating leftover pizza from the fridge takes minutes (cache hit). But it might be stale the next day (stale data). You must decide when to throw it out and order fresh (cache invalidation / TTL).

In real life, pizza goes bad gradually — you notice by smell or taste. In a digital cache, there is no gradual decay: data is either valid or stale, with no in-between. That is why invalidation must be controlled by exact metadata like max-age, not by intuition.

L1 Cache: ~1 ns (fastest, smallest memory)

L2 Cache: ~5 ns

L3 Cache: ~20 ns

RAM: ~100 ns

SSD: ~100 μs (1,000× slower than RAM)

Network: ~100 ms (1,000,000× slower than RAM)

First visit: the browser loads HTML, CSS, and images from the server (~2 seconds). The server sends Cache-Control: max-age=3600. Second visit within one hour: the browser serves from local cache (~50 ms). After one hour: cache expired, the browser re-fetches. Result: up to 50% faster perceived load time for returning visitors.

Caching only helps when data is accessed repeatedly. For one-time queries, cache misses dominate and the cache just wastes memory. Also, aggressive caching without proper invalidation serves stale data. Hence the famous saying: "The two hard things in CS: cache invalidation and naming things."

In large language models (LLMs), hash tables are used for token-to-ID mapping: every word or subword is mapped to an integer that serves as an index into the embedding matrix. Hashing also plays a central role in data deduplication — identical training data is identified and filtered via its hash value. The KV cache (key-value cache) in transformer models is another example: previously computed attention values are cached to avoid re-computation during token-by-token generation.

A hash function converts any key into an array index — that is the "magic trick" behind average O(1) lookup in dictionaries, maps, and caches.
Caching trades memory for speed: store the answer once, skip the expensive computation or network call next time — but stale data is the price without proper invalidation.
Collisions are not bugs — they are mathematically inevitable. Good hash tables handle them gracefully through chaining or probing and keep performance stable via rehashing.

What does a hash function do?

It sorts data alphabetically

It converts a key into an array index

It encrypts data

It compresses files

1. What does a hash function do?

☐ A) It sorts data alphabetically
☐ B) It converts a key into an array index
☐ C) It encrypts data
☐ D) It compresses files

2. A hash function computes the value 487 for a key. The table has 10 slots. In which slot does the entry end up?

☐ A) Slot 48
☐ B) Slot 4
☐ C) Slot 7
☐ D) Slot 487

3. Your cache has max-age=3600 (one hour). A user revisits the page after 30 minutes. What happens?

☐ A) Cache miss — data is fetched from the server
☐ B) Cache hit — data is served from local cache
☐ C) The cache is deleted
☐ D) The server automatically sends new data

4. A hash table with 100 slots stores 200 entries. Why must collisions occur?

☐ A) The hash function is buggy
☐ B) 200 keys cannot fit into 100 slots without sharing — pigeonhole principle
☐ C) The table is too slow
☐ D) Collisions only happen with bad keys

Answer Key: 1) B · 2) C · 3) B · 4) B

Checkpoint: Key-Based Data Structures

You store the token 'A1B2' in a hash table with 10 slots. The hash function computes the value 487. In which slot does the token end up — and why?
Two users 'Miller' and 'Meyer' both map to slot 13 through the hash function. What is this event called and what are the two strategies to handle it?
Your weather widget loads temperatures from a server and uses aggressive caching without invalidation. What problem arises for the users?

Data Structures III (Key-Based)

Hash Tables: The O(1) Trick

Hash Table

Analogy:

Definition:

Example: Hash Calculation

Misconception: Hash Tables Are Always O(1)

Collisions: When Two Keys Want the Same Slot

Collision Strategies

Example: Collision in Practice

Rehashing: When the Table Gets Full

Before and After: Rehashing

Misconception: Collisions Mean the Hash Table Is Broken

Interactive: Hash Demonstrator

Caching: Trading Memory for Speed

Memory Hierarchy: Latency Comparison

Real-World Example: Browser Caching

Misconception: Caching Always Makes Everything Faster

Deep Dive: Hash Tables in AI

Key Takeaways

Quiz: Key-Based Data Structures

What does a hash function do?

Checkpoint: Key-Based Data Structures

Hash Tables: The O(1) Trick

Hash Table

Analogy:

Definition:

Example: Hash Calculation

Misconception: Hash Tables Are Always O(1)

Collisions: When Two Keys Want the Same Slot

Collision Strategies

Example: Collision in Practice

Rehashing: When the Table Gets Full

Before and After: Rehashing

Misconception: Collisions Mean the Hash Table Is Broken

Interactive: Hash Demonstrator

Caching: Trading Memory for Speed

Memory Hierarchy: Latency Comparison

Real-World Example: Browser Caching

Misconception: Caching Always Makes Everything Faster

Deep Dive: Hash Tables in AI

Key Takeaways

Quiz: Key-Based Data Structures

What does a hash function do?

Checkpoint: Key-Based Data Structures

Related Content

Article

Data Structures II (Hierarchical & Networked)

Algorithmic Complexity

What Is an Algorithm?

Lists and Dictionaries

The Raw Material: Data Engineering for Machine Learning

Linear Data Structures

Glossary

Timeline