Data Structures III (Key-Based)

Key-based data structures that inconspicuously power almost every system.

Fundamentals 15 min Intermediate May 3, 2026

Arrays give you O(1) access by position. Binary search trees deliver O(log n) access by order. But what if you need to look up data by an arbitrary key — a username, a URL, or a session token? Hash tables complete the picture: they convert any key into an array position in constant time.

This article covers the three pillars of key-based data structures: the hash trick itself, the collision problem that every hash table must solve, and the caching principle that exploits the hash trick in practice.

Hash Tables: The O(1) Trick

Hash Table

AnalogyDefinition
Imagine a huge library. You could search every shelf — O(n). Or use an alphabetical catalogue that halves the search space each time — O(log n), like a binary search tree. Or you whisper the title to a magical librarian who instantly tells you the exact shelf number — O(1). The librarian is the hash function: a mechanical rule that converts the title into a number without understanding it.

But beware: a real librarian recognises similarities and corrects typos. A hash function treats even the tiniest differences as completely different keys (avalanche effect). And what happens when two titles map to the same shelf? The magical librarian never has this problem — hash tables do.

Example: Hash Calculation

Hash table with 10 slots:

hash("cat") = (99 + 97 + 116) % 10 = 312 % 10 = 2
hash("dog") = (100 + 111 + 103) % 10 = 314 % 10 = 4

Looking up "cat":
→ Compute hash → 2
→ Jump to slot 2
→ Done. One step instead of up to 10.

The ASCII values of the characters are summed and mapped to the table size using modulo. This gives you a direct address — no searching, no sorting, just compute and jump.

Array: O(1) by Index

Direct access via position number. Fast when you know the position — but you must know the number.

Hash Table: O(1) by Key

Direct access via any key (name, URL, ID). The hash function computes the position automatically.

Misconception: Hash Tables Are Always O(1)

O(1) is the average case with a good hash function and moderate load factor. In the worst case — many collisions, high load factor — access degrades to O(n). That is why rehashing exists: the table is enlarged and all entries are redistributed.

Collisions: When Two Keys Want the Same Slot

A collision occurs when two different keys hash to the same index. The pigeonhole principle guarantees this is unavoidable: if there are more possible keys than buckets, some must share.

Imagine a coat check with 100 numbered hooks. The attendant assigns hooks based on the first letter of your last name — A gets hook 1, B gets hook 2. When two guests named "Smith" and "Sullivan" both map to hook 19, the attendant has two options: hang both coats on the same hook and search through them later (separate chaining), or put the second coat on the next free hook (open addressing).

A real coat-check attendant improvises when running out of hooks — stacking coats, hanging them on the door, or placing them on a chair. A hash table cannot improvise. It needs strict, pre-programmed rules (chaining or probing) for every collision.

Collision Strategies

Separate Chaining Each bucket holds a linked list. On a collision, the new entry is appended to the list. Search: compute hash, then traverse the list in the bucket.
Open Addressing On a collision, the next free slot is found (linear probing, quadratic probing, or double hashing). All entries stay directly in the array.

Example: Collision in Practice

Hash table with 7 slots, h(key) = len(key) % 7:

"cat" → len=3 → 3 % 7 = 3
"dog" → len=3 → 3 % 7 = 3  ← Collision!
"elephant" → len=8 → 8 % 7 = 1
"bird" → len=4 → 4 % 7 = 4

Separate chaining at slot 3:
[("cat", val1) → ("dog", val2)]

"cat" and "dog" both have 3 characters — their hashes collide at slot 3. With separate chaining, the slot holds a linked list of both entries. Searching for "dog" requires 2 comparisons instead of 1, but remains fast. As the load factor rises, chains grow — at 75%, rehashing is typically triggered.

Rehashing: When the Table Gets Full

Before and After: Rehashing

Before rehashing: 75% load factor, long chains, slower search
After rehashing: doubled size, short chains, fast search restored

Misconception: Collisions Mean the Hash Table Is Broken

Collisions are mathematically guaranteed by the pigeonhole principle. A well-designed hash table handles them efficiently. It only becomes problematic if the hash function distributes poorly or the load factor rises too high without rehashing.

Interactive: Hash Demonstrator

Enter a key and watch how the hash function computes a slot number from it. Insert multiple keys and experience collisions live — when two different keys map to the same slot.

0
1
2
3
4
5
6
7
8
9

Caching: Trading Memory for Speed

Caching stores frequently or recently accessed data in fast memory to avoid slow re-computation or network calls. The key terms: cache hit (data found locally — fast), cache miss (data not in cache — must fetch from slow source), and cache invalidation (removing stale entries).

The pizza-in-the-fridge analogy: Ordering pizza from a delivery service takes 30-40 minutes (fetching from the original source). Reheating leftover pizza from the fridge takes minutes (cache hit). But it might be stale the next day (stale data). You must decide when to throw it out and order fresh (cache invalidation / TTL).

In real life, pizza goes bad gradually — you notice by smell or taste. In a digital cache, there is no gradual decay: data is either valid or stale, with no in-between. That is why invalidation must be controlled by exact metadata like max-age, not by intuition.

Memory Hierarchy: Latency Comparison

1
L1 Cache: ~1 ns (fastest, smallest memory)
2
L2 Cache: ~5 ns
3
L3 Cache: ~20 ns
4
RAM: ~100 ns
5
SSD: ~100 μs (1,000× slower than RAM)
6
Network: ~100 ms (1,000,000× slower than RAM)

Real-World Example: Browser Caching

First visit: the browser loads HTML, CSS, and images from the server (~2 seconds). The server sends Cache-Control: max-age=3600. Second visit within one hour: the browser serves from local cache (~50 ms). After one hour: cache expired, the browser re-fetches. Result: up to 50% faster perceived load time for returning visitors.

Misconception: Caching Always Makes Everything Faster

Caching only helps when data is accessed repeatedly. For one-time queries, cache misses dominate and the cache just wastes memory. Also, aggressive caching without proper invalidation serves stale data. Hence the famous saying: "The two hard things in CS: cache invalidation and naming things."

In large language models (LLMs), hash tables are used for token-to-ID mapping: every word or subword is mapped to an integer that serves as an index into the embedding matrix. Hashing also plays a central role in data deduplication — identical training data is identified and filtered via its hash value. The KV cache (key-value cache) in transformer models is another example: previously computed attention values are cached to avoid re-computation during token-by-token generation.

Key Takeaways

  1. A hash function converts any key into an array index — that is the "magic trick" behind average O(1) lookup in dictionaries, maps, and caches.
  2. Caching trades memory for speed: store the answer once, skip the expensive computation or network call next time — but stale data is the price without proper invalidation.
  3. Collisions are not bugs — they are mathematically inevitable. Good hash tables handle them gracefully through chaining or probing and keep performance stable via rehashing.

Quiz: Key-Based Data Structures

Question 1 / 4
Not completed

What does a hash function do?

Select one answer
Answer Key: 1) B · 2) C · 3) B · 4) B

Checkpoint: Key-Based Data Structures

  • You store the token 'A1B2' in a hash table with 10 slots. The hash function computes the value 487. In which slot does the token end up — and why?
  • Two users 'Miller' and 'Meyer' both map to slot 13 through the hash function. What is this event called and what are the two strategies to handle it?
  • Your weather widget loads temperatures from a server and uses aggressive caching without invalidation. What problem arises for the users?