Unsupervised Learning

Learning without an answer key — the more demanding but often more practical variant of machine learning.

Fundamentals 8 min Intermediate May 18, 2026

Your streaming service groups movies into categories that no one taught it. Your bank blocks fraudulent transactions in milliseconds — even though no one ever showed the system what fraud looks like. How do machines discover hidden patterns in data when there are no answers to learn from?

After seven articles on Supervised Learning, where models train on labeled data, we flip the approach. What happens when there are no labels, no correct answers, no teacher? Unsupervised Learning is how machines discover hidden structure in raw, unlabeled data — from customer segmentation to fraud detection. This article introduces three foundational techniques.

The Paradigm Without a Teacher

In Supervised Learning, the model has answers to learn from — the labels. In Unsupervised Learning, the algorithm must find patterns entirely on its own. This is not easier — it is fundamentally different.

Three pillars support this paradigm: clustering (finding groups), dimensionality reduction (finding the essence), and anomaly detection (finding outliers). Each technique tackles a different facet of the challenge, and each comes with trade-offs that the practitioner — not the machine — must navigate.

Clustering with k-Means

k-Means Clustering

AnalogyDefinition
Imagine dumping a box of 200 mixed buttons on a table — different colors, sizes, materials — and asking a friend who has never seen buttons before to sort them into groups. Without any instructions about categories, the friend starts placing similar-looking buttons together: large wooden ones here, small shiny metal ones there. k-Means does exactly this in mathematical space — measuring distances between feature vectors instead of eyeballing appearance.

The Four-Step Algorithm

k-Means works in four clearly defined, repeating steps:

1
Choose k random starting positions as cluster centers (centroids)
2
Assign each data point to the nearest centroid
3
Recalculate centroids as the mean of all assigned points
4
Repeat steps 2 and 3 until convergence is reached

Worked Example: Customer Segmentation

An online retailer analyzes 10,000 customers using three features: age, income, and purchase frequency. With k=4, k-Means discovers four customer profiles that were never provided as labels: budget-conscious students, young families with high purchase frequency, well-earning professionals, and affluent retirees. These groups emerge solely from geometric proximity in three-dimensional feature space — the algorithm has zero prior knowledge about life stages or purchasing behavior.

The k Problem: Who Decides the Number?

The biggest catch with k-Means: the machine cannot determine on its own how many natural groups exist in the data. You, the practitioner, must set the hyperparameter k in advance. Choose k too low, and distinct groups get merged. Choose k too high, and natural clusters get artificially split. The choice of k is a human decision, not a machine discovery.

A proven approach for selecting k is the Elbow Method. You compute the sum of squared distances from each point to its cluster center (Within-Cluster Sum of Squares, WCSS) for various k values — say from k=1 to k=10. Plotting WCSS against k typically produces a curve that drops steeply at first and then flattens out. The inflection point — the "elbow" — marks where additional clusters bring only marginal improvement. Imagine: at k=3 the curve drops sharply, at k=4 only slightly, at k=5 barely at all. Then k=3 or k=4 is a sensible choice. But be careful: the method is a heuristic, not a guarantee. Sometimes there is no clear bend, and you need domain knowledge.

Common Misconception

Myth: k-Means always finds the correct clusters.

Fact: The result heavily depends on the choice of k and the random initialization of centroids. Different starting positions can produce entirely different cluster assignments. The Elbow Method helps with choosing k but is a heuristic — not a guarantee.

Interactive: How Certain Is the Cluster Assignment?

In soft clustering, each data point is not rigidly assigned to one cluster but receives a probability for each cluster. Move the temperature slider: at low temperature, the point is assigned almost certainly to the nearest cluster. At high temperature, all clusters become equally likely — the assignment becomes uncertain.

An LLM has generated the beginning "Das Wetter heute ___" ("The weather today ___") and computes probabilities for the next word. The most natural continuation is "ist" ("is") — but the temperature determines whether the model always picks the safe choice or dares more unusual continuations.

0.1 (focused)2.0 (creative)
Standard (T≈1.0): The original logit probabilities are used. Balance between precision and variety.

Probability Distribution (at T=1.0)

Cluster A
75.8%
Cluster B
16.9%
Cluster C
5.1%
Outlier
1.4%
Noise
0.8%

Results (0 Samples)

No samples yet — click "Sample token"

Start the Experiment

Click "Sample token" to see how the LLM samples at the current temperature. Observe how the distribution of results approaches the theoretical probability with more samples.

Dimensionality Reduction with PCA

Principal Component Analysis (PCA)

AnalogyDefinition
Imagine photographing a sculpture from many angles. Most photos from adjacent positions look nearly identical — they carry redundant information. PCA finds the few "best angles" that capture the most visual diversity of the sculpture, and discards the rest. You lose subtle details (the texture on the back), but the essential shape is preserved in far fewer images.

Connection to Eigenvectors

PCA connects directly to your knowledge of eigenvectors from Path I.C. The algorithm computes the eigenvectors of the dataset's covariance matrix. These eigenvectors define the new axes in the space — the principal components. The corresponding eigenvalues indicate how much variance each axis captures. You keep the eigenvectors with the largest eigenvalues and discard the rest.

784 → 50
Dimensionality Reduction MNIST: 784 pixel features compressed to ~50 principal components (93% reduction), while preserving digit distinguishability

Worked Example: MNIST Digits

The MNIST dataset contains thousands of handwritten digits, each with 28×28 = 784 pixels. Many border pixels are almost always black — they carry no relevant information. PCA identifies that approximately 50 principal components capture over 95% of the meaningful variation. The digits remain distinguishable, but storage and computation time for subsequent classifiers drop by 93%.

Why is the eigenvector with the largest eigenvalue the first principal component? Because the eigenvalues of the covariance matrix directly measure the variance along the corresponding eigenvector direction. The largest eigenvalue = the direction with the greatest data spread. The second principal component is orthogonal (perpendicular) to the first — it captures the second-largest spread not explained by the first. Geometrically, "orthogonal" means: the axes are independent of each other, no information is double-counted. From your Path I.C you know: eigenvectors of a symmetric matrix are always orthogonal to each other — this is precisely why PCA works so elegantly.

Common Misconception

Myth: PCA reduces dimensions without losing information.

Fact: PCA always discards variance. The dropped components contained real data — just less of it than the ones you kept. The question is never "Do we lose information?" but "How much loss is acceptable for how much efficiency gain?"

Anomaly Detection

Anomaly Detection

AnalogyDefinition
Think of a night security guard at a factory. Every night, the guard sees the same patterns: lights off at 10 PM, cleaning crew at 11 PM, silence until 6 AM. Nobody gave the guard a list of "suspicious events." Over weeks, the guard simply learns what normal looks like. When a door opens at 3 AM and footsteps head toward the server room — the guard flags it instantly. Not because they recognized a burglar, but because the event doesn't match the learned pattern of normality.

Two Approaches Compared

Distance-Based

Measures how far a point is from known clusters. Strength: Works well with spherical clusters. Weakness: Struggles with unevenly distributed data.

Density-Based

Measures how sparsely populated a point's neighborhood is. Strength: Detects anomalies even in irregularly shaped data. Weakness: Sensitive to the choice of neighborhood radius.

99,9%
Class Distribution In credit card fraud, 99.9% of all transactions are legitimate — the needle-in-a-haystack challenge

Worked Example: Credit Card Fraud

A credit card company monitors transactions. 99.9% are legitimate. The model learns each customer's purchase profile: German stores, 10–200 EUR range, daytime purchases. Then three rapid 5,000 EUR transactions come from a foreign country. These points are geometrically far from the learned profile and get flagged automatically — not because the system "knows" fraud, but because the transactions don't match the normal pattern.

Common Misconception

Myth: Unsupervised Learning is easier because you don't need labels.

Fact: The absence of labels makes evaluation extremely difficult. In Supervised Learning, you have ground truth to measure accuracy against. In anomaly detection, you often can't tell whether a flagged point is a true anomaly or a false positive — without human investigation.

Memory: Which Technique Matches Which Application?

Test your knowledge of unsupervised learning techniques. Find the matching pairs: each technique has a typical application. How quickly can you find all six pairs?

0/ 6 Pairs
0Attempts
0:00Time

Key Takeaways

  • k-Means groups data into k clusters — but k itself is a human decision, not a machine discovery.
  • PCA compresses data by keeping the axes of greatest variance — deliberately trading some information for massive efficiency gains.
  • Anomaly detection learns what "normal" looks like and flags everything that deviates — a paradigm that powers fraud detection, quality control, and cybersecurity.

Knowledge Check: Unsupervised Learning

Question 1 / 6

What must a practitioner specify before running k-Means that the algorithm cannot determine on its own?

Select one answer
Answer Key: 1) B · 2) B · 3) B · 4) D · 5) B · 6) B

Checkpoint: Unsupervised Learning

  • Why can the k-Means algorithm not simply decide on its own how many groups exist in the data?
  • When PCA reduces 784 pixels to 50 principal components — what exactly is lost and why is it still worthwhile?
  • What is the difference between distance-based and density-based anomaly detection — and when is each one appropriate?