Learning without an answer key — the more demanding but often more practical variant of machine learning.
Fundamentals 8 min Intermediate May 18, 2026
Your streaming service groups movies into categories that no one taught it. Your bank blocks fraudulent transactions in milliseconds — even though no one ever showed the system what fraud looks like. How do machines discover hidden patterns in data when there are no answers to learn from?
After seven articles on Supervised Learning, where models train on labeled data, we flip the approach. What happens when there are no labels, no correct answers, no teacher? Unsupervised Learning is how machines discover hidden structure in raw, unlabeled data — from customer segmentation to fraud detection. This article introduces three foundational techniques.
The Paradigm Without a Teacher
In Supervised Learning, the model has answers to learn from — the labels. In Unsupervised Learning, the algorithm must find patterns entirely on its own. This is not easier — it is fundamentally different.
Three pillars support this paradigm: clustering (finding groups), dimensionality reduction (finding the essence), and anomaly detection (finding outliers). Each technique tackles a different facet of the challenge, and each comes with trade-offs that the practitioner — not the machine — must navigate.
2013 Papers
Word2Vec: Words as vectors
The transformation of word representation through semantic vector spaces. On January 16, 2013, Tomas Mikolov with his Google team published the groundbreaking paper 'Efficient Estimation of Word Representations in Vector Space'. Word2Vec transformed NLP by representing words as high-dimensional vectors that capture semantic and syntactic relationships. The two architecture variants CBOW (Continuous Bag of Words) and Skip-Gram learned from large text corpora that similar words appear in similar contexts. The famous example demonstrated vector arithmetic: King - Man + Woman = Queen. With over 49,000 citations, Mikolov's work became one of the most influential NLP papers. Word2Vec laid the foundation for all modern embedding techniques and enabled semantic reasoning in vector spaces. This innovation paved the way for transformer architectures and modern Large Language Models.
Clustering with k-Means
k-Means Clustering
AnalogyDefinition
Imagine dumping a box of 200 mixed buttons on a table — different colors, sizes, materials — and asking a friend who has never seen buttons before to sort them into groups. Without any instructions about categories, the friend starts placing similar-looking buttons together: large wooden ones here, small shiny metal ones there. k-Means does exactly this in mathematical space — measuring distances between feature vectors instead of eyeballing appearance.
Analogy:
Imagine dumping a box of 200 mixed buttons on a table — different colors, sizes, materials — and asking a friend who has never seen buttons before to sort them into groups. Without any instructions about categories, the friend starts placing similar-looking buttons together: large wooden ones here, small shiny metal ones there. k-Means does exactly this in mathematical space — measuring distances between feature vectors instead of eyeballing appearance.
Definition:
k-Means is an iterative algorithm that partitions n data points into k groups. In each iteration, every point is assigned to the nearest cluster center, then the centers are recalculated as the means of the assigned points. The process repeats until convergence — that is, until the assignments no longer change.
The Four-Step Algorithm
k-Means works in four clearly defined, repeating steps:
1
Choose k random starting positions as cluster centers (centroids)
2
Assign each data point to the nearest centroid
3
Recalculate centroids as the mean of all assigned points
4
Repeat steps 2 and 3 until convergence is reached
Worked Example: Customer Segmentation
An online retailer analyzes 10,000 customers using three features: age, income, and purchase frequency. With k=4, k-Means discovers four customer profiles that were never provided as labels: budget-conscious students, young families with high purchase frequency, well-earning professionals, and affluent retirees. These groups emerge solely from geometric proximity in three-dimensional feature space — the algorithm has zero prior knowledge about life stages or purchasing behavior.
The k Problem: Who Decides the Number?
The biggest catch with k-Means: the machine cannot determine on its own how many natural groups exist in the data. You, the practitioner, must set the hyperparameter k in advance. Choose k too low, and distinct groups get merged. Choose k too high, and natural clusters get artificially split. The choice of k is a human decision, not a machine discovery.
Deep Dive: The Elbow Method for Choosing k
A proven approach for selecting k is the Elbow Method. You compute the sum of squared distances from each point to its cluster center (Within-Cluster Sum of Squares, WCSS) for various k values — say from k=1 to k=10. Plotting WCSS against k typically produces a curve that drops steeply at first and then flattens out. The inflection point — the "elbow" — marks where additional clusters bring only marginal improvement. Imagine: at k=3 the curve drops sharply, at k=4 only slightly, at k=5 barely at all. Then k=3 or k=4 is a sensible choice. But be careful: the method is a heuristic, not a guarantee. Sometimes there is no clear bend, and you need domain knowledge.
Common Misconception
Myth: k-Means always finds the correct clusters.
Fact: The result heavily depends on the choice of k and the random initialization of centroids. Different starting positions can produce entirely different cluster assignments. The Elbow Method helps with choosing k but is a heuristic — not a guarantee.
Interactive: How Certain Is the Cluster Assignment?
In soft clustering, each data point is not rigidly assigned to one cluster but receives a probability for each cluster. Move the temperature slider: at low temperature, the point is assigned almost certainly to the nearest cluster. At high temperature, all clusters become equally likely — the assignment becomes uncertain.
An LLM has generated the beginning "Das Wetter heute ___" ("The weather today ___") and computes probabilities for the next word. The most natural continuation is "ist" ("is") — but the temperature determines whether the model always picks the safe choice or dares more unusual continuations.
0.1 (focused)2.0 (creative)
Standard (T≈1.0): The original logit probabilities are used. Balance between precision and variety.
Probability Distribution (at T=1.0)
Cluster A
75.8%
Cluster B
16.9%
Cluster C
5.1%
Outlier
1.4%
Noise
0.8%
Results (0 Samples)
No samples yet — click "Sample token"
Start the Experiment
Click "Sample token" to see how the LLM samples at the current temperature. Observe how the distribution of results approaches the theoretical probability with more samples.
Dimensionality Reduction with PCA
Principal Component Analysis (PCA)
AnalogyDefinition
Imagine photographing a sculpture from many angles. Most photos from adjacent positions look nearly identical — they carry redundant information. PCA finds the few "best angles" that capture the most visual diversity of the sculpture, and discards the rest. You lose subtle details (the texture on the back), but the essential shape is preserved in far fewer images.
Analogy:
Imagine photographing a sculpture from many angles. Most photos from adjacent positions look nearly identical — they carry redundant information. PCA finds the few "best angles" that capture the most visual diversity of the sculpture, and discards the rest. You lose subtle details (the texture on the back), but the essential shape is preserved in far fewer images.
Definition:
Principal Component Analysis is a linear transformation that projects high-dimensional data onto a smaller set of new axes (principal components). These axes are chosen to capture the maximum variance in the data. The first principal component captures the greatest spread, the second is orthogonal to it and maximizes the remaining variance.
Connection to Eigenvectors
PCA connects directly to your knowledge of eigenvectors from Path I.C. The algorithm computes the eigenvectors of the dataset's covariance matrix. These eigenvectors define the new axes in the space — the principal components. The corresponding eigenvalues indicate how much variance each axis captures. You keep the eigenvectors with the largest eigenvalues and discard the rest.
784 → 50
Dimensionality Reduction MNIST: 784 pixel features compressed to ~50 principal components (93% reduction), while preserving digit distinguishability
Worked Example: MNIST Digits
The MNIST dataset contains thousands of handwritten digits, each with 28×28 = 784 pixels. Many border pixels are almost always black — they carry no relevant information. PCA identifies that approximately 50 principal components capture over 95% of the meaningful variation. The digits remain distinguishable, but storage and computation time for subsequent classifiers drop by 93%.
Deep Dive: Eigenvectors as Principal Components
Why is the eigenvector with the largest eigenvalue the first principal component? Because the eigenvalues of the covariance matrix directly measure the variance along the corresponding eigenvector direction. The largest eigenvalue = the direction with the greatest data spread. The second principal component is orthogonal (perpendicular) to the first — it captures the second-largest spread not explained by the first. Geometrically, "orthogonal" means: the axes are independent of each other, no information is double-counted. From your Path I.C you know: eigenvectors of a symmetric matrix are always orthogonal to each other — this is precisely why PCA works so elegantly.
Common Misconception
Myth: PCA reduces dimensions without losing information.
Fact: PCA always discards variance. The dropped components contained real data — just less of it than the ones you kept. The question is never "Do we lose information?" but "How much loss is acceptable for how much efficiency gain?"
Anomaly Detection
Anomaly Detection
AnalogyDefinition
Think of a night security guard at a factory. Every night, the guard sees the same patterns: lights off at 10 PM, cleaning crew at 11 PM, silence until 6 AM. Nobody gave the guard a list of "suspicious events." Over weeks, the guard simply learns what normal looks like. When a door opens at 3 AM and footsteps head toward the server room — the guard flags it instantly. Not because they recognized a burglar, but because the event doesn't match the learned pattern of normality.
Analogy:
Think of a night security guard at a factory. Every night, the guard sees the same patterns: lights off at 10 PM, cleaning crew at 11 PM, silence until 6 AM. Nobody gave the guard a list of "suspicious events." Over weeks, the guard simply learns what normal looks like. When a door opens at 3 AM and footsteps head toward the server room — the guard flags it instantly. Not because they recognized a burglar, but because the event doesn't match the learned pattern of normality.
Definition:
Anomaly detection encompasses techniques that learn the statistical profile of "normal" data and then identify observations that deviate significantly from that profile — without ever being shown examples of anomalies. The approach is fundamentally different from classification: instead of asking "What is this?" the system asks "Does this fit what I've learned?"
Two Approaches Compared
Distance-Based
Measures how far a point is from known clusters. Strength: Works well with spherical clusters. Weakness: Struggles with unevenly distributed data.
Density-Based
Measures how sparsely populated a point's neighborhood is. Strength: Detects anomalies even in irregularly shaped data. Weakness: Sensitive to the choice of neighborhood radius.
99,9%
Class Distribution In credit card fraud, 99.9% of all transactions are legitimate — the needle-in-a-haystack challenge
Worked Example: Credit Card Fraud
A credit card company monitors transactions. 99.9% are legitimate. The model learns each customer's purchase profile: German stores, 10–200 EUR range, daytime purchases. Then three rapid 5,000 EUR transactions come from a foreign country. These points are geometrically far from the learned profile and get flagged automatically — not because the system "knows" fraud, but because the transactions don't match the normal pattern.
Common Misconception
Myth: Unsupervised Learning is easier because you don't need labels.
Fact: The absence of labels makes evaluation extremely difficult. In Supervised Learning, you have ground truth to measure accuracy against. In anomaly detection, you often can't tell whether a flagged point is a true anomaly or a false positive — without human investigation.
Memory: Which Technique Matches Which Application?
Test your knowledge of unsupervised learning techniques. Find the matching pairs: each technique has a typical application. How quickly can you find all six pairs?
0/ 6 Pairs
0Attempts
0:00Time
Key Takeaways
k-Means groups data into k clusters — but k itself is a human decision, not a machine discovery.
PCA compresses data by keeping the axes of greatest variance — deliberately trading some information for massive efficiency gains.
Anomaly detection learns what "normal" looks like and flags everything that deviates — a paradigm that powers fraud detection, quality control, and cybersecurity.
Knowledge Check: Unsupervised Learning
Question 1 / 6
What must a practitioner specify before running k-Means that the algorithm cannot determine on its own?
1. What must a practitioner specify before running k-Means that the algorithm cannot determine on its own?
☐ A) The feature names to use
☐ B) The number of clusters k
☐ C) The distance metric
☐ D) The convergence threshold
2. You run k-Means on a customer dataset with k=5, but the Elbow Method plot shows a clear bend at k=3. What is the most appropriate next step?
☐ A) Keep k=5 because more clusters mean more detail
☐ B) Re-run with k=3 and compare cluster quality
☐ C) Use k=1 to avoid overfitting
☐ D) Switch to supervised learning instead
3. A dataset has 200 features. After PCA, you keep 15 principal components that explain 92% of the total variance. A colleague says: "We only lost 8% of our data." Why is this statement misleading?
☐ A) PCA creates new data, so nothing is lost
☐ B) 8% of variance does not equal 8% of data points — specific observations may have lost important individual features
☐ C) PCA always loses exactly 50% of information
☐ D) The colleague confused features with samples
4. An anomaly detection system monitoring network traffic flags 1,000 events per day. After investigation, 990 are false positives and 10 are real intrusions. Without the system, all 10 intrusions would go undetected. Is the system useful?
☐ A) No — a 99% false positive rate means the system is broken
☐ B) Yes — it catches all intrusions, and the cost of investigating 990 false alarms is worth it
☐ C) No — 1% precision means the model has not learned anything
☐ D) It depends on whether the cost of investigating false positives exceeds the cost of undetected intrusions
5. You apply PCA to the MNIST digit dataset (784 features per image) and reduce to 50 components. You then train a classifier on the reduced data. Which of the following is a direct benefit of this PCA step?
☐ A) The classifier achieves higher accuracy because PCA removes noise
☐ B) Training time decreases because the model processes fewer features per image
☐ C) The classifier can now recognize new digit styles it hasn't seen
☐ D) PCA guarantees that no relevant pixel information was discarded
6. A credit card company uses anomaly detection. A new customer makes a single large purchase on their first day. The system flags it as anomalous. Why might this be a false positive?
☐ A) The system cannot process transactions under 24 hours old
☐ B) The system has not yet built a sufficient "normal" profile for this customer, so any transaction deviates from the nearly empty baseline
☐ C) Anomaly detection only works with at least 1,000 prior transactions
☐ D) Large purchases are always flagged regardless of the model
Answer Key: 1) B · 2) B · 3) B · 4) D · 5) B · 6) B
Checkpoint: Unsupervised Learning
Why can the k-Means algorithm not simply decide on its own how many groups exist in the data?
When PCA reduces 784 pixels to 50 principal components — what exactly is lost and why is it still worthwhile?
What is the difference between distance-based and density-based anomaly detection — and when is each one appropriate?