Bayes & Conditional Probability

Conditional probability: the tool that identifies statisticians — they calculate differently.

A medical test is 95% accurate. You test positive. How likely are you to actually be sick? Most people answer "95%." The real answer: about 16%. This is not a trick — it is Bayes' theorem in action.

Your brain is wired to get probabilities wrong — especially when rare events are involved. This article walks you through three tools: conditional probability (what "given that" means mathematically), Bayes' theorem (the formula for reversing probabilities), and Bayesian updating (how AI systems learn from data).

Conditional Probability — "Given That..."

You draw a card from a standard 52-card deck. P(Ace of Hearts) = 1/52. Someone tells you: "The card is red." This eliminates 26 cards. Your new universe is 26 red cards. P(Ace of Hearts | red) = 1/26. The information "red" halved your possibilities and doubled your probability. That is conditional probability: new information shrinks the possibility space.

Example

The card analogy works perfectly because the numbers are exact and small enough to verify by hand. In reality, you work with estimated frequencies (disease prevalence, test accuracy), not perfectly known card counts.

Analogy:

Example

Definition:

P(A|B) = P(A ∩ B) / P(B). The vertical bar "|" reads as "given that." You zoom into the world where B is true, and ask what fraction of that world also has A. Crucially: P(A|B) and P(B|A) are NOT the same thing. P(rain | October in Hamburg) is very different from P(October in Hamburg | rain). Confusing the direction of conditioning is the root of the Base Rate Fallacy.

Example: The Medical Test

1% of the population has disease X. The test detects sick people 95% of the time (sensitivity). In healthy people, it falsely shows positive 5% of the time. Take 1,000 people:

1,000 people: 10 sick, 990 healthy

Of 10 sick: 9 test positive (95%)

Of 990 healthy: 50 false positives (5%)

Total positive: 9 + 50 = 59

P(sick | positive) = 9/59 ≈ 16%

You test positive. The probability of actually being sick: only 16%. The 990 healthy people produce 50 false alarms that flood the 9 true positives. The base rate (1% prevalence) dominates.

Do not confuse P(positive | sick) with P(sick | positive). Test accuracy (how well the test detects the sick) and diagnostic probability (how likely you are sick given a positive test) are completely different things. The base rate — how rare the disease is — makes all the difference.

95% describes P(positive | sick) — how well the test finds sick people. But you want P(sick | positive) — how likely you are sick. Without the base rate (how common the disease is), the accuracy number is meaningless. For rare diseases, even excellent tests produce mostly false alarms.

Bayes' Theorem — The Formula for Updating Beliefs

You hear a noise at night. Your initial beliefs (priors): P(cat knocked something over) = 80%, P(burglar) = 0.1%. Then you hear glass breaking — P(glass | cat) = 10%, but P(glass | burglar) = 90%. Bayes updates your belief: P(burglar) rises significantly because glass breaking is much more likely under the burglar scenario. But the extremely low prior pulls the posterior back — it does not jump to 90%. The prior and the likelihood compete, and Bayes mediates.

Example

Humans do not compute numbers at night — they react with gut feelings and heuristics. Bayes requires explicit probabilities. This gap between intuitive and mathematical updating is exactly what the article teaches.

Analogy:

Example

Definition:

P(A|B) = P(B|A) x P(A) / P(B). Bayes' theorem connects P(A|B) and P(B|A), allowing you to reason "backwards" — from observed evidence to underlying cause. The formula combines prior knowledge (prior) with new observation (likelihood) and normalizes by the total probability of the evidence. P(B) is computed via the law of total probability: P(B) = P(B|A) x P(A) + P(B|~A) x P(~A).

The Four Building Blocks

Prior P(A)

Likelihood P(B|A)

Evidence P(B)

Posterior P(A|B)

Worked Example

P(sick | positive) = P(positive | sick) x P(sick) / P(positive)
                    = 0.95 x 0.01 / 0.059
                    = 0.0095 / 0.059
                    ≈ 0.161 → about 16.1%

P(positive) = P(positive | sick) x P(sick) + P(positive | healthy) x P(healthy)
            = 0.95 x 0.01 + 0.05 x 0.99
            = 0.0095 + 0.0495
            = 0.059

Frequency Table

Count 1,000 people, divide positive-tested sick by all positives. Intuitive, but only practical with small numbers.

Bayes' Formula

Prior x Likelihood / Evidence. Same answer (16%), but universally applicable — even when you cannot count 1,000 people.

Even if P(B|A) is high — the likelihood alone does not determine the result. With an extremely low prior (e.g., a disease affects 0.01%), the posterior stays low even with a 99% test. The prior always pulls the result in its direction. Only with moderate priors or multiple independent pieces of evidence can the evidence overcome the prior.

Interactive: Compute Bayes' Theorem

Adjust the prior, sensitivity, and false positive rate and watch how the posterior changes. Try the predefined scenarios from the article (rare disease, spam filter) and experiment with your own values.

Scenario: Medical Test

A disease affects a certain proportion of the population. A test detects the disease with a certain hit rate, but also produces false-positive results in healthy people. How likely is the disease really when the test comes back positive?

Input Values

P(A) — Prior Probability How common is the disease?

P(B|A) — Sensitivity Detection rate for actually sick people

P(B|¬A) — False Positive Rate False alarms in healthy people

Examples:

Bayes' Calculation

P(B)= P(B|A) × P(A) + P(B|¬A) × P(¬A)

P(B)= 0.9500 × 0.0100 + 0.0500 × 0.9900 = 0.0590

P(A|B)= P(B|A) × P(A) / P(B)

P(A|B)= 0.9500 × 0.0100 / 0.0590 = 0.1610

Result

16.1%

P(A|B) — Probability of disease given a positive test

Before (Prior)

After (Posterior)

16.1%

Bayes factor: The test has increased by a factor of 16.1 the probability (from 1% to 16.1%).

The Bayes Paradox

Although the test has 95% sensitivity, the probability given a positive result is only 16.1%. This is due to the low base rate (1%): Among 1,000 tested people, there are 50 false alarms but only 10 true hits. The false positives overwhelm the real cases.

Visualized: 1,000 People Tested

Sick & tested positive

(True Positive)

Sick & tested negative

(False Negative)

Healthy & tested positive

(False Positive)

941

Healthy & tested negative

(True Negative)

Of 60 positive tests, only 10 are actually sick. This gives P(A|B) = 10/60 = 16.1%.

Bayesian Updating — Learning Step by Step

A doctor listens to symptoms one at a time: Patient arrives → Prior: common cold is most likely. Symptom fever → Posterior shifts toward flu. Symptom rash → Posterior shifts further, now considering measles. Each symptom is a Bayes update: the posterior from the previous symptom becomes the prior for the next one.

Example

Doctors rarely compute explicit probabilities — they use pattern recognition and clinical experience. A Bayesian algorithm computes actual numbers. This contrast shows what Bayesian AI achieves that human intuition only approximates.

Analogy:

Example

Definition:

Bayesian updating generalizes Bayes' theorem into an iterative learning process: start with a prior, observe evidence, compute a posterior — then use that posterior as the NEW prior for the next piece of evidence. With little data, the prior dominates. With lots of data, the evidence dominates — the posterior converges regardless of where the prior started (Bayesian convergence).

Example: Spam Filter (Naive Bayes)

A spam filter classifies emails using word frequencies. Prior: P(Spam) = 0.4, P(Ham) = 0.6. The email contains the word "winner":

P("winner" | Spam) = 0.8
P("winner" | Ham)  = 0.05

P(Spam | "winner") = (0.8 x 0.4) / (0.8 x 0.4 + 0.05 x 0.6)
                    = 0.32 / 0.35
                    ≈ 0.914 → 91.4% spam probability

If the email also contains "click," another Bayes update further increases the spam probability. Each word is a piece of evidence. This is Naive Bayes — "naive" because it assumes words are independent given the class. A simplification that works remarkably well in practice.

Python: Naive Bayes

from sklearn.naive_bayes import MultinomialNB

# Training features (word frequencies)
X_train = [[5, 1, 0], [4, 2, 0], [0, 1, 3], [1, 0, 4]]
y_train = ['spam', 'spam', 'ham', 'ham']

clf = MultinomialNB()
clf.fit(X_train, y_train)

# Classify a new email
clf.predict([[3, 1, 0]])  # → 'spam'

With large datasets, yes — the prior's influence fades. But with limited data (common in practice: medical imaging, rare events, small user bases), the prior can dominate the posterior. Choosing a thoughtful prior is not bias — it is informed reasoning.

Training an AI model follows the same pattern: the prior is the model's random initial weights. The evidence is the training data. The posterior is the trained model. Each training batch updates the weights — just as Bayes updates the posterior with each new piece of evidence. Understanding Bayes means understanding how AI learns.

Interactive: Prior vs. Posterior Comparison

Drag the slider to transition between prior (flat initial distribution before evidence) and posterior (concentrated distribution after evidence). Observe how Bayesian updating shifts belief from a vague guess to a precise conviction.

Underfitting vs. Overfitting

Move the slider to switch between underfitting (left) and overfitting (right). The blue dots are training data. The curve shows how the model interprets the data.

UnderfittingOverfitting

Auto

‹ ›

📉

Underfitting

The model is too simple. It doesn't even recognize the obvious patterns in the training data. Like a student who hasn't understood the task.

Model complexityToo low

Training errorHigh

Test errorHigh

📈

Overfitting

The model is too complex. It memorizes every single data point, including noise. Like a student who memorizes answers instead of understanding.

Model complexityToo high

Training errorVery low

Test errorHigh

🎯

Sweet Spot: Good Fit

The optimal compromise lies in the middle: complex enough to recognize real patterns, but simple enough to generalize to new data. Techniques like regularization, cross-validation, and early stopping help find this point.

P(A|B) ≠ P(B|A) — confusing the direction of conditioning is the most common probability error (Base Rate Fallacy).
Bayes' formula P(A|B) = P(B|A) x P(A) / P(B) combines prior knowledge (prior) with new observation (likelihood) to produce the updated belief (posterior).
Bayesian updating is iterative: each posterior becomes the new prior. With little data, the prior dominates; with lots of data, the evidence dominates. Spam filters, medical diagnostics, and AI training all use exactly this principle.

What does P(A|B) mean?

The probability of A and B together

The probability of A, given that B has already occurred

The probability of B, given that A is true

The probability that A causes B

1. What does P(A|B) mean?

☐ A) The probability of A and B together
☐ B) The probability of A, given that B has already occurred
☐ C) The probability of B, given that A is true
☐ D) The probability that A causes B

2. Why are most positive test results wrong for rare diseases?

☐ A) Because the test is bad
☐ B) Because the large group of healthy people produces many false alarms that flood the few true positives
☐ C) Because doctors frequently make mistakes
☐ D) Because the disease is tested too rarely

3. A spam filter has prior P(Spam) = 0.5. An email contains "winner": P("winner" | Spam) = 0.6, P("winner" | Ham) = 0.1. What is P(Spam | "winner")?

☐ A) 60%
☐ B) 50%
☐ C) 86%
☐ D) 10%

4. An AI model starts with random weights and is trained with data. What Bayesian role do the initial weights play?

☐ A) Likelihood
☐ B) Evidence
☐ C) Posterior
☐ D) Prior

Answer Key: 1) B · 2) B · 3) C · 4) D

Checkpoint: Bayes & Conditional Probability

I can explain why the base rate (prior) is so crucial when interpreting a positive test result.
I can explain the difference between P(A|B) and P(B|A) with an example.
I can compute a posterior using Bayes' formula when given the prior, likelihood, and evidence.

Conditional Probability — "Given That..."

Conditional Probability

Example

Analogy:

Example

Definition:

Example: The Medical Test

The Base Rate Fallacy

Misconception: 95% Accurate Test = 95% Probability of Being Sick

Bayes' Theorem — The Formula for Updating Beliefs

Bayes' Theorem

Example

Analogy:

Example

Definition:

The Four Building Blocks

Worked Example

Misconception: Strong Evidence Overrides Everything

Interactive: Compute Bayes' Theorem

Input Values

Bayes' Calculation

Result

Visualized: 1,000 People Tested

Bayesian Updating — Learning Step by Step

Bayesian Updating

Example

Analogy:

Example

Definition:

Example: Spam Filter (Naive Bayes)

Python: Naive Bayes

Misconception: The Prior Does Not Matter — Just Let the Data Speak

Deep Dive: AI Training as Bayesian Learning

Interactive: Prior vs. Posterior Comparison

Underfitting vs. Overfitting

Underfitting

Overfitting

Key Takeaways

Quiz: Bayes & Conditional Probability

What does P(A|B) mean?

Checkpoint: Bayes & Conditional Probability

Probability & Expected Value

Related Content

Article

Probability & Expected Value

Supervised Learning — Learning with a Teacher

Measures of Central Tendency: Where Is the Middle?

Bias & Data Quality

Correlation vs. Causation

Distributions: The Shape of Data

Rules & Logic: Expert Systems

Linear & Logistic Regression

How Good Is Your Model? Metrics That Actually Matter

Demo

Naive Bayes (Classification)

Supervised Learning

Glossary

Timeline