Learning Probability for Machine Learning

Understanding probability is fundamental to machine learning. In this post, we’ll explore one of the most important theorems in statistics: the Central Limit Theorem (CLT).

What is the Central Limit Theorem?

The Central Limit Theorem states that when you have a large number of independent random variables, their sum (or average) tends to follow a normal distribution, regardless of the original distribution of those variables.

This is powerful! It means that even if we start with something as simple as coin flips (which follow a Bernoulli/Binomial distribution), when we take enough samples and look at their averages, we’ll see a beautiful bell curve emerge.

Interactive Demonstration

Let’s prove this through simulation! Below is an interactive visualization where you can:

Flip a coin multiple times
See how the distribution of averages converges to a normal distribution
Adjust parameters to see how sample size affects convergence

The Mathematics Behind It

Binomial Distribution

When we flip a fair coin \(n\) times, the number of heads \(X\) follows a binomial distribution:

\[P(X = k) = \binom{n}{k} p^k (1-p)^{n-k}\]

where:

\(n\) is the number of trials (coin flips)
\(k\) is the number of successes (heads)
\(p\) is the probability of success (0.5 for a fair coin)

Central Limit Theorem Application

For a binomial random variable with \(n\) trials and probability \(p\):

Mean: \(\mu = np\)
Standard deviation: \(\sigma = \sqrt{np(1-p)}\)

As \(n \to \infty\), the standardized variable converges to a standard normal distribution:

\[Z = \frac{X - np}{\sqrt{np(1-p)}} \xrightarrow{d} N(0, 1)\]

Why This Matters for Machine Learning

The Central Limit Theorem is crucial in ML because:

Statistical Inference: It allows us to make probabilistic statements about model parameters
Bootstrap Methods: Understanding sampling distributions helps us estimate confidence intervals
Optimization: Many gradient-based methods rely on the assumption that averages of gradients behave normally
A/B Testing: We use CLT to determine if differences between models are statistically significant

Experiment with the Demo

Try these experiments with the visualization above:

Start with 10 coin flips and 100 experiments - you’ll see some variation
Increase to 50 flips - notice the distribution becomes more bell-shaped
Try 100 flips with 1000 experiments - beautiful normal distribution!

The more coin flips per experiment and the more experiments you run, the more clearly you’ll see the normal distribution emerge.

Key Takeaways

The Central Limit Theorem is one of the most important results in probability theory
It explains why the normal distribution appears so frequently in nature and data science
You don’t need to start with normally distributed data to get normally distributed averages
Sample size matters: larger samples converge faster to the normal distribution

Try adjusting the parameters in the visualization to build your intuition about how the CLT works!