Learning Probability for Machine Learning

Understanding probability is fundamental to machine learning. In this post, we’ll explore one of the most important theorems in statistics: the Central Limit Theorem (CLT).

What is the Central Limit Theorem?

The Central Limit Theorem states that when you have a large number of independent random variables, their sum (or average) tends to follow a normal distribution, regardless of the original distribution of those variables.

This is powerful! It means that even if we start with something as simple as coin flips (which follow a Bernoulli/Binomial distribution), when we take enough samples and look at their averages, we’ll see a beautiful bell curve emerge.

Interactive Demonstration

Let’s prove this through simulation! Below is an interactive visualization where you can:

  • Flip a coin multiple times
  • See how the distribution of averages converges to a normal distribution
  • Adjust parameters to see how sample size affects convergence

The Mathematics Behind It

Binomial Distribution

When we flip a fair coin \(n\) times, the number of heads \(X\) follows a binomial distribution:

\[P(X = k) = \binom{n}{k} p^k (1-p)^{n-k}\]

where:

  • \(n\) is the number of trials (coin flips)
  • \(k\) is the number of successes (heads)
  • \(p\) is the probability of success (0.5 for a fair coin)

Central Limit Theorem Application

For a binomial random variable with \(n\) trials and probability \(p\):

  • Mean: \(\mu = np\)
  • Standard deviation: \(\sigma = \sqrt{np(1-p)}\)

As \(n \to \infty\), the standardized variable converges to a standard normal distribution:

\[Z = \frac{X - np}{\sqrt{np(1-p)}} \xrightarrow{d} N(0, 1)\]

Why This Matters for Machine Learning

The Central Limit Theorem is crucial in ML because:

  1. Statistical Inference: It allows us to make probabilistic statements about model parameters
  2. Bootstrap Methods: Understanding sampling distributions helps us estimate confidence intervals
  3. Optimization: Many gradient-based methods rely on the assumption that averages of gradients behave normally
  4. A/B Testing: We use CLT to determine if differences between models are statistically significant

Experiment with the Demo

Try these experiments with the visualization above:

  1. Start with 10 coin flips and 100 experiments - you’ll see some variation
  2. Increase to 50 flips - notice the distribution becomes more bell-shaped
  3. Try 100 flips with 1000 experiments - beautiful normal distribution!

The more coin flips per experiment and the more experiments you run, the more clearly you’ll see the normal distribution emerge.

Key Takeaways

  • The Central Limit Theorem is one of the most important results in probability theory
  • It explains why the normal distribution appears so frequently in nature and data science
  • You don’t need to start with normally distributed data to get normally distributed averages
  • Sample size matters: larger samples converge faster to the normal distribution

Try adjusting the parameters in the visualization to build your intuition about how the CLT works!