Learning Probability for Machine Learning
Understanding probability is fundamental to machine learning. In this post, we’ll explore one of the most important theorems in statistics: the Central Limit Theorem (CLT).
What is the Central Limit Theorem?
The Central Limit Theorem states that when you have a large number of independent random variables, their sum (or average) tends to follow a normal distribution, regardless of the original distribution of those variables.
This is powerful! It means that even if we start with something as simple as coin flips (which follow a Bernoulli/Binomial distribution), when we take enough samples and look at their averages, we’ll see a beautiful bell curve emerge.
Interactive Demonstration
Let’s prove this through simulation! Below is an interactive visualization where you can:
- Flip a coin multiple times
- See how the distribution of averages converges to a normal distribution
- Adjust parameters to see how sample size affects convergence
The Mathematics Behind It
Binomial Distribution
When we flip a fair coin \(n\) times, the number of heads \(X\) follows a binomial distribution:
\[P(X = k) = \binom{n}{k} p^k (1-p)^{n-k}\]where:
- \(n\) is the number of trials (coin flips)
- \(k\) is the number of successes (heads)
- \(p\) is the probability of success (0.5 for a fair coin)
Central Limit Theorem Application
For a binomial random variable with \(n\) trials and probability \(p\):
- Mean: \(\mu = np\)
- Standard deviation: \(\sigma = \sqrt{np(1-p)}\)
As \(n \to \infty\), the standardized variable converges to a standard normal distribution:
\[Z = \frac{X - np}{\sqrt{np(1-p)}} \xrightarrow{d} N(0, 1)\]Why This Matters for Machine Learning
The Central Limit Theorem is crucial in ML because:
- Statistical Inference: It allows us to make probabilistic statements about model parameters
- Bootstrap Methods: Understanding sampling distributions helps us estimate confidence intervals
- Optimization: Many gradient-based methods rely on the assumption that averages of gradients behave normally
- A/B Testing: We use CLT to determine if differences between models are statistically significant
Experiment with the Demo
Try these experiments with the visualization above:
- Start with 10 coin flips and 100 experiments - you’ll see some variation
- Increase to 50 flips - notice the distribution becomes more bell-shaped
- Try 100 flips with 1000 experiments - beautiful normal distribution!
The more coin flips per experiment and the more experiments you run, the more clearly you’ll see the normal distribution emerge.
Key Takeaways
- The Central Limit Theorem is one of the most important results in probability theory
- It explains why the normal distribution appears so frequently in nature and data science
- You don’t need to start with normally distributed data to get normally distributed averages
- Sample size matters: larger samples converge faster to the normal distribution
Try adjusting the parameters in the visualization to build your intuition about how the CLT works!