The Bernoulli Distribution
The Distribution That Broke Deep Learning
Introduction
Bernoulli is ruthlessly simple. You are a 1 or you are a 0. There is no negotiation.
This makes it the most honest way to model a binary decision — a neuron either fires or it doesn't, a weight either exists or it doesn't. The L0 norm, the most truthful measure of a network's complexity, is just a sum of Bernoulli gates. Turn off the weights that don't matter. Count what's left.
The problem: backpropagation is the chain rule. The chain rule needs derivatives. And Bernoulli has none. The moment you put a Bernoulli gate inside a neural network, the gradient hits it and dies. The colab session implodes. The most honest regularizer is mathematically incompatible with the only training method that works at scale.
This is the distribution that broke deep learning. Understanding why requires going back to the coin flip.
The Historical Context
The Bernoulli distribution is named after the Swiss mathematician Jacob Bernoulli (1655–1705). His magnum opus, Ars Conjectandi (The Art of Conjecturing), published posthumously in 1713, laid the foundations for probability theory and combinatorics. Before Bernoulli, probability was largely constrained to specific games of chance; Bernoulli generalized these concepts to apply to civil, moral, and economic domains.
Crucially, within Ars Conjectandi, Bernoulli introduced the concept of sequences of independent trials with two possible outcomes—what we now call Bernoulli trials. He used this foundation to formally prove the first version of the Law of Large Numbers, demonstrating that as the number of trials increases, the empirical proportion of successes mathematically converges to the theoretical probability .
§1. The Basics
Consider an experiment that can result in only one of two states. Yes or no. 1 or 0.
For our example, we are going to work with the infamous coin—a standard coin with absolutely no special attributes. Let us take Heads (H) to be the success state, occurring with probability . Consequently, Tails (T) is the failure state, occurring with probability (often denoted as ).
It is that simple.
§2. The Expected Value
If you wanted to empirically calculate the probability that the next flip of our coin is Heads, you could throw it 10-15 times, write down each outcome, and divide the number of heads by the total number of trials. Simple enough!
But what if you wanted to know what to expect for the next flip? Intuitively, the expected value should simply be the probability of getting Heads. Why? Because we specifically defined Heads as and Tails as .
When you calculate an expected value, you are taking a weighted average. You multiply the value of Heads () by its probability (), and add the value of Tails () multiplied by its probability (). The zero obliterates the Tails term, leaving only the probability of Heads:
is almost disappointingly simple. The expected value of a coin flip is just the probability of getting Heads. Boring. But this sets up the contrast for what comes next.
§3. Variance
The interesting one is variance. To understand why it behaves the way it does, don't look at the algebra first. Think about it this way: variance measures how surprised you get on average.
When are you most surprised by a coin flip? When . A fair coin. You genuinely have no idea what's coming. Maximum surprise, maximum variance. When are you least surprised? When or . A two-headed coin. You already know the answer before you flip. Zero surprise, zero variance.
So variance should be a function that is zero at the edges and maximum at 0.5. What simple function does that? . When : . When : . When : . Maximum.
The multiplication here isn't just an algebraic trick—it is the direct product of "how likely is success" and "how likely is failure." When both are substantial, uncertainty is maximum. When either collapses to zero, certainty takes over.
§4. The PMF
Now, the compression. We need to unify these two distinct states—success and failure—into a single mathematical function: the Probability Mass Function (PMF). We want an equation that spits out when , and when .
In programming, this is just an if/else statement. But math requires a single, continuous-looking expression. How do we build a switch without logic gates? We use exponents.
Anything to the power of 0 is 1, and anything to the power of 1 is itself. If we use our binary outcome as an exponent, it acts exactly like an on/off switch. turns on when . To get the inverse switch for failure, we use , which turns on when .
Multiply them together, and you get a single elegant equation that perfectly routes the inputs:
It is not an arbitrary definition. It is a mathematical trick to compress a binary state into a single line.
§5. Entropy
Before the flip, how much do you notknow? That's the question entropy answers.
A fair coin — — is the most uncertain object imaginable. You know nothing about the next outcome. Your best guess is still just a guess. Now bias it toward heads — say — and suddenly the flip is barely surprising. You'd bet heads every time and you'd be right nine times out of ten. The uncertainty collapsed.
Shannon [1948] showed that this intuition can be made precise. For any binary trial, the amount of uncertainty — the information entropy — is:
At , this returns exactly 1 bit — the purest atomic unit of uncertainty. Push toward either extreme and the entropy falls to 0. A coin that always lands heads carries no information at all; you already know what's coming.
Entropy runs much deeper than coin flips — it underpins cross-entropy loss, KL divergence, and the information-theoretic foundations of machine learning. We'll return to it properly in a later module.
§6. Independence
A single Bernoulli trial is the fundamental unit of probability, but its real power emerges when we start stacking them.
Imagine flipping our coin twice. If the flips are independent, it means the outcome of the first flip has absolutely zero influence on the second. The coin has no memory.
Mathematically, independence allows us to simply multiply the probabilities together. The probability of getting Heads and then Tails is precisely the probability of Heads multiplied by the probability of Tails: .
This innocent property of multiplication is what allows us to scale a single coin flip into complex, massive systems. And that leads directly to the Binomial distribution.
The Missing GradientSo why did Bernoulli break deep learning? Because you cannot differentiate a coin flip. To train a network that uses Bernoulli gates—for things like sparse routing or hardware-efficient quantization—we are forced to cheat. We use techniques like the Gumbel-Softmax trick or the Straight-Through Estimator to pretend the function is continuous during the backward pass. The pure, discrete Bernoulli trial remains fundamentally hostile to the calculus of backpropagation.
Interactive Laboratory
Put the theory to work. We've written some Python simulating the concepts above. The code is running entirely in your browser via WebAssembly. Break it, change the probabilities, and see what happens.
From Atomic to Aggregate
We have understood the single biased coin. What happens when we flip n independent coins and count the total number of successes? This leads us naturally to the Binomial Distribution.
Enter the Binomial Lab