Stochastic Gradient Descent
Why compute the exact gradient when an approximation is faster and fundamentally better? Discover how noise acts as kinetic energy to escape topological traps.
Historical Context
In 1951, Herbert Robbins and Sutton Monro published "A Stochastic Approximation Method," introducing a technique that could optimize a function using only noisy, uncertain estimates.
Today, this obscure trick is the undisputed engine of modern AI. When training massive neural networks, calculating the perfect gradient on all your data at once takes way too long. Instead, we estimate it using a small, random chunk of data called a mini-batch.
Because it's just an estimate, our direction is a bit noisy. But surprisingly, this noise is actually a superpower! It acts like a steady vibration, bumping our optimizer out of bad traps where a perfect gradient would get permanently stuck.
Layer 2: Escaping Saddle Points
Even with perfect directions, optimizers can still get stuck. A classic trap is a Saddle Point. Think of a horse saddle—it curves up in one direction and down in another, but the exact middle is perfectly flat.
A basic optimizer without noise stops dead in its tracks when it hits this flat spot because the gradient is zero.
But thanks to the built-in noise from mini-batches, SGD just jiggles off the flat part and slides down safely!
Note on Gradient Explosions: Sometimes in deep networks, the math compounds wildly and the steps become impossibly huge, launching the optimizer out of bounds entirely (NaN).
The fix is easy: we just clip the math, saying "never take a step larger than X."
Layer 3: Momentum
If normal SGD is someone blindly zigzagging down a hill, Momentum is putting them on a heavy sled.
Instead of instantly changing direction every time the noisy gradient points a new way, the sled builds up speed in the general direction it's been going. This powers it straight through the side-to-side wobbles and blasts it down the valley much faster.
Vanilla SGD
SGD with Momentum
- = Current position
- = Current velocity vector
- = Momentum coefficient (friction)
- = Learning rate
- = Gradient at current position
Layer 4: Learning Rate Schedulers
We know that taking big steps makes it impossible to settle at the exact bottom. The solution is a Learning Rate Schedule—slowly pressing the brakes as you get closer to the end.
We start with big steps to quickly move across the map and bounce out of traps. Then, we slowly shrink our step size.
This drains the energy out of the bouncing, letting the optimizer gently glide into the exact bottom.
Constant
Step Decay
Exponential Decay
Cosine Annealing w/ Restarts
- = Learning rate at step n
- = Initial learning rate
- = Decay factor
- = Exponential decay rate
- = Current step within the cosine restart period
- = Total steps in the current cosine period
Stochastic Gradient Descent Implementation
Placeholder description.
Momentum keeps us moving, and Schedules help us settle. But if the landscape is stretched like a valley, we still struggle.
What if the algorithm could give each coordinate its own personal learning rate?
The Anisotropic Problem