Stochastic Gradient Descent

Why compute the exact gradient when an approximation is faster and fundamentally better? Discover how noise acts as kinetic energy to escape topological traps.

Read Gradient Descent Primer

Historical Context

In 1951, Herbert Robbins and Sutton Monro published "A Stochastic Approximation Method," introducing a technique that could optimize a function using only noisy, uncertain estimates.

Today, this obscure trick is the undisputed engine of modern AI. When training massive neural networks, calculating the perfect gradient on all your data at once takes way too long. Instead, we estimate it using a small, random chunk of data called a mini-batch.

Because it's just an estimate, our direction is a bit noisy. But surprisingly, this noise is actually a superpower! It acts like a steady vibration, bumping our optimizer out of bad traps where a perfect gradient would get permanently stuck.

Layer 2: Escaping Saddle Points

Even with perfect directions, optimizers can still get stuck. A classic trap is a Saddle Point. Think of a horse saddle—it curves up in one direction and down in another, but the exact middle is perfectly flat.

A basic optimizer without noise stops dead in its tracks when it hits this flat spot because the gradient is zero.

But thanks to the built-in noise from mini-batches, SGD just jiggles off the flat part and slides down safely!

Note on Gradient Explosions: Sometimes in deep networks, the math compounds wildly and the steps become impossibly huge, launching the optimizer out of bounds entirely (NaN).

The fix is easy: we just clip the math, saying "never take a step larger than X."

Layer 3: Momentum

If normal SGD is someone blindly zigzagging down a hill, Momentum is putting them on a heavy sled.

Instead of instantly changing direction every time the noisy gradient points a new way, the sled builds up speed in the general direction it's been going. This powers it straight through the side-to-side wobbles and blasts it down the valley much faster.

Vanilla SGD

x_{n+1} = x_n - \eta \cdot \nabla f(x_n)

SGD with Momentum

v_{n+1} = \beta \cdot v_n + \nabla f(x_n)

x_{n+1} = x_n - \eta \cdot v_{n+1}

where:

$x_n$ = Current position
$v_n$ = Current velocity vector
$\beta$ = Momentum coefficient (friction)
$\eta$ = Learning rate
$\nabla f(x_n)$ = Gradient at current position

Layer 4: Learning Rate Schedulers

We know that taking big steps makes it impossible to settle at the exact bottom. The solution is a Learning Rate Schedule—slowly pressing the brakes as you get closer to the end.

We start with big steps to quickly move across the map and bounce out of traps. Then, we slowly shrink our step size.

This drains the energy out of the bouncing, letting the optimizer gently glide into the exact bottom.

Constant

\eta_n = \eta_0

Step Decay

\eta_n = \eta_0 \cdot \gamma^{\lfloor n / \text{step} \rfloor}

Exponential Decay

\eta_n = \eta_0 \cdot \exp(-k \cdot n)

Cosine Annealing w/ Restarts

\eta_n = \eta_{\min} + \frac{1}{2}(\eta_{\max} - \eta_{\min})\left(1 + \cos\left(\frac{T_{cur}}{T_i}\pi\right)\right)

where:

$\eta_n$ = Learning rate at step n
$\eta_0$ = Initial learning rate
$\gamma$ = Decay factor
$k$ = Exponential decay rate
$T_{cur}$ = Current step within the cosine restart period
$T_i$ = Total steps in the current cosine period

Stochastic Gradient Descent Implementation

Placeholder description.

Output

Output will appear here after running the code

Momentum keeps us moving, and Schedules help us settle. But if the landscape is stretched like a valley, we still struggle.

What if the algorithm could give each coordinate its own personal learning rate?

The Anisotropic Problem

Enter AdaGrad