Adam Optimizer

Why use the same learning rate for every parameter? Discover how Adam uses moving averages to dynamically scale step sizes and accelerate training.

Read Stochastic Gradient Descent Primer

Historical Context

In 2014, Diederik Kingma and Jimmy Ba introduced Adam (Adaptive Moment Estimation). It quickly became the default optimizer for almost every deep learning architecture, from CNNs to Transformers.

Adam combines the ideas of Momentum (remembering past gradients to keep moving) and RMSProp (scaling down the learning rate for parameters with large gradients). By keeping exponentially decaying averages of past gradients and squared gradients, Adam adapts the step size for each individual parameter.

The result is an optimizer that is highly robust to hyperparameter choices, handles sparse gradients well, and converges rapidly in complex, non-convex landscapes.

Layer 0: The Moment Generating Function

The magic of Adam is rooted in statistics. Imagine the gradients computed on each mini-batch as random samples drawn from a probability distribution. The Moment Generating Function (MGF) is a mathematical tool that describes the shape of this distribution.

For a normal distribution with mean $\mu$ and variance $\sigma^2$ , the MGF is $M(t) = \exp(\mu t + \frac{1}{2}\sigma^2 t^2)$ . The crucial property is that the derivatives of the MGF evaluated at $t=0$ give us the moments of the distribution.

The slope at $t=0$ gives the 1st moment (the mean). The curvature at $t=0$ gives the 2nd uncentered moment (related to variance). Play with the sliders below to see how changing the gradient distribution affects the MGF curve and its geometry at the origin.

Moment Generating Function

M(t) = \mathbb{E}[e^{t \cdot g}]

M'(0) = \mathbb{E}[g] = \mu \text{ (1st Moment)}

M''(0) = \mathbb{E}[g^2] = \mu^2 + \sigma^2 \text{ (2nd Moment)}

Layer 2: The AdaGrad Motivation

Before Adam, there was AdaGrad. Its breakthrough idea was to accumulate the squares of all past gradients. If a parameter had historically large gradients, AdaGrad would drastically shrink its learning rate.

This is brilliant for sparse data (like text), where rare words need large updates when they finally appear. However, AdaGrad has a fatal flaw for deep learning: the accumulated sum only ever grows. Eventually, the learning rates shrink to effectively zero, stopping training completely.

Adam fixes this by using a moving average instead of an infinite sum. It 'forgets' the distant past, allowing the learning rate to adapt continuously without decaying to zero.

This conceptual leap—adaptive rates that don't die—is the core reason Adam dominates today.

AdaGrad

G_t = G_{t-1} + g_t^2

\theta_t = \theta_{t-1} - \frac{\eta}{\sqrt{G_t + \epsilon}} \odot g_t

Layer 3: The Convergence Fix (AMSGrad)

Despite Adam's empirical success, researchers in 2018 found a flaw: in some specific cases, the moving average of the variance could shrink, causing the learning rates to spike late in training and destroying convergence.

The solution, AMSGrad, modifies Adam to guarantee that the variance term never decreases. It simply keeps the maximum variance seen so far. This subtle change provides a theoretical guarantee of convergence without sacrificing much empirical performance.

AMSGrad

v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2

\hat{v}_t = \max(\hat{v}_{t-1}, v_t)

\theta_t = \theta_{t-1} - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t

Layer 4: Decoupled Weight Decay (AdamW)

In standard neural network training, we often add Weight Decay (L2 regularization) to prevent overfitting by penalizing large weights.

However, in Adam, adding weight decay to the gradient interacts weirdly with the adaptive learning rates. Parameters with large variances get less regularization! AdamW fixes this by decoupling weight decay from the gradient update, applying it directly to the weights. This small change is a key reason why modern models generalize so well.

L2 Regularization vs AdamW

L_{reg} = L + \frac{\lambda}{2}\|\theta\|^2

\theta_t = \theta_{t-1} - \eta \left( \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon} + \lambda \theta_{t-1} \right)

Adam Implementation

Placeholder description.

Output

Output will appear here after running the code

Adam and its variants form the backbone of modern AI training.

Let's return to where it all began.

Back to the Beginning

Return to Babylonian Method