AdaGrad
Why should every parameter share the same learning rate? Discover how giving each dimension its own adaptive step size solves the zigzagging problem.
Historical Context
In 2011, John Duchi, Elad Hazan, and Yoram Singer published Adaptive Subgradient Methods for Online Learning and Stochastic Optimization (AdaGrad). Up until this point, algorithms largely applied a single, global learning rate to every parameter in a model.
AdaGrad asked a simple question: What if a parameter is associated with a very rare feature? With a global learning rate, rare features barely get updated, while frequent features get updated constantly. AdaGrad introduced the idea of adaptive, per-parameter learning rates.
By keeping a running sum of the squared gradients for each parameter, AdaGrad automatically shrinks the learning rate for frequently updated parameters and keeps it high for rarely updated ones. This was a massive conceptual leap that paved the way for modern optimizers like RMSProp and Adam.
Layer 1: The Anisotropic Landscape
In most real-world problems, loss landscapes are anisotropic—meaning they don't slope equally in all directions. Imagine a stretched valley that is very steep along the Y-axis but incredibly flat along the X-axis.
Standard Gradient Descent struggles here. If you set the learning rate low enough to avoid exploding on the steep Y-axis, the optimizer crawls at a snail's pace along the flat X-axis. If you increase the learning rate to move faster along X, it wildly overshoots and diverges along Y.
We need an algorithm that can take small steps in steep directions and large steps in flat directions, automatically.
Notice how Vanilla SGD (left) points either straight down or up the steep walls of the valley, mostly ignoring the path to the global minimum. It gets trapped bouncing back and forth across the Y-axis.
AdaGrad (right) tracks the history of the gradients. Because the Y gradients are huge, its accumulator grows rapidly, shrinking the effective learning rate in the Y direction. This acts as a brake and normalizes the step size, allowing it to point correctly down the center of the valley!
Vanilla SGD
AdaGrad
- = Next position
- = Current position
- = Base learning rate
- = Gradient of the objective function
- = Sum of squares of past gradients
- = Small constant for numerical stability
- = Element-wise multiplication
Vanilla SGD
AdaGrad
Layer 2: The Gradient Accumulator
AdaGrad solves this by giving every parameter its own personal learning rate. It does this by keeping a memory of every gradient it has ever seen.
For each coordinate, AdaGrad tracks the sum of squared gradients (let's call it G). The effective learning rate for that coordinate is the base learning rate divided by the square root of G.
If a direction is steep, its gradients are large, so G grows rapidly. Dividing by a large number shrinks the step size, acting as an automatic brake. If a direction is flat, G stays small, keeping the learning rate high.
Notice how the step ellipse adapts below: it squashes in the steep direction to prevent overshooting, and stretches in the flat direction to accelerate progress.
- = Accumulated squared gradients
- = Gradient at step i
- = Element-wise multiplication
- = Adaptive learning rate
- = Base learning rate
- = Small constant for numerical stability
Layer 3: The Fatal Flaw & Sparse Gradients
Look closely at the formula. G is a sum of squares, which means it can only ever increase. It never forgets the past.
Because we divide by this ever-growing sum, the learning rate for every parameter shrinks continuously over time. Eventually, the learning rates become infinitesimally small, and the algorithm completely freezes, often long before reaching the minimum.
Why use it then? Because for sparse data (like words in a massive vocabulary), most parameters get a gradient of zero on most steps. Their G barely grows, preserving their learning rate for when they finally appear in a batch. AdaGrad remains incredibly effective for sparse problems, even if its freezing flaw required the invention of RMSProp and Adam for dense deep learning.
Simulated Sparse Gradients
- = Gradient in the y direction
- = Probability of zeroing the y gradient
- = True partial derivative
AdaGrad Implementation
Placeholder description.
AdaGrad proved that per-parameter learning rates were the key to navigating complex, stretched landscapes.
But its infinite memory causes it to freeze prematurely. What if we allowed it to gradually forget the distant past?
Fixing the Accumulator