Gradient Descent
The workhorse of machine learning. A first-order optimization algorithm that takes steps proportional to the negative of the gradient of a function at the current point.
The Historical Context
The method of gradient descent was first proposed by the French mathematician Augustin-Louis Cauchy in 1847. He developed the technique to solve systems of simultaneous equations related to the orbital calculations of celestial bodies. Long before the advent of electronic computers or machine learning, Cauchy realized that moving in the direction of the steepest descent—the negative gradient—was a highly effective way to iteratively minimize a complex error function.
For over a century, the algorithm remained a relatively obscure numerical method within applied mathematics. It wasn't until the mid-20th century, particularly with the work of Haskell Curry in 1944 and later developments in non-linear optimization, that gradient descent began to take its modern form. In 1951, Herbert Robbins and Sutton Monro introduced Stochastic Gradient Descent (SGD), a profound variation that paved the way for optimizing functions with noisy data or massive datasets.
Today, gradient descent and its sophisticated descendants (like Momentum, RMSprop, and Adam) are the undisputed engines of modern artificial intelligence. Every time a deep neural network learns to recognize an image, translate a language, or generate text, it is fundamentally performing Cauchy's algorithm—navigating a high-dimensional loss landscape by stepping endlessly down the steepest slope.
Layer 4: The Local Minimum Trap
Gradient Descent is a fundamentally greedy algorithm. It only looks at its immediate, local neighborhood to decide where to step next.
This creates a fatal flaw: if the landscape has multiple valleys, the algorithm will happily descend into the first valley it finds and declare victory, completely oblivious to a much deeper, better valley just over the next ridge.
In the interactive plots below, we start in the shallow local minimum on the right. Press play and watch it converge. Then, click anywhere on the left side of the ridge to drop into the true global minimum.
The Asymmetric Double Bowl
Notice how the right basin is extremely shallow, while the left basin is much deeper. Because Gradient Descent is purely downhill-seeking, if you start anywhere on the right, you are mathematically doomed to get stuck in the shallow basin. The algorithm has no concept of "momentum" to carry it over the central ridge.
Gradient Descent Implementation
The algorithm powering every neural network ever trained. Implement it in 10 lines.
In classic Batch Gradient Descent, we compute the gradient using the entire dataset. The loss landscape is perfectly static, the gradient is perfectly accurate, and the descent is perfectly deterministic. That's precisely why it gets trapped.
But what if we compute the gradient on a small, random mini-batch instead? The approximation error acts as noise, shaking the landscape and bouncing us out of local minima.
Welcome to Stochastic Gradient Descent