Gradient Descent
Gradient descent is an iterative optimization algorithm that minimizes a function by moving in the direction of steepest descent, defined by the negative gradient.
Gradient descent is a first-order iterative optimization algorithm used to find the minimum of a differentiable function. The algorithm works by repeatedly taking steps proportional to the negative of the gradient (or approximate gradient) of the function at the current point. This process moves the parameters in the direction of the steepest decrease in the function value. The step size, known as the learning rate, controls how large each step is. The algorithm was first described by Augustin-Louis Cauchy in 1847.
In machine learning, gradient descent is the core algorithm for training many models, including linear regression, logistic regression, and neural networks. The function being minimized is typically a loss function that measures the error between the model’s predictions and the actual data. By adjusting the model’s parameters (weights and biases) to minimize this loss, the model learns from the data. Variants include batch gradient descent, which uses the entire dataset to compute the gradient; stochastic gradient descent (SGD), which uses a single randomly selected data point; and mini-batch gradient descent, which uses a small subset of the data.
The algorithm’s convergence depends on the learning rate and the shape of the loss function. If the learning rate is too small, convergence is slow; if too large, the algorithm may overshoot the minimum or diverge. Adaptive methods such as Adam, RMSprop, and Adagrad adjust the learning rate during training to improve convergence. Despite its simplicity, gradient descent remains a foundational tool in optimization and machine learning.
Why it matters
Gradient descent is essential because it provides a practical, scalable method for optimizing complex models with millions of parameters. Without it, training deep neural networks and many other machine learning models would be computationally infeasible. Its variants enable efficient learning from large datasets, making it a cornerstone of modern artificial intelligence and data science.
First appeared
Cauchy, 1847.
Related terms
FAQ
How does it work?
Gradient descent works by computing the gradient of the loss function with respect to the model’s parameters. It then updates the parameters by subtracting a fraction (the learning rate) of the gradient. This process is repeated iteratively until the loss function converges to a minimum.
What is the difference between batch and stochastic gradient descent?
Batch gradient descent computes the gradient using the entire training dataset, which is accurate but can be slow for large datasets. Stochastic gradient descent (SGD) computes the gradient using a single randomly chosen data point, which is faster but introduces noise. Mini-batch gradient descent is a compromise, using a small batch of data points.
When should adaptive methods like Adam be used instead of standard gradient descent?
Adaptive methods like Adam are often preferred when training deep neural networks because they automatically adjust the learning rate for each parameter, handling sparse gradients and noisy data better. Standard gradient descent with a fixed learning rate can be simpler and may work well for convex problems or when the learning rate is carefully tuned.