Stochastic gradient descent (SGD) is an optimization algorithm commonly used to improve the performance of machine learning models. It is a variant of the traditional gradient descent algorithm, with a key modification: instead of relying on the entire dataset to compute the gradient at each step, SGD uses a single data sample at a time.
Gradient descent (GD) is an optimization algorithm that iteratively minimizes an objective function. In the context of machine learning (ML), gradient descent is fundamental to improving the performance of supervised learning models during their training phase. Machine learning models, like neural networks, are complex, nonlinear and high-dimensional. Hence, there is no normal equation for such models that can compute the optimal weights, unlike in linear regression. Instead, approximation methods like the variants of gradient descent, Newton’s methods and expectation maximization can be used, among others.
Every model has a loss function, sometimes called a cost function. This function measures how far a model’s predictions are from the true data points. Think of this as a measure of how “wrong” the model’s predictions are. For example, the mean-squared error often serves as the loss function in regression problems. The model training phase is designed to find the parameter values that minimize this loss. Gradient descent is often the optimization technique used in training for this reason. The algorithm computes the gradient, or the slope, of the loss with respect to the model’s parameters. With this gradient, it then takes a step in the opposite direction to reduce the loss. The learning rate (also referred to as step size or the alpha) is the size of the steps and it remains fixed for all model parameters. This process repeats until the model achieves convergence near a minimum.
Convergence ideally occurs at the global minimum. In the following visualization, you can see that the loss value is lower at a local minimum than in its immediate surrounding area, but not necessarily the lowest value overall. The global minimum is the absolute lowest value of the loss function across its entire domain, representing the best possible solution for the problem.
If the learning rate is not small enough, the algorithm will often converge at a local minimum. A well-chosen rate is essential for minimizing the loss function and achieving convergence at a global minimum.
This visualization depicts the effect of the learning rate on convergence. A small learning rate leads to slow but stable convergence (left), while a large learning rate might cause overshooting and instability (right).
The key differentiator between traditional gradient descent and stochastic gradient descent is that SGD updates model weights by using a single training example at a time. The example is randomly picked at each iteration.1 Gradient descent uses the entire training dataset to compute the gradient before each parameter update. This difference in data usage is what makes SGD much less computationally expensive and easier to scale for large datasets. Alternatively, the convergence behavior of SGD is noisier than the noise of GD because the one example datapoint might not be a good representation of the dataset. This misrepresentation updates the points in a slightly “wrong” direction. However, this randomness is what makes SGD faster and sometimes better for nonconvex optimization problems because it can escape shallow local minima, or saddle points.
Strictly speaking, SGD was originally defined to update parameters by using exactly one training sample at a time. In modern usage, the term “SGD” is used loosely to mean “minibatch gradient descent,” a variant of GD in which small batches of training data are used at a time. The major advantage to using subsets of data rather than a singular sample is a lower noise level, because the gradient is equal to the average of losses from the minibatch. For this reason, minibatch gradient descent is the default in deep learning. Contrarily, strict SGD is rarely used in practice. These terms are even conflated by most machine learning libraries such as PyTorch and TensorFlow; optimizers are often called “SGD,” even though they typically use minibatches.
The following illustration provides a clearer depiction of how increasing the sample size of training data reduces oscillations and “noise.”
There are several other variants of GD that are built on basic gradient descent by adding mechanisms to improve speed, stability and convergence.
By accumulating momentum in dimensions with consistent gradients and dampening updates in dimensions with changing gradients, momentum helps SGD converge faster and with less oscillation.2
Adaptive learning rate methods, such as AdaGrad and RMSProp, are unique in that they adapt the learning rate for each parameter individually. This approach is in contrast to SGD methods, which use a fixed learning rate for all parameters.
AdaGrad (adaptive gradient algorithm): Adapts the learning rate for each parameter based on its previous gradients. Features that appear less often receive higher learning rates, and frequent features receive lower rates. This approach means that infrequent features are learned quicker than with SGD. This adaptive learning rate means it is a great method for natural language processing (NLP) and recommendation systems with sparse data, in which there is a large discrepancy in feature frequency.2
RMSProp (Root Mean Square Propagation): Another adaptive learning rate optimization technique that scales the learning rate for each parameter by using a moving average of recent squared gradients. Past gradient knowledge is discarded and only current gradient knowledge is preserved.4 The learning rate becomes larger for parameters with small gradients and smaller for those with large gradients. This method eliminates the diminishing learning rate problem with AdaGrad. RMSProp helps keep training stable in deep learning, especially for models like recurrent neural networks (RNNs), and it works well on problems where the objective keeps changing, such as in reinforcement learning.
SGD and other GD variants are useful when training time is the bottleneck.5
| Variant | Data used per step | Key feature | Common use |
|---|---|---|---|
| GD | All data | Stable but slow | Small datasets |
| SGD | 1 sample for classic SGD | Noisy but fast | Online learning |
| Mini-Batch GD | Few samples | Balanced and scalable | Deep learning |
| Momentum | Batch/mini-batch | Accelerates in right direction | Neural nets |
| NAG | Batch/mini-batch | Look-ahead momentum | Faster convergence |
| AdaGrad | Mini-batch | Adaptive learning rates | Sparse data |
| RMSProp | Mini-batch | Fixes AdaGrad decay | RNNs, deep nets |
| Adam | Mini-batch | Momentum + RMSProp | Default choice today |
The goal of SGD is to find parameters that make our model’s predictions as close as possible to the true values . In other words, we want to minimize the loss function, .
In the case of linear regression, those parameters are (weight) and (bias). So in this case, minimizing is the same as minimizing .
A commonly used analogy when teaching gradient descent is that GD is like walking downhill on a mountain until yoreach a valley (the minimum loss). Envision the gradient of the loss function, , points uphill and to go downhill, we must step in the opposite direction.
The general update rule for a parameter is:
where is the learning rate and is the gradient of the loss with respect to .
SGD uses just one randomly chosen sample to approximate the gradient:
Note, lowercase represents the loss of a single training example. Whereas uppercase is the overall loss function (the average of all individual losses across the dataset). This global error is what we’re really trying to minimize in training.
Let’s finish walking through the example of linear regression with SGD.
For one sample , the prediction is:
The local loss is the squared error for one sample:
Now during backpropagation, the model’s parameters are updated by using the chain rule that computes the gradients of the loss function with respect to each parameter.5 The gradients (derivates) are:
With SGD, we update each of these parameters, and , by using the following rules:
Instead of calculating a heavy average gradient across the entire dataset, SGD uses a lightweight random estimate.
When working with machine learning frameworks, there are built-in SGD optimizer classes one can use. For example,
For learning purposes, let’s walk through a simple Python implementation of SGD from scratch.
To reiterate, our objective is to find the best parameters (model weights) that minimize the loss function (a measure of how wrong our predictions are). We will update one sample at a time or a very small batch size.
To start, we can initialize the parameter values (weights) randomly. Next, we can select a random data point . From there, we compute the prediction and the error. For this simple demonstration, let’s try to fit a simple line: . The next step in the process is backpropagation, in which the gradients of the loss function are computed with respect to the parameters. These gradients (derivatives) are then used to update the parameters during the SGD optimization process. Because the gradient points to the direction of increase of the loss function, SGD subtracts each gradient from its respective current parameter value. We can think of this as moving in the opposite direction of the gradient to decrease the loss function. Hence, the “descent” in stochastic gradient descent. We repeat these steps until a fixed number of epochs or once the loss is less than the tolerance. The latter would mean that the loss is hardly changing and no longer are we improving the objective function. In other words, we stop once the algorithm converges.
SGD is the most common optimization method for training deep neural networks. In deep learning, a subset of machine learning within the broader field of data science, the objective is for computers to simulate the complex decision-making power of the human brain. Traditional ML models use simple neural networks consisting of one or two layers. Whereas deep learning models use three or more layers. Typically, hundreds or thousands of layers are needed to train the models. Given SGD’s ease to scale for large training sets, it is often the go-to approach for training deep neural networks. Other applications of SGD training include ridge regression, regularized logistic regression and the optimization of the hinge loss function used in support vector machines (SVMs) with a linear kernel.
SGD is a variant of GD that minimizes a machine learning model’s loss function by using a single data sample at a time. This approach is unlike GD, which depends on the entire dataset at each step to compute the gradient. There are several other GD variants that can be grouped as momentum-based or adaptive learning methods. Momentum gradient descent and Nesterov accelerated gradient are examples of the former. These methods leverage accumulated momentum in dimensions with consistent gradients and dampen updates in dimensions with changing gradients. Thus, helping SGD converge faster and with less oscillation. Adaptive learning rate methods such as AdaGrad and RMSProp adapt the learning rate for each parameter individually, unlike traditional SGD, which uses a fixed learning rate. In addition, hybrid methods like Adam offer a powerful alternative by combining the strengths of momentum-based GD and RMSProp.
1 Bottou, L. (2010). Large-Scale Machine Learning with Stochastic Gradient Descent. Lechevallier, Y., Saporta, G. (eds) Proceedings of COMPSTAT’2010. Physica-Verlag HD.
2 Ruder, S. (2016). An overview of gradient descent optimization algorithms.
3 Tian, Y., Zhang, Y., & Zhang, H. (2023). Recent Advances in Stochastic Gradient Descent in Deep Learning. Mathematics, 11(3), 682.
4 Haji, S. H., & Abdulazeez, A. M. (2021). Comparison of optimization techniques based on gradient descent algorithm: A review. PalArch’s Journal of Archaeology of Egypt/Egyptology, 18(4), 2715-2743.
5 Bottou, L. (2012). Stochastic Gradient Descent Tricks. Montavon, G., Orr, G.B., Müller, KR. (eds) Neural Networks: Tricks of the Trade. Lecture Notes in Computer Science, vol 7700. Springer, Berlin, Heidelberg.