What is stochastic gradient descent?

Author

AI Engineer, Developer Advocate

IBM

What is stochastic gradient descent?

Stochastic gradient descent (SGD) is an optimization algorithm commonly used to improve the performance of machine learning models. It is a variant of the traditional gradient descent algorithm, with a key modification: instead of relying on the entire dataset to compute the gradient at each step, SGD uses a single data sample at a time.

Gradient descent

Gradient descent (GD) is an optimization algorithm that iteratively minimizes an objective function. In the context of machine learning (ML), gradient descent is fundamental to improving the performance of supervised learning models during their training phase. Machine learning models, like neural networks, are complex, nonlinear and high-dimensional. Hence, there is no normal equation for such models that can compute the optimal weights, unlike in linear regression. Instead, approximation methods like the variants of gradient descent, Newton’s methods and expectation maximization can be used, among others.

Every model has a loss function, sometimes called a cost function. This function measures how far a model’s predictions are from the true data points. Think of this as a measure of how “wrong” the model’s predictions are. For example, the mean-squared error often serves as the loss function in regression problems. The model training phase is designed to find the parameter values that minimize this loss. Gradient descent is often the optimization technique used in training for this reason. The algorithm computes the gradient, or the slope, of the loss with respect to the model’s parameters. With this gradient, it then takes a step in the opposite direction to reduce the loss. The learning rate (also referred to as step size or the alpha) is the size of the steps and it remains fixed for all model parameters. This process repeats until the model achieves convergence near a minimum.

Convergence ideally occurs at the global minimum. In the following visualization, you can see that the loss value is lower at a local minimum than in its immediate surrounding area, but not necessarily the lowest value overall. The global minimum is the absolute lowest value of the loss function across its entire domain, representing the best possible solution for the problem.

If the learning rate is not small enough, the algorithm will often converge at a local minimum. A well-chosen rate is essential for minimizing the loss function and achieving convergence at a global minimum.

This visualization depicts the effect of the learning rate on convergence. A small learning rate leads to slow but stable convergence (left), while a large learning rate might cause overshooting and instability (right).

From GD to SGD

The key differentiator between traditional gradient descent and stochastic gradient descent is that SGD updates model weights by using a single training example at a time. The example is randomly picked at each iteration.¹ Gradient descent uses the entire training dataset to compute the gradient before each parameter update. This difference in data usage is what makes SGD much less computationally expensive and easier to scale for large datasets. Alternatively, the convergence behavior of SGD is noisier than the noise of GD because the one example datapoint might not be a good representation of the dataset. This misrepresentation updates the points in a slightly “wrong” direction. However, this randomness is what makes SGD faster and sometimes better for nonconvex optimization problems because it can escape shallow local minima, or saddle points.

Strictly speaking, SGD was originally defined to update parameters by using exactly one training sample at a time. In modern usage, the term “SGD” is used loosely to mean “minibatch gradient descent,” a variant of GD in which small batches of training data are used at a time. The major advantage to using subsets of data rather than a singular sample is a lower noise level, because the gradient is equal to the average of losses from the minibatch. For this reason, minibatch gradient descent is the default in deep learning. Contrarily, strict SGD is rarely used in practice. These terms are even conflated by most machine learning libraries such as PyTorch and TensorFlow; optimizers are often called “SGD,” even though they typically use minibatches.

The following illustration provides a clearer depiction of how increasing the sample size of training data reduces oscillations and “noise.”

There are several other variants of GD that are built on basic gradient descent by adding mechanisms to improve speed, stability and convergence.

Momentum-based methods:

By accumulating momentum in dimensions with consistent gradients and dampening updates in dimensions with changing gradients, momentum helps SGD converge faster and with less oscillation.²

Momentum gradient descent: Incorporates a “velocity” term, an average of previous gradients that gives more importance to recent ones. This approach reduces zigzagging, or oscillations, helping the algorithm move faster in the right direction.
NAG (Nesterov accelerated gradient): An improved momentum method that speeds up and smooths convergence by “looking ahead” at where the parameters are headed before computing the gradient. In other words, it anticipates the future gradient and uses this information to inform the current update step.³

Adaptive learning rate methods:

Adaptive learning rate methods, such as AdaGrad and RMSProp, are unique in that they adapt the learning rate for each parameter individually. This approach is in contrast to SGD methods, which use a fixed learning rate for all parameters.

AdaGrad (adaptive gradient algorithm): Adapts the learning rate for each parameter based on its previous gradients. Features that appear less often receive higher learning rates, and frequent features receive lower rates. This approach means that infrequent features are learned quicker than with SGD. This adaptive learning rate means it is a great method for natural language processing (NLP) and recommendation systems with sparse data, in which there is a large discrepancy in feature frequency.²
RMSProp (Root Mean Square Propagation): Another adaptive learning rate optimization technique that scales the learning rate for each parameter by using a moving average of recent squared gradients. Past gradient knowledge is discarded and only current gradient knowledge is preserved^.4The learning rate becomes larger for parameters with small gradients and smaller for those with large gradients. This method eliminates the diminishing learning rate problem with AdaGrad. RMSProp helps keep training stable in deep learning, especially for models like recurrent neural networks (RNNs), and it works well on problems where the objective keeps changing, such as in reinforcement learning.

Hybrid methods:

Adam (adaptive moment estimation): Combines momentum-based GD with RMSProp by tracking both the past gradients and the average of squared gradients.⁴This combination allows for a fast convergence rate even for noisy and sparse datasets.³ Additionally, the default hyperparameters like a learning rate of 0.001 in many frameworks, work well right away. For very large-scale datasets, however, SGD with momentum can lead to better generalization. The aggressive per-parameter adjustments of Adam can result in overfitting of the training data or settling into sharp minima that don’t generalize as well.

SGD and other GD variants are useful when training time is the bottleneck.⁵

Variant	Data used per step	Key feature	Common use
GD	All data	Stable but slow	Small datasets
SGD	1 sample for classic SGD	Noisy but fast	Online learning
Mini-Batch GD	Few samples	Balanced and scalable	Deep learning
Momentum	Batch/mini-batch	Accelerates in right direction	Neural nets
NAG	Batch/mini-batch	Look-ahead momentum	Faster convergence
AdaGrad	Mini-batch	Adaptive learning rates	Sparse data
RMSProp	Mini-batch	Fixes AdaGrad decay	RNNs, deep nets
Adam	Mini-batch	Momentum + RMSProp	Default choice today

Understanding the math

The goal of SGD is to find parameters $θ$ that make our model’s predictions as close as possible to the true values $y$ . In other words, we want to minimize the loss function, $L (θ)$ .

In the case of linear regression, those parameters are $w$ (weight) and $b$ (bias). So in this case, minimizing $L (θ)$ is the same as minimizing $L (w, b)$ .

$\hat{y_{i}} = w \cdot x_{i} + b$

$L (w, b) = \frac{1}{n} \sum_{i = 1}^{n} (y_{i} - \hat{y_{i}})^{2}$

A commonly used analogy when teaching gradient descent is that GD is like walking downhill on a mountain until yoreach a valley (the minimum loss). Envision the gradient of the loss function, $\nabla L$ , points uphill and to go downhill, we must step in the opposite direction.

The general update rule for a parameter $θ$ is:

$θ := θ - η \cdot \nabla_{θ} L (θ)$

where $η$ is the learning rate and $\nabla_{θ} L (θ)$ is the gradient of the loss with respect to $θ$ .

SGD uses just one randomly chosen sample $(x_{i}, y_{i})$ to approximate the gradient:

$\nabla_{θ} L (θ) \approx \nabla_{θ} ℓ (x_{i}, y_{i}; θ)$

Note, lowercase $ℓ (x_{i}, y_{i}; θ)$ represents the loss of a single training example. Whereas uppercase $L (θ)$ is the overall loss function (the average of all individual losses across the dataset). This global error is what we’re really trying to minimize in training.

Example: Linear regression with SGD

Let’s finish walking through the example of linear regression with SGD.

For one sample $(x_{i}, y_{i})$ , the prediction is:

$\hat{y_{i}} = w \cdot x_{i} + b$

The local loss is the squared error for one sample:

$ℓ (x_{i}, y_{i}; w, b) = (y_{i} - (w x_{i} + b))^{2}$

Now during backpropagation, the model’s parameters are updated by using the chain rule that computes the gradients of the loss function with respect to each parameter.⁵ The gradients (derivates) are:

$\frac{\partial ℓ}{\partial w} = - 2 x_{i} (y_{i} - (w x_{i} + b))$

$\frac{\partial ℓ}{\partial b} = - 2 (y_{i} - (w x_{i} + b))$

With SGD, we update each of these parameters, $w$ and $b$ , by using the following rules:

$w := w - η \cdot (- 2 x_{i} (y_{i} - (w x_{i} + b)))$

$b := b - η \cdot (- 2 (y_{i} - (w x_{i} + b)))$

Instead of calculating a heavy average gradient across the entire dataset, SGD uses a lightweight random estimate.

Simple Python implementation of SGD

When working with machine learning frameworks, there are built-in SGD optimizer classes one can use. For example, torch.optim.SGD for PyTorch, tf.keras.optimizers.SGD for Keras built into TensorFlow, and SGDRegressor for Scikit-learn.

For learning purposes, let’s walk through a simple Python implementation of SGD from scratch.

To reiterate, our objective is to find the best parameters (model weights) that minimize the loss function (a measure of how wrong our predictions are). We will update one sample at a time or a very small batch size.

To start, we can initialize the parameter values (weights) randomly. Next, we can select a random data point $(x, y)$ . From there, we compute the prediction and the error. For this simple demonstration, let’s try to fit a simple line: $y = m x + b$ . The next step in the process is backpropagation, in which the gradients of the loss function are computed with respect to the parameters. These gradients (derivatives) are then used to update the parameters during the SGD optimization process. Because the gradient points to the direction of increase of the loss function, SGD subtracts each gradient from its respective current parameter value. We can think of this as moving in the opposite direction of the gradient to decrease the loss function. Hence, the “descent” in stochastic gradient descent. We repeat these steps until a fixed number of epochs or once the loss is less than the tolerance. The latter would mean that the loss is hardly changing and no longer are we improving the objective function. In other words, we stop once the algorithm converges.

import numpy as np

def stochastic_gradient_descent(X, y, lr=0.01, epochs=100, tol=1e-6):
“””
    Perform Stochastic Gradient Descent (SGD) to fit a line y = w*x + b

    Parameters:
        X (ndarray): Input features
        y (ndarray): Target values
        lr (float): Learning rate (step size for updates)
        epochs (int): Number of iterations through the dataset

    Returns:
        w (float): Learned weight
        b (float): Learned bias
    “””
    # Initialize parameters randomly
    w = np.random.randn()
    b = np.random.randn()

    n = len(X)

    prev_loss = float(‘inf’)

    for epoch in range(epochs):
        # Shuffle the data for each epoch
        indices = np.arange(n)
        np.random.shuffle(indices)

        for i in indices:
            xi = X[i]
            yi = y[i]

            # Prediction
            y_pred = w * xi + b

            # Compute gradients (derivatives)
            dw = -2 * xi * (yi - y_pred)   # derivative wrt w
            db = -2 * (yi - y_pred)        # derivative wrt b

            # Update parameters
            w -= lr * dw
            b -= lr * db


        # Compute loss at the end of the epoch
        loss = np.mean((y - (w*X + b))**2)

        # Check stopping condition
        if abs(prev_loss - loss) < tol:
            print(f”Stopped early at epoch {epoch+1}”)
            break

        prev_loss = loss

    return w, b

Applications of SGD

SGD is the most common optimization method for training deep neural networks. In deep learning, a subset of machine learning within the broader field of data science, the objective is for computers to simulate the complex decision-making power of the human brain. Traditional ML models use simple neural networks consisting of one or two layers. Whereas deep learning models use three or more layers. Typically, hundreds or thousands of layers are needed to train the models. Given SGD’s ease to scale for large training sets, it is often the go-to approach for training deep neural networks. Other applications of SGD training include ridge regression, regularized logistic regression and the optimization of the hinge loss function used in support vector machines (SVMs) with a linear kernel.

Conclusion

SGD is a variant of GD that minimizes a machine learning model’s loss function by using a single data sample at a time. This approach is unlike GD, which depends on the entire dataset at each step to compute the gradient. There are several other GD variants that can be grouped as momentum-based or adaptive learning methods. Momentum gradient descent and Nesterov accelerated gradient are examples of the former. These methods leverage accumulated momentum in dimensions with consistent gradients and dampen updates in dimensions with changing gradients. Thus, helping SGD converge faster and with less oscillation. Adaptive learning rate methods such as AdaGrad and RMSProp adapt the learning rate for each parameter individually, unlike traditional SGD, which uses a fixed learning rate. In addition, hybrid methods like Adam offer a powerful alternative by combining the strengths of momentum-based GD and RMSProp.

Unlock the power of generative AI + ML

Learn how to confidently incorporate generative AI and machine learning into your business.

Abstract portrayal of AI agent, shown in isometric view, acting as bridge between two systems

Build, run and manage AI agents with watsonx Orchestrate

Resources

IBM is named a Leader in Data Science & Machine Learning

Learn why IBM has been recognized as a Leader in the 2025 Gartner® Magic Quadrant™ for Data Science and Machine Learning Platforms.

Explore IBM Granite

IBM Granite is our family of open, performant and trusted AI models, tailored for business and optimized to scale your AI applications. Explore language, code, time series and guardrail options.

AI in Action 2024

We surveyed 2,000 organizations about their AI initiatives to discover what's working, what's not and how you can get ahead.

Supervised learning models

Explore supervised learning approaches such as support vector machines and probabilistic classifiers.

Hands-on with generative AI

Learn fundamental concepts and build your skills with hands-on labs, courses, guided projects, trials and more.

How to choose the right foundation model

Learn how to select the most suitable AI foundation model for your use case.

Footnotes

¹Bottou, L. (2010). Large-Scale Machine Learning with Stochastic Gradient Descent. Lechevallier, Y., Saporta, G. (eds) Proceedings of COMPSTAT’2010. Physica-Verlag HD.

²Ruder, S. (2016). An overview of gradient descent optimization algorithms.

³Tian, Y., Zhang, Y., & Zhang, H. (2023). Recent Advances in Stochastic Gradient Descent in Deep Learning. Mathematics, 11(3), 682.

⁴Haji, S. H., & Abdulazeez, A. M. (2021). Comparison of optimization techniques based on gradient descent algorithm: A review. PalArch’s Journal of Archaeology of Egypt/Egyptology, 18(4), 2715-2743.

⁵Bottou, L. (2012). Stochastic Gradient Descent Tricks. Montavon, G., Orr, G.B., Müller, KR. (eds) Neural Networks: Tricks of the Trade. Lecture Notes in Computer Science, vol 7700. Springer, Berlin, Heidelberg.