Gradient Descent & Optimization: From SGD to Adam

Optimization is the heart of machine learning. In this guide, we'll explore how gradient descent works and the modern variants that make deep learning possible.

Vanilla Gradient Descent
Stochastic Gradient Descent (SGD)
Momentum
RMSprop
Adam Optimizer
Learning Rate Schedules

Vanilla Gradient Descent

The fundamental idea: move in the direction of steepest descent.

Update Rule

$\theta := \theta - \alpha \nabla_\theta J(\theta)$

Where:

$\theta$ represents all parameters
$\alpha$ is the learning rate
$\nabla_\theta J(\theta)$ is the gradient of the loss with respect to parameters

Types of Gradient Descent

Batch Gradient Descent: Uses entire dataset per update
Stochastic Gradient Descent: Uses one sample per update
Mini-batch Gradient Descent: Uses small batches (most common)

python

def batch_gradient_descent(X, Y, theta, learning_rate, epochs):
    """
    Batch gradient descent for linear regression.
    """
    m = len(Y)

    for epoch in range(epochs):
        # Compute predictions
        predictions = X @ theta

        # Compute gradient
        gradient = (1/m) * X.T @ (predictions - Y)

        # Update parameters
        theta = theta - learning_rate * gradient

        # Compute cost
        cost = (1/(2*m)) * np.sum((predictions - Y) ** 2)

        if epoch % 100 == 0:
            print(f"Epoch {epoch}: Cost = {cost:.4f}")

    return theta

Stochastic Gradient Descent

SGD updates parameters after each training example, introducing noise that can help escape local minima.

Update Rule

For sample $i$ :

$\theta := \theta - \alpha \nabla_\theta J(\theta; x^{(i)}, y^{(i)})$

Advantages

Faster convergence for large datasets
Can escape local minima
Online learning capable

Disadvantages

High variance in updates
May never settle at minimum
Requires careful learning rate tuning

python

def sgd(X, Y, theta, learning_rate, epochs):
    """
    Stochastic Gradient Descent with shuffling.
    """
    m = len(Y)

    for epoch in range(epochs):
        # Shuffle data
        indices = np.random.permutation(m)
        X_shuffled = X[indices]
        Y_shuffled = Y[indices]

        for i in range(m):
            xi = X_shuffled[i:i+1]
            yi = Y_shuffled[i:i+1]

            # Compute gradient for single sample
            gradient = xi.T @ (xi @ theta - yi)

            # Update
            theta = theta - learning_rate * gradient

    return theta

Momentum

Momentum accelerates SGD by accumulating a velocity vector in the direction of persistent gradients.

Update Rules

$v_t = \beta v_{t-1} + \nabla_\theta J(\theta)$

$\theta := \theta - \alpha v_t$

Where:

$v_t$ is the velocity (accumulated gradient)
$\beta$ is the momentum coefficient (typically 0.9)

Intuition

Think of a ball rolling down a hill:

Accelerates in consistent directions
Dampens oscillations in inconsistent directions
Can roll past shallow local minima

python

def sgd_momentum(X, Y, theta, learning_rate, momentum, epochs):
    """
    SGD with Momentum.
    """
    velocity = np.zeros_like(theta)
    m = len(Y)

    for epoch in range(epochs):
        indices = np.random.permutation(m)

        for i in range(0, m, 32):  # Mini-batch size 32
            batch_idx = indices[i:i+32]
            X_batch = X[batch_idx]
            Y_batch = Y[batch_idx]

            # Compute gradient
            gradient = (1/len(batch_idx)) * X_batch.T @ (X_batch @ theta - Y_batch)

            # Update velocity
            velocity = momentum * velocity + gradient

            # Update parameters
            theta = theta - learning_rate * velocity

    return theta

RMSprop

RMSprop adapts learning rates for each parameter based on the magnitude of recent gradients.

Update Rules

$s_t = \beta s_{t-1} + (1 - \beta)(\nabla_\theta J)^2$

$\theta := \theta - \frac{\alpha}{\sqrt{s_t + \epsilon}} \nabla_\theta J$

Where:

$s_t$ is the exponential moving average of squared gradients
$\beta$ is the decay rate (typically 0.9)
$\epsilon$ is a small constant for numerical stability ( $10^{-8}$ )

Benefits

Handles different gradient magnitudes
Works well with non-stationary objectives
Adapts to the geometry of the problem

python

def rmsprop(X, Y, theta, learning_rate, decay_rate, epochs):
    """
    RMSprop optimizer.
    """
    cache = np.zeros_like(theta)
    epsilon = 1e-8
    m = len(Y)

    for epoch in range(epochs):
        indices = np.random.permutation(m)

        for i in range(0, m, 32):
            batch_idx = indices[i:i+32]
            X_batch = X[batch_idx]
            Y_batch = Y[batch_idx]

            # Compute gradient
            gradient = (1/len(batch_idx)) * X_batch.T @ (X_batch @ theta - Y_batch)

            # Update cache (exponential moving average of squared gradients)
            cache = decay_rate * cache + (1 - decay_rate) * gradient ** 2

            # Update parameters
            theta = theta - learning_rate * gradient / (np.sqrt(cache) + epsilon)

    return theta

Adam Optimizer

Adam (Adaptive Moment Estimation) combines momentum and RMSprop with bias correction.

Update Rules

Compute moments:

$m_t = \beta_1 m_{t-1} + (1 - \beta_1) \nabla_\theta J$

$v_t = \beta_2 v_{t-1} + (1 - \beta_2) (\nabla_\theta J)^2$

Bias correction:

$\hat{m}_t = \frac{m_t}{1 - \beta_1^t}$

$\hat{v}_t = \frac{v_t}{1 - \beta_2^t}$

Update parameters:

$\theta := \theta - \frac{\alpha \hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}$

Recommended Hyperparameters

$\alpha = 0.001$ (learning rate)
$\beta_1 = 0.9$ (first moment decay)
$\beta_2 = 0.999$ (second moment decay)
$\epsilon = 10^{-8}$ (numerical stability)

python

class Adam:
    """
    Adam optimizer implementation.
    """
    def __init__(self, learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-8):
        self.lr = learning_rate
        self.beta1 = beta1
        self.beta2 = beta2
        self.epsilon = epsilon
        self.m = None
        self.v = None
        self.t = 0

    def update(self, theta, gradient):
        if self.m is None:
            self.m = np.zeros_like(theta)
            self.v = np.zeros_like(theta)

        self.t += 1

        # Update biased first moment estimate
        self.m = self.beta1 * self.m + (1 - self.beta1) * gradient

        # Update biased second raw moment estimate
        self.v = self.beta2 * self.v + (1 - self.beta2) * gradient ** 2

        # Compute bias-corrected first moment estimate
        m_hat = self.m / (1 - self.beta1 ** self.t)

        # Compute bias-corrected second raw moment estimate
        v_hat = self.v / (1 - self.beta2 ** self.t)

        # Update parameters
        theta = theta - self.lr * m_hat / (np.sqrt(v_hat) + self.epsilon)

        return theta


def train_with_adam(X, Y, theta, epochs, batch_size=32):
    """
    Training loop with Adam optimizer.
    """
    optimizer = Adam(learning_rate=0.001)
    m = len(Y)
    costs = []

    for epoch in range(epochs):
        indices = np.random.permutation(m)

        for i in range(0, m, batch_size):
            batch_idx = indices[i:i+batch_size]
            X_batch = X[batch_idx]
            Y_batch = Y[batch_idx]

            # Compute gradient
            gradient = (1/len(batch_idx)) * X_batch.T @ (X_batch @ theta - Y_batch)

            # Update with Adam
            theta = optimizer.update(theta, gradient)

        # Compute cost
        cost = (1/(2*m)) * np.sum((X @ theta - Y) ** 2)
        costs.append(cost)

        if epoch % 100 == 0:
            print(f"Epoch {epoch}: Cost = {cost:.4f}")

    return theta, costs

Learning Rate Schedules

The learning rate often needs to change during training.

Step Decay

Reduce learning rate at specific epochs:

$\alpha_t = \alpha_0 \cdot \gamma^{\lfloor t / s \rfloor}$

Where $\gamma$ is the decay factor and $s$ is the step size.

Exponential Decay

$\alpha_t = \alpha_0 \cdot e^{-kt}$

Cosine Annealing

$\alpha_t = \alpha_{min} + \frac{1}{2}(\alpha_{max} - \alpha_{min})(1 + \cos(\frac{t \cdot \pi}{T}))$

Warm Restarts

Periodically reset learning rate to escape local minima.

python

class LearningRateScheduler:
    """
    Various learning rate scheduling strategies.
    """

    @staticmethod
    def step_decay(initial_lr, epoch, drop_rate=0.5, epochs_drop=10):
        """Step decay: reduce LR by factor every N epochs."""
        return initial_lr * (drop_rate ** (epoch // epochs_drop))

    @staticmethod
    def exponential_decay(initial_lr, epoch, decay_rate=0.95):
        """Exponential decay."""
        return initial_lr * (decay_rate ** epoch)

    @staticmethod
    def cosine_annealing(initial_lr, epoch, total_epochs, min_lr=0):
        """Cosine annealing with warm restarts."""
        return min_lr + 0.5 * (initial_lr - min_lr) * (
            1 + np.cos(np.pi * epoch / total_epochs)
        )

    @staticmethod
    def warmup_then_decay(initial_lr, epoch, warmup_epochs=5, total_epochs=100):
        """Linear warmup followed by cosine decay."""
        if epoch < warmup_epochs:
            return initial_lr * (epoch + 1) / warmup_epochs
        else:
            progress = (epoch - warmup_epochs) / (total_epochs - warmup_epochs)
            return initial_lr * 0.5 * (1 + np.cos(np.pi * progress))

Comparison Summary

Optimizer	Pros	Cons	Best For
SGD	Simple, generalizes well	Slow, sensitive to LR	Convex problems
Momentum	Faster convergence	Extra hyperparameter	Most deep learning
RMSprop	Adapts per parameter	Can diverge	RNNs, non-stationary
Adam	Fast, robust	Memory overhead	Default choice

Key Takeaways

Adam is the default optimizer for most deep learning tasks
Learning rate is the most important hyperparameter
Momentum helps escape saddle points and smooth optimization
Adaptive methods (RMSprop, Adam) handle sparse gradients well
Learning rate schedules can significantly improve final performance
Warmup helps stabilize early training

The choice of optimizer affects both training speed and final model quality. Start with Adam, but don't be afraid to try SGD with momentum for better generalization!

Share this article

Tweet LinkedIn

Gradient Descent & Optimization: From SGD to Adam

Gradient Descent & Optimization: From SGD to Adam

Table of Contents

Vanilla Gradient Descent

Update Rule

Types of Gradient Descent

Stochastic Gradient Descent

Update Rule

Advantages

Disadvantages

Momentum

Update Rules

Intuition

RMSprop

Update Rules

Benefits

Adam Optimizer

Update Rules

Recommended Hyperparameters

Learning Rate Schedules

Step Decay

Exponential Decay

Cosine Annealing

Warm Restarts

Comparison Summary

Key Takeaways

Share this article

TheMLTrader

Related Articles

Neural Networks from Scratch: The Complete Mathematical Guide

LSTM Networks for Time Series Forecasting

Regularization Techniques: L1, L2, Dropout, and Beyond

Ready to Apply These Concepts?