Skip to main content
Skip to main content
Gradient Descent & Optimization: From SGD to Adam - Optimization article featured image
OptimizationFeatured

Gradient Descent & Optimization: From SGD to Adam

Master the optimization algorithms that power deep learning. Understand momentum, RMSprop, Adam, and learning rate schedules.

Gradient Descent & Optimization: From SGD to Adam

Optimization is the heart of machine learning. In this guide, we'll explore how gradient descent works and the modern variants that make deep learning possible.

Table of Contents

  1. Vanilla Gradient Descent
  2. Stochastic Gradient Descent (SGD)
  3. Momentum
  4. RMSprop
  5. Adam Optimizer
  6. Learning Rate Schedules

Vanilla Gradient Descent

The fundamental idea: move in the direction of steepest descent.

Update Rule

θ:=θαθJ(θ)\theta := \theta - \alpha \nabla_\theta J(\theta)

Where:

  • θ\theta represents all parameters
  • α\alpha is the learning rate
  • θJ(θ)\nabla_\theta J(\theta) is the gradient of the loss with respect to parameters

Types of Gradient Descent

  1. Batch Gradient Descent: Uses entire dataset per update
  2. Stochastic Gradient Descent: Uses one sample per update
  3. Mini-batch Gradient Descent: Uses small batches (most common)
python
1def batch_gradient_descent(X, Y, theta, learning_rate, epochs): 2 """ 3 Batch gradient descent for linear regression. 4 """ 5 m = len(Y) 6 7 for epoch in range(epochs): 8 # Compute predictions 9 predictions = X @ theta 10 11 # Compute gradient 12 gradient = (1/m) * X.T @ (predictions - Y) 13 14 # Update parameters 15 theta = theta - learning_rate * gradient 16 17 # Compute cost 18 cost = (1/(2*m)) * np.sum((predictions - Y) ** 2) 19 20 if epoch % 100 == 0: 21 print(f"Epoch {epoch}: Cost = {cost:.4f}") 22 23 return theta

Stochastic Gradient Descent

SGD updates parameters after each training example, introducing noise that can help escape local minima.

Update Rule

For sample ii:

θ:=θαθJ(θ;x(i),y(i))\theta := \theta - \alpha \nabla_\theta J(\theta; x^{(i)}, y^{(i)})

Advantages

  • Faster convergence for large datasets
  • Can escape local minima
  • Online learning capable

Disadvantages

  • High variance in updates
  • May never settle at minimum
  • Requires careful learning rate tuning
python
1def sgd(X, Y, theta, learning_rate, epochs): 2 """ 3 Stochastic Gradient Descent with shuffling. 4 """ 5 m = len(Y) 6 7 for epoch in range(epochs): 8 # Shuffle data 9 indices = np.random.permutation(m) 10 X_shuffled = X[indices] 11 Y_shuffled = Y[indices] 12 13 for i in range(m): 14 xi = X_shuffled[i:i+1] 15 yi = Y_shuffled[i:i+1] 16 17 # Compute gradient for single sample 18 gradient = xi.T @ (xi @ theta - yi) 19 20 # Update 21 theta = theta - learning_rate * gradient 22 23 return theta

Momentum

Momentum accelerates SGD by accumulating a velocity vector in the direction of persistent gradients.

Update Rules

vt=βvt1+θJ(θ)v_t = \beta v_{t-1} + \nabla_\theta J(\theta)

θ:=θαvt\theta := \theta - \alpha v_t

Where:

  • vtv_t is the velocity (accumulated gradient)
  • β\beta is the momentum coefficient (typically 0.9)

Intuition

Think of a ball rolling down a hill:

  • Accelerates in consistent directions
  • Dampens oscillations in inconsistent directions
  • Can roll past shallow local minima
python
1def sgd_momentum(X, Y, theta, learning_rate, momentum, epochs): 2 """ 3 SGD with Momentum. 4 """ 5 velocity = np.zeros_like(theta) 6 m = len(Y) 7 8 for epoch in range(epochs): 9 indices = np.random.permutation(m) 10 11 for i in range(0, m, 32): # Mini-batch size 32 12 batch_idx = indices[i:i+32] 13 X_batch = X[batch_idx] 14 Y_batch = Y[batch_idx] 15 16 # Compute gradient 17 gradient = (1/len(batch_idx)) * X_batch.T @ (X_batch @ theta - Y_batch) 18 19 # Update velocity 20 velocity = momentum * velocity + gradient 21 22 # Update parameters 23 theta = theta - learning_rate * velocity 24 25 return theta

RMSprop

RMSprop adapts learning rates for each parameter based on the magnitude of recent gradients.

Update Rules

st=βst1+(1β)(θJ)2s_t = \beta s_{t-1} + (1 - \beta)(\nabla_\theta J)^2

θ:=θαst+ϵθJ\theta := \theta - \frac{\alpha}{\sqrt{s_t + \epsilon}} \nabla_\theta J

Where:

  • sts_t is the exponential moving average of squared gradients
  • β\beta is the decay rate (typically 0.9)
  • ϵ\epsilon is a small constant for numerical stability (10810^{-8})

Benefits

  • Handles different gradient magnitudes
  • Works well with non-stationary objectives
  • Adapts to the geometry of the problem
python
1def rmsprop(X, Y, theta, learning_rate, decay_rate, epochs): 2 """ 3 RMSprop optimizer. 4 """ 5 cache = np.zeros_like(theta) 6 epsilon = 1e-8 7 m = len(Y) 8 9 for epoch in range(epochs): 10 indices = np.random.permutation(m) 11 12 for i in range(0, m, 32): 13 batch_idx = indices[i:i+32] 14 X_batch = X[batch_idx] 15 Y_batch = Y[batch_idx] 16 17 # Compute gradient 18 gradient = (1/len(batch_idx)) * X_batch.T @ (X_batch @ theta - Y_batch) 19 20 # Update cache (exponential moving average of squared gradients) 21 cache = decay_rate * cache + (1 - decay_rate) * gradient ** 2 22 23 # Update parameters 24 theta = theta - learning_rate * gradient / (np.sqrt(cache) + epsilon) 25 26 return theta

Adam Optimizer

Adam (Adaptive Moment Estimation) combines momentum and RMSprop with bias correction.

Update Rules

Compute moments:

mt=β1mt1+(1β1)θJm_t = \beta_1 m_{t-1} + (1 - \beta_1) \nabla_\theta J

vt=β2vt1+(1β2)(θJ)2v_t = \beta_2 v_{t-1} + (1 - \beta_2) (\nabla_\theta J)^2

Bias correction:

m^t=mt1β1t\hat{m}_t = \frac{m_t}{1 - \beta_1^t}

v^t=vt1β2t\hat{v}_t = \frac{v_t}{1 - \beta_2^t}

Update parameters:

θ:=θαm^tv^t+ϵ\theta := \theta - \frac{\alpha \hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}

  • α=0.001\alpha = 0.001 (learning rate)
  • β1=0.9\beta_1 = 0.9 (first moment decay)
  • β2=0.999\beta_2 = 0.999 (second moment decay)
  • ϵ=108\epsilon = 10^{-8} (numerical stability)
python
1class Adam: 2 """ 3 Adam optimizer implementation. 4 """ 5 def __init__(self, learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-8): 6 self.lr = learning_rate 7 self.beta1 = beta1 8 self.beta2 = beta2 9 self.epsilon = epsilon 10 self.m = None 11 self.v = None 12 self.t = 0 13 14 def update(self, theta, gradient): 15 if self.m is None: 16 self.m = np.zeros_like(theta) 17 self.v = np.zeros_like(theta) 18 19 self.t += 1 20 21 # Update biased first moment estimate 22 self.m = self.beta1 * self.m + (1 - self.beta1) * gradient 23 24 # Update biased second raw moment estimate 25 self.v = self.beta2 * self.v + (1 - self.beta2) * gradient ** 2 26 27 # Compute bias-corrected first moment estimate 28 m_hat = self.m / (1 - self.beta1 ** self.t) 29 30 # Compute bias-corrected second raw moment estimate 31 v_hat = self.v / (1 - self.beta2 ** self.t) 32 33 # Update parameters 34 theta = theta - self.lr * m_hat / (np.sqrt(v_hat) + self.epsilon) 35 36 return theta 37 38 39def train_with_adam(X, Y, theta, epochs, batch_size=32): 40 """ 41 Training loop with Adam optimizer. 42 """ 43 optimizer = Adam(learning_rate=0.001) 44 m = len(Y) 45 costs = [] 46 47 for epoch in range(epochs): 48 indices = np.random.permutation(m) 49 50 for i in range(0, m, batch_size): 51 batch_idx = indices[i:i+batch_size] 52 X_batch = X[batch_idx] 53 Y_batch = Y[batch_idx] 54 55 # Compute gradient 56 gradient = (1/len(batch_idx)) * X_batch.T @ (X_batch @ theta - Y_batch) 57 58 # Update with Adam 59 theta = optimizer.update(theta, gradient) 60 61 # Compute cost 62 cost = (1/(2*m)) * np.sum((X @ theta - Y) ** 2) 63 costs.append(cost) 64 65 if epoch % 100 == 0: 66 print(f"Epoch {epoch}: Cost = {cost:.4f}") 67 68 return theta, costs

Learning Rate Schedules

The learning rate often needs to change during training.

Step Decay

Reduce learning rate at specific epochs:

αt=α0γt/s\alpha_t = \alpha_0 \cdot \gamma^{\lfloor t / s \rfloor}

Where γ\gamma is the decay factor and ss is the step size.

Exponential Decay

αt=α0ekt\alpha_t = \alpha_0 \cdot e^{-kt}

Cosine Annealing

αt=αmin+12(αmaxαmin)(1+cos(tπT))\alpha_t = \alpha_{min} + \frac{1}{2}(\alpha_{max} - \alpha_{min})(1 + \cos(\frac{t \cdot \pi}{T}))

Warm Restarts

Periodically reset learning rate to escape local minima.

python
1class LearningRateScheduler: 2 """ 3 Various learning rate scheduling strategies. 4 """ 5 6 @staticmethod 7 def step_decay(initial_lr, epoch, drop_rate=0.5, epochs_drop=10): 8 """Step decay: reduce LR by factor every N epochs.""" 9 return initial_lr * (drop_rate ** (epoch // epochs_drop)) 10 11 @staticmethod 12 def exponential_decay(initial_lr, epoch, decay_rate=0.95): 13 """Exponential decay.""" 14 return initial_lr * (decay_rate ** epoch) 15 16 @staticmethod 17 def cosine_annealing(initial_lr, epoch, total_epochs, min_lr=0): 18 """Cosine annealing with warm restarts.""" 19 return min_lr + 0.5 * (initial_lr - min_lr) * ( 20 1 + np.cos(np.pi * epoch / total_epochs) 21 ) 22 23 @staticmethod 24 def warmup_then_decay(initial_lr, epoch, warmup_epochs=5, total_epochs=100): 25 """Linear warmup followed by cosine decay.""" 26 if epoch < warmup_epochs: 27 return initial_lr * (epoch + 1) / warmup_epochs 28 else: 29 progress = (epoch - warmup_epochs) / (total_epochs - warmup_epochs) 30 return initial_lr * 0.5 * (1 + np.cos(np.pi * progress))

Comparison Summary

OptimizerProsConsBest For
SGDSimple, generalizes wellSlow, sensitive to LRConvex problems
MomentumFaster convergenceExtra hyperparameterMost deep learning
RMSpropAdapts per parameterCan divergeRNNs, non-stationary
AdamFast, robustMemory overheadDefault choice

Key Takeaways

  1. Adam is the default optimizer for most deep learning tasks
  2. Learning rate is the most important hyperparameter
  3. Momentum helps escape saddle points and smooth optimization
  4. Adaptive methods (RMSprop, Adam) handle sparse gradients well
  5. Learning rate schedules can significantly improve final performance
  6. Warmup helps stabilize early training

The choice of optimizer affects both training speed and final model quality. Start with Adam, but don't be afraid to try SGD with momentum for better generalization!

Share this article

TweetLinkedIn
January 10, 202520 min read
#Optimization#Gradient Descent#Adam#Deep Learning
TheMLTrader - Article author

Written by

TheMLTrader

Quantitative researcher and ML engineer with 10+ years of experience in algorithmic trading. Specializing in deep learning for financial markets.

Ready to Apply These Concepts?

Our courses provide hands-on implementation of these ML concepts with real trading data and production-ready code.