Gradient Descent & Optimization: From SGD to Adam
Optimization is the heart of machine learning. In this guide, we'll explore how gradient descent works and the modern variants that make deep learning possible.
Table of Contents
- Vanilla Gradient Descent
- Stochastic Gradient Descent (SGD)
- Momentum
- RMSprop
- Adam Optimizer
- Learning Rate Schedules
Vanilla Gradient Descent
The fundamental idea: move in the direction of steepest descent.
Update Rule
Where:
- represents all parameters
- is the learning rate
- is the gradient of the loss with respect to parameters
Types of Gradient Descent
- Batch Gradient Descent: Uses entire dataset per update
- Stochastic Gradient Descent: Uses one sample per update
- Mini-batch Gradient Descent: Uses small batches (most common)
1def batch_gradient_descent(X, Y, theta, learning_rate, epochs):
2 """
3 Batch gradient descent for linear regression.
4 """
5 m = len(Y)
6
7 for epoch in range(epochs):
8 # Compute predictions
9 predictions = X @ theta
10
11 # Compute gradient
12 gradient = (1/m) * X.T @ (predictions - Y)
13
14 # Update parameters
15 theta = theta - learning_rate * gradient
16
17 # Compute cost
18 cost = (1/(2*m)) * np.sum((predictions - Y) ** 2)
19
20 if epoch % 100 == 0:
21 print(f"Epoch {epoch}: Cost = {cost:.4f}")
22
23 return thetaStochastic Gradient Descent
SGD updates parameters after each training example, introducing noise that can help escape local minima.
Update Rule
For sample :
Advantages
- Faster convergence for large datasets
- Can escape local minima
- Online learning capable
Disadvantages
- High variance in updates
- May never settle at minimum
- Requires careful learning rate tuning
1def sgd(X, Y, theta, learning_rate, epochs):
2 """
3 Stochastic Gradient Descent with shuffling.
4 """
5 m = len(Y)
6
7 for epoch in range(epochs):
8 # Shuffle data
9 indices = np.random.permutation(m)
10 X_shuffled = X[indices]
11 Y_shuffled = Y[indices]
12
13 for i in range(m):
14 xi = X_shuffled[i:i+1]
15 yi = Y_shuffled[i:i+1]
16
17 # Compute gradient for single sample
18 gradient = xi.T @ (xi @ theta - yi)
19
20 # Update
21 theta = theta - learning_rate * gradient
22
23 return thetaMomentum
Momentum accelerates SGD by accumulating a velocity vector in the direction of persistent gradients.
Update Rules
Where:
- is the velocity (accumulated gradient)
- is the momentum coefficient (typically 0.9)
Intuition
Think of a ball rolling down a hill:
- Accelerates in consistent directions
- Dampens oscillations in inconsistent directions
- Can roll past shallow local minima
1def sgd_momentum(X, Y, theta, learning_rate, momentum, epochs):
2 """
3 SGD with Momentum.
4 """
5 velocity = np.zeros_like(theta)
6 m = len(Y)
7
8 for epoch in range(epochs):
9 indices = np.random.permutation(m)
10
11 for i in range(0, m, 32): # Mini-batch size 32
12 batch_idx = indices[i:i+32]
13 X_batch = X[batch_idx]
14 Y_batch = Y[batch_idx]
15
16 # Compute gradient
17 gradient = (1/len(batch_idx)) * X_batch.T @ (X_batch @ theta - Y_batch)
18
19 # Update velocity
20 velocity = momentum * velocity + gradient
21
22 # Update parameters
23 theta = theta - learning_rate * velocity
24
25 return thetaRMSprop
RMSprop adapts learning rates for each parameter based on the magnitude of recent gradients.
Update Rules
Where:
- is the exponential moving average of squared gradients
- is the decay rate (typically 0.9)
- is a small constant for numerical stability ()
Benefits
- Handles different gradient magnitudes
- Works well with non-stationary objectives
- Adapts to the geometry of the problem
1def rmsprop(X, Y, theta, learning_rate, decay_rate, epochs):
2 """
3 RMSprop optimizer.
4 """
5 cache = np.zeros_like(theta)
6 epsilon = 1e-8
7 m = len(Y)
8
9 for epoch in range(epochs):
10 indices = np.random.permutation(m)
11
12 for i in range(0, m, 32):
13 batch_idx = indices[i:i+32]
14 X_batch = X[batch_idx]
15 Y_batch = Y[batch_idx]
16
17 # Compute gradient
18 gradient = (1/len(batch_idx)) * X_batch.T @ (X_batch @ theta - Y_batch)
19
20 # Update cache (exponential moving average of squared gradients)
21 cache = decay_rate * cache + (1 - decay_rate) * gradient ** 2
22
23 # Update parameters
24 theta = theta - learning_rate * gradient / (np.sqrt(cache) + epsilon)
25
26 return thetaAdam Optimizer
Adam (Adaptive Moment Estimation) combines momentum and RMSprop with bias correction.
Update Rules
Compute moments:
Bias correction:
Update parameters:
Recommended Hyperparameters
- (learning rate)
- (first moment decay)
- (second moment decay)
- (numerical stability)
1class Adam:
2 """
3 Adam optimizer implementation.
4 """
5 def __init__(self, learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-8):
6 self.lr = learning_rate
7 self.beta1 = beta1
8 self.beta2 = beta2
9 self.epsilon = epsilon
10 self.m = None
11 self.v = None
12 self.t = 0
13
14 def update(self, theta, gradient):
15 if self.m is None:
16 self.m = np.zeros_like(theta)
17 self.v = np.zeros_like(theta)
18
19 self.t += 1
20
21 # Update biased first moment estimate
22 self.m = self.beta1 * self.m + (1 - self.beta1) * gradient
23
24 # Update biased second raw moment estimate
25 self.v = self.beta2 * self.v + (1 - self.beta2) * gradient ** 2
26
27 # Compute bias-corrected first moment estimate
28 m_hat = self.m / (1 - self.beta1 ** self.t)
29
30 # Compute bias-corrected second raw moment estimate
31 v_hat = self.v / (1 - self.beta2 ** self.t)
32
33 # Update parameters
34 theta = theta - self.lr * m_hat / (np.sqrt(v_hat) + self.epsilon)
35
36 return theta
37
38
39def train_with_adam(X, Y, theta, epochs, batch_size=32):
40 """
41 Training loop with Adam optimizer.
42 """
43 optimizer = Adam(learning_rate=0.001)
44 m = len(Y)
45 costs = []
46
47 for epoch in range(epochs):
48 indices = np.random.permutation(m)
49
50 for i in range(0, m, batch_size):
51 batch_idx = indices[i:i+batch_size]
52 X_batch = X[batch_idx]
53 Y_batch = Y[batch_idx]
54
55 # Compute gradient
56 gradient = (1/len(batch_idx)) * X_batch.T @ (X_batch @ theta - Y_batch)
57
58 # Update with Adam
59 theta = optimizer.update(theta, gradient)
60
61 # Compute cost
62 cost = (1/(2*m)) * np.sum((X @ theta - Y) ** 2)
63 costs.append(cost)
64
65 if epoch % 100 == 0:
66 print(f"Epoch {epoch}: Cost = {cost:.4f}")
67
68 return theta, costsLearning Rate Schedules
The learning rate often needs to change during training.
Step Decay
Reduce learning rate at specific epochs:
Where is the decay factor and is the step size.
Exponential Decay
Cosine Annealing
Warm Restarts
Periodically reset learning rate to escape local minima.
1class LearningRateScheduler:
2 """
3 Various learning rate scheduling strategies.
4 """
5
6 @staticmethod
7 def step_decay(initial_lr, epoch, drop_rate=0.5, epochs_drop=10):
8 """Step decay: reduce LR by factor every N epochs."""
9 return initial_lr * (drop_rate ** (epoch // epochs_drop))
10
11 @staticmethod
12 def exponential_decay(initial_lr, epoch, decay_rate=0.95):
13 """Exponential decay."""
14 return initial_lr * (decay_rate ** epoch)
15
16 @staticmethod
17 def cosine_annealing(initial_lr, epoch, total_epochs, min_lr=0):
18 """Cosine annealing with warm restarts."""
19 return min_lr + 0.5 * (initial_lr - min_lr) * (
20 1 + np.cos(np.pi * epoch / total_epochs)
21 )
22
23 @staticmethod
24 def warmup_then_decay(initial_lr, epoch, warmup_epochs=5, total_epochs=100):
25 """Linear warmup followed by cosine decay."""
26 if epoch < warmup_epochs:
27 return initial_lr * (epoch + 1) / warmup_epochs
28 else:
29 progress = (epoch - warmup_epochs) / (total_epochs - warmup_epochs)
30 return initial_lr * 0.5 * (1 + np.cos(np.pi * progress))Comparison Summary
| Optimizer | Pros | Cons | Best For |
|---|---|---|---|
| SGD | Simple, generalizes well | Slow, sensitive to LR | Convex problems |
| Momentum | Faster convergence | Extra hyperparameter | Most deep learning |
| RMSprop | Adapts per parameter | Can diverge | RNNs, non-stationary |
| Adam | Fast, robust | Memory overhead | Default choice |
Key Takeaways
- Adam is the default optimizer for most deep learning tasks
- Learning rate is the most important hyperparameter
- Momentum helps escape saddle points and smooth optimization
- Adaptive methods (RMSprop, Adam) handle sparse gradients well
- Learning rate schedules can significantly improve final performance
- Warmup helps stabilize early training
The choice of optimizer affects both training speed and final model quality. Start with Adam, but don't be afraid to try SGD with momentum for better generalization!