Skip to main content
Skip to main content
Regularization Techniques: L1, L2, Dropout, and Beyond - Machine Learning article featured image
Machine Learning

Regularization Techniques: L1, L2, Dropout, and Beyond

Prevent overfitting in machine learning models. Understand the mathematics of regularization and when to apply each technique.

Regularization Techniques: L1, L2, Dropout, and Beyond

Overfitting is the enemy of generalization. Regularization helps models perform well on unseen data.

The Bias-Variance Tradeoff

Total Error = Bias² + Variance + Irreducible Error

  • High bias: Underfitting (model too simple)
  • High variance: Overfitting (model too complex)

Regularization reduces variance at the cost of slightly increased bias.


L2 Regularization (Ridge)

Add squared magnitude of weights to loss:

Jregularized=J+λ2mj=1nwj2J_{regularized} = J + \frac{\lambda}{2m} \sum_{j=1}^{n} w_j^2

Effect on Gradient

Jregwj=Jwj+λmwj\frac{\partial J_{reg}}{\partial w_j} = \frac{\partial J}{\partial w_j} + \frac{\lambda}{m} w_j

Weight Update

wj:=wjα(Jwj+λmwj)w_j := w_j - \alpha \left( \frac{\partial J}{\partial w_j} + \frac{\lambda}{m} w_j \right)

wj:=wj(1αλm)αJwjw_j := w_j \left(1 - \frac{\alpha \lambda}{m}\right) - \alpha \frac{\partial J}{\partial w_j}

The term (1αλm)\left(1 - \frac{\alpha \lambda}{m}\right) shrinks weights toward zero.

python
1import numpy as np 2 3def ridge_regression(X, y, lambda_reg=1.0): 4 """ 5 Closed-form Ridge Regression solution. 6 7 w = (X^T X + λI)^(-1) X^T y 8 """ 9 n_features = X.shape[1] 10 identity = np.eye(n_features) 11 12 # Don't regularize bias term 13 identity[0, 0] = 0 14 15 w = np.linalg.inv(X.T @ X + lambda_reg * identity) @ X.T @ y 16 17 return w 18 19 20def l2_loss(y_true, y_pred, weights, lambda_reg): 21 """L2 regularized MSE loss.""" 22 mse = np.mean((y_true - y_pred) ** 2) 23 l2_penalty = lambda_reg * np.sum(weights[1:] ** 2) # Skip bias 24 return mse + l2_penalty

L1 Regularization (Lasso)

Add absolute magnitude of weights:

Jregularized=J+λmj=1nwjJ_{regularized} = J + \frac{\lambda}{m} \sum_{j=1}^{n} |w_j|

Key Property: Sparsity

L1 drives some weights exactly to zero, performing feature selection.

Geometric Intuition:

  • L2: Circular constraint → weights shrink proportionally
  • L1: Diamond constraint → corners at axes → sparse solutions
python
1from sklearn.linear_model import Lasso, Ridge, ElasticNet 2 3def compare_regularization(X_train, y_train, X_test, y_test, alphas=[0.01, 0.1, 1.0]): 4 """Compare L1, L2, and Elastic Net regularization.""" 5 6 results = [] 7 8 for alpha in alphas: 9 # L1 (Lasso) 10 lasso = Lasso(alpha=alpha) 11 lasso.fit(X_train, y_train) 12 lasso_score = lasso.score(X_test, y_test) 13 lasso_nonzero = np.sum(lasso.coef_ != 0) 14 15 # L2 (Ridge) 16 ridge = Ridge(alpha=alpha) 17 ridge.fit(X_train, y_train) 18 ridge_score = ridge.score(X_test, y_test) 19 20 # Elastic Net (L1 + L2) 21 elastic = ElasticNet(alpha=alpha, l1_ratio=0.5) 22 elastic.fit(X_train, y_train) 23 elastic_score = elastic.score(X_test, y_test) 24 elastic_nonzero = np.sum(elastic.coef_ != 0) 25 26 results.append({ 27 'alpha': alpha, 28 'lasso_r2': lasso_score, 29 'lasso_features': lasso_nonzero, 30 'ridge_r2': ridge_score, 31 'elastic_r2': elastic_score, 32 'elastic_features': elastic_nonzero 33 }) 34 35 return results

Elastic Net

Combines L1 and L2:

Jregularized=J+λ1wj+λ2wj2J_{regularized} = J + \lambda_1 \sum |w_j| + \lambda_2 \sum w_j^2

Or with mixing parameter ρ\rho:

Jregularized=J+λ(ρwj+1ρ2wj2)J_{regularized} = J + \lambda \left( \rho \sum |w_j| + \frac{1-\rho}{2} \sum w_j^2 \right)

Benefits:

  • Sparsity from L1
  • Stability from L2
  • Handles correlated features better than pure L1

Dropout

Randomly zero out neurons during training:

h~(l)=h(l)m(l)\tilde{h}^{(l)} = h^{(l)} \odot m^{(l)}

Where m(l)Bernoulli(p)m^{(l)} \sim \text{Bernoulli}(p)

Inverted Dropout

Scale activations during training to maintain expected values:

h~(l)=h(l)m(l)p\tilde{h}^{(l)} = \frac{h^{(l)} \odot m^{(l)}}{p}

python
1import torch 2import torch.nn as nn 3 4class MLPWithDropout(nn.Module): 5 """ 6 MLP with Dropout regularization. 7 """ 8 def __init__(self, input_size, hidden_sizes, output_size, dropout_rate=0.5): 9 super().__init__() 10 11 layers = [] 12 prev_size = input_size 13 14 for hidden_size in hidden_sizes: 15 layers.extend([ 16 nn.Linear(prev_size, hidden_size), 17 nn.ReLU(), 18 nn.Dropout(p=dropout_rate) 19 ]) 20 prev_size = hidden_size 21 22 layers.append(nn.Linear(prev_size, output_size)) 23 24 self.network = nn.Sequential(*layers) 25 26 def forward(self, x): 27 return self.network(x) 28 29 30# Manual dropout implementation 31def dropout_forward(A, drop_prob, training=True): 32 """ 33 Apply dropout to activations. 34 """ 35 if not training or drop_prob == 0: 36 return A, None 37 38 # Create mask 39 mask = (np.random.rand(*A.shape) > drop_prob).astype(float) 40 41 # Apply inverted dropout 42 A_dropout = A * mask / (1 - drop_prob) 43 44 return A_dropout, mask 45 46 47def dropout_backward(dA, mask, drop_prob): 48 """ 49 Backprop through dropout. 50 """ 51 if mask is None: 52 return dA 53 54 return dA * mask / (1 - drop_prob)

Batch Normalization

Normalize layer inputs to reduce internal covariate shift:

x^i=xiμBσB2+ϵ\hat{x}_i = \frac{x_i - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}

yi=γx^i+βy_i = \gamma \hat{x}_i + \beta

Where γ\gamma and β\beta are learnable parameters.

python
1class BatchNorm: 2 """ 3 Batch Normalization layer. 4 """ 5 def __init__(self, num_features, epsilon=1e-5, momentum=0.1): 6 self.epsilon = epsilon 7 self.momentum = momentum 8 9 # Learnable parameters 10 self.gamma = np.ones(num_features) 11 self.beta = np.zeros(num_features) 12 13 # Running statistics for inference 14 self.running_mean = np.zeros(num_features) 15 self.running_var = np.ones(num_features) 16 17 def forward(self, x, training=True): 18 if training: 19 # Compute batch statistics 20 self.batch_mean = x.mean(axis=0) 21 self.batch_var = x.var(axis=0) 22 23 # Update running statistics 24 self.running_mean = (1 - self.momentum) * self.running_mean + self.momentum * self.batch_mean 25 self.running_var = (1 - self.momentum) * self.running_var + self.momentum * self.batch_var 26 27 # Normalize 28 self.x_norm = (x - self.batch_mean) / np.sqrt(self.batch_var + self.epsilon) 29 else: 30 # Use running statistics 31 self.x_norm = (x - self.running_mean) / np.sqrt(self.running_var + self.epsilon) 32 33 # Scale and shift 34 return self.gamma * self.x_norm + self.beta

Early Stopping

Monitor validation loss and stop when it starts increasing:

python
1class EarlyStopping: 2 """ 3 Early stopping to prevent overfitting. 4 """ 5 def __init__(self, patience=10, min_delta=0.001, restore_best=True): 6 self.patience = patience 7 self.min_delta = min_delta 8 self.restore_best = restore_best 9 10 self.best_loss = float('inf') 11 self.counter = 0 12 self.best_weights = None 13 14 def __call__(self, val_loss, model): 15 if val_loss < self.best_loss - self.min_delta: 16 self.best_loss = val_loss 17 self.counter = 0 18 if self.restore_best: 19 self.best_weights = {k: v.clone() for k, v in model.state_dict().items()} 20 else: 21 self.counter += 1 22 23 if self.counter >= self.patience: 24 if self.restore_best and self.best_weights: 25 model.load_state_dict(self.best_weights) 26 return True # Stop training 27 28 return False 29 30 31# Usage in training loop 32early_stopping = EarlyStopping(patience=10) 33 34for epoch in range(max_epochs): 35 train_loss = train_epoch(model, train_loader) 36 val_loss = validate(model, val_loader) 37 38 if early_stopping(val_loss, model): 39 print(f"Early stopping at epoch {epoch}") 40 break

Summary: When to Use Each

TechniqueUse WhenEffect
L2 (Ridge)Many features, all usefulShrinks weights
L1 (Lasso)Many features, few usefulSparse solution
Elastic NetCorrelated featuresBoth effects
DropoutDeep networksEnsemble effect
Batch NormDeep networksStabilizes training
Early StoppingAlwaysPrevents overtraining

Key Takeaways

  1. L2 regularization shrinks all weights proportionally
  2. L1 regularization creates sparse models (feature selection)
  3. Dropout prevents co-adaptation of neurons
  4. Batch normalization enables higher learning rates
  5. Early stopping is simple and effective
  6. Combine techniques for best results

Regularization is essential for building models that generalize to new data!

Share this article

TweetLinkedIn
December 15, 202416 min read
#Regularization#Overfitting#Machine Learning#Deep Learning
TheMLTrader - Article author

Written by

TheMLTrader

Quantitative researcher and ML engineer with 10+ years of experience in algorithmic trading. Specializing in deep learning for financial markets.

Ready to Apply These Concepts?

Our courses provide hands-on implementation of these ML concepts with real trading data and production-ready code.