Regularization Techniques: L1, L2, Dropout, and Beyond
Overfitting is the enemy of generalization. Regularization helps models perform well on unseen data.
The Bias-Variance Tradeoff
Total Error = Bias² + Variance + Irreducible Error
- High bias: Underfitting (model too simple)
- High variance: Overfitting (model too complex)
Regularization reduces variance at the cost of slightly increased bias.
L2 Regularization (Ridge)
Add squared magnitude of weights to loss:
Effect on Gradient
Weight Update
The term shrinks weights toward zero.
1import numpy as np
2
3def ridge_regression(X, y, lambda_reg=1.0):
4 """
5 Closed-form Ridge Regression solution.
6
7 w = (X^T X + λI)^(-1) X^T y
8 """
9 n_features = X.shape[1]
10 identity = np.eye(n_features)
11
12 # Don't regularize bias term
13 identity[0, 0] = 0
14
15 w = np.linalg.inv(X.T @ X + lambda_reg * identity) @ X.T @ y
16
17 return w
18
19
20def l2_loss(y_true, y_pred, weights, lambda_reg):
21 """L2 regularized MSE loss."""
22 mse = np.mean((y_true - y_pred) ** 2)
23 l2_penalty = lambda_reg * np.sum(weights[1:] ** 2) # Skip bias
24 return mse + l2_penaltyL1 Regularization (Lasso)
Add absolute magnitude of weights:
Key Property: Sparsity
L1 drives some weights exactly to zero, performing feature selection.
Geometric Intuition:
- L2: Circular constraint → weights shrink proportionally
- L1: Diamond constraint → corners at axes → sparse solutions
1from sklearn.linear_model import Lasso, Ridge, ElasticNet
2
3def compare_regularization(X_train, y_train, X_test, y_test, alphas=[0.01, 0.1, 1.0]):
4 """Compare L1, L2, and Elastic Net regularization."""
5
6 results = []
7
8 for alpha in alphas:
9 # L1 (Lasso)
10 lasso = Lasso(alpha=alpha)
11 lasso.fit(X_train, y_train)
12 lasso_score = lasso.score(X_test, y_test)
13 lasso_nonzero = np.sum(lasso.coef_ != 0)
14
15 # L2 (Ridge)
16 ridge = Ridge(alpha=alpha)
17 ridge.fit(X_train, y_train)
18 ridge_score = ridge.score(X_test, y_test)
19
20 # Elastic Net (L1 + L2)
21 elastic = ElasticNet(alpha=alpha, l1_ratio=0.5)
22 elastic.fit(X_train, y_train)
23 elastic_score = elastic.score(X_test, y_test)
24 elastic_nonzero = np.sum(elastic.coef_ != 0)
25
26 results.append({
27 'alpha': alpha,
28 'lasso_r2': lasso_score,
29 'lasso_features': lasso_nonzero,
30 'ridge_r2': ridge_score,
31 'elastic_r2': elastic_score,
32 'elastic_features': elastic_nonzero
33 })
34
35 return resultsElastic Net
Combines L1 and L2:
Or with mixing parameter :
Benefits:
- Sparsity from L1
- Stability from L2
- Handles correlated features better than pure L1
Dropout
Randomly zero out neurons during training:
Where
Inverted Dropout
Scale activations during training to maintain expected values:
1import torch
2import torch.nn as nn
3
4class MLPWithDropout(nn.Module):
5 """
6 MLP with Dropout regularization.
7 """
8 def __init__(self, input_size, hidden_sizes, output_size, dropout_rate=0.5):
9 super().__init__()
10
11 layers = []
12 prev_size = input_size
13
14 for hidden_size in hidden_sizes:
15 layers.extend([
16 nn.Linear(prev_size, hidden_size),
17 nn.ReLU(),
18 nn.Dropout(p=dropout_rate)
19 ])
20 prev_size = hidden_size
21
22 layers.append(nn.Linear(prev_size, output_size))
23
24 self.network = nn.Sequential(*layers)
25
26 def forward(self, x):
27 return self.network(x)
28
29
30# Manual dropout implementation
31def dropout_forward(A, drop_prob, training=True):
32 """
33 Apply dropout to activations.
34 """
35 if not training or drop_prob == 0:
36 return A, None
37
38 # Create mask
39 mask = (np.random.rand(*A.shape) > drop_prob).astype(float)
40
41 # Apply inverted dropout
42 A_dropout = A * mask / (1 - drop_prob)
43
44 return A_dropout, mask
45
46
47def dropout_backward(dA, mask, drop_prob):
48 """
49 Backprop through dropout.
50 """
51 if mask is None:
52 return dA
53
54 return dA * mask / (1 - drop_prob)Batch Normalization
Normalize layer inputs to reduce internal covariate shift:
Where and are learnable parameters.
1class BatchNorm:
2 """
3 Batch Normalization layer.
4 """
5 def __init__(self, num_features, epsilon=1e-5, momentum=0.1):
6 self.epsilon = epsilon
7 self.momentum = momentum
8
9 # Learnable parameters
10 self.gamma = np.ones(num_features)
11 self.beta = np.zeros(num_features)
12
13 # Running statistics for inference
14 self.running_mean = np.zeros(num_features)
15 self.running_var = np.ones(num_features)
16
17 def forward(self, x, training=True):
18 if training:
19 # Compute batch statistics
20 self.batch_mean = x.mean(axis=0)
21 self.batch_var = x.var(axis=0)
22
23 # Update running statistics
24 self.running_mean = (1 - self.momentum) * self.running_mean + self.momentum * self.batch_mean
25 self.running_var = (1 - self.momentum) * self.running_var + self.momentum * self.batch_var
26
27 # Normalize
28 self.x_norm = (x - self.batch_mean) / np.sqrt(self.batch_var + self.epsilon)
29 else:
30 # Use running statistics
31 self.x_norm = (x - self.running_mean) / np.sqrt(self.running_var + self.epsilon)
32
33 # Scale and shift
34 return self.gamma * self.x_norm + self.betaEarly Stopping
Monitor validation loss and stop when it starts increasing:
1class EarlyStopping:
2 """
3 Early stopping to prevent overfitting.
4 """
5 def __init__(self, patience=10, min_delta=0.001, restore_best=True):
6 self.patience = patience
7 self.min_delta = min_delta
8 self.restore_best = restore_best
9
10 self.best_loss = float('inf')
11 self.counter = 0
12 self.best_weights = None
13
14 def __call__(self, val_loss, model):
15 if val_loss < self.best_loss - self.min_delta:
16 self.best_loss = val_loss
17 self.counter = 0
18 if self.restore_best:
19 self.best_weights = {k: v.clone() for k, v in model.state_dict().items()}
20 else:
21 self.counter += 1
22
23 if self.counter >= self.patience:
24 if self.restore_best and self.best_weights:
25 model.load_state_dict(self.best_weights)
26 return True # Stop training
27
28 return False
29
30
31# Usage in training loop
32early_stopping = EarlyStopping(patience=10)
33
34for epoch in range(max_epochs):
35 train_loss = train_epoch(model, train_loader)
36 val_loss = validate(model, val_loader)
37
38 if early_stopping(val_loss, model):
39 print(f"Early stopping at epoch {epoch}")
40 breakSummary: When to Use Each
| Technique | Use When | Effect |
|---|---|---|
| L2 (Ridge) | Many features, all useful | Shrinks weights |
| L1 (Lasso) | Many features, few useful | Sparse solution |
| Elastic Net | Correlated features | Both effects |
| Dropout | Deep networks | Ensemble effect |
| Batch Norm | Deep networks | Stabilizes training |
| Early Stopping | Always | Prevents overtraining |
Key Takeaways
- L2 regularization shrinks all weights proportionally
- L1 regularization creates sparse models (feature selection)
- Dropout prevents co-adaptation of neurons
- Batch normalization enables higher learning rates
- Early stopping is simple and effective
- Combine techniques for best results
Regularization is essential for building models that generalize to new data!