Regularization Techniques: L1, L2, Dropout, and Beyond

Overfitting is the enemy of generalization. Regularization helps models perform well on unseen data.

The Bias-Variance Tradeoff

Total Error = Bias² + Variance + Irreducible Error

High bias: Underfitting (model too simple)
High variance: Overfitting (model too complex)

Regularization reduces variance at the cost of slightly increased bias.

L2 Regularization (Ridge)

Add squared magnitude of weights to loss:

$J_{regularized} = J + \frac{\lambda}{2m} \sum_{j=1}^{n} w_j^2$

Effect on Gradient

$\frac{\partial J_{reg}}{\partial w_j} = \frac{\partial J}{\partial w_j} + \frac{\lambda}{m} w_j$

Weight Update

$w_j := w_j - \alpha \left( \frac{\partial J}{\partial w_j} + \frac{\lambda}{m} w_j \right)$

$w_j := w_j \left(1 - \frac{\alpha \lambda}{m}\right) - \alpha \frac{\partial J}{\partial w_j}$

The term $\left(1 - \frac{\alpha \lambda}{m}\right)$ shrinks weights toward zero.

python

import numpy as np

def ridge_regression(X, y, lambda_reg=1.0):
    """
    Closed-form Ridge Regression solution.

    w = (X^T X + λI)^(-1) X^T y
    """
    n_features = X.shape[1]
    identity = np.eye(n_features)

    # Don't regularize bias term
    identity[0, 0] = 0

    w = np.linalg.inv(X.T @ X + lambda_reg * identity) @ X.T @ y

    return w


def l2_loss(y_true, y_pred, weights, lambda_reg):
    """L2 regularized MSE loss."""
    mse = np.mean((y_true - y_pred) ** 2)
    l2_penalty = lambda_reg * np.sum(weights[1:] ** 2)  # Skip bias
    return mse + l2_penalty

L1 Regularization (Lasso)

Add absolute magnitude of weights:

$J_{regularized} = J + \frac{\lambda}{m} \sum_{j=1}^{n} |w_j|$

Key Property: Sparsity

L1 drives some weights exactly to zero, performing feature selection.

Geometric Intuition:

L2: Circular constraint → weights shrink proportionally
L1: Diamond constraint → corners at axes → sparse solutions

python

from sklearn.linear_model import Lasso, Ridge, ElasticNet

def compare_regularization(X_train, y_train, X_test, y_test, alphas=[0.01, 0.1, 1.0]):
    """Compare L1, L2, and Elastic Net regularization."""

    results = []

    for alpha in alphas:
        # L1 (Lasso)
        lasso = Lasso(alpha=alpha)
        lasso.fit(X_train, y_train)
        lasso_score = lasso.score(X_test, y_test)
        lasso_nonzero = np.sum(lasso.coef_ != 0)

        # L2 (Ridge)
        ridge = Ridge(alpha=alpha)
        ridge.fit(X_train, y_train)
        ridge_score = ridge.score(X_test, y_test)

        # Elastic Net (L1 + L2)
        elastic = ElasticNet(alpha=alpha, l1_ratio=0.5)
        elastic.fit(X_train, y_train)
        elastic_score = elastic.score(X_test, y_test)
        elastic_nonzero = np.sum(elastic.coef_ != 0)

        results.append({
            'alpha': alpha,
            'lasso_r2': lasso_score,
            'lasso_features': lasso_nonzero,
            'ridge_r2': ridge_score,
            'elastic_r2': elastic_score,
            'elastic_features': elastic_nonzero
        })

    return results

Elastic Net

Combines L1 and L2:

$J_{regularized} = J + \lambda_1 \sum |w_j| + \lambda_2 \sum w_j^2$

Or with mixing parameter $\rho$ :

$J_{regularized} = J + \lambda \left( \rho \sum |w_j| + \frac{1-\rho}{2} \sum w_j^2 \right)$

Benefits:

Sparsity from L1
Stability from L2
Handles correlated features better than pure L1

Dropout

Randomly zero out neurons during training:

$\tilde{h}^{(l)} = h^{(l)} \odot m^{(l)}$

Where $m^{(l)} \sim \text{Bernoulli}(p)$

Inverted Dropout

Scale activations during training to maintain expected values:

$\tilde{h}^{(l)} = \frac{h^{(l)} \odot m^{(l)}}{p}$

python

import torch
import torch.nn as nn

class MLPWithDropout(nn.Module):
    """
    MLP with Dropout regularization.
    """
    def __init__(self, input_size, hidden_sizes, output_size, dropout_rate=0.5):
        super().__init__()

        layers = []
        prev_size = input_size

        for hidden_size in hidden_sizes:
            layers.extend([
                nn.Linear(prev_size, hidden_size),
                nn.ReLU(),
                nn.Dropout(p=dropout_rate)
            ])
            prev_size = hidden_size

        layers.append(nn.Linear(prev_size, output_size))

        self.network = nn.Sequential(*layers)

    def forward(self, x):
        return self.network(x)


# Manual dropout implementation
def dropout_forward(A, drop_prob, training=True):
    """
    Apply dropout to activations.
    """
    if not training or drop_prob == 0:
        return A, None

    # Create mask
    mask = (np.random.rand(*A.shape) > drop_prob).astype(float)

    # Apply inverted dropout
    A_dropout = A * mask / (1 - drop_prob)

    return A_dropout, mask


def dropout_backward(dA, mask, drop_prob):
    """
    Backprop through dropout.
    """
    if mask is None:
        return dA

    return dA * mask / (1 - drop_prob)

Batch Normalization

Normalize layer inputs to reduce internal covariate shift:

$\hat{x}_i = \frac{x_i - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}$

$y_i = \gamma \hat{x}_i + \beta$

Where $\gamma$ and $\beta$ are learnable parameters.

python

class BatchNorm:
    """
    Batch Normalization layer.
    """
    def __init__(self, num_features, epsilon=1e-5, momentum=0.1):
        self.epsilon = epsilon
        self.momentum = momentum

        # Learnable parameters
        self.gamma = np.ones(num_features)
        self.beta = np.zeros(num_features)

        # Running statistics for inference
        self.running_mean = np.zeros(num_features)
        self.running_var = np.ones(num_features)

    def forward(self, x, training=True):
        if training:
            # Compute batch statistics
            self.batch_mean = x.mean(axis=0)
            self.batch_var = x.var(axis=0)

            # Update running statistics
            self.running_mean = (1 - self.momentum) * self.running_mean +                                self.momentum * self.batch_mean
            self.running_var = (1 - self.momentum) * self.running_var +                               self.momentum * self.batch_var

            # Normalize
            self.x_norm = (x - self.batch_mean) / np.sqrt(self.batch_var + self.epsilon)
        else:
            # Use running statistics
            self.x_norm = (x - self.running_mean) / np.sqrt(self.running_var + self.epsilon)

        # Scale and shift
        return self.gamma * self.x_norm + self.beta

Early Stopping

Monitor validation loss and stop when it starts increasing:

python

class EarlyStopping:
    """
    Early stopping to prevent overfitting.
    """
    def __init__(self, patience=10, min_delta=0.001, restore_best=True):
        self.patience = patience
        self.min_delta = min_delta
        self.restore_best = restore_best

        self.best_loss = float('inf')
        self.counter = 0
        self.best_weights = None

    def __call__(self, val_loss, model):
        if val_loss < self.best_loss - self.min_delta:
            self.best_loss = val_loss
            self.counter = 0
            if self.restore_best:
                self.best_weights = {k: v.clone() for k, v in model.state_dict().items()}
        else:
            self.counter += 1

        if self.counter >= self.patience:
            if self.restore_best and self.best_weights:
                model.load_state_dict(self.best_weights)
            return True  # Stop training

        return False


# Usage in training loop
early_stopping = EarlyStopping(patience=10)

for epoch in range(max_epochs):
    train_loss = train_epoch(model, train_loader)
    val_loss = validate(model, val_loader)

    if early_stopping(val_loss, model):
        print(f"Early stopping at epoch {epoch}")
        break

Summary: When to Use Each

Technique	Use When	Effect
L2 (Ridge)	Many features, all useful	Shrinks weights
L1 (Lasso)	Many features, few useful	Sparse solution
Elastic Net	Correlated features	Both effects
Dropout	Deep networks	Ensemble effect
Batch Norm	Deep networks	Stabilizes training
Early Stopping	Always	Prevents overtraining

Key Takeaways

L2 regularization shrinks all weights proportionally
L1 regularization creates sparse models (feature selection)
Dropout prevents co-adaptation of neurons
Batch normalization enables higher learning rates
Early stopping is simple and effective
Combine techniques for best results

Regularization is essential for building models that generalize to new data!

Share this article

Tweet LinkedIn

Regularization Techniques: L1, L2, Dropout, and Beyond

Regularization Techniques: L1, L2, Dropout, and Beyond

The Bias-Variance Tradeoff

L2 Regularization (Ridge)

Effect on Gradient

Weight Update

L1 Regularization (Lasso)

Key Property: Sparsity

Elastic Net

Dropout

Inverted Dropout

Batch Normalization

Early Stopping

Summary: When to Use Each

Key Takeaways

Share this article

TheMLTrader

Related Articles

Neural Networks from Scratch: The Complete Mathematical Guide

Gradient Descent & Optimization: From SGD to Adam

LSTM Networks for Time Series Forecasting

Ready to Apply These Concepts?