LSTM Networks for Time Series Forecasting

Long Short-Term Memory (LSTM) networks are powerful for sequential data. In this guide, we'll explore how they work and apply them to financial time series.

The Problem with Vanilla RNNs

Standard RNNs suffer from the vanishing gradient problem:

$\frac{\partial h_T}{\partial h_1} = \prod_{t=2}^{T} \frac{\partial h_t}{\partial h_{t-1}}$

As $T$ grows, this product either vanishes ( $\to 0$ ) or explodes ( $\to \infty$ ).

LSTM Architecture

LSTMs solve this with a gating mechanism and a cell state that acts as a highway for gradients.

The Cell State

The cell state $C_t$ flows through time with minimal modification:

$C_t = f_t \odot C_{t-1} + i_t \odot \tilde{C}_t$

Where $\odot$ is element-wise multiplication.

The Gates

Forget Gate - decides what to forget:

$f_t = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f)$

Input Gate - decides what to store:

$i_t = \sigma(W_i \cdot [h_{t-1}, x_t] + b_i)$

Candidate Values:

$\tilde{C}_t = \tanh(W_C \cdot [h_{t-1}, x_t] + b_C)$

Output Gate - decides what to output:

$o_t = \sigma(W_o \cdot [h_{t-1}, x_t] + b_o)$

Hidden State:

$h_t = o_t \odot \tanh(C_t)$

python

import numpy as np

class LSTMCell:
    """
    Single LSTM cell implementation.
    """
    def __init__(self, input_size, hidden_size):
        self.hidden_size = hidden_size

        # Combined weights for efficiency
        # Shape: (4 * hidden_size, input_size + hidden_size)
        self.W = np.random.randn(4 * hidden_size, input_size + hidden_size) * 0.01
        self.b = np.zeros((4 * hidden_size, 1))

    def forward(self, x, h_prev, c_prev):
        """
        Forward pass through LSTM cell.

        Parameters:
        x -- input at current timestep (input_size, 1)
        h_prev -- previous hidden state (hidden_size, 1)
        c_prev -- previous cell state (hidden_size, 1)

        Returns:
        h_next -- next hidden state
        c_next -- next cell state
        cache -- values needed for backprop
        """
        # Concatenate input and previous hidden state
        concat = np.vstack([h_prev, x])

        # Compute all gates at once
        gates = self.W @ concat + self.b

        # Split into individual gates
        h = self.hidden_size
        f_gate = self._sigmoid(gates[:h])      # Forget gate
        i_gate = self._sigmoid(gates[h:2*h])   # Input gate
        c_tilde = np.tanh(gates[2*h:3*h])      # Candidate
        o_gate = self._sigmoid(gates[3*h:])    # Output gate

        # Update cell state
        c_next = f_gate * c_prev + i_gate * c_tilde

        # Compute hidden state
        h_next = o_gate * np.tanh(c_next)

        cache = (x, h_prev, c_prev, f_gate, i_gate, c_tilde, o_gate, c_next)

        return h_next, c_next, cache

    def _sigmoid(self, x):
        return 1 / (1 + np.exp(-np.clip(x, -500, 500)))

Time Series Forecasting with LSTM

Data Preparation

For time series, we create sequences of input-output pairs:

python

import numpy as np
import pandas as pd

def create_sequences(data, seq_length, forecast_horizon=1):
    """
    Create sequences for LSTM training.

    Parameters:
    data -- numpy array of shape (n_samples, n_features)
    seq_length -- number of time steps to look back
    forecast_horizon -- number of steps to predict ahead

    Returns:
    X -- sequences of shape (n_sequences, seq_length, n_features)
    y -- targets of shape (n_sequences, forecast_horizon)
    """
    X, y = [], []

    for i in range(len(data) - seq_length - forecast_horizon + 1):
        X.append(data[i:(i + seq_length)])
        y.append(data[i + seq_length:i + seq_length + forecast_horizon, 0])

    return np.array(X), np.array(y)


def prepare_financial_data(df, feature_cols, target_col, seq_length=60):
    """
    Prepare financial data for LSTM.
    """
    # Calculate returns and technical features
    df['returns'] = df['close'].pct_change()
    df['log_returns'] = np.log(df['close'] / df['close'].shift(1))
    df['volatility'] = df['returns'].rolling(20).std()
    df['ma_ratio'] = df['close'] / df['close'].rolling(20).mean()

    # Drop NaN
    df = df.dropna()

    # Normalize features
    from sklearn.preprocessing import StandardScaler
    scaler = StandardScaler()
    scaled_data = scaler.fit_transform(df[feature_cols])

    # Create sequences
    X, y = create_sequences(scaled_data, seq_length)

    return X, y, scaler

Complete LSTM Model

python

import torch
import torch.nn as nn

class LSTMForecaster(nn.Module):
    """
    LSTM model for time series forecasting.
    """
    def __init__(self, input_size, hidden_size, num_layers, output_size, dropout=0.2):
        super().__init__()

        self.hidden_size = hidden_size
        self.num_layers = num_layers

        self.lstm = nn.LSTM(
            input_size=input_size,
            hidden_size=hidden_size,
            num_layers=num_layers,
            batch_first=True,
            dropout=dropout if num_layers > 1 else 0
        )

        self.fc = nn.Sequential(
            nn.Linear(hidden_size, hidden_size // 2),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(hidden_size // 2, output_size)
        )

    def forward(self, x):
        # x shape: (batch, seq_len, features)

        # LSTM output
        lstm_out, (h_n, c_n) = self.lstm(x)

        # Use last hidden state
        out = self.fc(lstm_out[:, -1, :])

        return out


def train_lstm(model, train_loader, val_loader, epochs=100, lr=0.001):
    """
    Training loop for LSTM forecaster.
    """
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    model = model.to(device)

    criterion = nn.MSELoss()
    optimizer = torch.optim.Adam(model.parameters(), lr=lr)
    scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(
        optimizer, patience=10, factor=0.5
    )

    best_val_loss = float('inf')

    for epoch in range(epochs):
        # Training
        model.train()
        train_loss = 0

        for X_batch, y_batch in train_loader:
            X_batch = X_batch.to(device)
            y_batch = y_batch.to(device)

            optimizer.zero_grad()
            predictions = model(X_batch)
            loss = criterion(predictions, y_batch)
            loss.backward()

            # Gradient clipping
            torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

            optimizer.step()
            train_loss += loss.item()

        # Validation
        model.eval()
        val_loss = 0

        with torch.no_grad():
            for X_batch, y_batch in val_loader:
                X_batch = X_batch.to(device)
                y_batch = y_batch.to(device)
                predictions = model(X_batch)
                val_loss += criterion(predictions, y_batch).item()

        train_loss /= len(train_loader)
        val_loss /= len(val_loader)

        scheduler.step(val_loss)

        if val_loss < best_val_loss:
            best_val_loss = val_loss
            torch.save(model.state_dict(), 'best_model.pt')

        if epoch % 10 == 0:
            print(f"Epoch {epoch}: Train Loss = {train_loss:.6f}, Val Loss = {val_loss:.6f}")

    return model

Practical Considerations

Sequence Length

The optimal sequence length depends on:

Market regime duration
Computational constraints
Memory requirements

Rule of thumb: Start with 20-60 time steps for daily data.

Stationarity

Financial time series are often non-stationary. Transform to:

Returns: $r_t = \frac{p_t - p_{t-1}}{p_{t-1}}$
Log returns: $r_t = \log(p_t) - \log(p_{t-1})$
Normalized prices: $\tilde{p}_t = \frac{p_t - \mu}{\sigma}$

Avoiding Look-Ahead Bias

Always use rolling normalization:

python

def rolling_normalize(data, window=252):
    """
    Rolling z-score normalization to avoid look-ahead bias.
    """
    rolling_mean = data.rolling(window=window).mean()
    rolling_std = data.rolling(window=window).std()

    normalized = (data - rolling_mean) / rolling_std

    return normalized.dropna()

Key Takeaways

LSTMs solve vanishing gradients with cell state and gates
Proper data preparation is crucial - normalize, handle stationarity
Sequence length affects what patterns can be learned
Gradient clipping prevents exploding gradients
Rolling normalization avoids look-ahead bias
Validation strategy must respect temporal ordering

LSTMs are powerful but require careful implementation for financial applications!

Share this article

Tweet LinkedIn

LSTM Networks for Time Series Forecasting

LSTM Networks for Time Series Forecasting

The Problem with Vanilla RNNs

LSTM Architecture

The Cell State

The Gates

Time Series Forecasting with LSTM

Data Preparation

Complete LSTM Model

Practical Considerations

Sequence Length

Stationarity

Avoiding Look-Ahead Bias

Key Takeaways

Share this article

TheMLTrader

Related Articles

Neural Networks from Scratch: The Complete Mathematical Guide

Gradient Descent & Optimization: From SGD to Adam

Feature Engineering for Algorithmic Trading

Ready to Apply These Concepts?