Feature Engineering for Algorithmic Trading

Features are the fuel for machine learning models. In trading, good features can mean the difference between a profitable strategy and random noise.

Categories of Features

Price-based features - derived from OHLCV
Technical indicators - classical TA
Microstructure features - order book, trades
Alternative data - sentiment, fundamentals
Time features - seasonality, calendar effects

Price-Based Features

Returns

$r_t = \frac{p_t - p_{t-1}}{p_{t-1}}$

$r_t^{log} = \log(p_t) - \log(p_{t-1})$

Volatility

Historical volatility (realized):

$\sigma_t = \sqrt{\frac{1}{n-1} \sum_{i=1}^{n} (r_{t-i} - \bar{r})^2}$

Parkinson volatility (using high-low):

$\sigma_P = \sqrt{\frac{1}{4 \ln(2)} \cdot \frac{1}{n} \sum_{i=1}^{n} (\ln H_i - \ln L_i)^2}$

python

import pandas as pd
import numpy as np

def calculate_returns(df):
    """Calculate various return measures."""
    df['returns'] = df['close'].pct_change()
    df['log_returns'] = np.log(df['close'] / df['close'].shift(1))

    # Multi-period returns
    for period in [5, 10, 20]:
        df[f'returns_{period}d'] = df['close'].pct_change(period)

    return df


def calculate_volatility(df, windows=[5, 10, 20, 60]):
    """Calculate volatility features."""
    for window in windows:
        # Standard deviation of returns
        df[f'volatility_{window}d'] = df['returns'].rolling(window).std()

        # Parkinson volatility
        df[f'parkinson_vol_{window}d'] = np.sqrt(
            (1 / (4 * np.log(2))) *
            ((np.log(df['high'] / df['low']) ** 2).rolling(window).mean())
        )

    return df

Technical Indicators

Moving Averages

Simple Moving Average:

$SMA_n = \frac{1}{n} \sum_{i=0}^{n-1} p_{t-i}$

Exponential Moving Average:

$EMA_t = \alpha \cdot p_t + (1 - \alpha) \cdot EMA_{t-1}$

Where $\alpha = \frac{2}{n+1}$

RSI (Relative Strength Index)

$RSI = 100 - \frac{100}{1 + RS}$

Where $RS = \frac{\text{Average Gain}}{\text{Average Loss}}$

MACD

$MACD = EMA_{12} - EMA_{26}$ $Signal = EMA_9(MACD)$

python

def add_technical_indicators(df):
    """Add common technical indicators."""

    # Moving averages
    for window in [10, 20, 50, 200]:
        df[f'sma_{window}'] = df['close'].rolling(window).mean()
        df[f'ema_{window}'] = df['close'].ewm(span=window).mean()

        # Price relative to MA
        df[f'close_to_sma_{window}'] = df['close'] / df[f'sma_{window}'] - 1

    # RSI
    delta = df['close'].diff()
    gain = (delta.where(delta > 0, 0)).rolling(14).mean()
    loss = (-delta.where(delta < 0, 0)).rolling(14).mean()
    rs = gain / loss
    df['rsi_14'] = 100 - (100 / (1 + rs))

    # MACD
    ema_12 = df['close'].ewm(span=12).mean()
    ema_26 = df['close'].ewm(span=26).mean()
    df['macd'] = ema_12 - ema_26
    df['macd_signal'] = df['macd'].ewm(span=9).mean()
    df['macd_hist'] = df['macd'] - df['macd_signal']

    # Bollinger Bands
    df['bb_mid'] = df['close'].rolling(20).mean()
    df['bb_std'] = df['close'].rolling(20).std()
    df['bb_upper'] = df['bb_mid'] + 2 * df['bb_std']
    df['bb_lower'] = df['bb_mid'] - 2 * df['bb_std']
    df['bb_position'] = (df['close'] - df['bb_lower']) / (df['bb_upper'] - df['bb_lower'])

    # ATR (Average True Range)
    high_low = df['high'] - df['low']
    high_close = abs(df['high'] - df['close'].shift())
    low_close = abs(df['low'] - df['close'].shift())
    tr = pd.concat([high_low, high_close, low_close], axis=1).max(axis=1)
    df['atr_14'] = tr.rolling(14).mean()

    return df

Microstructure Features

Order book and trade data reveal information about supply and demand.

Order Book Imbalance

$OBI = \frac{V_{bid} - V_{ask}}{V_{bid} + V_{ask}}$

Trade Imbalance

$TI = \frac{V_{buy} - V_{sell}}{V_{buy} + V_{sell}}$

VWAP Deviation

$VWAP = \frac{\sum_i P_i \cdot V_i}{\sum_i V_i}$

$VWAP\_Dev = \frac{P - VWAP}{VWAP}$

python

def calculate_microstructure_features(df, trades_df=None):
    """Calculate microstructure-based features."""

    # VWAP
    df['vwap'] = (df['close'] * df['volume']).cumsum() / df['volume'].cumsum()
    df['vwap_dev'] = df['close'] / df['vwap'] - 1

    # Volume features
    df['volume_sma_20'] = df['volume'].rolling(20).mean()
    df['relative_volume'] = df['volume'] / df['volume_sma_20']

    # Price-volume correlation
    df['pv_corr_20'] = df['returns'].rolling(20).corr(df['volume'].pct_change())

    # On-Balance Volume
    df['obv'] = (np.sign(df['close'].diff()) * df['volume']).cumsum()
    df['obv_ema'] = df['obv'].ewm(span=20).mean()

    # Money Flow Index
    typical_price = (df['high'] + df['low'] + df['close']) / 3
    money_flow = typical_price * df['volume']

    positive_flow = money_flow.where(typical_price > typical_price.shift(), 0).rolling(14).sum()
    negative_flow = money_flow.where(typical_price < typical_price.shift(), 0).rolling(14).sum()

    df['mfi'] = 100 - (100 / (1 + positive_flow / negative_flow))

    return df

Alternative Data Features

Sentiment Features

python

def calculate_sentiment_features(df, sentiment_df):
    """
    Merge sentiment data with price data.

    sentiment_df should have: date, sentiment_score, sentiment_volume
    """
    # Merge on date
    df = df.merge(sentiment_df, on='date', how='left')

    # Forward fill missing sentiment
    df['sentiment_score'] = df['sentiment_score'].fillna(method='ffill')

    # Sentiment momentum
    df['sentiment_ma_7'] = df['sentiment_score'].rolling(7).mean()
    df['sentiment_change'] = df['sentiment_score'].diff()

    # Sentiment-return divergence
    df['sent_ret_corr'] = df['sentiment_score'].rolling(20).corr(df['returns'])

    return df

Time Features

Markets have cyclical patterns based on time.

python

def add_time_features(df):
    """Add calendar and time-based features."""

    df['date'] = pd.to_datetime(df['date'])

    # Day of week (0=Monday)
    df['day_of_week'] = df['date'].dt.dayofweek

    # Month
    df['month'] = df['date'].dt.month

    # Quarter end effect
    df['is_quarter_end'] = df['date'].dt.is_quarter_end.astype(int)

    # Days to month end
    df['days_to_month_end'] = (df['date'] + pd.offsets.MonthEnd(0) - df['date']).dt.days

    # Cyclical encoding (for neural networks)
    df['day_sin'] = np.sin(2 * np.pi * df['day_of_week'] / 5)
    df['day_cos'] = np.cos(2 * np.pi * df['day_of_week'] / 5)
    df['month_sin'] = np.sin(2 * np.pi * df['month'] / 12)
    df['month_cos'] = np.cos(2 * np.pi * df['month'] / 12)

    return df

Feature Selection

Not all features are useful. Use these techniques to select the best:

Correlation Analysis

python

def analyze_feature_correlation(df, target_col, threshold=0.8):
    """
    Analyze features for correlation with target and multicollinearity.
    """
    # Correlation with target
    target_corr = df.corr()[target_col].sort_values(ascending=False)
    print("Top features correlated with target:")
    print(target_corr.head(20))

    # Feature correlation matrix
    corr_matrix = df.corr().abs()

    # Find highly correlated features
    upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))
    high_corr = [(col, row, corr_matrix.loc[row, col])
                 for col in upper.columns
                 for row in upper.index
                 if upper.loc[row, col] > threshold]

    print(f"\nHighly correlated feature pairs (>{threshold}):")
    for col, row, corr in sorted(high_corr, key=lambda x: -x[2]):
        print(f"  {col} - {row}: {corr:.3f}")

    return target_corr, high_corr

Feature Importance

python

from sklearn.ensemble import RandomForestClassifier
from sklearn.inspection import permutation_importance

def calculate_feature_importance(X, y):
    """Calculate feature importance using Random Forest."""

    # Train model
    rf = RandomForestClassifier(n_estimators=100, random_state=42)
    rf.fit(X, y)

    # Built-in importance
    importance_df = pd.DataFrame({
        'feature': X.columns,
        'importance': rf.feature_importances_
    }).sort_values('importance', ascending=False)

    # Permutation importance (more robust)
    perm_importance = permutation_importance(rf, X, y, n_repeats=10, random_state=42)

    perm_importance_df = pd.DataFrame({
        'feature': X.columns,
        'importance_mean': perm_importance.importances_mean,
        'importance_std': perm_importance.importances_std
    }).sort_values('importance_mean', ascending=False)

    return importance_df, perm_importance_df

Complete Feature Engineering Pipeline

python

def build_feature_matrix(df):
    """
    Complete feature engineering pipeline.
    """
    df = df.copy()

    # Price features
    df = calculate_returns(df)
    df = calculate_volatility(df)

    # Technical indicators
    df = add_technical_indicators(df)

    # Microstructure
    df = calculate_microstructure_features(df)

    # Time features
    df = add_time_features(df)

    # Target variable (next day return direction)
    df['target'] = (df['returns'].shift(-1) > 0).astype(int)

    # Drop rows with NaN
    df = df.dropna()

    # Feature columns
    feature_cols = [col for col in df.columns
                   if col not in ['date', 'open', 'high', 'low', 'close',
                                 'volume', 'target', 'returns']]

    return df, feature_cols


# Usage
df, features = build_feature_matrix(price_data)
X = df[features]
y = df['target']

print(f"Feature matrix shape: {X.shape}")
print(f"Features: {features}")

Key Takeaways

Returns and volatility are fundamental - always include them
Technical indicators capture classical patterns
Microstructure features reveal supply/demand dynamics
Time features capture seasonality
Remove multicollinearity to improve model stability
Use rolling calculations to avoid look-ahead bias
Feature importance guides selection

Good features are more important than complex models. Spend time on feature engineering!

Share this article

Tweet LinkedIn

Feature Engineering for Algorithmic Trading

Feature Engineering for Algorithmic Trading

Categories of Features

Price-Based Features

Returns

Volatility

Technical Indicators

Moving Averages

RSI (Relative Strength Index)

MACD

Microstructure Features

Order Book Imbalance

Trade Imbalance

VWAP Deviation

Alternative Data Features

Sentiment Features

Time Features

Feature Selection

Correlation Analysis

Feature Importance

Complete Feature Engineering Pipeline

Key Takeaways

Share this article

TheMLTrader

Related Articles

Neural Networks from Scratch: The Complete Mathematical Guide

LSTM Networks for Time Series Forecasting

Cross-Validation for Time Series: Avoiding Data Leakage

Ready to Apply These Concepts?