Neural Networks from Scratch: The Complete Mathematical Guide

Neural networks are the backbone of modern machine learning. In this comprehensive guide, we'll build our understanding from the ground up, starting with the mathematics and implementing everything in Python.

The Perceptron
Activation Functions
Forward Propagation
Loss Functions
Backpropagation
Implementation

The Perceptron

The perceptron is the fundamental building block of neural networks. It takes multiple inputs, applies weights, adds a bias, and produces an output through an activation function.

Mathematical Definition

For a single perceptron with $n$ inputs:

$z = \sum_{i=1}^{n} w_i x_i + b = \mathbf{w}^T \mathbf{x} + b$

Where:

$\mathbf{x} = [x_1, x_2, ..., x_n]^T$ is the input vector
$\mathbf{w} = [w_1, w_2, ..., w_n]^T$ is the weight vector
$b$ is the bias term
$z$ is the pre-activation (weighted sum)

The output is then:

$a = \sigma(z)$

Where $\sigma$ is the activation function.

Activation Functions

Activation functions introduce non-linearity into the network, enabling it to learn complex patterns.

Sigmoid Function

$\sigma(z) = \frac{1}{1 + e^{-z}}$

Properties:

Output range: $(0, 1)$
Derivative: $\sigma'(z) = \sigma(z)(1 - \sigma(z))$
Use case: Binary classification output layer

ReLU (Rectified Linear Unit)

$\text{ReLU}(z) = \max(0, z)$

Properties:

Output range: $[0, \infty)$
Derivative: $\text{ReLU}'(z) = \begin{cases} 1 & \text{if } z > 0 \\ 0 & \text{otherwise} \end{cases}$
Use case: Hidden layers (most common)

Tanh

$\tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}}$

Properties:

Output range: $(-1, 1)$
Derivative: $\tanh'(z) = 1 - \tanh^2(z)$
Use case: Hidden layers, LSTM gates

Softmax

For a vector $\mathbf{z}$ with $K$ elements:

$\text{softmax}(z_i) = \frac{e^{z_i}}{\sum_{j=1}^{K} e^{z_j}}$

Properties:

Output range: $(0, 1)$ , sums to 1
Use case: Multi-class classification output layer

python

import numpy as np

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

def relu(z):
    return np.maximum(0, z)

def tanh(z):
    return np.tanh(z)

def softmax(z):
    exp_z = np.exp(z - np.max(z))  # Numerical stability
    return exp_z / exp_z.sum(axis=-1, keepdims=True)

Forward Propagation

Forward propagation computes the output of the network layer by layer.

Layer-by-Layer Computation

For layer $l$ :

$\mathbf{Z}^{[l]} = \mathbf{W}^{[l]} \mathbf{A}^{[l-1]} + \mathbf{b}^{[l]}$

$\mathbf{A}^{[l]} = g^{[l]}(\mathbf{Z}^{[l]})$

Where:

$\mathbf{W}^{[l]}$ is the weight matrix of shape $(n^{[l]}, n^{[l-1]})$
$\mathbf{b}^{[l]}$ is the bias vector of shape $(n^{[l]}, 1)$
$g^{[l]}$ is the activation function for layer $l$
$\mathbf{A}^{[0]} = \mathbf{X}$ (input)

Matrix Dimensions

For a batch of $m$ training examples:

Input: $\mathbf{X} \in \mathbb{R}^{n^{[0]} \times m}$
Weights: $\mathbf{W}^{[l]} \in \mathbb{R}^{n^{[l]} \times n^{[l-1]}}$
Bias: $\mathbf{b}^{[l]} \in \mathbb{R}^{n^{[l]} \times 1}$ (broadcast)
Output: $\mathbf{A}^{[l]} \in \mathbb{R}^{n^{[l]} \times m}$

python

def forward_propagation(X, parameters):
    """
    Compute forward pass for L-layer neural network.

    Parameters:
    X -- input data of shape (n_features, m_examples)
    parameters -- dict containing W1, b1, W2, b2, ..., WL, bL

    Returns:
    AL -- output of the last layer
    caches -- list of caches for backpropagation
    """
    caches = []
    A = X
    L = len(parameters) // 2

    # Hidden layers with ReLU
    for l in range(1, L):
        A_prev = A
        Z = np.dot(parameters[f'W{l}'], A_prev) + parameters[f'b{l}']
        A = relu(Z)
        caches.append((A_prev, Z))

    # Output layer with sigmoid
    ZL = np.dot(parameters[f'W{L}'], A) + parameters[f'b{L}']
    AL = sigmoid(ZL)
    caches.append((A, ZL))

    return AL, caches

Loss Functions

Loss functions measure how well our predictions match the true values.

Binary Cross-Entropy Loss

For binary classification:

$\mathcal{L}(\hat{y}, y) = -[y \log(\hat{y}) + (1-y) \log(1-\hat{y})]$

Cost over $m$ examples:

$J = -\frac{1}{m} \sum_{i=1}^{m} [y^{(i)} \log(\hat{y}^{(i)}) + (1-y^{(i)}) \log(1-\hat{y}^{(i)})]$

Mean Squared Error

For regression:

$\mathcal{L}(\hat{y}, y) = \frac{1}{2}(\hat{y} - y)^2$

$J = \frac{1}{2m} \sum_{i=1}^{m} (\hat{y}^{(i)} - y^{(i)})^2$

Categorical Cross-Entropy

For multi-class classification:

$\mathcal{L}(\hat{\mathbf{y}}, \mathbf{y}) = -\sum_{k=1}^{K} y_k \log(\hat{y}_k)$

python

def binary_cross_entropy(AL, Y):
    """
    Compute binary cross-entropy loss.

    Parameters:
    AL -- probability predictions, shape (1, m)
    Y -- true labels, shape (1, m)

    Returns:
    cost -- cross-entropy cost
    """
    m = Y.shape[1]
    cost = -np.sum(Y * np.log(AL + 1e-8) + (1 - Y) * np.log(1 - AL + 1e-8)) / m
    return cost

Backpropagation

Backpropagation computes gradients using the chain rule, allowing us to update weights.

The Chain Rule

For a composite function $J = J(a(z(w)))$ :

$\frac{\partial J}{\partial w} = \frac{\partial J}{\partial a} \cdot \frac{\partial a}{\partial z} \cdot \frac{\partial z}{\partial w}$

Gradient Computation

Output layer gradients:

$dZ^{[L]} = A^{[L]} - Y$

Hidden layer gradients:

$dZ^{[l]} = dA^{[l]} * g'^{[l]}(Z^{[l]})$

Where $*$ denotes element-wise multiplication.

Weight and bias gradients:

$dW^{[l]} = \frac{1}{m} dZ^{[l]} \cdot (A^{[l-1]})^T$

$db^{[l]} = \frac{1}{m} \sum_{i=1}^{m} dZ^{[l]}$

Propagate to previous layer:

$dA^{[l-1]} = (W^{[l]})^T \cdot dZ^{[l]}$

python

def backward_propagation(AL, Y, caches, parameters):
    """
    Implement backpropagation for L-layer neural network.

    Parameters:
    AL -- output of forward propagation
    Y -- true labels
    caches -- list of caches from forward propagation
    parameters -- dict containing weights and biases

    Returns:
    grads -- dictionary with gradients
    """
    grads = {}
    L = len(caches)
    m = AL.shape[1]
    Y = Y.reshape(AL.shape)

    # Output layer
    dAL = -(np.divide(Y, AL + 1e-8) - np.divide(1 - Y, 1 - AL + 1e-8))
    A_prev, Z = caches[L-1]
    dZ = dAL * sigmoid(Z) * (1 - sigmoid(Z))
    grads[f'dW{L}'] = np.dot(dZ, A_prev.T) / m
    grads[f'db{L}'] = np.sum(dZ, axis=1, keepdims=True) / m
    dA_prev = np.dot(parameters[f'W{L}'].T, dZ)

    # Hidden layers
    for l in reversed(range(1, L)):
        A_prev, Z = caches[l-1]
        dZ = dA_prev * (Z > 0)  # ReLU derivative
        grads[f'dW{l}'] = np.dot(dZ, A_prev.T) / m
        grads[f'db{l}'] = np.sum(dZ, axis=1, keepdims=True) / m
        dA_prev = np.dot(parameters[f'W{l}'].T, dZ)

    return grads

Implementation

Let's put it all together with a complete neural network implementation.

python

import numpy as np

class NeuralNetwork:
    def __init__(self, layer_dims):
        """
        Initialize neural network.

        Parameters:
        layer_dims -- list of layer dimensions [n_input, n_hidden1, ..., n_output]
        """
        self.parameters = {}
        self.L = len(layer_dims) - 1

        for l in range(1, self.L + 1):
            # He initialization
            self.parameters[f'W{l}'] = np.random.randn(
                layer_dims[l], layer_dims[l-1]
            ) * np.sqrt(2 / layer_dims[l-1])
            self.parameters[f'b{l}'] = np.zeros((layer_dims[l], 1))

    def forward(self, X):
        self.caches = []
        A = X

        for l in range(1, self.L):
            A_prev = A
            Z = np.dot(self.parameters[f'W{l}'], A_prev) + self.parameters[f'b{l}']
            A = np.maximum(0, Z)  # ReLU
            self.caches.append((A_prev, Z))

        # Output layer
        ZL = np.dot(self.parameters[f'W{self.L}'], A) + self.parameters[f'b{self.L}']
        AL = 1 / (1 + np.exp(-ZL))  # Sigmoid
        self.caches.append((A, ZL))

        return AL

    def backward(self, AL, Y):
        grads = {}
        m = AL.shape[1]

        # Output layer
        dZ = AL - Y
        A_prev, _ = self.caches[self.L - 1]
        grads[f'dW{self.L}'] = np.dot(dZ, A_prev.T) / m
        grads[f'db{self.L}'] = np.sum(dZ, axis=1, keepdims=True) / m
        dA = np.dot(self.parameters[f'W{self.L}'].T, dZ)

        # Hidden layers
        for l in reversed(range(1, self.L)):
            A_prev, Z = self.caches[l - 1]
            dZ = dA * (Z > 0)
            grads[f'dW{l}'] = np.dot(dZ, A_prev.T) / m
            grads[f'db{l}'] = np.sum(dZ, axis=1, keepdims=True) / m
            dA = np.dot(self.parameters[f'W{l}'].T, dZ)

        return grads

    def update(self, grads, learning_rate):
        for l in range(1, self.L + 1):
            self.parameters[f'W{l}'] -= learning_rate * grads[f'dW{l}']
            self.parameters[f'b{l}'] -= learning_rate * grads[f'db{l}']

    def train(self, X, Y, epochs, learning_rate=0.01):
        costs = []

        for epoch in range(epochs):
            AL = self.forward(X)
            cost = -np.mean(Y * np.log(AL + 1e-8) + (1 - Y) * np.log(1 - AL + 1e-8))
            grads = self.backward(AL, Y)
            self.update(grads, learning_rate)

            if epoch % 100 == 0:
                costs.append(cost)
                print(f"Epoch {epoch}: Cost = {cost:.4f}")

        return costs

# Example usage
if __name__ == "__main__":
    np.random.seed(42)

    # Create sample data
    X = np.random.randn(2, 1000)
    Y = ((X[0] ** 2 + X[1] ** 2) < 1).astype(float).reshape(1, -1)

    # Train network
    nn = NeuralNetwork([2, 16, 8, 1])
    costs = nn.train(X, Y, epochs=1000, learning_rate=0.1)

Key Takeaways

Perceptrons are the basic units - weighted sum followed by activation
Activation functions introduce non-linearity (ReLU for hidden, sigmoid/softmax for output)
Forward propagation computes predictions layer by layer
Loss functions measure prediction quality (cross-entropy for classification, MSE for regression)
Backpropagation computes gradients using the chain rule
He initialization prevents vanishing/exploding gradients

Understanding these fundamentals is crucial before moving to frameworks like TensorFlow or PyTorch. The math doesn't change - only the implementation details!

Share this article

Tweet LinkedIn

Neural Networks from Scratch: The Complete Mathematical Guide

Neural Networks from Scratch: The Complete Mathematical Guide

Table of Contents

The Perceptron

Mathematical Definition

Activation Functions

Sigmoid Function

ReLU (Rectified Linear Unit)

Tanh

Softmax

Forward Propagation

Layer-by-Layer Computation

Matrix Dimensions

Loss Functions

Binary Cross-Entropy Loss

Mean Squared Error

Categorical Cross-Entropy

Backpropagation

The Chain Rule

Gradient Computation

Implementation

Key Takeaways

Share this article

TheMLTrader

Related Articles

Gradient Descent & Optimization: From SGD to Adam

LSTM Networks for Time Series Forecasting

Feature Engineering for Algorithmic Trading

Ready to Apply These Concepts?