Skip to main content
Skip to main content
Neural Networks from Scratch: The Complete Mathematical Guide - Deep Learning article featured image
Deep LearningFeatured

Neural Networks from Scratch: The Complete Mathematical Guide

Understanding the mathematics behind neural networks, from perceptrons to backpropagation. Build your intuition with derivations and code.

Neural Networks from Scratch: The Complete Mathematical Guide

Neural networks are the backbone of modern machine learning. In this comprehensive guide, we'll build our understanding from the ground up, starting with the mathematics and implementing everything in Python.

Table of Contents

  1. The Perceptron
  2. Activation Functions
  3. Forward Propagation
  4. Loss Functions
  5. Backpropagation
  6. Implementation

The Perceptron

The perceptron is the fundamental building block of neural networks. It takes multiple inputs, applies weights, adds a bias, and produces an output through an activation function.

Mathematical Definition

For a single perceptron with nn inputs:

z=i=1nwixi+b=wTx+bz = \sum_{i=1}^{n} w_i x_i + b = \mathbf{w}^T \mathbf{x} + b

Where:

  • x=[x1,x2,...,xn]T\mathbf{x} = [x_1, x_2, ..., x_n]^T is the input vector
  • w=[w1,w2,...,wn]T\mathbf{w} = [w_1, w_2, ..., w_n]^T is the weight vector
  • bb is the bias term
  • zz is the pre-activation (weighted sum)

The output is then:

a=σ(z)a = \sigma(z)

Where σ\sigma is the activation function.


Activation Functions

Activation functions introduce non-linearity into the network, enabling it to learn complex patterns.

Sigmoid Function

σ(z)=11+ez\sigma(z) = \frac{1}{1 + e^{-z}}

Properties:

  • Output range: (0,1)(0, 1)
  • Derivative: σ(z)=σ(z)(1σ(z))\sigma'(z) = \sigma(z)(1 - \sigma(z))
  • Use case: Binary classification output layer

ReLU (Rectified Linear Unit)

ReLU(z)=max(0,z)\text{ReLU}(z) = \max(0, z)

Properties:

  • Output range: [0,)[0, \infty)
  • Derivative: ReLU(z)={1if z>00otherwise\text{ReLU}'(z) = \begin{cases} 1 & \text{if } z > 0 \\ 0 & \text{otherwise} \end{cases}
  • Use case: Hidden layers (most common)

Tanh

tanh(z)=ezezez+ez\tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}}

Properties:

  • Output range: (1,1)(-1, 1)
  • Derivative: tanh(z)=1tanh2(z)\tanh'(z) = 1 - \tanh^2(z)
  • Use case: Hidden layers, LSTM gates

Softmax

For a vector z\mathbf{z} with KK elements:

softmax(zi)=ezij=1Kezj\text{softmax}(z_i) = \frac{e^{z_i}}{\sum_{j=1}^{K} e^{z_j}}

Properties:

  • Output range: (0,1)(0, 1), sums to 1
  • Use case: Multi-class classification output layer
python
1import numpy as np 2 3def sigmoid(z): 4 return 1 / (1 + np.exp(-z)) 5 6def relu(z): 7 return np.maximum(0, z) 8 9def tanh(z): 10 return np.tanh(z) 11 12def softmax(z): 13 exp_z = np.exp(z - np.max(z)) # Numerical stability 14 return exp_z / exp_z.sum(axis=-1, keepdims=True)

Forward Propagation

Forward propagation computes the output of the network layer by layer.

Layer-by-Layer Computation

For layer ll:

Z[l]=W[l]A[l1]+b[l]\mathbf{Z}^{[l]} = \mathbf{W}^{[l]} \mathbf{A}^{[l-1]} + \mathbf{b}^{[l]}

A[l]=g[l](Z[l])\mathbf{A}^{[l]} = g^{[l]}(\mathbf{Z}^{[l]})

Where:

  • W[l]\mathbf{W}^{[l]} is the weight matrix of shape (n[l],n[l1])(n^{[l]}, n^{[l-1]})
  • b[l]\mathbf{b}^{[l]} is the bias vector of shape (n[l],1)(n^{[l]}, 1)
  • g[l]g^{[l]} is the activation function for layer ll
  • A[0]=X\mathbf{A}^{[0]} = \mathbf{X} (input)

Matrix Dimensions

For a batch of mm training examples:

  • Input: XRn[0]×m\mathbf{X} \in \mathbb{R}^{n^{[0]} \times m}
  • Weights: W[l]Rn[l]×n[l1]\mathbf{W}^{[l]} \in \mathbb{R}^{n^{[l]} \times n^{[l-1]}}
  • Bias: b[l]Rn[l]×1\mathbf{b}^{[l]} \in \mathbb{R}^{n^{[l]} \times 1} (broadcast)
  • Output: A[l]Rn[l]×m\mathbf{A}^{[l]} \in \mathbb{R}^{n^{[l]} \times m}
python
1def forward_propagation(X, parameters): 2 """ 3 Compute forward pass for L-layer neural network. 4 5 Parameters: 6 X -- input data of shape (n_features, m_examples) 7 parameters -- dict containing W1, b1, W2, b2, ..., WL, bL 8 9 Returns: 10 AL -- output of the last layer 11 caches -- list of caches for backpropagation 12 """ 13 caches = [] 14 A = X 15 L = len(parameters) // 2 16 17 # Hidden layers with ReLU 18 for l in range(1, L): 19 A_prev = A 20 Z = np.dot(parameters[f'W{l}'], A_prev) + parameters[f'b{l}'] 21 A = relu(Z) 22 caches.append((A_prev, Z)) 23 24 # Output layer with sigmoid 25 ZL = np.dot(parameters[f'W{L}'], A) + parameters[f'b{L}'] 26 AL = sigmoid(ZL) 27 caches.append((A, ZL)) 28 29 return AL, caches

Loss Functions

Loss functions measure how well our predictions match the true values.

Binary Cross-Entropy Loss

For binary classification:

L(y^,y)=[ylog(y^)+(1y)log(1y^)]\mathcal{L}(\hat{y}, y) = -[y \log(\hat{y}) + (1-y) \log(1-\hat{y})]

Cost over mm examples:

J=1mi=1m[y(i)log(y^(i))+(1y(i))log(1y^(i))]J = -\frac{1}{m} \sum_{i=1}^{m} [y^{(i)} \log(\hat{y}^{(i)}) + (1-y^{(i)}) \log(1-\hat{y}^{(i)})]

Mean Squared Error

For regression:

L(y^,y)=12(y^y)2\mathcal{L}(\hat{y}, y) = \frac{1}{2}(\hat{y} - y)^2

J=12mi=1m(y^(i)y(i))2J = \frac{1}{2m} \sum_{i=1}^{m} (\hat{y}^{(i)} - y^{(i)})^2

Categorical Cross-Entropy

For multi-class classification:

L(y^,y)=k=1Kyklog(y^k)\mathcal{L}(\hat{\mathbf{y}}, \mathbf{y}) = -\sum_{k=1}^{K} y_k \log(\hat{y}_k)

python
1def binary_cross_entropy(AL, Y): 2 """ 3 Compute binary cross-entropy loss. 4 5 Parameters: 6 AL -- probability predictions, shape (1, m) 7 Y -- true labels, shape (1, m) 8 9 Returns: 10 cost -- cross-entropy cost 11 """ 12 m = Y.shape[1] 13 cost = -np.sum(Y * np.log(AL + 1e-8) + (1 - Y) * np.log(1 - AL + 1e-8)) / m 14 return cost

Backpropagation

Backpropagation computes gradients using the chain rule, allowing us to update weights.

The Chain Rule

For a composite function J=J(a(z(w)))J = J(a(z(w))):

Jw=Jaazzw\frac{\partial J}{\partial w} = \frac{\partial J}{\partial a} \cdot \frac{\partial a}{\partial z} \cdot \frac{\partial z}{\partial w}

Gradient Computation

Output layer gradients:

dZ[L]=A[L]YdZ^{[L]} = A^{[L]} - Y

Hidden layer gradients:

dZ[l]=dA[l]g[l](Z[l])dZ^{[l]} = dA^{[l]} * g'^{[l]}(Z^{[l]})

Where * denotes element-wise multiplication.

Weight and bias gradients:

dW[l]=1mdZ[l](A[l1])TdW^{[l]} = \frac{1}{m} dZ^{[l]} \cdot (A^{[l-1]})^T

db[l]=1mi=1mdZ[l]db^{[l]} = \frac{1}{m} \sum_{i=1}^{m} dZ^{[l]}

Propagate to previous layer:

dA[l1]=(W[l])TdZ[l]dA^{[l-1]} = (W^{[l]})^T \cdot dZ^{[l]}

python
1def backward_propagation(AL, Y, caches, parameters): 2 """ 3 Implement backpropagation for L-layer neural network. 4 5 Parameters: 6 AL -- output of forward propagation 7 Y -- true labels 8 caches -- list of caches from forward propagation 9 parameters -- dict containing weights and biases 10 11 Returns: 12 grads -- dictionary with gradients 13 """ 14 grads = {} 15 L = len(caches) 16 m = AL.shape[1] 17 Y = Y.reshape(AL.shape) 18 19 # Output layer 20 dAL = -(np.divide(Y, AL + 1e-8) - np.divide(1 - Y, 1 - AL + 1e-8)) 21 A_prev, Z = caches[L-1] 22 dZ = dAL * sigmoid(Z) * (1 - sigmoid(Z)) 23 grads[f'dW{L}'] = np.dot(dZ, A_prev.T) / m 24 grads[f'db{L}'] = np.sum(dZ, axis=1, keepdims=True) / m 25 dA_prev = np.dot(parameters[f'W{L}'].T, dZ) 26 27 # Hidden layers 28 for l in reversed(range(1, L)): 29 A_prev, Z = caches[l-1] 30 dZ = dA_prev * (Z > 0) # ReLU derivative 31 grads[f'dW{l}'] = np.dot(dZ, A_prev.T) / m 32 grads[f'db{l}'] = np.sum(dZ, axis=1, keepdims=True) / m 33 dA_prev = np.dot(parameters[f'W{l}'].T, dZ) 34 35 return grads

Implementation

Let's put it all together with a complete neural network implementation.

python
1import numpy as np 2 3class NeuralNetwork: 4 def __init__(self, layer_dims): 5 """ 6 Initialize neural network. 7 8 Parameters: 9 layer_dims -- list of layer dimensions [n_input, n_hidden1, ..., n_output] 10 """ 11 self.parameters = {} 12 self.L = len(layer_dims) - 1 13 14 for l in range(1, self.L + 1): 15 # He initialization 16 self.parameters[f'W{l}'] = np.random.randn( 17 layer_dims[l], layer_dims[l-1] 18 ) * np.sqrt(2 / layer_dims[l-1]) 19 self.parameters[f'b{l}'] = np.zeros((layer_dims[l], 1)) 20 21 def forward(self, X): 22 self.caches = [] 23 A = X 24 25 for l in range(1, self.L): 26 A_prev = A 27 Z = np.dot(self.parameters[f'W{l}'], A_prev) + self.parameters[f'b{l}'] 28 A = np.maximum(0, Z) # ReLU 29 self.caches.append((A_prev, Z)) 30 31 # Output layer 32 ZL = np.dot(self.parameters[f'W{self.L}'], A) + self.parameters[f'b{self.L}'] 33 AL = 1 / (1 + np.exp(-ZL)) # Sigmoid 34 self.caches.append((A, ZL)) 35 36 return AL 37 38 def backward(self, AL, Y): 39 grads = {} 40 m = AL.shape[1] 41 42 # Output layer 43 dZ = AL - Y 44 A_prev, _ = self.caches[self.L - 1] 45 grads[f'dW{self.L}'] = np.dot(dZ, A_prev.T) / m 46 grads[f'db{self.L}'] = np.sum(dZ, axis=1, keepdims=True) / m 47 dA = np.dot(self.parameters[f'W{self.L}'].T, dZ) 48 49 # Hidden layers 50 for l in reversed(range(1, self.L)): 51 A_prev, Z = self.caches[l - 1] 52 dZ = dA * (Z > 0) 53 grads[f'dW{l}'] = np.dot(dZ, A_prev.T) / m 54 grads[f'db{l}'] = np.sum(dZ, axis=1, keepdims=True) / m 55 dA = np.dot(self.parameters[f'W{l}'].T, dZ) 56 57 return grads 58 59 def update(self, grads, learning_rate): 60 for l in range(1, self.L + 1): 61 self.parameters[f'W{l}'] -= learning_rate * grads[f'dW{l}'] 62 self.parameters[f'b{l}'] -= learning_rate * grads[f'db{l}'] 63 64 def train(self, X, Y, epochs, learning_rate=0.01): 65 costs = [] 66 67 for epoch in range(epochs): 68 AL = self.forward(X) 69 cost = -np.mean(Y * np.log(AL + 1e-8) + (1 - Y) * np.log(1 - AL + 1e-8)) 70 grads = self.backward(AL, Y) 71 self.update(grads, learning_rate) 72 73 if epoch % 100 == 0: 74 costs.append(cost) 75 print(f"Epoch {epoch}: Cost = {cost:.4f}") 76 77 return costs 78 79# Example usage 80if __name__ == "__main__": 81 np.random.seed(42) 82 83 # Create sample data 84 X = np.random.randn(2, 1000) 85 Y = ((X[0] ** 2 + X[1] ** 2) < 1).astype(float).reshape(1, -1) 86 87 # Train network 88 nn = NeuralNetwork([2, 16, 8, 1]) 89 costs = nn.train(X, Y, epochs=1000, learning_rate=0.1)

Key Takeaways

  1. Perceptrons are the basic units - weighted sum followed by activation
  2. Activation functions introduce non-linearity (ReLU for hidden, sigmoid/softmax for output)
  3. Forward propagation computes predictions layer by layer
  4. Loss functions measure prediction quality (cross-entropy for classification, MSE for regression)
  5. Backpropagation computes gradients using the chain rule
  6. He initialization prevents vanishing/exploding gradients

Understanding these fundamentals is crucial before moving to frameworks like TensorFlow or PyTorch. The math doesn't change - only the implementation details!

Share this article

TweetLinkedIn
January 15, 202525 min read
#Neural Networks#Mathematics#Deep Learning#Python
TheMLTrader - Article author

Written by

TheMLTrader

Quantitative researcher and ML engineer with 10+ years of experience in algorithmic trading. Specializing in deep learning for financial markets.

Ready to Apply These Concepts?

Our courses provide hands-on implementation of these ML concepts with real trading data and production-ready code.