Neural Networks from Scratch: The Complete Mathematical Guide
Neural networks are the backbone of modern machine learning. In this comprehensive guide, we'll build our understanding from the ground up, starting with the mathematics and implementing everything in Python.
Table of Contents
- The Perceptron
- Activation Functions
- Forward Propagation
- Loss Functions
- Backpropagation
- Implementation
The Perceptron
The perceptron is the fundamental building block of neural networks. It takes multiple inputs, applies weights, adds a bias, and produces an output through an activation function.
Mathematical Definition
For a single perceptron with inputs:
Where:
- is the input vector
- is the weight vector
- is the bias term
- is the pre-activation (weighted sum)
The output is then:
Where is the activation function.
Activation Functions
Activation functions introduce non-linearity into the network, enabling it to learn complex patterns.
Sigmoid Function
Properties:
- Output range:
- Derivative:
- Use case: Binary classification output layer
ReLU (Rectified Linear Unit)
Properties:
- Output range:
- Derivative:
- Use case: Hidden layers (most common)
Tanh
Properties:
- Output range:
- Derivative:
- Use case: Hidden layers, LSTM gates
Softmax
For a vector with elements:
Properties:
- Output range: , sums to 1
- Use case: Multi-class classification output layer
1import numpy as np
2
3def sigmoid(z):
4 return 1 / (1 + np.exp(-z))
5
6def relu(z):
7 return np.maximum(0, z)
8
9def tanh(z):
10 return np.tanh(z)
11
12def softmax(z):
13 exp_z = np.exp(z - np.max(z)) # Numerical stability
14 return exp_z / exp_z.sum(axis=-1, keepdims=True)Forward Propagation
Forward propagation computes the output of the network layer by layer.
Layer-by-Layer Computation
For layer :
Where:
- is the weight matrix of shape
- is the bias vector of shape
- is the activation function for layer
- (input)
Matrix Dimensions
For a batch of training examples:
- Input:
- Weights:
- Bias: (broadcast)
- Output:
1def forward_propagation(X, parameters):
2 """
3 Compute forward pass for L-layer neural network.
4
5 Parameters:
6 X -- input data of shape (n_features, m_examples)
7 parameters -- dict containing W1, b1, W2, b2, ..., WL, bL
8
9 Returns:
10 AL -- output of the last layer
11 caches -- list of caches for backpropagation
12 """
13 caches = []
14 A = X
15 L = len(parameters) // 2
16
17 # Hidden layers with ReLU
18 for l in range(1, L):
19 A_prev = A
20 Z = np.dot(parameters[f'W{l}'], A_prev) + parameters[f'b{l}']
21 A = relu(Z)
22 caches.append((A_prev, Z))
23
24 # Output layer with sigmoid
25 ZL = np.dot(parameters[f'W{L}'], A) + parameters[f'b{L}']
26 AL = sigmoid(ZL)
27 caches.append((A, ZL))
28
29 return AL, cachesLoss Functions
Loss functions measure how well our predictions match the true values.
Binary Cross-Entropy Loss
For binary classification:
Cost over examples:
Mean Squared Error
For regression:
Categorical Cross-Entropy
For multi-class classification:
1def binary_cross_entropy(AL, Y):
2 """
3 Compute binary cross-entropy loss.
4
5 Parameters:
6 AL -- probability predictions, shape (1, m)
7 Y -- true labels, shape (1, m)
8
9 Returns:
10 cost -- cross-entropy cost
11 """
12 m = Y.shape[1]
13 cost = -np.sum(Y * np.log(AL + 1e-8) + (1 - Y) * np.log(1 - AL + 1e-8)) / m
14 return costBackpropagation
Backpropagation computes gradients using the chain rule, allowing us to update weights.
The Chain Rule
For a composite function :
Gradient Computation
Output layer gradients:
Hidden layer gradients:
Where denotes element-wise multiplication.
Weight and bias gradients:
Propagate to previous layer:
1def backward_propagation(AL, Y, caches, parameters):
2 """
3 Implement backpropagation for L-layer neural network.
4
5 Parameters:
6 AL -- output of forward propagation
7 Y -- true labels
8 caches -- list of caches from forward propagation
9 parameters -- dict containing weights and biases
10
11 Returns:
12 grads -- dictionary with gradients
13 """
14 grads = {}
15 L = len(caches)
16 m = AL.shape[1]
17 Y = Y.reshape(AL.shape)
18
19 # Output layer
20 dAL = -(np.divide(Y, AL + 1e-8) - np.divide(1 - Y, 1 - AL + 1e-8))
21 A_prev, Z = caches[L-1]
22 dZ = dAL * sigmoid(Z) * (1 - sigmoid(Z))
23 grads[f'dW{L}'] = np.dot(dZ, A_prev.T) / m
24 grads[f'db{L}'] = np.sum(dZ, axis=1, keepdims=True) / m
25 dA_prev = np.dot(parameters[f'W{L}'].T, dZ)
26
27 # Hidden layers
28 for l in reversed(range(1, L)):
29 A_prev, Z = caches[l-1]
30 dZ = dA_prev * (Z > 0) # ReLU derivative
31 grads[f'dW{l}'] = np.dot(dZ, A_prev.T) / m
32 grads[f'db{l}'] = np.sum(dZ, axis=1, keepdims=True) / m
33 dA_prev = np.dot(parameters[f'W{l}'].T, dZ)
34
35 return gradsImplementation
Let's put it all together with a complete neural network implementation.
1import numpy as np
2
3class NeuralNetwork:
4 def __init__(self, layer_dims):
5 """
6 Initialize neural network.
7
8 Parameters:
9 layer_dims -- list of layer dimensions [n_input, n_hidden1, ..., n_output]
10 """
11 self.parameters = {}
12 self.L = len(layer_dims) - 1
13
14 for l in range(1, self.L + 1):
15 # He initialization
16 self.parameters[f'W{l}'] = np.random.randn(
17 layer_dims[l], layer_dims[l-1]
18 ) * np.sqrt(2 / layer_dims[l-1])
19 self.parameters[f'b{l}'] = np.zeros((layer_dims[l], 1))
20
21 def forward(self, X):
22 self.caches = []
23 A = X
24
25 for l in range(1, self.L):
26 A_prev = A
27 Z = np.dot(self.parameters[f'W{l}'], A_prev) + self.parameters[f'b{l}']
28 A = np.maximum(0, Z) # ReLU
29 self.caches.append((A_prev, Z))
30
31 # Output layer
32 ZL = np.dot(self.parameters[f'W{self.L}'], A) + self.parameters[f'b{self.L}']
33 AL = 1 / (1 + np.exp(-ZL)) # Sigmoid
34 self.caches.append((A, ZL))
35
36 return AL
37
38 def backward(self, AL, Y):
39 grads = {}
40 m = AL.shape[1]
41
42 # Output layer
43 dZ = AL - Y
44 A_prev, _ = self.caches[self.L - 1]
45 grads[f'dW{self.L}'] = np.dot(dZ, A_prev.T) / m
46 grads[f'db{self.L}'] = np.sum(dZ, axis=1, keepdims=True) / m
47 dA = np.dot(self.parameters[f'W{self.L}'].T, dZ)
48
49 # Hidden layers
50 for l in reversed(range(1, self.L)):
51 A_prev, Z = self.caches[l - 1]
52 dZ = dA * (Z > 0)
53 grads[f'dW{l}'] = np.dot(dZ, A_prev.T) / m
54 grads[f'db{l}'] = np.sum(dZ, axis=1, keepdims=True) / m
55 dA = np.dot(self.parameters[f'W{l}'].T, dZ)
56
57 return grads
58
59 def update(self, grads, learning_rate):
60 for l in range(1, self.L + 1):
61 self.parameters[f'W{l}'] -= learning_rate * grads[f'dW{l}']
62 self.parameters[f'b{l}'] -= learning_rate * grads[f'db{l}']
63
64 def train(self, X, Y, epochs, learning_rate=0.01):
65 costs = []
66
67 for epoch in range(epochs):
68 AL = self.forward(X)
69 cost = -np.mean(Y * np.log(AL + 1e-8) + (1 - Y) * np.log(1 - AL + 1e-8))
70 grads = self.backward(AL, Y)
71 self.update(grads, learning_rate)
72
73 if epoch % 100 == 0:
74 costs.append(cost)
75 print(f"Epoch {epoch}: Cost = {cost:.4f}")
76
77 return costs
78
79# Example usage
80if __name__ == "__main__":
81 np.random.seed(42)
82
83 # Create sample data
84 X = np.random.randn(2, 1000)
85 Y = ((X[0] ** 2 + X[1] ** 2) < 1).astype(float).reshape(1, -1)
86
87 # Train network
88 nn = NeuralNetwork([2, 16, 8, 1])
89 costs = nn.train(X, Y, epochs=1000, learning_rate=0.1)Key Takeaways
- Perceptrons are the basic units - weighted sum followed by activation
- Activation functions introduce non-linearity (ReLU for hidden, sigmoid/softmax for output)
- Forward propagation computes predictions layer by layer
- Loss functions measure prediction quality (cross-entropy for classification, MSE for regression)
- Backpropagation computes gradients using the chain rule
- He initialization prevents vanishing/exploding gradients
Understanding these fundamentals is crucial before moving to frameworks like TensorFlow or PyTorch. The math doesn't change - only the implementation details!