Building a Neural Network from Scratch: MNIST classifier with just Python

This is post 2 in my quest to learn and apply ML. The previous one was a simple visualizer here.

I was trying to understand neural networks but had a bit of a hard time understanding how they learnt. So I decided to build one just using raw python so I could intuitively get what they did. We are making an MNIST classifier that can recognize handwrittend digits. No TensorFlow, no PyTorch - just Python and a tiny bit of NumPy for loading the data. The MNIST dataset is the "Hello World" of machine learning. It contains 70,000 images of handwritten digits (0-9), each 28x28 pixels in size.

Our goal is to build a neural network that can look at these images and correctly identify the digits. In this post, I will go through each function in the code and explain what I did. The full code can be found here.

Part 1: Setting Up the Data

First things first - we need our data. We'll use Sklearn here (I know, I said no libraries, but I'm not cracked enough to do this without libs, yet ;)

Imports and getting data


            import numpy as np
            import random
            import matplotlib.pyplot as plt
            import math
            from sklearn.datasets import fetch_openml
            from sklearn.model_selection import train_test_split
            
            print("Fetching the MNIST dataset...")
            mnist = fetch_openml('mnist_784', version=1, as_frame=False)
            X, y = mnist.data, mnist.target
            print("MNIST dataset loaded")

Each image is flattened into a 784-dimensional vector (28*28 = 784 pixels). We split our data into three parts:

Training set (70%): This is what our network learns from
Validation set (10%): We use this to tune our hyperparameters
Test set (20%): The final exam - we only use this at the very end

Part 2: One-Hot Encoding - Why and How

Here's where things get interesting. We need to convert our number labels (0-9) into a format that makes sense for neural networks. Let's say we have the digit 5. We could represent it in three ways:

Option 1: Just use the number

Hard to actually make this work. If our network predicts 4 when it should be 5, the error is 1. If it predicts 9, the error is 4. This creates a false hierarchy where 4 is "closer" to 5 than 9 is, which isn't true for digit recognition - all wrong answers are equally wrong!

Option 2: Binary Encoding

We could use binary: 0 = 0000, 1 = 0001, ..., 9 = 1001. But this has the same problem in a different way. To go from 7 (0111) to 8 (1000), you need to flip ALL bits, but 6 (0110) to 7 (0111) is just one flip. Again, we're creating artificial relationships.

Option 3: One-Hot Encoding

Here's our winner. We represent each digit as a vector of 10 elements, where only one element is 1 and the rest are 0. So 5 becomes [0,0,0,0,0,1,0,0,0,0]. This is perfect because:

Every digit is equally different from every other digit (always two "flips" to switch numbers)
No implied relationships between digits
Error calculations become straightforward

One hot encoding


def one_hot_encoding(labels, num_classes = 10):
    #Convert number labels to one hot encoded vectors
    encoded = np.zeros((len(labels), num_classes))
    for i, label in enumerate(labels):
        encoded[i][int(label)] = 1
    return encoded

#Convert all our labels
y_train_encoded = one_hot_encoding(y_train)
y_val_encoded = one_hot_encoding(y_val)
y_test_encoded = one_hot_encoding(y_test)

Part 3: Building the Neural Network

Now for the fun part - building our network from scratch! I thought it would be neat to start with 2 hidden layers:

Input layer: 784 neurons (one for each pixel)
First hidden layer: 112 neurons
Second hidden layer: 16 neurons
Output layer: 10 neurons (one for each digit)

The Building Blocks

1. Initialization (init)

The network starts life with random weights and biases. Each connection between neurons gets a random weight between -0.5 and 0.5. Why? We want to start with small random values to break symmetry and allow the network to learn different patterns.

Initializing the network with weights and biases


class My_Neural_Network:
    def __init__(self, layer_sizes):
    self.sizes = layer_sizes
    self.num_layers = len(layer_sizes)
    self.weights = []
    self.biases = []

    for i in range(len(self.sizes)-1):
    layer_weights = []
    for destination_neuron in range(self.sizes[i+1]):
        neuron_weights = []
        for source_neuron in range(self.sizes[i]):
            weight = (random.random() - 0.5)
            neuron_weights.append(weight)
        layer_weights.append(neuron_weights)
    self.weights.append(layer_weights)

    #Start with bias=1 for next layer
    layer_biases = [1.0 for _ in range(self.sizes[i+1])]
    self.biases.append(layer_biases)

2. Activation Function (sigmoid)

We use the sigmoid function to squash any input into a value between 0 and 1. It's like a smooth step function:

Very negative inputs → output close to 0
Very positive inputs → output close to 1
Inputs around 0 → gradual transition

Defining our activation function


def sigmoid(self, x):
    # Handle extreme values to prevent overflow
    if x < -700:
        return 0.0
    elif x > 700:
        return 1.0
    return 1.0/(1.0 + math.exp(-x))

def sigmoid_derivative(self, x):
    # This is the derivative of sigmoid: sigmoid(x) * (1 - sigmoid(x))
    return self.sigmoid(x) * (1 - self.sigmoid(x))

Why do we check for ±700? The sigmoid function involves e^(-x), and Python's math.exp() can overflow with large values. Since sigmoid(700) is already extremely close to 1 (and sigmoid(-700) to 0), we can safely clamp values beyond these bounds.

3. Forward Propagation (feedforward)

This is how our network makes predictions. Data flows through the layers like this:

Defining how our neural network spits an output for a given input


def feedforward(self, inputs):
    """
    Push data through the network to get predictions. Returns:
    1. Activations - neuron outputs at each layer
    2. Weighted sums - the values before sigmoid activation (needed for backprop)
    """
    current_values = inputs
    all_activations = [inputs]  # Store all layer outputs
    all_weighted_sums = []      # Store all pre-activation values

    # Process each layer
    for layer_weights, layer_biases in zip(self.weights, self.biases):
    # Calculate weighted sums for each neuron in current layer
    weighted_sums = []
    for weights, bias in zip(layer_weights, layer_biases):
        # For each neuron:
        # 1. Multiply each input by its weight
        # 2. Sum all weighted inputs
        # 3. Add the bias
        total = sum(w * a for w, a in zip(weights, current_values))
        weighted_sums.append(total + bias)

    # Store pre-activation values for backprop
    all_weighted_sums.append(weighted_sums)

    # Apply sigmoid activation to get neuron outputs
    current_values = [self.sigmoid(z) for z in weighted_sums]
    all_activations.append(current_values)

    return all_activations, all_weighted_sums

With a given input vector, here's what happens in the forward pass:

We start with our input layer
For each subsequent layer:
- Each neuron calculates its weighted sum (z = w₁a₁ + w₂a₂ + ... + b)
- We store these weighted sums (we'll need them for backpropagation)
- Apply the sigmoid function to each weighted sum to get neuron outputs
- These outputs become the inputs for the next layer
We keep track of ALL values because backpropagation needs them:
- all_activations: Output of each layer (after sigmoid)
- all_weighted_sums: Values before sigmoid (needed for derivatives)

The zip(layer_weights, layer_biases) lets us process corresponding weights and biases together. For each neuron, we compute its output in two steps:

Weighted sum: Multiply each input by its weight and sum (plus bias)
Activation: Pass this sum through sigmoid to get output between 0 and 1

4. Backpropagation

This is where the magic happens. Backpropagation is how our network learns from its mistakes. Here's the complete implementation:

Backpropagation - teaching our NN to learn


def backpropagate(self, training_example, label):
    """
    1. Calculate the network's error
    2. Determine how each weight contributed to that error
    3. Return gradients for weights and biases
    """
    # First, do a forward pass to get all intermediate values
    activations, weighted_sums = self.feedforward(training_example)

    # Initialize gradients as zero
    weight_gradients = [[[0.0 for _ in range(len(w_row))]
                    for w_row in layer]
                    for layer in self.weights]
    bias_gradients = [[0.0 for _ in range(len(b))]
                for b in self.biases]

    # Calculate error at the output layer
    # Error = (actual - target) * sigmoid_derivative(output)
    output_error = [a - t for a, t in zip(activations[-1], label)]
    layer_error = [err * self.sigmoid_derivative(a)
            for err, a in zip(output_error, activations[-1])]

    # Save gradients for output layer
    bias_gradients[-1] = layer_error
    for i, error in enumerate(layer_error):
    for j, activation in enumerate(activations[-2]):
        weight_gradients[-1][i][j] = error * activation

    # Backpropagate error through earlier layers
    for layer in range(2, self.num_layers):
    layer_error_new = []

    # For each neuron in current layer
    for j in range(len(self.weights[-layer])):
        # Calculate error based on next layer's error
        error = 0.0
        for k in range(len(layer_error)):
            error += self.weights[-layer+1][k][j] * layer_error[k]
        error *= self.sigmoid_derivative(activations[-layer][j])
        layer_error_new.append(error)

    layer_error = layer_error_new

    # Save gradients
    bias_gradients[-layer] = layer_error
    for i, error in enumerate(layer_error):
        for j, activation in enumerate(activations[-layer-1]):
            layer_index = len(self.weights) - layer
            weight_gradients[layer_index][i][j] = error * activation

    return weight_gradients, bias_gradients

Let's walk through this carefully. For a detailed, mathematical understanding of backpropagation, I can't recommend Matt Mazur's post enough. Brilliant piece of writing.

Step 1: Output Layer Error

For the output layer, calculating the error is reasonably easy:

Error = (actual output - desired output) * sigmoid_derivative(output)
We multiply by sigmoid_derivative because we need to account for how the sigmoid function would change with small adjustments. See the link above for a step by step reasoning for how we got to this equation.
This tells us how much each output neuron contributed to the error.

Step 2: Hidden Layer Error

For hidden layers, it's more of the same. We need to:

Take each neuron in the current layer
Look at ALL connections to the next layer
Sum up (next layer's error * connection weight) for each connection
Multiply by sigmoid_derivative of the current neuron's output

This is the chain rule in action: we're calculating how much each hidden neuron contributed to the final error by looking at all paths from that neuron to the output.

Step 3: Weight Gradients

For each weight, its gradient is:

gradient = (error of receiving neuron) * (activation of sending neuron)

This makes intuitive sense. If the receiving neuron had a large error, we want to change its weights more. If the sending neuron had a large activation, it contributed more to the error. If either value is small, the weight didn't contribute much to the error.

Step 4: Using the Gradients

These gradients tell us how to adjust each weight to reduce the error:

Positive gradient → weight contributed to positive error → decrease weight
Negative gradient → weight contributed to negative error → increase weight
The learning rate determines how big these adjustments are. This is one of the hyperparameters we set during the training.

5. Training

The training process ties everything together. One thing to note is that MNIST is a solved problem now, with the best solutions achieving >99% accuracy. Through countless experiments, researchers have discovered optimal architectures and hyperparameters. Our implementation ignores these guidelines for the sake of learning, and consequently achieves a more modest accuracy.

Let's look at the three main approaches to training neural networks:

Stochastic Gradient Descent (SGD): Updates weights after every single training example. Very fast per update but produces noisy learning. Can zigzag toward the solution due to frequent updates and is sensitive to outliers. Good for online learning where data comes in real-time.
Full Batch Gradient Descent: Processes the entire training dataset before making any weight updates. Very stable since it averages over all examples, but requires lots of memory and can only make one update per epoch. Often too computationally expensive for large datasets and more likely to get stuck in local minima.
Mini-batch Gradient Descent (our approach): Processes a small batch of examples (in our case, 32) before updating. This hits a sweet spot - more stable than SGD but still allows frequent updates, less memory-intensive than full batch.

We use mini-batch gradient descent with a batch size of 32. Here's how the training works:

Training - updating mini batches -> training on each mini batch -> predicting on the validation set -> evaluating performance


def update_mini_batch(self, mini_batch, learning_rate):
    #Update network weights using a small batch of shuffled training data

    #Initialize gradient sums as zero

    weight_sums = [[[0.0 for _ in range(len(w_row))]
                    for w_row in layer]
                    for layer in self.weights]
    
    bias_sums = [[0.0 for _ in range(len(b))]
                    for b in self.biases]
    
    #Calculate gradients for each training example
    for inputs, label in mini_batch:
        weight_gradients, bias_gradients = self.backpropagate(inputs, label)

        #Add to sums
        for layer in range(len(weight_sums)):
            for i in range(len(weight_sums[layer])):
                for j in range(len(weight_sums[layer][i])):
                    weight_sums[layer][i][j] += weight_gradients[layer][i][j]
        
        for layer in range(len(bias_sums)):
            for i in range(len(bias_sums[layer])):
                bias_sums[layer][i] += bias_gradients[layer][i]
    
    # Update weights and biases using average gradients
    batch_size = len(mini_batch)
    for layer in range(len(self.weights)):
        for i in range(len(self.weights[layer])):
            for j in range(len(self.weights[layer][i])):
                self.weights[layer][i][j] -= (learning_rate/batch_size) * weight_sums[layer][i][j]
        
        for i in range(len(self.biases[layer])):
            self.biases[layer][i] -= (learning_rate/batch_size) * bias_sums[layer][i]

def train(self, training_data, epochs, mini_batch_size, learning_rate):
    #Train the network on batches of data
    n = len(training_data)

    for epoch in range(epochs):
        #Shuffle the training data
        random.shuffle(training_data)

        #Create mini-batches
        mini_batches = [
            training_data[k:k+ mini_batch_size]
            for k in range(0, n , mini_batch_size)
        ]

        #Train on each mini-batch
        print(f" Epoch {epoch + 1}: Training on {len(mini_batches)} batches")
        for i, mini_batch in enumerate(mini_batches):
            if i % 100 == 0:
                #Show progress every 100 batches
                print(f" Batch {i}/{len(mini_batches)}")
            self.update_mini_batch(mini_batch, learning_rate)
        
        #Calculate and show accuracy
        accuracy = self.evaluate(validation_data)
        print(f"Epoch {epoch+1}: {accuracy:.2f}% accuracy")
    
def predict(self, inputs):
    #Get network's prediction (0-9) for an input
    activations, _ = self.feedforward(inputs)
    output = activations[-1]
    return output.index(max(output))

def evaluate(self, test_data):
    #Calcuate accuracy on test data
    correct = 0
    total = len(test_data)

    for inputs, label in test_data:
        prediction = self.predict(inputs)
        actual = list(label).index(1) 
        #We are bringing the label back from one-hot-encoded to a numeric form
        if prediction == actual:
            correct += 1
    
    return (correct/total)*100

The Results

After 10 epochs of training, our network achieves around 75% accuracy on the validation set and 76% on the testing set. This is decidedly basic compared to modern deep learning models (which hit 99%+), but remember - we built this using only basic Python operations! No optimized matrix operations, no GPU acceleration, no fancy optimizers.

Key Takeaways

Building a neural network from scratch teaches us several important lessons:

Neural networks aren't magic - they're just lots of simple math operations chained together
The basic principles (forward pass, backprop, gradient descent) haven't changed since the 1980s
Modern frameworks abstract away these details, but understanding them helps debug and optimize networks
Even a simple network can learn meaningful patterns

Want to understand backpropagation in even more detail? Check out this excellent step-by-step guide that helped me grasp the concepts.

The complete code with detailed comments is available in this GitHub repo. Feel free to experiment with different architectures, learning rates, and batch sizes!