Download the notebook here

Exercise 8 (solution)#

[ ]:

from datasets import load_dataset
import torch
import matplotlib.pyplot as plt

Data Preparation#

The data preparation only uses concepts you already know from previous lectures. We therefore start with clean datasets for training and validation.

[ ]:

data = load_dataset("mnist")
data.set_format("torch")

[ ]:

example = data["test"][0]
print(f"True label: {int(example['label'])}")
fig = plt.imshow(example["image"])

[ ]:

img_size = example["image"].numel()
img_size

[ ]:

# Dividing by 255 maps pixel values to 0, 1
X_train = data["train"]["image"][:].reshape(-1, img_size).to(torch.float) / 255
X_test = data["test"]["image"][:].reshape(-1, img_size).to(torch.float) / 255
y_train = data["train"]["label"]
y_test = data["test"]["label"]
X_train.shape

Dimensions of our Neural Network#

[ ]:

# the input dimension
n_in = img_size
# the dimension of our 2 hidden layers
n_hidden = 16
# the dimension of our output layer
n_out = 10

Task 1: How many Parameters?#

The number of trainable parameters is entirly determined by the number of layers and their dimensions.

Write a function called count_params(n_in, n_hidden, n_out) that counts how many parameters will be in our model. Assume that there are 2 hidden layers.

[ ]:

def count_params(n_in, n_hidden, n_out):
    n_weights = n_hidden * (n_in + n_hidden + n_out)
    n_biases = 2 * n_hidden + n_out
    return n_weights + n_biases


count_params(n_in, n_hidden, n_out)

Task 2: Set up random start parameters#

We want to draw random start parameters that are distributed uniformly between -0.5 and 0.5.

Since we are going to modify the parameters in-place while training the model, we need a way to freshly generate the start parameters multiple times. We therefore create a function that draws start parameters.

The function takes the following arguments: - n_in - n_hidden - n_out - seed (give it a default value of 1995 so we all get the same results)

The function returns: - a list of weight matrices with the correct shapes - a list of biases with the correct shapes

[ ]:

def create_params(n_in, n_hidden, n_out, seed=1995):
    torch.manual_seed(1995)
    weights = [
        torch.rand((n_hidden, n_in)) - 0.5,
        torch.rand((n_hidden, n_hidden)) - 0.5,
        torch.rand((n_out, n_hidden)) - 0.5,
    ]

    biases = [
        torch.rand(n_hidden) - 0.5,
        torch.rand(n_hidden) - 0.5,
        torch.rand(n_out) - 0.5,
    ]
    return weights, biases


weights, biases = create_params(n_in, n_hidden, n_out)
biases

Task 3: Implement relu and softmax#

Implement a relu function that takes a 1d tensor and applies the relu nonlinearity elementwise
Implement a softmax function that takes a 1d tensor of logits and returns a 1d tensor of probabilities
Test your function on a small tensor
If you have time implement other nonlinearities such as sigmoid, tanh, …

[ ]:

def relu(x):
    """Calculate the elementwise relu nonlinearity on x."""
    return torch.clip(x, 0)


def softmax(x):
    """Compute softmax values over x.

    Subtracting the max is optional but improves numerical stability

    """
    e_x = torch.exp(x - torch.max(x))
    return e_x / e_x.sum()

[ ]:

relu(torch.linspace(-1, 1, 5))

[ ]:

softmax(torch.tensor([-1, 3, -2]))

Task 4: Implement the model#

The model should take the following arguments: - x: A 1d tensor with a flattened image - weights: The list of weights from task 2 - biases: The list of biases from task 2

It should return a 1d tensor of length n_out that contains probabilities for each category.

Implement a model function
Try it out on the first element of the training data
Try out the batch_model function on the first few rows of the training data

[ ]:

def model(x, weights, biases):
    h1 = relu(weights[0] @ x + biases[0])
    h2 = relu(weights[1] @ h1 + biases[1])
    return softmax(weights[2] @ h2 + biases[2])

[ ]:

model(X_train[0], weights, biases)

In the training process we need a batch_model function that evaluates the model on a batch of data. This is not very instructional, so I give you the function right away.

[ ]:

def batch_model(batch, weights, biases):
    n_out = len(biases[-1])
    out = torch.zeros((len(batch), n_out))
    for i, x in enumerate(batch):
        out[i] = model(x, weights, biases)
    return out

Task 5: Implement loss functions#

Write a function called nnl_loss that takes the result of the batch_model and returns the average negative log likelihood.
Try it out on the first 100 rows of the training data
Implement an accuracy function that takes the same arguments as the loss function
Try it out on the first 100 rows of the training data

[ ]:

def nll_loss(probs, labels):
    likelihoods = probs[torch.arange(len(probs)), labels] + 1e-50
    loglikes = torch.log(likelihoods)
    return -loglikes.mean()

[ ]:

probs = batch_model(X_train[:100], weights, biases)
labels = y_train[:100]

nll_loss(probs, labels)

[ ]:

def accuracy(probs, labels):
    y_pred = probs.argmax(axis=1)
    return (y_pred == labels).to(torch.float).mean()

[ ]:

accuracy(probs, labels)

Task 6: The training loop#

Create fresh weights and biases
Set requires_grad to True for all tensors in the weights and biases list.
Write a training loop to train your model with SGD and the following hyper-parameters
- n_epochs: 2
- batch_size: 100,
- learning_rate: 0.001
If you have time, try the model out on a few images

Important: Do the entire training in just one cell and re-create the start parameters at the beginning of that cell, so each training run starts from the same position.

[ ]:

# create fresh random weights and biases

# set requires_grad to True for training

# define the hyperparameters

# loop over epochs

# loop over batches
# evaluate model
# evaluate loss
# backwards

# loop over the paramter lists
# SGD updates for each parameter tensor

# Zero the gradients for the next iteration

[ ]:

# create fresh random weights and biases
weights, biases = create_params(n_in, n_hidden, n_out)

# set requires_grad to True for training
for i in range(3):
    weights[i].requires_grad = True
    biases[i].requires_grad = True

# define the hyperparameters
n_epochs = 2
batch_size = 100
learning_rate = 0.01

# loop over epochs
for _epoch in range(n_epochs):
    batch_indices = torch.randperm(len(X_train)).reshape(-1, batch_size)
    # loop over batches
    for idxs in batch_indices:
        probs = batch_model(X_train[idxs], weights, biases)
        loss = nll_loss(probs, y_train[idxs])
        loss.backward()

        for i in range(3):
            # SGD updates for each parameter
            weights[i].data = weights[i].data - learning_rate * weights[i].grad.data
            biases[i].data = biases[i].data - learning_rate * biases[i].grad.data
            # Zero the gradients for the next iteration
            weights[i].grad.data.zero_()
            biases[i].grad.data.zero_()

[ ]:

example_idx = 0
with torch.no_grad():
    probs = model(X_test[example_idx], weights, biases)

probs

[ ]:

probs[y_test[example_idx]]

Task 7: Diagnostics#

Copy-paste the training loop from the previous task or work in the same cell as before.
After each epoch, evaluate the batch_model on test data with the current best parameters; Use torch.no_grad to disable gradients.
Calculate the accuracy score no the result and print it.

[ ]:

# create fresh random weights and biases
weights, biases = create_params(n_in, n_hidden, n_out)

# set requires_grad to True for training
for i in range(3):
    weights[i].requires_grad = True
    biases[i].requires_grad = True

# define the hyperparameters
n_epochs = 2
batch_size = 100
learning_rate = 0.01

# loop over epochs
for epoch in range(n_epochs):
    batch_indices = torch.randperm(len(X_train)).reshape(-1, batch_size)
    # loop over batches
    for idxs in batch_indices:
        probs = batch_model(X_train[idxs], weights, biases)
        loss = nll_loss(probs, y_train[idxs])
        loss.backward()

        for i in range(3):
            # SGD updates for each parameter
            weights[i].data = weights[i].data - learning_rate * weights[i].grad.data
            biases[i].data = biases[i].data - learning_rate * biases[i].grad.data
            # Zero the gradients for the next iteration
            weights[i].grad.data.zero_()
            biases[i].grad.data.zero_()

    with torch.no_grad():
        probs = batch_model(X_test, weights, biases)
        acc = accuracy(probs, y_test)
    print(f"Accuracy after epoch {epoch}: {acc}")

Task 8: Training the model#

Tweak the number of epochs, batch size and learning rate until you get an accuracy of at least 90 %

Copy paste the code from the previous task or work in the same cell.

[ ]:

# create fresh random weights and biases
weights, biases = create_params(n_in, n_hidden, n_out)

# set requires_grad to True for training
for i in range(3):
    weights[i].requires_grad = True
    biases[i].requires_grad = True

# define the hyperparameters
n_epochs = 5
batch_size = 25
learning_rate = 0.1

# loop over epochs
for epoch in range(n_epochs):
    batch_indices = torch.randperm(len(X_train)).reshape(-1, batch_size)
    # loop over batches
    for idxs in batch_indices:
        probs = batch_model(X_train[idxs], weights, biases)
        loss = nll_loss(probs, y_train[idxs])
        loss.backward()

        for i in range(3):
            # SGD updates for each parameter
            weights[i].data = weights[i].data - learning_rate * weights[i].grad.data
            biases[i].data = biases[i].data - learning_rate * biases[i].grad.data
            # Zero the gradients for the next iteration
            weights[i].grad.data.zero_()
            biases[i].grad.data.zero_()

    with torch.no_grad():
        probs = batch_model(X_test, weights, biases)
        acc = accuracy(probs, y_test)
    print(f"Accuracy after epoch {epoch}: {acc}")