Download the notebook here

Exercise 5 (solution)#

[ ]:

import torch

Task 1: Download and inspect a DatasetDict#

Download the dair-ai/emotion dataset
Find out how many rows and columns it has
Find the cache directory and convince yourself that you would know how to delete the dataset to free up space
Create a pandas DataFrame containing the training split of the data

[ ]:

from datasets import load_dataset

emotions = load_dataset("dair-ai/emotion", name="split")
emotions

[ ]:

emotions.shape

[ ]:

emotions.cache_files

[ ]:

emotions.set_format(type="pandas")
emotions["train"][:]

Task 2: DatasetDict.map#

Use DatasetDict.map to add a new variable called label_name to the dataset. The translation is as follows: sadness (0), joy (1), love (2), anger (3), fear (4), surprise (5).

Test the function on one row before you call map

[ ]:

emotions.set_format(None)


def label_to_name(row):
    translation = {
        0: "sadness",
        1: "joy",
        2: "love",
        3: "anger",
        4: "fear",
        5: "surprise",
    }
    row["label_name"] = translation[row["label"]]
    return row


label_to_name(emotions["train"][0])

[ ]:

emotions_with_name = emotions.map(label_to_name)
emotions_with_name.column_names

Task 3: Batched map#

Rewrite the function from the previous task such that it works if you set batched=True in map, i.e. if you do emotions.map(my_func, batched=True)

Use the strategies you have just learned to find out how this works. Don’t try out random things!

[ ]:

def batched_label_to_name(batch):
    translation = {
        0: "sadness",
        1: "joy",
        2: "love",
        3: "anger",
        4: "fear",
        5: "surprise",
    }
    batch["label_name"] = [translation[label] for label in batch["label"]]
    return batch


batched_label_to_name(emotions["train"][:5])

[ ]:

emotions_with_name_batch = emotions.map(batched_label_to_name, batched=True)
emotions_with_name_batch.column_names

Task 4: Write Tokenizers#

Write a function called character_tokenizer that takes a string and returns a list of tokens. Use all characters of the latin alphabet and distingish lowercase and uppercase characters. Don’t forget puntcuation. Encode the following text:

[ ]:

text = "Programming isn't about what you know; it's about what you can figure out."

[ ]:

def character_tokenizer(text):
    characters = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ,.;-_?!' "
    token_dict = {character: pos for pos, character in enumerate(characters)}
    out = [token_dict[character] for character in text]
    return out


character_tokenizer(text)[:5]

Even this simple example shows you that a lot can go wrong when coding your own tokenizer. Always use pre-trained or pre-implemented tokenizers in practice!

Task 5: Use a pretrained tokenizer#

Use a pre-trained tokenizer for the "distilbert-base-uncased" model to encode the text from above. Decode each token so you can see how words were split into tokens.

[ ]:

from transformers import AutoTokenizer

model_name = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
example_tokens = tokenizer.encode(text)
example_tokens

[ ]:

for token in example_tokens:
    print(tokenizer.decode(token))

Now wrap the tokenizer into into a function called tokenize and tokenize the entire dataset using DatasetDict.map.

For the tokenizer, the settings should be: - padding=True - truncate=True For map the settings should be: - batched=True, - batch_size=None,

Hint, if you write the function correctly, the following should work:

tokenize(emotions["train"][:3])

[ ]:

def tokenize(batch):
    return tokenizer(batch["text"], padding=True, truncation=True)


tokenize(emotions["train"][:3])

[ ]:

emotions_encoded = emotions.map(tokenize, batched=True, batch_size=None)
emotions_encoded

Important#

Setting batched=True and batch_size=None means that all tweets are processed in one batch. This is very important. If the dataset was processed in multiple batches, each batch might be padded to a different size (the number of tokens in the longest tweet of that batch)

Task 6: Redo numpy exercises in torch#

The following is a subset of the exercises you did in the second lecture using numpy. Repeat them using torch.tensors instead of np.arrays. This is mainly to show how similar numpy and pytorch is.

Create the following tensors:

A three-dimensional tensor of shape (3, 3, 4) containing zeros
A two-dimensional tensor with 4 rows and 3 columns that contain that is equivalent to the list [[0.1, 0.2, 0.3], [0.4, 0.5, 0.6], [0.7, 0.8, 0.9], [1.0,1.1,1.2]]. Do not just type in the numbers.
Select the bottom left 2 x 2 array from the array you just created

[ ]:

a = torch.zeros((3, 3, 4))
a

[ ]:

b = torch.linspace(0.1, 1.2, 12).reshape(4, 3)
b

[ ]:

b[-2:, :2]

Now do the following calculations with tensors

Do a matrix multiplication of the two tensors x and y
Do an elementwise multiplication of the tensors x and y
Do an elementwise addition x and z
Do an elementwise addition of x and z.reshape(-1, 1)
Sum the two rows in x

[ ]:

x = torch.tensor([[0.5, 1.5], [2.5, 3.5]])
y = torch.diag(torch.tensor([2.0, 3.0]))
z = torch.tensor([2.0, 3.0])

[ ]:

x.matmul(y.T)

[ ]:

x * y

[ ]:

x + z

[ ]:

x + z.reshape(-1, 1)

[ ]:

torch.exp(z)

[ ]:

x.sum(axis=0)

Task 7: Differences between torch and numpy#

The following exercises show a few differences between torch and numpy.

Do a matrix multiplication of the tensors u and v
Check the device of the tensor u
Explicitly set the device to ‘cpu’

[ ]:

u = torch.ones(2, 2)
v = torch.tensor([[1, 2], [3, 4]])

[ ]:

u @ v.to(torch.float)

[ ]:

u.device

[ ]:

u.to("cpu")