Download the notebook here
Exercise 6 (solution)#
[ ]:
import numpy as np
from datasets import load_dataset
from transformers import AutoTokenizer
import torch
from transformers import logging
logging.set_verbosity_error()
Task 1: Masked calculations in numpy#
Calculate the mean over all valid entries in a and b.
For a, valid directly shows which entries are valid. For b, valid defines which rows are valid.
[ ]:
a = np.arange(4)
b = np.arange(12).reshape(4, 3)
valid = np.array([1, 0, 1, 0]).astype(bool)
[ ]:
masked_a = np.ma.array(a, mask=~valid)
masked_a
[ ]:
masked_a.mean()
[ ]:
valid_mat = valid.reshape(-1, 1).repeat(3, axis=1)
masked_b = np.ma.array(b, mask=~valid_mat)
masked_b
[ ]:
masked_b.mean()
Task 4: Now the same using map#
Write a function called
extract_lhsthat takesbatchandmodelas argument and extracts and averages the last hidden states. This really just means copy pasting all the steps we did in the last two tasks into one function and saving the result in the batch.Test the function on your practice batch
Apply the function to the encoded emotions dataset using
.mapwith the following settings:batched=Truebatch_size=1000fn_kwargs={"model": model}
Note: Step 3 will take a while, let it run while I show the solution and discuss the next steps. If you want, you can use num_proc=... to run this step on more than one core. If so, you should set it to the number of physical cores in your computer.
[ ]:
def extract_states(batch, model):
pass
[ ]:
def extract_states(batch, model): # noqa: F811
input_ids = torch.tensor(batch["input_ids"])
attention_mask = torch.tensor(batch["attention_mask"])
with torch.no_grad():
output = model(input_ids, attention_mask)
lhs = output.last_hidden_state.cpu().numpy()
valid = np.array(batch["attention_mask"]).astype(bool)
batch_size, n_tokens, hidden_dim = lhs.shape
valid = valid.reshape(batch_size, n_tokens, 1).repeat(hidden_dim, axis=-1)
masked_mean = np.ma.array(lhs, mask=~valid).mean(axis=1).data
batch["hidden_state"] = masked_mean
return batch
[ ]:
extract_states(batch, model)
[ ]:
last_states = emotions_encoded.map(
extract_states,
batched=True,
batch_size=1000,
fn_kwargs={"model": model},
)