Download the notebook here
Exercise 10#
In this exercise we learn about simple RNNs as well as encoder-decoder RNNs. We only implement some components from scratch in numpy and leave out the training. By now, it would not be hard for you to implement this in torch and add the training.
[ ]:
import numpy as np
from scipy.special import softmax
from dataclasses import dataclass
Task 1: Tokenization and embeddings#
Implement a simple character level tokenization and embedding algorithm. In contrast to what we did in an earlier lecture, we want to minimize the vocab size to just the characters that are present in a given text.
Write a function called
get_vocabulary(text)that returns a sorted list of all characters that occur in the textWrite a function called
tokenize(text, vocabulary)that takes a text and list of characters and returns a list of ints.Write a function called
embed(tokens, vocab_size)that returns a numpy array of shape (n_tokens, vocab_size) where each row is a one-hot vector corresponding to a tokenCall all the functions and to create
in_embeddingsfor our text
[ ]:
text = "hello"
[ ]:
[ ]:
[ ]:
Task 2: A Params class#
Define a
dataclasscalledParamsthat has the three attributesw_xh,w_hh,w_hyCreate an instance of
Paramswith weight matrices that have the correct shapes and are filled with uniform random values between -1 and 1.
[ ]:
n_in = vocab_size
n_out = vocab_size
n_hidden = 3
[ ]:
[ ]:
Task 3: Implement a Vanilla RNN (for Language Modelling)#
Implement a function called
model_step(x, h, p)wherexis a one-hot vector,his a vector that holds the internal state of the RNN andpis an instance ofParamsImplement a function called
model(embeddings, p)that calles themodel_stepinternally and produces an array of logits. The output array has shape (len(embeddings) -1, vocab_size). The function does roughly the following steps:Initialize h to a vector of zeros
call
modelin a loopCollect all y in a list
[ ]:
[ ]:
[ ]:
[ ]:
Task 4: Implement loss function#
Create a list called
targetsthat contains the target token for each output. I.e. the tokenized version of"ello"Write a function called
cross_entropy_loss(logits, targets). This is basically the same function you wrote in lecture 8. The steps are roughly:Take the softmax over the last axis
Use the indexing trick to get likelihoods
Return the negative mean of the log likelihoods
We are not using the loss function for training, I just want to make sure you understand what is the loss function for language modelling.
[ ]:
[ ]:
[ ]:
Task 5: Implement a text-to-text model and use optimal weights#
In this task I give you trained weights for the model. Those weights should enable the model to correctly return "ello" when prompted with "hello"
The only think you need to do is:
Write a function called
s2s_model(text, p, vocabulary)that takes text and returns text. Inside, you have to do the following steps:tokenize the text
embed the text
use the model to get logits
Get predicted tokens from the logits
Translate the tokens into text
[ ]:
w_xh_opt = np.array(
[
[-13.8, 0.6, 2.7, 0.1],
[4.7, -20.9, 1.6, 0.1],
[1.6, 6.9, 10.9, 0.0],
]
)
w_hh_opt = np.array(
[
[-2.1, -5.9, 7.2],
[-5.9, -4.2, 0.8],
[6.0, 7.5, 2.8],
]
)
w_hy_opt = np.array(
[[-0.6, -24.2, -0.7], [3.4, 8.8, -12.0], [-12.5, 12.2, 9.0], [10.0, 3.2, 3.7]]
)
p_opt = Params(
w_xh=w_xh_opt,
w_hh=w_hh_opt,
w_hy=w_hy_opt,
)
[ ]:
Switching to word level embedding for machine translation#
To learn about encoder-decoder RNNs we switch from character-level tokenization to word-level tokenization. Moreover, we add a start and end token.
Since you already know how to write tokenizers, here is the code:
[ ]:
in_text = "Hello World"
out_text = "Hallo Welt"
def get_vocabulary(text):
"""Get a minimal vocabulary to tokenize the text."""
text = text.lower().split()
words = sorted(set(text)) + ["<SOS>", "<EOS>"]
return words
def tokenize(text, vocabulary):
"""Tokenize the text, given the vocabulary."""
text = ["<SOS>"] + text.lower().split() + ["<EOS>"]
token_dict = {character: pos for pos, character in enumerate(vocabulary)}
out = [token_dict[character] for character in text]
return out
def embed(tokens, vocab_size):
"""Create input embeddings for each token."""
out = np.zeros((len(tokens), vocab_size))
out[np.arange(len(out)), tokens] = 1
return out
in_vocabulary = get_vocabulary(in_text)
print("Input vocabulary:", in_vocabulary)
in_vocab_size = len(in_vocabulary)
in_tokens = tokenize(in_text, in_vocabulary)
print("Input tokens:", in_tokens)
in_embeddings = embed(in_tokens, in_vocab_size)
out_vocabulary = get_vocabulary(out_text)
print("Output vocabulary:", out_vocabulary)
out_vocab_size = len(out_vocabulary)
out_tokens = tokenize(out_text, out_vocabulary)
print("Output tokens:", out_tokens)
target_size = len(out_tokens)
print("Target size:", target_size)
n_in = in_vocab_size
n_out = out_vocab_size
n_hidden = 4
Moreover, you get code for two classes of Parameters you can use in your model
[ ]:
@dataclass
class EncoderParams:
w_xh: np.ndarray
w_hh: np.ndarray
@dataclass
class DecoderParams:
w_ss: np.ndarray
w_ys: np.ndarray
w_sy: np.ndarray
np.random.seed(1234)
p_enc = EncoderParams(
w_xh=np.random.uniform(size=(n_hidden, n_in)),
w_hh=np.random.uniform(size=(n_hidden, n_hidden)),
)
p_dec = DecoderParams(
w_ss=np.random.uniform(size=(n_hidden, n_hidden)),
w_ys=np.random.uniform(size=(n_hidden, n_out)),
w_sy=np.random.uniform(size=(n_out, n_hidden)),
)
p_enc
Task 6: Implement encode and decode steps (for Machine Translation)#
Write a function called `encode_step(x, h, p_enc)
Write a function called `decode_step(s, y_prev, p_dec)
The two functions together will play the same role as the model_step in the simple RNN
[ ]:
[ ]:
Task 7: Implement the encoder-decoder model#
Write a function called
model(in_embeddings, target_size, p_enc, p_dec). The function has the following steps:Initialize h as a vector of zeros
call the encode step in a loop to produce a final encoder state (h)
Rename h to s
Initialize
y_prevto the embedding of the<SOS>token in the output vocabularyCollect the ys in a list
[ ]:
Task 8: Implement the encoder-decoder text-to-text model#
Implement a function called `s2s_model(in_text, p_enc, p_dec, in_vocabulary, out_vocabulary, target_size). This is similar to the function you wrote above, but this time the input vocabulary and output vocabulary differ.
[ ]:
[ ]:
w_xh_opt = np.array(
[
[1.1, 0.2, -0.4, 0.5],
[0.7, -0.5, -0.6, -7.9],
[0.4, 3.9, 0.3, -0.2],
[1.2, 0.6, 0.1, 2.7],
]
)
w_hh_opt = np.array(
[
[-0.6, -0.8, 0.4, -0.1],
[-1.6, 2.7, -3.7, -3.8],
[-1.2, 1.3, 1.0, 0.2],
[-0.0, -0.6, 0.6, 0.9],
]
)
w_ss_opt = np.array(
[
[10.6, -0.5, -5.2, 0.7],
[-6.7, 1.9, -5.0, 1.8],
[10.9, -7.5, -6.2, 4.5],
[1.5, 0.4, 4.1, -4.4],
]
)
w_sy_opt = np.array(
[
[3.0, -10.6, -22.2, 4.2],
[4.4, 15.4, 9.6, -7.0],
[4.8, -4.0, 12.2, -6.0],
[-16.4, -10.1, -18.8, 14.1],
]
)
w_ys_opt = np.array(
[
[0.5, -3.0, 11.9, 4.2],
[3.0, 0.8, 2.0, 1.0],
[3.8, -1.1, 16.1, 8.3],
[-3.2, -0.1, -7.4, -0.7],
]
)
p_enc_opt = EncoderParams(
w_xh=w_xh_opt,
w_hh=w_hh_opt,
)
p_dec_opt = DecoderParams(
w_ss=w_ss_opt,
w_ys=w_ys_opt,
w_sy=w_sy_opt,
)
[ ]: