{ "cells": [ { "cell_type": "markdown", "id": "406156cd", "metadata": {}, "source": [ "# Exercise 10 (solution)\n", "\n", "In this exercise we learn about simple RNNs as well as encoder-decoder RNNs. We only implement some components from scratch in numpy and leave out the training. By now, it would not be hard for you to implement this in `torch` and add the training." ] }, { "cell_type": "code", "execution_count": null, "id": "0a52a118", "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "from scipy.special import softmax\n", "from dataclasses import dataclass" ] }, { "cell_type": "markdown", "id": "9eec7b0d", "metadata": {}, "source": [ "## Task 1: Tokenization and embeddings\n", "\n", "Implement a simple **character level** tokenization and embedding algorithm. In contrast to what we did in an earlier lecture, we want to minimize the vocab size to just the characters that are present in a given text.\n", "\n", "1. Write a function called `get_vocabulary(text)` that returns a sorted list of all characters that occur in the text\n", "2. Write a function called `tokenize(text, vocabulary)` that takes a text and list of characters and returns a list of ints. \n", "3. Write a function called `embed(tokens, vocab_size)` that returns a numpy array of shape (n_tokens, vocab_size) where each row is a one-hot vector corresponding to a token\n", "4. Call all the functions and to create `in_embeddings` for our text" ] }, { "cell_type": "code", "execution_count": null, "id": "ec6ae98a", "metadata": {}, "outputs": [], "source": [ "text = \"hello\"" ] }, { "cell_type": "code", "execution_count": null, "id": "95d47cb3", "metadata": {}, "outputs": [], "source": [ "def get_vocabulary(text):\n", " \"\"\"Get a minimal character level vocabulary to tokenize the text.\"\"\"\n", " text = text.lower()\n", " characters = sorted(set(text))\n", " return characters\n", "\n", "\n", "vocabulary = get_vocabulary(text)\n", "vocab_size = len(vocabulary)\n", "vocabulary" ] }, { "cell_type": "code", "execution_count": null, "id": "3e01a33e", "metadata": {}, "outputs": [], "source": [ "def tokenize(text, vocabulary):\n", " \"\"\"Tokenize the text, given the vocabulary.\"\"\"\n", " text = text.lower()\n", " token_dict = {character: pos for pos, character in enumerate(vocabulary)}\n", " out = [token_dict[character] for character in text]\n", " return out\n", "\n", "\n", "tokens = tokenize(text, vocabulary)\n", "tokens" ] }, { "cell_type": "code", "execution_count": null, "id": "1466dbb7", "metadata": {}, "outputs": [], "source": [ "def embed(tokens, vocab_size):\n", " \"\"\"Create input embeddings for each token.\"\"\"\n", " out = np.zeros((len(tokens), vocab_size))\n", " out[np.arange(len(out)), tokens] = 1\n", " return out\n", "\n", "\n", "in_embeddings = embed(tokens, vocab_size)\n", "in_embeddings" ] }, { "cell_type": "markdown", "id": "63aef846", "metadata": {}, "source": [ "## Task 2: A Params class\n", "\n", "1. Define a `dataclass` called `Params` that has the three attributes `w_xh`, `w_hh`, `w_hy`\n", "2. Create an instance of `Params` with weight matrices that have the correct shapes and are filled with uniform random values between -1 and 1. " ] }, { "cell_type": "code", "execution_count": null, "id": "798874f6", "metadata": {}, "outputs": [], "source": [ "n_in = vocab_size\n", "n_out = vocab_size\n", "n_hidden = 3" ] }, { "cell_type": "code", "execution_count": null, "id": "21550574", "metadata": {}, "outputs": [], "source": [ "@dataclass\n", "class Params:\n", " w_xh: np.ndarray\n", " w_hh: np.ndarray\n", " w_hy: np.ndarray" ] }, { "cell_type": "code", "execution_count": null, "id": "ad24f925", "metadata": {}, "outputs": [], "source": [ "np.random.seed(12345)\n", "\n", "p = Params(\n", " w_xh=np.random.uniform(size=(n_hidden, n_in)),\n", " w_hh=np.random.uniform(size=(n_hidden, n_hidden)),\n", " w_hy=np.random.uniform(size=(n_out, n_hidden)),\n", ")\n", "p" ] }, { "cell_type": "markdown", "id": "9677845f", "metadata": {}, "source": [ "## Task 3: Implement a Vanilla RNN (for Language Modelling)\n", "\n", "1. Implement a function called `model_step(x, h, p)` where `x` is a one-hot vector, `h` is a vector that holds the internal state of the RNN and `p` is an instance of `Params`\n", "2. Implement a function called `model(embeddings, p)` that calles the `model_step` internally and produces an array of logits. The output array has shape (len(embeddings) -1, vocab_size). The function does roughly the following steps:\n", " - Initialize h to a vector of zeros\n", " - call `model` in a loop\n", " - Collect all y in a list" ] }, { "cell_type": "code", "execution_count": null, "id": "469a06b9", "metadata": {}, "outputs": [], "source": [ "def model_step(x, h, p):\n", " h = np.tanh(p.w_xh @ x + p.w_hh @ h)\n", " y = p.w_hy @ h\n", " return h, y" ] }, { "cell_type": "code", "execution_count": null, "id": "aa766a9e", "metadata": {}, "outputs": [], "source": [ "def model(embeddings, p):\n", " \"\"\"Model that takes input_embeddings and produces logits.\"\"\"\n", " h = np.zeros(len(p.w_hh))\n", " out = []\n", " for x in embeddings[:-1]:\n", " h, y = model_step(x, h, p)\n", " out.append(y)\n", " return np.array(out)" ] }, { "cell_type": "code", "execution_count": null, "id": "73f28d3b", "metadata": {}, "outputs": [], "source": [ "logits = model(in_embeddings, p)\n", "logits.shape" ] }, { "cell_type": "code", "execution_count": null, "id": "e609a037", "metadata": {}, "outputs": [], "source": [ "softmax(logits, axis=1).round(1)" ] }, { "cell_type": "markdown", "id": "d63b92fb", "metadata": {}, "source": [ "## Task 4: Implement loss function\n", "\n", "1. Create a list called `targets` that contains the target token for each output. I.e. the tokenized version of `\"ello\"`\n", "2. Write a function called `cross_entropy_loss(logits, targets)`. This is basically the same function you wrote in lecture 8. The steps are roughly:\n", " - Take the softmax over the last axis\n", " - Use the indexing trick to get likelihoods\n", " - Return the negative mean of the log likelihoods\n", "\n", "We are not using the loss function for training, I just want to make sure you understand what is the loss function for language modelling. " ] }, { "cell_type": "code", "execution_count": null, "id": "43387ec6", "metadata": {}, "outputs": [], "source": [ "targets = tokens[1:]\n", "targets" ] }, { "cell_type": "code", "execution_count": null, "id": "cabe78b1", "metadata": {}, "outputs": [], "source": [ "def cross_entropy_loss(logits, targets):\n", " probs = softmax(logits, axis=1)\n", " likelihoods = probs[np.arange(len(targets)), targets]\n", " return -np.log(likelihoods + 1e-50).mean()" ] }, { "cell_type": "code", "execution_count": null, "id": "0b26cbdc", "metadata": {}, "outputs": [], "source": [ "cross_entropy_loss(logits, targets)" ] }, { "cell_type": "markdown", "id": "5eec6a24", "metadata": {}, "source": [ "## Task 5: Implement a text-to-text model and use optimal weights\n", "\n", "In this task I give you trained weights for the model. Those weights should enable the model to correctly return `\"ello\"` when prompted with `\"hello\"`\n", "\n", "The only think you need to do is:\n", "\n", "1. Write a function called `s2s_model(text, p, vocabulary)` that takes text and returns text. Inside, you have to do the following steps:\n", " - tokenize the text\n", " - embed the text\n", " - use the model to get logits\n", " - Get predicted tokens from the logits\n", " - Translate the tokens into text" ] }, { "cell_type": "code", "execution_count": null, "id": "a040dacb", "metadata": {}, "outputs": [], "source": [ "w_xh_opt = np.array(\n", " [\n", " [-13.8, 0.6, 2.7, 0.1],\n", " [4.7, -20.9, 1.6, 0.1],\n", " [1.6, 6.9, 10.9, 0.0],\n", " ]\n", ")\n", "\n", "w_hh_opt = np.array(\n", " [\n", " [-2.1, -5.9, 7.2],\n", " [-5.9, -4.2, 0.8],\n", " [6.0, 7.5, 2.8],\n", " ]\n", ")\n", "\n", "w_hy_opt = np.array(\n", " [[-0.6, -24.2, -0.7], [3.4, 8.8, -12.0], [-12.5, 12.2, 9.0], [10.0, 3.2, 3.7]]\n", ")\n", "\n", "p_opt = Params(\n", " w_xh=w_xh_opt,\n", " w_hh=w_hh_opt,\n", " w_hy=w_hy_opt,\n", ")" ] }, { "cell_type": "code", "execution_count": null, "id": "93c16412", "metadata": {}, "outputs": [], "source": [ "def s2s_model(text, p, vocabulary):\n", " \"\"\"Model that takes text and returns text.\"\"\"\n", " vocab_size = len(vocabulary)\n", " tokens = tokenize(text, vocabulary)\n", " input_embeddings = embed(tokens, vocab_size)\n", " logits = model(input_embeddings, p)\n", " predictions = np.argmax(logits, axis=1)\n", " return \"\".join(vocabulary[pred] for pred in predictions)\n", "\n", "\n", "s2s_model(text, p_opt, vocabulary)" ] }, { "cell_type": "markdown", "id": "ce9dbf0b", "metadata": {}, "source": [ "## Switching to word level embedding for machine translation\n", "\n", "To learn about encoder-decoder RNNs we switch from character-level tokenization to word-level tokenization. Moreover, we add a start and end token. \n", "\n", "Since you already know how to write tokenizers, here is the code:" ] }, { "cell_type": "code", "execution_count": null, "id": "61b08402", "metadata": {}, "outputs": [], "source": [ "in_text = \"Hello World\"\n", "out_text = \"Hallo Welt\"\n", "\n", "\n", "def get_vocabulary(text):\n", " \"\"\"Get a minimal vocabulary to tokenize the text.\"\"\"\n", " text = text.lower().split()\n", " words = sorted(set(text)) + [\"\", \"\"]\n", " return words\n", "\n", "\n", "def tokenize(text, vocabulary):\n", " \"\"\"Tokenize the text, given the vocabulary.\"\"\"\n", " text = [\"\"] + text.lower().split() + [\"\"]\n", " token_dict = {character: pos for pos, character in enumerate(vocabulary)}\n", " out = [token_dict[character] for character in text]\n", " return out\n", "\n", "\n", "def embed(tokens, vocab_size):\n", " \"\"\"Create input embeddings for each token.\"\"\"\n", " out = np.zeros((len(tokens), vocab_size))\n", " out[np.arange(len(out)), tokens] = 1\n", " return out\n", "\n", "\n", "in_vocabulary = get_vocabulary(in_text)\n", "print(\"Input vocabulary:\", in_vocabulary)\n", "in_vocab_size = len(in_vocabulary)\n", "in_tokens = tokenize(in_text, in_vocabulary)\n", "print(\"Input tokens:\", in_tokens)\n", "in_embeddings = embed(in_tokens, in_vocab_size)\n", "\n", "out_vocabulary = get_vocabulary(out_text)\n", "print(\"Output vocabulary:\", out_vocabulary)\n", "out_vocab_size = len(out_vocabulary)\n", "out_tokens = tokenize(out_text, out_vocabulary)\n", "print(\"Output tokens:\", out_tokens)\n", "target_size = len(out_tokens)\n", "print(\"Target size:\", target_size)\n", "\n", "\n", "n_in = in_vocab_size\n", "n_out = out_vocab_size\n", "n_hidden = 4" ] }, { "cell_type": "markdown", "id": "9e9a3fa9", "metadata": {}, "source": [ "Moreover, you get code for two classes of Parameters you can use in your model" ] }, { "cell_type": "code", "execution_count": null, "id": "c17bd2f8", "metadata": {}, "outputs": [], "source": [ "@dataclass\n", "class EncoderParams:\n", " w_xh: np.ndarray\n", " w_hh: np.ndarray\n", "\n", "\n", "@dataclass\n", "class DecoderParams:\n", " w_ss: np.ndarray\n", " w_ys: np.ndarray\n", " w_sy: np.ndarray\n", "\n", "\n", "np.random.seed(1234)\n", "\n", "p_enc = EncoderParams(\n", " w_xh=np.random.uniform(size=(n_hidden, n_in)),\n", " w_hh=np.random.uniform(size=(n_hidden, n_hidden)),\n", ")\n", "\n", "p_dec = DecoderParams(\n", " w_ss=np.random.uniform(size=(n_hidden, n_hidden)),\n", " w_ys=np.random.uniform(size=(n_hidden, n_out)),\n", " w_sy=np.random.uniform(size=(n_out, n_hidden)),\n", ")\n", "\n", "p_enc" ] }, { "cell_type": "markdown", "id": "663f8aa5", "metadata": {}, "source": [ "## Task 6: Implement encode and decode steps (for Machine Translation)\n", "\n", "1. Write a function called `encode_step(x, h, p_enc)\n", "2. Write a function called `decode_step(s, y_prev, p_dec)\n", "\n", "The two functions together will play the same role as the `model_step` in the simple RNN" ] }, { "cell_type": "code", "execution_count": null, "id": "5f5ce202", "metadata": {}, "outputs": [], "source": [ "def encode_step(x, h, p_enc):\n", " h = np.tanh(p_enc.w_xh @ x + p_enc.w_hh @ h)\n", " return h" ] }, { "cell_type": "code", "execution_count": null, "id": "b4c2427c", "metadata": {}, "outputs": [], "source": [ "def decode_step(s, y_prev, p_dec):\n", " s = np.tanh(p_dec.w_ss @ s + p_dec.w_ys @ y_prev)\n", " y = p_dec.w_sy @ s\n", " return s, y" ] }, { "cell_type": "markdown", "id": "42e74f87", "metadata": {}, "source": [ "## Task 7: Implement the encoder-decoder model\n", "\n", "1. Write a function called `model(in_embeddings, target_size, p_enc, p_dec)`. The function has the following steps:\n", " - Initialize h as a vector of zeros\n", " - call the encode step in a loop to produce a final encoder state (h)\n", " - Rename h to s\n", " - Initialize `y_prev` to the embedding of the `` token in the output vocabulary\n", " - Collect the ys in a list" ] }, { "cell_type": "code", "execution_count": null, "id": "34528c89", "metadata": {}, "outputs": [], "source": [ "def model(in_embeddings, target_size, p_enc, p_dec):\n", " h = np.zeros(len(p_enc.w_hh))\n", " for x in in_embeddings:\n", " h = encode_step(x, h, p_enc)\n", "\n", " s = h\n", " y_prev = np.zeros(p_dec.w_ys.shape[1])\n", " y_prev[-2] = 1\n", " out = []\n", " for _ in range(target_size):\n", " s, y = decode_step(s, y_prev, p_dec)\n", " out.append(y)\n", " y_prev = y\n", "\n", " return np.array(out)" ] }, { "cell_type": "markdown", "id": "6cc440a7", "metadata": {}, "source": [ "## Task 8: Implement the encoder-decoder text-to-text model\n", "\n", "1. Implement a function called `s2s_model(in_text, p_enc, p_dec, in_vocabulary, out_vocabulary, target_size). This is similar to the function you wrote above, but this time the input vocabulary and output vocabulary differ. " ] }, { "cell_type": "code", "execution_count": null, "id": "7b563179", "metadata": {}, "outputs": [], "source": [ "def s2s_model(\n", " in_text,\n", " p_enc,\n", " p_dec,\n", " in_vocabulary=in_vocabulary,\n", " out_vocabulary=out_vocabulary,\n", " target_size=target_size,\n", "):\n", " \"\"\"Model that takes text and returns text.\"\"\"\n", " in_vocab_size = len(in_vocabulary)\n", " in_tokens = tokenize(in_text, in_vocabulary)\n", " in_embeddings = embed(in_tokens, in_vocab_size)\n", "\n", " logits = model(in_embeddings, target_size, p_enc, p_dec)\n", "\n", " predictions = np.argmax(logits, axis=1)\n", " return \" \".join(out_vocabulary[pred] for pred in predictions)" ] }, { "cell_type": "code", "execution_count": null, "id": "1a3c4dbe", "metadata": {}, "outputs": [], "source": [ "w_xh_opt = np.array(\n", " [\n", " [1.1, 0.2, -0.4, 0.5],\n", " [0.7, -0.5, -0.6, -7.9],\n", " [0.4, 3.9, 0.3, -0.2],\n", " [1.2, 0.6, 0.1, 2.7],\n", " ]\n", ")\n", "\n", "w_hh_opt = np.array(\n", " [\n", " [-0.6, -0.8, 0.4, -0.1],\n", " [-1.6, 2.7, -3.7, -3.8],\n", " [-1.2, 1.3, 1.0, 0.2],\n", " [-0.0, -0.6, 0.6, 0.9],\n", " ]\n", ")\n", "\n", "w_ss_opt = np.array(\n", " [\n", " [10.6, -0.5, -5.2, 0.7],\n", " [-6.7, 1.9, -5.0, 1.8],\n", " [10.9, -7.5, -6.2, 4.5],\n", " [1.5, 0.4, 4.1, -4.4],\n", " ]\n", ")\n", "\n", "w_sy_opt = np.array(\n", " [\n", " [3.0, -10.6, -22.2, 4.2],\n", " [4.4, 15.4, 9.6, -7.0],\n", " [4.8, -4.0, 12.2, -6.0],\n", " [-16.4, -10.1, -18.8, 14.1],\n", " ]\n", ")\n", "\n", "\n", "w_ys_opt = np.array(\n", " [\n", " [0.5, -3.0, 11.9, 4.2],\n", " [3.0, 0.8, 2.0, 1.0],\n", " [3.8, -1.1, 16.1, 8.3],\n", " [-3.2, -0.1, -7.4, -0.7],\n", " ]\n", ")\n", "\n", "p_enc_opt = EncoderParams(\n", " w_xh=w_xh_opt,\n", " w_hh=w_hh_opt,\n", ")\n", "\n", "p_dec_opt = DecoderParams(\n", " w_ss=w_ss_opt,\n", " w_ys=w_ys_opt,\n", " w_sy=w_sy_opt,\n", ")" ] }, { "cell_type": "code", "execution_count": null, "id": "dcabda85", "metadata": {}, "outputs": [], "source": [ "s2s_model(in_text, p_enc_opt, p_dec_opt)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.3" } }, "nbformat": 4, "nbformat_minor": 5 }