{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "406156cd",
   "metadata": {},
   "source": [
    "# Exercise 10 (solution)\n",
    "\n",
    "In this exercise we learn about simple RNNs as well as encoder-decoder RNNs. We only implement some components from scratch in numpy and leave out the training. By now, it would not be hard for you to implement this in `torch` and add the training."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0a52a118",
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "from scipy.special import softmax\n",
    "from dataclasses import dataclass"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9eec7b0d",
   "metadata": {},
   "source": [
    "## Task 1: Tokenization and embeddings\n",
    "\n",
    "Implement a simple **character level** tokenization and embedding algorithm. In contrast to what we did in an earlier lecture, we want to minimize the vocab size to just the characters that are present in a given text.\n",
    "\n",
    "1. Write a function called `get_vocabulary(text)` that returns a sorted list of all characters that occur in the text\n",
    "2. Write a function called `tokenize(text, vocabulary)` that takes a text and list of characters and returns a list of ints. \n",
    "3. Write a function called `embed(tokens, vocab_size)` that returns a numpy array of shape (n_tokens, vocab_size) where each row is a one-hot vector corresponding to a token\n",
    "4. Call all the functions and to create `in_embeddings` for our text"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ec6ae98a",
   "metadata": {},
   "outputs": [],
   "source": [
    "text = \"hello\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "95d47cb3",
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_vocabulary(text):\n",
    "    \"\"\"Get a minimal character level vocabulary to tokenize the text.\"\"\"\n",
    "    text = text.lower()\n",
    "    characters = sorted(set(text))\n",
    "    return characters\n",
    "\n",
    "\n",
    "vocabulary = get_vocabulary(text)\n",
    "vocab_size = len(vocabulary)\n",
    "vocabulary"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3e01a33e",
   "metadata": {},
   "outputs": [],
   "source": [
    "def tokenize(text, vocabulary):\n",
    "    \"\"\"Tokenize the text, given the vocabulary.\"\"\"\n",
    "    text = text.lower()\n",
    "    token_dict = {character: pos for pos, character in enumerate(vocabulary)}\n",
    "    out = [token_dict[character] for character in text]\n",
    "    return out\n",
    "\n",
    "\n",
    "tokens = tokenize(text, vocabulary)\n",
    "tokens"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1466dbb7",
   "metadata": {},
   "outputs": [],
   "source": [
    "def embed(tokens, vocab_size):\n",
    "    \"\"\"Create input embeddings for each token.\"\"\"\n",
    "    out = np.zeros((len(tokens), vocab_size))\n",
    "    out[np.arange(len(out)), tokens] = 1\n",
    "    return out\n",
    "\n",
    "\n",
    "in_embeddings = embed(tokens, vocab_size)\n",
    "in_embeddings"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "63aef846",
   "metadata": {},
   "source": [
    "## Task 2: A Params class\n",
    "\n",
    "1. Define a `dataclass` called `Params` that has the three attributes `w_xh`, `w_hh`, `w_hy`\n",
    "2. Create an instance of `Params` with weight matrices that have the correct shapes and are filled with uniform random values between -1 and 1. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "798874f6",
   "metadata": {},
   "outputs": [],
   "source": [
    "n_in = vocab_size\n",
    "n_out = vocab_size\n",
    "n_hidden = 3"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "21550574",
   "metadata": {},
   "outputs": [],
   "source": [
    "@dataclass\n",
    "class Params:\n",
    "    w_xh: np.ndarray\n",
    "    w_hh: np.ndarray\n",
    "    w_hy: np.ndarray"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ad24f925",
   "metadata": {},
   "outputs": [],
   "source": [
    "np.random.seed(12345)\n",
    "\n",
    "p = Params(\n",
    "    w_xh=np.random.uniform(size=(n_hidden, n_in)),\n",
    "    w_hh=np.random.uniform(size=(n_hidden, n_hidden)),\n",
    "    w_hy=np.random.uniform(size=(n_out, n_hidden)),\n",
    ")\n",
    "p"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9677845f",
   "metadata": {},
   "source": [
    "## Task 3: Implement a Vanilla RNN (for Language Modelling)\n",
    "\n",
    "1. Implement a function called `model_step(x, h, p)` where `x` is a one-hot vector, `h` is a vector that holds the internal state of the RNN and `p` is an instance of `Params`\n",
    "2. Implement a function called `model(embeddings, p)` that calles the `model_step` internally and produces an array of logits. The output array has shape (len(embeddings) -1, vocab_size). The function does roughly the following steps:\n",
    "    - Initialize h to a vector of zeros\n",
    "    - call `model` in a loop\n",
    "    - Collect all y in a list"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "469a06b9",
   "metadata": {},
   "outputs": [],
   "source": [
    "def model_step(x, h, p):\n",
    "    h = np.tanh(p.w_xh @ x + p.w_hh @ h)\n",
    "    y = p.w_hy @ h\n",
    "    return h, y"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "aa766a9e",
   "metadata": {},
   "outputs": [],
   "source": [
    "def model(embeddings, p):\n",
    "    \"\"\"Model that takes input_embeddings and produces logits.\"\"\"\n",
    "    h = np.zeros(len(p.w_hh))\n",
    "    out = []\n",
    "    for x in embeddings[:-1]:\n",
    "        h, y = model_step(x, h, p)\n",
    "        out.append(y)\n",
    "    return np.array(out)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "73f28d3b",
   "metadata": {},
   "outputs": [],
   "source": [
    "logits = model(in_embeddings, p)\n",
    "logits.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e609a037",
   "metadata": {},
   "outputs": [],
   "source": [
    "softmax(logits, axis=1).round(1)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d63b92fb",
   "metadata": {},
   "source": [
    "## Task 4: Implement loss function\n",
    "\n",
    "1. Create a list called `targets` that contains the target token for each output. I.e. the tokenized version of `\"ello\"`\n",
    "2. Write a function called `cross_entropy_loss(logits, targets)`. This is basically the same function you wrote in lecture 8. The steps are roughly:\n",
    "    - Take the softmax over the last axis\n",
    "    - Use the indexing trick to get likelihoods\n",
    "    - Return the negative mean of the log likelihoods\n",
    "\n",
    "We are not using the loss function for training, I just want to make sure you understand what is the loss function for language modelling. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "43387ec6",
   "metadata": {},
   "outputs": [],
   "source": [
    "targets = tokens[1:]\n",
    "targets"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cabe78b1",
   "metadata": {},
   "outputs": [],
   "source": [
    "def cross_entropy_loss(logits, targets):\n",
    "    probs = softmax(logits, axis=1)\n",
    "    likelihoods = probs[np.arange(len(targets)), targets]\n",
    "    return -np.log(likelihoods + 1e-50).mean()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0b26cbdc",
   "metadata": {},
   "outputs": [],
   "source": [
    "cross_entropy_loss(logits, targets)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5eec6a24",
   "metadata": {},
   "source": [
    "## Task 5: Implement a text-to-text model and use optimal weights\n",
    "\n",
    "In this task I give you trained weights for the model. Those weights should enable the model to correctly return `\"ello\"` when prompted with `\"hello\"`\n",
    "\n",
    "The only think you need to do is:\n",
    "\n",
    "1. Write a function called `s2s_model(text, p, vocabulary)` that takes text and returns text. Inside, you have to do the following steps:\n",
    "    - tokenize the text\n",
    "    - embed the text\n",
    "    - use the model to get logits\n",
    "    - Get predicted tokens from the logits\n",
    "    - Translate the tokens into text"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a040dacb",
   "metadata": {},
   "outputs": [],
   "source": [
    "w_xh_opt = np.array(\n",
    "    [\n",
    "        [-13.8, 0.6, 2.7, 0.1],\n",
    "        [4.7, -20.9, 1.6, 0.1],\n",
    "        [1.6, 6.9, 10.9, 0.0],\n",
    "    ]\n",
    ")\n",
    "\n",
    "w_hh_opt = np.array(\n",
    "    [\n",
    "        [-2.1, -5.9, 7.2],\n",
    "        [-5.9, -4.2, 0.8],\n",
    "        [6.0, 7.5, 2.8],\n",
    "    ]\n",
    ")\n",
    "\n",
    "w_hy_opt = np.array(\n",
    "    [[-0.6, -24.2, -0.7], [3.4, 8.8, -12.0], [-12.5, 12.2, 9.0], [10.0, 3.2, 3.7]]\n",
    ")\n",
    "\n",
    "p_opt = Params(\n",
    "    w_xh=w_xh_opt,\n",
    "    w_hh=w_hh_opt,\n",
    "    w_hy=w_hy_opt,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "93c16412",
   "metadata": {},
   "outputs": [],
   "source": [
    "def s2s_model(text, p, vocabulary):\n",
    "    \"\"\"Model that takes text and returns text.\"\"\"\n",
    "    vocab_size = len(vocabulary)\n",
    "    tokens = tokenize(text, vocabulary)\n",
    "    input_embeddings = embed(tokens, vocab_size)\n",
    "    logits = model(input_embeddings, p)\n",
    "    predictions = np.argmax(logits, axis=1)\n",
    "    return \"\".join(vocabulary[pred] for pred in predictions)\n",
    "\n",
    "\n",
    "s2s_model(text, p_opt, vocabulary)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ce9dbf0b",
   "metadata": {},
   "source": [
    "## Switching to word level embedding for machine translation\n",
    "\n",
    "To learn about encoder-decoder RNNs we switch from character-level tokenization to word-level tokenization. Moreover, we add a start and end token. \n",
    "\n",
    "Since you already know how to write tokenizers, here is the code:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "61b08402",
   "metadata": {},
   "outputs": [],
   "source": [
    "in_text = \"Hello World\"\n",
    "out_text = \"Hallo Welt\"\n",
    "\n",
    "\n",
    "def get_vocabulary(text):\n",
    "    \"\"\"Get a minimal vocabulary to tokenize the text.\"\"\"\n",
    "    text = text.lower().split()\n",
    "    words = sorted(set(text)) + [\"<SOS>\", \"<EOS>\"]\n",
    "    return words\n",
    "\n",
    "\n",
    "def tokenize(text, vocabulary):\n",
    "    \"\"\"Tokenize the text, given the vocabulary.\"\"\"\n",
    "    text = [\"<SOS>\"] + text.lower().split() + [\"<EOS>\"]\n",
    "    token_dict = {character: pos for pos, character in enumerate(vocabulary)}\n",
    "    out = [token_dict[character] for character in text]\n",
    "    return out\n",
    "\n",
    "\n",
    "def embed(tokens, vocab_size):\n",
    "    \"\"\"Create input embeddings for each token.\"\"\"\n",
    "    out = np.zeros((len(tokens), vocab_size))\n",
    "    out[np.arange(len(out)), tokens] = 1\n",
    "    return out\n",
    "\n",
    "\n",
    "in_vocabulary = get_vocabulary(in_text)\n",
    "print(\"Input vocabulary:\", in_vocabulary)\n",
    "in_vocab_size = len(in_vocabulary)\n",
    "in_tokens = tokenize(in_text, in_vocabulary)\n",
    "print(\"Input tokens:\", in_tokens)\n",
    "in_embeddings = embed(in_tokens, in_vocab_size)\n",
    "\n",
    "out_vocabulary = get_vocabulary(out_text)\n",
    "print(\"Output vocabulary:\", out_vocabulary)\n",
    "out_vocab_size = len(out_vocabulary)\n",
    "out_tokens = tokenize(out_text, out_vocabulary)\n",
    "print(\"Output tokens:\", out_tokens)\n",
    "target_size = len(out_tokens)\n",
    "print(\"Target size:\", target_size)\n",
    "\n",
    "\n",
    "n_in = in_vocab_size\n",
    "n_out = out_vocab_size\n",
    "n_hidden = 4"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9e9a3fa9",
   "metadata": {},
   "source": [
    "Moreover, you get code for two classes of Parameters you can use in your model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c17bd2f8",
   "metadata": {},
   "outputs": [],
   "source": [
    "@dataclass\n",
    "class EncoderParams:\n",
    "    w_xh: np.ndarray\n",
    "    w_hh: np.ndarray\n",
    "\n",
    "\n",
    "@dataclass\n",
    "class DecoderParams:\n",
    "    w_ss: np.ndarray\n",
    "    w_ys: np.ndarray\n",
    "    w_sy: np.ndarray\n",
    "\n",
    "\n",
    "np.random.seed(1234)\n",
    "\n",
    "p_enc = EncoderParams(\n",
    "    w_xh=np.random.uniform(size=(n_hidden, n_in)),\n",
    "    w_hh=np.random.uniform(size=(n_hidden, n_hidden)),\n",
    ")\n",
    "\n",
    "p_dec = DecoderParams(\n",
    "    w_ss=np.random.uniform(size=(n_hidden, n_hidden)),\n",
    "    w_ys=np.random.uniform(size=(n_hidden, n_out)),\n",
    "    w_sy=np.random.uniform(size=(n_out, n_hidden)),\n",
    ")\n",
    "\n",
    "p_enc"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "663f8aa5",
   "metadata": {},
   "source": [
    "## Task 6: Implement encode and decode steps (for Machine Translation)\n",
    "\n",
    "1. Write a function called `encode_step(x, h, p_enc)\n",
    "2. Write a function called `decode_step(s, y_prev, p_dec)\n",
    "\n",
    "The two functions together will play the same role as the `model_step` in the simple RNN"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5f5ce202",
   "metadata": {},
   "outputs": [],
   "source": [
    "def encode_step(x, h, p_enc):\n",
    "    h = np.tanh(p_enc.w_xh @ x + p_enc.w_hh @ h)\n",
    "    return h"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b4c2427c",
   "metadata": {},
   "outputs": [],
   "source": [
    "def decode_step(s, y_prev, p_dec):\n",
    "    s = np.tanh(p_dec.w_ss @ s + p_dec.w_ys @ y_prev)\n",
    "    y = p_dec.w_sy @ s\n",
    "    return s, y"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "42e74f87",
   "metadata": {},
   "source": [
    "## Task 7: Implement the encoder-decoder model\n",
    "\n",
    "1. Write a function called `model(in_embeddings, target_size, p_enc, p_dec)`. The function has the following steps:\n",
    "    - Initialize h as a vector of zeros\n",
    "    - call the encode step in a loop to produce a final encoder state (h)\n",
    "    - Rename h to s\n",
    "    - Initialize `y_prev` to the embedding of the `<SOS>` token in the output vocabulary\n",
    "    - Collect the ys in a list"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "34528c89",
   "metadata": {},
   "outputs": [],
   "source": [
    "def model(in_embeddings, target_size, p_enc, p_dec):\n",
    "    h = np.zeros(len(p_enc.w_hh))\n",
    "    for x in in_embeddings:\n",
    "        h = encode_step(x, h, p_enc)\n",
    "\n",
    "    s = h\n",
    "    y_prev = np.zeros(p_dec.w_ys.shape[1])\n",
    "    y_prev[-2] = 1\n",
    "    out = []\n",
    "    for _ in range(target_size):\n",
    "        s, y = decode_step(s, y_prev, p_dec)\n",
    "        out.append(y)\n",
    "        y_prev = y\n",
    "\n",
    "    return np.array(out)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6cc440a7",
   "metadata": {},
   "source": [
    "## Task 8: Implement the encoder-decoder text-to-text model\n",
    "\n",
    "1. Implement a function called `s2s_model(in_text, p_enc, p_dec, in_vocabulary, out_vocabulary, target_size). This is similar to the function you wrote above, but this time the input vocabulary and output vocabulary differ. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7b563179",
   "metadata": {},
   "outputs": [],
   "source": [
    "def s2s_model(\n",
    "    in_text,\n",
    "    p_enc,\n",
    "    p_dec,\n",
    "    in_vocabulary=in_vocabulary,\n",
    "    out_vocabulary=out_vocabulary,\n",
    "    target_size=target_size,\n",
    "):\n",
    "    \"\"\"Model that takes text and returns text.\"\"\"\n",
    "    in_vocab_size = len(in_vocabulary)\n",
    "    in_tokens = tokenize(in_text, in_vocabulary)\n",
    "    in_embeddings = embed(in_tokens, in_vocab_size)\n",
    "\n",
    "    logits = model(in_embeddings, target_size, p_enc, p_dec)\n",
    "\n",
    "    predictions = np.argmax(logits, axis=1)\n",
    "    return \" \".join(out_vocabulary[pred] for pred in predictions)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1a3c4dbe",
   "metadata": {},
   "outputs": [],
   "source": [
    "w_xh_opt = np.array(\n",
    "    [\n",
    "        [1.1, 0.2, -0.4, 0.5],\n",
    "        [0.7, -0.5, -0.6, -7.9],\n",
    "        [0.4, 3.9, 0.3, -0.2],\n",
    "        [1.2, 0.6, 0.1, 2.7],\n",
    "    ]\n",
    ")\n",
    "\n",
    "w_hh_opt = np.array(\n",
    "    [\n",
    "        [-0.6, -0.8, 0.4, -0.1],\n",
    "        [-1.6, 2.7, -3.7, -3.8],\n",
    "        [-1.2, 1.3, 1.0, 0.2],\n",
    "        [-0.0, -0.6, 0.6, 0.9],\n",
    "    ]\n",
    ")\n",
    "\n",
    "w_ss_opt = np.array(\n",
    "    [\n",
    "        [10.6, -0.5, -5.2, 0.7],\n",
    "        [-6.7, 1.9, -5.0, 1.8],\n",
    "        [10.9, -7.5, -6.2, 4.5],\n",
    "        [1.5, 0.4, 4.1, -4.4],\n",
    "    ]\n",
    ")\n",
    "\n",
    "w_sy_opt = np.array(\n",
    "    [\n",
    "        [3.0, -10.6, -22.2, 4.2],\n",
    "        [4.4, 15.4, 9.6, -7.0],\n",
    "        [4.8, -4.0, 12.2, -6.0],\n",
    "        [-16.4, -10.1, -18.8, 14.1],\n",
    "    ]\n",
    ")\n",
    "\n",
    "\n",
    "w_ys_opt = np.array(\n",
    "    [\n",
    "        [0.5, -3.0, 11.9, 4.2],\n",
    "        [3.0, 0.8, 2.0, 1.0],\n",
    "        [3.8, -1.1, 16.1, 8.3],\n",
    "        [-3.2, -0.1, -7.4, -0.7],\n",
    "    ]\n",
    ")\n",
    "\n",
    "p_enc_opt = EncoderParams(\n",
    "    w_xh=w_xh_opt,\n",
    "    w_hh=w_hh_opt,\n",
    ")\n",
    "\n",
    "p_dec_opt = DecoderParams(\n",
    "    w_ss=w_ss_opt,\n",
    "    w_ys=w_ys_opt,\n",
    "    w_sy=w_sy_opt,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "dcabda85",
   "metadata": {},
   "outputs": [],
   "source": [
    "s2s_model(in_text, p_enc_opt, p_dec_opt)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}