{ "cells": [ { "cell_type": "markdown", "id": "1c0b1aca", "metadata": {}, "source": [ "# Exercise 6" ] }, { "cell_type": "code", "execution_count": null, "id": "942faf96", "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "from datasets import load_dataset\n", "from transformers import AutoTokenizer\n", "import torch\n", "from transformers import logging\n", "\n", "logging.set_verbosity_error()" ] }, { "cell_type": "markdown", "id": "f8d66298", "metadata": {}, "source": [ "## Task 1: Masked calculations in numpy\n", "\n", "Calculate the mean over all valid entries in `a` and `b`. \n", "\n", "For a, `valid` directly shows which entries are valid. For `b`, `valid` defines which rows are valid. " ] }, { "cell_type": "code", "execution_count": null, "id": "f83314e8", "metadata": {}, "outputs": [], "source": [ "a = np.arange(4)\n", "b = np.arange(12).reshape(4, 3)\n", "valid = np.array([1, 0, 1, 0]).astype(bool)" ] }, { "cell_type": "code", "execution_count": null, "id": "aea7ec40", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "1901a977", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "f1c98338", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "81972673", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "e1005101", "metadata": {}, "source": [ "## Task 2: Extract last hidden states\n", "\n", "1. Create an Model instance using `AutoModel`\n", "2. define a batch of the tokenized data that helps you to prototype a function that can be used with `DatasetDict.map` (in batched mode)\n", "3. Extract the `input_ids` and `attention_mask` from the batch and convert them to `torch.tensor`s. \n", "4. Use the tensors to extract the last hidden states of the model\n", "5. Convert the last hidden states to a numpy array" ] }, { "cell_type": "code", "execution_count": null, "id": "569a508f", "metadata": {}, "outputs": [], "source": [ "emotions = load_dataset(\"dair-ai/emotion\", name=\"split\")\n", "model_name = \"distilbert-base-uncased\"\n", "tokenizer = AutoTokenizer.from_pretrained(model_name)\n", "\n", "\n", "def tokenize(batch):\n", " return tokenizer(batch[\"text\"], padding=True, truncation=True)\n", "\n", "\n", "emotions_encoded = emotions.map(tokenize, batched=True, batch_size=None)\n", "emotions_encoded" ] }, { "cell_type": "code", "execution_count": null, "id": "e90bde7e", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "1d5556a2", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "4a04c763", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "0cf3cec3", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "4a95f6c4", "metadata": {}, "source": [ "## Task 3: Average last hidden states\n", "\n", "Take the mean over the middle dimension of the last hidden states, but only take elements into account for which the attention mask takes the value 1. " ] }, { "cell_type": "code", "execution_count": null, "id": "e5b0943d", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "109a92aa", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "4cbd31d3", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "247111ce", "metadata": {}, "source": [ "## Task 4: Now the same using map\n", "\n", "1. Write a function called `extract_lhs` that takes `batch` and `model` as argument and extracts and averages the last hidden states. This really just means copy pasting all the steps we did in the last two tasks into one function and saving the result in the batch. \n", "2. Test the function on your practice batch\n", "3. Apply the function to the encoded emotions dataset using `.map` with the following settings:\n", " - `batched=True`\n", " - `batch_size=1000`\n", " - `fn_kwargs={\"model\": model}`\n", "\n", "**Note**: Step 3 will take a while, let it run while I show the solution and discuss the next steps. If you want, you can use `num_proc=...` to run this step on more than one core. If so, you should set it to the number of physical cores in your computer. " ] }, { "cell_type": "code", "execution_count": null, "id": "7aedacbd", "metadata": {}, "outputs": [], "source": [ "def extract_states(batch, model):\n", " pass" ] }, { "cell_type": "code", "execution_count": null, "id": "aa546731", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "a81400de", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "b9a19c2e", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "0a617232", "metadata": {}, "source": [ "## Task 5: Use the last hidden states in sklearn\n", "\n", "1. Extract the arrays `X_train`, `X_test`, `y_train` and `y_test`\n", "2. Run a logistic regression using sklearn (at default values)\n", "3. Calculate the accuracy score" ] }, { "cell_type": "code", "execution_count": null, "id": "dd11b836", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "c3a2b152", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.3" } }, "nbformat": 4, "nbformat_minor": 5 }