{ "cells": [ { "cell_type": "markdown", "id": "78037bc1", "metadata": {}, "source": [ "# Exercises 3 (solution)\n", "\n", "This notebook is inspired by Chapter 1 of the book **\"Natural Language Processing with Transformers: Building Language Applications with Hugging Face\"** by Tunstall, von Werra, and Wolf.\n", "\n", "You will get first experience in using pretrained models for specific tasks, using the *pipeline* API from the Hugging Face Transformes library, which allows you to do inference at a very high level of abstraction.\n", "\n", "In each exercise, you will first define a *pipeline* for a specific task, using a pretrained model from the Hugging Face models library, and apply the pipeline on text snippets from the course webpage.\n", "\n", "## Resources:\n", "\n", "- [pipeline documentation](https://huggingface.co/docs/transformers/main_classes/pipelines)\n", "- [Hugging Face models library](https://huggingface.co/models)" ] }, { "cell_type": "markdown", "id": "916ef806", "metadata": {}, "source": [ "## The input text" ] }, { "cell_type": "code", "execution_count": null, "id": "fd1b7db5", "metadata": {}, "outputs": [], "source": [ "from transformers import pipeline\n", "import pandas as pd\n", "\n", "pd.set_option(\"display.max_rows\", 100)\n", "\n", "text = \"\"\"Last year has shown multiple breakthroughs in deep learning, bringing large language\n", "models to the mainstream. OpenAI's ChatGPT, Microsoft's new Bing Search and GitHub\n", "Copilot, and Deep Mind's AlphaCode are the most prominent. While they still have many\n", "flaws, they show a potential to transform many sectors of the economy, replace some\n", "workers and make other vastly more productive.\n", "\n", "NLP also has an immense potential to change research in economics. Most economists use\n", "small and expensive structured datasets. NLP offers a way to work with novel data sources\n", "that often can be scraped for free from the web. Examples are classifying speeches along\n", "the political spectrum, classifying tweets to measure opinions, extracting concepts\n", "mentioned in free-form survey replies, or translating questionnaires or datasets into\n", "different languages.\n", "\n", "This class is an introduction to deep learning and NLP for economists. Starting from\n", "zero, the first half of the course focuses on learning the practical skills needed to\n", "incorporate NLP into empirical workflows. We will use Huggingface's transformers library\n", "and only work with pre-trained models for this. The second half of the class zooms in\n", "and focuses on understanding what language models are, how they differ, and how they are\n", "trained. We will write some purely didactical code in numpy and implement a few simple\n", "models in PyTorch. The main focus of the second half is to build enough understanding to\n", "work effectively with pre-trained models. It is beyond our scope and computational\n", "resources to actually train large models.\"\"\"\n", "\n", "paragraphs = text.split(\"\\n\\n\")" ] }, { "cell_type": "markdown", "id": "83ae75dc", "metadata": {}, "source": [ "## Task 1 - Text classification\n", "\n", "1. Define a classifier pipeline using the ` \"distilbert-base-uncased-finetuned-sst-2-english\"` model.\n", "\n", "2. Apply the pipeline to the entire text.\n", "\n", "3. Apply the pipeline separately on each paragraph of the input text to extract sentiments.\n", "\n", "4. Convert the output of the previous task to a pandas DataFrame" ] }, { "cell_type": "code", "execution_count": null, "id": "fbef014e", "metadata": {}, "outputs": [], "source": [ "classifier = pipeline(\n", " \"text-classification\", model=\"distilbert-base-uncased-finetuned-sst-2-english\"\n", ")" ] }, { "cell_type": "code", "execution_count": null, "id": "e7f35b91", "metadata": {}, "outputs": [], "source": [ "classifier(text)" ] }, { "cell_type": "code", "execution_count": null, "id": "31049358", "metadata": {}, "outputs": [], "source": [ "sentiments = classifier(paragraphs)\n", "sentiments" ] }, { "cell_type": "code", "execution_count": null, "id": "15edb509", "metadata": {}, "outputs": [], "source": [ "pd.DataFrame(sentiments)" ] }, { "cell_type": "markdown", "id": "91c10f7e", "metadata": {}, "source": [ "## Task 2 - Named entity recognition\n", "1. Define a named entity recognition (ner) pipeline using the `\"dslim/bert-base-NER-uncased\"` model. \n", "\n", "2. Apply the pipeline to `text` and convert the result to a DataFrame" ] }, { "cell_type": "code", "execution_count": null, "id": "87c73749", "metadata": {}, "outputs": [], "source": [ "ner_tagger = pipeline(\n", " \"ner\",\n", " model=\"dslim/bert-base-NER-uncased\",\n", ")\n", "pd.DataFrame(ner_tagger(text))" ] }, { "cell_type": "markdown", "id": "24a91f4c", "metadata": {}, "source": [ "## Task 3 - Aggregation strategies\n", "\n", "Use the same model as before, but try out different aggregation strategies. Which one gives you the best results?" ] }, { "cell_type": "markdown", "id": "da170769", "metadata": {}, "source": [ "**Note**: Below we show a solution where you apply all strategies in a loop and combine the result in one DataFrame. Other solutions are completely ok. " ] }, { "cell_type": "code", "execution_count": null, "id": "6754e5f2", "metadata": { "scrolled": true }, "outputs": [], "source": [ "ner_results = []\n", "for strategy in [\"none\", \"simple\", \"first\", \"average\", \"max\"]:\n", " tagger = pipeline(\n", " \"ner\",\n", " model=\"dslim/bert-base-NER-uncased\",\n", " aggregation_strategy=strategy,\n", " )\n", " df = pd.DataFrame(tagger(text))\n", " df[\"strategy\"] = strategy\n", " ner_results.append(df)\n", "\n", "pd.concat(ner_results).set_index(\"strategy\")" ] }, { "cell_type": "markdown", "id": "f511deaa", "metadata": {}, "source": [ "## Task 4 - Clearing the cache\n", "\n", "Use `huggingface-cli` to clear the model cache" ] }, { "cell_type": "markdown", "id": "3fb7d64e", "metadata": {}, "source": [ "## Task 5: Question answering\n", "\n", "1. Define a question answering pipeline using the `\"deepset/roberta-base-squad2\"` model.\n", "\n", "2. Come up with a few questions one might ask about the course logistics. \n", "\n", "3. Apply the pipeline to get answers to your questions.\n", "\n", "**Do not trust any answer without double-checking it!**" ] }, { "cell_type": "code", "execution_count": null, "id": "e15d2cfa", "metadata": {}, "outputs": [], "source": [ "logistics_text = \"\"\"Logistics\n", "\n", "The following page contains everything you need to know to take the class for credit. Please read it carefully before approaching us with questions.\n", "\n", "Communication\n", "\n", "All communication related to the course should be via zulip.\n", "\n", "If you have not signed up for the `bonn-econ-teaching` zulip workspace yet, use this [link](https://bonn-econ-teaching.zulipchat.com/join/wuyru5foek3s3vb6tdkjubwc/) to do so. You will then automatically be subscribed to the `dl-intro` stream.\n", "\n", "If you are already signed up from a previous course, please subscribe to the `dl-intro` stream.\n", "\n", "Questions that do not contain any private information or sensitive data should be asked in a public stream such that everyone benefits from the answer.\n", "\n", "We will announce relevant information about the lectures and course logistics on zulip. If you do not sign up or check your messages, you might miss something important!\n", "\n", "Office hours\n", "\n", "If you have a question or problem that is better discussed in person, we can do so\n", "right after the lecture.\n", "\n", "\n", "Distribution of materials\n", "\n", "All materials can be downloaded from the [course webpage](https://dl-intro.readthedocs.io).\n", "\n", "Grading\n", "\n", "The grade is determined entirely by a final project. The final project consists of a practical application of deep learning methods (you can choose the topic) and written answers to questions that we will distribute in due time.\n", "\n", "In the written answers you can achieve up to 25 points. In the application you can achieve up to 75 points. The points are added and translate into grades as follows:\n", "\n", "| Grade | Minimum number of points |\n", "| ----- | ------------------------ |\n", "| 1.0 | 98 |\n", "| 1.3 | 93 |\n", "| 1.7 | 87 |\n", "| 2.0 | 82 |\n", "| 2.3 | 77 |\n", "| 2.7 | 71 |\n", "| 3.0 | 66 |\n", "| 3.3 | 61 |\n", "| 3.7 | 55 |\n", "| 4.0 | 50 |\n", "\n", "How do we grade the application\n", "\n", "The grade will be a holistic assessment of your project. We look at the following criteria:\n", "\n", "- Is there a `README.md` that tells me how the repository is structured and how I can run the project?\n", "- Does the code run?\n", "- Is the code readable and well documented?\n", "- How challenging is the task?\n", "- How complete and correct is the solution?\n", "\n", "A final project will be about as much work as two or three of the exercise notebooks you will see in class.\n", "\n", "How do we grade the written answers\n", "\n", "We are looking for precise answers to the questions within the specified word limits. Answers that are too long will not get full points. Answers that contain the correct answer but also wrong or redundant information will not achieve full points.\n", "\n", "How will the project be submitted\n", "\n", "Final projects are submitted in a GitHub repository created by GitHub classroom. Follow this [invitation](https://classroom.github.com/a/R1vgPUT1) to create the repository which you can then clone to your computer.\n", "\n", "If you do not know yet what git is and how it works, don't worry. You will learn this during the lecture.\n", "\n", "\n", "How should the code be structured\n", "\n", "You can choose whether you want to work in jupyter notebooks, `.py` files or a mix thereof. You are also free to use a workflow system such as [pytask](https://github.com/pytask-dev/pytask) or [kedro](https://github.com/kedro-org/kedro). We think that in practice you should always use such a workflow system, but we do not have the time to teach them in this class.\n", "\n", "The only strict requirement is that there is a simple way to run your entire project and produce all results you need, either by running one notebook (that potentially imports functions from many files) or by running one command (such as pytask) from the command line.\n", "\n", "If there is no simple way of producing all results or it is not documented in the README, we will deduct points.\n", "\n", "Where to put the written answers\n", "\n", "The written answers to the questions have to be in the `README.md` file of your repository.\n", "\n", "Which libraries can you use\n", "\n", "You can use any python library you want. If you need packages that are not installed in the course environment, your project needs to contain an `environment.yml` file with all packages you use.\n", "\n", "If you use libraries that are not part of the course environment and do not provide an environment file, we will deduct points.\n", "\n", "\n", "Deadlines\n", "\n", "- Between April 3 and April 10 you must register for the class via basis. Otherwise, you cannot take the class for credit! It happens to multiple students every year and the examination office does not make exceptions. Please also look at the [official dates](https://www.vwlpamt.uni-bonn.de/pruefungsamt-en/pdf/summer-semester-2023/time-table-summer-semester-2023) from the examination office for details.\n", "- The topic of of your final project needs to be submitted until Tuesday, July 11 by writing it down at the top of the README.md file of your repository. This is to make sure that you have a topic and to make sure that you have created your repository and know how to push materials to it.\n", "- The deadline for the final project is Sunday, September 10, 23:00. The projects we download via GitHub classroom will not contain any changes you push after the deadline.\n", "\n", "Tentative Lecture Plan (this will change!)\n", "\n", "- Lecture 1: Overview, logistics and installation\n", "- Lecture 2: Python, Jupyter, Git and Markdown basics\n", "- Lecture 3: Intro to huggingface ecosystem and different NLP tasks\n", "- Lecture 4: Classification with sklearn\n", "- Lecture 5: Tokenization\n", "- Lecture 6: Text classification via feature extraction\n", "- Lecture 7: Text classification via fine-tuning\n", "- Lecture 8: Feedforward neural networks from scratch\n", "- Lecture 9: (Pre-)training neural networks\n", "- Lecture 10: RNNs, attention and transformers\n", "- Lecture 11: Model architectures and loss functions\n", "- Lecture 12: Final Projects tips / Bonus lecture\n", "\"\"\"" ] }, { "cell_type": "code", "execution_count": null, "id": "9ec94c05", "metadata": {}, "outputs": [], "source": [ "reader = pipeline(\"question-answering\", model=\"deepset/roberta-base-squad2\")\n", "\n", "questions = [\n", " \"What is the deadline for the final project?\",\n", " \"Which python library can I use for the final project?\",\n", " \"How is the grade determined?\",\n", " \"What is the topic of lecture 3?\",\n", " \"How should we contact you?\",\n", "]\n", "answers = pd.DataFrame(reader(question=questions, context=logistics_text))\n", "answers[\"question\"] = questions\n", "\n", "answers" ] }, { "cell_type": "markdown", "id": "71027bb6", "metadata": {}, "source": [ "## Task 6: Summarization\n", "1. Define a text summarizing pipeline using the `\"sshleifer/distilbart-cnn-6-6\"` model. \n", "2. Apply the pipeline to `text`\n", "3. Play around with the keyword arguments `min_length` and `max_length` until you get something you like." ] }, { "cell_type": "code", "execution_count": null, "id": "44aa8562", "metadata": {}, "outputs": [], "source": [ "summarizer = pipeline(\"summarization\", model=\"sshleifer/distilbart-cnn-6-6\")\n", "\n", "summarizer(\n", " text,\n", " min_length=100,\n", " max_length=200,\n", " clean_up_tokenization_spaces=True,\n", ")" ] }, { "cell_type": "markdown", "id": "51c25ef4", "metadata": {}, "source": [ "## Task 7: Summarization by paragraph\n", "\n", "1. Apply the pipeline from the previous task to each paragraph of `text`.\n", "2. Play around with `min_length` and `max_length` until you are satisfied with the result.\n", "3. Combine the results back into one string" ] }, { "cell_type": "code", "execution_count": null, "id": "f91f9eb6", "metadata": {}, "outputs": [], "source": [ "summaries = summarizer(\n", " paragraphs, min_length=40, max_length=60, clean_up_tokenization_spaces=True\n", ")\n", "texts = [entry[\"summary_text\"] for entry in summaries]\n", "print(\"\\n\\n\".join(texts))" ] }, { "cell_type": "markdown", "id": "ab5c5b61", "metadata": {}, "source": [ "## Task 8: Translation\n", "\n", "1. Go to [huggingface](https://huggingface.co/models) and search for a model that can translate a text from english to your favorite language.\n", "2. Define a pipeline to do the translation\n", "3. Apply the pipeline on the example input text to translate the content to German." ] }, { "cell_type": "code", "execution_count": null, "id": "365c3c3e", "metadata": {}, "outputs": [], "source": [ "translator = pipeline(\"translation_en_to_de\", model=\"Helsinki-NLP/opus-mt-en-es\")\n", "\n", "outputs = translator(text, clean_up_tokenization_spaces=True)\n", "outputs[0][\"translation_text\"]" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.3" } }, "nbformat": 4, "nbformat_minor": 5 }