{ "cells": [ { "cell_type": "markdown", "id": "998b266e", "metadata": {}, "source": [ "# Exercise 4" ] }, { "cell_type": "code", "execution_count": null, "id": "19c9b1f0", "metadata": {}, "outputs": [], "source": [ "from sklearn.datasets import load_digits\n", "import pandas as pd\n", "import seaborn as sns" ] }, { "cell_type": "markdown", "id": "7eba2cc2", "metadata": {}, "source": [ "## Note on import statements\n", "\n", "- In all real projects, all import statements should be in the first cell of a notebook\n", "- It is part of this exercise that you learn how to import what you need from sklearn\n", "- Therefore, in this exercise notebooks you will see imports in many places" ] }, { "cell_type": "markdown", "id": "da8f909d", "metadata": {}, "source": [ "## Task 1: Load and inspect the dataset\n", "\n", "In this task you will load the digits dataset from `sklearn.datasets`, using scikit-learn's `load_digits` function, which will return a dictionary-like `Bunch` object. \n", "\n", "The goal of this warmp-up task is that you use your Python knowledge to inspect the object you get from `load_digits`. You do not need to google.\n", "\n", "\n", "1. List the keys of the object\n", "2. Look some of the entries and understand their format (e.g. using `type()` and `.shape`\n", "3. Look at the description inside digits and find all the terms mentioned on the terminology slide" ] }, { "cell_type": "code", "execution_count": null, "id": "f5f091ee", "metadata": {}, "outputs": [], "source": [ "digits = load_digits()" ] }, { "cell_type": "code", "execution_count": null, "id": "4496e8e9", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "b0c66be4", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "c8bee711", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "12cce699", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "1c80915c", "metadata": {}, "source": [ "## Task 2: Data splitting\n", "\n", "Split the data and assign the splits to the variables `X_train`, `X_test`, `y_train`, `y_test`. Set a `random_state` of your choice. Split such that the training sets contain 75 percent of the data. Confirm that by looking at the shapes of the resulting arrays. " ] }, { "cell_type": "code", "execution_count": null, "id": "54a43501", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "d143a599", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "4aa48bef", "metadata": {}, "source": [ "## Task 3: Logistic Regression\n", "\n", "1. Run a logistic regression without regularization and with intercept\n", "2. Use the fitted model to create predictions on the test dataset" ] }, { "cell_type": "code", "execution_count": null, "id": "4813bad1", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "e9a3e1ec", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "3348f46d", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "92281238", "metadata": {}, "source": [ "## Task 4: Assess model quality\n", "\n", "1. Calculate the accurracy score\n", "2. Calculate the f1 score\n", "3. Convert the `\"target_names\"` to a `string` data type\n", "4. Create a classification report\n", "5. Calculate a confusion_matrix\n", "6. Plot the confusion matrix using seaborns [heatmap function](https://seaborn.pydata.org/generated/seaborn.heatmap.html) (Optional)" ] }, { "cell_type": "code", "execution_count": null, "id": "d1d09c59", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "e7fc5cf3", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "b801c5e3", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "e05ed5db", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "058d209a", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "742fed3a", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "7f2076df", "metadata": {}, "source": [ "## Task 5: Logit fitting with penalty\n", "\n", "1. Run a logistic regression with an \"l2\" penalty. Set the penalty parametr C = $1 / \\lambda$ to 1. \n", "2. You will get a warning. You have two options to solve it:\n", " 1. Find a good explanation of why it is acceptable to ignore this warning. Relate this to the differences between machine learning and econometrics\n", " 2. Change the settings so you don't get the warning" ] }, { "cell_type": "code", "execution_count": null, "id": "9014354d", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "fe53cb99", "metadata": {}, "source": [] }, { "cell_type": "markdown", "id": "b8d1b5ee", "metadata": {}, "source": [ "## Task 6: Understanding decision trees and random forrests in group work\n", "\n", "Read the following two sections of the Python Data Science Handbook\n", "\n", "- [Decision trees](https://jakevdp.github.io/PythonDataScienceHandbook/05.08-random-forests.html#Motivating-Random-Forests:-Decision-Trees)\n", "- [Random forrests](https://jakevdp.github.io/PythonDataScienceHandbook/05.08-random-forests.html#Ensembles-of-Estimators:-Random-Forests)\n", "\n", "Discuss decision trees and random forrests with your neighbor or in groups of up to 5 people. Make sure, everyone understands the basic idea and no-one gets hung-up on small technicalities. \n", "\n", "After everyone has a good understanding of the two methods, go through the basic steps (import, create model instance, fit, evaluate score) for a decision tree and a random forrest." ] }, { "cell_type": "code", "execution_count": null, "id": "567dc9f1", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "d5c7f7c8", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "1fbf38af", "metadata": {}, "source": [ "## Task 7: K-fold Cross Validation\n", "\n", "Do a five fold cross validation for a model of your choice on the training dataset" ] }, { "cell_type": "code", "execution_count": null, "id": "88ec1908", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "68e2fd8d", "metadata": {}, "source": [ "## Task 8: Hyperparameter tuning\n", "\n", "Tune the hyperparameters of one of the methods used above using a grid search with cross validation" ] }, { "cell_type": "code", "execution_count": null, "id": "c32673de", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "7b6a1b86", "metadata": { "scrolled": true }, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "386d871a", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "ade30d29", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.3" } }, "nbformat": 4, "nbformat_minor": 5 }