{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "998b266e",
   "metadata": {},
   "source": [
    "# Exercise 4 (solution)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "19c9b1f0",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.datasets import load_digits\n",
    "import pandas as pd\n",
    "import seaborn as sns"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7eba2cc2",
   "metadata": {},
   "source": [
    "## Note on import statements\n",
    "\n",
    "- In all real projects, all import statements should be in the first cell of a notebook\n",
    "- It is part of this exercise that you learn how to import what you need from sklearn\n",
    "- Therefore, in this exercise notebooks you will see imports in many places"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "da8f909d",
   "metadata": {},
   "source": [
    "## Task 1: Load and inspect the dataset\n",
    "\n",
    "In this task you will load the digits dataset from `sklearn.datasets`, using scikit-learn's `load_digits` function, which will return a dictionary-like `Bunch` object. \n",
    "\n",
    "The goal of this warmp-up task is that you use your Python knowledge to inspect the object you get from `load_digits`. You do not need to google.\n",
    "\n",
    "\n",
    "1. List the keys of the object\n",
    "2. Look some of the entries and understand their format (e.g. using `type()` and `.shape`\n",
    "3. Look at the description inside digits and find all the terms mentioned on the terminology slide"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f5f091ee",
   "metadata": {},
   "outputs": [],
   "source": [
    "digits = load_digits()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4496e8e9",
   "metadata": {},
   "outputs": [],
   "source": [
    "digits.keys()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b0c66be4",
   "metadata": {},
   "outputs": [],
   "source": [
    "type(digits[\"data\"])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c8bee711",
   "metadata": {},
   "outputs": [],
   "source": [
    "digits[\"data\"].shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "12cce699",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(digits[\"DESCR\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1c80915c",
   "metadata": {},
   "source": [
    "## Task 2: Data splitting\n",
    "\n",
    "Split the data and assign the splits to the variables `X_train`, `X_test`, `y_train`, `y_test`. Set a `random_state` of your choice. Split such that the training sets contain 75 percent of the data. Confirm that by looking at the shapes of the resulting arrays. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "54a43501",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "X_train, X_test, y_train, y_test = train_test_split(\n",
    "    digits[\"data\"],\n",
    "    digits[\"target\"],\n",
    "    random_state=1234,\n",
    "    test_size=0.25,\n",
    ")\n",
    "X_train.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d143a599",
   "metadata": {},
   "outputs": [],
   "source": [
    "X_test.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4aa48bef",
   "metadata": {},
   "source": [
    "## Task 3: Logistic Regression\n",
    "\n",
    "1. Run a logistic regression without regularization and with intercept\n",
    "2. Use the fitted model to create predictions on the test dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4813bad1",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.linear_model import LogisticRegression\n",
    "\n",
    "model = LogisticRegression(\n",
    "    fit_intercept=True,\n",
    "    penalty=None,\n",
    ")\n",
    "model.fit(X_train, y_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e9a3e1ec",
   "metadata": {},
   "outputs": [],
   "source": [
    "y_pred = model.predict(X_test)\n",
    "y_pred"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3348f46d",
   "metadata": {},
   "outputs": [],
   "source": [
    "model.score(X_test, y_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "92281238",
   "metadata": {},
   "source": [
    "## Task 4: Assess model quality\n",
    "\n",
    "1. Calculate the accurracy score\n",
    "2. Calculate the f1 score\n",
    "3. Convert the `\"target_names\"` to a `string` data type\n",
    "4. Create a classification report\n",
    "5. Calculate a confusion_matrix\n",
    "6. Plot the confusion matrix using seaborns [heatmap function](https://seaborn.pydata.org/generated/seaborn.heatmap.html) (Optional)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d1d09c59",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.metrics import accuracy_score\n",
    "\n",
    "accuracy_score(y_test, y_pred)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e7fc5cf3",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.metrics import f1_score\n",
    "\n",
    "f1_score(y_test, y_pred, average=None)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b801c5e3",
   "metadata": {},
   "outputs": [],
   "source": [
    "digits[\"target_names\"] = digits[\"target_names\"].astype(str)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e05ed5db",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.metrics import classification_report\n",
    "\n",
    "report = classification_report(\n",
    "    y_test,\n",
    "    y_pred,\n",
    "    target_names=digits[\"target_names\"],\n",
    ")\n",
    "print(report)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "058d209a",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.metrics import confusion_matrix\n",
    "\n",
    "confusion = confusion_matrix(y_test, y_pred, normalize=\"true\")\n",
    "confusion = pd.DataFrame(\n",
    "    confusion, columns=digits[\"target_names\"], index=digits[\"target_names\"]\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "742fed3a",
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "sns.heatmap(\n",
    "    confusion.round(3),\n",
    "    cmap=sns.color_palette(\"Blues\", as_cmap=True),\n",
    "    annot=True,\n",
    ")\n",
    "sns.set(rc={\"figure.figsize\": (12, 8.27)})"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7f2076df",
   "metadata": {},
   "source": [
    "## Task 5: Logit fitting with penalty\n",
    "\n",
    "1. Run a logistic regression with an \"l2\" penalty. Set the penalty parametr C = $1 / \\lambda$ to 1. \n",
    "2. You will get a warning. You have two options to solve it:\n",
    "    1. Find a good explanation of why it is acceptable to ignore this warning. Relate this to the differences between machine learning and econometrics\n",
    "    2. Change the settings so you don't get the warning"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9014354d",
   "metadata": {},
   "outputs": [],
   "source": [
    "logit = LogisticRegression(fit_intercept=True, max_iter=4500, C=1)\n",
    "logit.fit(X_train, y_train)\n",
    "logit.score(X_test, y_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fe53cb99",
   "metadata": {},
   "source": [
    "In econometrics it would be a huge problem if a numerical optimization terminates without convergence due to reaching max iterations. This is so, because we have no way of knowing whether that introduces a huge bias in our parameters. In supervised machine learning, we can try it out. It can even be the case that fewer iterations work better than more because of avoiding overfitting. "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b8d1b5ee",
   "metadata": {},
   "source": [
    "## Task 6: Understanding decision trees and random forrests in group work\n",
    "\n",
    "Read the following two sections of the Python Data Science Handbook\n",
    "\n",
    "- [Decision trees](https://jakevdp.github.io/PythonDataScienceHandbook/05.08-random-forests.html#Motivating-Random-Forests:-Decision-Trees)\n",
    "- [Random forrests](https://jakevdp.github.io/PythonDataScienceHandbook/05.08-random-forests.html#Ensembles-of-Estimators:-Random-Forests)\n",
    "\n",
    "Discuss decision trees and random forrests with your neighbor or in groups of up to 5 people. Make sure, everyone understands the basic idea and no-one gets hung-up on small technicalities. \n",
    "\n",
    "After everyone has a good understanding of the two methods, go through the basic steps (import, create model instance, fit, evaluate score) for a decision tree and a random forrest."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "567dc9f1",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.tree import DecisionTreeClassifier\n",
    "\n",
    "tree = DecisionTreeClassifier()\n",
    "tree.fit(X_train, y_train)\n",
    "tree.score(X_test, y_test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d5c7f7c8",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.ensemble import RandomForestClassifier\n",
    "\n",
    "forest = RandomForestClassifier()\n",
    "forest.fit(X_train, y_train)\n",
    "forest.score(X_test, y_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1fbf38af",
   "metadata": {},
   "source": [
    "## Task 7: K-fold Cross Validation\n",
    "\n",
    "Do a five fold cross validation for a model of your choice on the training dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "88ec1908",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.model_selection import cross_val_score\n",
    "\n",
    "scores = cross_val_score(logit, X_train, y_train, cv=5)\n",
    "scores"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "68e2fd8d",
   "metadata": {},
   "source": [
    "## Task 8: Hyperparameter tuning\n",
    "\n",
    "Tune the hyperparameters of one of the methods used above using a grid search with cross validation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c32673de",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.model_selection import GridSearchCV\n",
    "\n",
    "param_grid = {\n",
    "    \"penalty\": [\"l2\", \"l1\"],\n",
    "    \"max_iter\": [100, 2000],\n",
    "    \"C\": [0.01, 0.1, 100],\n",
    "}\n",
    "\n",
    "grid = GridSearchCV(\n",
    "    LogisticRegression(\n",
    "        fit_intercept=True,\n",
    "        penalty=\"l2\",\n",
    "    ),\n",
    "    param_grid,\n",
    "    cv=7,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7b6a1b86",
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "grid.fit(X_train, y_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "386d871a",
   "metadata": {},
   "outputs": [],
   "source": [
    "grid.best_params_"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ade30d29",
   "metadata": {},
   "outputs": [],
   "source": [
    "grid.best_estimator_.score(X_test, y_test)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}