Download the notebook here

Exercise 4#

[ ]:

from sklearn.datasets import load_digits
import pandas as pd
import seaborn as sns

Note on import statements#

In all real projects, all import statements should be in the first cell of a notebook
It is part of this exercise that you learn how to import what you need from sklearn
Therefore, in this exercise notebooks you will see imports in many places

Task 1: Load and inspect the dataset#

In this task you will load the digits dataset from sklearn.datasets, using scikit-learn’s load_digits function, which will return a dictionary-like Bunch object.

The goal of this warmp-up task is that you use your Python knowledge to inspect the object you get from load_digits. You do not need to google.

List the keys of the object
Look some of the entries and understand their format (e.g. using type() and .shape
Look at the description inside digits and find all the terms mentioned on the terminology slide

[ ]:

digits = load_digits()

[ ]:

[ ]:

[ ]:

[ ]:

Task 2: Data splitting#

Split the data and assign the splits to the variables X_train, X_test, y_train, y_test. Set a random_state of your choice. Split such that the training sets contain 75 percent of the data. Confirm that by looking at the shapes of the resulting arrays.

[ ]:

[ ]:

Task 3: Logistic Regression#

Run a logistic regression without regularization and with intercept
Use the fitted model to create predictions on the test dataset

[ ]:

[ ]:

[ ]:

Task 4: Assess model quality#

Calculate the accurracy score
Calculate the f1 score
Convert the "target_names" to a string data type
Create a classification report
Calculate a confusion_matrix
Plot the confusion matrix using seaborns heatmap function (Optional)

[ ]:

[ ]:

[ ]:

[ ]:

[ ]:

[ ]:

Task 5: Logit fitting with penalty#

Run a logistic regression with an “l2” penalty. Set the penalty parametr C = \(1 / \lambda\) to 1.
You will get a warning. You have two options to solve it:
1. Find a good explanation of why it is acceptable to ignore this warning. Relate this to the differences between machine learning and econometrics
2. Change the settings so you don’t get the warning

[ ]:

Task 6: Understanding decision trees and random forrests in group work#

Read the following two sections of the Python Data Science Handbook

Discuss decision trees and random forrests with your neighbor or in groups of up to 5 people. Make sure, everyone understands the basic idea and no-one gets hung-up on small technicalities.

After everyone has a good understanding of the two methods, go through the basic steps (import, create model instance, fit, evaluate score) for a decision tree and a random forrest.

[ ]:

[ ]:

Task 7: K-fold Cross Validation#

Do a five fold cross validation for a model of your choice on the training dataset

[ ]:

Task 8: Hyperparameter tuning#

Tune the hyperparameters of one of the methods used above using a grid search with cross validation

[ ]:

[ ]:

[ ]:

[ ]: