Download the notebook here

Exercise 4 (solution)#

[ ]:
from sklearn.datasets import load_digits
import pandas as pd
import seaborn as sns

Note on import statements#

  • In all real projects, all import statements should be in the first cell of a notebook

  • It is part of this exercise that you learn how to import what you need from sklearn

  • Therefore, in this exercise notebooks you will see imports in many places

Task 1: Load and inspect the dataset#

In this task you will load the digits dataset from sklearn.datasets, using scikit-learn’s load_digits function, which will return a dictionary-like Bunch object.

The goal of this warmp-up task is that you use your Python knowledge to inspect the object you get from load_digits. You do not need to google.

  1. List the keys of the object

  2. Look some of the entries and understand their format (e.g. using type() and .shape

  3. Look at the description inside digits and find all the terms mentioned on the terminology slide

[ ]:
digits = load_digits()
[ ]:
digits.keys()
[ ]:
type(digits["data"])
[ ]:
digits["data"].shape
[ ]:
print(digits["DESCR"])

Task 2: Data splitting#

Split the data and assign the splits to the variables X_train, X_test, y_train, y_test. Set a random_state of your choice. Split such that the training sets contain 75 percent of the data. Confirm that by looking at the shapes of the resulting arrays.

[ ]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    digits["data"],
    digits["target"],
    random_state=1234,
    test_size=0.25,
)
X_train.shape
[ ]:
X_test.shape

Task 3: Logistic Regression#

  1. Run a logistic regression without regularization and with intercept

  2. Use the fitted model to create predictions on the test dataset

[ ]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(
    fit_intercept=True,
    penalty=None,
)
model.fit(X_train, y_train)
[ ]:
y_pred = model.predict(X_test)
y_pred
[ ]:
model.score(X_test, y_test)

Task 4: Assess model quality#

  1. Calculate the accurracy score

  2. Calculate the f1 score

  3. Convert the "target_names" to a string data type

  4. Create a classification report

  5. Calculate a confusion_matrix

  6. Plot the confusion matrix using seaborns heatmap function (Optional)

[ ]:
from sklearn.metrics import accuracy_score

accuracy_score(y_test, y_pred)
[ ]:
from sklearn.metrics import f1_score

f1_score(y_test, y_pred, average=None)
[ ]:
digits["target_names"] = digits["target_names"].astype(str)
[ ]:
from sklearn.metrics import classification_report

report = classification_report(
    y_test,
    y_pred,
    target_names=digits["target_names"],
)
print(report)
[ ]:
from sklearn.metrics import confusion_matrix

confusion = confusion_matrix(y_test, y_pred, normalize="true")
confusion = pd.DataFrame(
    confusion, columns=digits["target_names"], index=digits["target_names"]
)
[ ]:

sns.heatmap( confusion.round(3), cmap=sns.color_palette("Blues", as_cmap=True), annot=True, ) sns.set(rc={"figure.figsize": (12, 8.27)})

Task 5: Logit fitting with penalty#

  1. Run a logistic regression with an “l2” penalty. Set the penalty parametr C = \(1 / \lambda\) to 1.

  2. You will get a warning. You have two options to solve it:

    1. Find a good explanation of why it is acceptable to ignore this warning. Relate this to the differences between machine learning and econometrics

    2. Change the settings so you don’t get the warning

[ ]:
logit = LogisticRegression(fit_intercept=True, max_iter=4500, C=1)
logit.fit(X_train, y_train)
logit.score(X_test, y_test)

In econometrics it would be a huge problem if a numerical optimization terminates without convergence due to reaching max iterations. This is so, because we have no way of knowing whether that introduces a huge bias in our parameters. In supervised machine learning, we can try it out. It can even be the case that fewer iterations work better than more because of avoiding overfitting.

Task 6: Understanding decision trees and random forrests in group work#

Read the following two sections of the Python Data Science Handbook

Discuss decision trees and random forrests with your neighbor or in groups of up to 5 people. Make sure, everyone understands the basic idea and no-one gets hung-up on small technicalities.

After everyone has a good understanding of the two methods, go through the basic steps (import, create model instance, fit, evaluate score) for a decision tree and a random forrest.

[ ]:
from sklearn.tree import DecisionTreeClassifier

tree = DecisionTreeClassifier()
tree.fit(X_train, y_train)
tree.score(X_test, y_test)
[ ]:
from sklearn.ensemble import RandomForestClassifier

forest = RandomForestClassifier()
forest.fit(X_train, y_train)
forest.score(X_test, y_test)

Task 7: K-fold Cross Validation#

Do a five fold cross validation for a model of your choice on the training dataset

[ ]:
from sklearn.model_selection import cross_val_score

scores = cross_val_score(logit, X_train, y_train, cv=5)
scores

Task 8: Hyperparameter tuning#

Tune the hyperparameters of one of the methods used above using a grid search with cross validation

[ ]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    "penalty": ["l2", "l1"],
    "max_iter": [100, 2000],
    "C": [0.01, 0.1, 100],
}

grid = GridSearchCV(
    LogisticRegression(
        fit_intercept=True,
        penalty="l2",
    ),
    param_grid,
    cv=7,
)
[ ]:
grid.fit(X_train, y_train)
[ ]:
grid.best_params_
[ ]:
grid.best_estimator_.score(X_test, y_test)