9. Classification and Regression Metrics#

9.1. Lecture Learning Objectives#

  • Explain why accuracy is not always the best metric in ML.

  • Explain components of a confusion matrix.

  • Define precision, recall, and f1-score and use them to evaluate different classifiers.

  • Identify whether there is class imbalance and whether you need to deal with it.

  • Explain class_weight and use it to deal with data imbalance.

  • Appropriately select a scoring metric given a regression problem.

  • Interpret and communicate the meanings of different scoring metrics on regression problems. MSE, RMSE, \(R^2\), MAPE.

  • Apply different scoring functions with cross_validate, GridSearchCV and RandomizedSearchCV.

9.2. Five Minute Recap/ Lightning Questions#

  • What is the difference between a business and a statistical question?

  • Should we ever question our clients’ requests?

  • What is an important feature?

  • What are some types of feature selection methods?

9.2.1. Some lingering questions#

  • How can we measure our model’s success besides using accuracy or \(R2\)?

  • How should we interpret our model score if we have data where there is a lot of one class and very few of another?

9.3. Introducing Evaluation Metrics#

Up until this point, we have been scoring our models the same way every time. We’ve been using the percentage of correctly predicted examples for classification problems and the \(R^2\) metric for regression problems. Let’s discuss how we need to expand our horizons and why it’s important to evaluate our models in other ways.

To help explain why accuracy isn’t always the most beneficial option, we are going back to the creditcard data set from the first class.

import pandas as pd
from sklearn.model_selection import train_test_split


cc_df = pd.read_csv('data/creditcard_sample.csv', encoding='latin-1')
train_df, test_df = train_test_split(cc_df, test_size=0.3, random_state=111)
train_df
Time V1 V2 V3 V4 V5 V6 V7 V8 V9 ... V21 V22 V23 V24 V25 V26 V27 V28 Amount Class
2210 139995.0 0.000822 0.176378 -0.081084 -2.240657 0.266328 -1.458596 0.658240 -0.340358 -1.124072 ... 0.574194 1.741723 -0.110379 0.053146 -0.692897 -0.207781 0.460053 0.307173 15.00 0
98478 139199.0 1.898426 -0.544627 0.021055 0.233999 -0.690212 0.343812 -0.976358 0.241278 0.957517 ... 0.118648 0.439855 0.323290 0.749224 -0.580108 0.317277 -0.005703 -0.034896 23.36 0
75264 147031.0 1.852468 -0.216744 -1.956124 0.360745 0.415657 -0.577488 0.229426 -0.215398 0.913203 ... -0.198389 -0.526080 0.093325 0.322035 -0.030224 -0.113123 -0.022952 0.000988 109.54 0
66130 50102.0 -0.999481 0.849393 -0.556091 0.259464 2.298113 3.728162 -0.258322 1.353233 -0.503258 ... -0.082967 -0.136016 0.092160 1.009201 0.216844 -0.236471 0.201575 0.101621 20.24 0
82331 41819.0 -0.417792 1.027810 1.560763 -0.029187 -0.076807 -0.904689 0.688554 -0.056332 -0.369867 ... -0.229592 -0.609212 -0.019424 0.356282 -0.198697 0.072055 0.264011 0.120743 2.69 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
105747 86420.0 1.539769 -0.710190 -0.779133 0.972778 0.521677 1.992379 -0.538152 0.592431 0.530753 ... -0.020365 -0.203199 0.323143 -0.793579 -0.611899 -0.926726 0.073134 -0.018315 147.80 0
102486 113038.0 -0.509300 1.128383 -0.876960 -0.568208 0.819440 -0.749178 0.903256 0.068764 0.068195 ... -0.391476 -0.860542 0.061769 0.387231 -0.334076 0.101585 0.085727 -0.194219 44.99 0
4820 142604.0 1.906919 -0.398941 0.275837 1.736308 -0.710844 0.682936 -1.180614 0.443751 0.047498 ... -0.022269 -0.163610 0.499126 0.731827 -1.088328 2.005337 -0.153967 -0.061703 3.75 0
10196 139585.0 2.106285 -0.102411 -1.815538 0.256847 0.340938 -1.002490 0.373141 -0.314247 0.541619 ... -0.060222 -0.047904 0.124192 0.771908 0.144864 0.645126 -0.117185 -0.074093 5.41 0
77652 148922.0 2.157147 -1.138329 -0.775495 -0.887122 -1.019818 -0.489387 -1.024161 -0.069089 0.329227 ... 0.282963 0.802273 0.037861 -0.642100 -0.101534 -0.046669 -0.001974 -0.052120 39.99 0

85504 rows × 31 columns

train_df.shape
(85504, 31)

We can see this is a quite large dataset!

train_df.describe(include='all')
Time V1 V2 V3 V4 V5 V6 V7 V8 V9 ... V21 V22 V23 V24 V25 V26 V27 V28 Amount Class
count 85504.000000 85504.000000 85504.000000 85504.000000 85504.000000 85504.000000 85504.000000 85504.000000 85504.000000 85504.000000 ... 85504.000000 85504.000000 85504.000000 85504.000000 85504.000000 85504.000000 85504.000000 85504.000000 85504.000000 85504.000000
mean 111177.965218 0.069621 -0.035379 -0.296593 -0.028028 0.086975 -0.030399 0.025395 -0.017036 0.042680 ... 0.017353 0.052381 0.012560 -0.009845 -0.048255 -0.014074 -0.003047 -0.002209 92.724942 0.004128
std 48027.531032 2.108440 1.780371 1.631892 1.466457 1.452847 1.354052 1.361786 1.258107 1.131435 ... 0.776954 0.755315 0.670336 0.607638 0.548000 0.481118 0.414234 0.369266 271.297276 0.064121
min 406.000000 -56.407510 -72.715728 -33.680984 -5.683171 -40.427726 -26.160506 -43.557242 -73.216718 -13.434066 ... -34.830382 -10.933144 -36.666000 -2.824849 -8.696627 -2.534330 -9.895244 -8.656570 0.000000 0.000000
25% 50814.000000 -0.886089 -0.634044 -1.228706 -0.871992 -0.622997 -0.801849 -0.550769 -0.234941 -0.616671 ... -0.225345 -0.524692 -0.160006 -0.365718 -0.375934 -0.331664 -0.074373 -0.058973 5.990000 0.000000
50% 133031.500000 0.064451 0.027790 -0.206322 -0.099292 0.060853 -0.300730 0.076727 0.001596 0.003678 ... -0.008602 0.074564 0.002990 0.027268 -0.062231 -0.061101 -0.003718 -0.003411 22.660000 0.000000
75% 148203.000000 1.832261 0.796311 0.767406 0.635543 0.735001 0.374897 0.632747 0.310501 0.658517 ... 0.215080 0.622089 0.177875 0.458784 0.317849 0.230836 0.088166 0.076868 80.000000 0.000000
max 172788.000000 2.451888 22.057729 4.187811 16.715537 34.801666 23.917837 44.054461 19.587773 9.234623 ... 27.202839 10.503090 20.803344 3.979637 7.519589 3.155327 10.507884 33.847808 19656.530000 1.000000

8 rows × 31 columns

We see that the columns are all scaled and numerical.

You don’t need to worry about this now. The original columns have been transformed already for confidentiality and our benefit so now there are no categorical features.

Let’s separate X and y for train and test splits.

X_train_big, y_train_big = train_df.drop(columns=["Class"]), train_df["Class"]
X_test, y_test = test_df.drop(columns=["Class"]), test_df["Class"]

We are going to be talking about evaluation metrics and it’s easier to do so if we use an explicit validation set instead of using cross-validation.

Our data is large enough so it shouldn’t be a problem.

X_train, X_valid, y_train, y_valid = train_test_split(
    X_train_big,  y_train_big, test_size=0.3, random_state=123)

9.3.1. Baseline#

Just like and predictive question, we start our analysis by building a simple DummyClassifier model as our baseline.

from sklearn.dummy import DummyClassifier

dummy = DummyClassifier(strategy="most_frequent")
dummy.fit(X_train, y_train)
dummy.score(X_train, y_train)
0.9958564458998864
dummy.score(X_valid, y_valid)
0.9959067519101824

Almost 100% accuracy? This is supposed to be a baseline model! How is it getting such high accuracy? Should we just deploy this DummyClassifier model for fraud detection?

Not so fast… If we look at the distribution of fraudulent labels to non-fraudulent labels, we can see there is an imbalance in the classes.

train_df["Class"].value_counts(normalize=True)
Class
0    0.995872
1    0.004128
Name: proportion, dtype: float64

Here the 0 class is a Non fraud transaction, and the 1 class is a Fraud transaction. We can see here that there are MANY Non fraud transactions and only a tiny handful of Fraud transactions. So, what would be a good accuracy here? 99.9%? 99.99%?

Let’s see if a logistic regression model would get a higher score than the Dummary model.

from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_validate

pipe = make_pipeline(
       (StandardScaler()),
       (LogisticRegression(random_state=123))
)

pd.DataFrame(cross_validate(pipe, X_train, y_train, return_train_score=True)).mean()
fit_time       0.172872
score_time     0.002629
test_score     0.998830
train_score    0.998993
dtype: float64

This seems slightly better than DummyClassifier, but the question is can it really identify fraudulent transactions? The “Fraud” class is the class that we want to spot. The class we are interested in. Let’s looks at some new metrics that can help us assess how well our model is doing overall and not just for the majority class label.

9.4. Classification Metrics and tools#

9.4.1. What is “positive” and “negative”?#

There are two kinds of binary classification problems:

  • Distinguishing between two classes

  • Spotting a specific class (fraud transaction, spam, disease)

In the case of spotting problems, the thing that we are interested in spotting is considered “positive” (not related to how a logistic regression model internall defines a “positive” and “negative” class). In our example, we want to spot fraudulent transactions and so fraudulent is the “positive” class.

9.4.2. Confusion Matrix#

A confusion matrix is a table that visualizes the performance of an algorithm. It shows the possible labels and how many of each label the model predicts correctly and incorrectly.

Here we first fit and predict the model, and then show it’s confusion matrix.

from sklearn.metrics import confusion_matrix


pipe.fit(X_train, y_train)
predictions = pipe.predict(X_valid)
cm = confusion_matrix(y_valid, predictions)
cm
array([[25541,     6],
       [   25,    80]])

9.4.2.1. Confusion Matrix components#

X

predict negative

predict positive

negative example

True negative (TN)

False positive (FP)

positive example

False negative (FN)

True positive (TP)

Remember the Fraud is considered “positive” in this case and Non fraud is considered “negative”.

The 4 quadrants of the confusion matrix can be explained as follows. These positions will change depending on what values we deem as the positive label.

  • True negative (TN): Examples that are negatively labelled that the model correctly predicts. This is in the top left quadrant.

  • False positive (FP): Examples that are negatively labelled that the model incorrectly predicts as positive. This is in the top right quadrant.

  • False negative (FN): Examples that are positively labelled that the model incorrectly predicts as negative. This is in the bottom left quadrant.

  • True positive (TP): Examples that are positively labelled that the model correctly predicted as positive. This is in the bottom right quadrant.

Instead of looking just at the numbers and remembering what each category represents, we can use the ConfusionMatrixDisplay class (the plot_confusion_matrix function in earlier version of sklearn) to visualize how well our model is doing classifying each target class.

We can use classes to see which position each label takes so we can designate them more comprehensive labels in our plot.

pipe.classes_
array([0, 1])
from sklearn.metrics import ConfusionMatrixDisplay
import matplotlib.pyplot as plt


plt.rc('font', size=12) # bigger font sizes
ConfusionMatrixDisplay(cm, display_labels=["Non fraud", "Fraud"]).plot()
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x179d97e50>
../_images/lecture9_33_1.png

In fact, we don’t even need to manually create the confusion matrix before plotting it, but can instead build it straight from the fitted estimator. In this case, we only need to fit the model before visualizing it (the predictions are done automatically on the validation/test data set that we pass in) This results in a 2 by 2 matrix with the labels Non fraud and Fraud on each axis.

pipe.fit(X_train, y_train)  # We already did this above, but adding it here for clarity

ConfusionMatrixDisplay.from_estimator(
    pipe,
    X_valid,
    y_valid,
    display_labels=["Non fraud", "Fraud"],
);
../_images/lecture9_35_0.png

Looking at the plotting arguments:

  • Similar to other sklearn functions, we pass the model/pipeline followed by the feature table and then the target values.

  • display_labels will show more descriptive labels. without this argument, it would simply show the classes we have in the data (0, 1).

  • values_format will determine how the numbers are displayed. Specifying d avoids scientific notation for large numbers (not needed in this example).

  • cmap is the colour argument! The default is viridis but other values such as Blues, Purples, RdPu or other colour schemes from here are also possible.

9.4.3. Accuracy is only part of the story…#

We have been using .score to assess our models, which returns accuracy by default for classification models. We just saw that accuracy can be misleading when we have a class imbalance, so maybe there are other metrics that are more suitable in these cases?

Note that the metrics we are going to discuss will only help us assess our model assessment. Further into this lecture we’ll talk about a few ways to address the class imbalance problem as well.

To understand the metrics we are going to talk about next, we will need our values for the four different quadrants in the confusion matrix. We are going to split up the values in the matrix into four separate variables

  • TN for the True Negatives

  • FP for the False Positives

  • FN for the False Negatives

  • TP for the True Positives

TN, FP, FN, TP = cm.flatten()

Now let’s look at the first metric, “Recall”

9.4.4. Recall#

“Among all positive examples, how many did the model identify?”

Recall is the ability of the classifier to find all the positive samples. You can think of this as “What was the model’s recall/hit rate out of all the truly positive observations”. The denominator in the equation below is all the truly positive values.

\[ \text{recall} = \frac{\text{Number of correctly identified positives}}{\text{Total number of true positives}} = \frac{TP}{TP + FN} \]

In binary classification, recall is sometimes used more generally for either the positive or negative class: recall of the positive class is also known as “sensitivity” and recall of the negative class is “specificity”, which are terms you might recognize from your statistics classes. In machine learning we almost always refer to “sensitivity” when we just say “recall”.

Since Fraud is our positive label, we see the correctly identified labels in the bottom right quadrant and the ones that we missed in the bottom left quadrant.

image.png

So here we take our true positives and we divide by all the positive labels in our validation set (the predictions the model incorrectly labelled as negative (the false negatives) as well as those correctly labelled as positive).

print('True Positives:', TP)
print('False Negatives:', FN)
True Positives: 80
False Negatives: 25
recall = TP / (TP + FN)
recall.round(3)
0.762

9.4.5. Precision#

“Among the positive examples you identified, how many were actually positive?”

Precision is the ability of the classifier to avoid putting a positive label on a negative observation. You can think of this as “How precise are the model’s predictions?”. The denominator in the equation below is all the predicted positive values.

\[ \text{precision} = \frac{\text{Number of correctly identified positives}}{\text{Total number of predicted positives}} = \frac{TP}{TP + FP} \]

With Fraud as our positive label, we see the correctly identified Fraudulent cases in the bottom right quadrant and the labels we incorrectly labelled as Frauds in the top right.

image.png

So here we take our true positives and we divide by all the positive labels that our model predicted.

print('True Positives:', TP)
print('False Positives:', FP)
True Positives: 80
False Positives: 6
precision = TP / (TP + FP)
precision.round(3)
0.93

Of course, we’d like to have both high precision and recall but the balance depends on our domain, and which type of error we think is more important to avoid. For credit card fraud detection, recall is really important (catching frauds), precision is less important (reducing false positives) since there likely will be a manual review process in place to look closer at the predicted frauds and prevent false accusations (whereas there likely are too many observations to have a manual review process for all the potentially missed frauds).

9.4.6. Visualization of precision and recall#

In case you find the concepts above hard to follow or remember, I am including this schematic as a visual aid

image.png

Source: https://en.wikipedia.org/wiki/Precision_and_recall

9.4.7. f1 score#

Sometimes we need a single score to maximize, e.g., when doing hyperparameter tuning via RandomizedSearchCV. Accuracy is often a not the ideal choice, and we might care about both the precision and recall. One way of combining these two into a single score is to average them. However, in machine learning, we usually use a different way of averaging these metrics together, which is called the “harmonic mean”. The advantage of this is that it penalizes the model more for performing poorly in either of the precision or recall, whether if we just took the common arithmetic mean, the model could compensate e.g. for a low recall with a high precision and still get a high overall score.

The harmonic mean of the precision and recall is called the f1 score:

\[ \text{f1} = 2 * \frac{\text{precision} * \text{recall}}{\text{precision} + \text{recall}} \]
print('Precision:', precision.round(4))
print('Recall:', recall.round(4))
Precision: 0.9302
Recall: 0.7619
f1_score = (2 * precision * recall) / (precision + recall)
f1_score.round(3)
0.838

We could calculate all these evaluation metrics by hand using the formulas we have covered so far:

data = {}
data["accuracy"] = [(TP + TN) / (TN + FP + FN + TP)]
data["precision"] = [ TP / (TP + FP)] 
data["recall"] = [TP / (TP + FN)] 
data["f1 score"] = [(2 * precision * recall) / (precision + recall)] 

measures_df = pd.DataFrame(data)
measures_df
accuracy precision recall f1 score
0 0.998792 0.930233 0.761905 0.837696

… or we can use scikit-learn which has functions for these metrics.

Here we are importing accuracy_score, precision_score, recall_score, f1_score from sklearn.metrics

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score


pred_cv =  pipe.predict(X_valid) 

data["accuracy"].append(accuracy_score(y_valid, pred_cv))
data["precision"].append(precision_score(y_valid, pred_cv))
data["recall"].append(recall_score(y_valid, pred_cv))
data["f1 score"].append(f1_score(y_valid, pred_cv))

pd.DataFrame(data, index=['by-hand', 'sklearn'])
accuracy precision recall f1 score
by-hand 0.998792 0.930233 0.761905 0.837696
sklearn 0.998792 0.930233 0.761905 0.837696

And you can see the scores match.

We can even go one step further and “observe” the scores using a Classification report

9.4.8. Classification report#

Similar to how a confusion matrix shows the False and True negative and positive labels, a classification report shows us an assortment of metrics, however, we can’t flatten or obtain the results from it and only see what is printed as the output.

We can import classification_report from sklearn.metrics

from sklearn.metrics import classification_report

In our function, we specify the true labels, followed by the predictions our model made.

The argument target_names, gives more descriptive labels similar to what display_labels did when plotting the confusion matrix.

print(
    classification_report(
        y_valid,
        pipe.predict(X_valid),
        target_names=["non fraud", "Fraud"]
    )
)
              precision    recall  f1-score   support

   non fraud       1.00      1.00      1.00     25547
       Fraud       0.93      0.76      0.84       105

    accuracy                           1.00     25652
   macro avg       0.96      0.88      0.92     25652
weighted avg       1.00      1.00      1.00     25652

Note that what you consider “positive” (Fraud in our case) is important when calculating precision, recall, and f1-score. If you flip what is considered positive or negative, we’ll end up with different True Positive, False Positive, True Negatives and False Negatives, and hence different precision, recall, and f1-scores. The support column just shows the number of examples in each class.

You might be wondering about the two lines at the end of this report, so let’s cover that next.

9.4.8.1. Macro average vs weighted average#

These are the average for the positive and negative class in each of the metrics.

  • Macro average gives equal importance to all classes irrespective of the number of observations (support) in each class.

  • Weighted average weighs the average by the number of observations (support) in each class.

Which one is relevant, depends upon whether you think each class should have the same weight or each sample should have the same weight. These metrics are often useful when predicting multiple classes which we will briefly discuss later on.

In addition to this lecture, my wonderful colleague Varada Kolhatkar has made a cheat sheet for these metrics available in a larger size here.

404 image

9.4.9. Imbalanced datasets#

A class imbalance typically refers to having many more examples of one class than another in one’s training set. We’ve seen this in our fraud dataset where our class target column had many more non-fraud than fraud examples. Real-world data is often imbalanced and can be seen in scenarios such as:

  • Ad clicking data (Only around ~0.01% of ads are clicked.)

  • Spam classification datasets.

y_train.value_counts('Class')
Class
0    0.995856
1    0.004144
Name: proportion, dtype: float64

9.4.9.1. Addressing class imbalance#

A very important question to ask yourself: “Why do I have a class imbalance?”

  • Is it because of my data collection methods?

    • If it’s the data collection, then that means the you need to rethink how you have collected the data and if you can recollect it to balance the classes (note: it might be dangerous to go out and just collect new observations of the least common class since these would be collected after the original data and if the data changed over time, these newly collected observations will be different than the old one not because of their class, but because of the date they were collected.

  • Is it because one class is much rarer than the other?

    • If it’s because one is rarer than the other in the true data distribution, you need to think about which type of error is more important to the stakeholders and prioritize how you train the model and how you assess its performance.

9.4.9.2. Handling imbalance#

Can we change the model itself so that it considers the errors that are important to us?

There are two common approaches to this:

  1. Changing the training procedure

  2. Changing the data (not in this course)

    • Undersampling

    • Oversampling

9.4.9.3. Changing the training procedure: class_weight#

If you look for example, in the documentation for the SVM classifier, or Logistic Regression we see class_weight as a parameter.

404 image

How can this help use work with class imbalances?

The default class_weight is 1 for all classes; which means that all classes are equally important. By setting the class weight to another value, we can say that errors on one class are more important than errors on another class, and when we perform the final computation of the error score, this class’s errors will have more weight and contribute more to the final error score.

Let’s see an example.

First, let’s build a model where we keep the class_weights as the default.

lr_default= LogisticRegression(random_state=12, max_iter=1000)
lr_default.fit(X_train,y_train);
ConfusionMatrixDisplay.from_estimator(
    lr_default, X_valid, y_valid,
    display_labels=["Non fraud", "Fraud"],
);
../_images/lecture9_76_0.png

Now let’s rebuild our pipeline but using the class_weight argument and setting it as class_weight={1:100}. This is equivalent to saying “repeat every positive example 100x in the training set”, but repeating data would slow down the code, whereas this doesn’t since it just weights the error on the second class 100x more than the first class. In the context of our data, we are saying that a false negative is 100x more problematic than a false positive.

lr_100 = LogisticRegression(random_state=12, max_iter=1000, class_weight={1:100})
lr_100.fit(X_train,y_train);
ConfusionMatrixDisplay.from_estimator(
    lr_100,
    X_valid,
    y_valid,
    display_labels=["Non fraud", "Fraud"],
);
../_images/lecture9_79_0.png

Notice that we now have reduced false negatives and predicted more true positives this time. But, as a consequence, we pay a price since now we are also increasing false positives.

We can also set class_weight="balanced". This sets the weights automatically so that the classes are “equal”, by automatically adjust weights inversely proportional to class frequencies in the input data. So if there is 10x less of class 2 in the data, its errors will be weighted 10x in the computation of the final score.

lr_balanced = LogisticRegression(random_state=12, max_iter=1000, class_weight="balanced")
lr_balanced.fit(X_train,y_train);
ConfusionMatrixDisplay.from_estimator(
    lr_balanced,
    X_valid,
    y_valid,
    display_labels=["Non fraud", "Fraud"],
);
../_images/lecture9_82_0.png

Again, we have reduced the number of false negatives and increased the number of true positives but we have many more false positives now! Overall, we can say that our weight adjustments are making the model more likely to make the prediction “fraud” on a sample.

9.4.9.4. Are we doing better with class_weight="balanced"?#

Let’s compare some metrics and find out.

lr_default.score(X_valid, y_valid)
0.9988305005457664
lr_balanced.score(X_valid, y_valid)
0.9676048651177296

Changing the class weight will generally reduce accuracy. The original model was trying to maximize accuracy. Now you’re telling it to do something different. But we know now that accuracy isn’t the only metric that matters. Let’s explain why this happens.

Since there are so many more negative examples than positive ones, false-positives affect accuracy much more than false negatives. Thus, precision matters a lot more than recall in this accuracy calculation. So, the default method trades off a lot of recall for a bit of precision. We are paying a “fee” in precision for a greater recall value.

9.5. Let’s Practice#

404 image

Use the diagram above to answer the questions.

1. How many examples did the model of this matrix correctly label as “Guard”?
2. If Forward is the positive label, how many false-positive values are there?
3. How many examples does the model incorrectly predict?
4. What is the recall of the confusion matrix assuming that Forward is the positive label?
5. What is the precision of the confusion matrix assuming that Forward is the positive label?
6. What is the f1 score assuming that Forward is the positive label?

True or False:

7. In spam classification, false positives are often more damaging than false negatives (assume “positive” means the email is spam, “negative” means it’s not).
8. In medical diagnosis, high recall is often more important than high precision.
9. The weighted average in the classification report gives equal importance to all classes.
10. Setting class_weight={1:100} will make each example of the second class label be counted 100 times.

9.6. Regression Metrics#

For this part, since we need to use data that corresponds to a regression problem, we are bringing back our California housing dataset.

We want to predict the median house value for different locations.

housing_df = pd.read_csv("data/housing.csv")
train_df, test_df = train_test_split(housing_df, test_size=0.1, random_state=123)


train_df = train_df.assign(rooms_per_household = train_df["total_rooms"]/train_df["households"],
                           bedrooms_per_household = train_df["total_bedrooms"]/train_df["households"],
                           population_per_household = train_df["population"]/train_df["households"])
                        
test_df = test_df.assign(rooms_per_household = test_df["total_rooms"]/test_df["households"],
                         bedrooms_per_household = test_df["total_bedrooms"]/test_df["households"],
                         population_per_household = test_df["population"]/test_df["households"])
                         
train_df = train_df.drop(columns=['total_rooms', 'total_bedrooms', 'population'])  
test_df = test_df.drop(columns=['total_rooms', 'total_bedrooms', 'population']) 
X_train = train_df.drop(columns=["median_house_value"])
y_train = train_df["median_house_value"]
X_test = test_df.drop(columns=["median_house_value"])
y_test = test_df["median_house_value"]

numeric_features = [ "longitude", "latitude",
                     "housing_median_age",
                     "households", "median_income",
                     "rooms_per_household",
                     "bedrooms_per_household",
                     "population_per_household"]
                     
categorical_features = ["ocean_proximity"]

X_train.head()
longitude latitude housing_median_age households median_income ocean_proximity rooms_per_household bedrooms_per_household population_per_household
6051 -117.75 34.04 22.0 602.0 3.1250 INLAND 4.897010 1.056478 4.318937
20113 -119.57 37.94 17.0 20.0 3.4861 INLAND 17.300000 6.500000 2.550000
14289 -117.13 32.74 46.0 708.0 2.6604 NEAR OCEAN 4.738701 1.084746 2.057910
13665 -117.31 34.02 18.0 285.0 5.2139 INLAND 5.733333 0.961404 3.154386
14471 -117.23 32.88 18.0 1458.0 1.8580 NEAR OCEAN 3.817558 1.004801 4.323045

We are going to bring in our previous pipelines and fit our model.

from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import make_column_transformer
from sklearn.neighbors import KNeighborsRegressor

numeric_transformer = Pipeline(
    steps=[("imputer", SimpleImputer(strategy="median")), 
           ("scaler", StandardScaler())]
)

categorical_transformer = Pipeline(
    steps=[("imputer", SimpleImputer(strategy="constant", fill_value="missing")),
           ("onehot", OneHotEncoder(handle_unknown="ignore"))]
)

preprocessor = make_column_transformer(
(numeric_transformer, numeric_features),
        (categorical_transformer, categorical_features), 
    remainder='passthrough')

pipe = make_pipeline(preprocessor, KNeighborsRegressor())
pipe.fit(X_train, y_train);

As you know, since we aren’t doing classification anymore, so we can’t just check for equality.

predicted_y = pipe.predict(X_train) 
predicted_y
array([111740., 117380., 187700., ..., 271420., 265180.,  60860.])
y_train.values
array([113600., 137500., 170100., ..., 286200., 412500.,  59300.])
(predicted_y == y_train).mean()  # "Accuracy"
0.01232773471145564

The predicted values will rarely be exactly the same as the real ones. Instead, we need a score that reflects how right/wrong each prediction is or how close we are to the actual numeric value.

We are going to discuss 4 different ones lightly but, if you want to see more regression metrics in detail, you can refer to the sklearn documentation.

9.6.1. Mean squared error (MSE)#

Mean Squared Error is a common measure for error and it is the same as taking the residual sum of squares (RSS, which we saw in linear regression) and dividing it on the number of samples to get an average error per sample.

\[MSE = \frac{1}{\text{total samples}} \displaystyle\sum_{i=1}^{\text{total samples}} (\text{true}_i - {\text{predicted}_i})^2\]
\[MSE = \frac{1}{n} \displaystyle\sum_{i=1}^{n} (y_i - {f(x_i)})^2\]

We calculate this by calculating the difference between the predicted and actual value, square it and sum all these values for every example in the data. The higher the MSE, the worse the model performs.

((y_train - predicted_y)**2).mean()
2570054492.048064

Perfect predictions would have MSE = 0 (no error in any predictions).

We can use mean_squared_error from sklearn.metrics again instead of calculating this ourselves.

from sklearn.metrics import mean_squared_error 
mean_squared_error(y_train, predicted_y)
2570054492.048064

9.6.1.1. The disadvantages#

If we look at MSE value, it’s huge. Having a mean error of 2.5 billion certainly sounds like a lot, but is it bad? How do we know how big a “good” error is?

Unlike classification, in regression, our target has units. In this case, our target column is the median housing value which is in dollars. That means that the mean squared error is in dollars\(^2\). This is a benefit in the sense that our error has units, however the units itself are not that helpful (what is a squared dollar?). Having problem-specific units can also make it hard to compare between models and develop an intuition for what is a good value since the score depends on the scale of the targets. If we were working in cents instead of dollars, our MSE would be 10,000 X (1002) higher!

9.6.2. Root mean squared error (RMSE)#

The MSE we had before was in \(dollars^2\), so an intuitive way to make this more interpretable would be to take the square root of the value to get the units in dollars. This is a more relatable metric and it is called the root mean squared error, or RMSE.

This is the square root of \(MSE\).

\[RMSE = \sqrt{MSE}\]
\[RMSE = \sqrt{\frac{1}{\text{total samples}} \displaystyle\sum_{i=1}^{\text{total samples}} (\text{true}_i - {\text{predicted}_i})^2}\]
\[RMSE = \sqrt{\frac{1}{n} \displaystyle\sum_{i=1}^{n} (y_i - {f(x_i)})^2}\]
mean_squared_error(y_train, predicted_y)
2570054492.048064
import numpy as np

np.sqrt(mean_squared_error(y_train, predicted_y))
50695.704867849156

This now has the units in dollars. Instead of 2 billion dollars squared, our error measurement is around $50,000. This is interpretable for a single prediction, but how would it work to report an RMSE for an entire dataset?

Let’s plot the predicted vs the true housing prices here.

df = pd.DataFrame(y_train).assign(predicted = predicted_y).rename(columns = {'median_house_value': 'true'})
df = pd.DataFrame(y_train).assign(predicted = predicted_y).rename(columns = {'median_house_value': 'true'})
plt.scatter(y_train, predicted_y, alpha=0.3, s = 5)
grid = np.linspace(y_train.min(), y_train.max(), 1000)
plt.plot(grid, grid, '--k');
plt.xticks(fontsize= 12);
plt.yticks(fontsize= 12);
plt.xlabel("true price", fontsize=14);
plt.ylabel("predicted price", fontsize=14);
../_images/lecture9_112_0.png

When we plot our predictions versus the examples’ actual value, we can see cases where our prediction is way off. Points under the line \(y=x\) means we’re under-predicting price, points over the line means we’re over-predicting price.

Question: Is an RMSE of $30,000 acceptable?

  • For a house worth $600k, it seems reasonable! That’s a 5% error.

  • For a house worth $60k, that is terrible. It’s a 50% error.

RMSE is in absolute units and does not account for the original value of the prediction. So how can we adjust to this?

…Enter MAPE!

9.6.3. MAPE - Mean Absolute Percent Error (MAPE)#

Instead of computing the absolute error, we can calculate a percentage error for each example. Now the errors are both positive (predict too high) and negative (predict too low).

percent_errors = (predicted_y - y_train)/y_train * 100.
percent_errors.head()
6051     -1.637324
20113   -14.632727
14289    10.346855
13665     6.713070
14471   -10.965854
Name: median_house_value, dtype: float64

We can look at the absolute percent error which now shows us how far off we were independent of direction.

np.abs(percent_errors).head()
6051      1.637324
20113    14.632727
14289    10.346855
13665     6.713070
14471    10.965854
Name: median_house_value, dtype: float64

And like MSE, we can take the average over all the examples.

np.abs(percent_errors).mean()
18.192997502985218

This is called Mean Absolute Percent Error (MAPE). The value is quite interpretable. We can see that on average, we have around 18% error in our predicted median housing valuation.

However, it is worth pointing out that MAPE also has drawbacks, most notably that it don’t work well with non-positive values and that it is biased towards low forecasts, which makes it unsuitable for predictive models where large errors are expected (for more details, see https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8279135/). MAPE is still commonly used because of its ease of interpretation and there there are variations of MAPE to asses these shortcoming, most notably symmetric MAPE (SMAPE)).

9.6.4. \(R^2\) (R squared)#

We’ve seen this before! This is the score that sklearn uses by default when you call .score() so we’ve already seen \(R^2\) in our regression problems. You can read about it here but we are going to just give you the quick notes.

Intuition: \(R^2\) is the residual sum of squares normalized to the total sum of squares. In other words, it is the proportion of the variation in the target feature that the model is able to explain using the variation in the input features.

  • 1 = Perfect score, all the variation in the target variable can be explained by the model applied to the input features.

  • 0 = None of the variation in the target variable can be explained by the model applied to the input features. There is no predictive value in the model as we would achieve the same result by constantly predicting constantly predicting the mean of the data.

  • < 0 = The model is performing worse than constantly predicting the mean of the data.

We can use the default scoring from .score() or we can calculate \(R^2\) using r2_score from sklearn.metrics

from sklearn.metrics import r2_score

\(R^2\) is a great default to use for reporting the performance of regression models, and if you need something that is easier to interpret (such as a percentage) or an error with units, you can opt for one of the other metrics above (there are more notes on R2 versus MAPE in pubmed article I linked to in the MAPE section).

Note that we can reverse the variables in the calculation of MSE but not R2.

print(mean_squared_error(y_train, predicted_y))
print(mean_squared_error(predicted_y, y_train))
2570054492.048064
2570054492.048064
print(r2_score(y_train, predicted_y))
print(r2_score(predicted_y, y_train))
0.8059396097446094
0.742915970464153

9.7. Let’s Practice#

1. Which measurement will have units which are the square values of the target column units?
2. For which of the following is it possible to have negative values?
3. Which measurement is expressed as a percentage?
4. Calculate the MSE from the values given below.

Observation

True Value

Predicted Value

0

4

5

1

12

10

2

6

9

3

9

8

4

3

3

True or False:

5. We can still use recall and precision for regression problems but now we have other measurements we can use as well.
6. A lower RMSE value indicates a better model.
7. In regression problems, calculating \(R^2\) using r2_score() and .score() (with default values) will produce the same results.

9.8. Passing Different Scoring Methods#

We now know about all these metrics; how do we implement them? We are lucky because it’s relatively easy and can be applied to both classification and regression problems.

Let’s start with regression and our regression measurements. This means bringing back our California housing dataset.

X_train.head()
longitude latitude housing_median_age households median_income ocean_proximity rooms_per_household bedrooms_per_household population_per_household
6051 -117.75 34.04 22.0 602.0 3.1250 INLAND 4.897010 1.056478 4.318937
20113 -119.57 37.94 17.0 20.0 3.4861 INLAND 17.300000 6.500000 2.550000
14289 -117.13 32.74 46.0 708.0 2.6604 NEAR OCEAN 4.738701 1.084746 2.057910
13665 -117.31 34.02 18.0 285.0 5.2139 INLAND 5.733333 0.961404 3.154386
14471 -117.23 32.88 18.0 1458.0 1.8580 NEAR OCEAN 3.817558 1.004801 4.323045

And our pipelines.

This time we are using \(k\)-nn.

numeric_transformer = Pipeline(
    steps=[("imputer", SimpleImputer(strategy="median")), 
           ("scaler", StandardScaler())]
)

categorical_transformer = Pipeline(
    steps=[("imputer", SimpleImputer(strategy="constant", fill_value="missing")),
           ("onehot", OneHotEncoder(handle_unknown="ignore"))]
)

preprocessor = make_column_transformer(
(numeric_transformer, numeric_features),
        (categorical_transformer, categorical_features), 
    remainder='passthrough')

pipe_regression = make_pipeline(preprocessor, KNeighborsRegressor())

9.8.1. Cross-validation#

Normally after building our pipelines, we would now either do cross-validation or hyperparameter tuning but let’s start with the cross_validate() function.

All the possible scoring metrics that this argument accepts are available here in the sklearn documentation. Directly from the docs:

All scorer objects follow the convention that higher return values are better than lower return values. Thus metrics which measure the distance between the model and the data, like metrics.mean_squared_error, are available as neg_mean_squared_error which return the negated value of the metric.

So if we wanted the RMSE measure, we would specify neg_mean_squared_error and the negated value of the metric will be returned in our dataframe.

pd.DataFrame(
    cross_validate(
        pipe_regression,
        X_train,
        y_train, 
    )
)
fit_time score_time test_score
0 0.037876 0.223662 0.695818
1 0.028238 0.203475 0.707483
2 0.029148 0.211091 0.713788
3 0.028359 0.218618 0.686938
4 0.028985 0.177136 0.724608
pd.DataFrame(
    cross_validate(
        pipe_regression,
        X_train,
        y_train, 
        scoring = 'neg_root_mean_squared_error'
    )
)
fit_time score_time test_score
0 0.037514 0.225316 -62462.584290
1 0.028159 0.202306 -63437.715015
2 0.027739 0.209377 -62613.202523
3 0.028325 0.214590 -64204.295214
4 0.027873 0.177630 -59217.838633

Now our cross-validation returns percentages!

We can also return multiple scoring measures together by making a dictionary and then specifying the dictionary in the scoring argument.

scoring={
    "neg_mse": "neg_mean_squared_error",    
    "neg_rmse": "neg_root_mean_squared_error",    
    "mape_score": 'neg_mean_absolute_percentage_error',
    "r2": "r2",
}
pd.DataFrame(
    cross_validate(
        pipe_regression,
        X_train,
        y_train,
        scoring=scoring
    )
)
fit_time score_time test_neg_mse test_neg_rmse test_mape_score test_r2
0 0.036642 0.220047 -3.901574e+09 -62462.584290 -0.227097 0.695818
1 0.027709 0.204257 -4.024344e+09 -63437.715015 -0.227546 0.707483
2 0.028423 0.214962 -3.920413e+09 -62613.202523 -0.222369 0.713788
3 0.027470 0.213464 -4.122192e+09 -64204.295214 -0.230167 0.686938
4 0.027981 0.179939 -3.506752e+09 -59217.838633 -0.210335 0.724608

If we set return_train_score=True we would return a validation and training score for each measurement!

pd.DataFrame(
    cross_validate(
        pipe_regression,
        X_train,
        y_train,
        scoring=scoring,
        return_train_score=True
    )
)
fit_time score_time test_neg_mse train_neg_mse test_neg_rmse train_neg_rmse test_mape_score train_mape_score test_r2 train_r2
0 0.030596 0.213980 -3.901574e+09 -2.646129e+09 -62462.584290 -51440.540539 -0.227097 -0.184210 0.695818 0.801659
1 0.028104 0.199698 -4.024344e+09 -2.627996e+09 -63437.715015 -51263.979666 -0.227546 -0.184691 0.707483 0.799575
2 0.027226 0.204043 -3.920413e+09 -2.678975e+09 -62613.202523 -51758.817852 -0.222369 -0.186750 0.713788 0.795944
3 0.027180 0.206620 -4.122192e+09 -2.636180e+09 -64204.295214 -51343.743586 -0.230167 -0.185108 0.686938 0.801232
4 0.027144 0.173949 -3.506752e+09 -2.239671e+09 -59217.838633 -47325.157312 -0.210335 -0.169510 0.724608 0.832498

9.8.2. What about hyperparameter tuning?#

We can do exactly the same thing we saw above with cross_validate() but instead with GridSearchCV and RandomizedSearchCV.

from sklearn.model_selection import GridSearchCV


param_grid = {"kneighborsregressor__n_neighbors": [2, 5, 50, 100]}

grid_search = GridSearchCV(
    pipe_regression,
    param_grid,
    cv=5, 
    return_train_score=True,
    n_jobs=-1, 
    scoring='neg_mean_squared_error'
);
grid_search.fit(X_train, y_train);
grid_search.best_params_
{'kneighborsregressor__n_neighbors': 5}
grid_search.best_score_
-0.2235027119616972

If we used another scoring metric, we might end up with another results for the best hyperparameter.

# 'max_error' is a metric we haven't talked about and it is not that useful,
# I just use it here to show that the choice of metric can influence the returned best hyperparameters.

grid_search = GridSearchCV(
    pipe_regression,
    param_grid,
    cv=5, 
    return_train_score=True,
    n_jobs=-1, 
    scoring='max_error'
);
grid_search.fit(X_train, y_train);
grid_search.best_params_
{'kneighborsregressor__n_neighbors': 100}
grid_search.best_score_
-373468.55

9.8.3. … and with Classification?#

Let’s bring back our credit card data set and build our pipeline.

train_df, test_df = train_test_split(cc_df, test_size=0.3, random_state=111)

X_train, y_train = train_df.drop(columns=["Class"]), train_df["Class"]
X_test, y_test = test_df.drop(columns=["Class"]), test_df["Class"]

We can use class_weight='balanced' in our classifier…

from sklearn.tree import DecisionTreeClassifier


dt_model = DecisionTreeClassifier(random_state=123, class_weight='balanced')
import scipy

param_grid = {"max_depth": scipy.stats.randint(low=1, high=100)}

… and tune our model for the thing we care about.

In this case, we are specifying the f1 score.

from sklearn.model_selection import RandomizedSearchCV
grid_search = RandomizedSearchCV(
    dt_model,
    param_grid,
    cv=3,
    return_train_score=True,
    verbose=2,
    n_jobs=-1,
    n_iter = 6,
    scoring='f1',
    random_state=2080
)
grid_search.fit(X_train, y_train);
Fitting 3 folds for each of 6 candidates, totalling 18 fits
[CV] END .......................................max_depth=69; total time=   0.0s
[CV] END .......................................max_depth=69; total time=   0.0s
[CV] END .......................................max_depth=69; total time=   0.0s
[CV] END .......................................max_depth=12; total time=   0.0s
[CV] END .......................................max_depth=12; total time=   0.0s
[CV] END .......................................max_depth=12; total time=   0.0s
[CV] END .......................................max_depth=65; total time=   0.0s
[CV] END .......................................max_depth=65; total time=   0.0s
[CV] END .......................................max_depth=43; total time=   0.0s
[CV] END .......................................max_depth=43; total time=   0.0s
[CV] END .......................................max_depth=43; total time=   0.0s
[CV] END .......................................max_depth=65; total time=   0.0s
/Users/quannguyen/opt/miniconda3/envs/571/lib/python3.10/site-packages/sklearn/model_selection/_split.py:737: UserWarning: The least populated class in y has only 1 members, which is less than n_splits=3.
  warnings.warn(
[CV] END ........................................max_depth=4; total time=   0.0s
[CV] END ........................................max_depth=4; total time=   0.0s
[CV] END ........................................max_depth=4; total time=   0.0s
[CV] END .......................................max_depth=62; total time=   0.0s
[CV] END .......................................max_depth=62; total time=   0.0s
[CV] END .......................................max_depth=62; total time=   0.0s
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[50], line 13
      1 from sklearn.model_selection import RandomizedSearchCV
      2 grid_search = RandomizedSearchCV(
      3     dt_model,
      4     param_grid,
   (...)
     11     random_state=2080
     12 )
---> 13 grid_search.fit(X_train, y_train);

File ~/opt/miniconda3/envs/571/lib/python3.10/site-packages/sklearn/base.py:1152, in _fit_context.<locals>.decorator.<locals>.wrapper(estimator, *args, **kwargs)
   1145     estimator._validate_params()
   1147 with config_context(
   1148     skip_parameter_validation=(
   1149         prefer_skip_nested_validation or global_skip_validation
   1150     )
   1151 ):
-> 1152     return fit_method(estimator, *args, **kwargs)

File ~/opt/miniconda3/envs/571/lib/python3.10/site-packages/sklearn/model_selection/_search.py:898, in BaseSearchCV.fit(self, X, y, groups, **fit_params)
    892     results = self._format_results(
    893         all_candidate_params, n_splits, all_out, all_more_results
    894     )
    896     return results
--> 898 self._run_search(evaluate_candidates)
    900 # multimetric is determined here because in the case of a callable
    901 # self.scoring the return type is only known after calling
    902 first_test_score = all_out[0]["test_scores"]

File ~/opt/miniconda3/envs/571/lib/python3.10/site-packages/sklearn/model_selection/_search.py:1809, in RandomizedSearchCV._run_search(self, evaluate_candidates)
   1807 def _run_search(self, evaluate_candidates):
   1808     """Search n_iter candidates from param_distributions"""
-> 1809     evaluate_candidates(
   1810         ParameterSampler(
   1811             self.param_distributions, self.n_iter, random_state=self.random_state
   1812         )
   1813     )

File ~/opt/miniconda3/envs/571/lib/python3.10/site-packages/sklearn/model_selection/_search.py:875, in BaseSearchCV.fit.<locals>.evaluate_candidates(candidate_params, cv, more_results)
    868 elif len(out) != n_candidates * n_splits:
    869     raise ValueError(
    870         "cv.split and cv.get_n_splits returned "
    871         "inconsistent results. Expected {} "
    872         "splits, got {}".format(n_splits, len(out) // n_candidates)
    873     )
--> 875 _warn_or_raise_about_fit_failures(out, self.error_score)
    877 # For callable self.scoring, the return type is only know after
    878 # calling. If the return type is a dictionary, the error scores
    879 # can now be inserted with the correct key. The type checking
    880 # of out will be done in `_insert_error_scores`.
    881 if callable(self.scoring):

File ~/opt/miniconda3/envs/571/lib/python3.10/site-packages/sklearn/model_selection/_validation.py:414, in _warn_or_raise_about_fit_failures(results, error_score)
    407 if num_failed_fits == num_fits:
    408     all_fits_failed_message = (
    409         f"\nAll the {num_fits} fits failed.\n"
    410         "It is very likely that your model is misconfigured.\n"
    411         "You can try to debug the error by setting error_score='raise'.\n\n"
    412         f"Below are more details about the failures:\n{fit_errors_summary}"
    413     )
--> 414     raise ValueError(all_fits_failed_message)
    416 else:
    417     some_fits_failed_message = (
    418         f"\n{num_failed_fits} fits failed out of a total of {num_fits}.\n"
    419         "The score on these train-test partitions for these parameters"
   (...)
    423         f"Below are more details about the failures:\n{fit_errors_summary}"
    424     )

ValueError: 
All the 18 fits failed.
It is very likely that your model is misconfigured.
You can try to debug the error by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
18 fits failed with the following error:
Traceback (most recent call last):
  File "/Users/quannguyen/opt/miniconda3/envs/571/lib/python3.10/site-packages/sklearn/model_selection/_validation.py", line 729, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/Users/quannguyen/opt/miniconda3/envs/571/lib/python3.10/site-packages/sklearn/base.py", line 1152, in wrapper
    return fit_method(estimator, *args, **kwargs)
  File "/Users/quannguyen/opt/miniconda3/envs/571/lib/python3.10/site-packages/sklearn/tree/_classes.py", line 959, in fit
    super()._fit(
  File "/Users/quannguyen/opt/miniconda3/envs/571/lib/python3.10/site-packages/sklearn/tree/_classes.py", line 242, in _fit
    X, y = self._validate_data(
  File "/Users/quannguyen/opt/miniconda3/envs/571/lib/python3.10/site-packages/sklearn/base.py", line 617, in _validate_data
    X = check_array(X, input_name="X", **check_X_params)
  File "/Users/quannguyen/opt/miniconda3/envs/571/lib/python3.10/site-packages/sklearn/utils/validation.py", line 915, in check_array
    array = _asarray_with_order(array, order=order, dtype=dtype, xp=xp)
  File "/Users/quannguyen/opt/miniconda3/envs/571/lib/python3.10/site-packages/sklearn/utils/_array_api.py", line 380, in _asarray_with_order
    array = numpy.asarray(array, order=order, dtype=dtype)
  File "/Users/quannguyen/opt/miniconda3/envs/571/lib/python3.10/site-packages/pandas/core/generic.py", line 2084, in __array__
    arr = np.asarray(values, dtype=dtype)
ValueError: could not convert string to float: 'INLAND'
grid_search.best_params_
{'max_depth': 69}

This returns the max_depth value that results in the highest f1 score, not the max_depth with the highest accuracy.

# Validation performance
grid_search.best_score_
0.7877624963028689
# Test performance
grid_search.score(X_test, y_test)
0.7698113207547169

Let’s look at our recall score to compare to the next section.

recall_score(y_test, grid_search.predict(X_test))
0.7338129496402878
ConfusionMatrixDisplay.from_estimator(grid_search, X_test, y_test);
../_images/lecture9_161_0.png

If we now tune hyperparameters based on their recall score instead of precision, you will se that we select a different value for max_depth and that our recall score is higher with this value.

grid_search = RandomizedSearchCV(
    dt_model,
    param_grid,
    cv=3,
    return_train_score=True,
    verbose=2,
    n_jobs=-1,
    n_iter = 6,
    scoring='recall',
    random_state=2080
)
grid_search.fit(X_train, y_train);
Fitting 3 folds for each of 6 candidates, totalling 18 fits
[CV] END .......................................max_depth=69; total time=   2.3s
[CV] END .......................................max_depth=12; total time=   2.4s
[CV] END .......................................max_depth=65; total time=   2.5s
[CV] END .......................................max_depth=12; total time=   2.5s
[CV] END .......................................max_depth=69; total time=   2.7s
[CV] END .......................................max_depth=69; total time=   2.7s
[CV] END .......................................max_depth=12; total time=   2.7s
[CV] END .......................................max_depth=65; total time=   2.7s
[CV] END ........................................max_depth=4; total time=   1.6s
[CV] END ........................................max_depth=4; total time=   1.6s
[CV] END ........................................max_depth=4; total time=   1.8s
[CV] END .......................................max_depth=65; total time=   2.5s
[CV] END .......................................max_depth=43; total time=   2.5s
[CV] END .......................................max_depth=62; total time=   2.1s
[CV] END .......................................max_depth=43; total time=   2.5s
[CV] END .......................................max_depth=43; total time=   2.5s
[CV] END .......................................max_depth=62; total time=   2.0s
[CV] END .......................................max_depth=62; total time=   1.8s
grid_search.best_params_
{'max_depth': 4}

This returns the max_depth value that results in the highest f1 score, not the max_depth with the highest accuracy.

# Validation performance
grid_search.best_score_
0.8839152059491043
# Test performance
grid_search.score(X_test, y_test)
0.841726618705036

As you can see above, our recall score is now higher (remember that the default scoring method changes to the metric used during hyperparameter optimization) If we look at our f1 score we can see that it is worse than before, as expected. When we are optimizing on recall alone, we are only trying to catch as many of the true positives as possible and don’t care about that we are incorrectly classifying many negatives as positives which will lead to a lower precision and f1 score.

f1_score(y_test, grid_search.predict(X_test))
0.1984732824427481

In the confusion matrix, we have many more values in the top right quadrant because there is no penalty for incorrectly classifying observations here when just using recall.

ConfusionMatrixDisplay.from_estimator( grid_search, X_test, y_test);
../_images/lecture9_171_0.png

9.9. Let’s Practice#

True or False:

  1. We are limited to the scoring measures offered from sklearn.

  2. If we specify the scoring method in GridSearchCV and RandomizedSearchCV, best_param_ will return the parameters with the best specified measure.*

9.10. Let’s Practice - Coding#

Let’s bring back the Pokémon dataset that we saw previously.

This time let’s try to predict whether a Pokémon has a legendary status or not based on their other attributes.

pk_df = pd.read_csv('data/pokemon.csv')

train_df, test_df = train_test_split(pk_df, test_size=0.3, random_state=1)

X_train_big = train_df.drop(columns=['legendary'])
y_train_big = train_df['legendary']
X_test = test_df.drop(columns=['legendary'])
y_test = test_df['legendary']

X_train, X_valid, y_train, y_valid = train_test_split(
    X_train_big, 
    y_train_big, 
    test_size=0.3, 
    random_state=123
)

print(y_train.value_counts())
X_train
legendary
0    359
1     33
Name: count, dtype: int64
name deck_no attack defense sp_attack sp_defense speed capture_rt total_bs type gen
124 Electabuzz 125 83 57 95 85 105 45 490 electric 1
11 Butterfree 12 45 50 90 80 70 45 395 bug 1
77 Rapidash 78 100 70 80 80 105 60 500 fire 1
405 Budew 406 30 35 50 70 55 255 280 grass 4
799 Necrozma 800 107 101 127 89 79 3 600 psychic 7
... ... ... ... ... ... ... ... ... ... ... ...
33 Nidoking 34 102 77 85 75 85 45 505 poison 1
458 Snover 459 62 50 62 60 40 120 334 grass 4
234 Smeargle 235 20 35 20 45 75 45 250 normal 2
287 Vigoroth 288 80 80 55 55 90 120 440 normal 3
561 Yamask 562 30 85 55 65 30 190 303 ghost 5

392 rows × 11 columns

# create a numeric transformer
num_pipe = make_pipeline(
    SimpleImputer(strategy='median'),
    StandardScaler()
)

num_cols = ['attack','defense','speed']

# create a categorical transformer
cat_pipe = make_pipeline(
    SimpleImputer(strategy='constant'),
    OneHotEncoder(handle_unknown='ignore')
)

cat_cols = ['type']

# make column transformer
preprossor = make_column_transformer(
    (num_pipe, num_cols),
    (cat_pipe, cat_cols),
    remainder='drop'
)

# the final pipeline
from sklearn.svm import SVC
main_pipe = make_pipeline(
    preprocessor,
    SVC()
)

main_pipe
Pipeline(steps=[('columntransformer',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('pipeline-1',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(strategy='median')),
                                                                  ('scaler',
                                                                   StandardScaler())]),
                                                  ['longitude', 'latitude',
                                                   'housing_median_age',
                                                   'households',
                                                   'median_income',
                                                   'rooms_per_household',
                                                   'bedrooms_per_household',
                                                   'population_per_household']),
                                                 ('pipeline-2',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(fill_value='missing',
                                                                                 strategy='constant')),
                                                                  ('onehot',
                                                                   OneHotEncoder(handle_unknown='ignore'))]),
                                                  ['ocean_proximity'])])),
                ('svc', SVC())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
cross_validate(main_pipe, X_train, y_train, cv=5)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[57], line 1
----> 1 cross_validate(main_pipe, X_train, y_train, cv=5)

File ~/opt/miniconda3/envs/571/lib/python3.10/site-packages/sklearn/utils/_param_validation.py:214, in validate_params.<locals>.decorator.<locals>.wrapper(*args, **kwargs)
    208 try:
    209     with config_context(
    210         skip_parameter_validation=(
    211             prefer_skip_nested_validation or global_skip_validation
    212         )
    213     ):
--> 214         return func(*args, **kwargs)
    215 except InvalidParameterError as e:
    216     # When the function is just a wrapper around an estimator, we allow
    217     # the function to delegate validation to the estimator, but we replace
    218     # the name of the estimator by the name of the function in the error
    219     # message to avoid confusion.
    220     msg = re.sub(
    221         r"parameter of \w+ must be",
    222         f"parameter of {func.__qualname__} must be",
    223         str(e),
    224     )

File ~/opt/miniconda3/envs/571/lib/python3.10/site-packages/sklearn/model_selection/_validation.py:328, in cross_validate(estimator, X, y, groups, scoring, cv, n_jobs, verbose, fit_params, pre_dispatch, return_train_score, return_estimator, return_indices, error_score)
    308 parallel = Parallel(n_jobs=n_jobs, verbose=verbose, pre_dispatch=pre_dispatch)
    309 results = parallel(
    310     delayed(_fit_and_score)(
    311         clone(estimator),
   (...)
    325     for train, test in indices
    326 )
--> 328 _warn_or_raise_about_fit_failures(results, error_score)
    330 # For callable scoring, the return type is only know after calling. If the
    331 # return type is a dictionary, the error scores can now be inserted with
    332 # the correct key.
    333 if callable(scoring):

File ~/opt/miniconda3/envs/571/lib/python3.10/site-packages/sklearn/model_selection/_validation.py:414, in _warn_or_raise_about_fit_failures(results, error_score)
    407 if num_failed_fits == num_fits:
    408     all_fits_failed_message = (
    409         f"\nAll the {num_fits} fits failed.\n"
    410         "It is very likely that your model is misconfigured.\n"
    411         "You can try to debug the error by setting error_score='raise'.\n\n"
    412         f"Below are more details about the failures:\n{fit_errors_summary}"
    413     )
--> 414     raise ValueError(all_fits_failed_message)
    416 else:
    417     some_fits_failed_message = (
    418         f"\n{num_failed_fits} fits failed out of a total of {num_fits}.\n"
    419         "The score on these train-test partitions for these parameters"
   (...)
    423         f"Below are more details about the failures:\n{fit_errors_summary}"
    424     )

ValueError: 
All the 5 fits failed.
It is very likely that your model is misconfigured.
You can try to debug the error by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
5 fits failed with the following error:
Traceback (most recent call last):
  File "/Users/quannguyen/opt/miniconda3/envs/571/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 3791, in get_loc
    return self._engine.get_loc(casted_key)
  File "index.pyx", line 152, in pandas._libs.index.IndexEngine.get_loc
  File "index.pyx", line 181, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 7080, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 7088, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'longitude'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/quannguyen/opt/miniconda3/envs/571/lib/python3.10/site-packages/sklearn/utils/__init__.py", line 447, in _get_column_indices
    col_idx = all_columns.get_loc(col)
  File "/Users/quannguyen/opt/miniconda3/envs/571/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 3798, in get_loc
    raise KeyError(key) from err
KeyError: 'longitude'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/quannguyen/opt/miniconda3/envs/571/lib/python3.10/site-packages/sklearn/model_selection/_validation.py", line 729, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/Users/quannguyen/opt/miniconda3/envs/571/lib/python3.10/site-packages/sklearn/base.py", line 1152, in wrapper
    return fit_method(estimator, *args, **kwargs)
  File "/Users/quannguyen/opt/miniconda3/envs/571/lib/python3.10/site-packages/sklearn/pipeline.py", line 423, in fit
    Xt = self._fit(X, y, **fit_params_steps)
  File "/Users/quannguyen/opt/miniconda3/envs/571/lib/python3.10/site-packages/sklearn/pipeline.py", line 377, in _fit
    X, fitted_transformer = fit_transform_one_cached(
  File "/Users/quannguyen/opt/miniconda3/envs/571/lib/python3.10/site-packages/joblib/memory.py", line 353, in __call__
    return self.func(*args, **kwargs)
  File "/Users/quannguyen/opt/miniconda3/envs/571/lib/python3.10/site-packages/sklearn/pipeline.py", line 957, in _fit_transform_one
    res = transformer.fit_transform(X, y, **fit_params)
  File "/Users/quannguyen/opt/miniconda3/envs/571/lib/python3.10/site-packages/sklearn/utils/_set_output.py", line 157, in wrapped
    data_to_wrap = f(self, X, *args, **kwargs)
  File "/Users/quannguyen/opt/miniconda3/envs/571/lib/python3.10/site-packages/sklearn/base.py", line 1152, in wrapper
    return fit_method(estimator, *args, **kwargs)
  File "/Users/quannguyen/opt/miniconda3/envs/571/lib/python3.10/site-packages/sklearn/compose/_column_transformer.py", line 751, in fit_transform
    self._validate_column_callables(X)
  File "/Users/quannguyen/opt/miniconda3/envs/571/lib/python3.10/site-packages/sklearn/compose/_column_transformer.py", line 459, in _validate_column_callables
    transformer_to_input_indices[name] = _get_column_indices(X, columns)
  File "/Users/quannguyen/opt/miniconda3/envs/571/lib/python3.10/site-packages/sklearn/utils/__init__.py", line 455, in _get_column_indices
    raise ValueError("A given column is not a column of the dataframe") from e
ValueError: A given column is not a column of the dataframe

Let’s do cross-validation and look at the scores from cross-validation of not just accuracy, but precision and recall and the f1 score as well.

  1. Build a pipeline containing the column transformer and an SVC model and set class_weight="balanced" in the SVM classifier.

  2. Perform cross-validation using cross-validate on the training split using the scoring measures accuracy, precision, recall and f1. Save the results in a dataframe.

Solutions

1.

from sklearn.svm import SVC


num_pipe = make_pipeline(
    SimpleImputer(),
    StandardScaler()
)

cat_pipe = make_pipeline(
    SimpleImputer(strategy='constant'),
    OneHotEncoder(handle_unknown='ignore')
)

num_cols = X_train.select_dtypes('number').columns

num_cols = [
 'capture_rt',
 'total_bs',
 'gen']
cat_cols = ['type']

preprocessing = make_column_transformer(
    (num_pipe, num_cols),
    (cat_pipe, cat_cols)
)

main_pipe = make_pipeline(
    preprocessing,
    SVC(class_weight='balanced')
)

2.

pd.DataFrame(
    cross_validate(
        main_pipe,
        X_valid,
        y_valid,
        scoring=['accuracy', 'precision', 'recall', 'f1']
    )
)
fit_time score_time test_accuracy test_precision test_recall test_f1
0 0.008283 0.006674 0.941176 0.666667 0.666667 0.666667
1 0.006897 0.006086 0.941176 0.666667 0.666667 0.666667
2 0.006335 0.005881 0.911765 0.500000 0.666667 0.571429
3 0.006494 0.006122 0.939394 0.500000 0.500000 0.500000
4 0.006614 0.005894 0.909091 0.500000 0.333333 0.400000

9.11. What We’ve Learned Today#

  • The components of a confusion matrix.

  • How to calculate precision, recall, and f1-score.

  • How to implement the class_weight argument.

  • Some of the different scoring metrics used in assessing regression problems; MSE, RMSE, \(R^2\), MAPE.

  • How to apply different scoring functions with cross_validate, GridSearchCV and RandomizedSearchCV.