2. Splitting and Cross-validation

2.1. Lecture Learning Objectives

  • Explain the concept of generalization.

  • Split a dataset into train and test sets using train_test_split function.

  • Explain the difference between train, validation, test, and “deployment” data.

  • Identify the difference between training error, validation error, and test error.

  • Explain cross-validation and use cross_val_score() and cross_validate() to calculate cross-validation error.

  • Explain overfitting, underfitting, and the fundamental tradeoff.

  • State the golden rule and identify the scenarios when it’s violated.

2.2. Five Minute Recap/ Lightning Questions

  • What is an example of machine learning?

  • Which type of machine learning does not have labels?

  • What is an example of a Regression problem?

  • In a dataframe, what is an observation?

  • What is the first node of a Decision Tree called?

  • Where/who determines the parameter values?

  • What library will we be using for machine learning?

Let’s start by creating a decision tree model as we did in lecture 1.

c1.shape
(100, 2)
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt


X, y = make_blobs(200, centers=2, cluster_std=2, random_state=5)
plt.scatter(X[:, 0], X[:, 1], c=y)
dt = DecisionTreeClassifier()
dt.fit(X, y)
plot_classifier(X, y, dt, show_data=False)
<AxesSubplot:>
../_images/lecture2_5_1.png ../_images/lecture2_5_2.png
from sklearn.tree import DecisionTreeClassifier
from plot_classifier import plot_classifier

dt = DecisionTreeClassifier()
dt.fit(X, y)
plot_classifier(X, y, dt)
/home/joel/miniconda3/envs/bait/lib/python3.9/site-packages/plot_classifier/__init__.py:122: UserWarning: You passed a edgecolor/edgecolors ('k') for an unfilled marker ('x').  Matplotlib is ignoring the edgecolor in favor of the facecolor.  This behavior may change in the future.
  ax.scatter(X0[y==labels[2]], X1[y==labels[2]], s=60, c='k', marker='x', edgecolors='k')
<AxesSubplot:>
../_images/lecture2_6_2.png
import pandas as pd


voting_df = pd.read_csv(
    'data/cities_USA.csv',
    index_col=0
).query(
    'lon > -140'  # Remove alaska for easier plotting
)
voting_df
lon lat vote
1 -80.162475 25.692104 blue
2 -80.214360 25.944083 blue
3 -80.094133 26.234314 blue
4 -80.248086 26.291902 blue
5 -81.789963 26.348035 blue
... ... ... ...
393 -101.115061 47.640333 red
394 -119.036728 47.926446 blue
395 -102.142732 47.955970 red
396 -97.460476 48.225094 red
397 -96.551116 48.591592 blue

397 rows × 3 columns

from sklearn.tree import DecisionTreeClassifier


# feature table
X = voting_df.drop(columns='vote')
# the target variable
y = voting_df['vote']

# initiate model framework with a decision tree of max_depth 2
depth=2
model = DecisionTreeClassifier(max_depth=depth)
# training the model
model.fit(X, y)

# Score the models overall accuracy on the training data
model.score(X, y)
0.8261964735516373
# Predict on new data
model.predict(pd.DataFrame({'lon': [-75], 'lat': [56]}))
array(['blue'], dtype=object)

2.2.1. Some lingering questions

  1. Is model performance on the training data a good indication of how it will perform on new data?

  2. Can we just use a large max_depth for every supervised learning problem and get super high accuracy?

  3. If not, how do we choose a value of max_depth (or other hyperparameters)?

We will be answering these questions in this lecture.

2.3. Generalization

2.3.1. Visualizing model complexity using decision boundaries

In the last lecture, we saw that we could visualize the splitting of decision trees using either a tree diagram or a plot of decision boundaries.

from sklearn.tree import plot_tree
import matplotlib.pyplot as plt


plot_tree(
    model,
    feature_names=X.columns,
    class_names=y.unique(),
    impurity=False,
    ax=plt.subplots(figsize=(12, 7))[1]
);
../_images/lecture2_12_0.png

Plotting the values, we see the two levels of splits indicated by our max_depth=2 hypterparameter.

from plot_classifier import plot_classifier


plot_classifier(X, y, model, ticks=True)
plt.xlabel('lon', fontsize=14)
plt.ylabel('lat', fontsize=14)
plt.title("Decision tree with depth = %d" % (depth), fontsize=18);
/home/joel/miniconda3/envs/bait/lib/python3.9/site-packages/sklearn/base.py:450: UserWarning: X does not have valid feature names, but DecisionTreeClassifier was fitted with feature names
  warnings.warn(
../_images/lecture2_14_1.png

Ok, now let’s see what happens to our score and decision boundaries when we change our maximum tree depth.

depth = 3
model = DecisionTreeClassifier(max_depth=depth)
model.fit(X, y)
model.score(X, y)
0.8740554156171285
plot_tree(
    model,
    feature_names=X.columns,
    class_names=y.unique(),
    impurity=False,
    ax=plt.subplots(figsize=(18, 8))[1]
);
../_images/lecture2_17_0.png

The decision boundaries are created by three levels of splits now 3 splits now (for each observation we ask three questions to find out how to classify it).

Our score here has increased from 83% to 87%.

When we graph it, we can now see more boundaries, meaning that the model has become more specific to our training data.

plot_classifier(X, y, model, ticks=True)
plt.xlabel('lon', fontsize=14)
plt.ylabel('lat', fontsize=14)
plt.title("Decision tree with depth = %d" % (depth), fontsize=18);
/home/joel/miniconda3/envs/bait/lib/python3.9/site-packages/sklearn/base.py:450: UserWarning: X does not have valid feature names, but DecisionTreeClassifier was fitted with feature names
  warnings.warn(
../_images/lecture2_19_1.png

What happens if we give the model an unlimited max_depth?

model = DecisionTreeClassifier()
model.fit(X, y);
model.get_depth()
8
plot_tree(
    model,
    feature_names=X.columns,
    class_names=y.unique(),
    impurity=False,
    ax=plt.subplots(figsize=(24, 10))[1]
);
../_images/lecture2_22_0.png
plot_classifier(X, y, model, ticks=True)
plt.xlabel('lon', fontsize=14)
plt.ylabel('lat', fontsize=14)
plt.title('Decision tree with unlimited max depth', fontsize=18);
/home/joel/miniconda3/envs/bait/lib/python3.9/site-packages/sklearn/base.py:450: UserWarning: X does not have valid feature names, but DecisionTreeClassifier was fitted with feature names
  warnings.warn(
../_images/lecture2_23_1.png
model.score(X, y)
1.0

We can see that with this model we are perfectly fitting every observation in the training data. That must be great right? Not so fast. In this toy example, we have evaluated the model on the same data we used to train it, but how would it perform on a city that did not exist in the training data? We can use the cities from alaska that we excluded initially to find out:

alaska = pd.read_csv(
    'data/cities_USA.csv',
    index_col=0
).query(
    'lon <= -140'  # Remove alaska for easier plotting
)
alaska
lon lat vote
398 -166.519855 53.887114 red
399 -163.733617 67.665859 red
400 -145.423115 68.077395 red
model.predict(alaska[['lon', 'lat']])
array(['blue', 'blue', 'blue'], dtype=object)

Terrible! The model incorrectly predicted “blue” for all three cities. This illustrates that a model that is more specific to the training data, might not always generalize well to unseen data, which is our eventual goal since we will not have all its true target labels upfront (then we would not need ML in the first place). For our decision tree model, we see that training score increases as we increase max_depth. Since we are creating a more complex tree (higher max_depth) we can fit all the peculiarities of our data to eventually get 100% accuracy.

2.3.2. The Fundamental goal of machine learning

Goal: to generalize beyond what we see in the training examples.
We are only given a sample of the data and do not have the full distribution.
Using the training data, we want to come up with a reasonable model that will perform well on some unseen examples.

At the end of the day, we want to deploy models that make reasonable predictions on unseen data

2.3.3. Generalizing to unseen data

404 image

The point here is that we want this learning to be able to generalize beyond what it sees here and be able to predict and predict labels for the new examples that have some of the characteristics of, but are not identical too, the training data.

2.3.4. Training score versus Generalization score (or Error)

So would we expect a model with a really high accuracy on the training data to perform equally well on unseen examples? Probably not.

Hopefully our training data is representative of the unseen data, but it will not be identical. In general it is difficult to build a model that has close to perfect accuracy so when this happens on the training data, it might mean that we have made the model too specific to the training set.

Given a model in machine learning, people usually talk about two kinds of accuracies (scores):

  1. Accuracy on the training data

  2. Accuracy on the entire distribution of data (including all the unseen data)

We are interested in the score on the entire distribution because at the end of the day we want our model to perform well on unseen examples.

But the problem is that we do not have access to the distribution and only the limited training data that is given to us.

So, what do we do? Is our only option to use the training data score and hope for the best?

2.4. Splitting

Although we don’t have access to the unseen data, we can withhold some data from the model at training to create unseen data that can be used for evaluation. By splitting our data this way, we can approximate generalization accuracy.

404 image
  • Keep a randomly selected portion of our data aside that we call that the testing data.

  • fit (train) a model on the training portion only.

  • score (assess) the trained model on this set-aside Testing data to get a sense of how well the model would be able to generalize.

2.4.1. Simple train and test split

404 image
  • First, the data needs to be shuffled.

  • Then, we split the rows of the data into 2 sections -> train and test.

  • The lock and key icon on the test set symbolizes that we don’t want to touch the test data until the very end (more on this soon).

2.4.2. How do we do this?

In our trusty Scikit Learn package, we have a function for that!

  • train_test_split

Let’s try it out using a similar yet slightly different dataset. Here we still have latitude and longitude coordinates but this time our target variable is if the city with these coordinates lies in Canada or the USA.

2.4.2.1. First way

cities_df = pd.read_csv('data/canada_usa_cities.csv')

cities_df
longitude latitude country
0 -130.0437 55.9773 USA
1 -134.4197 58.3019 USA
2 -123.0780 48.9854 USA
3 -122.7436 48.9881 USA
4 -122.2691 48.9951 USA
... ... ... ...
204 -72.7218 45.3990 Canada
205 -66.6458 45.9664 Canada
206 -79.2506 42.9931 Canada
207 -72.9406 45.6275 Canada
208 -79.4608 46.3092 Canada

209 rows × 3 columns

import altair as alt


chart_votes = alt.Chart(cities_df).mark_circle(size=20, opacity=0.6).encode(
    alt.X('longitude:Q', scale=alt.Scale(domain=[-140, -50])),
    alt.Y('latitude:Q', scale=alt.Scale(domain=[25, 60])),
    alt.Color('country:N', scale=alt.Scale(domain=['Canada', 'USA'],
                                           range=['red', 'blue'])))
chart_votes
X = cities_df.drop(columns=["country"])
X
longitude latitude
0 -130.0437 55.9773
1 -134.4197 58.3019
2 -123.0780 48.9854
3 -122.7436 48.9881
4 -122.2691 48.9951
... ... ...
204 -72.7218 45.3990
205 -66.6458 45.9664
206 -79.2506 42.9931
207 -72.9406 45.6275
208 -79.4608 46.3092

209 rows × 2 columns

y = cities_df["country"]
y
0         USA
1         USA
2         USA
3         USA
4         USA
        ...  
204    Canada
205    Canada
206    Canada
207    Canada
208    Canada
Name: country, Length: 209, dtype: object
from sklearn.model_selection import train_test_split


# Split the dataset into 80% train and 20% test 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)
# Print shapes of the training and testing data (you can ignore the code here)
pd.DataFrame({
    "Variable name": ["X", "y", "X_train", "y_train", "X_test", "y_test"],
    "Shape": [X.shape, y.shape,
              X_train.shape, y_train.shape,
              X_test.shape, y_test.shape]
})
Variable name Shape
0 X (209, 2)
1 y (209,)
2 X_train (167, 2)
3 y_train (167,)
4 X_test (42, 2)
5 y_test (42,)

2.4.2.2. Second way

Instead of splitting our X and y objects. We can split the whole dataset first into train and test splits.

The earlier to split the data the better, so this alternative is preferred in general.

cities_df
longitude latitude country
0 -130.0437 55.9773 USA
1 -134.4197 58.3019 USA
2 -123.0780 48.9854 USA
3 -122.7436 48.9881 USA
4 -122.2691 48.9951 USA
... ... ... ...
204 -72.7218 45.3990 Canada
205 -66.6458 45.9664 Canada
206 -79.2506 42.9931 Canada
207 -72.9406 45.6275 Canada
208 -79.4608 46.3092 Canada

209 rows × 3 columns

train_df, test_df = train_test_split(cities_df, test_size=0.2, random_state=123)

X_train = train_df.drop(columns=["country"])
y_train = train_df["country"]

X_test = test_df.drop(columns=["country"])
y_test = test_df["country"]

train_df
longitude latitude country
160 -76.4813 44.2307 Canada
127 -81.2496 42.9837 Canada
169 -66.0580 45.2788 Canada
188 -73.2533 45.3057 Canada
187 -67.9245 47.1652 Canada
... ... ... ...
17 -76.3305 44.1255 USA
98 -74.7287 45.0184 Canada
66 -121.4944 38.5816 USA
126 -79.5656 43.6436 Canada
109 -66.9195 44.8938 Canada

167 rows × 3 columns

2.4.3. Applications with Splitting

Let’s fit a decision tree model with unlimited depth on the training data and look at the tree structure.

model = DecisionTreeClassifier()
model.fit(X_train, y_train)

plot_tree(
    model,
    feature_names=X.columns,
    class_names=y.unique(),
    impurity=False,
    ax=plt.subplots(figsize=(24, 10))[1]
);
../_images/lecture2_45_0.png

Let’s see how on model scores on the training data and the test data.

print("Train score: " + str(round(model.score(X_train, y_train), 2)))
print("Test score: " + str(round(model.score(X_test, y_test), 2)))
Train score: 1.0
Test score: 0.74

For this tree, the training score is 1.0 and the test score is only 0.74.

The model does not perform quite as well on data that it has not seen, i.e. it is very specific to the training data and does not generalize well.

To see this visually, let’s look at the training and testing data with the decision boundaries made by the model.

model.fit(X_train, y_train);

plt.figure(figsize=(16, 8))
plt.suptitle('Decision tree boundaries', fontsize=20)
plt.subplot(1, 2, 1);
plt.title("Training data", fontsize=16);
plot_classifier(X_train, y_train, model, ticks=True, ax=plt.gca(), lims=(-140,-50,25,60))
plt.xlabel("longitude")
plt.ylabel("latitude")

plt.subplot(1, 2, 2);
plt.title("Test data", fontsize=16);
plot_classifier(X_test, y_test, model, ticks=True, ax=plt.gca(), lims=(-140,-50,25,60))
plt.xlabel("longitude")
plt.ylabel("latitude");
/home/joel/miniconda3/envs/bait/lib/python3.9/site-packages/sklearn/base.py:450: UserWarning: X does not have valid feature names, but DecisionTreeClassifier was fitted with feature names
  warnings.warn(
/home/joel/miniconda3/envs/bait/lib/python3.9/site-packages/sklearn/base.py:450: UserWarning: X does not have valid feature names, but DecisionTreeClassifier was fitted with feature names
  warnings.warn(
../_images/lecture2_49_1.png

On the left and the right, we have the same boundaries. But different data being shown.

The model is getting 100 percent accuracy on the training and for that to happen, the model ends up being extremely specific.

The model got over complicated on the training data and this doesn’t generalize to the test data well.

In the plot on the right, we can see some red triangles in the blue area. That is the model making mistakes which explains the lower test accuracy.

2.4.4. Parameters in .train_test_split()

  • test_size - test split size (0-1)

  • train_size - train split size (0-1) (only need to specify one if these

  • random_state - keeps the split randomization consistent between runs

train_df, test_df = train_test_split(cities_df, test_size = 0.2, random_state = 123)
train_df, test_df = train_test_split(cities_df, train_size = 0.8, random_state = 123)
train_df, test_df = train_test_split(cities_df, test_size = 0.2, train_size = 0.8, random_state = 123)

There is no hard and fast rule on the split sizes should we use. Some common splits are 90/10, 80/20, 70/30 (training/test).

In the above example, we used an 80/20 split.

But there is a trade-off:

  • More training -> More information for our model.

  • More test -> Better assessment of our model.

Now let’s look at the random_state argument:

The random_state argument controls this shuffling and without this argument set, each time we split our data, it will be split in a different way.

train_df, test_df = train_test_split(cities_df, test_size = 0.2)
train_df
longitude latitude country
127 -81.2496 42.9837 Canada
112 -114.0626 51.0534 Canada
194 -122.7979 49.0235 Canada
47 -86.1584 39.7683 USA
18 -75.4864 44.6943 USA
... ... ... ...
74 -75.9774 36.8530 USA
189 -112.8516 49.6943 Canada
61 -87.9225 43.0350 USA
40 -121.8906 37.3362 USA
154 -119.4983 49.8893 Canada

167 rows × 3 columns

We set this to add a component of reproducibility to our code and if we set it with a random_state when we run our code again it will produce the same result.

train_df_rs5, test_df_rs5 = train_test_split(cities_df, test_size = 0.2, random_state = 5)
train_df_rs5
longitude latitude country
39 -96.7969 32.7763 USA
55 -97.5171 35.4730 USA
40 -121.8906 37.3362 USA
54 -83.0466 42.3316 USA
48 -122.3301 47.6038 USA
... ... ... ...
144 -104.6173 50.4488 Canada
118 -79.7599 43.6858 Canada
189 -112.8516 49.6943 Canada
206 -79.2506 42.9931 Canada
99 -72.0968 45.0072 Canada

167 rows × 3 columns

train_df_rs7, test_df_rs7 = train_test_split(cities_df, test_size = 0.2, random_state = 7)
train_df_rs7
longitude latitude country
128 -118.7148 50.4165 Canada
195 -122.7454 53.9129 Canada
99 -72.0968 45.0072 Canada
97 -75.5192 44.7184 Canada
14 -78.8784 42.8867 USA
... ... ... ...
151 -71.8890 45.4033 Canada
67 -84.3902 33.7491 USA
25 -67.9353 47.1575 USA
196 -110.6790 50.0430 Canada
175 -82.4066 42.9746 Canada

167 rows × 3 columns

2.5. Validation data for hyperparameter optimization

We have seen that letting the model set the hypterparameters (such as max_depth) in an unconstrained way, will cause it to become very specific to the training data. How can we come up with a good way to find which hyperparameters produce the best generalized model (also called hyperparameter optimization)?

It’s a good idea to have separate data for tuning the hyperparameters of a model that is not the test set. Enter, the validation set.

So we actually want to split our dataset into 3 splits: train, validation, and test.

../_images/train-valid-test-split.png

Note: There isn’t a good consensus on the terminology of what is validation and what is test data.

We use:

  • training data: The data used when fitting to find the best model parameters.

  • validation data: Data where we have access to the target values, but unlike the training data, we only use this for hyperparameter tuning and model assessment; we don’t pass these into fit. This evaluation is still biased as skill on the validation dataset is incorporated into the model hyperparameter configuration.

  • test data (also called holdout data): Data where we have access to the target values, but in this case, unlike training and validation data, we neither use it in training nor hyperparameter optimization and only use it once to evaluate the performance of the best performing model on the validation set. We lock it in a “vault” until we’re ready to evaluate and it gives a more unbiased performance score than the validation score since the model configuration has not been tweaked to perform well on the test data in any way.

2.6. “Deployment” data

What’s the point of making models?

We want to predict something which we do not know the answer to, so we do not have the target values and we only have the features.

After we build and finalize a model, we deploy it, and then the model is used with data in the wild.

We will use deployment data to refer to data, where we do not have access to the target values.

Deployment score is the thing we really care about; it is the performance of the final model on the data in production.

We use validation and test scores as proxies for the deployment score, and we hope they are similar.

So, if our model does well on the validation and test data, we hope it will do well on deployment data.

2.7. Let’s Practice

1. When is the most optimal time to split our data?
2. Why do we split our data?
3. Fill in the table below:

datasets

.fit()

.score()

.predict()

Train

✔️

Validation

Test

Deployment

2.8. Cross-validation

Problems with having a single train-validation split:

  • Only using a portion of your full data set for training/validation (data is our most precious resource!!!)

  • If your dataset is small you might end up with a tiny training/validation set

  • Might be unlucky with your splits such that they don’t well represent your data (shuffling data, as is done in train_test_split(), is not immune to being unlucky!)

There must be a better way!

There is! The answer to our problem is called…..

Cross-validation (CV) or 𝑘-fold cross-validation. In cross-validation, we perform several training-validation splits and train and evaluate our model across all of them to find the hyperparameters that performs the best on the data in general.

../_images/cross-validation.png
  • We still have the test set here at the bottom locked away that we will not touch until the end.

  • But, we split the training data into \(k\) folds (\(k>2\), often \(k=10\)). In the graphic above \(k=4\).

  • Each “fold” gets a turn at being the validation set.

  • Each round will produce a score so after 𝑘-fold cross-validation, it will produce 𝑘 scores. We usually average over the 𝑘 results.

  • Note that cross-validation doesn’t shuffle the data; it’s done in train_test_split.

  • We can get now a more “robust” score on unseen data since we can see the variation in the scores across folds.

Note that for the same reason as we perform CV with the validation dataset, we might also perform it with the test data set. This is referred to as nested CV, as it includes one outer CV loop for splitting the test data from the validation + training data, and one inner CV loop for splitting the validation data from the training data This can be a good idea in practice, especially if your sample dataset is small, but it is not as commonly used as CV for only the validation-training split.

2.9. Cross-validation using sklearn

There are 2 ways we can do cross-validation with sklearn:

  • .cross_val_score()

  • .cross_validate()

Before doing cross-validation we still need to split our data into our training set and our test set and separate the features from the targets. So using our X and y from our Canadian/United States cities data we split it into train/test splits.

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=123)

2.9.1. cross_val_score

from sklearn.model_selection import cross_val_score


model = DecisionTreeClassifier(max_depth=4)
cv_score = cross_val_score(model, X_train, y_train, cv=5)
cv_score
array([0.76470588, 0.82352941, 0.78787879, 0.78787879, 0.84848485])

Once, we’ve imported cross_val_score we can make our model and call our model, the feature object and target object as arguments.

  • cv determines the cross-validation splitting strategy or how many “folds” there are.

  • For each fold, the model is fitted on the training portion and scores on the validation portion.

  • The output of cross_val_score() is the validation score for each fold.

cv_score.mean()
0.8024955436720143
cv_score = cross_val_score(model, X_train, y_train, cv=10)
cv_score
array([0.76470588, 0.82352941, 0.70588235, 0.94117647, 0.82352941,
       0.82352941, 0.70588235, 0.9375    , 0.9375    , 0.9375    ])
cv_score.mean()
0.8400735294117647

2.9.2. cross_validate

  • Similar to cross_val_score but more informative.

  • Lets us access training and validation scores using the parameter return_train_score.

  • Note: in the dictionary output test_score and test_time refers to validation score and validation time

from sklearn.model_selection import cross_validate
scores = cross_validate(model, X_train, y_train, cv=10, return_train_score=True)
scores
{'fit_time': array([0.00183988, 0.00165224, 0.00149655, 0.00149798, 0.00132823,
        0.00136709, 0.0014019 , 0.00135303, 0.00134039, 0.00130725]),
 'score_time': array([0.00142384, 0.0012486 , 0.00102472, 0.00109506, 0.00099587,
        0.00089931, 0.00087047, 0.00085831, 0.00090098, 0.00084519]),
 'test_score': array([0.76470588, 0.82352941, 0.70588235, 0.94117647, 0.82352941,
        0.82352941, 0.70588235, 0.9375    , 0.9375    , 0.9375    ]),
 'train_score': array([0.91333333, 0.90666667, 0.90666667, 0.9       , 0.90666667,
        0.91333333, 0.92      , 0.90066225, 0.90066225, 0.90066225])}

We can wrap this in a dataframe for easier reading.

scores_df = pd.DataFrame(cross_validate(model, X_train, y_train, cv=10, return_train_score=True))
scores_df
fit_time score_time test_score train_score
0 0.001702 0.001002 0.764706 0.913333
1 0.001385 0.001002 0.823529 0.906667
2 0.001461 0.000923 0.705882 0.906667
3 0.001358 0.000802 0.941176 0.900000
4 0.001331 0.000853 0.823529 0.906667
5 0.001251 0.000797 0.823529 0.913333
6 0.001260 0.000930 0.705882 0.920000
7 0.001318 0.000879 0.937500 0.900662
8 0.001376 0.001303 0.937500 0.900662
9 0.002209 0.001251 0.937500 0.900662
scores_df.mean()
fit_time       0.001465
score_time     0.000974
test_score     0.840074
train_score    0.906865
dtype: float64
scores_df.std()
fit_time       0.000291
score_time     0.000175
test_score     0.094993
train_score    0.006822
dtype: float64

2.10. Our typical supervised learning set up is as follows:

  1. Given training data with X and y.

  2. We split our data into X_train, y_train, X_test, y_test.

  3. Hyperparameter optimization using cross-validation on X_train and y_train.

  4. We assess the best model using X_test and y_test.

  5. The test score tells us how well our model generalizes.

  6. If the test score is reasonable, we deploy the model.

2.11. Let’s Practice

1. We carry out cross-validation to avoid reusing the same validation set again and again. Let’s say you do 10-fold cross-validation on 1000 examples. For each fold, how many examples do you train on?
2. With 10-fold cross-validation, you split 1000 examples into 10-folds. For each fold, when you are done, you add up the accuracies from each fold and divide by what?

True/False:

3. 𝑘-fold cross-validation calls fit 𝑘 times and predict 𝑘 times.

2.12. Overfitting and Underfitting

2.12.1. Types of scores

We’ve talked about the different types of splits, now we are going to talk about their scores.

  • Training score: The score that our model gets on the same data that it was trained on. (seen data - training data)

  • Validation score: The mean validation score from cross-validation).

  • Test score: This is the score from the data that we locked away.

2.12.2. Overfitting

  • Overfitting occurs when our model is overly specified to the particular training data and often leads to bad results.

  • Training score is high but the validation score is much lower.

  • The gap between train and validation scores is large.

  • It’s usually common to have a bit of overfitting (only a bit!)

  • This produces more severe results when the training data is minimal or when the model’s complexity is high.

model = DecisionTreeClassifier()
scores = cross_validate(model, X_train, y_train, cv=10, return_train_score=True)

pd.DataFrame(scores)
fit_time score_time test_score train_score
0 0.001723 0.001085 0.647059 1.0
1 0.001647 0.001349 0.764706 1.0
2 0.002690 0.001627 0.764706 1.0
3 0.001665 0.000936 0.941176 1.0
4 0.001338 0.000885 0.882353 1.0
5 0.001374 0.000853 0.823529 1.0
6 0.001434 0.001036 0.705882 1.0
7 0.002044 0.001131 0.812500 1.0
8 0.001461 0.000934 0.937500 1.0
9 0.001683 0.001016 0.812500 1.0
print("Train score: " + str(round(scores["train_score"].mean(), 2)))
print("Validation score: " + str(round(scores["test_score"].mean(), 2)))
Train score: 1.0
Validation score: 0.81
model.fit(X_train, y_train);
plot_classifier(X_train, y_train, model);
plt.title("Decision tree with no max_depth");
/home/joel/miniconda3/envs/bait/lib/python3.9/site-packages/sklearn/base.py:450: UserWarning: X does not have valid feature names, but DecisionTreeClassifier was fitted with feature names
  warnings.warn(
../_images/lecture2_91_1.png

2.12.3. Underfitting

  • Underfitting is somewhat the opposite of overfitting in the sense that it occurs when the model is not complex enough.

  • Underfitting is when our model is too simple (DecisionTreeClassifier with max_depth=1).

  • The model doesn’t capture the patterns in the training data and the training score is not that high.

  • Both train and validation scores are low and the gap between train and validation scores is low as well.

model = DecisionTreeClassifier(max_depth=1)

scores = cross_validate(model, X_train, y_train, cv=10, return_train_score=True)
print("Train score: " + str(round(scores["train_score"].mean(), 2)))
print("Validation score: " + str(round(scores["test_score"].mean(), 2)))
Train score: 0.83
Validation score: 0.81

Standard question to ask ourselves: Which of these scenarios am I in?

2.12.4. How can we figure this out?

  • If the training and validation scores are very far apart → more likely overfitting.

    • Try decreasing model complexity.

  • If the training and validation scores are very close together → more likely underfitting.

    • Try increasing model complexity.

2.13. The “Fundamental Tradeoff” of Supervised Learning

As model complexity increases:

\(\text{Training score}\) ↑ and (\(\text{Training score} − \text{Validation score}\)) tend to also ↑

If our model is too simple (underfitting) then we won’t really learn any “specific patterns” of the training set.

BUT

If our model is too complex then we will learn unreliable patterns that get every single training example correct, and there will be a large gap between training error and validation error.

The trade-off is there is tension between these two concepts.

When we underfit less, we overfit more.

How do we know how much overfitting is too much and how much is not enough?

2.13.1. How to pick a model that would generalize better?

We can create a loop that tries different values of the hyperparameters to see which give the best results, let’s try this with a max depth from 1 to 20.

results_dict = {"depth": list(), "mean_train_score": list(), "mean_cv_score": list()}

for depth in range(1, 20):
    model = DecisionTreeClassifier(max_depth=depth)
    scores = cross_validate(model, X_train, y_train, cv=10, return_train_score=True)
    results_dict["depth"].append(depth)
    results_dict["mean_cv_score"].append(scores["test_score"].mean())
    results_dict["mean_train_score"].append(scores["train_score"].mean())

results_df = pd.DataFrame(results_dict)
results_df
depth mean_train_score mean_cv_score
0 1 0.834349 0.809926
1 2 0.844989 0.804044
2 3 0.862967 0.804412
3 4 0.906865 0.840074
4 5 0.918848 0.845956
5 6 0.930817 0.815074
6 7 0.954115 0.827941
7 8 0.972733 0.821324
8 9 0.980049 0.827574
9 10 0.994013 0.803309
10 11 0.996675 0.803309
11 12 1.000000 0.815074
12 13 1.000000 0.809191
13 14 1.000000 0.809191
14 15 1.000000 0.803309
15 16 1.000000 0.815074
16 17 1.000000 0.803309
17 18 1.000000 0.809191
18 19 1.000000 0.809191
source = results_df.melt(id_vars=['depth'] , 
                              value_vars=['mean_train_score', 'mean_cv_score'], 
                              var_name='score_type', value_name='accuracy')
chart1 = alt.Chart(source).mark_line().encode(
    alt.X('depth:Q', axis=alt.Axis(title="Tree Depth")),
    alt.Y('accuracy:Q'),
    alt.Color('score_type:N', scale=alt.Scale(domain=['mean_train_score', 'mean_cv_score'],
                                           range=['teal', 'gold'])))
chart1
  • As we increase our depth (increase our complexity) our training data increases.

  • As we increase our depth, we overfit more, and the gap between the train score and validation score also increases… except …

  • There is a spot where the gap between the validation score and test score is the smallest while still producing a decent validation score.

  • In the plot, this would be around max_depth is 5.

  • Commonly, we look at the cross-validation score and pick the hyperparameter with the highest cross-validation score.

results_df.sort_values('mean_cv_score', ascending=False)
depth mean_train_score mean_cv_score
4 5 0.918848 0.845956
3 4 0.906865 0.840074
6 7 0.954115 0.827941
8 9 0.980049 0.827574
7 8 0.972733 0.821324
15 16 1.000000 0.815074
11 12 1.000000 0.815074
5 6 0.930817 0.815074
0 1 0.834349 0.809926
12 13 1.000000 0.809191
17 18 1.000000 0.809191
13 14 1.000000 0.809191
18 19 1.000000 0.809191
2 3 0.862967 0.804412
1 2 0.844989 0.804044
10 11 0.996675 0.803309
14 15 1.000000 0.803309
16 17 1.000000 0.803309
9 10 0.994013 0.803309

Now that we know the best value to use for max_depth, we can build a new classifier setting max_depth=5, train it and now (only now) do we score our model on the test set.

model = DecisionTreeClassifier(max_depth=5)
model.fit(X_train, y_train);
print("Score on test set: " + str(round(model.score(X_test, y_test), 2)))
Score on test set: 0.81
  • Is the test error comparable with the cross-validation error?

  • Do we feel confident that this model would give a similar performance when deployed?

2.14. The Golden Rule

  • Even though we care the most about test error THE TEST DATA CANNOT INFLUENCE THE TRAINING PHASE IN ANY WAY.

  • We have to be very careful not to violate it while developing our ML pipeline.

  • Why? When this happens, the test data influences our training and the test data is no longer unseen data and so the test score will be too optimistic.

  • Even experts end up breaking it sometimes which leads to misleading results and lack of generalization on the real data.

    • https://www.theregister.com/2019/07/03/nature_study_earthquakes/

    • https://www.technologyreview.com/2015/06/04/72951/why-and-how-baidu-cheated-an-artificial-intelligence-test/

How do we avoid this?

The most important thing is when splitting the data, we lock away the test set and keep it separate from the training data.

Forget it exists temporarily - kinda like forgetting where you put your passport until you need to travel.

To summarize, the workflow we generally follow is:

  • Splitting: Before doing anything, split the data X and y into X_train, X_test, y_train, y_test or train_df and test_df using train_test_split.

  • Select the best model using cross-validation: Use cross_validate with return_train_score = True so that we can get access to training scores in each fold. (If we want to plot train vs validation error plots, for instance.)

  • Scoring on test data: Finally, score on the test data with the chosen hyperparameters to examine the generalization performance.

2.15. Let’s Practice

Overfitting or Underfitting:

1. If our train accuracy is much higher than our test accuracy.
2. If our train accuracy and our test accuracy are both low and relatively similar in value.
3. If our model is using a Decision Tree Classifier for a classification problem with no limit on max_depth.

True or False:

4. In supervised learning, the training score is always higher than the validation score.
5. The fundamental tradeoff of ML states that as training score goes up, validation score goes down.
6. More “complicated” models are more likely to overfit than “simple” ones.
7. If our training score is extremely high, that means we’re overfitting.

2.16. Let’s Practice - Coding

Below is some starter code that creates your feature table and target column from the data from the bball.csv dataset (in the data folder).

bball_df = pd.read_csv('data/bball.csv')
bball_df = bball_df[(bball_df['position'] =='G') | (bball_df['position'] =='F')]

# Define X and y
X = bball_df.loc[:, ['height', 'weight', 'salary']]
y = bball_df['position']
  1. Split the dataset into 4 objects: X_train, X_test, y_train, y_test. Make the test set 0.2 (or the train set 0.8) and make sure to use random_state=7.

  2. Build a decision tree model with max_depth=5.

  3. Cross-validate using cross_validate() on the objects X_train and y_train specifying the model and making sure to use 10 fold cross-validation and return_train_score=True.

  4. Convert the scores into a dataframe and save it in an object named scores_df.

  5. Calculate the mean scores from cross validation.

  6. Is your model overfitting or underfitting?

# 1.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=7)
# 2.
dt5 = DecisionTreeClassifier(max_depth=5)
# 3.
scores = cross_validate(dt5, X_train, y_train, cv=10, return_train_score=True)
# 4.
scores_df = pd.DataFrame(scores)
# 5.
scores_df.mean()
fit_time       0.001511
score_time     0.000904
test_score     0.836833
train_score    0.957826
dtype: float64
  1. It seems to be overfitting slighlty since it is performing notably better on the training data.

2.17. What We’ve Learned Today

  • The concept of generalization.

  • How to split a dataset into train and test sets using train_test_split function.

  • The difference between train, validation, test, and “deployment” data.

  • The difference between training error, validation error, and test error.

  • Cross-validation and use cross_val_score() and cross_validate() to calculate cross-validation error.

  • Overfitting, underfitting, and the fundamental tradeoff.

  • Golden rule and identify the scenarios when it’s violated.