11. Multi-Class Classification (Optional)#

11.1. Lecture Learning Objectives#

Explain components of a confusion matrix with respect to multi-class classification.
Define precision, recall, and f1-score with multi-class classification
Carry out multi-class classification using OVR and OVO strategies.

11.2. Five Minute Recap/ Lightning Questions#

What metrics is calculated using the equation \(\frac{TP}{TP + FP}\) ?
What function can be used to find the calculated values of precision, recall, and f1?
What function do we use to identify the number of false positives, false negatives and correctly identified positive and negative values?
What argument/parameter is important to use when we make our own scorer where lower values are better?
What regression metric will give funky units?

11.2.1. Some lingering questions#

What happens if we have data where there is a lot of one class and very few of another?
How do we measure precision and recall and what do our confusion matrices look like now

11.3. Multi-class classification#

Often we will come across problems where there are more than two classes to predict.
We call these multi-class problems.

Some algorithms can natively support multi-class classification, for example:
- Decision Trees
- \(K\)-nn
- Naive Bayes
Below is an example of a Decision Tree Classifier used to classify 3 labels

And here’s the graph:

Here’s an example of KNN:

Other models, like SVMs and Logistic Regression, don’t natively support multi-class classification.

Instead, there are two common strategies to help us:

One-vs-rest
One-vs-one

11.3.1. One-vs-Rest#

(also known as one-vs-all)

It’s the default for most sklearn algorithms, e.g., LogisticRegression, SVM.
Turns \(k\)-class classification into \(k\) binary classification problems.
Builds \(k\) binary classifiers; for each classifier, the class is fitted against all the other classes.
For k classes, that means we need k models in total, e.g.:
- blue vs (red & orange)
- red vs (blue & orange)
- orange vs (blue & red)
We use all models to make a prediction, and then choose the category with the highest prediction/probability/confidence.
You can do this yourself for any binary classifier using OneVsRestClassifier

Here we are importing OneVsRestClassifier from sklearn.multiclass

from sklearn.multiclass import OneVsRestClassifier

We are going to use a wine dataset that has 3 different classes; 0, 1, 2 (maybe red, white and rose?)

import pandas as pd
from sklearn import datasets
from sklearn.model_selection import train_test_split

data = datasets.load_wine()
X = pd.DataFrame(data['data'], columns=data["feature_names"])
X = X[['alcohol', 'malic_acid']]
y = data['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=2021)
X_train.head()

	alcohol	malic_acid
36	13.28	1.64
77	11.84	2.89
131	12.88	2.99
159	13.48	1.67
4	13.24	2.59

pd.DataFrame(y_train).value_counts()

  54
  40
  39
dtype: int64

from sklearn.linear_model import LogisticRegression

ovr = OneVsRestClassifier(LogisticRegression(max_iter=100000))
ovr.fit(X_train, y_train)
ovr.score(X_train, y_train)

0.7819548872180451

from plot_classifier import plot_classifier

plot_classifier(X_train, y_train, ovr);

/home/joel/miniconda3/envs/bait/lib/python3.9/site-packages/sklearn/base.py:450: UserWarning: X does not have valid feature names, but LogisticRegression was fitted with feature names
  warnings.warn(
/home/joel/miniconda3/envs/bait/lib/python3.9/site-packages/sklearn/base.py:450: UserWarning: X does not have valid feature names, but LogisticRegression was fitted with feature names
  warnings.warn(
/home/joel/miniconda3/envs/bait/lib/python3.9/site-packages/sklearn/base.py:450: UserWarning: X does not have valid feature names, but LogisticRegression was fitted with feature names
  warnings.warn(
/home/joel/miniconda3/envs/bait/lib/python3.9/site-packages/plot_classifier/__init__.py:122: UserWarning: You passed a edgecolor/edgecolors ('k') for an unfilled marker ('x').  Matplotlib is ignoring the edgecolor in favor of the facecolor.  This behavior may change in the future.
  ax.scatter(X0[y==labels[2]], X1[y==labels[2]], s=60, c='k', marker='x', edgecolors='k')

11.3.2. One-vs-One#

One-vs-One fits a model to all pairs of categories.
If there are 3 classes (“blue”, “red”, “orange”), we fit a model on:
- blue vs red
- blue vs orange
- red vs orange
So we have 3 models in this case, or in general \(\frac{n*(n-1)}{2}\)
For 100 classes, we fit 4950 models!
All models are used during prediction and the classification with the most “votes” is predicted.
Computationally expensive, but can be good for models that scale poorly with data, because each model in OvO only uses a fraction of the dataset.

from sklearn.multiclass import OneVsOneClassifier

ovo = OneVsOneClassifier(LogisticRegression(max_iter=100000))
ovo.fit(X_train, y_train)
ovo.score(X_train, y_train)

0.8120300751879699

plot_classifier(X_train, y_train, ovo);

/home/joel/miniconda3/envs/bait/lib/python3.9/site-packages/sklearn/base.py:450: UserWarning: X does not have valid feature names, but OneVsOneClassifier was fitted with feature names
  warnings.warn(
/home/joel/miniconda3/envs/bait/lib/python3.9/site-packages/plot_classifier/__init__.py:122: UserWarning: You passed a edgecolor/edgecolors ('k') for an unfilled marker ('x').  Matplotlib is ignoring the edgecolor in favor of the facecolor.  This behavior may change in the future.
  ax.scatter(X0[y==labels[2]], X1[y==labels[2]], s=60, c='k', marker='x', edgecolors='k')

11.4. Multi-class measurements#

Similar to how we can use different classification metrics for binary classification, we can do so with multi-class too.

Let’s look at this with a larger version of this wine dataset.

data = datasets.load_wine()
X = pd.DataFrame(data['data'], columns=data["feature_names"])
y = data['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=2021)
X_train.head()

	alcohol	malic_acid	ash	alcalinity_of_ash	magnesium	total_phenols	flavanoids	nonflavanoid_phenols	proanthocyanins	color_intensity	hue	od280/od315_of_diluted_wines	proline
36	13.28	1.64	2.84	15.5	110.0	2.60	2.68	0.34	1.36	4.60	1.09	2.78	880.0
77	11.84	2.89	2.23	18.0	112.0	1.72	1.32	0.43	0.95	2.65	0.96	2.52	500.0
131	12.88	2.99	2.40	20.0	104.0	1.30	1.22	0.24	0.83	5.40	0.74	1.42	530.0
159	13.48	1.67	2.64	22.5	89.0	2.60	1.10	0.52	2.29	11.75	0.57	1.78	620.0
4	13.24	2.59	2.87	21.0	118.0	2.80	2.69	0.39	1.82	4.32	1.04	2.93	735.0

X_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 133 entries, 36 to 116
Data columns (total 13 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   alcohol                       133 non-null    float64
 1   malic_acid                    133 non-null    float64
 2   ash                           133 non-null    float64
 3   alcalinity_of_ash             133 non-null    float64
 4   magnesium                     133 non-null    float64
 5   total_phenols                 133 non-null    float64
 6   flavanoids                    133 non-null    float64
 7   nonflavanoid_phenols          133 non-null    float64
 8   proanthocyanins               133 non-null    float64
 9   color_intensity               133 non-null    float64
 10  hue                           133 non-null    float64
 11  od280/od315_of_diluted_wines  133 non-null    float64
 12  proline                       133 non-null    float64
dtypes: float64(13)
memory usage: 14.5 KB

Since our data here isn’t missing any values and it’s all numeric, we can make a pipeline with just StandardScaler() and a model, we are going to use LogisticRegression.

from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

pipe = make_pipeline(
    (StandardScaler()),
    (LogisticRegression())
)
pipe.fit(X_train,y_train);

predictions = pipe.predict(X_test)
pipe.score(X_test,y_test)

0.9333333333333333

We can predict on our test set and see that we get an accuracy of 93%.

But what does this mean for our metrics?

11.4.1. Multiclass confusion metrics#

We can still create confusion matrices but now they are greater than a 2 X 2 grid.

We have 3 classes for this data, so our confusion matrix is 3 X 3.

from sklearn.metrics import ConfusionMatrixDisplay


ConfusionMatrixDisplay.from_estimator(
    pipe,
    X_test,
    y_test
);

We see that we can still compute a confusion matrix, for problems with more than 2 labels in the target column.

The diagonal values are the correctly labelled wines and the rest are the errors.

Here we can see the model mistakenly predicted:

1 wine of true class 1 as class 0 and,
1 wine of true class 1 as class 2.
1 of the wines with a class of 2 as class 1.

11.4.2. Multiclass classification report#

Precision, recall, etc. don’t apply directly but like we said before, but depending on which class we specify as our “positive” label and consider the rest to be negative, then we can.

from sklearn.metrics import classification_report

print(classification_report(y_test, predictions, digits=4))

              precision    recall  f1-score   support

           0     0.9500    1.0000    0.9744        19
           1     0.9375    0.8824    0.9091        17
           2     0.8889    0.8889    0.8889         9

    accuracy                         0.9333        45
   macro avg     0.9255    0.9237    0.9241        45
weighted avg     0.9331    0.9333    0.9326        45

If class 0 is our positive class then our precision, recall and f1-scores are 0.95, 1.00, 0.9744 respectively.

If class 1 is our positive class then now the precision, recall and f1-scores are 0.9375, 0.8824, 0.9091.

And finally, if class 2 is our positive class then the precision, recall and f1-scores are 0.8889, 0.8889, 0.8889.

Again the support column on the right shows the number of examples of each wine class.

11.5. Multi-class coefficients#

Let’s look at the coefficients with this multi-class problem. (Ignore the max_iter for now. You can look into it here if you like)

pipe.named_steps['logisticregression'].coef_

array([[ 0.65224456,  0.15226257,  0.39849154, -0.65779374,  0.14809695,
         0.40339704,  0.64196157, -0.17781306, -0.08015819,  0.14511694,
         0.10843067,  0.57387779,  1.04929389],
       [-0.98152274, -0.46358731, -0.72783023,  0.39233462, -0.37572384,
        -0.12216444,  0.26882969,  0.28402296,  0.57145264, -0.7940844 ,
         0.64433443,  0.0136787 , -1.02136195],
       [ 0.32927817,  0.31132474,  0.32933868,  0.26545912,  0.22762689,
        -0.2812326 , -0.91079126, -0.1062099 , -0.49129445,  0.64896746,
        -0.75276509, -0.58755649, -0.02793194]])

pipe.named_steps['logisticregression'].coef_.shape

(3, 13)

What is going on here?

Well, now we have one coefficient per feature per class.

The interpretation is that these coefficients contribute to the prediction of a certain class.

The specific interpretation depends on the way the logistic regression is implementing multi-class (OVO, OVR).

11.6. Multi-class and `predict_proba`#

If we look at the output of predict_proba you’ll also see that there is a probability for each class and each row adds up to 1 as we would expect (total probability = 1).

pipe.predict_proba(X_test)[:5]

array([[9.93054789e-01, 6.37825640e-03, 5.66954156e-04],
       [8.88767485e-04, 9.91007367e-01, 8.10386504e-03],
       [9.99631873e-01, 2.80846843e-04, 8.72802725e-05],
       [4.31126532e-04, 3.74655335e-04, 9.99194218e-01],
       [1.36846522e-03, 9.97552882e-01, 1.07865245e-03]])

11.7. Let’s Practice#

1. Which wrapper is more computationally expensive?
2. Name a model that can handle multi-class problems without any issues or needing any additional strategies.
3. If I have 6 classes, how many models will be built if I use the One-vs-Rest strategy?
4. If I have 6 classes, how many models will be built if I use the One-vs-One strategy?

Use the diagram below to answer the next few questions:

5. How many examples did the model correctly predict?
6. How many examples were incorrectly labelled as G?
7. How many F-C labels were in the data?

True or False:

8. Decision Trees use coefficients for multi-class data.
9. Using 1 target label as the positive class will make all other target labels negative.

Solutions!

One-vs-One
Decision Trees, K-nn
6
\(6(5)/2=15\)
52
3
6
False
True

11.8. Pandas Profiler#

EDA secret! (Careful to only use this on your training split though -> Golden Rule!)
quickly generate summaries of dataframes including dtypes, stats, visuals, etc.
Pandas profiling is not part of base Pandas
If using conda, install with: pip install pandas-profiling

from pandas_profiling import ProfileReport

df = pd.read_csv('data/housing.csv')

profile = ProfileReport(df)
profile

11.9. What We’ve Learned Today#

How to carry out multi-class classification.
How to utilize pandas profiling for EDA.

BAIT 509
Business Applications of Machine Learning

Multi-Class Classification (Optional)

Contents

11. Multi-Class Classification (Optional)#

11.1. Lecture Learning Objectives#

11.2. Five Minute Recap/ Lightning Questions#

11.2.1. Some lingering questions#

11.3. Multi-class classification#

11.3.1. One-vs-Rest#

11.3.2. One-vs-One#

11.4. Multi-class measurements#

11.4.1. Multiclass confusion metrics#

11.4.2. Multiclass classification report#

11.5. Multi-class coefficients#

11.6. Multi-class and `predict_proba`#

11.7. Let’s Practice#

11.8. Pandas Profiler#

11.9. What We’ve Learned Today#

BAIT 509Business Applications of Machine Learning

Multi-Class Classification (Optional)

Contents

11. Multi-Class Classification (Optional)#

11.1. Lecture Learning Objectives#

11.2. Five Minute Recap/ Lightning Questions#

11.2.1. Some lingering questions#

11.3. Multi-class classification#

11.3.1. One-vs-Rest#

11.3.2. One-vs-One#

11.4. Multi-class measurements#

11.4.1. Multiclass confusion metrics#

11.4.2. Multiclass classification report#

11.5. Multi-class coefficients#

11.6. Multi-class and predict_proba#

11.7. Let’s Practice#

11.8. Pandas Profiler#

11.9. What We’ve Learned Today#

BAIT 509
Business Applications of Machine Learning

11.6. Multi-class and `predict_proba`#