{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# BAIT 509: Business Applications of Machine Learning\n", "## Lecture 5 - Logistic regression, naive Bayes and some case studies\n", "Tomas Beuzen, 20th January 2020" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Lecture outline\n", "- [0. Recap (5 mins)](#0)\n", "- [1. Lecture learning objectives](#1)\n", "- [2. No free lunch theorem (5 mins)](#2)\n", "- [3. Logistic regression (40 mins)](#3)\n", "- [--- Break --- (10 mins)](#break)\n", "- [4. Naive Bayes (30 mins)](#4)\n", "- [5. Logistic Regression vs Naive Bayes (5 mins)](#5)\n", "- [6. True/false questions (5 mins)](#6)\n", "- [7. Summary questions to ponder](#7)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Announcements" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Assignment 1 due **tonight at 11:59pm**.\n", "- Assignment 2 will be released tomorrow morning and will be due next **Monday (27th Jan) at 11:59pm**." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 0. Recap (5 mins) \n", "\n", "- The problems of having numeric data on different scales\n", "- Scaling numeric data (normalization, standardization)\n", "- How to properly implement scaling in a ML workflow (the golden rule)\n", "- Encoding categorical data (label encoding, one hot encoding)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 1. Lecture learning objectives \n", "\n", "- Introduce and use logistic regression\n", "- Introduce and use naive Bayes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 2. No free lunch theorem (5 mins) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- There's an important theorem in ML called the \"[No free lunch theorem](https://www.kdnuggets.com/2019/09/no-free-lunch-data-science.html)\"\n", "- Essentially, this states that there is no one ML model that is best for all problems\n", "- In practice, we usually try lots of different models (more on that next lecture)\n", "- However, some models have proved to be generally more effective for particular problems\n", "- The two that we will look at today are:\n", " - Logistic regression\n", " - Naive Bayes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 3. Logistic Regression (40 mins) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3.1 A conceptual introduction" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- We all know about linear regression\n", "- We use linear regression for predicting a continuous response variable with the form:\n", "$$Y = \\beta_0+\\beta_1X$$\n", "- But can we use linear regression for classification?\n", "- Yes we can and we call this **LOGISTIC REGRESSION**\n", "- In logistic regression we model the __*probabilities*__ of the different outcomes\n", "- Let's take a look at some toy data classifying tumor state based on tumor size\n", "- Our goal of logistic regression is to model the __*probability*__ that a particular tumor size is *Malignant*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- We want to model probabilities\n", "- So, the first thing we can do is encode our class labels with 0 (benign) and 1 (malignant) to make them numeric and ranging from 0 to 1\n", "\n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- We can then fit a least-squares linear regression model as normal...\n", "- (By the way R encodes categorical data automatically when fitting a linear model using lm(), Python does not)\n", "\n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Okay so we've now fit a line to our data, how can we use that to predict a class benign (0) or malignant (1)?\n", "- Well we can simply use the threshold P(malignant) = 0.5 as our \"decision boundary\"\n", "- If P(malignant) < 0.5, call the tumor benign\n", "- If P(malignant) > 0.5 call the tumor malignant\n", "\n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- But there are a few problems with our simple least squares regression approach..." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Problem 1: Probabilities < 0 or > 1 don't make sense\n", "- Our predictions of P(malignant) are not confined between 0 and 1\n", "- The actual equation of the above line is\n", "P(malignant) = -0.2 + 0.13*tumor\\_size\n", "\n", "\n", "- So for a very small tumor size, we can have P(malignant) < 0\n", "- And for a very large tumor size, we could predict P(malignant) > 1 which doesn't make sense" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Problem 2: Our model is not robust to new data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- What happens if we get some new data like the following?\n", "- We fit a line to the data which is now quite different to our previous line\n", "\n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- If we try to use our 0.5 threshold to split the data again our classifier is now making mistakes\n", "- Our classifier is not very robust to changes in the data\n", "- We could change the threshold now, to say 0.35, but that would be impractical, we can't change the model every time we see new data!\n", "- Intuitively in this case, we want to be \"more sure\" that those very large tumors are malignant, than we are that those tumors around 5-10cm are malignant\n", "\n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Problem 3: The law of diminishing returns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- In the linear regression above, a particular change in tumor size always produces the same change in P(malignant) (we have a constant slope)\n", "- i.e, a change in tumor size of 5cm produces a change in $P(malginant)$ of ~0.25\n", "- More naturally, we generally expect there to be \"diminishing returns\"\n", "- If P(malignant) is already large or small, we expect that a bigger change in **Tumor size** is required to result in a significant change in P(malignant)\n", "- Analogy: a spring. If we apply a constant force to stretch a spring, it would expand a lot at first, but the more it expands, the less that constant force will stretch the spring" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### The solution" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- To address the problems above, we need:\n", " 1. our predictions to be bounded between 0 and 1, and\n", " 2. predictions to change rapidly around 0.5 (the threshold) and slower away from 0.5 \n", "- In logistic regression these requirements are addressed by the logistic function\n", "$$P(x)=\\frac{1}{1+e^{-\\beta_0+\\beta_1x}}$$\n", "- This function **squashes** the output of linear regression into an \"S-shape\" between 0 and 1.\n", "\n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- So if we use logistic regression for our toy problem above we get the following plot, which addresses all of our previous problems!\n", "- We still use 0.5 as a threshold in a logistic model (this is the norm in practice but it can be changed)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Summary\n", "- Logistic regression is an extension of linear regression to classification problems\n", "- In binary classification problems, logistic regression predicts the probability of *one* label of the response variable (you can think of it as a reference class) and uses a threshold to make a classification\n", "- The probability of the non-reference label is simply $1-P(\\textrm{reference class})$\n", "- Logistic regression can be used for multi-class problems (more on that in a later lecture)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 2.2 A bit of maths" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Calculating the coefficients in logistic regression\n", "- In least squares regression, we have nice formulae to help us calculate the coefficients $\\beta_i$\n", "- Unfortunately, there's no closed form solutions to calculating the coefficients $\\beta_i$ in logistic regression\n", "- Instead, calculating the coefficients $\\beta_i$ is an optimization problem using maximum likelihood (not discussed in this course but [here is a great video on the topic](https://www.youtube.com/watch?v=BfKanl1aSG0) and [here is a more complicated proof](https://czep.net/stat/mlelr.pdf))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Interpreting the coefficients in logistic regression\n", "- The logistic function written above can be reformulated as follows:\n", "$$P(x)=\\frac{1}{1+e^{-\\beta_0+\\beta_1x}}$$\n", "\n", "$$log(\\frac{P(x)}{1-P(x)})=\\beta_0+\\beta_1x$$\n", "\n", "- $\\frac{P(x)}{1-P(x)}$ is known as \"odds\"\n", "- We hear about odds all the time, especially in betting/gambling situation\n", "- If a particular sports team has a probability of winning of 0.8, then the odds are: $\\frac{0.8}{1-0.8}=4$, sometimes read as 4:1, i.e., the team has 4 times as much chance of winning as it does losing\n", "- The point I want to make here is that, in contrast to least squares regression, a one unit increase in a feature $x$ does not cause a $\\beta_i$ increase in $P(x)$\n", "- Instead, a one unit change in a feature value, multiplies the odds by $e^{\\beta_i}$ (proof [here](https://christophm.github.io/interpretable-ml-book/logistic.html#interpretation-1))\n", "- Put simply:\n", " - Negative coefficients decrease the probability of the response as a feature increases (and vice versa)\n", " - Positive coefficients increase the probability of the response as a feature increases (and vice versa)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2.2 Cities example" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Let's check out a simple example of logistic regression on our now-familiar cities dataset" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Load in the necessary packages\n", "import numpy as np\n", "import pandas as pd\n", "import altair as alt\n", "from sklearn.model_selection import train_test_split\n", "import warnings\n", "from sklearn.exceptions import DataConversionWarning\n", "warnings.filterwarnings(action='ignore', category=DataConversionWarning)\n", "warnings.filterwarnings(action='ignore', category=FutureWarning)\n", "import sys\n", "sys.path.append('code/')\n", "from model_plotting import plot_model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- The code below loads the data csv and splits it into training and testing sets" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Load the data\n", "df = pd.read_csv('data/cities_USA.csv', index_col=0)\n", "X = df.drop(columns=['vote'])\n", "y = df[['vote']]\n", "# We will always split the data as our first step from now on!\n", "X_train, X_test, y_train, y_test = train_test_split(X,\n", " y,\n", " test_size=0.2,\n", " random_state=123)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- We won't be needing a validation set here or any cross-validation functions because we are not \"tuning\" any hyperparameters\n", "- There are actually a few hyperparameters we can tune in the LogisticRegression classifier, but they usually don't impact the model much and so often aren't tuned" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.linear_model import LogisticRegression\n", "model = LogisticRegression().fit(X_train, y_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Let's plot our model to see how it's behaving" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "alt.LayerChart(...)" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "plot_model(X_train, y_train, model)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Let's find the error rate of the model on the test data\n", "- It's not too bad at all for such a simple, quick model!" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Error rate = 0.22\n" ] } ], "source": [ "print(f\"Error rate = {1 - model.score(X_test, y_test):.2f}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Remember that logistic regression predicts probabilities\n", "- So we can also get a nice map of predicted probabilities from our model\n", "- This map looks just like our logistic function:\n", " - Probabilities are around 0.5 at the decision boundary\n", " - They increase/decrease rapdily away from the boundary" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "alt.LayerChart(...)" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "plot_model(X_train, y_train, model, predict_proba=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- With our cities dataset we've created a logistic regression model with two features\n", "- So we have 3 coefficients:\n", " 1. $\\beta_0$ = the intercept\n", " 2. $\\beta_1$ = coefficient for lon\n", " 3. $\\beta_2$ = coefficient for lat\n", "- Looking at the plot above, do you expect the coefficients to be positive or negative?\n", "- To answer this, we have to find out what the reference response of our model is" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array(['red'], dtype=object)" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.predict(np.atleast_2d([-100, 35]))" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[0.32795903, 0.67204097]])" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.predict_proba(np.atleast_2d([-100, 35]))" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array(['blue', 'red'], dtype=object)" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.classes_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- The reference response is the second class label, so our model is predicting the probability of **red** (that's why my plot above is plotting $P(red)$)\n", "- (the probability of **blue** is simply $1 - P(red)$)\n", "- So here we expect both coefficients to be **negative**!\n", "- Because as we increase lon or lat, the probability of **red** decreases\n", "- We can check our answer and access our model coefficients using model.coef_" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
LonLat
0-0.049315-0.135467
\n", "
" ], "text/plain": [ " Lon Lat\n", "0 -0.049315 -0.135467" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.DataFrame(model.coef_, columns=['Lon', 'Lat'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Note that we didn't scale our data, so it's difficult to compare the magnitude of these coefficients directly\n", "- The code below re-fits the model but with scaled data\n", "- We see that our coefficients are fairly similar here (which makes sense as our decision boundary in the plot above is a diagonal line)\n", "- Lat has a bit more influence over the response than Lon (which also makes sense because our diagonal decision boundary is slightly more horizontal than it is vertical)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
LonLat
0-0.65335-0.768348
\n", "
" ], "text/plain": [ " Lon Lat\n", "0 -0.65335 -0.768348" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.preprocessing import StandardScaler\n", "scaler = StandardScaler()\n", "X_train = scaler.fit_transform(X_train)\n", "X_test = scaler.transform(X_test)\n", "model = LogisticRegression().fit(X_train, y_train)\n", "pd.DataFrame(model.coef_, columns=['Lon', 'Lat'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2.3 A real-world case study predicting NFL field goals" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- In the data folder I've included some data on field goals from the nfl\n", "- This data is a record of field goal attempts in the 2008 NFL season compiled by the University of Florida and available [here](http://users.stat.ufl.edu/~winner/data/)\n", "- Let's load in the data and check it out" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0147300031
1154460001
2145281701
31453711401
4150391001
........................
10345-641116161
10355-335124241
10365-438117171
10375-1547113130
10385-1357113131
\n", "

1039 rows × 7 columns

\n", "