\n",
"\n",
"\n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
"

\n",
"

"
],
"text/plain": [
" X y\n",
"0 URGENT!! As a valued network customer you have... spam\n",
"1 Lol you are always so convincing. non spam\n",
"2 Sauder has interesting courses. non spam\n",
"3 URGENT! You have won a 1 week FREE membership ... spam\n",
"4 Had your mobile 11 months or more? U R entitle... spam\n",
"5 Sauder has been interesting so far. non spam"
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd\n",
"\n",
"\n",
"df = pd.DataFrame({'X': [\n",
" \"URGENT!! As a valued network customer you have been selected to receive a £900 prize reward!\",\n",
" \"Lol you are always so convincing.\",\n",
" \"Sauder has interesting courses.\",\n",
" \"URGENT! You have won a 1 week FREE membership in our £100000 prize Jackpot!\",\n",
" \"Had your mobile 11 months or more? U R entitled to Update to the latest colour mobiles with camera for Free!\",\n",
" \"Sauder has been interesting so far.\"],\n",
" 'y': [\"spam\", \"non spam\", \"non spam\", \"spam\", \"spam\", \"non spam\"]})\n",
"df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We know that we need to encode categorical data and transform it to numeric data to use it with machine learning since categorical columns throw an error when we try to fit our model.\n",
"\n",
"This sounds like a job for `CountVectorizer()` since we have words that need to be converted into numerical features! \n",
"\n",
"Here we are going to set `max_features=2`\n",
"to create a toy example that is easy to follow in our manual calculations.\n",
"We are also setting `stop_words='english'`\n",
"so we are getting meaningful words as features and not commonly used words\n",
"such as \"and\", \"or\", etc\n",
"(these are referred to as \"stop words\")."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"X | y | |
---|---|---|

0 | URGENT!! As a valued network customer you have... | spam |

1 | Lol you are always so convincing. | non spam |

2 | Sauder has interesting courses. | non spam |

3 | URGENT! You have won a 1 week FREE membership ... | spam |

4 | Had your mobile 11 months or more? U R entitle... | spam |

5 | Sauder has been interesting so far. | non spam |

\n",
"\n",
"\n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
"

\n",
"

"
],
"text/plain": [
" sauder urgent target\n",
"URGENT!! As a valued network customer you have ... 0 1 spam\n",
"Lol you are always so convincing. 0 0 non spam\n",
"Sauder has interesting courses. 1 0 non spam\n",
"URGENT! You have won a 1 week FREE membership i... 0 1 spam\n",
"Had your mobile 11 months or more? U R entitled... 0 0 spam\n",
"Sauder has been interesting so far. 1 0 non spam"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from sklearn.feature_extraction.text import CountVectorizer\n",
"\n",
"\n",
"# Transform the data with the count vectorizer\n",
"count_vect = CountVectorizer(max_features=2, stop_words='english')\n",
"data = count_vect.fit_transform(df['X']).toarray() # Returns a sparse matric which we convert to an array\n",
"\n",
"# Put together a df with the results\n",
"train_bow_df = pd.DataFrame(data, columns=count_vect.get_feature_names_out(), index=df['X'].tolist())\n",
"train_bow_df['target'] = df['y'].tolist() # tolist() needed since indices are different\n",
"train_bow_df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Suppose we are given 2 new text messages and we want to find the targets for these examples, how do we do it using naive Bayes?\n",
"\n",
"First, let's get a numeric representation of our made-up new text messages. "
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"sauder | urgent | target | |
---|---|---|---|

URGENT!! As a valued network customer you have been selected to receive a £900 prize reward! | 0 | 1 | spam |

Lol you are always so convincing. | 0 | 0 | non spam |

Sauder has interesting courses. | 1 | 0 | non spam |

URGENT! You have won a 1 week FREE membership in our £100000 prize Jackpot! | 0 | 1 | spam |

Had your mobile 11 months or more? U R entitled to Update to the latest colour mobiles with camera for Free! | 0 | 0 | spam |

Sauder has been interesting so far. | 1 | 0 | non spam |

\n",
"\n",
"\n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
"

\n",
"

"
],
"text/plain": [
" sauder urgent\n",
"URGENT! Free!! 0 1\n",
"I like Sauder 1 0"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"test_texts = [\"URGENT! Free!!\", \"I like Sauder\"]\n",
"data = count_vect.transform(test_texts).toarray()\n",
"test_bow_df = pd.DataFrame(data, columns=count_vect.get_feature_names_out(), index=test_texts)\n",
"test_bow_df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's look at the text: \"**URGENT! Free!!**\" and ask the question: \"Is this message **spam**?\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"What we want to use the Naive Bayes algorithm for\n",
"is figuring out the probability that a text message is either spam or not spam **given that** it contains the words \"URGENT! Free!!\", which can be expressed like so (the `|` means \"given that\"/\"condition upon\"):\n",
"\n",
"$$P(\\textrm{spam}|\\textrm{\"URGENT! Free!!\"})$$\n",
"\n",
"$$ \\text{and} $$\n",
"\n",
"$$P(\\textrm{non spam}|\\textrm{\"URGENT! Free!!\"})$$\n",
"\n",
"Once we have calculated these probabilities,\n",
"we compare the probabilities for spam and non-spam and classify each text according to the largest probability.\n",
"\n",
"So what does the calculation for the probabilities look like?\n",
"Naive Bayes relies on Bayes Theorem to compute these numbers,\n",
"and it looks like this (the posterior probability corresponds to our equations above):"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" \n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Substituting into Bayes rule for the \"Urgent Free\" message, we get the following for computating the probability that this message is spam:\n",
"\n",
"$$\\frac{P(\\text{sauder} = 0, \\text{urgent} = 1 |\\textrm{spam})*P(\\textrm{spam})}{P(\\text{sauder} = 0, \\text{urgent} = 1 )}$$\n",
"\n",
"And a similar equation for whether it is not spam\n",
"\n",
"$$\\frac{P(\\text{sauder} = 0, \\text{urgent} = 1 |\\textrm{non-spam})*P(\\textrm{non-spam})}{P(\\text{sauder} = 0, \\text{urgent} = 1 )}$$"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Since our question is \"Is this message spam\" we want to compare this probabilities\n",
"and answer \"yes\" if the spam-probability is larger than the non-spam probability.\n",
"In other words,\n",
"the message is spam if the following condition is true:\n",
"\n",
"\n",
"$$\\frac{P(\\text{sauder} = 0, \\text{urgent} = 1 |\\textrm{spam})*P(\\textrm{spam})}{P(\\text{sauder} = 0, \\text{urgent} = 1 )} > \\frac{P(\\text{sauder} = 0, \\text{urgent} = 1 |\\textrm{non-spam})*P(\\textrm{non-spam})}{P(\\text{sauder} = 0, \\text{urgent} = 1 )}$$"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, there are two reasons naive Bayes is so easy:\n",
"1. We can cancel out the denominator which leads us to this: \n",
"\n",
"$$P(\\text{sauder} = 0, \\text{urgent} = 1 |\\textrm{spam})*P(\\textrm{spam})> P(\\text{sauder} = 0, \\text{urgent} = 1 |\\textrm{non-spam})*P(\\textrm{non-spam})$$\n",
"\n",
"2. We can simplify the numerator via the Naive Bayes approximation."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Naive Bayes' approximation\n",
"\n",
"The reason for the name \"Naive\" is that word order does not matter in the calculation\n",
"and that we can assume each feature (word) is conditionally independent\n",
"(assume that all features in $X$ are mutually independent, conditional on the target class).\n",
"It might sound too simplistic to not care about word order and grammatical rules,\n",
"but this has shown to work well enough in practice and dramatically simplifies our calculation."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"$$\n",
"P(\\text{sauder} = 0, \\text{urgent} = 1 \\mid \\text{spam}) = P(\\text{sauder} = 0 \\mid \\text{spam}) * P(\\text{urgent} = 1 \\mid \\text{spam})\n",
"$$\n",
"\n",
"\n",
"\n",
"If we don't assume independence, it's a much bigger probability space, so there won't be enough training examples to learn. We'd need examples for every combination of words occurring together. \n",
"\n",
"\n",
"Now we just need to calculate the probabilities for each word individually from the training data!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Estimating $P(\\text{spam} \\mid \\text{message})$ (The left side of our equation)\n",
"\n",
"$$P(\\text{sauder} = 0 \\mid \\text{spam}) * P(\\text{urgent} = 1 \\mid \\text{spam})*P(\\textrm{spam}) $$ \n",
"\n",
"We need the following: "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"1. Prior probability:\n",
" - $P(\\text{spam})$\n",
"2. Conditional probabilities:\n",
" - $P(\\text{sauder} = 0 \\mid \\text{spam})$\n",
" - $P(\\text{urgent} = 1 \\mid \\text{spam})$\n",
" \n",
"Let's remind us of what our data looks like:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"sauder | urgent | |
---|---|---|

URGENT! Free!! | 0 | 1 |

I like Sauder | 1 | 0 |

\n",
"\n",
"\n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
"

\n",
"

"
],
"text/plain": [
" sauder urgent target\n",
"URGENT!! As a valued network customer you have ... 0 1 spam\n",
"Lol you are always so convincing. 0 0 non spam\n",
"Sauder has interesting courses. 1 0 non spam\n",
"URGENT! You have won a 1 week FREE membership i... 0 1 spam\n",
"Had your mobile 11 months or more? U R entitled... 0 0 spam\n",
"Sauder has been interesting so far. 1 0 non spam"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"train_bow_df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- Prior probability (what proportion of messages are spam, aka what we would guess with no information about the words in the message)\n",
" - $P(\\text{spam}) = 3/6$\n",
" \n",
"- Conditional probabilities\n",
" - $P(\\text{sauder} = 0 \\mid \\text{spam}) = 3/3$ \n",
" - Given target is spam, how often is \"sauder\"= 0?\n",
" - $P(\\text{urgent} = 1 \\mid \\text{spam}) = 2/3$"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we have everything we need to do our calculations!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"$$P(\\text{sauder} = 0 \\mid \\text{spam}) * P(\\text{urgent} = 1 \\mid \\text{spam})*P(\\textrm{spam}) = \\frac{3}{3} * \\frac{2}{3} * \\frac{3}{6}$$"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.3333333333333333"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sauder0_spam = 3/3\n",
"urgent1_spam = 2/3\n",
"spam_prior = 3/6\n",
"spam_prob = sauder0_spam * urgent1_spam * spam_prior\n",
"spam_prob"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Remember that we simplified away the denominator,\n",
"so the number above doesn't correspond to an actual probability,\n",
"but we can still use it to compare the estimation of spam versus non-spam.\n",
"\n",
"Ok, So we've done our left side! Now we have to do the right!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Estimating $P(\\text{non spam} \\mid \\text{message})$ (The right side of our equation)\n",
"\n",
"$$P(\\text{sauder} = 0 \\mid \\text{non-spam}) * P(\\text{urgent} = 1 \\mid \\text{non-spam})*P(\\textrm{non-spam}) $$ \n",
"\n",
"Now we need the following:\n",
"\n",
"1. Prior probability:\n",
" - $P(\\text{non spam})$ \n",
"2. Conditional probabilities: \n",
" - $P(\\text{sauder} = 0 \\mid \\text{non spam})$\n",
" - $P(\\text{urgent} = 1 \\mid \\text{non spam})$\n",
"\n",
"Again we use the data to calculate these probabilities. "
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"sauder | urgent | target | |
---|---|---|---|

URGENT!! As a valued network customer you have been selected to receive a £900 prize reward! | 0 | 1 | spam |

Lol you are always so convincing. | 0 | 0 | non spam |

Sauder has interesting courses. | 1 | 0 | non spam |

URGENT! You have won a 1 week FREE membership in our £100000 prize Jackpot! | 0 | 1 | spam |

Had your mobile 11 months or more? U R entitled to Update to the latest colour mobiles with camera for Free! | 0 | 0 | spam |

Sauder has been interesting so far. | 1 | 0 | non spam |

\n",
"\n",
"\n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
"

\n",
"

"
],
"text/plain": [
" sauder urgent target\n",
"URGENT!! As a valued network customer you have ... 0 1 spam\n",
"Lol you are always so convincing. 0 0 non spam\n",
"Sauder has interesting courses. 1 0 non spam\n",
"URGENT! You have won a 1 week FREE membership i... 0 1 spam\n",
"Had your mobile 11 months or more? U R entitled... 0 0 spam\n",
"Sauder has been interesting so far. 1 0 non spam"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"train_bow_df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- Prior probability \n",
" - $P(\\text{non spam}) = 3/6$\n",
"- Conditional probabilities \n",
" - $P(\\text{sauder} = 0 \\mid \\text{non spam}) =1/3$\n",
" - Given the target is non spam, how ofter is \"sauder\"=0?\n",
" - $P(\\text{urgent} = 1 \\mid \\text{non spam}) = 0/3$"
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"Time for our calculation:\n",
"\n",
"$$P(\\text{sauder} = 0 \\mid \\text{non-spam}) * P(\\text{urgent} = 1 \\mid \\text{non-spam})*P(\\textrm{non-spam}) = \\frac{1}{3} * \\frac{0}{3} * \\frac{3}{6}$$"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.0"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"non_spam_prior = 3/6\n",
"sauder0_non_spam = 1/3\n",
"urgent1_non_spam = 0/3\n",
"non_spam_prob = non_spam_prior * sauder0_non_spam * urgent1_non_spam\n",
"non_spam_prob"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"So for the question: \"Is the text 'Urgent!! Free!' spam\", our initial equation: \n",
"\n",
"$$P(\\text{sauder} = 0, \\text{urgent} = 1 |\\textrm{spam})*P(\\textrm{spam})> P(\\text{sauder} = 0, \\text{urgent} = 1 |\\textrm{non-spam})*P(\\textrm{non-spam})$$\n",
"\n",
"has been calculated to \n",
"\n",
"0.33333... > 0.0\n",
"\n",
"Which is True (0.333 is bigger than 0), which means that the answer for this text is \"Yes it to be classified as spam\"."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now let's see how we can do this in sklearn and see if the results are the same."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Naive Bayes classifier in sklearn\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's split up our data into our features and targets:"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"X_train = train_bow_df.drop(columns='target')\n",
"y_train = train_bow_df['target']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here I am selecting the first row of our test set which was the **URGENT! Free!!** text. "
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"sauder | urgent | target | |
---|---|---|---|

URGENT!! As a valued network customer you have been selected to receive a £900 prize reward! | 0 | 1 | spam |

Lol you are always so convincing. | 0 | 0 | non spam |

Sauder has interesting courses. | 1 | 0 | non spam |

URGENT! You have won a 1 week FREE membership in our £100000 prize Jackpot! | 0 | 1 | spam |

Had your mobile 11 months or more? U R entitled to Update to the latest colour mobiles with camera for Free! | 0 | 0 | spam |

Sauder has been interesting so far. | 1 | 0 | non spam |

\n",
"\n",
"\n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
"

\n",
"

"
],
"text/plain": [
" sauder urgent\n",
"URGENT! Free!! 0 1"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"test_bow_df.iloc[[0]]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The main Naive Bayes classifier in sklearn is called `MultinomialNB` and exists in the `naive_bayes` module.\n",
"Here we use it to predict the class label of our test text-message."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array(['spam'], dtype='sauder | urgent | |
---|---|---|

URGENT! Free!! | 0 | 1 |

non spam | spam | |
---|---|---|

0 | 0.25 | 0.75 |

\n",
"\n",
"\n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
"

\n",
"

"
],
"text/plain": [
" text target\n",
"0 YOU THERE, PACHIRISU PUNK, PREPARE TO BE DESTR... 0\n",
"1 Face absolutely flattened against the glass of... 0\n",
"2 Bruhhhh I screamed when she said that 😭 MY HEA... 0\n",
"3 Granting warrants to \"authorise police to ente... 0\n",
"4 Ang lala hahaha I woke up to a deluge of death... 0\n",
"... ... ...\n",
"3995 As it seems to be fairly contagious, I'm think... 1\n",
"3996 #BoundBrookFire Firefighters from several diff... 1\n",
"3997 It is turning out to be a very violent storm a... 1\n",
"3998 A raging fire in Bound Brook, New Jersey, on S... 1\n",
"3999 Hazardous eruption a possibility after Philipp... 1\n",
"\n",
"[4000 rows x 2 columns]"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"tweets_df = pd.read_csv(\"data/tweets_mod.csv\")\n",
"tweets_df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's split it into our training and test sets as well as our features and target objects. "
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"text | target | |
---|---|---|

0 | YOU THERE, PACHIRISU PUNK, PREPARE TO BE DESTR... | 0 |

1 | Face absolutely flattened against the glass of... | 0 |

2 | Bruhhhh I screamed when she said that 😭 MY HEA... | 0 |

3 | Granting warrants to \"authorise police to ente... | 0 |

4 | Ang lala hahaha I woke up to a deluge of death... | 0 |

... | ... | ... |

3995 | As it seems to be fairly contagious, I'm think... | 1 |

3996 | #BoundBrookFire Firefighters from several diff... | 1 |

3997 | It is turning out to be a very violent storm a... | 1 |

3998 | A raging fire in Bound Brook, New Jersey, on S... | 1 |

3999 | Hazardous eruption a possibility after Philipp... | 1 |

4000 rows × 2 columns

\n", "\n",
"\n",
"\n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
"

\n",
"

"
],
"text/plain": [
" text target\n",
"1420 How low have you sunk Alice, just clickbait fo... 0\n",
"1638 Watching this tonight as I was working yesterd... 0\n",
"616 January 14, 2020 at about 08:30 am, personnel ... 0\n",
"184 Next oil spill you drone strike the CEO's neig... 0\n",
"2075 Another 6.0 aftershock has hit Puerto Rico aft... 1\n",
"... ... ...\n",
"1122 Aftershock comics. We prefer working in partne... 0\n",
"1346 Two platforms collide to do good, how awesome ... 0\n",
"3454 More than 23,000 people have been evacuated an... 1\n",
"3437 I’m traumatised 😭 1\n",
"3582 A volcano near the Philippine capital is spewi... 1\n",
"\n",
"[3200 rows x 2 columns]"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from sklearn.model_selection import train_test_split\n",
"\n",
"train_df, test_df = train_test_split(tweets_df, test_size=0.2, random_state=123)\n",
"X_train, y_train = train_df[\"text\"], train_df[\"target\"]\n",
"X_test, y_test = test_df[\"text\"], test_df[\"target\"]\n",
"train_df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next, we make a pipeline and cross-validate!"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"text | target | |
---|---|---|

1420 | How low have you sunk Alice, just clickbait fo... | 0 |

1638 | Watching this tonight as I was working yesterd... | 0 |

616 | January 14, 2020 at about 08:30 am, personnel ... | 0 |

184 | Next oil spill you drone strike the CEO's neig... | 0 |

2075 | Another 6.0 aftershock has hit Puerto Rico aft... | 1 |

... | ... | ... |

1122 | Aftershock comics. We prefer working in partne... | 0 |

1346 | Two platforms collide to do good, how awesome ... | 0 |

3454 | More than 23,000 people have been evacuated an... | 1 |

3437 | I’m traumatised 😭 | 1 |

3582 | A volcano near the Philippine capital is spewi... | 1 |

3200 rows × 2 columns

\n", "\n",
"\n",
"\n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
"

\n",
"

"
],
"text/plain": [
" fit_time score_time test_score train_score\n",
"0 0.057878 0.010434 0.796875 0.948438\n",
"1 0.050771 0.009572 0.801562 0.948438\n",
"2 0.047406 0.009335 0.801562 0.946875\n",
"3 0.045360 0.008559 0.837500 0.945703\n",
"4 0.045094 0.008602 0.814063 0.944531"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from sklearn.pipeline import make_pipeline\n",
"from sklearn.model_selection import cross_validate\n",
"\n",
"\n",
"pipe_nb = make_pipeline(\n",
" CountVectorizer(),\n",
" MultinomialNB()\n",
")\n",
"scores = cross_validate(pipe_nb, X_train, y_train, return_train_score=True)\n",
"pd.DataFrame(scores)"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"fit_time 0.049302\n",
"score_time 0.009300\n",
"test_score 0.810312\n",
"train_score 0.946797\n",
"dtype: float64"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.DataFrame(scores).mean()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It looks like we are overfitting to the training data\n",
"and would be advised to try tune/optimize our hyperparameters\n",
"(such as the amount of noise added)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Let's Practice\n",
"\n",
"Using naive Bayes by hand, what class would naive Bayes predict for the second test text message: \"I like Sauder\"?"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"fit_time | score_time | test_score | train_score | |
---|---|---|---|---|

0 | 0.057878 | 0.010434 | 0.796875 | 0.948438 |

1 | 0.050771 | 0.009572 | 0.801562 | 0.948438 |

2 | 0.047406 | 0.009335 | 0.801562 | 0.946875 |

3 | 0.045360 | 0.008559 | 0.837500 | 0.945703 |

4 | 0.045094 | 0.008602 | 0.814063 | 0.944531 |

sauder | urgent | target | |
---|---|---|---|

URGENT!! As a valued network customer you have been selected to receive a £900 prize reward! | 0 | 1 | spam |

Lol you are always so convincing. | 0 | 0 | non spam |

Sauder has interesting courses. | 1 | 0 | non spam |

URGENT! You have won a 1 week FREE membership in our £100000 prize Jackpot! | 0 | 1 | spam |

0 | 0 | spam | |

Sauder has been interesting so far. | 1 | 0 | non spam |

\n",
"\n",
"\n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
"

\n",
"

"
],
"text/plain": [
" sauder urgent\n",
"I like Sauder 1 0"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"test_bow_df.iloc[[1]]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's setup some of the steps together: \n",
"\n",
"**spam side**\n",
"\n",
"1. Prior probability: \n",
" - $P(\\text{spam}) = $ \n",
"2. Conditional probabilities: \n",
" - $P(\\text{sauder} = 1 \\mid \\text{spam}) = $\n",
" - $P(\\text{urgent} = 0 \\mid \\text{spam}) = $\n",
"3. $P(\\textrm{spam}|\\text{sauder} = 1, \\text{urgent} = 0) = $\n",
"\n",
"\n",
"**non spam side** \n",
"\n",
"1. Prior probability: \n",
" - $P(\\text{non spam}) = $ \n",
"2. Conditional probabilities: \n",
" - $P(\\text{sauder} = 1 \\mid \\text{non spam}) = $ \n",
" - $P(\\text{urgent} = 0 \\mid \\text{non spam}) = $ \n",
"3. $P(\\textrm{non spam}|\\text{sauder} = 1, \\text{urgent} = 0) =$ \n",
"\n",
"\n",
"**Final Class** \n",
"\n",
"Which class's probability is greater?\n",
"\n",
"---"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"```{admonition} Solutions!\n",
":class: dropdown\n",
"\n",
"Spam:\n",
"\n",
"1.$3/6$ \n",
"2. $0/3$ and $1/3$ \n",
"3.$\\frac{0}{3} * \\frac{1}{3} *\\frac{3}{6} = 0$ \n",
"\n",
"Non-spam:\n",
"\n",
"1. $3/6$ \n",
"2. $2/3$ and $3/3$ \n",
"3. $\\frac{2}{3} * \\frac{3}{3} *\\frac{3}{6} = 1/3$ \n",
"\n",
"1/3 > 0 so the message is not classified as spam \n",
"\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Automated Hyperparameter Optimization\n",
"\n",
"So far we've seen quite a few different hyperparameters for different models:\n",
"\n",
"- `max_depth` and `min_samples_split` for decision trees. \n",
"- `n_neighbors` and `weights` for K-Nearest Neighbours.\n",
"- `gamma` and `C` for SVMs with RBF.\n",
"- `alpha` for NaiveBayes.\n",
"- We have also seen hyperparameters for our transformations like `strategy` for our `SimpleImputer()`. \n",
"\n",
"We have seen how important these are and that they can optimize your model,\n",
"but we haven't seen an effective way to optimize them;\n",
"so far we have only used primitive for loops.\n",
"Picking reasonable hyperparameters is important as it helps avoid underfit or overfit models."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### The problem with hyperparameters\n",
"\n",
"- We may have a lot of them.\n",
"- Nobody knows exactly how to choose them, there is no single function/formula to apply.\n",
"- May interact with each other in unexpected ways.\n",
"- The best settings depend on the specific data/problem.\n",
"- Can take a long time to execute."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### How to pick hyperparameters"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- Manual hyperparameter optimization (What we've done so far)\n",
" - We may have some intuition about what might work. \n",
" - It takes a lot of work. \n",
" \n",
"**OR...**\n",
"\n",
"- **Automated hyperparameter optimization** (hyperparameter tuning)\n",
" - Reduce human effort. \n",
" - Less prone to error. \n",
" - Data-driven approaches may be effective. \n",
" - It may be hard to incorporate intuition. \n",
" - Overfitting on the validation set. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Automated hyperparameter optimization\n",
"\n",
"- Exhaustive grid search: [`sklearn.model_selection.GridSearchCV`](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html)\n",
"\n",
"- Randomized hyperparameter optimization: `sklearn.model_selection.RandomizedSearchCV` \n",
"\n",
"The \"CV\" stands for cross-validation; these methods have built-in cross-validation"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Let's Apply it\n",
"\n",
"Let's bring back the cities dataset we worked with in previous lectures. "
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"sauder | urgent | |
---|---|---|

I like Sauder | 1 | 0 |

\n",
"\n",
"\n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
"

\n",
"

"
],
"text/plain": [
" longitude latitude\n",
"160 -76.4813 44.2307\n",
"127 -81.2496 42.9837\n",
"169 -66.0580 45.2788\n",
"188 -73.2533 45.3057\n",
"187 -67.9245 47.1652\n",
".. ... ...\n",
"17 -76.3305 44.1255\n",
"98 -74.7287 45.0184\n",
"66 -121.4944 38.5816\n",
"126 -79.5656 43.6436\n",
"109 -66.9195 44.8938\n",
"\n",
"[167 rows x 2 columns]"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"cities_df = pd.read_csv(\"data/canada_usa_cities.csv\")\n",
"train_df, test_df = train_test_split(cities_df, test_size=0.2, random_state=123)\n",
"X_train, y_train = train_df.drop(columns=['country']), train_df['country']\n",
"X_test, y_test = test_df.drop(columns=['country']), test_df['country']\n",
"X_train"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"160 Canada\n",
"127 Canada\n",
"169 Canada\n",
"188 Canada\n",
"187 Canada\n",
" ... \n",
"17 USA\n",
"98 Canada\n",
"66 USA\n",
"126 Canada\n",
"109 Canada\n",
"Name: country, Length: 167, dtype: object"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"y_train"
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"## Exhaustive grid search - Trying ALL the options"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We need to first decide on our model and which hyperparameters we want to tune. \n",
"Let's use an SVC classifier as an example here.\n",
"Next, we built a dictionary called `param_grid` and we specify the values we wish to look over for the hyperparameter. "
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [],
"source": [
"param_grid = {\"gamma\": [0.1, 1.0, 10, 100]}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Then we pass our model to the `GridSearchCV` object."
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.svm import SVC\n",
"from sklearn.model_selection import GridSearchCV\n",
"\n",
"\n",
"grid_search = GridSearchCV(SVC(), param_grid, verbose=2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Assigning `verbose=2` tells `GridSearchCV` to print some output while it's running.\n",
"To actually execute the grid search,\n",
"we need to call `fit` on the training data.\n",
"Remember that CV is built in,\n",
"so we don't need to worry about that."
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Fitting 5 folds for each of 4 candidates, totalling 20 fits\n",
"[CV] END ..........................................gamma=0.1; total time= 0.0s\n",
"[CV] END ..........................................gamma=0.1; total time= 0.0s\n",
"[CV] END ..........................................gamma=0.1; total time= 0.0s\n",
"[CV] END ..........................................gamma=0.1; total time= 0.0s\n",
"[CV] END ..........................................gamma=0.1; total time= 0.0s\n",
"[CV] END ..........................................gamma=1.0; total time= 0.0s\n",
"[CV] END ..........................................gamma=1.0; total time= 0.0s\n",
"[CV] END ..........................................gamma=1.0; total time= 0.0s\n",
"[CV] END ..........................................gamma=1.0; total time= 0.0s\n",
"[CV] END ..........................................gamma=1.0; total time= 0.0s\n",
"[CV] END ...........................................gamma=10; total time= 0.0s\n",
"[CV] END ...........................................gamma=10; total time= 0.0s\n",
"[CV] END ...........................................gamma=10; total time= 0.0s\n",
"[CV] END ...........................................gamma=10; total time= 0.0s\n",
"[CV] END ...........................................gamma=10; total time= 0.0s\n",
"[CV] END ..........................................gamma=100; total time= 0.0s\n",
"[CV] END ..........................................gamma=100; total time= 0.0s\n",
"[CV] END ..........................................gamma=100; total time= 0.0s\n",
"[CV] END ..........................................gamma=100; total time= 0.0s\n",
"[CV] END ..........................................gamma=100; total time= 0.0s\n"
]
},
{
"data": {
"text/plain": [
"GridSearchCV(estimator=SVC(), param_grid={'gamma': [0.1, 1.0, 10, 100]},\n",
" verbose=2)"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"grid_search.fit(X_train, y_train)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The nice thing about this is we can do this for multiple hyperparameters simultaneously as well."
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [],
"source": [
"param_grid = {\n",
" \"gamma\": [0.1, 1.0, 10, 100],\n",
" \"C\": [0.1, 1.0, 10, 100]\n",
"}\n",
"\n",
"# Setting n_jobs=-1 means to use all the CPU cores instead of just 1 (the default)\n",
"# This allows us to speed up the computation by performing tasks in parallel\n",
"grid_search = GridSearchCV(SVC(), param_grid, cv=3, verbose=2, n_jobs=-1)"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Fitting 3 folds for each of 16 candidates, totalling 48 fits\n",
"[CV] END ...................................C=0.1, gamma=1.0; total time= 0.0s\n",
"[CV] END ...................................C=0.1, gamma=0.1; total time= 0.0s\n",
"[CV] END ....................................C=0.1, gamma=10; total time= 0.0s\n",
"[CV] END ...................................C=0.1, gamma=100; total time= 0.0s[CV] END ...................................C=0.1, gamma=100; total time= 0.0s\n",
"\n",
"[CV] END ...................................C=0.1, gamma=1.0; total time= 0.0s\n",
"[CV] END ...................................C=1.0, gamma=0.1; total time= 0.0s\n",
"[CV] END ...................................C=0.1, gamma=100; total time= 0.0s\n",
"[CV] END ...................................C=1.0, gamma=0.1; total time= 0.0s\n",
"[CV] END ...................................C=1.0, gamma=0.1; total time= 0.0s\n",
"[CV] END ...................................C=0.1, gamma=0.1; total time= 0.0s\n",
"[CV] END ...................................C=1.0, gamma=1.0; total time= 0.0s\n",
"[CV] END ...................................C=1.0, gamma=1.0; total time= 0.0s\n",
"[CV] END ...................................C=1.0, gamma=1.0; total time= 0.0s\n",
"[CV] END ....................................C=1.0, gamma=10; total time= 0.0s\n",
"[CV] END ....................................C=1.0, gamma=10; total time= 0.0s\n",
"[CV] END ....................................C=1.0, gamma=10; total time= 0.0s\n",
"[CV] END ...................................C=1.0, gamma=100; total time= 0.0s\n",
"[CV] END ...................................C=1.0, gamma=100; total time= 0.0s\n",
"[CV] END ...................................C=0.1, gamma=0.1; total time= 0.0s\n",
"[CV] END ...................................C=1.0, gamma=100; total time= 0.0s\n",
"[CV] END ....................................C=10, gamma=0.1; total time= 0.0s\n",
"[CV] END ....................................C=10, gamma=1.0; total time= 0.0s\n",
"[CV] END ....................................C=10, gamma=0.1; total time= 0.0s\n",
"[CV] END ....................................C=10, gamma=1.0; total time= 0.0s\n",
"[CV] END ....................................C=10, gamma=0.1; total time= 0.0s\n",
"[CV] END ....................................C=10, gamma=1.0; total time= 0.0s\n",
"[CV] END .....................................C=10, gamma=10; total time= 0.0s\n",
"[CV] END .....................................C=10, gamma=10; total time= 0.0s\n",
"[CV] END ....................................C=0.1, gamma=10; total time= 0.0s\n",
"[CV] END .....................................C=10, gamma=10; total time= 0.0s\n",
"[CV] END ....................................C=10, gamma=100; total time= 0.0s\n",
"[CV] END ....................................C=10, gamma=100; total time= 0.0s\n",
"[CV] END ....................................C=10, gamma=100; total time= 0.0s\n",
"[CV] END ...................................C=100, gamma=0.1; total time= 0.0s\n",
"[CV] END ...................................C=100, gamma=0.1; total time= 0.0s\n",
"[CV] END ...................................C=0.1, gamma=1.0; total time= 0.0s\n",
"[CV] END ...................................C=100, gamma=1.0; total time= 0.0s\n",
"[CV] END ...................................C=100, gamma=0.1; total time= 0.0s\n",
"[CV] END ...................................C=100, gamma=1.0; total time= 0.0s\n",
"[CV] END ...................................C=100, gamma=1.0; total time= 0.0s\n",
"[CV] END ....................................C=100, gamma=10; total time= 0.0s\n",
"[CV] END ....................................C=100, gamma=10; total time= 0.0s\n",
"[CV] END ....................................C=100, gamma=10; total time= 0.0s\n",
"[CV] END ...................................C=100, gamma=100; total time= 0.0s\n",
"[CV] END ...................................C=100, gamma=100; total time= 0.0s\n",
"[CV] END ...................................C=100, gamma=100; total time= 0.0s\n",
"[CV] END ....................................C=0.1, gamma=10; total time= 0.0s\n"
]
},
{
"data": {
"text/plain": [
"GridSearchCV(cv=3, estimator=SVC(), n_jobs=-1,\n",
" param_grid={'C': [0.1, 1.0, 10, 100],\n",
" 'gamma': [0.1, 1.0, 10, 100]},\n",
" verbose=2)"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"grid_search.fit(X_train, y_train)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The grid in `GridSearchCV` stands for the way that it’s checking the hyperparameters. \n",
"\n",
"Since there 4 options for each, grid search is checking every value in each hyperparameter to one another. \n",
"\n",
"That means it’s checking 4 x 4 = 16 different combinations of hyperparameter values for the model. \n",
"\n",
"In `GridSearchCV` we can specify the number of folds of cross-validation with the argument `cv`. \n",
"\n",
"Since we are specifying `cv=6` that means that fit is called a total of 48 times (16 different combinations x 3 cross-validation folds)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Implement hyperparameter tuning with Pipelines"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.impute import SimpleImputer\n",
"from sklearn.preprocessing import StandardScaler\n",
"\n",
"\n",
"pipe = make_pipeline(\n",
" SimpleImputer(strategy=\"median\"),\n",
" StandardScaler(),\n",
" SVC()\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"After specifying the steps in a pipeline, a user must specify a set of values for each hyperparameter in `param_grid` as we did before but this time we specify the name of the step followed by two underscores `__` and the name of the hyperparameter.\n",
"\n",
"\n",
"This is because the pipeline would not know which hyperparameter goes with each step. Does `gamma` correspond to the hyperparameter in `SimpleImputer()` or `StandardScaler()`?\n",
"\n",
"This now gives the pipeline clear instructions on which hyperparameters correspond with which step. "
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [],
"source": [
"param_grid = {\n",
" \"svc__gamma\": [0.1, 1.0, 10, 100],\n",
" \"svc__C\": [0.1, 1.0, 10, 100]\n",
"}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"When using `make_pipeline()` remember that the function names the steps by default the lower case name of each transformation or model. "
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Pipeline(steps=[('simpleimputer', SimpleImputer(strategy='median')),\n",
" ('standardscaler', StandardScaler()), ('svc', SVC())])"
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pipe"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now when we initiate `GridSearchCV`, we set the first argument to the pipeline name instead of the model name this time. "
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Fitting 3 folds for each of 16 candidates, totalling 48 fits\n",
"[CV] END .........................svc__C=0.1, svc__gamma=0.1; total time= 0.0s\n",
"[CV] END .........................svc__C=0.1, svc__gamma=0.1; total time= 0.0s\n",
"[CV] END .........................svc__C=0.1, svc__gamma=1.0; total time= 0.0s\n",
"[CV] END .........................svc__C=0.1, svc__gamma=0.1; total time= 0.0s\n",
"[CV] END .........................svc__C=0.1, svc__gamma=1.0; total time= 0.0s\n",
"[CV] END ..........................svc__C=0.1, svc__gamma=10; total time= 0.0s\n",
"[CV] END ..........................svc__C=0.1, svc__gamma=10; total time= 0.0s\n",
"[CV] END ..........................svc__C=0.1, svc__gamma=10; total time= 0.0s\n",
"[CV] END .........................svc__C=1.0, svc__gamma=0.1; total time= 0.0s\n",
"[CV] END .........................svc__C=0.1, svc__gamma=100; total time= 0.0s\n",
"[CV] END .........................svc__C=0.1, svc__gamma=100; total time= 0.0s\n",
"[CV] END .........................svc__C=0.1, svc__gamma=100; total time= 0.0s\n",
"[CV] END .........................svc__C=0.1, svc__gamma=1.0; total time= 0.0s\n",
"[CV] END .........................svc__C=1.0, svc__gamma=1.0; total time= 0.0s\n",
"[CV] END .........................svc__C=1.0, svc__gamma=0.1; total time= 0.0s\n",
"[CV] END ..........................svc__C=1.0, svc__gamma=10; total time= 0.0s\n",
"[CV] END ..........................svc__C=1.0, svc__gamma=10; total time= 0.0s\n",
"[CV] END .........................svc__C=1.0, svc__gamma=1.0; total time= 0.0s\n",
"[CV] END ..........................svc__C=10, svc__gamma=0.1; total time= 0.0s\n",
"[CV] END ..........................svc__C=10, svc__gamma=1.0; total time= 0.0s[CV] END .........................svc__C=1.0, svc__gamma=100; total time= 0.0s\n",
"\n",
"[CV] END ..........................svc__C=10, svc__gamma=0.1; total time= 0.0s\n",
"[CV] END .........................svc__C=1.0, svc__gamma=100; total time= 0.0s\n",
"[CV] END ..........................svc__C=1.0, svc__gamma=10; total time= 0.0s\n",
"[CV] END .........................svc__C=1.0, svc__gamma=0.1; total time= 0.0s\n",
"[CV] END .........................svc__C=1.0, svc__gamma=1.0; total time= 0.0s\n",
"[CV] END ..........................svc__C=10, svc__gamma=0.1; total time= 0.0s\n",
"[CV] END .........................svc__C=1.0, svc__gamma=100; total time= 0.0s\n",
"[CV] END ..........................svc__C=10, svc__gamma=1.0; total time= 0.0s\n",
"[CV] END ..........................svc__C=10, svc__gamma=1.0; total time= 0.0s\n",
"[CV] END ...........................svc__C=10, svc__gamma=10; total time= 0.0s\n",
"[CV] END .........................svc__C=100, svc__gamma=0.1; total time= 0.0s[CV] END ..........................svc__C=10, svc__gamma=100; total time= 0.0s\n",
"\n",
"[CV] END .........................svc__C=100, svc__gamma=0.1; total time= 0.0s\n",
"[CV] END ...........................svc__C=10, svc__gamma=10; total time= 0.0s\n",
"[CV] END .........................svc__C=100, svc__gamma=1.0; total time= 0.0s\n",
"[CV] END ..........................svc__C=100, svc__gamma=10; total time= 0.0s\n",
"[CV] END ...........................svc__C=10, svc__gamma=10; total time= 0.0s\n",
"[CV] END ..........................svc__C=100, svc__gamma=10; total time= 0.0s\n",
"[CV] END .........................svc__C=100, svc__gamma=0.1; total time= 0.0s\n",
"[CV] END ..........................svc__C=10, svc__gamma=100; total time= 0.0s\n",
"[CV] END .........................svc__C=100, svc__gamma=1.0; total time= 0.0s\n",
"[CV] END ..........................svc__C=10, svc__gamma=100; total time= 0.0s\n",
"[CV] END .........................svc__C=100, svc__gamma=1.0; total time= 0.0s\n",
"[CV] END .........................svc__C=100, svc__gamma=100; total time= 0.0s\n",
"[CV] END ..........................svc__C=100, svc__gamma=10; total time= 0.0s\n",
"[CV] END .........................svc__C=100, svc__gamma=100; total time= 0.0s\n",
"[CV] END .........................svc__C=100, svc__gamma=100; total time= 0.0s\n"
]
}
],
"source": [
"grid_search = GridSearchCV(pipe, param_grid, cv=3, return_train_score=True, verbose=2, n_jobs=-1)\n",
"grid_search.fit(X_train, y_train);"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Looking a bit closer these are the steps being performed with `GridSearchCV`. \n",
"\n",
"\n",
"```\n",
"for gamma in [0.1, 1.0, 10, 100]:\n",
" for C in [0.1, 1.0, 10, 100]:\n",
" for fold in folds:\n",
" fit in training portion with the given C and gamma\n",
" score on validation portion\n",
" compute average score\n",
" pick hyperparameters with the best score\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Why a grid? "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Instead of going through all the combinations of hyperparamers in a grid,\n",
"you might think it could be more efficient to optimize one hyperparameter at a time,\n",
"and then use the best values together.\n",
"However,\n",
"since the interactions between hyperparameters can be unpredictable,\n",
"we are not guaranteed to arrive at the best combinatorial result if we look at a single parameter at a time.\n",
"As an example,\n",
"let's have a look at the grid below:\n",
"\n",
"\n",
"\n",
"If we fix `C` with a value of 1 and loop over the values of 1, 10 and 100 for `gamma`.\n",
"This results in `100` having the best score with 0.82. \n",
"\n",
"Next, we fix `gamma` at `100` since that was what we found was the most optimal when `C` was equal to 1. \n",
"When we loop over the values of 1, 10 and 100 for `C` we get the most optimal value to be 10. \n",
"So naturally, we would pick the values `100` for `gamma` and `10` for `C`. \n",
"\n",
"HOWEVER - if we had performed every possible combination, we would have seen that the optimal values would have actually been `10` for both `gamma` and `C`. \n",
"The same thing is shown if we did it the other way around, first fixing `gamma` at a value of 1 and then looping over all possible values of `C`. \n",
"This time the most optimal combination is `gamma` equal to 1 and `C` equal to 100 which is again not the optimal value of 10 for each. \n",
"\n",
"These combinatorial effects is why it is so important not to fix either of the hyperparameters since it won’t necessarily help you find the most optimal values. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Now what?\n",
"\n",
"How do we know what the best hyperparameter values are after fitting?\n",
"\n",
"We can extract the best hyperparameter values with `.best_params_` and their corresponding score with `.best_score_`."
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'svc__C': 10, 'svc__gamma': 1.0}"
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"grid_search.best_params_"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.8327922077922079"
]
},
"execution_count": 30,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"grid_search.best_score_"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can extract the optimal classifier inside with `.best_estimator_`.\n",
"This has already been fully fitted on with all the data and not just a portion from cross-validation so all we need to do is score! \n",
"Instead of extracting and saving the estimator in two steps,\n",
"we can use the `.score` method of the grid search object itself:"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.8502994011976048"
]
},
"execution_count": 31,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"grid_search.score(X_train, y_train)"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.8333333333333334"
]
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"grid_search.score(X_test, y_test)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The same can be done for `.predict()` as well, either using the saved model or using the `grid_search` object directly. "
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array(['Canada', 'Canada', 'Canada', 'Canada', 'Canada', 'Canada',\n",
" 'Canada', 'Canada', 'Canada', 'USA', 'USA', 'Canada', 'Canada',\n",
" 'Canada', 'Canada', 'USA', 'Canada', 'USA', 'Canada', 'Canada',\n",
" 'Canada', 'Canada', 'Canada', 'Canada', 'Canada', 'Canada',\n",
" 'Canada', 'Canada', 'Canada', 'Canada', 'Canada', 'USA', 'Canada',\n",
" 'Canada', 'Canada', 'Canada', 'Canada', 'USA', 'USA', 'Canada',\n",
" 'Canada', 'Canada'], dtype=object)"
]
},
"execution_count": 33,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"grid_search.predict(X_test)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If we wanted to see all the fit combinations we could use `.cv_results_`."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Notice any problems? \n",
"\n",
"This seems pretty nice and obeys the golden rule.\n",
"However one issue is the execution time. \n",
"\n",
"Think about how much time it would take if we had 5 hyperparameters each with 10 different values.\n",
"That would mean we would be needing to call `cross_validate()` 100,000 times!\n",
"Exhaustive grid search may become infeasible fairly quickly.\n",
"\n",
"**Enter randomized hyperparameter search!**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Randomized hyperparameter optimization"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [],
"source": [
"param_grid = {\n",
" \"svc__gamma\": [0.1, 1.0, 10, 100],\n",
" \"svc__C\": [0.1, 1.0, 10, 100]\n",
"}"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Fitting 3 folds for each of 5 candidates, totalling 15 fits\n",
"[CV] END .........................svc__C=1.0, svc__gamma=100; total time= 0.0s\n",
"[CV] END .........................svc__C=1.0, svc__gamma=100; total time= 0.0s\n",
"[CV] END .........................svc__C=1.0, svc__gamma=100; total time= 0.0s\n",
"[CV] END .........................svc__C=100, svc__gamma=0.1; total time= 0.0s\n",
"[CV] END .........................svc__C=100, svc__gamma=100; total time= 0.0s\n",
"[CV] END .........................svc__C=100, svc__gamma=100; total time= 0.0s\n",
"[CV] END .........................svc__C=100, svc__gamma=100; total time= 0.0s\n",
"[CV] END .........................svc__C=100, svc__gamma=0.1; total time= 0.0s\n",
"[CV] END .........................svc__C=100, svc__gamma=0.1; total time= 0.0s\n",
"[CV] END .........................svc__C=0.1, svc__gamma=100; total time= 0.0s\n",
"[CV] END .........................svc__C=0.1, svc__gamma=100; total time= 0.0s\n",
"[CV] END .........................svc__C=0.1, svc__gamma=100; total time= 0.0s\n",
"[CV] END ..........................svc__C=10, svc__gamma=0.1; total time= 0.0s\n",
"[CV] END ..........................svc__C=10, svc__gamma=0.1; total time= 0.0s\n",
"[CV] END ..........................svc__C=10, svc__gamma=0.1; total time= 0.0s\n"
]
}
],
"source": [
"from sklearn.model_selection import RandomizedSearchCV\n",
"\n",
"\n",
"random_search = RandomizedSearchCV(pipe, param_grid, cv=3, verbose=2, n_jobs=-1, n_iter=5)\n",
"random_search.fit(X_train, y_train);"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Notice that we use the same arguments in `RandomizedSearchCV()` as in `GridSearchCV()` however with 1 new addition - `n_iter`. \n",
"This argument gives us more control and lets us restrict how many hyperparameter candidate values are searched over. \n",
"\n",
"`GridSearchCV()` conducts `cross_validate()` on every single possible combination of the hyperparameters specified in `param_grid`. \n",
"Now we can change and control that using `n_iter` which will pick a random subset containing the specified number of combinations.\n",
"\n",
"The last time when we used exhaustive grid search, we had 36 fits (4 x 4 x 3). \n",
"This time we see only 15 fits (5 x 3 instead of 16 x 3)! "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**How differently does exhaustive and random search score?** "
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.8333333333333334"
]
},
"execution_count": 36,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"grid_search.score(X_test, y_test)"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.8095238095238095"
]
},
"execution_count": 37,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"random_search.score(X_test, y_test)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Although we could theoretically run a chance of missing the optimal combination of parameters when we are randomly picking,\n",
"randomized grid search does in practice produce scores that are very similar to an exhaustive search."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## The problem with hyperparameter tuning - overfitting the validation set\n",
"\n",
"Since we are repeating cross-validation over and over again, it’s not necessarily unseen data anymore.\n",
"\n",
"This may produce overly optimistic results. \n",
"\n",
"If our dataset is small and if our validation set is hit too many times, we suffer from **optimization bias** or **overfitting the validation set**. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Example: overfitting the validation set\n",
"Attribution: [Mark Scmidt](https://www.cs.ubc.ca/~schmidtm/)\n",
"\n",
"This exercise helps explain the concept of overfitting on the validation set.\n",
"\n",
"Consider a multiple-choice (a,b,c,d) \"test\" with 10 questions:\n",
"\n",
"- If you choose answers randomly, the expected grade is 25% (no bias).\n",
"- If you fill out two tests randomly and pick the best, the expected grade is 33%.\n",
" - overfitting ~8%.\n",
"- If you take the best among 10 random tests, the expected grade is ~47%.\n",
"- If you take the best among 100, the expected grade is ~62%.\n",
"- If you take the best among 1000, the expected grade is ~73%.\n",
" - You have so many \"chances\" that you expect to do well.\n",
" \n",
"**But on a single new test, the \"random choice\" accuracy is still 25%.**"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The expected grade among the best of 1 tests is : 0.25\n",
"The expected grade among the best of 2 tests is : 0.32\n",
"The expected grade among the best of 10 tests is : 0.47\n",
"The expected grade among the best of 100 tests is : 0.62\n",
"The expected grade among the best of 1000 tests is : 0.73\n"
]
}
],
"source": [
"import numpy as np\n",
"# Code attributed to Rodolfo Lourenzutti \n",
"\n",
"number_tests = [1, 2, 10, 100, 1000]\n",
"for ntests in number_tests:\n",
" y = np.zeros(10000)\n",
" for i in range(10000):\n",
" y[i] = np.max(np.random.binomial(10.0, 0.25, ntests))\n",
" print(\n",
" \"The expected grade among the best of %d tests is : %0.2f\"\n",
" % (ntests, np.mean(y) / 10.0)\n",
" )"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If we instead used a 100-question test then:"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The expected grade among the best of 1 tests is : 0.25\n",
"The expected grade among the best of 2 tests is : 0.27\n",
"The expected grade among the best of 10 tests is : 0.32\n",
"The expected grade among the best of 100 tests is : 0.36\n",
"The expected grade among the best of 1000 tests is : 0.40\n"
]
}
],
"source": [
"# Code attributed to Rodolfo Lourenzutti \n",
"\n",
"number_tests = [1, 2, 10, 100, 1000]\n",
"for ntests in number_tests:\n",
" y = np.zeros(10000)\n",
" for i in range(10000):\n",
" y[i] = np.max(np.random.binomial(100.0, 0.25, ntests))\n",
" print(\n",
" \"The expected grade among the best of %d tests is : %0.2f\"\n",
" % (ntests, np.mean(y) / 100.0)\n",
" )"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The optimization bias **grows with the number of things we try**. \n",
"But, optimization bias **shrinks quickly with the number of examples**. \n",
"But it’s still non-zero and growing if you over-use your validation set! \n",
"\n",
"\n",
"Essentially our odds of doing well on a multiple-choice exam (if we are guessing) increases the more times we can repeat and randomly take the exam again (selecting the best of many random tests).\n",
"Because we have so many chances you’ll eventually do well and perhaps this is not representative of your knowledge (remember you are randomly guessing) \n",
"\n",
"The same occurs with selecting hyperparameters. \n",
"The more hyperparameters values and combinations we try, the more likely we will randomly get a better scoring model by chance and not because the model represents the data well. \n",
"This overfitting can be decreased somewhat by increasing the number of questions or in our case, the number of examples we have. \n",
"\n",
"TLDR: If your test score is lower than your validation score, it may be because did so much hyperparameter optimization that you got lucky. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Alleviate validation data overfitting during the hyperparameter search\n",
"\n",
"If you find yourself in the situation of a big difference between your validation and your test score,\n",
"and you suspect that this is due to hyperparameter overfitting,\n",
"there are a few thing your could try:\n",
"\n",
"### Collect more data\n",
"\n",
"Overfitting happens because you only see a bit of data and you learn patterns that are overly specific to your sample.\n",
"Or because you got \"lucky\" with your validation data split which made it easier to predict and get a high score on.\n",
"If you had larger training and validation data,\n",
"then the notion of \"overly specific\" or \"fortunate split\" would be less likely to apply.\n",
"\n",
"### Manually adjust\n",
"\n",
"If your test score is much lower than your cross-validation score,\n",
"You could choose simpler models/hyperparameter combinations manually\n",
"or by selecting the top nth percentile model instead of the best one.\n",
"You could also use the test set a couple of times; it's not the end of the world \n",
"but you need to communicate this clearly when you report the results.\n",
"\n",
"### Refined the hyperparameter tuning procedure\n",
"\n",
"Both GridSearchCV and RandomizedSearchCV do each trial independently.\n",
"What if you could learn from your experience, e.g. learn that max_depth=3 and then avoid using it in future hyperparameter combinations?\n",
"That could save time because you wouldn't try combinations involving max_depth=3 in the future.\n",
"\n",
"There are specific python libraries dedicated to more efficient and generalizable hyperparameter searches.\n",
"In short, these use machine learning to predict what hyperparameters will be good.\n",
"Machine learning on machine learning!\n",
"Examples of such libraries include scikit-optimize, hyperopt, and hyperband.\n",
"The central theme among these is to use infomation from previous hyperparameter combinations\n",
"to influence the choice of future hyperparameters to try.\n",
"Commonly this is done through methods such as \n",
"\"Bayesian optimization\" and \"Gradient Descent\".\n",
"We will not cover this in detail as part of this course."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Let's Practice\n",
"\n",
"1\\. Which method will attempt to find the optimal hyperparameter for the data by searching every combination possible of hyperparameter values given? \n",
"2\\. Which method gives you fine-grained control over the amount of time spent searching? \n",
"3\\. If I want to search for the most optimal hyperparameter values among 3 different hyperparameters each with 3 different values how many trials of cross-validation would be needed? \n",
"\n",
"$x= [1,2,3]$ \n",
"$y= [4,5,6]$ \n",
"$z= [7,8,9]$ \n",
" \n",
"\n",
"**True or False** \n",
"\n",
"4\\. A Larger `n_iter` will take longer but will search over more hyperparameter values. \n",
"5\\. Automated hyperparameter optimization can only be used for multiple hyperparameters. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"```{admonition} Solutions!\n",
":class: dropdown\n",
"\n",
"1. Exhaustive Grid Search (`GridSearchCV`)\n",
"2. Randomized Grid Search (`RandomizedSearchCV`)\n",
"3. $3 * 3 * 3 = 27$ (* the how many splits you have in your CV)\n",
"4. True\n",
"5. False\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Let's Practice - Coding \n",
"\n",
"We are going to practice grid search using our basketball dataset that we have seen before. "
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {},
"outputs": [],
"source": [
"# Loading in the data\n",
"from sklearn.neighbors import KNeighborsClassifier\n",
"\n",
"\n",
"bball_df = pd.read_csv('data/bball.csv')\n",
"bball_df = bball_df[(bball_df['position'] =='G') | (bball_df['position'] =='F')]\n",
"\n",
"# Define X and y\n",
"X = bball_df.loc[:, ['height', 'weight', 'salary']]\n",
"y = bball_df['position']\n",
"\n",
"# Split the dataset\n",
"X_train, X_test, y_train, y_test = train_test_split(\n",
" X, y, test_size=0.2, random_state=7)\n",
"\n",
"bb_pipe = make_pipeline(\n",
" SimpleImputer(strategy=\"median\"),\n",
" StandardScaler(),\n",
" KNeighborsClassifier()\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"1. Using the pipeline `bb_pipe` provided, create a parameter grid to search over `param_grid`. Search over the values 1, 5, 10, 20, 30, 40, and 50 for the hyperparameter `n_neighbors` and 'uniform' and 'distance' for the hyperparameter `weights` (make sure to name them appropriately). \n",
"2. Setup a `GridSearchCV` to hyperparameter tune using cross-validate equal to 3 folds. Make sure to specify the arguments `verbose=2` and `n_jobs=-1`.\n",
"3. Train/fit your grid search object on the training data to execute the search.\n",
"4. Find the best hyperparameter values. Make sure to print these results.\n",
"5. Lastly, score your model on the test set."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Solutions**\n",
"\n",
"1\\."
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {
"tags": [
"hide-cell"
]
},
"outputs": [
{
"data": {
"text/plain": [
"Pipeline(steps=[('simpleimputer', SimpleImputer(strategy='median')),\n",
" ('standardscaler', StandardScaler()),\n",
" ('kneighborsclassifier', KNeighborsClassifier())])"
]
},
"execution_count": 41,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Check the names of each step.\n",
"bb_pipe"
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {
"tags": [
"hide-cell"
]
},
"outputs": [],
"source": [
"param_grid = {\n",
" \"kneighborsclassifier__n_neighbors\": [1, 5, 10, 20, 30, 40, 50],\n",
" \"kneighborsclassifier__weights\": ['uniform', 'distance']\n",
"}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"2\\."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": [
"hide-cell"
]
},
"outputs": [],
"source": [
"gsearch = GridSearchCV(bb_pipe, param_grid, cv=3, verbose=2, n_jobs=-1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"3\\."
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {
"tags": [
"hide-cell"
]
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Fitting 3 folds for each of 14 candidates, totalling 42 fits\n",
"[CV] END kneighborsclassifier__n_neighbors=1, kneighborsclassifier__weights=uniform; total time= 0.0s\n",
"[CV] END kneighborsclassifier__n_neighbors=1, kneighborsclassifier__weights=uniform; total time= 0.0s\n",
"[CV] END kneighborsclassifier__n_neighbors=1, kneighborsclassifier__weights=distance; total time= 0.0s\n",
"[CV] END kneighborsclassifier__n_neighbors=1, kneighborsclassifier__weights=uniform; total time= 0.0s\n",
"[CV] END kneighborsclassifier__n_neighbors=5, kneighborsclassifier__weights=uniform; total time= 0.0s\n",
"[CV] END kneighborsclassifier__n_neighbors=5, kneighborsclassifier__weights=uniform; total time= 0.0s\n",
"[CV] END kneighborsclassifier__n_neighbors=5, kneighborsclassifier__weights=distance; total time= 0.0s\n",
"[CV] END kneighborsclassifier__n_neighbors=5, kneighborsclassifier__weights=uniform; total time= 0.0s\n",
"[CV] END kneighborsclassifier__n_neighbors=1, kneighborsclassifier__weights=distance; total time= 0.0s\n",
"[CV] END kneighborsclassifier__n_neighbors=5, kneighborsclassifier__weights=distance; total time= 0.0s\n",
"[CV] END kneighborsclassifier__n_neighbors=5, kneighborsclassifier__weights=distance; total time= 0.0s\n",
"[CV] END kneighborsclassifier__n_neighbors=1, kneighborsclassifier__weights=distance; total time= 0.0s\n",
"[CV] END kneighborsclassifier__n_neighbors=10, kneighborsclassifier__weights=distance; total time= 0.0s\n",
"[CV] END kneighborsclassifier__n_neighbors=10, kneighborsclassifier__weights=uniform; total time= 0.0s[CV] END kneighborsclassifier__n_neighbors=20, kneighborsclassifier__weights=uniform; total time= 0.0s\n",
"[CV] END kneighborsclassifier__n_neighbors=10, kneighborsclassifier__weights=distance; total time= 0.0s\n",
"[CV] END kneighborsclassifier__n_neighbors=20, kneighborsclassifier__weights=uniform; total time= 0.0s\n",
"[CV] END kneighborsclassifier__n_neighbors=10, kneighborsclassifier__weights=uniform; total time= 0.0s[CV] END kneighborsclassifier__n_neighbors=20, kneighborsclassifier__weights=distance; total time= 0.0s\n",
"\n",
"\n",
"[CV] END kneighborsclassifier__n_neighbors=30, kneighborsclassifier__weights=uniform; total time= 0.0s\n",
"[CV] END kneighborsclassifier__n_neighbors=10, kneighborsclassifier__weights=uniform; total time= 0.0s\n",
"[CV] END kneighborsclassifier__n_neighbors=10, kneighborsclassifier__weights=distance; total time= 0.0s\n",
"[CV] END kneighborsclassifier__n_neighbors=20, kneighborsclassifier__weights=distance; total time= 0.0s\n",
"[CV] END kneighborsclassifier__n_neighbors=20, kneighborsclassifier__weights=uniform; total time= 0.0s\n",
"[CV] END kneighborsclassifier__n_neighbors=20, kneighborsclassifier__weights=distance; total time= 0.0s[CV] END kneighborsclassifier__n_neighbors=30, kneighborsclassifier__weights=uniform; total time= 0.0s\n",
"[CV] END kneighborsclassifier__n_neighbors=40, kneighborsclassifier__weights=distance; total time= 0.0s\n",
"[CV] END kneighborsclassifier__n_neighbors=40, kneighborsclassifier__weights=uniform; total time= 0.0s\n",
"[CV] END kneighborsclassifier__n_neighbors=40, kneighborsclassifier__weights=uniform; total time= 0.0s\n",
"[CV] END kneighborsclassifier__n_neighbors=40, kneighborsclassifier__weights=distance; total time= 0.0s\n",
"[CV] END kneighborsclassifier__n_neighbors=30, kneighborsclassifier__weights=uniform; total time= 0.0s\n",
"[CV] END kneighborsclassifier__n_neighbors=30, kneighborsclassifier__weights=distance; total time= 0.0s\n",
"[CV] END kneighborsclassifier__n_neighbors=40, kneighborsclassifier__weights=distance; total time= 0.0s\n",
"\n",
"[CV] END kneighborsclassifier__n_neighbors=50, kneighborsclassifier__weights=distance; total time= 0.0s\n",
"[CV] END kneighborsclassifier__n_neighbors=50, kneighborsclassifier__weights=uniform; total time= 0.0s\n",
"[CV] END kneighborsclassifier__n_neighbors=30, kneighborsclassifier__weights=distance; total time= 0.0s\n",
"[CV] END kneighborsclassifier__n_neighbors=50, kneighborsclassifier__weights=uniform; total time= 0.0s\n",
"[CV] END kneighborsclassifier__n_neighbors=50, kneighborsclassifier__weights=uniform; total time= 0.0s\n",
"[CV] END kneighborsclassifier__n_neighbors=30, kneighborsclassifier__weights=distance; total time= 0.0s\n",
"[CV] END kneighborsclassifier__n_neighbors=50, kneighborsclassifier__weights=distance; total time= 0.0s\n",
"[CV] END kneighborsclassifier__n_neighbors=50, kneighborsclassifier__weights=distance; total time= 0.0s\n",
"[CV] END kneighborsclassifier__n_neighbors=40, kneighborsclassifier__weights=uniform; total time= 0.0s\n"
]
}
],
"source": [
"gsearch.fit(X_train, y_train);"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"4\\."
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {
"tags": [
"hide-cell"
]
},
"outputs": [
{
"data": {
"text/plain": [
"{'kneighborsclassifier__n_neighbors': 50,\n",
" 'kneighborsclassifier__weights': 'uniform'}"
]
},
"execution_count": 44,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"gsearch.best_params_"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"5\\."
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {
"tags": [
"hide-cell"
]
},
"outputs": [
{
"data": {
"text/plain": [
"0.9354838709677419"
]
},
"execution_count": 45,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"gsearch.score(X_test, y_test)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## What We've Learned Today\n",
"\n",
"- How to predict by using naive Bayes.\n",
"- How to use `scikit-learn`'s `MultiNomialNB`.\n",
"- What `predict_proba` is. \n",
"- Why we need smoothing in naive Bayes.\n",
"- How to carry out hyperparameter optimization using `sklearn`'s `GridSearchCV` and `RandomizedSearchCV`."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python [conda env:bait]",
"language": "python",
"name": "conda-env-bait-py"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.9"
},
"toc": {
"base_numbering": 1,
"nav_menu": {},
"number_sections": true,
"sideBar": true,
"skip_h1_title": false,
"title_cell": "Table of Contents",
"title_sidebar": "Contents",
"toc_cell": false,
"toc_position": {
"height": "calc(100% - 180px)",
"left": "10px",
"top": "150px",
"width": "274.188px"
},
"toc_section_display": true,
"toc_window_display": true
}
},
"nbformat": 4,
"nbformat_minor": 4
}
longitude | latitude | |
---|---|---|

160 | -76.4813 | 44.2307 |

127 | -81.2496 | 42.9837 |

169 | -66.0580 | 45.2788 |

188 | -73.2533 | 45.3057 |

187 | -67.9245 | 47.1652 |

... | ... | ... |

17 | -76.3305 | 44.1255 |

98 | -74.7287 | 45.0184 |

66 | -121.4944 | 38.5816 |

126 | -79.5656 | 43.6436 |

109 | -66.9195 | 44.8938 |

167 rows × 2 columns

\n", "