{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Preprocessing Categorical Features and Column Transformer" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Lecture Learning Objectives \n", "\n", "- Identify when it's appropriate to apply ordinal encoding vs one-hot encoding.\n", "- Explain strategies to deal with categorical variables with too many categories.\n", "- Explain `handle_unknown=\"ignore\"` hyperparameter of `scikit-learn`'s `OneHotEncoder`.\n", "- Use the scikit-learn `ColumnTransformer` function to implement preprocessing functions such as `MinMaxScaler` and `OneHotEncoder` to numeric and categorical features simultaneously.\n", "- Use `ColumnTransformer` to build all our transformations together into one object and use it with `scikit-learn` pipelines.\n", "- Explain why text data needs a different treatment than categorical variables.\n", "- Use `scikit-learn`'s `CountVectorizer` to encode text data.\n", "- Explain different hyperparameters of `CountVectorizer`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Five Minute Recap/ Lightning Questions \n", "\n", "- Where does most of the work happen in $k$-nn - `fit` or `predict`?\n", "- What are the 2 hyperparameters we looked at with Support Vector Machines with RBF kernel? \n", "- What is the range of values after Normalization? \n", "- Imputation will help data with missing values by removing which of the following; the column, the row or neither?\n", "- Pipelines help us not violate what?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Some lingering questions\n", "\n", "- What about categorical features??! How do we use them in our model!?\n", "- How do we combine everything?!\n", "- What about data with text? " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Introducing Categorical Feature Preprocessing" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's bring back our California housing dataset that we explored last class. \n", "Remember we engineered some of the features in the data." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | longitude | \n", "latitude | \n", "housing_median_age | \n", "households | \n", "median_income | \n", "median_house_value | \n", "ocean_proximity | \n", "rooms_per_household | \n", "bedrooms_per_household | \n", "population_per_household | \n", "
---|---|---|---|---|---|---|---|---|---|---|
6051 | \n", "-117.75 | \n", "34.04 | \n", "22.0 | \n", "602.0 | \n", "3.1250 | \n", "113600.0 | \n", "INLAND | \n", "4.897010 | \n", "1.056478 | \n", "4.318937 | \n", "
20113 | \n", "-119.57 | \n", "37.94 | \n", "17.0 | \n", "20.0 | \n", "3.4861 | \n", "137500.0 | \n", "INLAND | \n", "17.300000 | \n", "6.500000 | \n", "2.550000 | \n", "
14289 | \n", "-117.13 | \n", "32.74 | \n", "46.0 | \n", "708.0 | \n", "2.6604 | \n", "170100.0 | \n", "NEAR OCEAN | \n", "4.738701 | \n", "1.084746 | \n", "2.057910 | \n", "
13665 | \n", "-117.31 | \n", "34.02 | \n", "18.0 | \n", "285.0 | \n", "5.2139 | \n", "129300.0 | \n", "INLAND | \n", "5.733333 | \n", "0.961404 | \n", "3.154386 | \n", "
14471 | \n", "-117.23 | \n", "32.88 | \n", "18.0 | \n", "1458.0 | \n", "1.8580 | \n", "205000.0 | \n", "NEAR OCEAN | \n", "3.817558 | \n", "1.004801 | \n", "4.323045 | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
7763 | \n", "-118.10 | \n", "33.91 | \n", "36.0 | \n", "130.0 | \n", "3.6389 | \n", "167600.0 | \n", "<1H OCEAN | \n", "5.584615 | \n", "NaN | \n", "3.769231 | \n", "
15377 | \n", "-117.24 | \n", "33.37 | \n", "14.0 | \n", "779.0 | \n", "4.5391 | \n", "180900.0 | \n", "<1H OCEAN | \n", "6.016688 | \n", "1.017972 | \n", "3.127086 | \n", "
17730 | \n", "-121.76 | \n", "37.33 | \n", "5.0 | \n", "697.0 | \n", "5.6306 | \n", "286200.0 | \n", "<1H OCEAN | \n", "5.958393 | \n", "1.031564 | \n", "3.493544 | \n", "
15725 | \n", "-122.44 | \n", "37.78 | \n", "44.0 | \n", "326.0 | \n", "3.8750 | \n", "412500.0 | \n", "NEAR BAY | \n", "4.739264 | \n", "1.024540 | \n", "1.720859 | \n", "
19966 | \n", "-119.08 | \n", "36.21 | \n", "20.0 | \n", "348.0 | \n", "2.5156 | \n", "59300.0 | \n", "INLAND | \n", "5.491379 | \n", "1.117816 | \n", "3.566092 | \n", "
18576 rows × 10 columns
\n", "\n", " | rating | \n", "
---|---|
0 | \n", "Good | \n", "
1 | \n", "Bad | \n", "
2 | \n", "Good | \n", "
3 | \n", "Good | \n", "
4 | \n", "Bad | \n", "
5 | \n", "Neutral | \n", "
6 | \n", "Good | \n", "
7 | \n", "Good | \n", "
8 | \n", "Neutral | \n", "
9 | \n", "Neutral | \n", "
10 | \n", "Neutral | \n", "
11 | \n", "Good | \n", "
12 | \n", "Bad | \n", "
13 | \n", "Good | \n", "
\n", " | rating | \n", "rating_enc | \n", "
---|---|---|
0 | \n", "Good | \n", "1 | \n", "
1 | \n", "Bad | \n", "0 | \n", "
2 | \n", "Good | \n", "1 | \n", "
3 | \n", "Good | \n", "1 | \n", "
4 | \n", "Bad | \n", "0 | \n", "
5 | \n", "Neutral | \n", "2 | \n", "
6 | \n", "Good | \n", "1 | \n", "
7 | \n", "Good | \n", "1 | \n", "
8 | \n", "Neutral | \n", "2 | \n", "
9 | \n", "Neutral | \n", "2 | \n", "
10 | \n", "Neutral | \n", "2 | \n", "
11 | \n", "Good | \n", "1 | \n", "
12 | \n", "Bad | \n", "0 | \n", "
13 | \n", "Good | \n", "1 | \n", "
\n", " | rating | \n", "rating_enc | \n", "
---|---|---|
0 | \n", "Good | \n", "2 | \n", "
1 | \n", "Bad | \n", "0 | \n", "
2 | \n", "Good | \n", "2 | \n", "
3 | \n", "Good | \n", "2 | \n", "
4 | \n", "Bad | \n", "0 | \n", "
5 | \n", "Neutral | \n", "1 | \n", "
6 | \n", "Good | \n", "2 | \n", "
7 | \n", "Good | \n", "2 | \n", "
8 | \n", "Neutral | \n", "1 | \n", "
9 | \n", "Neutral | \n", "1 | \n", "
10 | \n", "Neutral | \n", "1 | \n", "
11 | \n", "Good | \n", "2 | \n", "
12 | \n", "Bad | \n", "0 | \n", "
13 | \n", "Good | \n", "2 | \n", "
\n", " | language | \n", "
---|---|
0 | \n", "English | \n", "
1 | \n", "Vietnamese | \n", "
2 | \n", "English | \n", "
3 | \n", "Mandarin | \n", "
4 | \n", "English | \n", "
5 | \n", "English | \n", "
6 | \n", "Mandarin | \n", "
7 | \n", "English | \n", "
8 | \n", "Vietnamese | \n", "
9 | \n", "Mandarin | \n", "
10 | \n", "French | \n", "
11 | \n", "Spanish | \n", "
12 | \n", "Mandarin | \n", "
13 | \n", "Hindi | \n", "
\n", " | language | \n", "language_enc | \n", "
---|---|---|
0 | \n", "English | \n", "0 | \n", "
1 | \n", "Vietnamese | \n", "5 | \n", "
2 | \n", "English | \n", "0 | \n", "
3 | \n", "Mandarin | \n", "3 | \n", "
4 | \n", "English | \n", "0 | \n", "
5 | \n", "English | \n", "0 | \n", "
6 | \n", "Mandarin | \n", "3 | \n", "
7 | \n", "English | \n", "0 | \n", "
8 | \n", "Vietnamese | \n", "5 | \n", "
9 | \n", "Mandarin | \n", "3 | \n", "
10 | \n", "French | \n", "1 | \n", "
11 | \n", "Spanish | \n", "4 | \n", "
12 | \n", "Mandarin | \n", "3 | \n", "
13 | \n", "Hindi | \n", "2 | \n", "
\n", " | language_English | \n", "language_French | \n", "language_Hindi | \n", "language_Mandarin | \n", "language_Spanish | \n", "language_Vietnamese | \n", "
---|---|---|---|---|---|---|
0 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "
1 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "
2 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "
3 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "
4 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "
5 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "
6 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "
7 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "
8 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "
9 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "
10 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "
11 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "0 | \n", "
12 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "
13 | \n", "0 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "
\n", " | ocean_proximity_<1H OCEAN | \n", "ocean_proximity_INLAND | \n", "ocean_proximity_ISLAND | \n", "ocean_proximity_NEAR BAY | \n", "ocean_proximity_NEAR OCEAN | \n", "
---|---|---|---|---|---|
6051 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "
20113 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "
14289 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "
13665 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "
14471 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "
\n", " | province_BC | \n", "province_MB | \n", "province_ON | \n", "
---|---|---|---|
0 | \n", "1 | \n", "0 | \n", "0 | \n", "
1 | \n", "0 | \n", "0 | \n", "1 | \n", "
2 | \n", "0 | \n", "1 | \n", "0 | \n", "
\n", " | province_BC | \n", "province_MB | \n", "province_ON | \n", "
---|---|---|---|
0 | \n", "1 | \n", "0 | \n", "0 | \n", "
1 | \n", "0 | \n", "0 | \n", "0 | \n", "
2 | \n", "0 | \n", "0 | \n", "0 | \n", "
\n", " | province_AB | \n", "province_BC | \n", "province_MB | \n", "province_NB | \n", "province_NL | \n", "province_NS | \n", "province_NT | \n", "province_NU | \n", "province_ON | \n", "province_PE | \n", "province_QC | \n", "province_SK | \n", "province_YT | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "
1 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "
2 | \n", "0 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "
\n", " | Caffeine | \n", "
---|---|
0 | \n", "No | \n", "
1 | \n", "Yes | \n", "
2 | \n", "Yes | \n", "
3 | \n", "No | \n", "
4 | \n", "Yes | \n", "
5 | \n", "No | \n", "
6 | \n", "No | \n", "
7 | \n", "No | \n", "
8 | \n", "Yes | \n", "
9 | \n", "No | \n", "
10 | \n", "Yes | \n", "
11 | \n", "Yes | \n", "
12 | \n", "No | \n", "
13 | \n", "Yes | \n", "
\n", " | Caffeine_No | \n", "Caffeine_Yes | \n", "
---|---|---|
0 | \n", "1 | \n", "0 | \n", "
1 | \n", "0 | \n", "1 | \n", "
2 | \n", "0 | \n", "1 | \n", "
3 | \n", "1 | \n", "0 | \n", "
4 | \n", "0 | \n", "1 | \n", "
5 | \n", "1 | \n", "0 | \n", "
6 | \n", "1 | \n", "0 | \n", "
7 | \n", "1 | \n", "0 | \n", "
8 | \n", "0 | \n", "1 | \n", "
9 | \n", "1 | \n", "0 | \n", "
10 | \n", "0 | \n", "1 | \n", "
11 | \n", "0 | \n", "1 | \n", "
12 | \n", "1 | \n", "0 | \n", "
13 | \n", "0 | \n", "1 | \n", "
\n", " | Caffeine_Yes | \n", "
---|---|
0 | \n", "0 | \n", "
1 | \n", "1 | \n", "
2 | \n", "1 | \n", "
3 | \n", "0 | \n", "
4 | \n", "1 | \n", "
5 | \n", "0 | \n", "
6 | \n", "0 | \n", "
7 | \n", "0 | \n", "
8 | \n", "1 | \n", "
9 | \n", "0 | \n", "
10 | \n", "1 | \n", "
11 | \n", "1 | \n", "
12 | \n", "0 | \n", "
13 | \n", "1 | \n", "
\n", " | longitude | \n", "latitude | \n", "housing_median_age | \n", "households | \n", "median_income | \n", "ocean_proximity | \n", "rooms_per_household | \n", "bedrooms_per_household | \n", "population_per_household | \n", "
---|---|---|---|---|---|---|---|---|---|
6051 | \n", "-117.75 | \n", "34.04 | \n", "22.0 | \n", "602.0 | \n", "3.1250 | \n", "INLAND | \n", "4.897010 | \n", "1.056478 | \n", "4.318937 | \n", "
20113 | \n", "-119.57 | \n", "37.94 | \n", "17.0 | \n", "20.0 | \n", "3.4861 | \n", "INLAND | \n", "17.300000 | \n", "6.500000 | \n", "2.550000 | \n", "
14289 | \n", "-117.13 | \n", "32.74 | \n", "46.0 | \n", "708.0 | \n", "2.6604 | \n", "NEAR OCEAN | \n", "4.738701 | \n", "1.084746 | \n", "2.057910 | \n", "
13665 | \n", "-117.31 | \n", "34.02 | \n", "18.0 | \n", "285.0 | \n", "5.2139 | \n", "INLAND | \n", "5.733333 | \n", "0.961404 | \n", "3.154386 | \n", "
14471 | \n", "-117.23 | \n", "32.88 | \n", "18.0 | \n", "1458.0 | \n", "1.8580 | \n", "NEAR OCEAN | \n", "3.817558 | \n", "1.004801 | \n", "4.323045 | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
7763 | \n", "-118.10 | \n", "33.91 | \n", "36.0 | \n", "130.0 | \n", "3.6389 | \n", "<1H OCEAN | \n", "5.584615 | \n", "NaN | \n", "3.769231 | \n", "
15377 | \n", "-117.24 | \n", "33.37 | \n", "14.0 | \n", "779.0 | \n", "4.5391 | \n", "<1H OCEAN | \n", "6.016688 | \n", "1.017972 | \n", "3.127086 | \n", "
17730 | \n", "-121.76 | \n", "37.33 | \n", "5.0 | \n", "697.0 | \n", "5.6306 | \n", "<1H OCEAN | \n", "5.958393 | \n", "1.031564 | \n", "3.493544 | \n", "
15725 | \n", "-122.44 | \n", "37.78 | \n", "44.0 | \n", "326.0 | \n", "3.8750 | \n", "NEAR BAY | \n", "4.739264 | \n", "1.024540 | \n", "1.720859 | \n", "
19966 | \n", "-119.08 | \n", "36.21 | \n", "20.0 | \n", "348.0 | \n", "2.5156 | \n", "INLAND | \n", "5.491379 | \n", "1.117816 | \n", "3.566092 | \n", "
18576 rows × 9 columns
\n", "ColumnTransformer(remainder='passthrough',\n", " transformers=[('numeric',\n", " Pipeline(steps=[('imputer',\n", " SimpleImputer(strategy='median')),\n", " ('scaler', StandardScaler())]),\n", " Index(['longitude', 'latitude', 'housing_median_age', 'households',\n", " 'median_income', 'rooms_per_household', 'bedrooms_per_household',\n", " 'population_per_household'],\n", " dtype='object')),\n", " ('categorical',\n", " Pipeline(steps=[('imputer',\n", " SimpleImputer(fill_value='missing',\n", " strategy='constant')),\n", " ('onehot',\n", " OneHotEncoder(handle_unknown='ignore'))]),\n", " Index(['ocean_proximity'], dtype='object'))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
ColumnTransformer(remainder='passthrough',\n", " transformers=[('numeric',\n", " Pipeline(steps=[('imputer',\n", " SimpleImputer(strategy='median')),\n", " ('scaler', StandardScaler())]),\n", " Index(['longitude', 'latitude', 'housing_median_age', 'households',\n", " 'median_income', 'rooms_per_household', 'bedrooms_per_household',\n", " 'population_per_household'],\n", " dtype='object')),\n", " ('categorical',\n", " Pipeline(steps=[('imputer',\n", " SimpleImputer(fill_value='missing',\n", " strategy='constant')),\n", " ('onehot',\n", " OneHotEncoder(handle_unknown='ignore'))]),\n", " Index(['ocean_proximity'], dtype='object'))])
Index(['longitude', 'latitude', 'housing_median_age', 'households',\n", " 'median_income', 'rooms_per_household', 'bedrooms_per_household',\n", " 'population_per_household'],\n", " dtype='object')
SimpleImputer(strategy='median')
StandardScaler()
Index(['ocean_proximity'], dtype='object')
SimpleImputer(fill_value='missing', strategy='constant')
OneHotEncoder(handle_unknown='ignore')
[]
passthrough
\n", " | longitude | \n", "latitude | \n", "housing_median_age | \n", "households | \n", "median_income | \n", "rooms_per_household | \n", "bedrooms_per_household | \n", "population_per_household | \n", "ocean_proximity_<1H OCEAN | \n", "ocean_proximity_INLAND | \n", "ocean_proximity_ISLAND | \n", "ocean_proximity_NEAR BAY | \n", "ocean_proximity_NEAR OCEAN | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "0.908140 | \n", "-0.743917 | \n", "-0.526078 | \n", "0.266135 | \n", "-0.389736 | \n", "-0.210591 | \n", "-0.083813 | \n", "0.126398 | \n", "0.0 | \n", "1.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "
1 | \n", "-0.002057 | \n", "1.083123 | \n", "-0.923283 | \n", "-1.253312 | \n", "-0.198924 | \n", "4.726412 | \n", "11.166631 | \n", "-0.050132 | \n", "0.0 | \n", "1.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "
2 | \n", "1.218207 | \n", "-1.352930 | \n", "1.380504 | \n", "0.542873 | \n", "-0.635239 | \n", "-0.273606 | \n", "-0.025391 | \n", "-0.099240 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "1.0 | \n", "
3 | \n", "1.128188 | \n", "-0.753286 | \n", "-0.843842 | \n", "-0.561467 | \n", "0.714077 | \n", "0.122307 | \n", "-0.280310 | \n", "0.010183 | \n", "0.0 | \n", "1.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "
4 | \n", "1.168196 | \n", "-1.287344 | \n", "-0.843842 | \n", "2.500924 | \n", "-1.059242 | \n", "-0.640266 | \n", "-0.190617 | \n", "0.126808 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "1.0 | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
18571 | \n", "0.733102 | \n", "-0.804818 | \n", "0.586095 | \n", "-0.966131 | \n", "-0.118182 | \n", "0.063110 | \n", "-0.099558 | \n", "0.071541 | \n", "1.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "
18572 | \n", "1.163195 | \n", "-1.057793 | \n", "-1.161606 | \n", "0.728235 | \n", "0.357500 | \n", "0.235096 | \n", "-0.163397 | \n", "0.007458 | \n", "1.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "
18573 | \n", "-1.097293 | \n", "0.797355 | \n", "-1.876574 | \n", "0.514155 | \n", "0.934269 | \n", "0.211892 | \n", "-0.135305 | \n", "0.044029 | \n", "1.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "
18574 | \n", "-1.437367 | \n", "1.008167 | \n", "1.221622 | \n", "-0.454427 | \n", "0.006578 | \n", "-0.273382 | \n", "-0.149822 | \n", "-0.132875 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "1.0 | \n", "0.0 | \n", "
18575 | \n", "0.242996 | \n", "0.272667 | \n", "-0.684960 | \n", "-0.396991 | \n", "-0.711754 | \n", "0.025998 | \n", "0.042957 | \n", "0.051269 | \n", "0.0 | \n", "1.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "
18576 rows × 13 columns
\n", "\n", " | fit_time | \n", "score_time | \n", "test_score | \n", "train_score | \n", "
---|---|---|---|---|
0 | \n", "0.032777 | \n", "0.223269 | \n", "0.695818 | \n", "0.801659 | \n", "
1 | \n", "0.030454 | \n", "0.213943 | \n", "0.707483 | \n", "0.799575 | \n", "
2 | \n", "0.030485 | \n", "0.226894 | \n", "0.713788 | \n", "0.795944 | \n", "
3 | \n", "0.027297 | \n", "0.211561 | \n", "0.686938 | \n", "0.801232 | \n", "
4 | \n", "0.027026 | \n", "0.174867 | \n", "0.724608 | \n", "0.832498 | \n", "
Pipeline(steps=[('preprocessor',\n", " ColumnTransformer(remainder='passthrough',\n", " transformers=[('numeric',\n", " Pipeline(steps=[('imputer',\n", " SimpleImputer(strategy='median')),\n", " ('scaler',\n", " StandardScaler())]),\n", " Index(['longitude', 'latitude', 'housing_median_age', 'households',\n", " 'median_income', 'rooms_per_household', 'bedrooms_per_household',\n", " 'population_per_household'],\n", " dtype='object')),\n", " ('categorical',\n", " Pipeline(steps=[('imputer',\n", " SimpleImputer(fill_value='missing',\n", " strategy='constant')),\n", " ('onehot',\n", " OneHotEncoder(handle_unknown='ignore'))]),\n", " Index(['ocean_proximity'], dtype='object'))])),\n", " ('reg', KNeighborsRegressor())])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
Pipeline(steps=[('preprocessor',\n", " ColumnTransformer(remainder='passthrough',\n", " transformers=[('numeric',\n", " Pipeline(steps=[('imputer',\n", " SimpleImputer(strategy='median')),\n", " ('scaler',\n", " StandardScaler())]),\n", " Index(['longitude', 'latitude', 'housing_median_age', 'households',\n", " 'median_income', 'rooms_per_household', 'bedrooms_per_household',\n", " 'population_per_household'],\n", " dtype='object')),\n", " ('categorical',\n", " Pipeline(steps=[('imputer',\n", " SimpleImputer(fill_value='missing',\n", " strategy='constant')),\n", " ('onehot',\n", " OneHotEncoder(handle_unknown='ignore'))]),\n", " Index(['ocean_proximity'], dtype='object'))])),\n", " ('reg', KNeighborsRegressor())])
ColumnTransformer(remainder='passthrough',\n", " transformers=[('numeric',\n", " Pipeline(steps=[('imputer',\n", " SimpleImputer(strategy='median')),\n", " ('scaler', StandardScaler())]),\n", " Index(['longitude', 'latitude', 'housing_median_age', 'households',\n", " 'median_income', 'rooms_per_household', 'bedrooms_per_household',\n", " 'population_per_household'],\n", " dtype='object')),\n", " ('categorical',\n", " Pipeline(steps=[('imputer',\n", " SimpleImputer(fill_value='missing',\n", " strategy='constant')),\n", " ('onehot',\n", " OneHotEncoder(handle_unknown='ignore'))]),\n", " Index(['ocean_proximity'], dtype='object'))])
Index(['longitude', 'latitude', 'housing_median_age', 'households',\n", " 'median_income', 'rooms_per_household', 'bedrooms_per_household',\n", " 'population_per_household'],\n", " dtype='object')
SimpleImputer(strategy='median')
StandardScaler()
Index(['ocean_proximity'], dtype='object')
SimpleImputer(fill_value='missing', strategy='constant')
OneHotEncoder(handle_unknown='ignore')
[]
passthrough
KNeighborsRegressor()
Pipeline(steps=[('preprocessor',\n", " ColumnTransformer(remainder='passthrough',\n", " transformers=[('numeric',\n", " Pipeline(steps=[('imputer',\n", " SimpleImputer(strategy='median')),\n", " ('scaler',\n", " StandardScaler())]),\n", " Index(['longitude', 'latitude', 'housing_median_age', 'households',\n", " 'median_income', 'rooms_per_household', 'bedrooms_per_household',\n", " 'population_per_household'],\n", " dtype='object')),\n", " ('categorical',\n", " Pipeline(steps=[('imputer',\n", " SimpleImputer(fill_value='missing',\n", " strategy='constant')),\n", " ('onehot',\n", " OneHotEncoder(handle_unknown='ignore'))]),\n", " Index(['ocean_proximity'], dtype='object'))])),\n", " ('reg', KNeighborsRegressor())])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
Pipeline(steps=[('preprocessor',\n", " ColumnTransformer(remainder='passthrough',\n", " transformers=[('numeric',\n", " Pipeline(steps=[('imputer',\n", " SimpleImputer(strategy='median')),\n", " ('scaler',\n", " StandardScaler())]),\n", " Index(['longitude', 'latitude', 'housing_median_age', 'households',\n", " 'median_income', 'rooms_per_household', 'bedrooms_per_household',\n", " 'population_per_household'],\n", " dtype='object')),\n", " ('categorical',\n", " Pipeline(steps=[('imputer',\n", " SimpleImputer(fill_value='missing',\n", " strategy='constant')),\n", " ('onehot',\n", " OneHotEncoder(handle_unknown='ignore'))]),\n", " Index(['ocean_proximity'], dtype='object'))])),\n", " ('reg', KNeighborsRegressor())])
ColumnTransformer(remainder='passthrough',\n", " transformers=[('numeric',\n", " Pipeline(steps=[('imputer',\n", " SimpleImputer(strategy='median')),\n", " ('scaler', StandardScaler())]),\n", " Index(['longitude', 'latitude', 'housing_median_age', 'households',\n", " 'median_income', 'rooms_per_household', 'bedrooms_per_household',\n", " 'population_per_household'],\n", " dtype='object')),\n", " ('categorical',\n", " Pipeline(steps=[('imputer',\n", " SimpleImputer(fill_value='missing',\n", " strategy='constant')),\n", " ('onehot',\n", " OneHotEncoder(handle_unknown='ignore'))]),\n", " Index(['ocean_proximity'], dtype='object'))])
Index(['longitude', 'latitude', 'housing_median_age', 'households',\n", " 'median_income', 'rooms_per_household', 'bedrooms_per_household',\n", " 'population_per_household'],\n", " dtype='object')
SimpleImputer(strategy='median')
StandardScaler()
Index(['ocean_proximity'], dtype='object')
SimpleImputer(fill_value='missing', strategy='constant')
OneHotEncoder(handle_unknown='ignore')
passthrough
KNeighborsRegressor()
Pipeline(steps=[('columntransformer',\n", " ColumnTransformer(remainder='passthrough',\n", " transformers=[('pipeline-1',\n", " Pipeline(steps=[('simpleimputer',\n", " SimpleImputer(strategy='median')),\n", " ('standardscaler',\n", " StandardScaler())]),\n", " Index(['longitude', 'latitude', 'housing_median_age', 'households',\n", " 'median_income', 'rooms_per_household', 'bedrooms_per_household',\n", " 'population_per_household'],\n", " dtype='object')),\n", " ('pipeline-2',\n", " Pipeline(steps=[('simpleimputer',\n", " SimpleImputer(fill_value='missing',\n", " strategy='constant')),\n", " ('onehotencoder',\n", " OneHotEncoder())]),\n", " Index(['ocean_proximity'], dtype='object'))])),\n", " ('kneighborsregressor', KNeighborsRegressor())])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
Pipeline(steps=[('columntransformer',\n", " ColumnTransformer(remainder='passthrough',\n", " transformers=[('pipeline-1',\n", " Pipeline(steps=[('simpleimputer',\n", " SimpleImputer(strategy='median')),\n", " ('standardscaler',\n", " StandardScaler())]),\n", " Index(['longitude', 'latitude', 'housing_median_age', 'households',\n", " 'median_income', 'rooms_per_household', 'bedrooms_per_household',\n", " 'population_per_household'],\n", " dtype='object')),\n", " ('pipeline-2',\n", " Pipeline(steps=[('simpleimputer',\n", " SimpleImputer(fill_value='missing',\n", " strategy='constant')),\n", " ('onehotencoder',\n", " OneHotEncoder())]),\n", " Index(['ocean_proximity'], dtype='object'))])),\n", " ('kneighborsregressor', KNeighborsRegressor())])
ColumnTransformer(remainder='passthrough',\n", " transformers=[('pipeline-1',\n", " Pipeline(steps=[('simpleimputer',\n", " SimpleImputer(strategy='median')),\n", " ('standardscaler',\n", " StandardScaler())]),\n", " Index(['longitude', 'latitude', 'housing_median_age', 'households',\n", " 'median_income', 'rooms_per_household', 'bedrooms_per_household',\n", " 'population_per_household'],\n", " dtype='object')),\n", " ('pipeline-2',\n", " Pipeline(steps=[('simpleimputer',\n", " SimpleImputer(fill_value='missing',\n", " strategy='constant')),\n", " ('onehotencoder',\n", " OneHotEncoder())]),\n", " Index(['ocean_proximity'], dtype='object'))])
Index(['longitude', 'latitude', 'housing_median_age', 'households',\n", " 'median_income', 'rooms_per_household', 'bedrooms_per_household',\n", " 'population_per_household'],\n", " dtype='object')
SimpleImputer(strategy='median')
StandardScaler()
Index(['ocean_proximity'], dtype='object')
SimpleImputer(fill_value='missing', strategy='constant')
OneHotEncoder()
passthrough
KNeighborsRegressor()
\n", " | 08002986030 | \n", "100000 | \n", "11 | \n", "900 | \n", "all | \n", "always | \n", "are | \n", "around | \n", "as | \n", "been | \n", "... | \n", "update | \n", "urgent | \n", "usf | \n", "valued | \n", "vettam | \n", "week | \n", "with | \n", "won | \n", "you | \n", "your | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
URGENT!! As a valued network customer you have been selected to receive a £900 prize reward! | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "1 | \n", "... | \n", "0 | \n", "1 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "0 | \n", "
Lol you are always so convincing. | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "0 | \n", "
Nah I don't think he goes to usf, he lives around here though | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "... | \n", "0 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "
URGENT! You have won a 1 week FREE membership in our £100000 prize Jackpot! | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "... | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "0 | \n", "1 | \n", "1 | \n", "0 | \n", "
Had your mobile 11 months or more? U R entitled to Update to the latest colour mobiles with camera for Free! Call The Mobile Update Co FREE on 08002986030 | \n", "1 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "... | \n", "2 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "1 | \n", "
As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your callertune for all Callers. Press *9 to copy your friends Callertune | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "2 | \n", "1 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "3 | \n", "
6 rows × 72 columns
\n", "Pipeline(steps=[('countvectorizer', CountVectorizer()), ('svc', SVC())])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
Pipeline(steps=[('countvectorizer', CountVectorizer()), ('svc', SVC())])
CountVectorizer()
SVC()
Pipeline(steps=[('countvectorizer', CountVectorizer()), ('svc', SVC())])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
Pipeline(steps=[('countvectorizer', CountVectorizer()), ('svc', SVC())])
CountVectorizer()
SVC()