What: Learning Outcomes
Contents
What: Learning Outcomes#
Course Objectives#
Describe fundamental machine learning concepts such as: supervised and unsupervised learning, regression and classification, overfitting, training/validation/testing error, parameters and hyperparameters, and the golden rule.
Broadly explain how common machine learning algorithms work, including: naïve Bayes, k-nearest neighbors, decision trees, support vector machines, and logistic regression.
Identify when and why to apply data pre-processing techniques such as scaling and one-hot encoding.
Use Python and the scikit-learn package to develop an end-to-end supervised machine learning pipeline.
Apply and interpret machine learning methods to carry out supervised learning projects and to answer business objectives.
Lecture 1#
Explain motivation to study machine learning.
Differentiate between supervised and unsupervised learning.
Differentiate between classification and regression problems.
Explain machine learning terminology such as features, targets, training, and error.
Explain the
.fit()
and.predict()
paradigm and use.score()
method of ML models.Broadly describe how decision trees make predictions.
Use
DecisionTreeClassifier()
andDecisionTreeRegressor()
to build decision trees using scikit-learn.Explain the difference between parameters and hyperparameters.
Explain how decision boundaries change with
max_depth
.
Lecture 2#
Explain the concept of generalization.
Split a dataset into train and test sets using
train_test_split
function.Explain the difference between train, validation, test, and “deployment” data.
Identify the difference between training error, validation error, and test error.
Explain cross-validation and use
cross_val_score()
andcross_validate()
to calculate cross-validation error.Explain overfitting, underfitting, and the fundamental tradeoff.
State the golden rule and identify the scenarios when it’s violated.
Lecture 3#
Use
DummyClassifier
andDummyRegressor
as baselines for machine learning problems.Explain the notion of similarity-based algorithms .
Broadly describe how KNNs use distances.
Discuss the effect of using a small/large value of the hyperparameter \(K\) when using the KNN algorithm
Explain the general idea of SVMs with RBF kernel.
Describe the problem of the curse of dimensionality.
Broadly describe the relation of
gamma
andC
hyperparameters and the fundamental tradeoff.
Lecture 4#
Identify when to implement feature transformations such as imputation and scaling.
Describe the difference between normalizing and standardizing and be able to use scikit-learn’s
MinMaxScaler()
andStandardScaler()
to pre-process numeric features.Apply
sklearn.pipeline.Pipeline
to build a machine learning pipeline.Use
sklearn
for applying numerical feature transformations to the data.Discuss the golden rule in the context of feature transformations.
Lecture 5#
Identify when it’s appropriate to apply ordinal encoding vs one-hot encoding.
Explain strategies to deal with categorical variables with too many categories.
Explain
handle_unknown="ignore"
hyperparameter ofscikit-learn
’sOneHotEncoder
.Use the scikit-learn
ColumnTransformer
function to implement preprocessing functions such asMinMaxScaler
andOneHotEncoder
to numeric and categorical features simultaneously.Use
ColumnTransformer
to build all our transformations together into one object and use it withscikit-learn
pipelines.Explain why text data needs a different treatment than categorical variables.
Use
scikit-learn
’sCountVectorizer
to encode text data.Explain different hyperparameters of
CountVectorizer
.
Lecture 6#
Identify when it’s appropriate to apply ordinal encoding vs one-hot encoding.
Explain strategies to deal with categorical variables with too many categories.
Explain
handle_unknown="ignore"
hyperparameter ofscikit-learn
’sOneHotEncoder
.Use the scikit-learn
ColumnTransformer
function to implement preprocessing functions such asMinMaxScaler
andOneHotEncoder
to numeric and categorical features simultaneously.Use
ColumnTransformer
to build all our transformations together into one object and use it withscikit-learn
pipelines.Explain why text data needs a different treatment than categorical variables.
Use
scikit-learn
’sCountVectorizer
to encode text data.Explain different hyperparameters of
CountVectorizer
.
Lecture 7#
Explain the general intuition behind linear models.
Explain the
fit
andpredict
paradigm of linear models.Use
scikit-learn
’sLogisticRegression
classifier.Use
fit
,predict
andpredict_proba
.Use
coef_
to interpret the model weights.
Explain the advantages and limitations of linear classifiers.
Apply scikit-learn regression model (e.g., Ridge) to regression problems.
Relate the Ridge hyperparameter
alpha
to theLogisticRegression
hyperparameterC
.Compare logistic regression with naive Bayes.
Lecture 8#
In the context of supervised learning, form statistical questions from business questions/objectives.
Understand the different forms your client may expect you to communicate results.
Explain the general concept of feature selection.
Discuss and compare different feature selection methods at a high level.
Use sklearn’s implementation of recursive feature elimination (RFE).
Implement the forward search algorithm.
Lecture 9#
Explain why accuracy is not always the best metric in ML.
Explain components of a confusion matrix.
Define precision, recall, and f1-score and use them to evaluate different classifiers.
Identify whether there is class imbalance and whether you need to deal with it.
Explain
class_weight
and use it to deal with data imbalance.Appropriately select a scoring metric given a regression problem.
Interpret and communicate the meanings of different scoring metrics on regression problems. MSE, RMSE, \(R^2\), MAPE.
Apply different scoring functions with
cross_validate
,GridSearchCV
andRandomizedSearchCV
.
Lecture 10#
Explain ethical considerations in data science, relating to multiple phases of machine learning pipelines.
Be able to analyze a confusion matrix and think about how different scoring metrics affect diverse stakeholders.
Explain components of a confusion matrix with respect to multi-class classification.
Define precision, recall, and f1-score with multi-class classification
Carry out multi-class classification using OVR and OVO strategies.