Machine learning encompasses a wide range of statistical methods, including linear

and logistic regression, which we learned about earlier in the specialization.

These methods can be used to describe associations, search for

patterns in a dataset, or make predictions about outcomes,

also called target or response variables, from a set of explanatory variables.

Also called predictors, features, or inputs.

Pattern detection, or describing associations among variables without

a specific outcome variable, is referred to as unsupervised learning.

When the goal is prediction of the value of a response variable based on a number

of predictors, then we are using a supervised learning approach.

Unlike the hypothesis testing approach you've been describing in the previous

courses in the specialization, we typically do not go into a machine

learning application with specific hypotheses in mind.

Instead we learn from the data.

We use a subset of observations from our dataset, which we call the training set,

to learn about the data.

And then test the statistical model we get from the training data set

on a different set of observations which we call the test set.

We split the data this way because the model that is fit using a machine learning

approach is going to fit best in the data set on which it was developed.

But it might not perform as well when we try to test it on a different set of

observations.

If the model doesn't work for a different data set then it's not of much use

even if it fits well in the training data set.

When we apply our statistical model from the training data set to the test data

set, we're interested primarily in the accuracy of our statistical model.

Accuracy can be assessed by what is called the test error rate, which is a measure of

the extent to which a model correctly classifies observations into categories.

Or correctly estimates the value of a different variable of interest for

each observation in the test data set.

The goal then is to identify a model that minimizes the test error rate.

That is, a model that accurately reflects true population associations or patterns.