Model evaluation tells us how our model performs in the real world.
In the previous module,
we talked about in-sample evaluation.
In-sample evaluation tells us how well our model fits the data already given to train it.
It does not give us an estimate of how well the train model can predict new data.
The solution is to split our data up,
use the in-sample data or training data to train the model.
The rest of the data, called Test Data,
is used as out-of-sample data.
This data is then used to approximate how the model performs in the real world.
Separating data into training and testing sets is an important part of model evaluation.
We use the test data to get an idea how our model will perform in the real world.
When we split a dataset,
usually the larger portion of data is used for
training and a smaller part is used for testing.
For example, we can use 70 percent of the data for training.
We then use 30 percent for testing.
We use training set to build a model and discover predictive relationships.
We then use a testing set to evaluate model performance.
When we have completed testing our model,
we should use all the data to train the model.
A popular function in the scikit-learn package for
splitting datasets is the train test split function.
This function randomly splits a dataset into training and testing subsets.
From the example code snippet,
this method is imported from sklearn cross-validation.
The input parameters y_data is the target variable.
In the car appraisal example,
it would be the price and x_data,
the list of predictive variables.
In this case, it would be all the other variables in
the car dataset that we are using to try to predict the price.
The output is an array.
x_train and y_train the subsets for training.
x_test and y_test the subsets for testing.
In this case, the test size is a percentage of the data for the testing set.
Here, it is 30 percent.
The random state is a random seed for random data set splitting.
Generalization error is a measure of how well
our data does at predicting previously unseen data.
The error we obtain using our testing data is an approximation of this error.
This figure shows the distribution of the actual values in
red compared to the predicted values from a linear regression in blue.
We see the distributions are somewhat similar.
If we generate the same plot using the test data,
we see the distributions are relatively different.
The difference is due to
a generalization error and represents what we see in the real world.
Using a lot of data for training gives us an accurate means
of determining how well our model will perform in the real world.
But the precision of the performance will be low.
Let's clarify this with an example.
The center of this bull's eye represents the correct generalization error.
Let's say we take a random sample of the data using
90 percent of the data for training and 10 percent for testing.
The first time we experiment,
we get a good estimate of the training data.