0:00

This lecture's about two things.

First it's about predicting with regression and using

multiple covariates, but more importantly it's about exploring a

data set and trying to identify which predictors are

the most important to include in our prediction model.

0:13

So we're again going to be using the wages data to try to

predict the wages of a group of men that come from the mid Atlantic.

This data set is available at the ISLR package which you can find here.

And it comes from the book Introduction to statistical learning.

0:28

So the first thing that we do is that we load

the data set in, and so here is the ISLR data set.

We're also loading the ggplot2 package for some

exploratory analysis, and we're loading the caret package.

We're doing prediction, and we're going to

subset the data set for exploration purposes to

just the part of the data set that isn't the variable we're trying to predict.

So we select out the log wage variable, which is

the variable that we're going to be predicting in this analysis.

Then we can look at a summary of the data set and

so we can already see some features, for example, we saw before,

that this data set has only males included in the data set

and that for example the region is entirely peopled from the mid Atlantic.

1:09

So to do a little bit more exploration, we first

need to train subset things into the training and test sets.

So we use that with the Create Data Partition function.

And we subset into a training and test set.

And so we're going to do all of our exploration again on the training set.

Because when we're building models, we're not going to use any of the test set.

We're only going to apply it once.

1:30

So the first thing that you can do is a feature plot.

So this feature plot says, shows a little bit

about how the variables are related to each other.

Sometimes this plot is useful and sometimes it isn't.

In this particular plot it's a little bit hard

to see because everything is so squished together but if

you make it on your own, by using this function

you'll see that it's a little bit easier to see.

So, for example, you can see that there appears

for the job class there appears to be this group,

appears to be a little bit higher than that

group for the job class in terms of the outcome.

And you can see that there is, at least

for this age variable the relationship to the outcome.

Again there seems to be some outlier group, two separate groups here that gives

us some indication that we might be able to use that variable to predict.

So the first thing that we can do is plot the variables versus weight age.

So this is plotting the wage variable versus the age

variable and you can see that there appears to be

some kind of trend which we saw in the previous

lectures, but there also is this set of points that's

outlined up here, and so the idea is that that

set of points that's outlined up there might be, something

that we can predict, but we would have to figure

out what variable it is that, is representing that chunk.

2:46

So, one way that you can do that is, you

plot one variable, one predictor, age versus the outcome, wage.

And then you color the points by another variable, in this case job class.

And you can see for example that it appears that most of these points up

here are blue instead of pink and so

that means they come from the information group.

So, this gives us some indication that the

information variable might be able to predict at least

some fraction of the variability that's in that top class up at the top of the plot.

3:15

You can also color it by education.

So, here I'm just doing a q plot again.

So I'm plotting age versus wage.

And I'm telling it to color the plot by education.

And, again, I'm only using the training set because we're trying

to, only look at training set for our development of our model.

And you can see that the advance degree, also explains,

a lot of the variation up here in the top group.

And so, some combination of, degree, and,

3:56

So the next thing that we can do is fit a linear model with multiple variables in it

and so the idea here is, again, we're just

fitting lines, but now we're fitting more than one line.

And so the idea is, we have some intercept terms, so that's

just the baseline level of, wage that we might have, and then

we might have a, a relationship with the age of the person,

and then we might have relationship with what job class you're in.

So one way that we typically do that is, by fitting an indicator variable.

So an indicator variable is a variable

that's denoted like this in mathematical notation.

It just says, if the job class for the ith

person is equal to information, this variable's equal to one.

If the job class for the ith person is not equal to information, then

this information is equal to zero, and so this represents the difference in the

wages between the people with job class

equal to information versus job class equal

to not information, when you, fix all the other variables in the regression model.

You can also do this for education it's a little

bit more complicated road because there are multiple education levels.

So we create an indicator variable for, each of the different education

levels and so, here this is the sum of four indicator variables.

And so the, the variable's equal to one, if the education

for person I is equal to level K, that variables equal to one and zero otherwise.

5:24

So then we can fit the model just like we did before

so we have, we're using the train function in the caret package.

And we again use wage as the outcome and then tilde represents the formula

on the right, it's going to be used to predict the variable on the left.

So we use age, job class, and education.

So job class and education are both factor variables in r and so by

default it does, you know, it creates

these indicator variables like I've shown here.

So, that when it fits the model it takes that into account by automatically.

Again we're fitting the model on the training set

and we can look at the final model and

you can see now that the final model has

ten predictors even though we only put three variable

names into the formula and the reason is because

this, variable right here, got, actually received more than

one, predictor in the data set because of the

way that we had to create these indicator functions.

6:19

So then we can look at some diagnostic plots.

So this is very typical for when you're building these regression models.

The idea here is you can plot the fitted

values, so this is the predictions from our model on

the training set versus the residuals, that's the amount

of variation that's left over after you fit your model.

And so what you'd like to see is that.

This line would be centered at zero on this, axis because the residuals is the,

difference between our model prediction and our,

actual real values that we're trying to predict.

And here, you can see there's still a couple of outliers

up here that have been labeled for you in this plot.

6:55

And so, those variables might be variables that we

want to try to explore a little bit further and

see if we can identify any other predictors in

our data set that might be able to explain them.

7:07

The other thing that we can do is color by variables not just used in the model.

So for example here I'm plotting again the

fitted values from our model versus the residuals.

And so again, we like to see this laying on the zero

line because it's the difference between our fitted values and our real values.

And so, what we can do is plot on this plot, we can again plot

it by different variables, in this case, we can plot it by for example, race.

And so you can see that it seems like some of these

outliers up here may be explained by the race variable in the

data set and so these another exploratory technique plotting the fitted model

versus the residuals then coloring it

by different variables to identify potential trends.

7:47

Another thing that can be really useful

is plotting the fitted residuals versus the index.

And what do I mean by the index?

So the data set comes in a set of rows that you got in a particular order.

And so the index is just which row of the data set you're looking at.

8:04

And the y axis here is the residuals.

And so you can see that all the residuals seems to be happening, high residuals seem

to be happening down here at the right end of the high, the highest row numbers.

And you can also see a trend with respect to row numbers and so

whenever you can see a trend or a outlier like that with respect to the

row numbers, it suggests that there's a

variable missing from your model because you're

plotting the residuals here, so that's the

difference between the true values and the fitted.

And that shouldn't have any relationship to the order

in which the variables appear in your data set.

Unless, and this is what's typically you discover when you see

a trend like this, or outliers like this at one end of

this plot, that there's a relationship with respect to time, or

age, or some other continuous variable that the rows are ordered by.

8:56

So the other thing that you can do is plot the wage variable.

So this is the wage variable in the test

set versus the predicted values in the test set.

So ideally these two things would be very close to each other.

Ideally you'd have essentially a straight line on the 45

degree line where wage was exactly equal to our predictions.

Of course that isn't how it always works out.

And then in the test set, you can explore

and try to identify trends that you might have missed.

So, for example, here we're looking at the year

that the data was collected in the test set.

As a way of exploring how our model might have broken down.

Now something to keep in mind is that if you do

this sort of exploration in the test set, you can't then

go back and re-update your model in the training set because

that would be using the test set to rebuild your predictors.

This is more like a post-mortem on your analysis or a

way to try to determine whether your analysis worked or not.

9:53

If you want all of the covariants in your model building, one

thing that you can do is use this again in the training function.

You can pass it an outcome and then tilde and then if you put

a dot here instead of putting a set of variables separated by plus signs.

It says predict with all of the variables in the data set.

So, this is model fit with all of the variables and

so this is the wage variable and the predictions here and

so you can actually see that it does a little bit

better when you include all of the variables in the data set.

This is By default if you don't want to

try to do some sort of model selection in advance.

10:30

So linear regression is often useful in combination with other models.

It's a quite a simple model in the

sense that it always fits lines through the data.

And so it can be, capture a lot of variability

if the relationship between the predictors and the outcome is linear.

If it's not linear, then you can often miss

things, and it can be better blended with other models.

We'll talk about model blending later in the class.

Exploratory data analysis can be very useful with regression

models because, like, the plots we made with residuals

and so forth colored by different features, you can

try to identify the patterns in the data set.