0:00

So far in this unit, we have learned how to fit multiple linear

regression models, how to interpret results coming

out of a multiple linear regression model.

We've also talked about inference using the multiple linear regression.

And lastly, what we are going to do now is to go through

the conditions required for the multiple

linear regression model to be mapped valid.

These conditions are that we need

linear relationships between our numerical explanatory

variables and our response variable, our residuals need to be nearly normally

distributed, we want constant variability of

residuals, and we also want independence

of residuals, which basically speaks to

independence of the observations in our sample.

0:41

First, linear relationships between numerical explanatory

variables and our response variable y.

We're mentioning numerical here because it doesn't make sense to ask

for a linear relationship between a

categorical variable and another numerical variable.

So each numerical explanatory variable needs to

be linearly related to the response variable.

We check this condition using residuals plots,

that is, the residuals versus the explanatory variable.

We're looking for a random scatter around zero.

And note that we're using the residuals plot, instead of a scatter

plot of the response variable versus the explanatory, because the residuals plot

allows for considering the other variables that are also in the model

and not just the bivariate relationship between a given x and our y.

As illustrative examples, we're once again going to use

the cognitive scores data set from the previous videos.

And we had decided that the final model is going to have the mom's high

school status, mom's IQ, and mom's work

status as the explanatory variables in the model.

Note that the only numerical variable in our model is mom's IQ score,

so that's the variable we're going to be focusing on for the linearity condition.

2:41

The next condition is nearly normal residuals with mean zero.

Remember that some residuals will be positive

and some are going to be negative.

On a residuals plot we look for a random scatter of residuals around zero.

This translates to a nearly normal

distribution of residuals centered at zero.

And we can check this using a histogram or a normal probability plot.

So, once again, using R, we can make a histogram of

our residuals that are stored in the object for the regression model.

And we can also make a normal probability plot

using the functions qqnorm for the plot, and qqline for

the, guidance line that we're going to use to

see if the points actually align on a straight line.

This is what our plots look like.

We are seeing a little bit of a skew in the residuals.

However, the skew doesn't look too bad.

And looking at the normal probability plot as well, except for

at the tail areas, we're not seeing huge deviations from the mean.

So I think we can say that this condition seems to be fairly satisfied.

The next condition is constant variability of residuals.

We want our residuals to be equally variable for

low and high values of the predicted response variable.

So we check the residuals plot of residuals versus

the predicted values, that's e versus r y hat.

And note that we're using residuals versus predicted, instead of residuals versus x,

because it allows for considering the entire

model with all explanatory variables at once.

We want our residuals to be randomly scattered

in a band with a constant width around zero.

So in other words, we're looking to see nothing like that resembles a fan shape.

It is also worthwhile to view the absolute value of residuals versus

the predicted values to identify any unusual observations easily.

As usual, we can easily create both of these parts in R.

Here for example, we have our residuals on our y axis, and

on the x axis we have what R calls the fitted values.

What this basically means is our predicted values, or in other words our y hats.

And we can also calculate the absolute values of these

residuals and plot that against the fitted values as well.

So here's what our plots look like.

The first plot is a residuals versus fitted plot.

We don't see a fan shape here.

It appears that the variability of the

residual stays constant as the value of the

fitted or the predicted values change, so,

the constant variability condition appears to be met.

The absolute value of residuals plot can be

thought of simply the first plot folded in half.

So if we were to see a fan shape in the first plot,

we would see a triangle in the absolute value of residuals versus fitted plot.

Doesn't exactly seem to be the case, so it seems like this condition is met as well.

Lastly, independent residuals, and note that

independent residuals basically means independent observations.

If we have any time series structure, or if

we're suspecting that there may be any time series structure

in our data set, we can check for independent residuals

using the residuals versus the order of data collection plot.

If, on the other hand, that is not a consideration, to check to see, if the

residuals are independent, we don't really have another

diagnostic approach, diagnostic graph that we can use.

Instead, we want to go back to first principles

and think about how the data are sampled.

We've talked numerous times in this course

about what independence of observations means and what

do we need in terms of the sampling of the data to obtain independent observations.

So let's quickly take a look to see if this

order of data collection plot looks wonky in any way.

For that, we simply plot our residuals, and

we don't even have to specify anything for our

x-axis, because R will basically plot them in

the order that they appear in our data set.

And the order of data collection plot where we have the residuals on the y-axis,

and the order of data collection on the x-axis, does not show any patterns.

If there was some non-independent structure we would see

these residuals increasing or decreasing but we don't see any

such pattern, so it appears that any sort of

time series structure is not a consideration for this dataset.