0:07

Before we start adding more explanatory variables to our regression model,

there are some assumptions that we all make for the linear regression model.

There are four major assumptions for

linear regression analysis that we can test for.

They are the assumption of Normality, Linearity,

Homoscendasticity, and Independence.

In addition, we have to contend with the possibility of Multicollinearity,

which occurs when explanatory variables are highly correlated with each other.

And observations called Outliers that look different from the other observations.

0:38

We need to investigate whether or not these assumptions are met,

because serious violations of these assumptions can lead to distorted

regression coefficients in significance tests, and/or a weaker analysis.

The assumption of Normality means that we assume that the residuals from our

Linear Regression Model, which are the deviation of each observations predicted

score on the response variable from the true score, are normally distributed.

This means if you were to plot the residuals in a histogram,

it would make a nice bell shaped curve.

If the residuals are not normally distributed,

than the regression model may be misspecified.

1:35

That is, if you plot the model residuals, they should look like they have the same

spread that is the same amount of variability,

no matter where in the X axis you are looking.

So if you at the two scatter plots here, you can see that the points and the one of

the left, look like they have about the same spread along the entire X axis.

This indicates that the homoscedasticity assumption has been met.

On the other hand, if you look at the scatter plot on the right, you can see

that the spread of the residual values increases as you move along the X axis.

This suggests that we have heteroscedasticity, meaning that

the spread of the points is not the same at all levels of the explanatory variable.

If you remember, the model residuals are estimates of the deviation of

each observation's predicted score on the response variable,

from their observed score.

The bigger the residual, the greater the error in prediction.

In this example,

the spread gets wider as the level of the explanatory variable increases.

This indicates that your aggression model does not predict the response

at higher values as explanatory variable,

as well as it does at lower values of the explanatory variable.

The assumption of Independence of observations

means that the observations in your data set are not correlated with each other.

When we assess the assumption of independence, we are talking about whether

the observations in a data set are correlated with each other.

This is different from the correlation between variables within an observation.

Violation of this assumption usually occurs if your data are nested or

clustered, or if you have repeated measures data, which is generated by

research designs in which the same observation is assessed repeatedly,

often over time.

For example, babies that are born pre-term, might be weighed twice a week to

monitor their rate of growth in a study of two different feeding programs,

to compare growth rates in the two groups.

Because weight is measured on each baby multiple times,

the observations which in this case are the repeated measures for

each baby, are going to be correlated because they're coming from the same baby.

Data generated from a repeated measure study,

may also be called longitudinal data.

3:34

Another example, would be a study of children's mathematical

aptitude in elementary schools.

The aptitude scores for each child in the classroom maybe recorded only once.

But it is possible that the aptitude scores for

children in the same class maybe more similar to each other,

than the scores from children in different classrooms.

This might be because all the children in each classroom have the same teacher.

And each classroom has a different teacher.

The fact that children in a classroom have the same teacher

might lead to their scores being correlated.

This is an example of what we call Hierarchical, Nested, or Clustered data.

In this case, children are nested within classrooms.

Both of these examples are likely to result in data that violate the assumption

of independence.

Of all the assumptions, violation of the assumption of independence is the most

serious, and most likely to have a negative impact on parameter estimation.

It is also one of the most difficult assumptions to fix.

4:28

Unlike the assumptions of Normality, Linearity, and Homoscedasticity,

the assumption of Independence cannot be fixed by transforming variables, or

otherwise modifying the variables in your analysis or excluding observations.

This is because it it typically the structure of the data itself,

that results in violation of this assumption.

So it's important to understand the process, or

study does design that generated your data.

If that process produces data that are hierarchically structured or

clustered, or has correlated observations,

then the best solution may be to use an alternative regression method,

that can take into account the lack of independence in your data.

Most of the methods are simply extensions of the linear regression model.

So having a good understanding of linear regression will make it easier to

understand and apply these alternative statistical methods, that can account for

lack of independence among observations.

Although not one of the big four assumptions,

Outliers in multicollinearity can affect your analyses in undesirable ways.

So you have to do some investigating to see if either one or both are present.

Outliers are observations that have unusual or

extreme values relative to the other observations.

In regression analysis, Outliers can have an unusually large influence on

the estimation of the line of best fit.

A few outlying observations, or even just one outlying observation

can affect your linear regression assumptions or change your results,

specifically in the estimation of the line of best fit.

The analysis will attempt to try to fit the outliers.

As a result, the estimated regression line will not rest of the data as well as it

should, increasing the prediction error for the majority of the observations.

You can often identity outliers by just looking at a scatter plot.

In this scatter plot, there are two observations that appear to be different

or unusual, compared to the other observations.

This observation here is definitely an outlier.

It's far from all the other observations, and it's nowhere near the regression line.

This single observation could definitely have an impact

on your regression assumptions, and on your results.

If this is the case, then something needs to be done with it.

The other observation here, also looks like it could also be an outlier,

because it is far away from the values of the other observations.

But it still fits along the regression line.

It may have an extreme value, but

it shows the same linear association as the rest of the observations.

This means that including this observation in the analysis

will not have an impact on your results, and should be retained in your analysis.

Histograms and box plots can also be used to identify unit variant and

by variant outliers.

So the question is, what do you do with outliers?

Just getting rid of them might not be the answer.

Here's a decision flow chart that can help you decide what to do with outliers.

The first thing to do is check to see whether the observation changes,

whether or not your aggression assumptions are met.

7:14

If it has little effect, then you can try running your analysis with and

without the observation.

If it terms out the outliers are the result of a data recording or

data management problem, you can see if it's something you can fix.

For example if your variable is age, and one of the observations has a value of

212, than you know it's a data recording problem.

If you can go to back to the original source of the data, and find perhaps that

the actual age value is 21, then you can recode the age to 21 and go ahead

with your analysis, but if you can't, then you need to recode the value as missing.

Sometimes, analysts will recode an unusual value like this to the maximum value for

the range of the other observations.

So if the rest of the observations ranged in age from 18 to 75,

then the value of 212 would be recorded to 75.

However, doing so means that you have to assume that the observation

is an older person, when in fact you really don't know for sure if that's true.

If there no reason to believe the assumption is correct,

then this solution is not a good one.

So if you've ruled a data cleaning or data management problem, the next step is

to see if you can figure out if the observation is part of your population, or

whether it comes from a different population.

Sometimes that's easy to figure out.

So if your population of interest is adults aged 18 to 65,

and your observation's age is 78,

then you know that they are not part of your target population.

And you can exclude that observation from your sample.

If you can't rule out that your observation is from a population other

than your target population, then it's not so easy to just get rid of it.

You don't want to exclude the observation if it's part of your target population,

just because it's extreme.

If you eliminate observations that are part of your target population, then your

results may no longer be a reflection of what's going on in the target population.

That is, generalizing your results to the target population, becomes risky.

9:08

If it turns out that you don't have a good reason to exclude the observation,

you can consider transforming the variable.

For example, by taking the log or square root of the variable.

Transformation can help bring the extreme values closer to the sample observations.

Multicollinearity means that you have explanatory variables that are highly

correlated with each other.

For example, say you have three explanatory variables, X1, X2, and X3.

9:47

In this example, the Pearson correlation coefficients,

with the associations between these three variables would be relatively small.

What's important here,

is that each explanatory variable has a lot of unique variability,

that can contribute to explaining the variability in your response variable.

So each explanatory variable has the possibility of being

independently associated with the response variable.

After taking into account the variability from the other variables.

So it's a lot easier to determine the contribution

of each variable to explaining the response variable, after adjusting or

controlling for the other explanatory variables.

On the other hand,

if the explanatory variables are highly correlated with each other, as illustrated

in this Venn diagram to the right, they have very little unique variability.

So having another two explanatory variables in X2 and X3 in your aggression

model, isn't going to help you explain anymore of the variability and

the response value, than you could with just the one explanatory variable.

In this example, the Pearson Correlation Coefficients,

for the associations between these three variables, would be very high.

Multicolinearity can really mess up your parameter estimates.

The estimates can be really unstable.

For example, you might see that a variable that you would expect to be highly

associated with your response variable, is not significant.

Or, you might get a negative regression coefficient for

a variable that by itself is positively associated with the response variable.

Or, taking just one explanatory variable out of the regression model, drastically

changes the parameter estimates for the other explanatory variables.

In a nutshell, your regression model is doing nothing to help

you to predict your response variable, if you have multicollinearity.

So what do you do if you have highly correlated explanatory variables?

The simplest approach is to choose just one.

This is a good option, if there's no reason to believe that the variables

you're excluding will not have an impact on the fit of your regression model.

Or, you could aggregate or otherwise combine the variables,

to create a single variable.