0:00

In this video we'll discuss linear regression,

Â which is perhaps the most widely used predictive model.

Â Why are we interested in linear regression models?

Â There are at least three reasons.

Â First, linear regression models are easy to interpret.

Â Second, the model is not too complex and relatively concise.

Â Finally, even if we are interested in more complex models,

Â linear regression can still serve as a useful baseline.

Â 0:24

For those of you who have taken a statistics class before,

Â linear regression is likely not new.

Â I caution here that using linear regression as a predictive model

Â is somewhat different from the linear regression covered in most

Â high school college level statistics classes.

Â In predictive modeling, there is a lot of emphasis on prediction,

Â which is somewhat different from classical statistics.

Â We will also use linear regression as a context

Â to discuss important issues in predictive modeling.

Â To make our discussions concrete,

Â I would like to start with an example using some sample data.

Â This dataset contains 314 homes listed for sale in Boulder,

Â Colorado during July 2014.

Â The original datasets have many columns, however,

Â I will only use a few columns to illustrate the concepts.

Â 1:11

Here is a list of the data columns.

Â I would like to use the data set to understand what factors determine list

Â prices of the houses for sale.

Â Therefore, list price is the target variable.

Â All other variables are predictor variables.

Â Most of the variables are continuous variables,

Â with the exception of home type, parking type, and ZIP.

Â It is interesting to point out that even though ZIP takes numerical values,

Â it should be treated as a categorical variable,

Â because numbers in ZIP codes do not have a meaning for interpretation.

Â 1:51

Obviously, the answer depends on our purpose.

Â Our purpose here is to understand what factors determine list prices of homes

Â for sale.

Â This data set seems to be the relevant one to look at.

Â Of course, we can supplement the data set with additional data such as historical

Â sales, crime rates, and school district.

Â From my own experience, these additional data fields will be very helpful for

Â our analysis.

Â These additional data are available if we are willing to spend time or

Â money to collect them.

Â I choose not to include them here, but

Â you'll notice that as a limitation of our discussion here.

Â This is a scatter plot showing you the relationship between square footage and

Â list price for all homes in our dataset.

Â As we can see, there seems to be some positive association between the two.

Â As the square footage increases, the list price increases.

Â This positive association is quite intuitive.

Â Larger homes cost more.

Â Linear regression can help us understand this relationship better.

Â In linear regression,

Â we would like to find the line segment that best fits the scatter plot.

Â Here, I show a couple of alternatives.

Â 3:10

Here, y hat is the predicted value of the target variable and

Â x is the value of the predictor variable.

Â b0 is called the intercept and b1 is called the slope.

Â For a given b0 and b1, the value of y hat changes when the value of x changes.

Â When x equals 0, y hat is equal to b0.

Â When the value of x increases by 1 unit, the value of y hat increases by b1 units.

Â 3:38

Linear regression gives us a way to find in some sense the best fitted line.

Â For the data set we have, we obtain that b0 is about -125, and b1 is 0.43.

Â The red line segment in the graph shows the fitting line.

Â Note that b0 is a value of the the line when the square footage is 0.

Â It is a value of the y-axis at an intersection with the line segment.

Â This explains why b0 is called the intercept.

Â The value of b1 is 0.43.

Â For each unit increase in square footage, the value of y increases by 0.43.

Â Recall that the list price is in thousands of dollars.

Â Therefore, for each additional square foot,

Â the predicted list price increased by about $430.

Â Also note that the fitted line from linear regression does not perfectly explain

Â the relationship between square footage and list price.

Â Indeed, on a scatter plot, most, if not all,

Â points lie either above or below the line segment.

Â Meaning that the predicted list price is either below or

Â above the list price in the data set.

Â The part of the observed list prices that are not explained by the fitted value is

Â called the residual.

Â Which is given by observed value minus fitted value.

Â 4:59

Let's look at the scatter plot again.

Â The big red point is an observation in the dataset

Â which corresponds with a pair of values of square footage and list price.

Â The big blue point on the line segment shows the predicted value of the fitted

Â line for the same square footage.

Â Obviously, this predicted value is quite far off.

Â How far off this point is from the observed value can be measured by

Â residual, which is the difference between observed value and the fitted value.

Â Note that if a point is above the line segment, the observed value is higher than

Â the predicted value and the residual is a positive number.

Â However, if the point is below the line segment, the observed value is

Â smaller than the predicted value and the residual is a negative number.

Â For the big red point, the observed list price is 6,499.

Â The predicted value can be calculated by using the estimated coefficients b0

Â and b1.

Â And the squared footage, which is 5,588.

Â Therefore, the predicted value is about 2,260.

Â Residual is the difference between the two, which is 4,238.

Â Is our fitted line a good fit to the data?

Â One way to assess the accuracy of our fitted line is to see whether it explains

Â the relationship between the two variables well.

Â If all data points lie exactly on the regression line,

Â the line perfectly explains the relationship.

Â In most cases, however, the points will scatter around the regression line.

Â When the data points are close to the line,

Â the line does a better job explaining the relationship.

Â To numerically capture this measure of accuracy,

Â we use r squared, which can be interpreted as

Â the percentage of the variation in y that is explained by changes in x.

Â 6:45

r squared is also called the coefficient of determination and

Â takes values between 0 and 1.

Â The bigger the r squared, the better the model fit,

Â because more variation in y is explained by changes in x.

Â 6:58

For a linear regression model, r squared is about 0.64.

Â In other words, 64% of variations in list prices can be explained by square footage.

Â This model fit is considered quite good.

Â Intuitively, this is not too hard to understand.

Â Square footage is perhaps one of the most important factors

Â when people assess the value of a house.

Â Another question is whether the coefficient estimates are reliable.

Â This question can be answered using p-values,

Â which tells us whether the coefficient estimates are statistically significant.

Â In other words, p-values tells us how reliable the coefficient estimates are.

Â Smaller p-values imply stronger statistical significance.

Â In other words,

Â coefficient estimates with smaller p-values are considered more reliable.

Â For our model, the p-value for

Â b0 is 0.0266 and that for b1 is close to 0.

Â This shows that both coefficient estimates are statistically significant.

Â We typically use a cutoff 0.05 for p-values.

Â Any coefficient estimates with a p-value less than 0.05

Â are considered statistically significant.

Â In classical statistics, we make a number of assumptions in linear regression.

Â I choose not to discuss them in detail here.

Â I would like to comment that we are less concerned with violations of these

Â classical assumptions in predictive modeling.

Â However, we need to be careful when we interpret our results when one or

Â more of these assumptions are violated.

Â