As with any technique we encounter, there are conditions associated with linear regression as well. These are linearity, nearly normal residuals and constant variability. In this video we're going to go through these conditions one by one, and look at diagnostic tools that we can use to assess whether the conditions have been met or have not been met. First, linearity, which says that the relationship between the explanatory and the response variable should be linear. Which makes sense because we're using a liner model to predict the response variable from the explanatory variable. There are indeed methods for fitting a model to non-linear realtionships. However, those are beyond the scope of this course. So for this course we're going to be sticking with linear models only. In order to check if the linearity condition has been met. We can use a scatter plot of the data or a residuals plot. Here we have a set of three plots, three relationships that are being displayed. And on top we have the scatter plots of y versus x. And at the bottom we have the residuals plots. We're going to talk about what a residuals plot is in the next slide. And go through how to decipher that. So for now let's stick to the scatter plots and let's look to see, based on the scatter plot, which of these display a linear relationship versus which ones do not. The first plot seems to display a pretty linear relationship. In the second plot, we can see a bit of a bend. And the third plot is hard to tell, because there's a lot of scatter around the data. However, even with a lot of scatter, there isn't an obvious non-linear pattern. So, what is our residuals plot? Since previously we looked at the residuals of Rhode Island and District of Columbia, for their relationship between poverty and high school graduation rate in the U.S., we'll stick with those to exemplify how we build a residuals plot. In Rhode Island, the observed high school graduation rate is 81% and the observed rate of poverty is 10.3%. Using our linear model, we can calculate what the predicted of poverty would be for Rhone island. So, for that we simply plug in 81% into our linear model. And we see that the model predicts 14.46% for the poverty rate in Rhode island. The difference between the observed and the predicted rates is r residual, and that comes out to be negative four point sixteen percent. This is basically the value that's shown in the residuals plot. So the x-axis of the residuals plot here is once again high school graduation rate. So if, again we have 81% for Rhode Island. And on the y-axis we have the residuals. And since Rhode Island has a negative residual, the point associated with Rhode Island appears below the zero line in the residuals plot and is four point sixteen percent away from the zero line. For DC, the observed high school graduation rate is 86% and the observed rate of poverty is 16.8%. Using the same linear model and plugging in 86%, we can actually calculate the predicted poverty rate for DC, and we see the model predicts a poverty rate of 11.36%. In this case, the residual can once again be calculated as the observed value minus the predicted value, and that comes out to be positive five point forty-four percent. And that's the same value that we're seeing on the residuals plot, where on the x-axis we have 86% for DC. And on the y-axis we have the associated residual. The ideal residual would be zero, because that would mean that the data point falls exactly on the regression line. And that there is no difference between the predicted and observed values for that particular data point. With random checks, this is going to be unlikely to happen, but we like small residuals and we want our residuals in the residuals plot to be randomly scattered around zero. There's going to be some that are positive and some that are negative, because that corresponds to some points falling above the regression line, and other points falling below the regression line. And we want them to have absolutely no pattern, because what we want is for the linear model to capture all of the pattern in the data, and anything that's left over to be simply random scatter. So just like we look for a straight line in the scatter plot to check for the linearity condition. In the residuals plot, we look for a random scatter around zero. The next condition is nearly normal residuals, which says that residuals should be nearly normally distributed, centered at zero. This condition may not be satisfied if there are unusual observations that don't follow the trend of the rest of the data. And we can check this condition using a histogram of a normal probability plot of residuals. The histogram shows a somewhat symmetric distribution. It is indeed centered at zero, and the normal probability plot shows that there are some values on the higher end of the tail, that actually steer away from normality. But that's only just a few observations. The last condition is constant variability, which says that variability of points around the least squares line should be roughly constant. This implies that the variability of residuals around the zero line should be roughly constant as well. This condition is also called homoscedasticity. And we can check this using a residuals plot. On the scatter plot, we can see that as the x value varies, the variability of the data do not vary a whole lot, they actually seem to be captured around this constantly variable grey band around the regression line. And when we look at the residuals plot, we can also confirm that the variability of the residuals, that is how far they are from zero, do not vary by the value of the explanatory variable. Checking regression diagnostics is somewhat of an art. And it takes lots of practice to be able to tell when a condition has been met or has not been met. Let's play around with the following applet to get some of that practice. Let's start with an example where things actually work well. Here we have a linear trend between our explanatory and our response variable. We can see a completely random scatter in the residuals plot. The histogram of the residuals is centered at zero, and the shape of the distribution looks fairly symmetric. And the normal probability plot with almost all of the dots aligned on the straight line, also indicates that the distribution of the residuals is nearly normal. Let's take a look at another example. Once again, a linear trend, except this time the direction has changed. So we have a downwards trend between our response and our explanatory variables. Once again, a completely random scatter in the residuals plot. A symm, fairly symmetric distribution in the histogram of the residuals. And the normal probability plot looks pretty [UNKNOWN], good as well. So what do these look like when the conditions have not been met? What if we actually have a curved relationship between our response and our explanatory variable? In this case we can definitely see that the residuals plot is no longer displaying a random scatter around zero. The histogram of the residuals shows a right skew. And that same right skew is shown on the normal probability plot as well. So in this case, would it be appropriate to fit a linear model to predict y from x? Definitely not. Let's take a look at another example. Once again, we have a curved relationship. Not as extreme a curve, and it it might actually be somewhat difficult to tell from the scatter plot if we didn't have the grey band around it. But, the residuals plot highlights very well for us, that the relationship is not linear, because the distribution of the residuals does not show a random scatter around zero. The histogram of the residual shows a distribution centered at zero, but the distribution doesn't exactly look very normal. And we can see that the normal probability plot also shows that a lot of the points on the tails actually steer away from normality. So these were two examples where the linearity condition has not been met. What if the constant variability condition has not been met? This is usually when we have what we call fan-shaped data. We can see that when the value of the explanatory variable is low, the variability of the response variable is low as well. However, as x increases the data are fanning out such that the response variable becomes more and more variable. This yields up what we call a fan-shaped residuals plot where we can clearly see that as the x increases, the variability of the residuals increase as well. The histogram of the residuals looks fairly symmetric, and it's centered at zero. But looking at the normal probability plot, we can see that we're actually steering quite a bit away from normality. I hope that you will play around with this applet a little bit more to get practice working with situations where the conditions have been met and have not been met. And the more you see these plots, the easier it's going to get for you to be able to tell whether a condition has been met or not.