At this point, we've learned how to test a multiple linear regression model and how to evaluate the fit of the model based on the significance of the regression coefficients and their confidence intervals. And by the r square, which is the amount of variability in the response variable that is explained by our explanatory variables. However, we should further evaluate our regression models for evidence of misspecification. Specification is the process of developing a regression model. If a model is correctly specified then the residuals, or error terms, are not correlated with the explanatory variables. If the data failed to meet the regression assumptions or if our model is missing important explanatory variables, then we have model specification error. We perform regression diagnostics to try to understand the cause of the misspecification so that we can try to address it. We can assess violation of the assumptions of the linear regression analysis by examining model residuals. That is, we can take a closer look at the e in our regression formula, which is the error or residual estimate. There are many regression diagnostic procedures to choose from. In this course, we will focus on examining residual plots in order to visually evaluate specification error. First, let's add another centered explanatory variable, internetuserate, to our regression equation. Internet use can be considered an indicator of a conscious level of modernization. Here, is the regression equation for this model and the SAAS code. As usual, we are using the glm procedure to test our regression model, but after the PROC glm code, we type PLOTS and in parentheses, unpack then an equal sign and all, semi colon. PLOTS request that the diagnostic plots be printed out in addition to the fifth statistics for the regression model. Unpack asks SAAS to unpack the plot into separate graphs for each plot. Finally, =all requests all the regression diagnostic plots that are available for the glm procedure. In the next line of code, we specify the model the same way we usually do, and after the /solution command, we ask for 95% confidence intervals with the clparm option. Then, we add some additional code to request that a dataset be produced that has the residuals from the regression equation. The option, residual = res asks for a column with the unstandardized residuals and the option, student = res, asks for a column with the standardized residuals. Finally, the out = results option asks for another data set that includes the regression analysis results. Then we end with a semicolon and run; to run the code. The information in the datasets produced by the output command allows us to generate additional regression diagnostic plots. Here are the results. We haven't yet discussed the interpretation of the intercept in detail. The intercept is the value of the response variable when all the explanatory variables are held constant at a value of zero. Because we centered our two explanatory variables, so that the mean for each variable was equal to zero, the intercept is the female employment rate at the mean of urban rate and Internet use rate. So the female employment rate, when urban rate and Internet use rates are at their mean, is 44 out of every 100 women. The results also show that the coefficients for the linear and quadratic urban rate variables remain significant after adjusting for Internet use rate. Internet use rate is also statistically significant. The positive regression coefficient indicates that countries with a high rate of Internet usage tend to have a higher female employment rate. Each observation has an estimated response value which is also referred as the predicted or fitted response value based on this equation. But we know that this equation does not estimate the observed response variable for that observation perfectly. In fact, urban rate and Internet use rate together explain only about 18% of the variability in female employment rate. So there's clearly some error in estimating the response value with this model. The residual is the difference between the expected or predicted female employment rate, and the actual observed female employment rate for each country. We can take a look at this residual variability, which not only helps us to see how large the residuals are. But also allows us to see whether our regression assumptions are met and whether there are any outlying observations that might be unduly influencing the estimation of the regression coefficients. First, we can use a Q-Q Plot to evaluate the assumption that the residuals from our regression model are normally distributed. A Q-Q Plot plots the quantiles of the residuals that we would theoretically see if the residuals followed a normal distribution against the quantiles of the residuals estimated from our regression module. What we are looking for is to see if the points follow a straight line, meaning that the model estimated residuals aren't what we would expect if the residuals were normally distributed. If we scroll down to the Q-Q Plot, we can see that the residuals generally follow a straight line, but deviate somewhat at the lower and higher quantiles. This indicates that our residuals do not follow perfect normal distribution. This could mean that the curvilinear association that we observed in our scatter plot may not be fully estimated by the quadratic urban rate term. There may be other explanatory variables that we might consider including in our model that could improve estimation of the observed curvilinearity.