Okay. Welcome back. In this special topic, we're going to be talking about whether we should use survey weights when fitting regression models. We've talked about a lot of different regression models at this point, and one thing we haven't discussed is whether or not we should use weights associated with a given probability sample when fitting these different types of regression models. So, this particular special topic is going to talk about what to think about with respect to using survey weights when estimating regression models. Okay. So, let's review what's meant by a survey weights, and this goes back to when we talked about understanding and visualizing data, and we talked about the notion of a probability sample. So, recall that survey weights may be available in data sets that are collected from complex probability samples, and that these weights account for, at the very least, unequal probabilities of selection into the sample for different cases. So, cases with a lower probability of being selected into the sample will get a higher weight in the overall analysis. In theory, these weights are designed to enable unbiased estimation of selected parameters for a finite target population. So, recall that we have a finite target population of interest, and we use these weights to make inference about that finite target population, and the weights enable unbiased estimation. Meaning that, on average, our estimates will be equal to the true finite population parameters of interest. Therefore, we consider the case where an analyst wants to fit a regression model to a given dependent variable collected from a probability sample. We know from looking at the data that there are weights in that data set that represent, again, at the very least, different probabilities of selection. The question is, do we need to account for those weights when actually fitting the regression model and making inference? That's what we're going to be discussing in this lecture. So, should we use these weights in estimation? Using survey weights to fit regression models, first of all, ensures that the estimated regression parameters will be unbiased with respect to the sample design. So taking a sample design into account, we ensure that, on average, the estimates of our parameters across repeated samples will be equal to the true population regression parameters. However, those survey weights do not protect analysts from poor model specification. So, as an analyst, you still have to do a good job of specifying the mean structure of the model, the variance-covariance structure of the model. The weights aren't going to protect you from leaving out important predictors or miss specifying nonlinear relationships, some of the issues that we've talked about that go into specifying good models. So, suppose that an analyst fits a poorly specified model but still uses the survey weights in estimation and computes weighted estimates of the regression parameters, what's the result of this kind of analysis? Well, first of all, the analyst is going to produce unbiased estimates of the regression parameters with respect to the sample design, so that's a good thing. On the other hand, the analyst has misspecified the model for the finite population of inference and that's a bad thing. So, in this situation, the analyst is going to come up with good unbiased estimates of a really bad population model. So, in practice, we want to try to turn both of these indicators to good when fitting models to survey data. We want unbiased estimates of the regression parameters, and we want to do our best to specify a correct model for the finite population of inference. So, let's look at some examples of what we mean by this. First of all, consider the case where we have weights in the analysis and we use those weights for estimation, but we've done a poor job of specifying the model. So, in this example data set that you see plotted here, the size of each point represents the survey weight for that case. So the larger the circle or the larger the bubble, the more weight that that particular case gets. Now, this scatter plot shows that there is a non-linear relationship between X and Y. So you see that for smaller values of X, the value of Y dips down to around zero or around five. Then as you get larger in terms of values of X, there's an increase in the predicted value of Y to a point, but then the relationship between X and Y more or less flattens out. So, it seems like there's a non-linear relationship, and its cases that have larger values on X and Y that tend to get more weight. So if we just plotted the data, it definitely seems like there's a nonlinear relationship between X and Y. But the analyst has misspecified the model, and they're fitting a model that assumes a linear relationship. Now, the analyst uses the weights in estimation because they've heard that accounting for those survey weights when fitting linear regression models will give you unbiased estimates of the regression parameters with respect to the sample design. So we get unbiased population estimates of what these regression parameters look like for the overall population. So, in this case, you see when we fit that straight line model, that regression model fit is drawn to the points that have higher weight. You see the straight line goes through all the points that have the highest weight, but we're still doing a really poor job of modeling those observations at the lower end of the distribution on both X and Y. So, unbiased estimate of what the best linear model would look like, in this particular population or the best model that assumes a linear relationship between X and Y. We have unbiased estimates of the regression parameter, but we don't have a well-specified model and we're missing that important feature of the relationship here. So, suppose that we instead ignore the weights in the estimation and we still misspecify the model. So, suppose that the analyst goes back, refits the model, ignoring the weights, and misspecifies the model, now you see what happens is that that straight line regression fit is now drawn further towards the points at the lower end of the distribution. The line is no longer drawn towards the points that have the highest weight in the overall analysis. So the line is shifting downward a little bit relative to the model, where we were estimating the regression parameters using the weights. So, what this is, is a biased estimate of the relationship because we're not using the weights in estimation. So we're not getting an unbiased estimate of the finite population regression parameters, we're getting a biased estimate. That biased estimate is based on a poorly specified model. So it's kind of the worst of both worlds in this particular case. We're ignoring the weights so we no longer get unbiased estimates and we're misspecifying the model, so we're missing that non-linear relationship. Assuming that that straight line relationship did hold in the population, we're not getting the right estimate of that straight line relationship for the values of interests. So again, note the situation that we want to be in practice. Again, note that the fitted regression line is drawn toward the low-weight points, and we really wanted to take steps to avoid this situation in practice because, again, we're getting the worst of both inferential worlds in this case. So, let's consider an alternative where the analyst actually does a good job of specifying the model. In the regression model, they allow the relationship of X with Y to be non-linear. So they might include X squared as a predictor variable to model that curvilinear relationship, but they ignore the survey weights in the actual analysis. In this particular case, we see that the well-specified model provides a good fit to the observed data. You see that the fitted regression model accurately reflects the relationship between X and Y with a little bit of error. In this model-based approach, if the model is correctly specified, we may not ultimately need the weights to capture this relationship in the population, and that's the key point here. If we do a good job of specifying that relationship and we look at our data and we look at scatter plots, we may not ultimately need to use the weights to describe what's going on in the population. We can capture this relationship between X and Y with a well-specified model, as you can see here in the fit. So, what happens if instead we use the weights in estimation and again we do a good job of specifying the model? So the analyst uses the weights to estimate the model and correctly specifies the model, so kind of the best of both inferential worlds. We get an unbiased estimate of that relationship from a well-specified model. Again, you can see that the model fits well. But if you look at these two different slides, the fit is essentially the same. It doesn't look like the fit of that model allowing for the curvilinear relationship looks any different, whether we're using the weights or we're not using the weights. So we would arrive at the same conclusions. The drawback is that using the weights in estimation, if you've done a good job of specifying the model describing this relationship, if you use those weights in estimation, you can inflate the standard errors of your regression parameter estimates. Remember, when testing hypothesis about regression parameters, we generally take an estimate divided by its standard error and refer it to a t-distribution. That's a common hypothesis testing technique. If we use the weights in estimation and the model's been well-specified, that standard error can get inflated unnecessarily just because we're using those variable weights in the estimation, and that increases the sampling variance of our estimates. So, in this case, we probably would not want to use the weights because we've done a good job of specifying the model, and we don't want to affect our inferences by having estimates that are unnecessarily variable from using those nuisance weights in this particular case. Because again, we've captured the relationship by doing a good job of specifying the model. So if you summarize what we talked about on these previous slides with these different pictures, here are some recommendations for practice when you find survey weights in a given data set. So if survey weights are available for probability sample and you wish to fit a regression model, first and foremost, do the best that you can to specify the model correctly. So, think about your subject matter knowledge, think about the predictor variables that you have available, look at the data, look at some scatter plots, and make sure that you're capturing relationships to the best of your knowledge in specifying the regression function for a given dependent variable. Second, fit the model with and without using the survey weights. Statistical software like Python or other packages, they make it very easy to fit these models very quickly with and without using the survey weights. So, you can examine the sensitivity of your results and your inferences to the use of survey weights for estimation. Again, in theory, we're using those weights to come up with population estimates of these regression parameters, so that's an important theory. But there is that drawback where if we've done a nice job of capturing relationships, using the weights in estimation can inflate our standard errors. So we need to be careful about this and compare the estimates with and without using the survey weights. Now, if the estimated coefficients remains similar, when following these two approaches, but the weighted estimates have larger standard errors, it's likely that your model has been specified correctly to some extent and the weights may ultimately be unnecessary. Again, what the weights might be doing is only inflating the sampling variance of your estimates. So, this is where comparing the estimated coefficients relative to the changes in the standard errors when using these two approaches, becomes very important. Now, the other case, if the estimated coefficients changed substantially, the model may have been misspecified. In this case where we're not sure if we specify the model correctly, the coefficients changing substantially means that the weights carry information about the relationship that you're trying to model in the population of interest. So, in this setting, the weighted estimates should be reported to ensure that they're at least unbiased with respect to the sample design. If we're not sure that we have the model correct, we still want to report what the population estimates of those regression parameters would be so that we can make inferences about the finite population if that model has not been specified correctly. So, if we assume that we've left out a predictor or maybe we weren't able to measure some important predictors, we should at least report the weighted population estimates so that we get a good unbiased estimate of our potentially misspecified model. So, there's a very famous quote in statistics says, "All models are wrong, but some are useful." Whether we're making prediction or making inference about relationships between different variables, that can be attributed to the statistician, George Box. In practice, it's difficult to specify every model correctly. The question that really comes up here is, when do we ever know what the true model is? Exactly what predictor variables should be included? Exactly what functional relationships between variables we should be modeling? What covariant structure we should be modeling? Do we ever really know what the true model is in our population of interests, especially when we have multiple predictors? It's very difficult to do this in practice, and this is where regression modeling can really become an art, trying to identify the best predictors for a given dependent variable. So, in that setting, again, modern statistical software like Python makes it very easy to fit both unweighted and weighted models if survey weights are available. So you can examine the sensitivity of your inferences to using weights in the actual analysis and follow those recommendations that we talked about on the previous slide. There are also formal tests that exist for comparing weighted and unweighted estimates when using survey ways to fit regression models, and you can find out a lot more on those formal tests in the deep-dive readings for this week.