In module 4, we'll talk about imputing for missing items, this is almost always an issue in survey. We'll look at reasons for imputation first, throughout the module. We'll talk about what thinking to use to do the imputing and what software is available to actually do it. To visit this problem in general, there are two kinds of missing this, at least. One is completely missing. No data at all in a case. Now, when does that occur? It can occur for at least two reasons. One is we didn't sample a unit in the first place, so it's completely missing. Or we sampled it and it didn't respond at all. Also completely missing. How do we handle that? What we do is assign weights to the sample cases. We take the sample projected up to the full universe. We indirectly are doing imputations through this waiting. Now, software will have different standards for how that code missing cases, so I've listed some of them here. The default code in R is in A for missing value. In SAS,. a to. z are used to just point dot as missing value. Dot underscores is a missing value in SAS. Status similar. Dot a to. z, and just point dot are missing values. Also, particular surveys may use certain codes to distinguish types of missing this. In some surveys it's important to put down a reason essentially for why it's missing, so you may see different codes on a single item being used. Ninety-nine is a popular one to denote missing this. Sometimes you'll see minus nine or minus eight is a code for missing. One thing that you want to be sure of when you get a data set from somebody else, is what special coding are they using for missing this. You don't want to analyze a ninety-nine or a minus nine is if it's a real data point, when really it's just the survey, code for not being not there. Now, how do we go about handling these cases? One is called complete case analysis. What that means is, if a case is missing on any variable, you just completely delete it. You treat it as not in the data set. That seems extreme. Less extreme would be available case analysis. For example, if you're running a regression of y on a couple of x's, you just use the cases that are complete on those variables, regardless of whether they're complete on the other variables, that would allow you to use more cases than complete case analysis. But still, it seems bad. I mean, you're throwing away data on cases that are partially complete. Another way that we'll talk about here is just fill in those blanks, those holes by imputation. That way you get to use all your cases in every analysis, if you impute for every missing value. Certainly builds up sample size available to do analysis. There are implications of that, of course, that's not real data, so you ought to do something that accounts for the fact that it's not real data. One thing that's a problem with complete case analysis, there are a number of problems, but if the units with missing data differ systematically, from the completely observed cases, you could have biased estimates. If, say men and women differ systematically on your y values, and there are a lot more missing cases for men, if you just throw them out, the distribution in your sample between men and women, is not going to look like what's in the population. That means that when you combine or do a dataset analysis, even with weights, you could have biased estimates, so we'd like to avoid that. Another problem with complete case analysis is, if you've got many variables included in some model that you're trying to fit, there may be very few complete cases. You'd be discarding a lot just for the case or the sake of a simple analysis. That's bad. Another thing to be aware of in complete case analysis is, you're not really ignoring those dropped cases. When you drop them out, there's an implied imputation there. For many analysis like estimating means and totals, what you're doing is implicitly imputing those missing cases by the average of the complete cases. Now, that may be poor. That may be a poor imputation, so we'd like to think better ways of doing that. Now it's good to go back to the missing data mechanisms, that Rubin and little Nim defined. The ones that we've seen earlier, in previous video are missing completely at random, every unit is got the same probability of appearing in the sample. You can apply that down at the item level. Every item has got the same probability of being filled in. More realistic probably is the missing at random, MAR, that means that after you account for some covariates, then you may be able to make a sensible imputation for the missing cases. The worst is nonignorable nonresponse. That means that the probability of appearing or not appearing, depends on covariance and critically on the variables you're trying to analyze the whys, so this is bad. We may have covariates for both the complete missing cases, but we're not going to have whys here for the complete the missing. We don't observe wise for the cases that have got missing data. Generally, MAR is the best that we hope for. Accounting for as many covariates as it seems reasonable. We hope we'll give us a way of imputing intelligently for those missing cases, just as that MAR thinking, we hoped would give us a way of adjusting weights for nonresponse, in a way that would produce approximately unbiased estimates. We'll fill in the details on this imputation in coming videos.