Hello, everyone. Welcome back. Now that we've talked about some measures of processing quality that we can compute conceptually for designed data, we're going to look at some examples of actually computing those measures of processing quality within the ER software. Back to the big picture again, we've now talked about thinking about quality with respect to validity, computing measures of quality with respect to the origin of the data. Now we're going from actually having a data value recorded, processing those recorded data values, and producing an edited data set, and thinking about measures of quality associated with that processing step. Now we're going to see some examples of computing those quality metrics associated with the processing step in the measurement dimension side. Here's the example, one type of processing quality metric that we can look at with design data, is Inter-Coder agreement. Suppose that two human coders have been asked to code 10 open-ended survey responses for a survey variable that captures people's occupations, so people could say whatever their occupation was, in an open-ended manner, they could say as much as they wanted to or as little as they wanted to. These human coders have been asked to code these into reasonable categories that can be used for analysis. In this particular application, we're looking to create a new variable in the survey data set that represents whether these are blue collar occupations, where a one means yes, and a zero means no, so that's the variable we ultimately want to code. The data handed to coder 1, or actually the binary indicators produced by coder 1, you can see Here 1,0,1,1,0,0,0,1,0,0. Those were their codes. But then when looking at the exact same 10 open-ended responses, coder 2 produced 1,0, that's an agreement, but then 0, that's disagreement, 1 is in agreement, 0 is in agreement, but then 1 is in disagreement, and then finally for the 10th case, they disagree were coder 1 is at 0, coder 2 is at 1. The question is, how do we quantify this level of inter-coder agreement, and is it acceptable? For the analysis, we're going to look at a cross tabulation of the different codes for the two coders, and then we're going to compute the Kappa statistic as a measure of inter-coder agreement and test it for significance. Here's the R code that we would use for the Kappa analysis, and we're going to turn back to R Studio here in a second, but we start with cross-tabulation. Pretty easy to do this, we just use the table function and we indicate the two vectors that we wish to cross-tabulate, so I'm using the c function to indicate the raw data here. More generally, that could be two different variables within a DataFrame object. Either way this works, but that creates a simple cross table where we look at how much agreement there was, and how much disagreement there was. Looking at that cross tab, we know that there's 70 percent agreement, so for 70 percent of these cases, the two coders agreed in terms of the coding. Then I just took that by looking at the the counts of agreement, 4 where both cases at 0, 3 where both cases at 1, and dividing by the total number of cases coded, and that works out to be 0.7 in this case. That just gives us a descriptive sense of the agreement, we can compute that Kappa statistic is a more formal measure of this inter-coder agreement here. There are many different ways to do this within R. One particular contributed package that contains a quick function for calculating the Kappa statistic is this fmsb package. We can load that particular package and then run that Kappa.test function, again referring to the exact same two vectors of produce codes by the two human coders. Let's switch to R Studio, and we're going to take a look at how to do this. Now, for the example here, you've been provided with an R Markdown file. This is a nice file that allows you to more formally share your results with colleagues and add your own text within the body of the overall analysis that you're doing, so that you can ultimately produce a document that can be very transparent and open about your results that you're generating and the code that you were using. Up here at the top of this R Markdown, R file, I see a very standard part of this R Markdown file which just introduces the title of this overall files. I'm calling this between coder variance example. Then within an R Markdown file, you can type whatever you would like, you can add whatever text you would like to introduce your analysis, to interpret your analysis, and you can also add hyperlinks to different part of the text. What you can do to add hyperlinks is just put the section of the texts that you want to hyperlink in square brackets, and then immediately following in parentheses, include the website that you wish to link to for that particular texts. I've just add some introductory text here about R Markdown in R Studio. Then I said we're going to use the coder data set here to illustrate the estimation of between coder variance and a binary indicator. The first step we're going to take is to load the coder.csv data set, and this is available online into an R object that we're calling coder data. Now notice here within the R Markdown file, we have this general texts that I was typing, but then you have these chunks of our codes. You see these three apostrophes here followed by an RN curly brackets. Whatever follows that is just R code that will be executed when you run the R Markdown file. The end of that chunk of our code is this three apostrophes once again. I just have one r function in this code here that says to read the CSV file, which is available at this website, and recognize variable names in the first row. That's where header equals true means here. We're creating a new DataFrame object called Coder Data in this chunk of our code within the R Markdown file. Another cool thing about R Markdown files is you can run these chunks one at a time rather than processing the entire file. Within the chunk, I can just highlight that code and click on this Play button, which stands for run the current chunk. If I do this, that reads my data set into a new object called Coder Data. Then I can process that Coder DataFrame further. Now, what we're going to do moving forward is to load a contributed lme4 package in R, which enables users to fit logistic regression models using the GLMER function. Okay, so we're gonna do that when we get back to estimation of this between coder variance. Now we've had a sense to look at the overall kapitus. We'll come back to this in a second here. Now, once we load that library here in FMSB and we can run this kapitus, let's take a look at the output that would be generated. Here's the results of that particular kapitus function. We estimate Cohen's Kappa statistics and tests the null hypothesis that the extent of the agreement is the same as random. That's our null hypothesis here, is that any agreement between these two coders is essentially random or kappa equals 0. This function echoes the data that we input. Then we get an overall z-statistic to perform our hypothesis test, and the function also computes a p-value for testing the null hypothesis as to whether or not the agreement was random. This function also produces a 95% confidence interval for the actual Kappa statistic. The actual estimated Kappa is 0.4. The question is, well, is that large agreement or is that small agreement between these two coders? Very cool thing about this particular function and why we picked it, is the function also produces a judgment of that estimate. It helps you to interpret what that estimate means. Specifically, the function calls this level of agreement, fair agreements. Previous authors have proposed an ascending scale of agreement depending on the value of the Kappa statistic, and this would only be called fair agreement in this case. Our Kappa estimates 0.4, the p-value is 0.1, which would generally mean we failed to reject the null hypothesis, which is agreement seems to be occurring at random. The qualitative label again for the strength of the agreement in this case is fair, which means that we could be doing a better job between these two coders and we need to actually resolve this level of disagreement. Now, notice that the 95% confidence interval is pretty wide in this case, and that's because our sample size is only 10. There's a lot of uncertainty in terms of this estimate of Kappa. With a larger data set, more values where you're looking at agreement between two coders, that confidence level is going to be narrower and you will have more power to detect a Kappa statistic as potentially being significant. But at least in this case, we don't have enough evidence against the null hypothesis. If that p-value is below 0.05, we'd have a little bit stronger evidence to say that there's agreement in this case. But as it is, if they're disagreeing three out of ten times, we don't really have strong evidence that agreement is consistent. We have to go back and resolve some of those discrepancies. Disagreements are occurring a bit more frequently than expected with a bit larger Kappa. A solution would be to resolve these discrepancies with potentially a third independent coder, maybe an expert in these measurements, and discuss any of the discrepancies with the two original coders and try to arrive at a final coding for all ten of these cases before we actually do the analysis. Because if there's this level of disagreement, we want to resolve those disagreements before we produce a final edited data set for analysis, and Kappa is an important processing quality metric at this stage. Again, keep in mind that this was also very small sample size mainly for illustration, so in most surveys, many more cases would be coded. We want to see if the strength of agreement rises above that fair level when using this particular function. We've looked at Kappa, now as another example, let's look at between-coder variance more generally as another processing quality metric for designed data. Suppose that you have data from 15 different coders, and each of them has been assigned a random subsample of open-ended occupation responses. Again, they've been charged with coding these open-ended occupation responses as being blue-collar or not. So we read in that coder data, you can see in the R Markdown file to start with from the online site, here. What we want to do is estimate between-coder variance in the probability of coding, the type of occupation is blue collars. Do different coders tend to produce blue-collar results at a higher rate or a lower rate than other coders? Is there a lot of variability between the coders in terms of how often they interpret blue-collar occupations if they've each been assigned random subsamples? Then we want to test that variance between coders for significance. In other words, is a model including these random coder effects on these binary indicators a better fit than a model excluding the random coder effects. How we do this we fit a multi-level logistic regression model. Logistic because our dependent variable's binary, here. Did they code it as blue-collar 1 or blue-collar 0? That's why we're using a different type of model. We fit these multilevel logistic regression models using the G-L-M-E-R, what some people call glimmer function within R. Then we also fit that same model without the random coder effects using the more standard glm function, which doesn't include the random effects. You can see the difference in the code here. In the glmer function, blue-collars are dependent variable, the binary indicator. Then after a tilde, we have these random coderID effects. In the dataset here, coder data, we have that coderID variable, and we include random effects of coderID. The DataFrame objects called coder data and then the family for the dependent variable's binomial. That by default means that the dependent variable is binary in nature and we're fitting this logistic regression model. Now, for the model without the random coder effects, we simply say glm's a different model without random effects dependent variable is still blue-collar, we include a one to say that this model only includes an intercept without any random coder effects, same dataset, same family because we have a binary dependent variable. Let's flip back to our studio here. The next chunk of our code, we do have to load the lme4 package first because that's where the glmer function lives to fit multi-level logistic regression models. I can do that in this chunk of R code where it says require lme4, provided that I've already installed the lme4 package. Again, you can install packages over in the lower right pane here by clicking, "Install" and then typing the name of the package that you want to install, where it says packages here. I'm not going to do that in this case, because I've already installed the lme4 package. I am just going to load it by running this chunk of R code so that the glimmer function is available. Now, let's fit both of these models that we talked about in the slides here, the first model and then M2 which excludes the random coder effects to see whether or not the between-coder variance is significant. Now, I'm going to fit those models, creating these objects called M and M2. Then notice down below that, I'm going to write some custom code. I'm declaring a function here. This is a custom function with an R that I created myself, which takes two objects as input. Notice I am using the general function command here, and then within parentheses, the two arguments that I want to input, and then after curly brackets, that's where you can supply a series of our functions, our operations to execute something using those two objects. So what I'm doing is I'm taking the log likelihoods of those two model fit objects, I'm calculating their difference. I'm multiplying that by negative 2 and then I'm calculating a Chi-square p-value based on that difference in the negative 2 log-likelihood. This is what's known as a likelihood ratio test. Because I'm testing a variance component, I multiplying the p-value of that chi-squared test by 0.5. It's a technicality here. But because we're testing that between coder variance, we want to test whether it's greater than zero, that between coder variance component set to zero, that's the edge of the values it can possibly take in. There's a minor modification of that p-value here, where you have to multiply it by 0.5. But this gives you the theoretically correct likelihood ratio tests for these two fitted models here. Then notice, after I declare this function within the curly brackets, then I simply run that new function that I created called LRT, and my two input objects are my model fit objects from up above and M2. This will automatically perform that likelihood ratio test of the null hypothesis that, that between covariance is equal to zero. Let's run this entire chunk here and see what we get as a result. We're going to fit the two models, the two competing models, compare them using that likelihood ratio test, and see whether that between coder variance is adding something significant to the overall model. You can see it went fast there in the output. The two models were fit. Then after the two models were fit, we look at the result of that likelihood ratio test. Here's the p-value. You can see that that p-value is tiny. This is 3.2 times 10 to the negative 15th, very small value. That means we would strongly reject the null hypothesis that, that between coder covariance is equal to 0. When you run these chunks of code and you generate output up in the R Markdown file, you'll also see that same output in the R Markdown script up above here. You can see in the text that I added to the R Markdown file, you see a tiny p-value. That suggests a significant difference in the fits of these two models, where the larger model that included the random coder effects has a significantly better fit. We would reject the null hypothesis that we can exclude those random coder effects and their variance is zero. It definitely seems like there's between coder variance in terms of how often they code blue color. If that p-value is larger than say, 0.05, we could conclude that between coder variance is not a significant concern. But in this particular application with these data, it does seem like the 15 different coders are in fact coding blue color at different rates, and we want to look into that a little bit further. We'll talk about that more in the next class on maximizing data quality. But with design data, where this coding is often necessary to create the edited data file, very important to look at variability in that process, especially if human beings are involved in that coding, to see if additional training is needed or something might be needed to reduce that between coder variance and really make sure that everybody's on the same page in terms of how they're coding these open-ended outcomes. In the slides again, just some more detail here about that function. Conducting a likelihood ratio test, we apply that function, we see the small p-value, we reject the null hypothesis, and again, you can see that R Markdown file from today for additional details. We would also like you to use these R Markdown files as templates. If you'd like to write your own R Markdown files, add your own text, run your own R code so that you can again share these with colleagues. Just to illustrate that idea, after we've run all this code, I'm going to knit this R Markdown file, the entire file together into a nice-looking HTML document that like I said, you could potentially share with colleagues. I go up to the top here, click on this Knit pull-down menu, and say Knit to HTML. What R is going to do, it's going to run that entire R Markdown file, and then produce this HTML file. You can see my hyperlinks that I had in my text. You can see the texts that I had, all the R code is set aside in these gray boxes where you can see the actual R code. You can see the output that was generated by running all of this R code, you can see again, the text explaining the output that we're seeing, the actual output generated from fitting the models or doing whatever your analyses are. You see that big chunk of likelihood ratio test code, and then you see the text at the end interpreting. The nice thing about this, you could post this on web pages, you could knit your R Markdown file into a PDF document, share with your colleagues, makes it very easy to reproduce R code, reproduce results, be transparent about what it is that you're doing, makes it easy to perform these quality checks, especially if you're working on teams. We've seen some examples now of computing these processing quality metrics for designed data. Now we're going to turn to a discussion of measuring processing quality for gathered data, and we'll see some additional examples of computing processing quality metrics for gathered data as well. Thank you.