So, in this section we'll look at two more approaches to comparing proportions between two populations, one is called Chi-square approach is mathematically equivalent to the two-sample z-test we just looked at, but it's expandable and allows for comparing proportions between more than two populations that we'll investigate this in the next lecture set, and then something called Fisher's Exact test, which is nice for testing for differences in "smaller samples with binary outcomes". So, upon completion of this lecture section you will be able to explain that; the Chi-square test for comparing proportions between two populations, gives the exact same results as the two-sample z-test. Explain the general principle the Chi-square approach. Interpret the results from an exact test for comparing proportions between two populations called Fisher's exact test. Explain the general principle of Fisher's exact test and name situations where Fisher's exact test is preferable to the chi-square and two-sample z-test approaches. The general principles here will at their very basic level, will be identical to the principles we've used in all other hypothesis tests. The mechanics are slightly different but I want you to focus on the big picture just again, how we approach the science of hypothesis testing conceptually. So again, let's go to our go-to example in response to a therapy among sample of a thousand HIV positive individuals from the clinical population where 25 percent of those who presented with CD4 counts of less than 250 when they started therapy respond to the therapy versus 16 percent who had CD4 counts are greater than or equal to 250 at the time of starting therapy, and we know we've done this both with confidence intervals and in the previous section, got a p-value, we know this result is statistically significant and the resulting confidence intervals for our measures of association did not include their respective null values, and the p-value from the two sample z-test we just did in the previous section was less than 0.01(0.0003). So, let's talk about doing something now called the Chi-square test, and this is something you'll see in the literature more often than the two-sample z-test, and again I'll note that the reason why is it can be expanded to cover and compare more than two populations in one test whereas with a two-sample z-test, we're limited to two samples from two populations. In the specific case of two proportions being compared across two populations, the result from the Chi-square test and the two-sample z-test are identical; both depend on the same central limit theorem based result. The approach is exactly the same though as all other hypothesis tests we've seen, and anymore we'll see in the class but with different inputs. But to start we specify the two competing hypotheses; the null and alternative. We assume the null to be the truth and then we compute how far the sample-based estimate is from what is expected under the null, translate this distance into a p-value and make a decision. So, the only thing that's going to change here is how we actually measure that standardized distance. So again, here are the two competing hypotheses expressed in a multitude of ways like we did before, the underlying null is that the underlying proportions responding at the population level is the same, between the two CD4 count populations. Here's what we observed in the study in a two-by-two table, this is just, we've looked at this many times but this two-by-two table, these four cells here show the number of persons who responded and did not respond in the each of the two samples we're comparing the group with CD4 counts of less than 250, and with greater than or equal to 250. So, how the Chi-square works, is what we're going to do is create another two-by-two table, that doesn't show what we saw in our study but shows what will happen if the null hypothesis were to show the cell counts, the number responders to each of the two groups we get under the null hypothesis, and so the way this works is as follows, if we assume the null hypothesis is true, that the underlying proportions of responders are the same in both populations we're comparing, what we're going to do is we're going to pull the data across the two samples and estimate this common population level proportion to be the total percentage of responders in our entire sample not broken out by CD4 count group. So in our entire sample, 206 of the 1000 persons in our sample responded across both groups or proportion responding to 20.6, and then conversely for a proportion of not responding of one minus 20.6 percent or 79.4 percent, and what we're going to do to fill in the table with the expected values is, we'll keep the row totals what they were, we'll keep the column totals what they were, but in order to fill in the respective cell counts at least to get the first one, we'll take that low proportion of responders under the null hypothesis 20.6 percent, and multiply it by the total number of persons in each group. So, in the CD4 count less than 250 group, there are 503 persons, and 20.6 of those responded as we expect them to the null, then we take 20.6 percent of 503, we get 103.6 responders, this is a theoretical counts so it doesn't have to be an integer, whereas in reality you would need any integer number persons. We do this to the second group, we see under the null, we expect 20.6 percent of the 497 persons in this group to respond or a 102.4 responders, and then among the nonresponders for each of the two groups, we take the proportion who didn't respond in the overall sample, and apply it to the numbers in each of the two groups. So, we get these four numbers that characterize what the two-by-two table would look like or we'd expect it to look like under the null hypothesis. So now, what I have here is my two-by-two table with the observed counts and then in parentheses, the expected counts. So you can see we saw a 127 responders in the CD4 count less than 250 group, that's what we saw but under the null hypothesis, we'd expect to have seen only a 103.6 or about a 104 responders, so we have a discrepancy between what we observed and what we would expect under the null for that first cell count. What we're going to do now is add up the discrepancies between what we observed and what we would expect in each of these four cell counts, and then we're going to standardize them by some measure of the potential variation in these observed counts from study to study of the same size, so it's business as usual, it's just how we characterize the standardized difference, is not mechanically similar to what we've done previously. And this is the formula for the Chi-square measure of distance, we take the difference between what we observed in each cell minus the expected, we square that and then we divide it by the expected value which happens to be the standard error of the the standard error of this square distance. So, what we're adding up here are the standardized distances accounting for sampling variability between what we observed and what we'd expect in each of the four cells, and we get some distance measure. So if you do out the math and I'm not expecting you to do this by hand. Not even expecting you to interpret this number per say, because it's not easily interpretable like when we had something we compare to a normal or T-curve, but this number 13.37 measures the cumulative standardized discrepancy between what we'd observed and what we'd expect under the null. In order to get a p-value and turn this into his statement about, it is what we got relatively likely to have occurred under the null by chance or is it very different, is we're going to compare this to what's called a Chi-Square distribution. So, the resulting p-value from the Chi-Square test for this example, what we do is we go to a Chi-Square distribution, we do it in the context of the computer, the Chi-Square distribution looks like this. So, right-skewed distribution, and what we do is we find our value of 13.37 under this distribution. I'm not drawing this to scale, and we look at the proportion of observations that were as far or further as likely or less likely should I say, than what we observed. Because it's a diminishing proportions on the curve even though it's a two-sided test, we'd only be seeing values greater than or equal to 13.37. In any case, that's how we do it, we let the computer handle that. But this comes from a Chi-Square distribution with what's called one-degree of freedom, and I'll explain that in a minute. So, the resulting p-value from the Chi-Square test for this example is 0.00026. Notice that that's almost exactly equal to the p-value we got from the two sample z-test to 0.00032, and certainly we'd make the same decision there. It turns out that Chi-Square test and the two-sample z-test for comparing proportions are actually mathematically equivalent. A Chi-Square distribution with one degree of freedom, this describes the sampling behavior for observed cell counts in the two-by-two table, is what we get if we took the sampling distribution of our difference in proportions, which we know from the central theorem is a standard normal distribution, and because we standardized by subtracting the expected mean value and dividing by its standard error, is a standard normal distribution with mean zero and standard deviation or standard error, which we call it one. If we squared all the values under a standard normal curve, we get the Chi-Square distribution. So, all of this says that our distance measure, if we were to take our distance measure that we measured the old-fashioned way from the two-sample z-test, taking the difference of proportions we observed, dividing by standard error and square that, we would get the Chi-Square distance we just got. So, as such, they're mathematically equivalent and they give the exact same p-values. However, in the example, I showed the resulting p-values that I reported were slightly different because of rounding on my part in the number of decimal places for the distance measure here. But generally speaking, they are equivalent and will yield the same result. While the z-test is easier to do by hand, the Chi-Square test is most often used because again, it can be expanded compare proportions between more than two populations in one test. So, frequently, you'll see a p-value in a publication, in the comparing proportions between two groups and they'll say all p-values for binary comparisons from Chi-Square test. So, you might ask, "Well, what about this one degree of freedom?" So, technically, I should put a little one here to indicate we have one degree of freedom. Where does that come from? What does that mean? Well, think about this now. If I have a two-by-two table that looks like this. So, with my data at 503 persons in the CD4 count less than 250, I had 497 remaining in the group that was greater than or equal to 250, 206 of this total, 1,000 responded, whereas 794 don't. So, if I ask you now to fill in the cells in this table for me, you wouldn't be able to do it without additional information, because there's many ways I could arrange the values in this two-by-two table such that the addition will worked out, that the number of values in the row is would add up to the row totals, number of values in the columns would add up to their column totals. So, for example, this cell count could be anywhere theoretically from zero to 206. So, as we think about it right now, the cells of our table are random. But as soon as I put in a number here, for example, the 127 we observed, soon as I put that in, the other three cells are now fix, there no longer random after I knew that first number because they have to be whatever 127 adds to get 206. For example, in that first row, whatever 127 adds to get 503 in the first column, and then, whatever this adds to that value to get 794, et cetera. So, of the four cells in my two-by-two table for comparison of two proportions, only one is random, only one varies freely, and that's where the idea of one degree of freedom comes from, and we'll see you in the next lecture set when we're comparing more than two proportions, the number of degrees of freedom increases. So, another approach for getting a p-value when comparing proportions between two populations is called Fisher's exact test. This can be used regardless of sample size. Theoretically, both the two-sample z and Chi-Square test require, "large sample sizes." I don't want you to worry about what that is operationally, but when in doubt, and you have a computer in front of you, you could run both the Chi-Square test and Fisher's exact test to compare the results. Generally speaking, the results for the p-value from the Fisher's exact test will be very similar in values of that from the two-sample or Chi-Square test, except possibly again in a small sample situation. The reason this is only become more frequently used even with larger samples is it because it's computationally intensive. But in this day and age, computing power is no longer an issue, and it can be applied to any sample size. Let me just try and give you a sense of how it works. I just think this is another nice way of conceptualizing how we do hypothesis testing. So, again, here are the observed results from our studying, at 503 people, in the first group, 127 responded, and 497 in the second group, 79 responded, there were a total of 1,000 people, and 206 of those 1,000 responded across the two groups. So, you can think of visitors act as doing this. We'd take 1,000 marbles, and put them in a jar, they represent 1,000 study participants. There are 206 red marbles representing the 206 people responded to the treatment, the 794 blue marbles, representing the 794 who did not respond to treatment. Put these in a jar, shake it up so that the distribution of red and blue marbles is the same throughout the entire jar. So, now this jar of marbles represents the null hypothesis. Essentially, if you split the jar into two groups, those with less than 250, and those with greater than 250 CD4 counts, the proportion should be about the same over respondent. This jar represents the null, or simulates the null. What the p-value for Fisher's exact test, is the probability of I choose five marbles or three marbles to represent the 503 people in the CD4 count less than 250 sample, and I actually get 127 that are red, which is the number of persons responding and observed in that group and 367 are blue. So, again, it tries to calculate the probability of my study results, or something less likely when the null hypothesis is true. Certainly, we don't need a jar of marbles to do this, the computer can figure out the exact value that this would be. But I think it's a nice way to think again in terms of what we do with hypothesis testing. Start by assuming the null distribution, the null hypothesis, and then computing the likelihood of our study results under that assumption. Again, the only reason, I point out how to do Fisher's exact test, nothing I expect you to know how to do it based on what I've showed you, but the philosophy behind it, is again because I think it's a good reminder of the general concept of hypothesis testing, which is again start by assuming the null, and then figure out how likely your results or something even less likely would be to have occurred under that null. So, in the CD4 count groupings response to therapy study we've looked at. Here, we saw before the p-value from the two sample z-test is 0.0003 when rounded, from the Chi-Square-square test is also 0.0003, and technically these are exactly equal although we showed because of rounding, they were slightly different. The p-value for Fisher's exact is equal to four decimal places as well. So, whether we did the chi-squared or Fisher's exact test, normally we get the same agreement about rejecting the null, you get the same p-value. So, here all three p-values agree exactly. In smaller studies, and I'll show you an example of one, the p-values may differ slightly between the Chi-Square-square and Fisher's exact tests. We looked at maternal infant outcome, maternal infant HIV transmission study. Here's the results that we've seen so many times. We look at the respective p-values for association when we see, we know by heart now that all results favor the AZT group greatly. They're statistically significant, the resulting p-values from the two sample z-test or the equivalent chi-squared test is 0.0001. We did Fisher's exact test, we get the same p-value, so no difference. The decision you make or the resulting likelihood, our results under the null hypothesis assumption. So, let's just look at a slightly smaller study to see and compare the resulting p-values from chi-squared and z-test versus Fisher's exact. So, here's a study where we had 65 pregnant women, all who were classified as having a high risk of pregnancy-induced hypertension, they were recruited to participate in the study of the effects of aspirin on hypertension. These women were randomized to receive either a 100 milligrams of aspirin daily or a placebo during the third trimester of pregnancy. Here are the results. So, only 65 women in the study of the 34 who received aspirin forehead hypertension, or about 12 percent of the 41 who got the placebo, 11 experienced pregnancy-related hypertension, or 35 percent. So, a large observed difference, but again we don't have many women in this study. So, if we looked at the all three estimates of association in the direction of aspirin compared to placebo, the difference it was very large. What we observed, a risk difference of -24 percent relative risk of 0.33, as quite a reduction all results, and they should agree in terms of significance and direction, the result is statistically significant, but you can see there's a lot of uncertainty in terms of the confidence intervals, so the risk difference could be anywhere from about 44 percent reduction to only 4 percent reduction on the absolute scale for example. If we did the p-values here, if we did the two sample z-test or chi-squared approach, we get a p-value of 0.0234, with Fisher's exact test is 0.0378. So, again we make the same decision about whether to reject the null, in this case the p-values do not sync up exactly. So, you can think of situations, and sometimes people are sticklers about them even though like in this example there's no consequence to using their tests, we make the same decision. But you could think about situations in smaller samples, where the p-value from the chi-squared, which is based on the central limit theorem arguments, doesn't quite hold up and we get a p-value of less than 0.05 but if we run Fisher's exact, that comes in a little greater than 0.05. So, it's significantly with one and not the other. Again, if you're 30 around 0.05, that's maybe the bigger picture result, but technically speaking, you should use the p-value from the Fisher's exact in such a situation. So, in summary the two sample z-test provides a method for getting a p-value for testing two competing hypotheses about the true proportions of a binary outcome between two populations as we saw in the last section. We can introduce this idea of a Chi-squared, and what we see is the approach, conceptually is the same mechanically, slightly more cumbersome but again a computer will handle that, but the two sample z-test and the chi-square test give the exactly the same result. However, the chi-squared is usually what is referred to in the literature. Fisher's exact test is a computer-based test, and really any test we do was computer-based tests, but we could theoretically do, we can certainly do the two sample z-test, and theoretically could do the chi-squared by hand, but we won't. But Fisher's exact is a computer-based test, and its results usually align with the other two tests, but the resulting p-values may differ slightly smaller samples. Any of the three are generally appropriate for comparing proportions between two populations with the slight caveat than smaller samples may want to look at the results for both the chi-squared and Fisher's exact, but regardless of what test is used, the p-value is interpreted in the same way. It's the probability of getting a study result as extreme or more extreme than when you did under the assumption that the null hypothesis.