Learn fundamental concepts in data analysis and statistical inference, focusing on one and two independent samples.

Loading...

From the course by Johns Hopkins University

Mathematical Biostatistics Boot Camp 2

34 ratings

Johns Hopkins University

34 ratings

Learn fundamental concepts in data analysis and statistical inference, focusing on one and two independent samples.

From the lesson

Discrete Data Settings

In this module, we'll discuss testing in discrete data settings. This includes the famous Fisher's exact test, as well as the many forms of tests for contingency table data. You'll learn the famous observed minus expected squared over the expected formula, that is broadly applicable.

- Brian Caffo, PhDProfessor, Biostatistics

Bloomberg School of Public Health

so this is from Rise's book Mathematical Statistics and Data Analysis, which is another first of all, I have affiliation, and never met Rise. So it's easier for me to say this, but, I, I, love this book. I think it's wonderful, this Mathematical Statistics and Data Analysis book. So, if you are looking for a book recommendation, I like that one. in addition, I really like to address this both, and I'm willing to stipulate my conflict of interest in recommending it. but I do really like it, I read it all the time. so any rate, in this book he there's he had this interesting example. Where he had a bunch of words words taken from some novels, that were well one of them was purported, were two of them were known to be Jane Austin novels. And one of them was, was in questions as to whether or not it was from written by that author. Let's say they found it later.

this maybe other ways you would want to analyse this data for this reason, but you know we want to use it as an an example for the Chi-Squared. So, don't think too hard about specifically how you would analyze this data, because I, I doubt this would be what you would arrive at immediately. But, it, it's not unsensible by the way, it's, it's, reasonable.

so imagine, let's say, book three. I'm just spit balling here. Imagine book three it's, it's book where you don't whether or not it's from the same author, in this case it was Jane Austin.

and you want to test whether the word distribution of these words, is equivalent across the three books and you sampled so many words from each of the books. Okay? So that's the setting and let's see if we can figure out some, some expected cell counts to do a Chi-Square test. Okay, so our null hypothesis is that the distribution of these words is the same for every book, and that at least two are different. So in this case, we have kind of a multanomial for every column that we're interested in. We want to test equivalence of those multinomial probabilities across the columns.

row margins and we'd say the word a should appear roughly 434 over 1017 times. Kay? So our 1,1 cell was 147 but in that book, there were 375, you know, words of this type that were sampled and we're multiplying them times the, what we would expect to see of that word, if book were relevant. And that's 434 over 1,017, one thousand seventeen, and that's how you get that expected cell count. Follow through, take the observed

now that's how many time a appeared in book two. It co, it came up one 186 times. How many will we expect out of 440 words? from our estimate viewing book is irrelevant, that would be 434 over 117 times 440 and you would compare that then to 186, to see how much the observed count deviates the expected count. Now then you would move on to the next word, an. Right? So an was disregarding books seen 62 out of 1017 words. and so we would take 375 and multiply it times 62, times, divided by 1017, and compare that to 25, to see if the number of an's in book one, was different than what we'd expect and so on. And you can go through all the calculations and the sum of the observed minus expected squared over expected is 12.27. Degrees of freedom are 6 minus 1 times 3 minus 1. In this case, working out to be 10. now again

and, and, and now at this point, I would ask you, as an exercise, to figure out what the chi squared probability is.

now, I should make some comments about sampling assumptions and these sorts of things about how the reasonableness of modelling the words this way.

we're making assumptions on how the experiment was conducted and that the model we're applying is relevant. So in this case, you could probably say, yeah. That's not so bad, about sampling assumptions, if you under, if you were to make some, you know you know, assume that there's, you know, there's lots of words in the book.

you know? We're only sampling, you know, 300 or so of these very generic words so that we could, you know, so that modelling that process is multinomial. Yeah. That doesn't seem so bad to me, at least.

But you always want to think about that. How does what I'm doing, so the Chi-Squared test, how does its assumptions accurately reflect the way in which the experiment was conducted? You know, anyone can of course just take a table, click some buttons and get a p value. But that p value is supposed to represent a pro-, probability. That probability is supposed to you know quantify the randomness in the experiment. And that randomness is only you know, that, that is only being modelled, through,or that is only being quantified through this statistical model. And so our results are only meaningful insofar as that model is a meaningful reflection of reality.

Okay. So here is kind of a funny one. I also got this from Agresti's book. Both these books have lots of wonderful discussions of contingency table stuff.

So in this case, they we're rating couples. A husband and wife's ratings of sexual fun. So, N was never, F was fairly often, V was very often, A was almost always.

and so let's say they sampled 91 couples and then they cross-classified them. And you know let's let's talk about

in this case, you know, kind of the logical way, kin, the, the logical, a logical first question to ask would be, you know, are the ratings independent of one another? Is the wife's rating independent of the husband's rating and so on. And and see you, you can make all the relevant jokes about the experiment at home right now while we move on.

Okay, so now in this point, we're going to talk about that the ratings between the row variable, which was husband's ratings and the column variable, which was wife's ratings, that they're independent. Versus the alternative that they're not independent. So let's talk about what happens under independence. The probability that a husband rated n and the wife rated a would, would factor in the probability that the husband rated n times, the probability that the wife rated a.

Okay? So, again, we don't have these probabilities, we're going to have to estimate them. But for the Chi-Squared test, we're going to estimate them under the null hypothesis and compare them to what we observe in the cell counts. So let, let me just do one of them.

col row probability, is 19 over 91 because if you disregard wife's ratings, the husband rated n 19 total times out of 91. and the wife rated n 12 times out of 91, if you disregard the husband's ratings.

and so our expected counts under independence, would the probability of that specific cell, 19 over 91 times 12 over 91. Which we can multiply these, because we made the assumption of independence under the null hypothesis. Now we multiply that times the potential, the number 91, the total number of couples, to get 2.51, we're going to compare that to 7. When you do that throughout and you get the expected counts are this formula right here. If you, if you like, the, the, the row total times the column total, then divide it by n instead of n squared, because you multiply by n. so that e-ij is the cell you copy for every cell, but the logic is clear as to where that formula comes from. And the degrees of freedom, are again rows minus 1 times columns minus 1. And I'm going to to let you again now, I'm going to do even less, and I'm going to let you calculate the statistic and compare it to the Chi-Squared distribution.

I would note, this is what, I find this fascinating, so I keep repeating it. But it is interesting that in all of the cases, all of the cases that we covered, you could just execute the formula by, or execute the Chi-Squared test by using this formula, the eij equals ni plus times n plus j over n. You wind up in every case with the identical Chi-Squared statistic and I suggest you try it. Go back to some of these other examples and calculate the Chi-Squared statistic that way.

so any rate and that's why I often, in textbooks, the Chi-Squared statistic is only presented as this formula right here.

but again I, you know, even though the statistics stays the same, the interpretation of the results depends on the design of the experiments and which margins they are fixed by the design, and so on. So, I think that's an important caveat. But none the less, in terms of actually doing this stuff, if you have to program it, for example.

you can always just use this simple formula, not have to really spend too much time not thinking about, oh yeah, what's fixed and what's not while you're doing the calculation's. Of course you want to think about that while you're doing the interpretation, but while you're doing the calculations, you just use this formula right here, which is extremely convenient then.

Oh and apparently I lied when I said, I'm going to let you do this on your own, because here on the next slide I do it. so, I define a 2 a 4 by 4 matrix here. the x matrix you just use Chi-Squared, chisq.test(x), that will give it to you. Again, remember doing the continuity correction. So if you do it by hand you probably won't get exactly the same numbers.

if you do this formula, observe minus expected squared over expected, you get around 17. Degrees of freedom are 3 squared, and the p value works out to be just under 5 %.

so couple caveats. I should, I should say a couple caveats. One is, the Chi-Squared approximation is an asymptotic approximation, using the central limit theorem. Now, it may not be clear how the central limit theorem is kicking in here, but it is. And so, you have to worry about whether or not the central limit theorem is a good approximation. But fortunately for you, in a slide or two, we're going to talk about how you can do exact Monte Carlo finite sample approximations. So you, you should get excited about that now. so, so, often in textbooks you'll see them talk about whether the cell counts are large enough to use, a large sample approximation. My recommendation is just to always use the small sample one. In chisq.test, you can do an exact equals true argument and just get the exact small sample test.

which, you know, then gets rid of the need to do that. And for the test of independence, you can do it without

too much computing. You have to have a pretty big table, to not be able to do it. so for a more elaborate Chi-Square test, you can't do the exact ones. I have an r-package called Exact LogLin Test, which does kind of crazier distributions.

but you know, it's not maybe the most trivial r-package to use. and then there's the software called StatExact, where they do quite a few exact versions of contingency table tests. So any way, but all the Chi-Squared tests, in terms of comparing them to the Chi-Squared distribution, that is an asymptotic test. It relies on the central limit theorem. You are you know, you can, you can use these rules that say, the cell counts have to be so large. But in reality, as with all Chi-squared test, with all asymptotic approximations, you're putting your faith in the idea that the, the the cell count, the, the asymptotic's have kicked in on your behalf.

but the, you know, checking that the cell counts are large is a, is a way to give yourself some hope that, that's true.

I had one other point I wanted to, to make. so let's see, I think I made the point that the that the test, all the tests if we've used this last Chi-Squared independence formula, we would've gotten the same Chi-Squared statistic for all the tests. In every case the degrees of freedom, is always rows minus 1, times columns minus 1.

oh, last thing, yeah, now I remember what I wanted to say. So where in the world does this observed minus expected squared over expected, where does that statistic come from? It actually comes from the Poisson distribution.

and the, it turns out that the, you know, the expected value, the expected, are, are, the sort of expected cell counts. And it turns out in the Poisson, the the expected value is the variance, as well of a Poisson random variable. So you can kind of think of this as each o is a Poisson count minus it's mean and then divided by the standard deviation and if you square all that. Right? Then you wind up with this statistic, o minus e squared over e.

so you might think of this guy, the, the, the o minus e squared over e, one of the elements of the sum, as being like a little z statistic. And then, you're adding up a bunch of z statistics, squared. which e, a z statistic squared is a Chi-Squared 1, so when you're adding them up, you get a Chi-Squared. now because we're estimating components of the expected counts, then we lose degrees of freedom that way. And if you want to do a careful accounting of how the asymptotic work, works, then you have to a, account for it. But that's where this formula comes from. It comes from the Poisson distribution. and it really, you think of it as a bunch of squared z statistics. and of course, the asymptotic's are a little bit more delicate than that, but, but honestly not that much more delicate than that. So that's where that formula comes from.

Okay, so just to rehash these equal distribution tests, like we did for the word counts. That yields the same thing as the independence tests, they're, they're all the same tests. And, you know, under the same, under, under kind of the, the, the sort of similar modeling assumptions I'm sorry, similar testing assumptions. You know, if, if your model's kind of binomial or multinomial where the row total's fixed, binomial or multinomial where the column total's fixed. The total sample size is fixed, and you assume a multinomial and test independence. None are fixed and you do a Poisson model and you test some form of row versus column additivity. All these wind up in exactly the same Chi-Squared statistic the same p-value, the same rejector not resolved.

so and if this bothers you, basically the best thing I, you know, I tried, I think throughout the lecture to, to describe this on numerous occasions. But if it's still bothering you and gnawing at you

what I would say is, this is really common in statistics. Mathematically equivalent results, are applied in different settings. They have different interpretations, but the actual statistic, the mathematical results are equivalent. So that's how I like to, to rationalize it to myself, that all of these things are coincidentally the same.

So, some final comments on the asymptotic test. The Chi-Square result require is an asymptotic test, so it requires that, the something be going to infinity. you know, in the multinomial, the multinomial case. It's the, overall sample size has to go to infinity. If you have, you know, multinomial columns, then I all of the samp, all of the column totals have to be going to infinity. And there's various strategies for checking whether or not this, you, you, you're close enough for the asymptotic's to be reasonable.

but we'll talk about an exact test, that will get rid of the need to think about that. The degrees of freedom are always rows minus 1, columns minus 1. And this, what we'll talk about now is the generalizations of Fisher's exact test can be used.

or, you know, this other thing, the continuity corrections can be used that will make the asymptotic approximations accurate, even for relatively small sample size type problems.

So let me let me show you how you can actually use Monte Carlo to calculate an exact P-value for contingency tables. Imagine if we got the, the individual data points. Not just the contingency table, but the individual data points. So for the first couple, it was nn. Where is the second couple it was nn. For the third couple it was nn, for the fourth couple it was nn and so on. And then oh, here's a couple that was fn and so on. And here I've clearly sorted them in some way. But this is the raw data. If you were to, if you were to then take this data and create the counts of the number of nn's, fn's, vn's, an's and so on, ff's, af's and so on. You, you would get exactly the contingency table in the

from the couple of slides ago. And here's the interesting fact. So this is the, the husband's data and the wife's data. So, think about it. If, if they were independent then the matching the pairs of husband wife's would be irrelevant whether or not, you know, you were talking about this specific couple. Or you just had a husband and a wife their answers should, it should matter whether you line them up together with their correct husband and wife, or not. So, so what you can do, is either take the, the, the wife,

row here I have, or the husband row here. Either way or both but that would be unnecessary and then just permute it.

And then what you get is a realization of the contingency table, where we've operated under the assumption that, you know, which particular member of the couple, the pairing of the couple is irrelevant. In other words, that husband and wives are, are independent. But notice if we were to permute that, that data

and, and reconstruct the contingency table. We, we would still have the same number of husbands answering n and husbands answering f, wives answering n, wives answering f, wives answering b, wives answering a and so on. So in other words, this procedure would constrain the margins, but would permute the interior of the table, which is exactly what Fisher's exact test did. Right? And this is exactly the same thing as the Monte Carlo version of Fisher's exact test. It's just a general, you know, with more possible outcome values for the, for the row and the column variables. So you do this and you re-calculate the contingency table. You calculate the Chi-Squared statistic for each permutation and the percentage of times it's larger than the observed value, is a so called exact P-value. And in R it's pretty easy to do this, because you do chisq.test (x), simulate.p.value = TRUE). And then it does this exact Monte Carlo

version of the test, which is really, which is really really neat. So this is just a generalization of Fisher's exact test.

and yeah, it's a very nifty little result and it has a lot of intuition to it. Right? It, it makes sense to think that, well, under the null hypothesis of independence, we should be able to permute the couple. They, you know, permute which spec-, specific pair it was and, and recalculate the contingency table. And we should get roughly the same you know, discrepancy between the observed and the expected counts.

notice, you know? I said use the Chi-Squared statistic. So we're using the Chi-Squared statistic, but we're not, we have an exact small sample P-value. This P-value is, is valid regardless of the size of the data. So of course then it tends to be a little bit conservative. but it's, so it's it's using the Chi-Squared statistic but it's not using the asymptotic, chi the central limit theorem to compare your statistic to the Chi-Squared distribution. I would say there's other choices for the test statistic and because the way in which we're calculating our, our null distribution is through this permutation process. We could use whatever statistic we want right here. And so, you know, the Chi-Squared distribution, the Chi-Squared statistic is not necessarily a bad choice. So any way, this is an interesting way to get an exact P-value for contingency table test, where you're interested in looking at things like independence between the rows and the columns.

Coursera provides universal access to the world’s best education, partnering with top universities and organizations to offer courses online.