Hi, my name is Brian Caffo, and this is, Mathematical Biostatistics Boot Camp 2, lecture 10, on case control data. In this lecture, we're going to briefly talk about case control methods. We'll talk about an instance where using retrospective case control data. And a so called rare disease assumption, we can estimate prospective odds ratios. And then because this is kind of a lot focused on the odds ratio, I thought I'd talk a little bit about exact inference for the odds ratio. Okay, so let's talk about retrospective, you know kind of case reference sampling. And again this is a deep subject, we're going to scratch the surface of it. So in this case, imagine if we wanted to study, study lung cancer and here we had some cases and controls. And we ascertained whether or not they were a smoker. Now there's two ways we could collect it, well there's. Conceptually, two ways we could collect this data. One is, we could follow a bunch of people over time some of them would smoke and some of them wouldn't, and then we could see who obtained lung cancer. That, that's very hard. Right. I think conceptually, you can all see that, that experiment is basically impossible. a much easier experiment would be to go to hospital records, and find a bunch of people that were cases, that had lung cancer. In this case, we found 709 of them. And then we also found 709 controls that were at some level comparable. And then we retrospectively determined whether or not they were smokers. Now, in this case, 709 is fixed, right, and it's whether or not they were a smoker that kind of has the ability to vary. Now, I should also say that the most common way to do case control methods would be, for every case, to try and very closely match a control, so that for every case, there's a specific matched control. But in this case, we're not doing that. Let's say we had a group of cases, a group of case hospital records and a group of control hospital records, and we, or group of control patients, and we figured out You know, a reasonable strategy for getting control patients. And now these, these 709 is fixed, so what we wanted to ascertain is who is a smoker and, and not, and whether or not the cases had a great proportion of smokers, and to kind of make prospective conclusions from this retrospective data. So just you know, in terms of probability. Right. We we cannot estimate the probability of being a case given that you're a smoker directly from the data. but we can estimate the probability of being a smoker given that you were case. Right, and so the co-, so we want to work on that. You know? Kind of probable probability rubric. What is interesting, is we can estimate an odds ratio. so the odds ratio that we would, want to estimate is the odds of being a case, given that you're a smoker. Relative to the odds of becoming a case relative to being a non, given that you're a non smoker. Okay. So we want the odds of, of developing lung cancer given that you smoked compared to the odds of developing lung cancer given that you didn't smoke. Well, it turns out that that odds ratio is exactly equal to the odds of being odds of being a smoker given, you're a case, relative to the odds of being a smoker, given that, given that you are control. So, in, in, in the bottom one we can estimate, the top one we cannot. So here I just directly go through the calculations. The odds of being a case given that you're a smoker, divided by the odds of being a case given that you're a non-smoker, the odds ratio interest. Right. And let me just replace case and not with C and S. And case and non case with C and C bar, and smoker and nonsmoker with S and S bar. And here I just churn through the calculations. You can go through these three steps to make sure that you agree so here I carry through the calculation. And look, this works out to be the probability of being a case and a smoker, times the probability of non-case and a non-smoker, divided by the probability of Being a case and a non-smoker divided by the probability of being not a case and not a smoker. So it's sort of like the probability cross product ratio, the probability of caseness and smokerness times the probability of being not case and non-smoker divided by the kind of off-diagonal probabilities. Now, and I say this actually proves the result, and I think it does, because honestly. You know, you can just see that if you were to exchange the words case and smoker at top up at the top that nothing changes when we get down to the bottom line here. Right. Because probability of C, S is the same as the probability S, C. so I think you can, you can tell to me, that the, that this. this is exact legal or if you want to, if you want to be very particular, you can, you can then keep working and get to the odds, the other odds ratio. but to me this, this proves the result from the previous page. And, you know, it also reminds you that this is, these are the probability statements. But we estimate those probability and odds ratios from data, and of course the sample odds ratio is the cross product ratio n1, n22, divided by n12 and n21. And the odds ratio is invariant to transposing the rows and the columns. So it, you know, our estimator has this kind of invariance property. which we would hope, right. It would be weird, if we said that the two odds ratios' probabilities were equal, but oh, the sample estimates were not equal depending on which, which, which which one you were treating as the outcome and which one you were treating as the predictor. So that's nice. By the way, the sample odds ratio is unchanged if a row or a column is multiplied by a constant. and then the last thing, and this is what we'll talk about. The odds ratio, turns out to be related to the relative risk. So you know the thing is if you want odds ratios, we just kind of demonstrated, that the odds ratio works out really well. And you can kind of reverse conditioning a little bit when talking about the odds ratio. But we'll talk about specifically the relative risk which is what people often want to estimate, and how it relates to the odds ratio. Okay, so the odds ratio is here, right? The probability of a smoker given that your a case divided by the probability of non-smoker given that your a case. And so on, you can read this top line. Okay then we can reverse the odd ratio, right? using the argument from the other page, right? So now, we have the probability of a case given smoker divided by probability of non-case given smoker, divided by probability of case given non-smoker, divided by probability of non-case given non-smoker. Okay. then in the, in the next line, just everything is multiplied out. Denominators are raised up to numerators, and so on. And then, look at this first term here. Probability of case given smoker, divided by probability of case given non-smoker. that's the relative risk. Right. That's, if you wanted who develops lung cancer comparing who's smoked to who didn't smoke. That's the relative risk. The ratio of the two probabilities. And then that's multiplied by times these things, but I you know, I wanted to, to refer them with a respect to case status. So I just 1 minus [INAUDIBLE] to the probabilities. And what you can see is if this ratio that we're multiplying the relative risk times if, if, if, if its about 1, then odds ratio is approximating the relative risk. so, and you know, often is the case if the, the, these two numbers, 1 minus this number, and 1 minus that number. that they're, they're similar enough if in fact the case is very rare, in, in other words, regardless of whether or not you smoke, the probability that you'd get this disease, let's say lung cancer, is, is quite small. if that's the case, this so-called rare disease assumption, if that's true, then this ratio will be about 1, and then the odds ratio will approximate the relative risk, and that's what people often talk about the rare disease assumption, and they use. The retrospectively collected data, along with the odds ratio, to then approximate the relative risk. It's so common often people don't even really talk about what they're doing. They just do it. I think that's so common in the epi literature, it's, it's generally not described in a, in a, say, American Journal of Epidemiology article or something like that. So now, just make the small point that the disease has to be rare among the exposed and the non-exposed, not just rare overall. So here's a simple example. Chuck Rodi reminded me of this at one point. So here we have the exposure, yes or no. Disease yes or no. We have 911999 so just from the data. And let's just assume that this is just cross sectional data. So all the margins or everything are estimable. So the probability of disease, the estimated profitability of disease is about 1%, the odds ratio works out to be almost 9000. the relative risk works out to be about 900, so clearly the odds ratio is not estimating the relative risks, and in this case, like I said, because of the sampling I'm assuming the two are, are estimate, directly estimable from the data. So in this case what happens is disease is, is rare among the among the exposed. I'm sorry, D is rare overall. Right. let's see, what is it, 10 out of 1010. but these not-rare among the, among the exposed. Right. So among the exposed, you actually had 9 times the number of people having the disease rather than not. So any rate, I, I think, you know, this, this is a. If you look at the equation right, it, it's clear, you know, that, that both the P of C given as far and P of C given as both have to be small in order for the rare diseases assumption apply. And that's the real criteria. I think this is just a numerical, this is a numerical illustration in a, in a hypothetical circumstances where we can estimate all the probabilities as well. And we can show that the two aren't approximately equal to each other. So let's just recap about the odds ratio. So an odds ratio of 1 implies no association. odds ratio greater than 1 is a positive association. Odds ratio less than 1 is a negative. Association the for retrospective case control studies. Odds ratios can be introspectively for diseases that are rare among the cases in controls the odds ratio approximates the relative risk. and the delta method's standard air for the odds ratio is the square root of 1 over the cell counts. added up. oh and, and just to remind you, that's the standard error for the log odds ratio, not the standard error for the odds ratio. So let's just go through our example. Here is, we have our lung cancer cases, and control, smokers yes or no. We get our odds ratio works out to be 3. The inner standard error for the L log odds ratio works to be 0.26. If we want a confidence interval, it's log of 3 plus or minus 2 standard errors, we get 0.59 to 1.61. We would compare this interval to whether or not 0 is in that interval. If we exponentiate it. Then we would compare whether or not 1 is in the interval. In this case if we exponentiate it we get 1.8 to 5.0 so 1 is not in the interval it you know, in our estimated odds of lung cancer for smokers is 3 times that the odds for non-smokers.