Learn fundamental concepts in data analysis and statistical inference, focusing on one and two independent samples.

Loading...

From the course by Johns Hopkins University

Mathematical Biostatistics Boot Camp 2

34 ratings

Johns Hopkins University

34 ratings

Learn fundamental concepts in data analysis and statistical inference, focusing on one and two independent samples.

From the lesson

Techniques

This module is a bit of a hodge podge of important techniques. It includes methods for discrete matched pairs data as well as some classical non-parametric methods.

- Brian Caffo, PhDProfessor, Biostatistics

Bloomberg School of Public Health

Okay, so let's talk about exact inference odds ratios. This is the last thing I'll talk about in this lecture.

let's let X be the, the number of smokers for the cases, and Y be the number of smokers for the controls.

and remember in this case X and Y are the random numbers because we're thinking of case reference sampling. So X and Y are the, the, are the random numbers. The 709 margins are fixed, and we're going to assume that both of them are say binomial.

and we want to calculate an exact confidence interval for the odds ratio. Not an approximate one, so the square root, one over the cell counts formulas an approximate

one. And I'll show you that you have to eliminate a nuisance parameter, and I'll show you how to do that.

So let's define the logit function as the log of the odds. So logit p is log p over 1 minus p, that's the logit function. So notice the differences in the logits are log odds ratio, so if you logit P1 minus logit P2, that's the log odds ratio for P1 to P2. So as an example, logit P so let's, let's define the, the, the logit of the probability of being a smoker given you're case as delta.

And, and by the way this implies that the probability that you're a smoker given the case, given a case is e to the delta over one plus e to the delta, so if you invert the logit function you get that.

The logit of the probability of being a smoker given that you're a control, let's call that delta plus theta. So it's a different number for describing it relative to delta but because we're not constraining theta, it can still be any number but theta, okay?

then we get probability of being a smoker given a control works out to be e to the delta plus theta divided by one plus e to the delta plus theta.

Okay, so actually just of course it's the log-odds ratio because if we subtracted to logits, the delta cancels out and we get theta. So theta's the log-odds ratio.

delta, this other parameter is this so called nuisance parameter. We don't care about that. What we care about is the log-odds ratio comparing smokers to case status.

Okay, so let's keep working on the the model here. So here we're going to assume that x is binomial with n one trials, and then we already stipulated that the probability that the logit is delta. So the probability is either the delta over 1 plus e to the delta, then y is binomial with n2 trials, and success probability e to the delta plus theta divided by 1 plus e to the delta plus theta.

So then our probability x is, is the capital X takes on realized value little x It's going to be this binomial probability and you can kind of work with to get to where it's this formula right here. N1 choose X, etcetera.

this is the property X takes on realized value little x. And then I'm going to look at the probability Y takes on realized value z minus x. And you'll, you'll hopefully see why in a minute. And that's just plugging directly into the binomial formula. And again I have z minus x right here, instead of a particular value say little y. Okay, now,

takes on realized value z is a little bit harder to calculate. Because X and Y are not identically distributed. If they were identically distributed, then it would, x would be the sum of a bunch of Bernouli trials, y would be the sum of a bunch of Bernouli trials. So x plus y would be the sum of a bunch of [UNKNOWN] Bernouli trials. So they're still both the sum of a bunch of Bernouli trials, but not the same, okay.

So here's what we can do. We can factor this into, let, let's suppose that we decompose z into u and z minus u,

u part's going into x, and z minus u part's going into y. The net probability would be this, this product right here. Probability X is u, and Y is z minus u. And so the probability X plus Y takes on the value z, is going to be the sum over all the possible values of u. in other words, all the different ways we could allocate some of it, some of the some of the elements of z to x, then what. Whatever we can allocate the remaining to Y. Okay so, so that's a quick little formula you can do.

Okay, now we're going to get to the point. So now let's look at the probability of X takes on a particular value x given that the sum X plus Y takes on a particular value z. And I'm just going to plug in these three lines up here, right? So the probability X takes on value x is going to be the probability this, this numerator probability right here, probability x equals x, probability y equals z minus x.

just to elaborate on that point. the probability X takes on value x and X plus Y takes on value z is this, the and probability, then given, because we're stipulating that X is value x. That's the same thing as the probability Y takes on the value z minus little x, and then we can factor those probabilities into the product. So that's the numerator right here, and then the denominator I'm just plugging directly in the probability X plus Y takes some value z.

Okay, so then you put it all in, just you, you know, if, if you can follow the mathematics, hopefully you follow the mathematics.

And, and again this, you know, this is very similar to our development of Fisher's exact test. Only difference is now, we haven't assumed the null hypothesis to be true. And so what we have here is, is the, it depends on theta, this log odds ratio. Okay.

So here but, but notice it doesn't depend on delta, right? So we've gotten rid of delta. And that's this idea of conditioning away the nuisance parameter. Here, conditioning on x plus y, it conditions away the nuisance parameter. So, but nonetheless, now we have a distribution odd for our two variables. Because remember if I know, I don't need to talk about the x and y if I have conditioned on x plus y, if I know x, then I know y, given that I know X plus Y. Okay. So

So, so the, the, you can use this distribution to calculate the exact hypothesis test for theta equal to theta nought, other than 0. The specific case 0 results in Fischer's exact test, the ordinary hyper geometric distribution.

and then you could invert these tests to yield exact confidence intervals for the odds ratio. And that is exactly what R does if you do fisher.test, it'll give you a confidence interval for the odds ratio. It is exactly doing this

this procedure right here. It's inverting the so-called dis-distribution here, which is called the non-central hypergeometric distribution. and it, so we're not going to go through any calculations with this, because as you can tell, at this point, it's gotten rather involved. But I did just want to show everyone

where these exact odds ratio calculations come from. They basically come form this formulation of the problem as a non-central hypergeometric distribution. So what I'm hoping you got from today's lecture though was a little bit of information about the odds ratio, about some of its more general purpose uses, for example, in case control studies. And then also now to talk about a little bit about where some of the more complex formulas for performing inference on the odds ratio come from.

Coursera provides universal access to the world’s best education, partnering with top universities and organizations to offer courses online.