A conceptual and interpretive public health approach to some of the most commonly used methods from basic statistics.

Loading...

From the course by Johns Hopkins University

Statistical Reasoning for Public Health 1: Estimation, Inference, & Interpretation

136 ratings

Johns Hopkins University

136 ratings

A conceptual and interpretive public health approach to some of the most commonly used methods from basic statistics.

From the lesson

Module 4B: Making Group Comparisons: The Hypothesis Testing Approach

Module 4B extends the hypothesis tests for two populations comparisons to "omnibus" tests for comparing means, proportions or incidence rates between more than two populations with one test

- John McGready, PhD, MSAssociate Scientist, Biostatistics

Bloomberg School of Public Health

All right, in this section we're going to look at the issue of sample size computations for studies comparing two or more means. We want the study to have a certain level of power to detect a difference of interest.

So upon completion of this lecture section, you will be able to describe the relationship between power and sample size with regards to the size of the minimum detectable difference in means between 2 groups.

Describe the relationship between power and sample size with regards to the standard deviations of individual values in the groups being compared. And understand the impact of design studies to have equal versus unequal sizes on the total sample size necessary to achieve a certain power.

So let's look at our example that we used in section eight to motivate the idea of power. So suppose we have the data on oral contraceptives and blood pressure in a sample woman age 35 to 39. So recall the data. We had 29 women, eight of who were currently using oral contraceptives at the time of the study, versus 21 who were not. And we had the sample mean, blood pressures and the sample deviations.

So we think this research has a potentially interesting association, shows evidence, potentially, of an interesting association. But of course, the issue was the small samples sizes led to a large margin of error and low power to detect an interest in difference. So we want to build on this and design a bigger study, but we want this larger study to have ample power to detect an association of interest should it really exist in the population of 35 to 39 year old women with regards to oral contraceptive use and blood pressure. So what we want to do going forward is design a study. And we want to determine the sample sizes needed to detect about a five millimeter increase in blood pressure in oral contraceptive users relative to those women not using oral contraceptives. And we want to have this with 80% power at our standard rejection level 0.05. And using that pilot data, we estimate that the standard deviations of blood pressures are 15.3 millimeters of mercury and 18.2 millimeters of mercury in the oral contraceptive and non-oral contraceptive users respectively.

So here, we have a desired power in mind, and we want to find the sample size necessary to achieve a power of 80% to detect a population difference in blood pressure of five or more milliliters of mercury between the two groups.

So we can find the necessary sample sizes of this study if we know in advance our alpha level of the test. Which is easy, it's going to be 0.05. If we have specific values for the true underlying means in the two groups being compared such that, really, the important thing is to know the difference of interest between these two means. And this usually represents the minimum scientific difference of interest.

We also have to estimate the standard deviations of the blood pressure measurements in both groups being compared. And then we have to know our desired level of power. And to start, we'll use a power of 80%. So where does this idea of a minimal detectable difference come from, and how do we estimate the population SDs?

Well, the minimal detectable difference is something that the researcher would consider the minimum difference to be scientifically interesting. For example, in this blood pressure study, it could be the case that the average blood pressure difference in the population between oral contraceptive users and non-users is on the order of 1 or 2 millimeters of mercury. But as a researcher, we don't see that as being clinically useful of relevant, or a very strong finding. So this minimal statistical difference has to come from our knowledge of what would make for a interesting difference of the population level. And then where do these estimated population levels SDs come from? Well, again, researcher knowledge experience makes for good educated guesses, but hopefully there's other study out there, maybe a pilot study for example that we have, are privy to in this case, and we can use that as the starting point.

So let's fill in the blanks from the pilot study data we have on blood pressure on oral contraceptives. We know we want the alpha level test to be 0.5, but that's not a function of the pilot study. We have, if we're shooting for the a minimal detectable difference of 5 millimeters of mercury, we can estimate means for the two groups that are similar to the means we saw in our pilot study but have a difference of 5 mm of mercury. So I'm just going to say 132 for the oral contraceptive users, 127 for the non-users. What's really important here is not the value of the individual means, but the difference we want to see. Then we have to have the estimates of the standard deviations, which we have from the study of 15.3 for the oral contraceptive users and 18.2 for those women not using oral contraceptives. And then we have to know the power that we desire to detect the difference of at least 5 millimeters of mercury.

And that will start with an 80%, and then we'll look at raising it to 90%. So given this information, how can necessary sample size be computed? Well you can certainly use statistical software such as Stata, or SPSS, or Sas. There are some free online sample size calculators. If you just do a Google search, you'll get some hits. A favorite of mine, is one that you can actually download and put on your computer, and it's pretty intuitive and user friendly, from Dupont and Plummer, statisticians at Vanderbilt University. Theoretically, we could also do this by hand. It's a little cumbersome. Just for those of you who are interested, at the last lecture set of this Section 13, I'll show you an example of doing it by hand. But only for those who are interested. It's optional.

And for the first approach, let's assume we want equal numbers of women in each group. In the original clinical sample, only about a third of the women were using oral contraceptives, a little less than a third.

So let's oppose for our study design, instead of taking one random sample from the clinical population of 35 to 39-year-old women, and then classifying each woman as currently taking oral contraceptives or not currently taking oral contraceptives, our approach would actually require taking two samples of women separately. We'd first classify the woman in the clinic as whether they're on oral contraceptives or not. And then take equal samples numbers of women from each of the two groups.

So if we do this and we run this data that we collected in the previous slide through statistical software, turns out we would need 178 women in each of the two groups for a total of 356 women to have 80% power to detect a difference as large or larger than 5 millimeters of mercury.

This actually, even though we've been thinking our hypothesis, based on the sample results and what we're thinking clinically, is that oral contraceptive users will have higher blood pressures. This actually covers us in situations where if the groups being compared were such that the oral contraceptives users had Lower blood pressures on the order of 5 mm Hg or even lower than the non-users. So this is actually covering us in both directions. So the margin of error, if we had 170 women in either group, just to give us some sense of what the related precision of our estimated mean difference is, is plus or minus 3.6 millimeters of mercury.

Suppose we found, well you know what, clinically speaking a difference of five is larger than necessary. We might be interested as researchers if the difference is on the only four millimeters of mercury. So let's, if we rerun the numbers but made our minimum detectable difference smaller, what do you think's going to happen to the sample size? We're making it harder to see in some sense. Well, as you may suspect, when we do this, the number we'll need in each group is larger than when our difference was five millimeters of mercury. And we'll need 278 women in each of the two groups.

when we had the minimal detector difference of five millimeters of mercury. Suppose we said, well, let's just play around and see what the impact is on changing the minimal detectable difference in the other direction, let's make it larger and therefore easier to see, if you will. How many women will we need in each group? And if we run the numbers, we need 124 women in each group. Substantially less.

More than 50 less in each group then for the minimal detectable difference of five millimeters of mercury. So, playing around with this minimal detectable difference will influence the numbers we need in each group.

So, if a researcher was writing up a grant proposal, he or she may include a table like the following. Usually it is not acceptable to just come up with a one computation, you want to actually vary some of the parameters or inputs and show what happens to the necessary sample size. So it might be something like this. You might make a table where you look at the relationship between necessary sample size to get 80% power for various detectable differences. Here I'll do four, five and six millimeters of mercury.

And then, because our standard deviation estimates are just estimates from in this case, a small study, there's going to be some variability in those. So we may play around with situations where the true standard deviation is lower in each group than what we observed in the sample, and where it is larger than what we observed from the sample. And this gives sort of a robust analysis that shows that we've thought about potential scenarios and are not just wanted to one exact scenario. And it gives the 20 agents of sense of the sample size is needed for each and then they can make a decision about whether the study is robust enough and how much they're willing to fund. So, and when you look at this table, and I want you to look at the impact in two directions. First, for any given set of standard deviations assumptions, look what happens across the table as we increase the minimal detectable of interest. And as we showed before, the larger that becomes, the easier it is to see, and the necessary numbers in each group decrease. If you go down each column for a fixed minimal detectable difference, what you can see is the variability estimates of the blood pressure measurements in each of the two groups increase.

There's more person to person variability which will increase our standard error, the necessary sample sizes increase with that.

So if I were writing to a funding agency, I would present a table like this and then ask for something like funding for 300 women in each group. And that would pretty much cover all of the situations. We might be a little low in this one scenario with the smallest detectable difference and the greatest variability. But this table would show that with 300 women in each group, I could pretty much find any differences with 80% power, at least, under the scenarios that I've played out here.

Suppose the funding agency reviewed the grant application but said, we're interested in you doing the study for 90% power and we'd like to see the same computations redone for 90% power. And we could do that and if you compare all the values in this table where we have higher power we want to be more sure of rejecting when we should. You'll see that all the corresponding values for the combinations of detectable difference and variability estimates increase relative to what they were with 80% power. And that makes sense because we're actually putting a higher onus on our ability to pick up a difference. And we're going to need more information to get smaller standard errors needed to increase our likelihood of detecting the difference under these alternative hypothesis scenarios.

So in other words, instead of taking separate samples from women who are currently using oral contraceptives and those who aren't and assuming the sizes will be equal like we did in the last scenario. This approach would involve taking one sample and then classifying the woman after they've been selected to be in the study. So in the original small study, 8 of the 29 women were currently using oral contraceptives. That was 28% of the sample. So purposes of designing this study, let's use 30% to assume that when we take this overall sample, 30% of the women will be using oral contraceptives, and the other 17% will not. And so we will have to design a study that recognize the uneven sample sizes we desire.

And this can be easily done with statistical software, you can add in a piece about the relative frequency of participants in each of the two groups being compared. So, if we run the numbers on this, if we wanted to go with our original goal of detecting a difference as large, or larger than five millimeters of mercury in either direction, a mean difference.

We need 119 women in the oral contraceptive group, and 274 in the non oral contraceptive group for a total of 393 women. So notice that this total sample size here is larger than when the study was designed to have equal sample size.

Why is that? Why do you think that is? Well, what these computations are doing, given the information necessary, is solving for the, they really are doing what we did before in terms of solving for the margin of error through the standard error necessary to have the power of interest. So, if you remember the standard error for a difference in means comparing two groups. Is the standard deviation of the values in the first group squared over the sample size of the first group plus the standard deviation of the values in the second group squared over the sample size in the second group.

When one of these samples is smaller than the other, the rate limiting factor in how big the standard error's going to be is going to be a function of the smaller sample size. And so we're going to need more women or more subjects total to overcome the fact that we have. One of the sample size is being smaller than the other. If the sample sizes were equally split, we would need a fewer numbers to get the same standard error. So that's why when we go for an imbalanced sample size design, we need more total subjects than if we were to assume equal samples in growth group. So, something you think about in study design for you are able to control how the sample is done, whether you can do situation where you purposely choose equal numbers in each group. Each sample or your constrain to do it such that you have to classify subjects to their group membership after you sample the entire group and there may be an imbalance in the numbers in each group. So in this situation, if we had changed the minimal detectable difference to 4 millimeters of mercury and this is the situation where we expect 30% of the subjects to be using oral contraceptives and the remaining 70% not to, we'd need 186 women within the oral contraceptive group and that 428 in the group not using oral contraceptives for a total of 614. So this is notably larger, again, than when we actually did this assuming equal sample sizes. And of course, because our minimal detectable difference is smaller in this scenario. Four versus five, we're going to need more subjects than with the situation where we had a detectable difference of five.

If we actually did this computation, assuming that 30% of the women sample would be in the oral contraceptive group and the remaining would be in the non oral contraceptive group, we'd need 83 women in the oral contraceptive group and 191 in the non-oral contraceptive group for a total of 274r women in our study. And you can create a similar table to what we had done before for this invalid sample size, showing it as a function that the number needed in one of the groups and then putting a footnote that the number needed in the other group would be a proportion of the first group.

So you might say, well, this is great. But suppose we are interested in designing a study to compare means between more than two groups, these computations only allow us to do two groups at a time. So for example, you wish to compare the average length of stay for preventable diabetes hospitalizations across three insurance groups. Government, private and uninsured for diabetes patients in the State of Maryland in 2013 and you plan to sample equal numbers from each of these groups.

And based on the data from another that's done a similar study, you have the following estimates and you want to do this study you'll see how in part the result in Maryland compare to other states. So, you have an estimate of the mean length of stay for diabetes patients on government insurance of 4.2 days. For private insurance, you're estimated mean length of stay is 3.1 days. And in the uninsured group, it's 2.5 days. And the estimated standard deviations for the three groups for the individual length to stay values these three groups are similar at four days for each, so that we can start with that.

So, how can a study be designed with 80% power to detect differences between the three groups? Well, one possibility is to do the sample size computations for each unique two groups comparison and then take the maximum number necessary across the three computations. In that way, you'll be covered with a minimum of 80% power for all three combinations. So, let me just show you this based on the software. The sample size needed to have 80% power in the government versus private group.

For the government versus uninsured and this is the greatest anticipated difference, so it's the easiest to see.

We only need 87 subjects in each group. But for the private versus no insurance, the difference in anticipated mean like the stayed was much smaller than the other two comparisons. And so, this is going to be our driving factor in the overall sample size computation to see the difference between those two groups based on the mean estimates we have. We would actually need 698 persons in each group. Subsizely larger than the other two comparisons. But if we're really interested in being able to tech differences, if they exist in this magnitude with 80% power for each of these three comparisons, then the conservative thing to do would be to actually do a study where we take 690 people from each of the 3 groups.

Now that certainly means that for some of the comparisons, the first two, our power will be greater than 80%. Because we have more than we need to see the difference for those at 80%, but we'll be covered at a minimum of 80% for each of the three group comparisons. So that, that was important to us. This is what we have to do.

So in summary, when designing a study to compare means from two or more populations, a researcher must have some estimate of the mean and standard deviation of the values in each populations. The sample size necessary to achieve a desired power to detect a minimum detectable difference is a function of the difference. The variability of the individual values in each group, standard deviation and the desired power. As you can see, I'd just laid out the idea of how to present, just an example of how to present a sample size computation portion of a method section for a brand. It's prudent to actually show the necessary computations under a couple different scenarios both for the anticipated minimal detectable difference of interest and for the estimated standard deviations in both groups being compared. And then you can use that to sort of come up with an idea for a sample size in each of the two groups being compare or three, or more groups that would cover all of the scenarios played out in the table. And the funding agency can make a decision about whether it wants to cover all the scenarios you've suggested or only some of them based on what it thinks is important in terms of minimum detectable difference. In the next section, we'll show how to do the same thing. But for comparing binary outcomes or proportions between two or more populations. So what I hope you take home from this lecture section as well as the next one is the role that the detectable difference of interest and the variability of the data for continuous data, and the desired power have on getting the necessary sample size to have a study with the desired level of power. What each of these can do to either increase or decrease the necessary sample size.

Coursera provides universal access to the world’s best education, partnering with top universities and organizations to offer courses online.