A conceptual and interpretive public health approach to some of the most commonly used methods from basic statistics.

Loading...

From the course by Johns Hopkins University

Statistical Reasoning for Public Health 1: Estimation, Inference, & Interpretation

138 ratings

Johns Hopkins University

138 ratings

A conceptual and interpretive public health approach to some of the most commonly used methods from basic statistics.

From the lesson

Module 4B: Making Group Comparisons: The Hypothesis Testing Approach

Module 4B extends the hypothesis tests for two populations comparisons to "omnibus" tests for comparing means, proportions or incidence rates between more than two populations with one test

- John McGready, PhD, MSAssociate Scientist, Biostatistics

Bloomberg School of Public Health

So in lecture section, we're going to extend the ideas that we developed in lectures nine and ten with regards to hypothesis testing. And we're going to look at situations where we can compare parameters, whether they be population level means, proportions or incidence rates, between more than two populations using data from more than two samples in one test.

So, in this first section we're going to look at the situation where we want to compare means of a continuous outcome between more than two populations. It's an extension of the two sampled unpaired t-test and is called Analysis of Variance and frequently referred to by its nickname ANOVA.

So you may say well, if you're comparing means, why is the algorithm called Analysis of Variance? Because variants refers to variability. Well let's think about it for a minute. Let's go back to the t-test, which is a specific case of analysis of variance when we only have two groups. If you think about what we do when we compute our distance metric, we look at the distance between the two group means and the numerator. Which is a measure of how much the group means vary between the two groups we're looking at. And we divide it by the expected variability of this difference estimate around zero, under the null hypothesis. So in some sense, we're comparing variability to variability. And the analysis of variance for more than two groups just extends that idea, and creates a similar distance metric that unifies it across multiple groups.

So in this lecture section you will learn to interpret a p-value From this hypothesis test for any mean difference between more than two populations. The test is testing for any mean differences between more than two populations. And this method, forgetting the p-value is, as I said before, called the Analysis of Variance or by its nickname ANOVA.

So let me give you this first example from a study done in the late 1970's, where researchers looked at the relationship between smoking and measures of pulmonary health, including mid-expiratory flow. And the researchers recruited study subjects and classified them into one of six smoking categories.

they had non smokers. Passive smokers were those that were exposed to secondhand smoke. Non inhalant smokers, light smokers, moderate smokers, and heavy smokers.

And to start, the researchers were interested in whether there were any statistically significant differences in pulmonary outcomes. Such as FEV1, at mid extractory flow, et cetera, between the six underlying works.

And whether they to compare, for example, mid experatory flow, which is measured on a continuum, and they have these six groups. If they only knew about the two sample comparisons we have done this far, they would need to do lots of two sample t-tests. For each possible two-group comparison. And if you enumerate the number of possible unique two-group comparisons from the six, there would be 15 unique comparisons. Non-smokers to passive smokers. Non-smokers to non-inhaling smokers and so on. So, that would be labor intensive. And it wouldn't give a unified picture of what was going on.

So there's another method that we can use to extend the two sample t-test to compare means between more than two populations. And this is called Analysis of Variance. Sometimes called ANOVA, or one way ANOVA. And the one way indicates that we only have one predictor or grouping factor. In this case, it's smoking. There sometimes you'll see something called two-way ANOVA, which allows for two grouping factors to look at, compare across. So for example, smoking and sex of the person. And we'll look at some examples of two-way ANOVA in the second term of this course. But for now, as we've done thus far, we're only looking at one predictor or one grouping factor which in this example is smoking.

The general idea behind ANOVA for comparing means for k-populations, and I'll just generically say where k is a number greater than two, is the null hypothesis.

Is that all of the population level means for the k groups are equal. And we could phrase this, it would be a little harder, to do succinctly but, in terms of all possible, unique mean differences. So the overall null is that, the underlying population means are all equal and any two way difference is zero. But this is the standard way to state the lump [UNKNOWN]. And then the hypothesis, alternative hypothesis is that at least one population mean is different from at least one other population mean. So if we fail to reject this you know, we're making the conclusion that our results are not unlikely if these data came from populations with the same underlying means. But if we do reject, then we're only making the conclusion that at least one population is different from at least one other, another. And we don't actually get information about what means ours statistically different than each other and what the magnitude of the differences are.

So let's go back to the smoking mid expiatory flow example. We're going to focus on the mid expiatory flow although they [INAUDIBLE] other pulmonary outcomes as well. And so, from a pool of over 5200 potential participants, a random sample of 200 men and 200 women was drawn from each smoking group. So they enrolled a bunch of people to be potential parti, participants.

And then they had these people self classified them into one of the six smoking groups based on their smoking habits. And then they randomly sampled 200 men and 200 women from each smoking group, except for the non-inhalers because there were so few amongst the 5,200. So they took 50 men and 50 women from that group. And then they took pulmonary measurements on each of the subjects, including this mid expiratory flow, or FEF, of 25 to 75%. Here's

And then they had characteristics on each of them, including their age, their height, and then their pulmonary measures, FBC, FEV1, and then what we're focusing on to illustrate this mid expiratory flow.

And I'll just represent this data to make it a little easier to see because we'll only focus on the FEF 25 to 75% or the mid expiratory flow. Now you can see here, just exploratorily, there seems to be somewhat, at least in the samples of a dose, response. The greater the degree of smoking, so as we go from non-smokers to heavy smokers,

as expected the lower the mid expiratory flow in liters per second. So least in these sample data not only does it look like there's potentially differences between smoking for example, but it's almost of a dose response nature.

So, if we wanted to test one or any of these differences were in fact statistically significant. We wanted to account for the uncertainty in the sample estimates before we make a strong conclusion about FEF being related to smoking level, we can do an analysis of variants. So the null for the analysis of variants is that the mean FEF for all six smoking groups are the same. And the alternative is that at least two of the six groups have different means.

So, if you do this test you get a p-value of less than 0.01. So what is the conclusion here, well in our standard 5% level

this suggests that if these six samples had come from populations with the same mean at the F values. Then the chances of getting these study results are very small. Getting these study results, or results even less likely.

Are very small or less than one in a hundred. So, our conclusion to 5% level would be to reject this. And conclude that at least some of the smoking group means are statistically different than others.

So you might think well, down to I have to go back and actually find out where the differences are? Well, that's one possibility. Now we could actually look and do t-test for each comparison to see where the statistically significant differences are. My take on this is that, the p-value coupled with that decreasing

mean as a function of increasing smoking level. So decreasing pulmonary function is increasing smoking level coupled with that statistically significant result. Gives an overall picture that there is a statistically significant association between reduced pulmonary health and greater smoking. What the authors did is they, they actually used the ANOVA approach for each of their pulmonary outcome comparisons.

specific comparisons to look at where the biggest differences were group to group. So, here's what they say about this. When we look to the extent to which smoke exposure is related to graded abnormality, we found that non-smokers in smoke free working environments have the highest scores in the spirometric tests. So that includes FDF and some of the other measures they took. Passive smokers, smokers who do not inhale and light smokers score similarly and significantly lower than non-smokers. And then heavy smokers scored the lowest. So here they give a p-value from an ANOVA for each of their comparisons the tre, for each of the test they did, FEV FVC, and the mid expertory flow and the FEF 85% to 95%. The results were statistically significant with p-values less than 0.005. But, they also did some post ANOVA analyses to look at where the differences were most note notable statistically speaking. And they sort of found three clusters. That the non-smokers did the best. Followed by the cluster that includes passive smokers, smokers who do not inhale, and light smokers. And then, medium and heavy smokers scored the lowest.

But the overall picture and the overall conclusion they give is that, we conclude that chronic exposure to tobacco smoke in the work environment is deleterious to the nonsmoker and significantly reduces small airways function.

So in general they've found that smoking was bad but they also highlighted the role of secondary smoke and it's impact on people exposed to it.

Here's another example, more than two populations being compared. This again, is for pulmonary outcomes, but we can use ANOVA for other outcomes as well. So this was a study on pulmonary outcomes, done at three medical centers. And included a total of 60 patients from three medical centers. So there were 60 patients with coronary artery disease from Johns Hopkins, Rancho Los Amigos Medical Center and the St. Louis University School of Medicine. The purpose of the study was to investigate the effects of carbon monoxide exposure on these patients.

And prior to analyzing the carbon monoxide effects data, researchers wanted to get a sense of how the respiratory health of these patients compared across the three medical centers. Ostensibly their life would be easier if the respiratory health was comparable prior to the exposure of the carbon monoxide.

So here's, I was able to get my hands on this data and here's actually some box plots of the FEV1 measurements prior to exposure to carbon monoxide.

For the patients at these three medical centers. So you can see there's only 60 patients to begin with, so these samples are small. So here's John Hopkins with 21 patients. Rancho Los Amigos with 16 patients. And then Saint Louis University Medical Center with 23 patients.

And our eyes can be playing tricks on us because of scaling but it does seem that at least visually speaking there are some differences in the distributions of these FEV1 measures.

But again, these are based on small samples of data, so this could just be because of sample variation.

So, what the researches did was they did an analysis of variance, testing the null that these mean, mean base-line FEVs.

Or equivalent at the patient population level, versus the alternative that at least one mean was different than at least one other mean. And their p-value for this, is 0.052. Wow. Wow, that's crazy. Isn't it? It's so close to being statistically significant. But yet, technically it's not. It is not less than 0.05. So we cannot call this result not less than 0.05. So this result is at that point 0.05 level not statistically significant.

However, something we should think about here and we'll discuss in more detail in the next set of lectures is the small sample sizes. And one of the reasons we may not have found the statistic significant difference is because we didn't have the ability to detect one. So it's sort of the jury's out, as to whether or not we fail to reject the null, because there are really no differences at the patient population level or because we couldn't see it. So, if you're strictly doing a letter of [INAUDIBLE], cut off here in considering whether you need to factor in these baseline differences when actually looking at the results after exposing the patients to carbon monoxide. You could play naive and say, well the p-value was greater than 0.05, the results were not statistically significant. So we don't have to adjust or account for them when looking at our results after the exposure. But, I would advise in a situation like this. With such small sample sizes, that you would want to look at the resulting FEB measurements after exposure to carbon monoxide, both on their own and then, as we learn how to do in the second term, adjusting for the starting FEB measurements.

So how does ANOVA work? Well, it's the same approach conceptually. We assume the null hypothesis is true, that all means are equal for the population being compared by examples. What then the null that goes on to do is computes a measured discrepancy between what was observed in the samples compared to what was expected under the null. So this is very much in line of what we've done with every hypothesis test thus far. This measure of discrepancy is sometimes called the F-statistic. And we won't show how to directly compute this, but you could think of it as an extension of the two sample t-test statistic that allows for comparing differences between multiple samples.

And then this measure of discrepancy that we get for our samples, is compared to the distribution of such measures.

Across or under random sampling variability when the null hypothesis is true. So we basically again look at where our result falls relative to what we could have expected to get just by sampling variability. When the null is true then figure out whether we're part of the majority or an outlier. And the thing that ultimately tells us where we fall in the pack is the p-value. It tells us the, chances of being as far or farther away than we are under the null.

So for this example with the FEV, in medical centers the F-statistic is 3.12. I don't know what that means off the top of my head and these F-statistics do not have an easily interpretable distance metric like number standard errors. So to get a p-value for these things is it's very hard to look at the F-statistic alone and make a decision about whether it's statistically significant or not you would need to go to computer or an F-distribution. F-distribution is a high maintenance distribution.

It has two sets of degrees of freedom. The numerator and the denominator. And in these ANOVA comparisons with k groups, the numerator degrees of freedom is the number of groups we have minus 1. So for our example it's 3 minus 1 or 2, and the denominator is the total sample size minus the number of proofs we have. So we had a total of 60 individuals across three groups, so the denominator degrees of freedom is 57. The only reason I point this out is a lot of times you'll see in papers, people re, report the value of the F-statistic and then tell you which distribution they need to look it up on. That's actually to my purpose sort of ancillary information because it neither tells me what I want to know, which is how unlikely are my results or their results under the null. But this get converted, this, these F-distributions kind of like the Chi square are skewed and confined where 3.12 occurs on distribution. And figure out what percentage or observations are as far or farther away, as likely as unlikely in either direction. And convert that to a p-value and that's the p-value they got at 0.052. So this is really the, the result you really want to see, not the interim steps.

And again the idea's exactly the same though this F-statistic is nothing more than a distance measure. It's just not easy to interpret on, on its own right.

So let's look at one more example of where we have ANOVA, and frequently these are used just to sort of give some understanding of characteristics that differ between subgroups in our population as being compared. So this is the academic physician salary study we've looked at before where the goal was compare the mean salary between male and female academic positions.

And one of the things they wanted to account for in this study was other differences between males and females that may have been related to salaries as well. So in this table here. And this is a very common type of table to see in this study. They present and they call it bi-variable associations between salary and measured characteristics. And they look at the average salary in different subgroups of the entire sample. So for example. They look at the salary differences on average between groups rank by their National Institute of Health funding into four levels. And then they present a, the mean salary for each group and they give confidence intervals but then they present an overall p-value testing the null that the mean salaries are not different between any of those four groups. And this comes from ANOVA. So I'll explore this up for that, this is just to show you two examples from this table but this is very common this is a nice presentation. They give the means for each of the four groups they're comparing. They give the confidence intervals so we can eyeball and see where the largest differences are after accounting for sampling variability and then they summarize the comparison for this overall p-value.

And so this gives a heads up that there is some association between current institutions National Institute of Health funding and the salaries paid by the institution. If this characteristic is also related to being female, i.e. What where the institution ranks? That the distribution of females is different between institutions with different rankings then the authors are going to want to adjust for that in their analysis when they compare males and females. And well again get into adjustment in the second term. They also looked at how salaries differed by institutional regions they gave means and confidence intervals. Can see this comparison was not statistically significant. So it's very common to see the results from ANOVA presented in this way. For looking at the relationship between an outcome of interest and multiple different predictor variables one at a time. So in summary, ANOVA is just an extension of the two sample t-test that allows us to compare

It only gives us a p-value, there's no confidence interval or measure of association we can give to present

Coursera provides universal access to the world’s best education, partnering with top universities and organizations to offer courses online.