So in this lecture section we'll talk about how to numerically and visually compare distributions of continuous data between two or more samples, as estimates for the comparison at the population level. So upon completion of this lecture section you will be able to suggest graphical approaches to comparing distributions of continuous data between two or more samples. Explain why a difference in sample means can be used to quantify, in a single number summary, differences in the distribution of continuous datas between samples. So frequently, in public health, medicine, science, etc, researchers and practitioners are interested in comparing two or more populations via data collected on samples from these populations. Such comparisons can be used to investigate questions such as how does weight change differ between those who are in a low fat diet compared to those on a low carbohydrate diet? How do salaries differ between males and females? How do cholesterol levels differ across weight groups? And I'm sure you can come up with questions of interest that you have. And while these comparisons can be done visually, it's also useful to have a numerical summary. One of the primary reasons it's useful to have a numerical summary, as we'll see later in the course, we can estimate the uncertainty in numerical summary estimates based on imperfect samples from larger populations. And put bounds on a range of possibilities for the unknown true number that we can only estimate. These are sometimes called confidence intervals. And so whittling down the comparison to a single number summary will allow us to create a confidence interval on this summary measure. So theoretically this numerical summary measure could be many things. We could take the difference in medians between two sample distributions. We could look at the ratio of means. We could compare the 95th percentiles via their difference. We could take the ratio of standard deviations, and you could come up with other things we could do as well. However, what is commonly used, for reasons that we'll elaborate on shortly in the course, is a difference in sample means. And this is what I was alluding to in the last slide. But also, when comparing sample distributions, this can be a reasonable measure of the overall differences in these distributions as an estimate of the underlying difference in this population distributions. While it only gets it a difference in some measure of center and doesn't numerically compare all aspects to compare between the distributions. We'll see that it can actually quantify sort of an overall shift in the values up or down in one sample compared to the other. So to start let's look at a data set we have yet to use but we will use now throughout the rest of the course. These are data 236 Nepalese children at 12 months old. 124 of the the children sampled are male, and 112 are females. And here are side by side box plots of the weight distribution for these two sex groups. And we can see that if we look at the distribution of weights in kilograms for male and female children, males tend to weigh more than females. We can see the shift both in the median, larger median for males compared to females. And also the box shifts up, meaning that the respective 25th and 75th percentiles for males are larger than those for females. But otherwise, we can still see there's a lot of crossover in these distributions. It's certainly not that all males weigh more than all females, it just weights tend to. And we can see that the variability in these weight values is similar in the two sex groups. So this side by side box plot is a really nice way to visually compare some key aspects of these distributions and get a sense of how they compare both visually and numerically. Not as easy in my opinion to ascertain in a visual sense is when we would have side by side or in this case stacked histograms. Because there's more detail in each histogram when it comes to comparing the distributions, it's a little harder to see and make generalizations. But never the less, here are the weight distribution histograms for the 236 children split out in the 124 males and 124 females. And maybe if you look carefully at this, you can see that the mass or the, Bars in the distribution for males shift over a bit relative to where they are for females. And that would give some sense that males tend to weigh more on average than females. But it's not perhaps as clear as when we can see the clear at least median differences in the box plots. Nevertheless, we're going to summarize each of these samples by their sample means. And we can see that the mean for males is 7.4 kilograms, and the mean weight for females is 6.7. So now we have numerical evidence that the males on average weigh more then females. We already saw that their medians were higher and their averages track as well. If I put vertical bolded lines to represent the relative position of the mean for each sex group on the respective histogram. You can sort of see that that difference in means between the two sexes, a difference, increase in the mean for the males compared to females, sort of captures if you look at the whole of the distribution. This captures how the whole distribution shifts over for the male distribution. The distribution of these values has shifted over compared to the females by about that much, the mean difference. So those difference in means can be used to quantify the shift in mass, if you will, of similarly shaped distributions like this at least. So if we were to compute this numerically and relegate it to a single number, we could take the difference in average weights for males compared to females. So that difference to 7.4 kilograms minus 6.7 for a difference of 0.7 kilograms. To find this we can say, on average male children weigh more than female children by 0.7 kilograms. Certainly the direction of comparison is arbitrary, we could've just as easily compared the females to the males and taken the difference female to male. In which case we'd end up with the same absolute different of 0.7 kilograms, but the direction would be different. It would be negative 0.7 when we compare females to males because females weigh less on average. So we could say on average, female children weigh less than male children by 0.7 kilograms. Which is essentially the same as stating, like we did before, on average male children weigh more than female children by 0.7 kilograms. So it doesn't matter which direction the mean difference is computed. It's just important to know what the direction is because if we don't know whether it's the difference between males and females or females compared to males, we won't be able to easily make the proper interpretation of the difference. So again, just to show this is to what I was getting at before, that this difference in means as I showed before, roughly captures the shift in the mass of values in the histogram for males relative to females. So this single summary number doesn't compare all aspect of the distribution but shows the shift in center. And when we have similarly shaped distributions, that tells us a lot about that shift in distributions. Let's look at another example. Let's look at our length of stay data, but we're going to split it out by age of first claim in the year 2011 for all subjects in the Heritage Health Plan who had in-patient stays of at least one day in the year 2011. So roughly a quarter, or 3,769, of this sample persons who were admitted to the hospital were less than or equal to 40 years old at the time of their first admission in the year. Whereas the remaining 9,000 plus are greater than 40 years old. And so here are the histograms showing the distributions of the stay values for those who are older than 40 for their first admission in 2011 and younger than 40. And I would argue that now with the skewed data it's difficult to see visually what's going on here when we compare stacked histograms. Although if you look carefully you can see that the tail perhaps shifts over a bit for those who are older than 40 relative to those who are less than 40. Certainly if I presented side by side box plots, we would have been able to see and compare these more easily visually. But nevertheless, I'm going to take the sample mean difference in average length of stay between these two groups. And I'll do it in the direction of taking the mean for those who are greater than 40 years minus the mean for those than less than 40 years. And the means for each group were, and the means for each group were 4.9 days for the older group and 2.7 days for the younger group respectively. So this is a mean difference of 2.2 days. So now this really nails down this with a numerical summary measure that those who were older when they were admitted to the hospital had average length of stays of 2.2 days greater than those who were younger. And now that really gives some context to what I was trying to see visually in the shift of mass for these two distributions. It's shifted, this distribution has shifted up on average by 2.2 days for the older persons from where it was for the younger. What are we going to do when we have more than two groups we want to compare the means of? Well the general practice is to designate one of the multiple groups as our reference, and compute the differences for each of the other groups compared to that same reference. So let me show you an example of this. This is an article we'll look at throughout the course because it gives some useful examples of techniques we'll be doing in the course in a nice context. But this was a study for gender, the call it gender, but they mean biological sex differences in the salaries of physician researchers. This was published in JAMA in 2012, and this dealt with US academic physicians. And they were interested in comparing the average salary between female and male practitioners. But they noted that they'd ultimately have to make this comparison beyond a simple mean difference because there are potentially multiple things that differ between male and female physicians that could also be related to salary. And so they want to adjust out those factors before making a head to head comparison. Just to show you what they showed at first, the mean salary within the cohort for women was $167,669 US compared to $200,433 for men. So a substantial difference of over $30,000. They adjusted for a bunch of other things that included academic rank, leadership position, publication numbers, etc. And we'll certainly show how to do this adjustment in the second term of this course. But this difference still persisted, albeit lower in value. The average difference after leveling the playing field in terms of these other things between men and women it was $13,400, US dollars per year which is very sizable. So as part of this paper, they wanted to look at and demonstrate in the paper that there were other factors associated with salary. Then they also go on to show in the paper that some of these factors were associated with gender. Hence been making the case that they would need to adjust for them before making their conclusions about sex based differences in salary. So one of the things they looked at were regional differences in salary. And then looked at whether the sex distribution of the physicians differed by region as well. So they have four regions of the US, the West, the Midwest, the South, and the Northeast. And here they present the mean salary for physicians from each of these four regions. So if we wanted to quantify differences in these means, what we could do is designate one of the four regions as our reference region. And then report the mean differences for the other regions compared to the same reference. Designating the region is arbitrary, just like the direction of comparison when we have two groups is as well because essentially we're designating the reference there as well. So for example, if we make West the reference region, then the three mean differences we could report are the mean differences between the Midwest and the West. Which if you do the math comes up to be a difference of $4,416. So physicians in the Midwest made on average $4,416 more then physicians in the West. We take the difference between mean salaries for those physicians in the Southern region and the West, it's negative $35. Indicating then on average physicians in the South made slightly less on average than physicians from the West. And if we do the same thing, comparing the mean salaries, for the Northeast physicians versus the same reference of the West, the difference is negative $2,322. Which indicates on average physicians in the Northeast made $2,322 less per year than physicians in the West. >> Mr. John, well what if I wanted to compare physicians' salaries from other regions? For example, salaries in the South to salaries in the Midwest? Do I have to rerun the analysis, changing my reference group to the Midwest? >> And I would say to you, no, no you don't. You have all the information you need here to do this. What do I mean by that? What we actually have about these respective regions? Well what we don't have, If we only had these differences, for example, we hadn't seen the previous table with the actual means. We have the difference between physicians in the South and physicians in the reference group, the West. And we also have the difference means between physicians in the Midwest and the physicians in the West. And if we were to do this out, if we were to do the math here, we get xs, x bar s minus x bar w minus x bar Midwest, and then this minus minus or plus x bar West. So this reference cancels and we're left with the difference between the Southern regions and the Midwest region. So we could take these differences from the same reference, the difference between Southern and Western, was negative $35. And then subtract the difference in means between Midwestern physicians and physicians in the same reference, the West, which was $4,416. To get a difference in average salaries between physicians in the South and physicians in the Midwest of negative $4,451. So again, the reference is arbitrary, but you have to respect the reference in order to properly interpret the mean difference. But you can also compute other differences of interest besides those for each non-reference region compared to the same reference. So in summary, while the distributions of continuous data can be compared between samples in many ways. Some key approaches include visual comparisons, such as these side by side box plots, and numerical comparisons, mainly the mean difference between any two groups of samples. And we showed with our examples why the mean difference at least gets at some measure of the difference in the mass of the distributions, reasonably so.