So in this section, we'll talk about the relationship between sample estimate values like means, and standard deviations, the shape of table distributions, and the size of the sample. So, upon completion of this lecture section, you will be able to understand that a random sample taken from a larger population will imperfectly mimic the characteristics of the larger population. Understand that the distribution of values in a random sample, should reflect the distribution of the values in the population from which the sample was taken. Understand and explain that increasing sample size does not systematically decrease the value of sample summary statistic estimates nor does it estimator increase in, for example, we can not predict a priority without seeing the results whether an estimated mean or standard deviation will increase or decrease in value when we move from a smaller to a larger sample from the same population. Also, begin to understand and just have an appreciation for something that we'll really work on later in the course, that while increasing sample size does not systematically decrease or increase sample summary statistic estimates. The estimates become less variable across samples of larger sizes, if we were taking multiple samples of the same size. So, let me try and illustrate what I'm getting at in those learning objectives with some examples. So let's talk about what happens if we take increasingly larger samples from some population, and what would we expect to happen to the sample-based estimates and their shape. So here's the histogram of the original data of 113 men, taken from a larger clinical population. I've put this histogram on the fine tune scale, something I said would not be appropriate for just these data. But what I'm going to do now, is compare it as if we took larger random samples from the same population. So, I want the histograms to be comparable across all sets of samples. So I've made the bin width very narrow here. So, here's my single sample. We can see it's roughly symmetric and bell-shaped by the histogram. The mean and the median are similar in value, and the standard deviation of these 113 values is 12.9 millimeters of mercury. So, what happened, and I could do this with the computer where I created this theoretical population. What if I added 100 more randomly sampled men from the same population to my data-set? So, now I actually have 213 systolic blood pressure measurements. Well, you can see the sample mean is slightly less than what it was when the 113 was 123.6, when we only had one 113 measurement. The sample standard deviation is slightly larger, and the mean is the same. If we look at the distribution here, it's still symmetric and bell-shaped, but if you compare it to the previous histogram, it's little more flashed out, a little more detailed. Also, the peaks tend to get lower because we're filling out and shifting some other proportional observations that fell within a specific bin to other bins that did not appear, or had no observations in the previous smaller dataset. Finally, if I added 887 more randomly selected persons to the same original sample for 113 men, well, let's see what happens here. The mean here is slightly lower again than it was in that original sample. The standard deviation is larger. The estimated standard deviation is larger slightly valued. The median stays the same in this situation here. But the picture gets even more flashed out. We get to see more detail about that roughly symmetric bell-shaped data ostensibly, because all of these samples are taken from a population where we think given these empirical results, that the data are roughly symmetric and bell-shaped. We just get a clearer picture of that. But that was consistently echoed in our three samples regardless of size. Let's look at the box plots of the individual systolic blood pressure values from these three samples. We can see that for the most part, these line up pretty nicely. The medians are very similar. The 25th percentiles are very similar. The variability across the entire range into the ranges are very similar. We start to get some outliers in the larger dataset. But in general, the three pictures given by the box plots look very similar in terms of their center and the shape of those distributions and the spread. So, let me just show you empirically what would happen if I did this process a couple more times, and I'm just going to keep my ideal eyes on the sample mean values. For now, suppose I did another run, where I took three random samples of size 113, 213, and 1000 respectively. There's no connection between the samples. I took them just in succession randomly speaking. If I look at the sample means for these three samples across the increasing sample size, we can see that for this first run, we started with a mean of 124.4 millimeters of mercury. The smallest sample, it goes down to 123.5 in the sample of 213, and then down again further to 122.6 in the sample of 1000. So, if you just solve this, your first inclination might be, and ignoring what we saw before, that as sample size increases, the sample mean and the estimate tends to decrease in value. However, we have looked at another second run where I did the same thing, we can see that the mean for the small samples of 121.7 millimeters of mercury, that goes up to 123.4 in the next random sample of 213. If you look across these five runs where I took three successive samples randomly, 113 persons, 213, and 1000, you'll see there's no consistent pattern as to whether the mean estimate increases or decreases with respect to increasing sample size. Perhaps more important, because this is something that people get confused all the time, and I'll come back and explain where the confusion comes from when we get to later points in the course. But more important to note is that there's no systematic link between the estimated standard deviation and the sample size. So, in this first run, the standard deviation of the 113 measurements was 12.9 millimeters of mercury, then I took a random sample of 213 measurements, and the standard deviation went up a little bit to 13.1 millimeters of mercury. Then I took a sample of 1000 where the standard deviation was 12.4. So, no consistent pattern there. In the next run, it started off at 14.4 in the sample of 113, went down to 11.6 in sample 213, then jumped back up to 13.3 in the sample of 1000. So, if you look across these five runs, you won't see any consistent pattern between the magnitude of the standard deviation estimate and the sample size. It does not increase or decrease systematically. That's actually important because even though there's variation in these numbers, because they're based on differing random subsets of the same population, they are all regardless of the sample size, estimating the same underlying single quantity, the population standard deviation sometimes represented by sigma, which is a single static value. So, these are all estimating the same number imperfectly, and there will be variation from sample to sample in samples of the same size, and across samples of different sizes. So, the only reason I bring this up, and I do want to emphasize that here is, you may say this is pretty obvious based on what you're showing me John. But we'll see later down the course we're going to talk about a different type of variation that is dependent on sample size. Frequently, people get the two mixed up and start to describe the properties of one incorrectly the other, and they get this notion in their head, that the sample variability will systematically decrease with increasing sample size. What I'm here to show you empirically is that is not the case. So, let's look at another example where we have heavily skewed data. Let's look at the length of stay claims data from Heritage Health, with an inpatient stay of at least one day in 2011. So I had the information on a very large sample of 12,928 claims from that single year. The mean of the sample was 4.3 days, the standard deviation was 4.9, and the median as we saw before was substantially less than the mean at two days. But let's suppose I took a random sample of 200 patients from this larger group. We'll consider for the moment that 12,928 to be our working population. Well, if I took a smaller sample and I did a histogram of the values in my sample, even though this is less detailed and cruder than if I look at the histogram for all 12,928 values, we see strong evidence of that right skewed data distribution just like we did when we had the larger sample. The mean length of stay for this sample is 3.8 days, the standard deviation of these values is 4.2 days, and the median is two. So, although these values differ somewhat from what we saw in the large sample 12,928, we still see that the mean is larger by a fair amount than the median of the sample. If we increase our sample to 1,000 patients, we get a more detailed picture of what we saw in the sample of 200 patients, but the distribution is still right-skewed and continues to exhibit that right-skew as it should. Because if the population of values from which we're drawing is indeed right-skewed, we want to infer that from any sample no matter the size. Here the sample mean is 4.3 days and the median again is 2.0 days. So, we again get data empirically numerically about that right-skew in addition to the visual we have here. Then finally, this is the histogram again which we've seen before of the 12,928 persons in these data. Then it's just a more fleshed out, more detailed picture of what we saw in the samples of 200 and 1,000 respectively of a heavily right-skewed distribution whose mean is substantially larger than its median. Again, here are some box plot spaced on the distributions of length of stays from these three samples. If you look at these, now they're not perfectly equivalent to each other. You can see that the boxes tend to be similar. This one has a slightly in the large. Overall sample the 75th percentile is slightly smaller than the previous two groups and the large sample has more outliers. Generally speaking, all three capture that heavy right-skew and the median and 25th percentile values are certainly similar as well. So again, for underlying population is heavily right-skewed, we should pick that up in our sample if it's random regardless of how large the sample is. Let's look at some runs though where I take again, multiple sets of random samples of different sizes and compare their sample means. So, in the first run, I take a sample of 200, a sample of 500, which I hadn't looked at before and a sample of 1,000, and look at the sample means across these three samples with increasing sample size. It goes from 4.05 days up to 4.19 days and then down to 4.15 days in the largest sample of the three. If I do this again, my sample of size 200 has a large mean of 4.97 days which shifts down to 4.21 days in the sample of 500, and then down further to 4.12 days in the sample 1,000. So, if you only look at this one, you might conclude that the sample mean systematically decrease with increasing sample size. Certainly, if you look at the next one, you see a different story. So again, there's no way to predict if I take a bigger sample, will I get a larger or smaller estimated sample mean. Again, all of these values regardless of the one I'm sampling I did and the size of the sample are all estimating the same single underline quantity the population mean Mu. One thing I'll just point out now and we'll certainly spend more time on this later in the course. So, I just want to put it in your heads think about it as we come into it later is, if you look across the sample mean estimates among the samples of size 200, look at their variability in these means across the samples. It tends to be lesser across samples the larger the samples are. So, we can't predict whether the values go up or down as a function of sample size, but the sample mean estimates tend to be closer in value across samples of larger size. Again, we'll come back and explore that in some detail shortly in the course. Just focusing again on the standard deviations from these runs and I won't go through this in total detail, but you can see in the first situation where I had a sample of 200, 500, and 1,000, taken successfully. The sample standard deviations estimates increase with increasing sample size. In other situations like the second run, they tend to decrease. Then, there's some mix of going up and going down in these other ones. So again, there's no consistent pattern. We cannot predict whether our estimate will increase or decrease in value with an increasing sample size. All 15 of these values are estimating the same underlying single quantity, the population standard deviation which quantifies the variability of individual like these data values in the population from which these samples were taken. So, in summary, the distribution of sample values of continuous data shouldn't perfectly mimic the distribution of the values in the population from which the sample was taken. If the data in the population are roughly symmetric and bell-shaped, we'd expect any random sample to reflect that in terms of it via its histogram regardless of size. If the data in the population are right-skewed, we'd expect and hope any representative sample to capture that right-skew regardless of the sample size. We would not expect the distribution to systematically change say from right-skewed to more symmetric the larger the sample. All sample distributions regardless of the sample size are estimating the same underlying unknown population distribution. With regard to the distribution sample values and increasing sample size, again increasing the sample size will not systematically alter the shape of the distribution, but it will result in a more filled out or detailed distribution. It will not systematically alter the values of the sample statistics. The sample statistic estimates will vary from random sample to random sample, but will not systematically get larger or smaller with increasing sample size. Something I noted and I don't want you to worry about too much yet because we will explore in more detail is that, increasing sample size will increase the precision of the summary statistics as estimates of the unknown population true values. In other words, if we look at across multiple samples, the estimated mean for multiple samples of the same size, the larger the size of that set of samples, the less variability there will be in the sample mean estimates across samples. We will really focus on that in lecture set six and beyond in the course. So, I just wanted to plant that seed in your head.