Okay. Let's look at some additional examples to build on what we've done in lecture set two. So, I want to remind you that regardless of how large our sample is, the distributions of observations in a random sample taken from a larger population should imperfectly mimic the shape of that distribution, and when we increase the sample size, we'll only get finer resolution as to what the underlying distribution looks like. So, I'm going to show you here a random sample of 200 days taken from the period 1974-1998 from the US city of Philadelphia, and what we have on each of those days, as the temperature on that day in Fahrenheit. Here's the distribution of these two hotter temperatures taken randomly from 200 days in this period. I'm goinmg to rescale this to have narrow bars because we're going to compare this visual display to larger samples, and so that the distributions are comparable visually, I wanted to scale everything the same way. So, let's look at this distribution. The average temperature across these 200 days was 52.9 degrees Fahrenheit. Slightly larger than the median at 50.6 degrees Fahrenheit. There's a lot of variability in day-to-day temperatures. Remember, this is a random sample taken over a 15-year period. So, it covers all seasons. So, there would be a fair amount of variability as opposed to if we were only looking at day selected from a particular month or particularly season. Well, sometimes the eye can trick you into things, I sort of see a couple things going on here. I sort of see that there's two peaks on these data perhaps, representing colder seasons and warmer seasons. Aside from those two peaks, it's otherwise somewhat uniformly distributed, noting the exception in those peaks and then there tends to be a little bit of a right tail and perhaps a little bit of a left tail on this distribution. Let's look what happens when we look at a random sample of 500 days. We start to get a more fleshed-out picture, but we still see some visual evidence of perhaps- or I see, but you may not see it, but I see a similar thing with these two peaks going on perhaps because of the two different seasons being represented. Now I start to see, well there's still a little bit of a right tail, but there's more of an emergence of a left tail. The extreme values in the sample tend to be on the colder side of things. The average temperature for these 500 days was 54.8 degrees Fahrenheit. That came in slightly less than the median at 55.9 likely because of the influence of these lower values on the mean. Again, there's also a lot of variation again in the sample 500 day temperatures, likely because they were collected across all seasons. Now let's look at the entire dataset of all 5,471 days in the period of 1974 to 1998. So, you can think of this as perhaps the population distribution, or certainly a much larger sample that we were pulling those other samples from. Now that we get a really fleshed out picture, we see less of the peaks that I felt I was seeing. We see a little bit of a peak here, perhaps in the colder months, and a little bit of a peak here in the warmer months. Certainly that left skew is really pronounced now, although it has little impact on the mean relative to the median, in that the mean is only slightly less again as it was in the sample of 500, and we still see a similar degree of variation these temperatures because again, they're coming across all seasons of the year. So, let me just repeat what I did with some of our other examples in the lecture set. I repeated the process of taking successive random samples of 200 temperatures, 500 temperatures, and 1,000 temperatures from this 15-year period. I did this three times where I randomly took one of each, and I tracked the mean temperature reading for each of those samples. So, my first run, the mean temperature for the sample of 200 was 56.1 degrees Fahrenheit. When I went to the sample of 500, it looks like it went down slightly to 54.3 degrees Fahrenheit. Then for the sample of 1,000 it was 53.9 degrees Fahrenheit. So, if you only saw this result, you might conclude that the sample means tend to decrease with increasing sample size, but if you look, for example, at some of these other runs, we'd see opposite or inconsistent results. So, for example, here in run two, we started out with a mean of 52.3 degrees in the sample of 200, and it increases to 53.2 degrees when we had the sample of 500, goes up to 55 degrees sample of 1,000. So again, this is just to show that we can not systematically predict whether an estimate will go up, go down in value, or stay the same when we compare samples, random samples of different sizes. Again, the big picture is that all 15 values that I've got here, regardless of the run they're from or the sample size, they're all estimating the same underlying singular value, the true population mean temperature in the population of all days over all years in Philadelphia. Similarly, if we look at the standard deviations of the temperature values in these respective samples of size 200, 500, and 1,000 respectively, across a couple of sets of taking samples randomly, again, there will be no clear consistent pattern. So, in this first run, the variability in the 200 measurements in my first random sample of 200 days was 18.1 degrees Fahrenheit, I had the same sample variability in the random sample of 500 I took, and then for the sample of 1,000, a decrease to 17.9. If we look at other runs, we can see that, for example, on run number five, the sample standard deviation of the random sample of 200 temperature is a 17.4 degrees of Fahrenheit, in the corresponding random sample 500, it goes up to 18.1 degrees Fahrenheit, then comes back down slightly for the corresponding sample of 1,000. So again, there's no systematic way to predict how the estimate will behave from sample to sample of differing sizes. Sometimes they will increase, sometimes it will decrease, sometimes it will stay the same, but all 15 of these values are estimating. They should be similar, hopefully, because they're all estimating the same underlying singular value, the true standard deviation of temperatures in the population from which the samples were taken. Just to show you that certainly this holds for other statistics as well, and I won't go through this in detail, here are the 15 sample medians from the respective samples I took across those five runs where I took for each run a single sample of 200, a single sample of 500, and a single sample of 1,000. Again, you'll see no consistent pattern between whether the estimate increases, decreases, or stays the same when we increase the sample pattern. These 15 numbers are also estimating one underlying population level quantity, the population median. So, one of the other pieces of data we have in this rich data on Philadelphia days is we have the daily death counts on each of these 5,471 days sampled from all days between 1974 and 1988. The mean number of deaths across these 5,000 plus days on any given day is 46.7 deaths, the median is 47. We have a slight right skew here, and the variability from day to day is 8.4 deaths. We may want to ask, "How do the death counts and distributions compare, or how do they vary by temperature of the today?" So, one way to do that would be maybe to look at the box plots of the death counts classified by which temperature quartile the days fall into, that would be one way to examine this empirically. So, what I did with the temperature data we looked back before, is I took all 5,400 plus temperature values across all these days, used the computer to split them into quartiles. So, the first quartile includes days who temperature values were less than or equal to 25 percent of all values, and such that the remaining 75 percent were greater than the second quartile or the values between the 25th percentile and the 50th percentile, the third quartile of the values between the 50th and 75th, and the fourth quartiles are values greater than the 75th percentile for temperature. Here's a plot of the death distributions across these four temperature quartiles. So, we can see, if we track the median, the median is highest on the colder days. The coldest set of days. As we move into the second quartile, the distribution shifts down a bit, both the median and their respective 75th and 25th percentiles, and then as we get to the third quartile, it takes another shift, the middle box shifts down and then stays about the same as we progress into the fourth quartile. So, there tends to be more deaths on colder days. The variability in the death counts, just crudely looking at this picture seems to be similar, and the largest number of deaths across these four quartiles occurred in days that were coldest and days that were warmest. There's a couple of outliers on the warmest days. I was surprised because I expected on the hottest days, I expected the median to shift up again because there are heat-related deaths every summer in the United States. So, I was expecting that to unfortunately shift up again, but according to these data it did not when compared to the previous temperature quartile. So, if we wanted to quantify the differences in those distributions and single number summaries, one of the things we could do, is designate one of the temperature quartiles as our reference, and then take the difference in average deaths between the remaining three quartiles and that reference. So, for example, if I made quartile one my reference, then I could compute three mean differences comparing quartile two to quartile one, for example. It should be 47.4 minus 50.8. Similarly, I could do the same thing for quartile three, paired to quartile one. I won't write it out, but I could do the same thing for quartiles four compared to quartile one. Again, there's no need to represent these if we only had these three mean differences, and then have the four group means, there would be no need to have somebody represent them with a different reference to get other comparisons because we could take the difference in differences, for example here, get the difference in average death counts on days with temperatures in the third quartile compared to days within temperature in the second quartile because the reference would cancel out. So hopefully, these additional examples were helpful, and that you enjoy this, and congratulations on wrapping up lecture set two.