In this next set of lectures, we'll look at ways for summarizing continuous data measures for example, blood pressures, healthcare costs in dollars or hospital length of stay in days. So to start in this first lecture section, we'll talk about some useful summary statistics of the numerical variety for summarizing a sample of continuous data. So upon completion of this lecture set, you will be able to compute a mean, sample mean and a sample standard deviation, interpret the estimated mean, standard deviation, median, and various percentiles computed on a sample of continuous data measures. So, some of the ways that we can measure and describe continuous data numerically include measures of the center of data, sometimes called measures of central tendency, and these include the mean of the dataset and the median, sometimes called the 50th percentile. We also would like to measure the variability in individual data points in our sample and one commonly used numerical summary measure that we'll define is the sample standard deviation, and then some other measures of location other than the center include other percentiles besides the median or 50th such as the 25th percentile or the 75th percentile. So let's start, we'll define all these measures and look at some examples of them applied to some datasets but let's look at, first talking about something you've probably heard of and probably computed at some point in your life, something called the average. We'll call it here the sample mean. It could be called the sample average or arithmetic mean as well and given a set of data of n data points, the rubric for computing the sample mean is to add up all the points in our sample and divide by the sample size. We'll frequently represent the quantity of the sample size by the letter n. So let's look at an example here, look at the mean on a small dataset consisting of systolic blood pressures. So I've got a sample here and the only reason I looked at a small samples so we can compute it by hand. Generally, it will let the computer do the work for us, but it's nice to go through the exercise once for the values we're looking at just to get a sense of where they come from. So we have five systolic blood pressures measured on five persons; 120, 80, 90 millimeters of mercury, 110 millimeters of mercury and 95 millimeters of mercury. We can generically represent these values with some letter. I'll choose the letter x and just tie the subscript to there arbitrary order in the sample. So the first value x_1 is 120 millimeters of mercury, the second value x_2 is 80, et cetera. So the sample mean is computed by adding up in this case the five data values and dividing by five. In the statistics notation, the sample mean is frequently represented by a letter like x with a line over it. So, for example, we call these five data points x_1 through x_5, we continue to use the letter x to represent our sample mean, letter x with a bar over it. We would pronounce or call that x bar. So, for example, in our dataset, the sample mean would just be the sum of these five values divided by five. In this case, it turns out to be a sample mean of 99 millimeters of mercury. Here's a generic way to represent the formula in math notation and I'm only bringing this up because it will be helpful in understanding another quantity we'll look at. I think it speaks to the power representing things mathematically because they can carry a lot of information when the proper notation is used. So here's the general formulation for adding up all the values in our dataset and divide it by n. The numerator here involves the capital Greek letter, Sigma, which generally stands in mathematics for sum and the way to read something like this is its index from one to n. So this says from i equals one to n, sum up the values x sub i, or what this means is take the first x, x_1 in your data plus x_2 plus x_3 plus up through the nth value. So it's just a generic way to say, "Add up all the observations." So the summation sign, and then we take that sum of all the observations and divide by the sample size which is how we get our sample mean. So the sample mean, generally speaking, is also sometimes called the sample average or the sample arithmetic mean but why do we put the adjective, sample, on it? Well, we want to distinguish it from the population mean. Ostensibly, all samples are imperfect representations of a larger unobservable population and so our sample mean is our best estimate of an unknown unknowable value of interests, sometimes called Mu, the population mean and then we can estimate this by x bar. So the notation we use sometimes for values that we can't observe directly, sometimes we call them by a Greek letter. So the population mean, we could denote with a value Mu and it can only be estimated through our sample mean. Sample mean especially in smaller samples tend to be sensitive to extreme values. So change in the value of one data point could make a substantial change in the values of the sample mean especially based on smaller samples. These tend to be more robust as statistics against outliers or extreme values when we have larger samples. Let's talk about another measure of center that focuses on something slightly different called the median. The median is essentially the middle value in an ordered set of continuous data measures. It's also called the 50th percentile and we'll define percentile formally in a minute but that means it is the value at which half the data set is less than or equal to and the remainder of the remaining 50 percent is greater than this median. So, for five blood pressures, if we lined them up from smallest to largest based on the values in our sample, the median value is just the middle of these five values and it's 95 millimeters of mercury for these data. The sample median is not sensitive to the influence of extreme sample values unlike the sample mean. Just to give you an example of what I meant by sensitivity to extreme value, suppose in the sample the five systolic blood pressure data measures we had, we find out that the value 120 was recorded incorrectly and the actual value is 200 millimeters mercury, a very high systolic blood pressure, of course. While the sample median would still remain at 95 because it's only affected by the relative position of the values and that doesn't change even though this largest value increased substantially. However, the sample mean would go up from what we had before 99 millimeters mercury, it would jump substantially to 115 millimeters of mercury because of the influence of that one point on a mean. Now, if there's only five data points in the sample certainly, each sample point has a large amount of influence on the mean. If this were a sample of 100 data points, the influence on the mean would be a lot less substantial. How could I calculate or find the middle value for the median when the sample size is an even number of values? The rubric for this would be because there's no clear middle value in these data is to find the two, if you will, middle values and take their average. So if we added a sixth systolic blood pressure value to this sample we already had added another high measurement at 125 so that we now have six values, the way to find the median would be to take the two innermost values, 95 and 110, and divide by two. The median for this sample will be 102.5. So generally speaking, we're going to let the computer do these calculations for us because certainly if we have a sample of five, it's relatively easy to do by hand. But if we had a sample of a thousand numbers, we all have the mathematical chops to compute the mean and median of a sample of a thousand numbers but it would take a long time and be a tedious endeavor and that's why computers have made our lives much better in the world of statistics. So if we want to describe how variable the measurements in our sample are, how could we do that numerically? Well, the most commonly used measure, and you'll see it time and time again reported in journals and such and it will be of utility to us throughout the course, is something called the sample standard deviation. So I'm going to start by defining something called the sample variance and then show how the sample standard deviation is just a square root of this thing called the variance. So what the variance is, it looks like this. What we do, so here's our summation sign again. For all n observations in our sample, we take the difference between each observation and the sample mean, we square that and we add that up across all differences. So we would take the difference between the first observation, sample mean, square that and the second observation, the sample mean, square that all the way up to the last or nth observation and the sample mean. So what we get is the sum or the cumulative squared distance between all of our observations and the sample mean. Then we average that. For the moment, just think of this as to being divided by n not n minus one up spleen, where the n minus one comes in later in this lecture set. But essentially, what we get with the variance is how the average squared distance that any single observation falls from the sample mean. So the further any single observation falls on average in squared terms, the more variable the data. Then to get the sample standard deviation, we take the square root of this quantity called the variance. So you can think essentially as the sample standard deviation is a measure of variability, but it measures how far on average any single sample observation is from the sample mean. So the further on average any single observation is from the sample mean, the more variable the values are around that sample mean. So let's just look at a computing at once by hand just to show the operations and again, the math here you're all capable of but we'll generally leave this to the computer so we can work on the harder things like interpreting this and using it in other situations. So we had these five systolic blood pressures as we had before. The sample mean, we said was 99 millimeters of mercury. So, in order to get the sample standard deviation, we first start and sum up the squared differences between each observation and this mean of 99. So for example, 120 minus 99 is 21, so we take that and square it. The next observation 80 minus 99, that difference is negative 19. Now you can start to maybe see why we square these things first, because if we add up the differences unsquared, we'd be adding together positive and negative differences and we consistently get something that would be zero actually. You can show that that would always equals zero. So we wouldn't be able to quantify the variability if we didn't square these things before adding. So if we add these things up and cumulative the total squared distance of these five measures from that sample mean of 99 is 1,020 millimeters of mercury squared. We take that and divide it by the sample size less one essentially average and again, just think of this as averaging. We get an average squared distance of 255 millimeters of mercury squared. If we take the square root of that to get the standard deviation, turns out to be 15.97 or approximately 16 millimeters of mercury. So on average, any one of these five data points falls plus or minus 16 millimeters from the mean of this sample. So a couple of things to note about standard deviation. First of all, the more variability there is in the sample of data, the larger the value of s and we said this before. What s measures is this variability or spread of individual sample values around the sample mean. It can only equals zero if there's no variability. If all n sample observations have the same value. The units of the sample standard deviation are the same as the units of data measurements in the sample. For example, millimeters of mercury can also be abbreviated SD or sd but we'll generally use s to represent sample standard deviation for our purpose. S squared, the sample variance is the best estimate of some underlying population we can't directly observe. The variance of all values in the population and hence s is the best estimate of the population standard deviation, and we will represent this unknowable quantity that measures the variation or values in the population from which we took the sample as the Greek letter Sigma. So we want Sigma but we can only estimate it via s. I just want to talk very briefly about why we don't directly average but divide this by n minus one instead of n. Really this has very little influence on the results in larger samples but it could make a difference in smaller samples. But here's the reason. What we really want to know is we like to know the average distance of our sample points from the true population mean Mu. That's a poorly drawn Mu but that's what we want to know. But we don't know Mu, we only have a sample and we can only estimate it through x bar. So we replace Mu with x bar, but x bar doesn't depend on all points in the population like Mu does. It only depends on the points we have in our sample. So it can be shown mathematically that this squared distance of our points from x bar is systematically smaller than it would be if we replaced x bar with the true population mean. Slightly smaller. So in order to correct for that and when we compute the variance and standard deviation, we divide by a number slightly smaller than the sample size to get a slightly larger value than if we divide it by n alone. So we're just correcting slightly for something that can be shown with rather complex mathematics that we're estimating the numerator underestimates what we'd really like to know by a slight amount. Again, the impact of this on the estimated standard deviation is minimal especially in larger samples. Certainly, we would want to relegate these computations to the computers especially when we have large samples. So if I had a 113 blood pressure measurements and I wanted to compute the mean and standard deviation, here are the first 50 values and I'm only showing this to show you that these data are large relatively speaking to what we did before and we need two more slides to show the rest of the data in the sample of 113. While we could compute these summary measures by hand, it would be quite cumbersome. So, we're going to let the computer tell us what these are, the estimated mean. The sample mean for these data is 123.6. That's an estimate of the true underlying population mean. Sample standard deviation is 12.9 millimeters of mercury. That's an estimate of the underlying population variability in blood pressures in all persons in this population, and the sample median is 123 millimeters of mercury. So something else that will help us quantify characteristics of a distribution of continuous data are percentiles. We've already seen an example of a percentile, the sample median, and let's just talk about defining these in situations where our samples have all unique values and where we have some repeated values. So, let's first look at the case with all unique values. In general, if our sample values are unique the pth per sample percentile is that value in the sample such that p percent of the sample values are less than or equal to this value and 100 minus p percent are greater than this value. So, that's why the median is the 50th percentile, 50 percent of the values are less than or equal to the median and the remaining 50 percent are greater than the median. The 25th percentile for example would be the value that was greater than or equal to 25 percent of the sample data points and less than the remaining 75 percent of the values. These could be done by hand, we could line up our values from largest to smallest and put these off, but again, it's much easier and more effective to have them done by the computer. If not all sample values are unique, in other words some are repeated, then the pth sample value is that value in the sample such that p percent of the sample values are less than or equal to this value and 100 minus p percent are greater, and we'll show an example of this but sometimes we can have data where there's a large number of a single value repeated and so we can have multiple percentiles with that same value. So, let's talk about percentiles in that systolic blood pressure data set we looked at, taken from a random sample of 113 adult men, taken from a larger clinical population. The 10th percentile for these 113 measurements is 107 millimeters of mercury, meaning that approximately 10 percent of the men in the sample have systolic blood pressures of less than 107, and 100 minus 10 percent or 90 percent have systolic blood pressures greater than, I'm going to generally say greater than or equal to here it sounds kind of counterintuitive to have it on both ends but just to cover the situations where we have repeated data in our samples. So, 90 percent the men have systolic blood pressures greater than or equal to 107 millimeters of mercury. The 75th percentile for these 113 blood pressure measurements is 132 millimeters of mercury. Meaning that approximately 75 percent of the men in the sample have systolic blood pressures less than or equal to 132 millimeters of mercury, and 25 percent of the men have systolic blood pressures greater than or greater than or equal to 132 millimeters of mercury. Here are some other percentiles taken from the sample as well including the 2.5th percentile, the value such that only 2 and a half percent of the sample are less than this and the remaining 97.5 percent or greater that's 100.7, and we have other examples here like the 25th percentile and the 97.5th percentile. Lets look at one more example. We're going to see different results because here in these data we have a lot of repeated values. This is a length of stay claims at Heritage Health Plan for all patients or all enrollees who had an inpatient stay of at least one day, an inpatient hospital stay of at least one day in the year 2011. There's 12,928 claims and I'm clearly not going to show each individual value in these slides because that would take a lot of PowerPoint, but I'm just showing you the first 50 measurements to give you a sense of the flavor of these data, and you notice that there are a lot of repeated values, a lot of persons in the sample ended up having a one day, the minimum, impatient length of stay. Others had two days but that was repeated as well. So there were a lot of repeated values in these data. So let's talk about the summary statistics of center first and spread, the mean for these data is 4.3 days, contrast that with the median which is only two days and maybe you can start thinking about then we'll investigate later in another section, what's going on here, but start thinking about why that may be, and the sample standard deviation is 4.9 days. So it turns out here is some of the sample percentiles for the 12,920 claims. Remember, in order to be in this data, you had to have an inpatient stay of at least one day. So the minimum value in the data set is one, it turns out that the 2.5th percentile is also one, 2.5 percent of the data values are less than or equal to one, and the remaining 97.5 percent are greater than or equal to one. So here's one, 2.5 percent are less than or equal to, and the remaining 97.5 percent of the values are greater than or equal to. Because there's so many repeated ones in this data set, it's also the 25th percentile, so that tells us right off the bat that more than 25 percent of persons in the sample had a one day length of stay in 2011. So, one also informs us about the 25 percent of the data values are less than or equal to one and the remaining 75 percent percent here, 25 percent here, and this should be one, the remaining 75 percent are greater than or equal to it. Then we see the media is actually two days. So, now we can tell that somewhere between 25 percent and 49 percent of the sample had a value one because the median is two, jumps to two when we get to that percentile. But notice that the difference between the median and the 2.5th percentile is only one day, contrast that with the difference between this median of two days and the 97.5th percentile the value on the other end, the complement or mirror image, not quite mirror image of the 2.5th percentile value such that 97.5 percent of the data values are less than or equal to this, and 2.5 percent or greater, that's 20 days. This distance here, from the second half of the data set, this spread is a lot larger than in the first half. So just think about that and when we start looking at visual displays, we will see how this plays out. So in general, summary measures that can be computed for a sample of continuous data include the mean, standard deviation, the median or 50th percentile, as well as other percentiles, and these sample-based estimates are the best estimates of unknown underlying population quantities. For example, x-bar our sample mean is our best estimate based on the data we have of the underlying population mean. S, our sample standard deviation is the best estimate of the population standard deviation. Soon, about halfway through the course, we'll start talking about how to address the uncertainty in these estimates as they relate to the unknown thing they're estimating. Certainly the sample mean is an imperfect estimate of the population mean because it's only based on the values in our sample and not all values in the population. In the next section, we'll continue with how to look at and investigate continuous data by introducing some visual summary measures as well.