So, in this section, we'll actually do something that we did not do in the first term. It was not just a matter of re-expressing something that we did in the first term as regression, but we'll actually look at the situation where when it compare means across groups and estimate group differences, but our grouping factor will be allowed to remain or be continuous. So, in this section, hopefully, you'll get a sense of understanding why treating the continuous predictor as continuous as opposed to making a binary or categorical can be beneficial, we can gain something from doing that in an analysis. Used what's called a scatterplot and we already saw an example one where it wasn't useful when we only had two values tracks, but when we have a continuous measure as scatterplot may be a useful tool to assess whether an outcome predictor relationship is reasonably described by a line. In other words, whether the mean of the outcome tracks with the continuous predictor in a linear fashion. We'll learn to interpret the estimated slope and intercept from simple linear regression model with a continuous predictor x_1. So, let's go back to our anthropometric data. Again, we already looked at sex differences in arm circumference using simple linear regression in a previous section. Now, let's look at the relationship between average arm circumference and height in the sample of a 150 Nepali children 0-12 months old. So, here are some statistics on the whole of the data set. The mean arm circumference for the entire sample is 12.4 centimeters, it ranges from 7.3-15.6 centimeters. The height, the mean height in the sample is 61.6 centimeters in that range is 40.9-73.3 centimeters. So, the first approach we've got height measured as a continuous predictor here. But if we wanted to apply a regression in the sense of how we did in previous sections and treat this as a two sample comparison with a binary predictor, we could dichotomize height at the medium, compute the mean difference in arm circumference for taller children compared to short children in their respective means, or of course we can do that as a simple linear regression. So, if we were to model the mean arm circumference as a function of height when it was dichotomized, the median had a predictor x1 which took on the value of 1 for children above the median height, and 0 for children who are less than or equal to the median height. This is the falling equation, we get the average arm circumference is equal to 11.7 plus 1.4 x_1. So, the 11.7 here is the estimated mean arm circumference for the reference group, the group coded zero, and x1 which is the group that is less than or equal to the median height. So, their arm circumference on average is 11.7 centimeters. For the group who is greater than or equal to the median we would take that 11.7 and add 1.4. So, 1.4 is the mean difference, in arm circumference for the group whose height is greater than median compared to the group whose height is less than or equal to the mean height. So, it's just a mean difference between two groups. So, the potential advantage of doing that by taking the continuous height in dichotomizing into being above or below the median is we know how to do it. It gives a single summary measure of sample mean difference for quantifying the arm circumference height association. But the potential disadvantages of this approach is that it throws away a lot of information in the height data that was originally measured as continuous so we do a much cruder representation by taking something that was initially continuous and dichotomizing it. It only allows for a single comparison between these two crudely defined height categories above or below the median. Another approach that is a little more robust, gives us a little more used to only for our data, gets a little more out of the height data, is to categorize the height into four categories by quartile and compare the mean arm circumference via mean differences across the height quartiles. So, even though these quartiles for ordinal we can treat them as categorical and we don't force any structure on the nature of the relationship as the quartiles increase. We don't force that to be the mean to always increase or decrease by sticking that in as a single x taking on the value one, two, three, or four. So, what I did here is made quartile one the reference, and the intercept here as 10.9. That's the mean arm circumference for height quartile one, and then these coefficients are the respective differences between quartiles two through four and the respective reference group. Let 1.76 means that the group in the second height quartile has average arm circumference is 1.76 centimeters greater than the reference. This 2.21 means that those in the third quartile of arm circumferences, Q3 here sorry, Q3. Write that again, arm circumferences, 2.21 centimeters greater than average than those in the first quartile, the reference again and then this is the difference in average between those and the fourth quartile and those in the same reference, the first. There should be Q here. So, the potential advantages of that approach is again we know how to do it. We can make four groups and we can estimate mean differences between each of the three groups and the reference of using the simple linear regression approach or not. We can just do it having the four means, but the potential disadvantages of this are still throws away a lot of information in the height data that was originally measured as continuous. It requires multiple summary measures, six sample mean differences between each unique combination of height categories, if we were to do every possible comparison of two groups to quantify the arm circumference height relationship. Each of those means only use the data in that particular quartile. So, even though we started with a 150 observations, there's only roughly between 30 and 40 in each of the height quartile. So, each of those four means for arm circumferences based on a smaller grouping of the data and hence will be less precise. This does not exploit the structure we see in the previous boxplot, as height increases so does arm circumference. We look at this boxplots here, we see that now you may argue and I agree. You could see for the regression that the perhaps the increase in arm circumference on average from quartile one to quartile two, this is the mean increase, is greater than the subsequent increases between quartiles, but these four quartiles are crude categorization of a continuous measure. So, perhaps if we did quintiles or deciles we'd see more of a linear nature or not. So, in that boxplots presentation, we have some empirical evidence that as height increases arm circumference tend to increase. Let's just explore this a little further and think about treating height as continuous with estimating the arm circumference height relationship. So, linear regression using a continuous predictor is a potential option, but allows us to associate a continuous outcome with a continuous predictor via a line and the line estimates the mean value of the outcome, in this case arm circumference, for each continuous value of the predictor height. This makes a lot of a sense, this approach when our predictor is also continuous, but only if aligned reasonably describes the outcome predictor relationship. So, we saw some empirical evidence of the dose response, in other words, increasing height is associated with increasing arm circumference when we had those four quartiles of height and the box plots of arm circumference within each respectively. But there were some possible concern about the relationship not being completely linear a bigger jump in the beginning earlier heights as opposed to later heights but that may have been an artifact of the crude characterization. So, what I'm showing here is a useful visual display for assessing the nature of the association between two continuous variables, this is that graphic called a scatterplot. So, what this shows here is for each child there's a 150 points on this plot, and for each child his or her height is plotted on the horizontal or x-axis and his or her corresponding arm circumference is plotted on the y-axis. So, it looks to me subjectively of course but at first glance that there's at least a line would not be a terrible fit to this association to begin with. So, I'm going to assume for the moment we can do this and I'm going to do it and then we're going to interpret the results. So, regression can be estimated via the computer of the form y hat, where y is average arm circumference y hat his average arm circumference, equals sum intercept estimate plus some slope estimate times x where x is the height in centimeters. So, what this equation does for any given value of x1 for any given value of height, it gives us back a mean estimated arm circumference for a group of children all that same height. So, for these data on a 150 Nepalese children less than 12 months old, the estimated regression line is y hat equals 2.7 plus 0.16 x1. So, in general we see that there is arm circumference. We see this height x1 increases average arm circumference increases as well because that slope is positive. So, here's the scatter plot again with this resulting regression line superimposed on the graph. So, that looks like a reasonably good fit to me maybe because there's so few children who were down less than 50 centimeters. It looks like we may tend to overestimate things here but again we don't have much data here to work with but in the point where there's more data and more variability at any given height, this tends to fit quite well in my opinion, strikes a nice balance down the middle of the values and again we're estimating the mean for any given height. So, for example what does this equation say? It says the estimated arm circumference for children over 60 centimeters tall. Well, we can find that by plugging in 60 to x value in the equation and cranking out the result and when we do that we get an estimate y hat of 12.3, that 12.3 corresponds to this point on the line with the x value equal to 16. Notice though most points don't fall directly on the line but as I've said before we are estimating the mean arm circumference of all children 60 centimeters tall. But the observed points for individual children who are at or around 60 centimeters tall their arm circumferences will vary about this estimated mean. So, if we see the shaded portion here, you can see that there are several points, several observations with children who have height in centimeters as 60 centimeters and this point on the graphic estimates the mean arm circumference for those children. So, recall for this data on a 150 Nepalese children less than 12 months old. The estimated regression line is y hat equals 2.7 plus 0.16 x1. So, this slope Beta one hat. How do we interpret this? Well again this is just the average change in arm circumference for a one unit increase in height. In other words, it's the mean difference in arm circumference for two groups who differ by one unit next by one unit in height and it compares that difference for the taller group to the shorter group. So, this result estimates that the mean difference in arm circumference per one centimeter difference in height is 0.1 centimeters with taller children having greater arm circumference. This is constant across the entire height range in the sample. Anywhere along this line the slope is 0.16. So, if I were to look at the average difference in arm circumference for any one unit difference in heights whether it be 51 versus 50, 66 versus 65. That difference is that constant slope of beta one hat equals 0.16. So, what is the estimated mean difference in arm circumference for children 60 centimeters tall versus 59 centimeters tall? Children 45 centimeters tall versus children 44 centimeters tall? Children 72 centimeters tall versus children 71 centimeters tall et cetera? The answer is the same for all the above. Each of these groups of children differ by one centimeter in height and that one's slope estimate quantifies that difference in arm circumference between any of these two groups because they each differ by one centimeter in height. So, the answer is the same for all of the above and any other comparisons within that range of height values that differ by one unit. The answers 0.16 centimeters. What if we wanted to compare, you might say well it's not that interesting and always compare two groups of children who differ by one centimeter in height. I want to look at greater differences in height and see what that begets in terms of differences in average arm circumference. So, if we were to do this, we could look at this. We could write out the estimated equation and what it gives us when height is equal to 70 centimeters and what it gives us when height is equal to 60. So, when height is equal to 70 centimeters our estimated mean arm circumference is that intercept of 2.7 plus 0.16 times 70. When we're looking at height 60 centimeters. Our estimated mean value of arm circumference y hat when x is equal to 60 is 2.7 plus 0.16 times 60. If we take the difference in that, the intercepts cancel and we get 0.16 times 70 minus 60 or 0.16 times 10, or 1.6 centimeter. So, we have a 10 unit difference in our x value. The slope quantifies the difference in y hat per one unit difference in x, that's 0.16. So, taking over 10 units that's 0.16 the original slope times the difference in x values which is 10 and gives us that cumulative difference that you can see on this graphic over here. That this difference here is 10 times beta one hat equals 1.6 centimeters. So, this slope is a very powerful number because under the assumptions of linearity if our predictor is continuous, this slope this single number encapsulates all information about mean differences in our outcome for any comparison we can make for differences in our predictor whether it be a one unit difference or a multi-unit difference.. Let me ask you this. What is the estimated mean difference in arm circumference for children 90 centimeters tall versus children 89 centimeters tall, or children 34 centimeters tall versus children 33 centimeters tall? Well, your first knee-jerk response would be, what I would say too which is 0.16 centimeters. Because that's the average difference in arm circumference for any two groups of children who differ by one centimeter in height, but this is actually a trick question. The range of observed heights in the sample is between 40.9 centimeters and 73.3 centimeters. So, these regression results only apply to the relationship between arm circumference and height for this observed height range. So, only quantify this association for children in this height range. All right. So, recall again for these data on a 150 Nepalese children less than 12 months old, the estimated regression line is y hat equals 2.7 plus 0.16 x_1. So now, let's try and figure out what the interpretation of the estimated intercept is. The estimate intercept beta nought hat is equal to 2.7 centimeters. What is the interpretation of this? Well, generally speaking, is the estimate of y hat, the average arm circumference when x_1 equals zero. So, in other words, the mean arm circumference for children zero centimeters tall. So, this doesn't make a lot of sense. We can't actually have children who are zero centimeters tall, but we still have an estimated arm circumference for this non-existent group. Well, as we noted before, estimates of the mean arm circumferences only apply to the observed height range. So, 40 disrupted 40-70. So, the intercept actually estimates the mean arm circumference for a height group outside the range of child heights in the sample. We don't have any children who were zero centimeters tall. So, this intercept is meaningless scientifically. It doesn't apply to any of our data. In fact, it doesn't apply to any data because we can't have children zero centimeters tall. But frequently, this is the case when x is continuous that the scientific interpretation of the intercept is scientifically meaningless when x_1 is a continuous predictor. But we still need this intercept to necessarily specify the full equation of a line, and then make estimates of the mean arm circumference for groups of children with heights in the sample range. So, even though on its own, it doesn't tell us anything. We still need this intercept, the starting point for the line to give us the proper end result estimates. So, what I'm showing you here, this here is our estimated regression line, the one based on our data. Here are three more lines that all have the same slope but different intercepts. You can see none of them fit the data as well as the one we got for our data, and that's because they're shifted up or down because they have different intercepts. So, the intercept is absolutely unnecessary even if it's an estimate outside the range of our x values to position this line on the graphic and give us valid. Boulez was not just that the difference in arm circumference between to any two groups who differ by one centimeter in height, but also we can estimate this specific average arm circumference for any group given their height with this proper intercepts. So, even though the intercept doesn't always have independent meaning, it is absolutely critical to form your understanding of the relationship of interest. Let's look at another example, systolic blood pressure and age. These are data from the, again in NHANES study the 2013-2014 wave. What we have here is data on 10,000 plus observations, 0-80 years old and about 7,100 of these, we have systolic blood pressure for, because the people in the range persons 8-80 years old. So, we don't have blood pressure measurements on children less than eight years old. Here's a scatterplot of the data for each of the 7,000 persons. I've plotted their individual blood pressure and their individual age. Age on the x-axis, systolic blood pressure on y-axis but there's so many points. This looks like they're just big jumbo cluster of being. We can see that there might be an upward shift. That blood pressure may increase on average with age, but it's really hard to tell with this much data on a picture. So, I'm going to use a free regression tool here, flexible exploratory tool called a running mean smoother to help figure out what the general nature of the association between mean of the systolic blood pressure and the age of person is. What this tool does and you could enact this in any computer program, is it fits a flexible function to the graph here. What it does, is it breaks the x-axis, the age window down into small intervals around a specific age. So, if we wanted to estimate the mean at 40, what it might do is, use only the data at 40 and then close to 40, and take the mean of those, I mean arm circumference of those values and plot it at the point 40, and it would do this giving less weight to the values that are around 40 as compared to the values who actually have the age of 40. Then, we continue this, it would move this window across the entire age range. We would over do the same thing at 41 and do the same thing at 42, and estimate a weighted mean in that small window of ages and then it connects all these. So, there's no assumption of any shape of the relationship between blood pressure and age here. We're just connecting the dots or for means estimated in small areas. Mean arm circumferences estimated small windows of age. We can see that, well it's not a perfect line. It's pretty much a reasonably approximates a linear function. I think it would be reasonable to start by estimating a line here. What we're going to do here, is estimate with the computer. Line of the form y hat equals b nought hat plus B_1 hat. X_1, where x is age in here and y hat is the average systolic blood pressure in millimeters of mercury, for a group of individuals all at the same age given by the predictor value x_1. For these data, the estimated regression line is y hat equals 99.52 plus 0.48 times x_1. So, this slope of 0.48, it aligns with what we've tended to see in the pictures that as age increases, blood pressure tends to increase. What this estimates, is the average change in systolic blood pressure for a one unit increase in age, and age is in years here, so that would be a one-year increase in age. So, beta one hat is the mean difference. Another way to phrase this, is that beta one hat is the mean difference in systolic blood pressures for two groups of persons who differ by one unit, that is one year in age older over the younger. So, this result estimates the mean difference in systolic blood pressure. For one year difference age is 0.48 millimeters of mercury, with older individuals having greater average systolic blood pressure. So, for these data, the estimated regression line is y hat equals 99.52 plus 0.48 x_1 as we saw before. Here, the estimated intercept is beta nought hat equals 99.2. So technically speaking, this is the estimate of y hat, the estimated mean systolic blood pressure when the predictor x is zero. In other words, the estimated means systolic blood pressure for persons zero years old or newborns. Because the age range in persons for which systolic blood pressure was measured, is 8-80 years. This estimated mean alone is not applicable to these data. However, for similar reasons that we showed with the previous example, is the intercept estimate is necessary to specify the complete relationship here. So, we don't want to get rid of it even if we can't use it on it's own to specify an estimated blood pressure. So, here is the estimated regression line superimposed over the scatter plot of systolic blood pressure values compared to age. Visually, it looks like a pretty good fit. If you could take advanced courses in regression, you can actually evaluate how well this fits and try alternative models that don't force total linearity across the range. But unfortunately, we won't be able to get into that in this course. But nevertheless, visually speaking, it passes the visual tests pretty well in terms of this appropriateness of fit. So, in summary with regards to simple linear regression again, simple linear regression is a method for estimating the relationship between the mean value of an outcome y and a predictor value x_1 via linear equation. When x_1 is a continuous variable, the estimated slope for x_1, beta one hat has a mean difference interpretation as it always is hat. This is the mean difference in y. The difference in y hat values, that is for two groups who differ by one unit next one. The change in the mean y per unit change in x_1. This can be estimated for any data. We can get the computer to give us an estimate, but this approach is only appropriate in treating x is continuous. If there's some evidence in the data, usually measured by a visual assessment that the relationship between the mean of the outcome and x is relatively well described by a line. In other words, it's relatively linear. The estimated intercept beta nought hat is the estimated mean of y. The estimated y hat value when x_1 equals zero. This is often not a scientifically relevant quantity when we have a continuous x, but we still necessary to specify the complete linear relationship. So, in subsequent sections, we'll look at now. How do we address the uncertainty in our estimates by putting confidence limits on these quantities regression? Also, formally testing via hypothesis testing whether the association is statistically significant.