So, in this lecture section we'll follow up on what we've done in the previous sections and talk about how we can use the results for regression to get some measure of the strength of the linear association we've estimated. Otherwise in other words some measure of how well the mean of y tracks with x. So the slope of a regression line estimates the magnitude and direction of the relationship between y and x_1 and encapsulates how much y differs on average with differences in x_1. The slope estimate and standard error as we showed in the last section, can be used to dress the uncertainty in this estimate with regard to the true magnitude and direction of the association in the population from which the sample was taken. But the slopes nor do the intercepts do not impart any information about how well, if you will the regression line fits the data in the sample. Gives no indication of how close the points get the estimated regression line. So we can't determine how well x predicts the mean of y by looking at just the slope itself. A lot of times this is confused, people confuse the idea of larger slopes with stronger relationships. Larger in absolute value that is. The value of a regression slope depends on the units of both y and x_1 when x_1 is continuous and has units. So for example with the arm circumference age example we used data on 150 Nepalese children less than 12-months-old. I presented the results where arm circumference was quantified in each child in centimeters and age was in years and the resulting estimated regression equation was y hat the estimated mean arm circumference is equal to an intercept of 2.7 centimeters plus a slope of 0.16 times the predictor x_1. But if you think about it since arm circumference was measured in centimeters and age was measured in years, this 0.16 is centimeters per year of age. So, we estimate the mean arm circumference increases by 0.16 centimeters per one year increase in age. But if I instead used age measured in months, then the units of the slope would change and it would be centimeters per months of age and the value would differ because it was a different scaling of the same association. The slope would be a lot smaller in value it'd be 0.013 because it would be 0.013 centimeters per month of age which corresponds to 0.16 centimeters per year of age. I could've measured arm circumference in age and years and then in that case my slope would have been units of slope in inches per months of age and we get a different numerical value than the previous two runs. Again, this is all three of these slopes thus far quantify the exact same relationship, the relationship hasn't changed just how we expressed it has by the choice of units. Finally if we did arm circumference in inches and age in months we get a fourth value for the slope. Again, all of the above regressions with the four different estimates of slope quantify the exact same relationship the only reason the slopes differ across these four is because of the choice of units for both the outcome and the predictor. So we can arbitrarily increase or decrease the value of the slope willingly by changing our units appropriately and as such the size of the slope is totally dependent on units and doesn't measure anything about the underlying strength of the relationship. So the value of the slope again because it's affected by the choice of units for both y and x, the absolute magnitude of the slope does not give any information about the strength of a linear relationship. So another quantity that can be estimated via linear regression is something called the coefficient of determination, also called R squared. This is a number that ranges from zero to one with larger values indicating closer fits of the data points to the regression line. So R squared it measures the strength of association by comparing variability of points around the regression line, the variation of individual values around their regression estimated means to variability in the y-values around the same mean for everyone which ignores the information in the predictor x_1. So let's flashback to term one and said, "Well, suppose we were to measure the overall variability in arm circumference, ignoring height." This would be given by the following; what we would do is we take the 150 children in our sample we take the distance or difference between each child's individual arm circumference and the overall single mean for all 150 children. We just averaged 150 and this turned out to be roughly 12.4 centimeters in the sample. So, visually what this would look like even though I've plotted height on the x axis here I'm trying to illustrate the point that what we're estimating here is the variability in these arm circumference measurements around a flatline mean, the same mean, for each child regardless of their height. That would be the total original variability in arm circumference ignoring height and if we wanted to turn that into a standard deviation we take the square root of those sum of squared differences divided by the sample size less one. So now, what we want to look at or concern ourselves with how much variability do we have around the mean value estimates for groups with the same x_1 value when we allow the mean to differ depending on the value of x_1? In other words when we fit a regression line predicts different estimated mean arm circumferences for different heights. So the variation in the observed values around the regression line is the sum of the total square residuals and it looks like this, for each point we take his or her arms circumferences and subtract the estimated mean for children with the same height. So the mean here changes depending on the height of the person now we're taking height into account when we talk about mean arm circumference. We could estimate the standard deviation,, the average variation in individual points around the regression line by taking the sum of squared distances of each point from the line divided by the sample size less two. You can again think of this as roughly averaging that square distance and taking the square root. So, standard error of the individual values around their regression predicted values sometimes written as S of y and this pipe here means given the value of x_1 is done by averaging the variability of the points around the line. So, this total residual variability before we've averaged it the total sum of the squared distances of each individual point in his or her regression predicted arm circumference mean quantifies the variability in arm circumference or y not explain by x_1. These discrepancies are differences between individual values and their mean. This is the part of the variability in arm circumference that's not explained by a person's height. The smaller that this quantity is compared to the overall variation in arm circumference ignoring height the stronger the relationship is between y and x_1 and what the bigger this discrepancy means the closer the points get to the line, the regression line that is. So R-squared will be calculated by the computer and not by hand, but for reference I'm providing the formula for R-square to better illustrate the concept that I just showed visually. So the proportion of the overall variability in y not explained by taking into account x. In other words, not explained by the linear regression equation. So in our example, the proportion of the overall variability in arm circumference- Not explained by the predictor, height is given by falling. When we look at the denominator for this proportion is the total overall variability in y or in arm circumference, ignoring x, and then the numerator is the leftover variability in y, in our case arm circumference, after taking into account x1, after taking into account height. So this proportion is the proportion of the overall variability ignoring x1, not explained by taking into account x1. So conversely the proportion of the variability in our outcome, for example, the proportion of the variability in arm circumference explained by taking height x1 into account is one minus this previous proportion. If this is the proportion not explained, then one minus this proportion is the proportion of the original variability that has been explained by taking into account x1 when making a statement about the mean of y and that is our quantity R squared. So again, R squared quantifies the proportion of variability explained by taking x1 into account by using the information x1 visa via a linear regression. This value of R squared can be anywhere from zero to one. So if there is no reduction in variability than the numerator should be equal to the denominator, in other words the leftover variability from regression, is the same as the original overall variability in the data ignoring x1, and that ratio would be equal to one and R squared would be one minus one or zero. Conversely, if all the points lined up on the regression line, everyone's value was equal to their predicted mean, there'll be no variation after taking into account x1 in the outcome y values and the numerator would be equal to zero and then R squared would be one minus zero. Very rarely, in fact, you'll never in real data, stochastic data, have R squared values of zero or one but the closer the R squared is to one the stronger the linear relationship is, in other words the closer the individual y values are for the respective mean values predicted by x1 from the regression equation. So let's look at our example for arm circumference and height. The R squared for this regression model and again I get this from the computer's 0.46 or 46 percent. So this means that 46 percent of the original variation in arm circumference is explained by taking into account children's height, about half the variability. So there's certainly information there but that also means on the flip side that an estimated 54 percent of the variability in arm circumference is not explained by child's height. So if you wanted to predict for an individual side of whoever to measure their height but for some reason, we couldn't measure their arm circumference and we use the mean for all children with the same height, that mean would not necessarily predict very well for any individual child because the mean only explains that takes into account; height only explains some percentage of the overall variability. There's still a fair amount of variability left over in the individual values around their predicted mean. However, we may want to do better in terms of prediction, and what we'll see is we can increase the number of predictors in the model by invoking multiple linear regression, which we'll get to soon in the course where we can see if we can explain some of that additional variation that's not explained by height by adding other factors into our multivariable equation. We might take into account the person's sex, the arm circumference of the mother, etc., and we may get a better prediction or explain more variability in arm circumference by taking into account multiple factors. We may recall the regression we looked at for systolic blood pressure needs using an Haynes data in a positive association. The R squared for this was 0.34 or 34 percent. So you can see in this picture, there's the flat line here, this is when we use the overall mean of all observations, the overall mean systolic blood pressure ignoring age and then this slanted line here is the actual regression line and what we're saying is the points get closer on average to the regression line than they do to that flat line and the regression explains about 34 percent of the overall variation in these individual values, the original variation around that flat line. So again, we have a situation where there is definitely a trend here mean tends to increase the mean- systolic blood pressure tends to increase with age, but if you were to try and predict somebody's blood pressure without measuring it based on their age, use the mean for people the same age, you still wouldn't necessarily get a very precise prediction because there's still a large amount of variation unexplained by age and a large amount of variability in the individual systolic blood pressures in any given age group. Additionally, this means an estimated 66 percent of the variability in systolic blood pressure is not explained by age. So some of this unexplained variability may be explained by factors other than age, and in the next set of units, we'll look at expanding this model to include other predictors that may influence systolic blood pressure above and beyond age and explain some of that invariant ability that was not explained by age. So we can also compute R squared when we have dichotomous or categorical predictors. Now uncommonly though, when our predictor is dichotomous like sex or categorical like ethnicity, there tends to be minimal variation explained by the categories, there tends to still be a fair amount of variability in individual values within each of the x levels. So in this example, when we had arm circumference by sex, we saw that females had an average arm circumference of lower lesser than men, smaller than boys; turns out and I didn't show you in the inference that this was not statistically significant but nevertheless, the R squared for this model is 0.002 or 0.2 percent. So arm circumference- so sex explains less than one percent of the variability in arm circumference. So knowing someone's sex not allow one to make a very accurate prediction about an individual's arm circumference. There's a lot of variation around the mean arm circumferences for males and females. So, this means that an estimated 99.8 percent of the variability in arm circumference is not explained by sex. So, obviously, there may be other factors like height, we've seen that explain a lot more of the variability in this outcome. There are a couple important things to keep in mind about this measure R-squared. As with all other estimates, R-squared is based on the sample of data, it's frequently though reported without some recognition of sampling variability. It's usually reported as is with no confidence intervals namely, because it's difficult to get a confidence interval on it and the approximations are not very good. This is something important to note. Low R-squared is not necessarily bad. Many outcomes cannot and will not be fully or close to fully explain in terms of their variability by any one single predictor, or even multiple predictors. There's sometimes a lot of inherent or natural variability and measures that can't be explained by other factors. The higher the R-squared value though, the better the predictor x1 predicts y for individuals in the sample or population as the individual y values vary less about their means, the closer the points are to align. However, there may be important overall associations between the meaning of y and x, even though there's still a lot of individual variability in y values about their means estimated by x1. So, in the SBP and age example, the systolic blood pressure and age example, age explained an estimated 34 percent of the variability is systolic blood pressure. The association was statistically significant, showing the average systolic blood pressure is larger for older persons. However, for any given age or year of age, there's still substantial variation in the systolic blood pressure for individuals. But we can see that on a group basis, there's a trend. Let me talk about R-squared's companion statistic, which sounds like it's easy to compute from R-squared and it is but there's a catch, it's called R. Although R-squared is usually represented with a capital R, the r, corresponding R is usually represented with a lower case r. So, that another value that measures the strength of a linear relationship, but also includes information about the direction of the relationship, is the correlation coefficient r. R-squared is always positive regardless of the direction of relationship, but r is the, if you will properly sign square root of r squared, the sine of R corresponds to the sign of the slope. So, this measure gives us some measure, not only the strength of a relationship in one number but also the direction. So, any arm circumference and height example, the R-squared measure was 0.46, the square root of that is 0.68. So, these correlations always are larger numerically than the R-squared values because we're taking square roots of numbers less than one, and those are larger than the number we're taking the square root of. So, the correlation coefficient in the arm circumference and height example is 0.68. For systolic blood pressures on age, the association R-squared was 0.34. The square root of that, the correlation coefficient is 0.58, and the association was positive so we'd report the positive square root of 0.34. For arm circumference in female sex, when we coded sex is one for females at least, the R-squared was 0.02. It would be the same if we coded for males is one as well, but the slope would change slides. But when we did it with females being equal to one and males being equal to zero, the slope was negative so, the correlation coefficient here is the square root of R-squared, the square root of 0.02, but with the negative version. So, it's negative 0.04. So, it gives us an indication that not only is that relationships are not very strong, but when female is coded as one for the predictor, it's a negative association. So, let's talk about again, bring back slopes and compare and contrast them to both R-squared and R. Slope estimates the magnitude and the direction of the relationship between our outcome y and our predictor x1. So, the slope estimates the mean difference in y for two groups who differ by one unit next n-1. The slope will change if the units change for y and or for x. Because of this larger slopes and absolute magnitude are not indicative of a stronger linear association nor are smaller slopes indicative of a weaker linear association, the size the slope depends on the units used. R-squared conversely measures the strength of the linear association. R measures the strength and direction. Neither R-squared nor R measure the magnitude, and it turns out neither R-square nor R changes with changes in units. So, it is invariant to what units we use for outcome and predictor. It will be the same regardless let choice unlike this slope. If you have r, you can compute R-squared simply by taking the correlation coefficient and squaring it. If you have R squared and you can almost compute r. You'd also, you can take the square root of it, but you need to know the direction of the relationship being quantified, and having the slope and knowing whether its positive and negative will give you that. Another utility of r versus R-squared is if I want to summarize relationships in a data set concisely in terms of both the strength of the association and the direction, I want one R is a full piece of information in the sense that it includes the direction. I could present a table like this, where I list all my data points of interests: maybe age, weight, height arm circumference and sex for female. One for females and I list them in the column and across the tops of the column, and wherever the two meet up in this matrix, I get the correlation coefficient. So, if I wanted the correlation between weight and age, I see that's positive and it's 0.77. If I wanted to R-squared, I can square that. You can see the correlation between arm circumference and height. For example is 0.68. The diagonal is equal to one because each variables perfectly correlated with itself. But presenting this this format gives a quick snapshot of the nature of the relationships in terms of directions between pairs of variables of interest and the strength of it. So in summary, R-squared measures the strength of the association, the linear association model by the regression by comparing the variability of points. In other words, individual y-values around their regression line predicted means to variability in the y-values ignoring x1, and the correlation coefficient r is the properly signed square root of R-squared and hence, provides information about the direction of the association estimated by the regression above and beyond what it provides about the strength. These are measures that are different, and convey different information than the slope estimate from a linear regression.