So, let's look at the situation. Now, let's start by looking at linear regression where our outcome is continuous, and we're estimating the mean of the outcome for different values of our predictor x_1. We'll call it simple to indicate that we only have one predictor. Let's start looking at examples in the situation where the predictor is binary as ostensibly where we're estimating two means like we did in the first term, and estimating the difference in means, but let's look at this in a regression framework. So, after viewing this section, I'd like you to understand that linear regression provides a framework for estimating means and mean differences in a linear equation context, and interpret the estimated slope and intercept from a simple linear regression model with a binary predictor. So, for linear regression, the equation is relatively straightforward. The regression models, the mean value of a continuous outcome, we'll call the outcome y, as a function of our predictor x_1. So, ostensibly what we're estimating is an equation that gives the estimated mean of y. We'll represent that here with y bar. The bar over like we did while representing estimated means in the first term is a linear function of our predictor x_1. So, we have the intercept plus the slope times the value of x_1. So, as noted in the previous section, x_1, our predictor, can be binary, nominal categorical, in which case, it will be represented by more than one x, or continuous. We'll take on those latter two situations in the next two lecture sections. I want to point this out, as with everything else we have done this far, we will be using data from a sample, an imperfect sample from a larger population or process. So, we will only be able to estimate our everything based on the sample. To indicate that we have estimates, we should put hats over our intercepts and slopes to indicate that they are estimates based on the data as well. Frequently, you'll see even though this left-hand side here is the mean of y which we previously represented with a bar just to address the components all the same way. This is frequently replaced with a hat in the written version of the equation. So, we're estimating the mean of y as estimated intercept plus an estimated slope from our sample data times our predictor value. So, for any given value of x_1, we can estimate the mean of y via the equation y hat equals Beta naught hat plus Beta one x_1. We just plug in our value of x_1 to this equation, and we'll get an estimated value of the mean of y. This slope compares the mean value of y for two groups who differ by one unit x_1. Since we're estimating means of y, this slope is interpretable as a difference in means between two groups who differ by one unit of x_1. So, let's look at some examples when our predictor is binary to start parsing this and interpreting it in a substantive context. So, we know an example we looked at in term one, we are going to reframe as a regression problem data on anthropometric measures from a random sample of 150 Nepalese children less than a year old, zero to 12 months old. The question we might have is what is the relationship between average arm circumference and sex of the child? So, the data as is the overall mean arm circumference is 12.4 centimeters, and it ranges from 7.3 centimeters to 15.6 centimeters. We're curious as to whether there's different underlying estimated means for males and females. This is the mean for everyone. The data is roughly half and half slightly higher percentage of female children, 51 percent female versus 49 percent male. So, we could look at the boxplot for these data. We have the sample means, the estimated mean arm circumference for males is 12.5 centimeters, the estimated mean for females is 12.37. You can see from the boxplot that lines up with what we see. There's a lot of crossover in values but the values for males are shifted up slightly compared to the values of females, hence, the slightly higher mean value. So, we know from term one if we wanted to estimate the difference in means between these two sexes, we'd simply take 12.5 and subtract 12.37, or do it in the opposite direction if we want it to compare females to males. Here is what's called a scatter plot of these data. This is what will be a tool that will be useful when our predictor is continuous, but it really doesn't enlighten us very much when our predictor is binary. At what we have here are individual points, the individual values of arm circumference for each child in the data. Each dot represents a single child for males and for females. So, we certainly got more detail when we actually saw that displayed as a boxplot. But when our x_b will be continuous, showing these points in a two-dimensional graphic will be useful. So, here are alkalines arm circumference which is a continuous measure. We're estimating the mean of y, x_1 is not continuous, it's binary in the male or female. So, how are we going to handle that as an x in regression, this binary value? Well, what we'll do is we'll create a binary x that takes on the value of one for one of the two sex groups and zero for the other. So, I'm going to arbitrarily make it take on a value of one for female children and a value of zero for male children. So, what we're now going to be estimating, we're going to estimate intercept and slope to an equation that will give us the estimated mean arm circumference and some linear function of this binary sex. So, note, as fancies it seems to estimate a linear equation that this equation is only estimating two values; the mean arm circumference for male children, and the mean arm circumference for female children. So again, we're going to get an estimated slope and intercept such that when we're looking at female children, the estimated mean arm circumference for that group is the value of the intercept plus the value of the slope because x_1 takes on the value of one for females. For male children, x_1 takes on a value of zero so the slope won't enter into their estimated mean and it'll simply be the intercept. So, this slope, Beta one, estimates the mean difference in arm circumference for female children compared to male children. It is the difference in the estimated mean of y hat for a one unit difference in x_1 as we've seen slopes are in general. But the only possible one different unit. Next one is a one unit difference. The difference between x_1 equals one and x_1 equals zero. So the resulting equation estimated by a computer and we'll talk more about the estimation process later in this lecture set, is this. The estimated mean arm circumference for a given person, given their value of x_1 is 12.5 plus negative 0.13 times x_1. So let's look at that slope to start, just interpret that. Think about it, this is the estimated mean difference in arm circumference for female children compared to male children. In other words, female children have arm circumference on the order of 0.13 centimeters less on average than males. The intercept value is 12.5, this is the estimated mean arm circumference for male children. In other words, children whose value of x_1 is zero, so that's equal to 12.5 alone and then if I take the sum of the intercept plus the slope, so the mean for males when x_1 equals zero plus the difference, this is the difference in the means for when x_1 equals one compared to x_1 equals zero. The resulting sum is 12.37 so males have an arm circumference of 12.5, female arm circumference on average is 0.13 centimeters less. So when we add negative 0.13 from that 12.5 we get 12.37, the mean arm circumference for female children. So, again, this is a fancy linear equation in which we're only estimating ultimately two values, a mean for males and a mean for females. But it is a line, we only need two points to get a line, so what does the resulting regression line look in this sample? Well, if we were to put it on a graphic like this such that our X-axis was a zero for males and a one for females, the Y-intercept for this line here is again zero, slightly off the axis here but this is the mean for males which is 12.5, this is the mean for females here which is 12.37. The slope of this line is very hard to see but there's a slight negative slope and the slope of this line is negative 0.13, so x here is not measured on a continuum but we only need two points to get a line and I will show all pieces of that line here. The coding here was arbitrary for x_1. In this example, it was coded as a one for female children, then zero for male children. The variable x_1 could've instead been easily coded as a one for male children and a zero for female children and we could have fit a very similar regression line that generally would look the same, but I'm going to put a little star next to my intercept and slope estimates here to indicate that they may be different values than what we saw when x_1 was coded one for females and zero for males. Let me ask you this, can you figure out based on the previous regression results given in the information we have, what the values of the slope would be if we reverse the coding and made one males and zero females based on the same data? See, if you can figure this out, I will show how to do this in the Additional Examples section as well in detail. Let's look at another example we used in the first term and bring this back. We will be comparing the mean length of stay between patients and our Kaggle dataset from the Heritage Health System. Will the mean length of stay for all who had been hospitalized in the year 2007, comparing the mean for older patients, those who were greater than 40 years old, compared to those who were less than or equal to 40 years old. So now, I'll show these means out to two decimal places. So, the estimated mean for the sub-group who was less than or equal to 40 years old was 2.74 days as compared to 4.87 days for the group that was older than 40 years. So a large mean difference in the average length of stay between these two groups. So we can use an equation if we want to put this in a regression context of the form y hat equals b naught plus b naught hat plus b one hat x_1 where y hat is the mean length of stay and x_1 equals one for persons greater than 40 years old and zero for persons less than or equal to 40 years old. So the results for these data coming from the computer is our estimated mean length of stay is equal to the intercept 2.74 plus 2.13 times our predictor x_1. So again, x_1 is one for older persons and for that group the estimated mean is 2.74 days plus 2.13 days times one, when x one equals zero. The younger group the mean is simply y hat equals 2.74, we'd take the slope 2.13 but we multiply it by zero, so it would not show up in the equation for those who were in the reference group. The value zero for persons less than or equal to 40, so this intercept is simply the mean length of stay for that reference group. Persons less than or equal to 40 years old. This 2.13, the slope of our predictor when coded this way is 2.13 days, so this indicates that patients who are older than 40 have length of stay 2.13 days larger than the group less than 40. This is the mean difference in length of stay for persons greater than 40 years old compared to persons less than or equal to 40 years old. Of course, the sum of these two things 2.74 days plus 2.13 days equals 4.87 days which is the mean length of stay per persons greater than 40 years old. As with the arm circumference data, we could have coded these x's in the opposite way making the reference group, the group that was greater than 40 years old and the x_1 taking a value of one for persons that are less than 40 years old. We get different values of the intercept and slope but the overall story would remain the same. So, in general, simple linear regression is a method for estimating the relationship between the mean value of a continuous outcome or continuous outcome y and a predictor x_1 via linear equation of the form y hat which represents the mean of y equals some intercept, estimated intercept plus some slope times x_1 our predictor. When x_1 is binary, this slope estimates the mean difference in y for the group with x_1 equals one compared to the group with x_1 equals zero. The intercept b naught hat is the estimated mean of y for the group with x_1 equals zero. Then there's two sets where we visit this equation but when our predictor is multi-categorical and when our predictor is continuous, then the same themes will emerge in those sections as well.