So, now let's change it up a little bit and replace the binary predictor we had in the previous section with a categorical predictor. We'll see the ideas are an extension of what we did in the last section. So here, we'll look at simple linear regression with the categorical predictor. So again, I want you to understand, and we'll build on this understanding throughout the sections, that linear regression provides a framework for estimating means and mean differences. So here, I'd like you to be able to interpret the estimated slopes and intercept from the simple regression model when we have a nominal categorical predictor. So sometimes, regression scenarios include predictors that are not continuous, not binary, but are multi-categorical and nominal. Some examples of this include, a subject's race, or ethnicity for example, white, African-American, Hispanic, Asian, or stay self-identified god-knows for but some other category, or person's city of residence, Baltimore, Chicago, Tokyo, or Madrid for example, the study of multi-center study with four sites. So, how can this type of situation be handled in a regression framework? Namely, how can we set up our x or a predictor to model these categories? So, let's look at an example here, this is an article published in 2012 in the Journal of American Medical Association, actually now called JAMA, just JAMA. Data were collected on 800 US academic physicians, including their yearly salary, and the purpose of this article was to look at sex-based differences in salaries, adjusting for other characteristics that may differ between male and female academic physicians, they may also be related to sex. So potential confounders and we'll pick up on this again, and look at it again when we get into the realm of multiple regression, but one of the pieces of information they collected is a potential predictor of salary that may also lead to potentially related to sex of the position, is the geographical region of the United States where their academic job is located. And there were four regions used by the authors of this article; the West, Northeast, South, and Midwest. So, in trying to understand whether region as a potential confounder, authors wanted to see if it was associated with the salaries that they were modeling as their outcome regression including sex. So, I wanted to ask, do average salaries differ by geographic region, and if so, what is the magnitude of these differences? So, could this analysis be done by a linear regression relating salaries to region? How can a predict that takes on four categories be represented as a predictor in a regression model? So, one approach is we can arbitrarily give each region a numerical value on a continuum of values, make X1 equal to one for Western, physicians from the west, two for physicians from the Midwest, three for the South, and four for the Northeast for example. Then, we can fit a simple linear regression of our mean and predicting or estimated mean salary, I mean, y is a linear function of this x value, that's an order numerical version of the values one, three, four. This is actually a terrible idea. The coding here is completely arbitrary, as it always is with our predictors when we have categories or binary predictors, we could have easily assigned X1 equal to one for the physicians from the Midwest for example, two for physicians from the south, and three for physician from the Northeast. In fact, there's several permutations of ways we can code this as one, two, three, or four for the four regions. It turns out the value of beta one hat will depend on the arbitrary coding, and the results of the overall results and statements about the means for these four regions will not be equivalent depending on the coding schema used. So, we want something that's robust and will give us the same answers for each of the four regions, regardless of how we code our predictor, and coding is arbitrarily as a one, two, or three, or four and using that as a single predictor will not work, it will give us different answers depending on the coding schema. Further, this coding puts a pretty strict structure on the relationship between mean salary, differences between regions, and assumes it's incremental. So in this current coding schema we set out here, it assumes that the difference between average salaries, between physicians in the South, and physicians in the West, that's a two unit difference in our next value, so it's difference is twice as large as the difference between the physicians in the Midwest, where X1 equals two, and the West where X1 equals one, a one unit difference. So, not only would we get a different answer if we coded this, but regardless to the coning zooms the salaries increase or decrease, incrementally with an increasing value of region, which is absolutely arbitrary depending on who coded it. So it doesn't work at all to take this approach. Approach two however, is robust and allows us to get the same answers over all regardless of how we do the coding. So, this preferred approach is to designate one region, one of the four in this case, as a reference region. For example, we'll make the West our reference region, and then make binary indicators or xs for each of the three other regions. So, one way to do this would be, again, to make the west the reference region, and then make indicators for whether the physician is from the Midwest, South, or Northeast. So, create three variables, X1 will take on a value of one, for physicians from the Midwest and zero if they're not from the West, if different from any of the other three regions, X2 will take on a value of one, if the physician is from the south and zero if they're from any other region, X3 will take on a value of one if they're from the Northeast and zero if they're from any of the other regions. We'll show that this approach does not force the structure on the y-x relationship that depends on the coding and that the overall results will be equivalent regardless of what reference group is used for the comparisons. So, we're going to fit this regression model that looks like this, y hat equals beta naught hat plus beta one hat X1 plus beta two hat X2 plus beta three hat X3. Again, this looks like a complicated linear regression but again we're only estimating four values of y hat. So, each of the three slopes here, beta one hat, beta two hat, beta three hat, estimates the mean salary difference between a region that has a corresponding x value of one and the reference region, the western states, and the intercept beta nought hat is the estimated mean salary for physicians from the west, our reference group. So for example, for physicians from the Midwest, the group whose value of x one equals one and whose values of X2 equals zero and X3 equals zero, the model predicts the estimated mean of y, the mean salary is equal the intercept plus beta one hat times the value of X1 for this group which is one plus beta two hat, it has a value of X1 for this group which is zero plus beta three hat times the value of X1 for this group which is also zero. In other words, when the dust settles the estimated mean salary for physicians in the Midwest is beta naught hat the intercept plus the slope beta one hat, and physicians in the reference group the West whose values of all three Xs are zero. The model estimates that the mean salary for that group is just the intercept, because all values of x are zero so the three slopes disappear. So, for the Midwest, we have the estimated mean, generically is equal to the intercept plus the value of the slope X1 and for the western region the average salary is equal to the intercept. So, beta naught is the starting point and beta one hat is what we add to that starting point to get the salary, average salary for physicians in the Midwest. So, the difference in these two when we take the differences beta one hat. Beta one hat is the difference in salaries for physicians in the Midwest compared to physicians in the West, and you can show similarly, that beta two is the difference between salaries for physicians in the South versus same reference group of the West. Finally, that beta three is the mean difference in salaries for physicians in the Northeast region compared to the same reference group of physicians in the West. So, here's the resulting regression equation based on the results presented by the authors in the article. These salaries are in dollars, US dollars. So, this says that the starting salary for those in the reference group of the West, the average salary is $194,474, this is for the West. This is the slope for the Midwest, and this 4,412 Is the difference in average salaries between those in the Midwest and those in the reference group, the West. So, they make on average those in the Midwest make over four thousand dollars more on average than those physicians in the West. For the South, the beta two hat is equal to negative 35 indicating that the physicians in the South make $35 less per annum than physicians in the reference group, the West. Finally, the slope for the Northeast is negative 2,322, indicating that physicians in the Northeast, make an average salary of $2,322 less on average than those in the West. So you might say to me, "John, well, there's three comparisons there. Each of the three non-reference regions to the same reference; Midwest to West, South to West, Northeast to west. But what if I wanted the mean difference in salaries positions in the Northeast compared to physicians in the Midwest? Would I have to rerun the regression and recode the Xs so that my reference group is Midwest?" Well, you could do that but you don't need to do that. Recall for physician in the Northeast, which is where X1 equals zero, X2 equals zero, but X3 equals one, the estimated salary for that group would be y hat equals the starting 194,474 and this will drop out for the Northeast and we get that minus 2,322 or plus negative 2,322 for the Midwest. That's the group with X1 and the remaining Xs are zero. So for the Midwest, the estimated mean salary is the same starting 0.194474 plus the slope for the Midwest which is positive 4,412. So, we take this difference in average salaries for the Northeast minus the average salaries for the Midwest. This intercept cancels and it will be the slope for the Northeast and negative 2,322 minus the slope for the Midwest. So if you think about it analytically we take the difference between this slope here, in terms of the slopes, we have the difference between the Northeast minus the reference of the West and we subtract the slope for the Midwest which is the difference between the midwest and the same reference of the west and this part cancels and we're left when we subtract just the slopes with the average difference between those in the Northeast and those in the Midwest. So the point is, regardless of how I coded these things, if I use a schema where I designate one group as the reference and the remaining groups make their own indicator, then while the coding is arbitrary and it could change, not only can I get any of the differences of interests regardless of what I choose for the reference group, but you can show that if you change the coding and had a different reference group, and different X1, X2, and X3 regions, we scrambled them or rearrange them, you still ultimately get the same four mean estimates and corresponding mean differences between each of the groupings. Let's look at another sample, data from the National Health and Nutrition Examination Survey also called NHANES, a large probability-based sampling survey done in the United States every several years. This is based on the wave from 2013-2014. The data includes 10,000 plus observations on persons 0-80 years old and they only measure systolic blood pressure on people eight and older so that reduce the size of the sample 7,172 blood pressure measurements, that's when they can measure it comparably for the ages. So we have measurements in over 7,000 systolic blood pressure, measurements on persons 8-80 years old, what we want to see is how those estimated blood pressure averages differ if at all between ethnicity categories. Again, we're going to estimate the magnitude of the difference, we have not done anything with significance yet in the previous section. In this section we will lay out the ideas first and how to interpret the results and then we'll get into looking at things like confidence intervals and p-values. So we have five ethnicity categories, people identify one of the five Mexican-American, Hispanic, non-Hispanic white, non-Hispanic black and for those who don't identify those 40 categories, they identify as other ethnicity. So we've got five groups here, so the drill would be to do like we did before, we'll make one of the five groups the reference and then we'll look at coding individual axis for each of the remaining four groups. So, I chose to make Mexican Americans the reference, you could choose to do something differently and run regression. Then I made four indicators here X1 through X4 to indicate being of Hispanic origin up through identifying as other ethnicity respectively. So, we're going to fit this regression model and again it looks complicated, it's a long equation with one intercept and four slopes. But keep your eyes on the prize, this is only estimating five means for five different groups. So what each of these things measures then is the estimated systolic blood pressure when all Xs are zero, that's for the reference group. The estimated systolic blood pressure for Mexican Americans and then each of the slopes will estimate the mean difference in systolic blood pressure between whatever groups coded as one for that particular X and the mean for the reference group Mexican-American. So again, X1 is coded as a one for those identified as Hispanic hispanic and a zero for otherwise. So all other ethnicities have a zero, so whatever this value will be is the estimated mean difference in systolic blood pressures between those who identified as Hispanic and the reference group, those are identified as Mexican-American. So the resulting model we get from the computer gives us the following: So we can see that across the board the group with the lowest average systolic blood pressure is the reference group, the Mexican-Americans because the slopes are the differences in average blood pressures for the other four groups compared to that same reference are all positive. So, some questions and we will review the answers to this in the additional exercises in more detail, but I'll go through and give the answers here so that you can think about it and see if you can come up with the same thing. So what is the difference in systolic blood pressure between non-Hispanic blacks and Mexican Americans? So the one difficulty with this is you have to go back and keep a crib sheet of what the coning is so you know X refers to which ethnicity. In this case, X3 was a one for people who identified as non-Hispanic black, and zero if they identified in any of the other four groups. So this slope here 4.4 is the estimated mean difference in systolic blood pressure between those who identify as non-Hispanic black and those identified as the reference group, Mexican-American, so this would be the answer here as a single slope 4.4 millimeters of mercury. What if we wanted to get the difference in the mean systolic blood pressure between those identifiers as non-Hispanic black, the same group, but instead of comparing to the reference of Mexican-Americans we want to compare that to non-Hispanic whites and that group is coded as X2. There a one renowned non-Hispanic white and zero otherwise. So, what we can do here using the same logic that we showed in the previous example, this is not a difference between one group and the reference, it's the difference between two of the groups that are coded, but we can represent that as the difference in their respective differences to the same reference group. So it would be the slope for non-Hispanic black minus the slope for non-Hispanic whites, 4.4 millimeters of mercury minus 3.4 or one millimeter of mercury and we'll review that in more detail in the additional exercises. What is the mean difference between Hispanics and non-Hispanic whites? Well, again going back to our cheat sheet X1 takes on the value one for Hispanics and zero for the other four groups. So the difference between Hispanics and Mexican-Americans in average blood pressure is 1.3, the difference between non-Hispanic whites and the same reference group is 3.4. So again, we can take a difference in the slopes to get this. Sorry that's 3.4. I can't erase here so I'll just write that 1.3 minus 3.4 and that is negative 2.1 millimeters of mercury. So, Hispanics on average have systolic blood pressures of 2.1 millimeters mercury lesser than non-Hispanic whites. So again, just to reiterate we did this section of the previous and the whole idea here is simple linear regression is a method for estimating the relationship between the mean of an outcome Y and a predictor X1 or when the case with multicategorical predictors, we have more than one X, so X1, X2 through etc. via linear equation. When X is nominal-categorical, nominal categorical can also be done if it's ordinal-categorical as well, we'll get to that later. We'll explain why we might want to do that later but the same idea holds, designate one category is the reference group and make separate binary X's for all other categories. So if we have p groups, we'll need p minus one Xs. So when we had four groups we needed three Xs in this situation with region of the country and when we had five ethnicities we needed four Xs. So at this point you're going John, you promised me I'd ask this question in the lecture interview when I'm asking now, why are we doing this? All we've done thus far is reframe something we did in the first term which was, estimating group means and mean group differences. When we had two groups and we had more than two groups, and all we've done is maybe made it more complicated by putting the equation framework. So, what's the benefit of this? I'm glad you ask and you should be asking that question because I agree with you. Thus far it doesn't look like we're doing anything new, we're just reframing what we knew from before. I have two answers for that; First of all, in the next section I'm going to show you that we're no longer limited to predictors that are binary or categorical in nature, we can actually allow our predictor of interest to be continuous under the right set of circumstances with the relationships we observe in our data. So under the right assumptions where there is evidence of a linear association between the mean of our outcome and a predictor, we can exploit that relationship and gain something from using that, and I'll explain more of that in the next action but that allows us to do something new that we hadn't done before where our grouping factor is continuous. Secondly and perhaps more importantly, is we're going to see in the latter part of this course that these models can be expanded to include multiple predictors at once. So we'll be able to get better estimates potentially of the mean of an outcome by taking into account more than one factor at once. In the first term we could do something like say, "Hey, how does systolic blood pressure depend on sex? How does it depend on age if age has been categorized? How does it depend on some other measure that's been categorized?" But never could we say, "How does it depend on these three things taken together where some of them are continuous?" So, we will be able to expand these models to take into account multiple predictors at once which will be very powerful both for better estimating outcomes and for making adjusted estimates in the presence of confounding.