Here is another question about confounding. I start with a simple question about an association between an explanatory and response variable. Is the incidence of coronary heart disease greater among men who drink coffee than among men who don't drink coffee? My response variable is coronary heart disease. And my explanatory variable is a history of coffee drinking. If I find a significant association in my data, I will also want to evaluate whether there are other variables that might confound or explain the relationship. It strikes me that some people who drink lots of coffee do so while also smoking cigarettes. So I would like to evaluate if smoking is a confounder in the relationship between coffee drinking and coronary heart disease. >> This relationship between coffee drinking and coronary heart disease is the one that we're testing. But we also want to partial out or remove smoking from that association. We want to see if that relationship between coffee drinking and coronary heart disease is still significant after we account for smoking. Here's a Venn diagram that illustrates how our multivariate models will handle this question. Coronary heart disease is our response variable. Our explanatory variable is coffee drinking. Our basic question is, do we know something about the presence or absence of coronary heart disease by knowing the level of coffee drinking in our sample? But we may also know or believe that smoking is related to both coronary heart disease and coffee drinking. So we want to include smoking in our model as a possible confounder in the relationship between coffee drinking and coronary heart disease. Other terms used to describe potentially confounding variable in my statistical model include control variable, covariate, third variable, or lurking variable. When we're looking at the association between coffee drinking and coronary heart disease, the overlap you see in the Venn diagram is what we're testing. We're saying, is that overlap significant? Are they significantly associated? When we add the potential confounder of smoking, we're asking, is coffee drinking and coronary heart disease still significantly associated after we partial out the overlap between coffee drinking and coronary heart disease that can be accounted for by smoking? Because smoking is associated with coronary heart disease and coffee, there's a part of the association between coffee drinking and coronary heart disease that can be accounted for by smoking, the highlighted area in the Venn diagram. What we want to do is mathematically partial that out. When we run multivariate models, we're partialing out the portion of the association between the explanatory and response variables that can be accounted for by that overlap with the third variable. >> For this course we will only be discussing two types of multivariate models, the multiple regression, where our response variable is quantitative. And the logistic regression, where our response variable is binary. >> That is a two level categorical variable. >> The question of when is a third or fourth or fifth variable in our multivariate model a confounder is strategically important. If the variable is a confounder when we include it in the statistical model and the association of interest is no longer statistically significant, then we can determine that our original variables had no real relationship. Testing for confounding variables with multivariate models is vital in the testing for true statistically significant associations or real relationships between variables in our research. >> If we had run the model with maternal age and birth order predicting Down syndrome, birth order would have been significantly associated with Down syndrome in that first model. Once we added maternal age to the model as a potential confounder, the association between birth order and Down syndrome would no longer be significant. [MUSIC] >> So we now know that we will use multiple regression to evaluate multiple explanatory variables and or potential confounders when predicting a quantitative response variable. So how does a linear regression analysis work? Let's start with a simple example. >> We imposed our causal model on observational data by selecting our explanatory and response variables denoted by X and Y and placed on the X and Y axis of a bivariate graph. Let's return to a graph that we made in course two data analysis tools using the gap minder data set. Here we visualize the association between Internet use rate in a country and the percent of its population that lives in an urban setting in a scatter plot. We place the explanatory variable on the x-axis and the response variable on the y-axis. So our research question is, is the rate of urbanization associated with the rate of people who use the Internet? We also ran a Pearson correlation, which is talked about in the second course, Data Analysis Tools is used to test the association between two quantitative variables. And we find a pretty strong positive and significant linear association between these two variables with a Pearson correlation coefficient of 0.61. >> In order to test this model, our first goal is to determine the equation of that best fit line, the line that we created in our graph that shows the best linear fit between our two variables of interest. As you may recall from high school algebra, the equation of a line is usually defined as Y = mX+b, where X and Y are the variables that are on those respective axes. m is the slope of the line, and b is the spot on the Y-intercept where the line crosses an axis. In our model, we know that Internet use rate is our y or response variable, and urban rate is our x or explanatory variable. So we need to determine our slope and our intercept in order to define this best-fitting line.