From the bivariate graphing examples that we've covered, we filled in the left side of our graphic decisions flow chart. Each example showed situations when our response variable was categorical. Let's talk now about the right side of our flow chart when the response variable is quantitative. We'll now change our research question, using an example from the gapminder data set. Here, we're interested in the association between the percent of the population living in urban settings within each country. And the country's rates of Internet use. That is, the percent of people with access to the world wide web. Below, you can see a full description of these variables from the GapMinder Codebook. For this research question, both the response and explanatory variables are quantitative. A bar chart would not work here. The graph of choice would be a Scatterplot. A Scatterplot, by definition, is a graph of plotted points that show the relationship between two quantitative variables. In a scatterplot data for each observation's explanatory and response variable are plotted. This scatterplot shows a sample of 11 observations according to the relationship between height and weight. In the lower left hand side of the graph, we see plotted individuals with relatively low height and weight. In the upper right hand portion, we see individuals with relatively high height and weight. Returning to the gap minor data set lets see how we can use Sass to examine the relationship between percent of the population living in urban settings and the rate of internet use. Since we're using a different data set we'll begin with a new program. It begins with the standard lib name statement. Then we're gonna call in the get minor data set at the beginning of the data step. We end the data step by sorting according to the unique identifier in this case the unique identifier is country Next is a univariate and variable statement. In order to examine the central tendency and spread or variability of both urban rate and Internet use rate. Then the program ends with a run statement. We can see that for urban rate, the mean percent of the population living in urban settings is about 57%. The standard deviation is about 24%, suggesting that there's quite a bit of variability from country to country, in terms of the population living in urban settings. For Internet use rate, on average,about 35.6% of the population across these countries has access to the world wide web. Again, with a standard deviation of 27.8, there seems to be quite a bit of variability from country to country. But, is there a relationship between these two variables? We can explore this question visually with a scatterplot. SAS provides scatterplots in response to the PROC GPLOT command. The code is PROC GPLOT semicolon PLOT quantitative response variable, which is internetuserate. Times quantitative explanatory variable, which is urbanrate, ended with a semicolon. To characterize the relationship that we see in this scatterplot, it can be helpful to draw a line of best fit through the observations as a way of trying to determine how the dots line up. That is ,do they seem to line up in a positive or negative direction? Or with a positive or negative slope? And increasing slope as we have here between urban rate and Internet use rate indicates the relationship is positive, that is an increase in one of the variables seems to be associated with an increase in the other. Here's another example from gapminder, exploring the relationship between income per person in each country and Internet use rate. Again, if considering a linear pattern, the relationship seems to be positive. That is, higher income is associated with higher Internet use rate. The strength of the relationship in a scatter plot is determined by how closely the data points follow the form. In this scatterplot, the data points follow the linear pattern quite closely. This is an example of a very strong relationship. In this other scatterplot, the points also follow the linear pattern but much less closely. Therefore, we can say that this is a weaker relationship. The form of the relationship is its general shape. When identifying the form, we try to find the simplest way to describe the shape of the scatterplot. There are many possible forms. As we saw, a positive or increasing relationship means that an increase in one of the variables is associated with an increase in the other. And negative, or decreasing relationship, means that an increase in one of the variables is associated with a decrease in the other, as shown in this central scatterplot. Not all relationships can be classified as either positive or negative. Further, if you can't plausibly but a line through the dots If the dots are just an amorphous cloud of specks on the graph, then there may be no relationship. >> For various reasons, the scatterplot is sometimes limited in its ability to allow us to evaluate a relationship visually. >> Here's a scatterplot for income per person by rate of HIV among 15 to 49 year olds. Since most countries have a low HIV rate per 100 people, the dots on this scatterplot seem to clump in the lower left hand corner of the graph. So to try to get a better sense of whether or not there is a relationship between these two variables, we would try to categorize or group the explanatory variable, income. If we use the Proc Univariate statement for the variable income per person, we can use the quantiles table to determine the 25th, 50th, and 75th percentile, or quartile. These will allow us to divide the countries into four ordered groups, according to income per person. We need to add the appropriate data management syntax to the program within the data step in order to create these categories. This new variable will be called income group. Proc Freq and tables statements can be added so we can examine the distribution of this new variable. After the program has been saved and run, we can see the distribution for income group. The four ordered groups we created show that there are 47 countries in the lowest Income group, the lowest 25%. There are 48 countries in the next 25%, 49 in the next, and 47 countries in the highest 25% in terms of income. >> With this new categorical, explanatory variable, we're now ready to create the last type of bivariate graph. That is the Categorical to Quantitative Bar Chart. >> The code we will use is PROC GCHART: VBAR categorical explanatory variable, which is incomegroup/discrete type=mean SUMVAR equals quantitative response variable, which is HIVrate, and it ends with a semi-colon. In this bar chart, while we can see clear differences in HIV rate based on income per person within countries, the relationship does not seem to be linear. Although we might have expected a negative linear relationship that is increases in HIV rate with decreases in income group you can see in this graph that income group two falls outside of this pattern. We've worked through each type of bivariate, or two-variable, graph, highlighting when and how each should be used to visualize the relationship. Now let's just very briefly summarize. When visualizing a categorical to categorical relationship, we use a bar chart with explanatory categories on the x-axis and the proportion of our response variable on the Y axis. When visualizing a categorical to quantitative relationship, we use a bar chart with explanatory categories on the X axis in the mean of our response variable on the Y axis. When visualizing a quantitative to quantitative relationship we use a scatter plot, in which each observation is displayed according to the values of the explanatory and response variables. >> Use these basic guidelines, as well as the graphing decisions flow chart, to visualize the relationships between your own variables of interest.