Welcome back people. In addition to measuring [INAUDIBLE] differences between groups of people, we are sometimes interested in measuring associations between variables. This can be measuring if there is relationship between the price charge for a product and the number of units sold in a given market. Does advertising affect overall sales in one way or another. In this lesson, we'll introduce two techniques for measuring association between variables, Chi-square and Correlation Analysis, which will be used independently, depending on the scale of the variables. On the one hand, you use a Chi-Square test, mostly for nominal data, and you use correlation analysis mostly for variables measured on interval scales. Let's review the Chi-Square test first. The example we're going to look at here is going to determine if there is a relationship between brand preferences and where someone lives. Usually these type of data is cross-tabulated, and we use this table to compute the Chi-Square test. The null hypothesis in this case, is that there is no relationship between the variables. Suppose that you want to see whether where someone lives sas an impact on which brand they prefer or not. Here, I'm going to consider a country like the West where there are three main regions, West Coast, Midwest and East Coast. And you have three brands, Brand A, Brand B and Brand C. What you can see is that each sale represents how many people like a given brand in a given region. For example, if you look at Brand A, you can see that 100 people on the West Coast responded that Brand A was their favorite. 400 people said that Brand B was their favorite. And 250 people responded that Brand C was their favorite. So overall on the West Coast, 750 people where surveyed. And you can see that we obtain the same kind of data for the Midwest and for the East Coast. From this data, we need to compute the test statistics for the Chi-Square Test. This test is based on this formula where on the left hand side of the equation there is a greek letter, chi, and then we have a sigma representing the sum from i to R, and j to C. C is going to be the number of categories in the column variable, and R is going to be the number of categories in the row variable. In our case, C and R are each going to be equal to three, because we have three columns and three rows. Oij is going to be the observed number in cell ij, and that's for example 100 for Brand A in the West Coast. Then Eij is going to be the expecting number in cell ij. What does that mean here? It means that if responses were totally random, what would be the number that we should expect. So for example, Eij is computed as XAi, which is the number of elements in category A, and XBj which is a number of elements in category B. So for example, Eij for the West Coast and Brand A, that should be equal to 750 times 700 divided by 240. Applying the formula for Eij, you should obtain 218 for the set of one, one. That is the sale for West Coast and Brand A. However, notice that we have 100 people, and so we use this difference to compute the difference between Oij and Eij, which is reporting this table for every cell. Then we can compute the two other quantities as displayed in the formula, to complete the test statistics. At the end, we perform the sum to complete the Chi-Square test, which gives us about 504. Now, the next question is, is this 504 high or low? Or said differently, in statistical terms, does this Chi-Square value allows us to make an inference as to whether we should reject or accept the null hypothesis. Here again, the null hypothesis is that there is no association between where you live and the type of brand that you prefer. We need to look for the theoretical value of this Chi-Square number that we want to use for our test. For that you can look at a Chi-Square table that is easily found online, or in any statistics books. And so we need some information, one of them is degrees of freedom of the Chi-Square test, which in this case is (R-1)(C-1). R and C are equal to 3, so that means (3- 1)(3- 1) equals 2 times 2 which is 4. So we used a 95% confidence level, which means that we need to look for tail area of 1 minus 0.95, which is 0.05. And so when you look at this number, we find that this value is 9.5. And so, what we found is that the test statistic is higher than the theoretical value. Which means that we can reject the null hypothesis that is, there is an association in the data between where people live and the type of brands that they like. In other words, there are enough statistical support in the data to conclude that location and brand preferences are in fact associated. Again, you don't need to remember this formula. You just need to understand the meaning of the formula. And to know that such a test can be easily found in any statistical software, or even in Excel.