>> Wouldn't it be nice if it was possible to not only describe the relationship between two categorical variables, but also assess objectively whether one variable really influences the other, whether they're independent, so to speak? I think that will be very useful. And the good news is, the method you can use for this analysis is not very complex. It's called the chi square test for independence. In this video, I will explain how it works. Let's look at this table. It represents counts based on a random sample of still life paintings from three art periods. The variables are the period for which the painting was created and the type of object that was being painted. And the question for which this table was created was to see whether different objects were being painted in different art periods. Or stated differently, whether the type of object and art period were independent. The null hypothesis for such a question is that the variables are independent, and the alternative hypothesis is that they are not. It's noteworthy that there's no possibility to formulate a one sided hypothesis here. If the variables were independent, we could calculate the joint frequencies by multiplying the marginal frequencies and then dividing by the overall sample size. Let's do that here. So if these expected frequencies deviate a lot from the observed frequencies, this points at some discrepancy between what we expect under the null hypotheses of independence and reality. >> For each comparison, let's calculate the difference between the observed and expected frequencies. In the table with residuals, we see that there are positive as well as negative residuals in each row and in each column. This has to be the case because row and column totals are the same for observed and expected counts. So now we need a mechanism to turn the information about observed and expected counts into a statistic that can be used to judge whether the residuals can, on the whole, be considered to be so unlikely that the null hypothesis should be rejected. This equation provides what we need. It calculates the chi square statistic for a contingency table. According to this equation, we should for all cells i in a contingency table calculate the observed minus expected values, square the results, next divide that by the expected count and sum all the values for each individual cell. Let's calculate the chi squared statistic for this particular case, calculating observed minus expected per cell squaring that value, dividing by expected. And then summing all the values. This results in a value of 14.5. The chi squared statistic follows a chi squared distribution. It's a distribution with only one parameter. Degrees of freedom. And that parameter completely determines the shape of that distribution. Here you see the chi squared distribution with 1, 2, 4, and 8 degrees of freedom. The distribution is always positive. Which makes sense if you think about it. Also, the formula to calculate a chi squared statistic can never lead to negative values, because the term and the numerator is squared. You can also see that with a higher value for the degrees of freedom, the distribution becomes less skewed. But also wider, and it moves to the right. In fact, the decrease of freedom parameter represents the mean of the distribution. So how did we select to right degrees of freedom for our case? In a table with r rows and c columns, there are r minus 1 times c minus 1 degrees of freedom. In our table with three rows and three columns, r minus 1 is 2, and c minus 1 is 2, so we have four degrees of freedom. We can use a probability table or software to find the p value associated with 14.5 for a chi square distribution with four degrees of freedom. It turns out to be 0.006. This is very small, so we reject the null hypothesis of independence. Painters did paint different objects in their still life paintings during the periods we studied. Similar to what is the case in other parametric distributions, the chi squared statistic is better described by the chi squared distribution if the sample size, that is the total number of cases in the cross table, increases. The minimum number of required samples is, however, not described by the table's total, but rather at the cell level. The sample size should be such that the expected cell count in each cell is at least five. I hope you understood the following from this video. The possible dependencies between two categorical variables can be assessed by first summarizing the joint counts in a contingency table. And then by using the marginal counts preferable, calculate the expected counts for each combination of the two variables. The expected joint counts calculated in this way are what you expect if the two categorical variables are independent. The total relative difference between the observed and expected counts form the chi squared statistic via this equation. You find p value for the chi squared statistic with the chi squared probability distribution. It has only one parameter, the degrees of freedom. With a contingency table of r rows and c columns, the degrees of freedom parameter equals r minus 1 times c minus 1. The chi square distribution is always positive. And the degrees of freedom parameter represents the mean of the distribution. The sampling distribution of the chi square statistic gets closer to the chi square distribution if the sample size increases. And the approximation is good when each expected cell count is at least five.