Now that we have seen what association analysis is, let's go over the association analysis process in more detail. After this video, you will be able to define the terms support and confidence, describe the steps in association analysis, and explain how association rules are formed from item sets. Let's review the set in association analysis. They are create item the sets, then identify the frequent item sets. Finally, generate the rules. We will continue with this example dataset, there are five transactions in the dataset. Each with a set of items purchased together. The goal is to come up with rules describing associations between items. The first step is to create item sets. Item sets have different sizes which need to be created. We will color code the items so that each one is easier to pick out from the transactions table. We start out with just 1-item sets, that is, sets with just one item. The left table is the dataset of transactions. The right table contains the 1-item sets that can be created from this dataset. As each item set is created, we also need to keep track of the frequency at which these item set occurs in the dataset. This is referred to support for the item set and is calculated by dividing the number of times the item set occurs in the dataset by the total number of transactions. This is what is in the Support column in the right table. For example, eggs the last item in the right table occurs just occurs just once in the dataset, in the transaction two. So if Support is 1/5 or one fifth,the item set with diaper occurs in all transactions, so if Support is 5/5 or 1. The Support for each item set will be used to identify frequent item sets in the next step, specifically, the Support issues to prune, or remove, item sets that do not occur frequently. The support of each item set will be used to identify frequent item sets in the next step. Specifically, the support is used to prune or remove item sets that do not occur frequently. For example, the minimum support threshold is set to 3/5. So looking at the 1-item sets table We can remove any item set with the support of less than 3/5. These item sets are highlighted in pink, they will be removed before the sets for two items are created. The final one item sets are then the item sets with bread, milk, beer and diaper. We only consider items that were in the one item sets, that were not pruned. The two item sets are shown in the right table. We, again, need to keep track of the support for these item sets, just as we did with the one item sets. For example, for the last item set, with beer, and diaper, we see, by looking at the left table, that beer and diaper occur together three times In transactions two, three and four. So with support is 3/5. Again, we need to prune item sets with low support. The ones highlighted in pink in the two item sets table. Those would be the item set with bread and beer and the item set with milk and beer. The remaining two items that end. One item such or then use to create the three item sets. Let's look now at creating three item sets. The only three item sets that has a support value greater than minimum support is the one shown in the right table. Namely the items start with bread, milk and diaper. The second step in association analysis is to identify the frequent item sets. But note that the process that we just described for creating item sets already identifies frequent item sets. A frequent item set is one whose support is greater than or equal to the minimum support. So by keeping track of the support of each item set as it is being created and removing item sets with low support, we are already identifying frequent item sets. For our example, the frequent one, two and three item sets are shown here. Now that we identified the frequent item sets, the last step is to generate the rules to capture associations that we see in the data. Let's first define some terms we'll need to discuss association rules. The format of an association rule is shown at the top. It's written as X arrow Y and is read as if X, then Y. The X part is called the antecedent and the Y part is called the consequent of the rule. X and Y are item sets. An important term in rule generation is the rule confidence. This is to find as a support for X and Y together divided by the support for X only. So rule confidence calculates the frequency of instances to which the rule applies. Recall that the support for X is the frequency of item set X and is defined as the number of transactions containing items in X divided by the total number of transactions. The rule confidence measures how frequently items in Y appear in the transaction that contain X. In other words, the confidence measures the reliability of the rule by determining how often, if X and Y is found to be true in the data. How is rule confidence used in rule generation? Association rules are generated from the frequent item sets created from the data. Each item in an item set can be used as a part of the antecedent or consequent of the rule. And you can have many ways to combine items to form the antecedent and consequent. So if we just simply generate rules from each frequent item set, we would end up with lots and lots of rules. Each item set with k items can generate 2 to the k-2 rules. That's a lot of rules. And the majority of those rules would not be found in the data. This is where rule confidence comes in. We can use rule confidence to constrain the number of rules to keep. Specifically, a minimum confidence threshold is set and only rules with confidence greater than or equal to the minimum confidence are significant and only those will be kept. Let's look at how this works with our example dataset. We call that only one three item set was created from the transactions. That three items that contains items bread, milk and diaper as shown at the top. With these three item set let's see how we can generate rules from it and determine which rules to keep and which one to prune. Let's set the minimum confidence to 0.95. And here again is the definition for confidence. For candidate rule if bread and milk then diaper, we can calculate it's confidence as follows the support for both antecedent and consequent is the number of times we see bread, milk and diaper together in the data, divided by the total number of transactions. Items bread, milk and diaper appear together in transaction 1, 4 and 5 so the support is 3/5. The support for just the antecedent is the number of times we see bread and milk together divided by the total number of transactions. Items bread and milk appear together also in transactions 1, 4, and 5. So the support is 3/5. The confidence of this rule is then 1, or 100%. This means that the rule is correct 100%. Every time bread and milk are bought together, diaper is bought as well. For candidate rule if bread and diaper than milk, we calculate its confidence the same way. The support for bread, diaper and milk is 3/5 as before. Items bread and diaper are paired together in transactions 1, 2, 4 and 5. So the support for the items set with bread and milk is 4/5. Then the confidence with this rule is 0.75 or 75%. Since the minimum confidence is 0.95 or 95%, the first rule is kept and the second rule is removed from consideration. There are several algorithms for association analysis. Each uses a different set of methods to make frequent items set creation and rule generation efficient. The more popular algorithms are Apriori, FP Growth and Eclat. As a summary, we just looked at the steps in association analysis in more detail. We saw how items sets can be created from a dataset, how frequent items sets can be identified, and how association rules can be created from frequent item sets and pruned using rule confidence.