So in this lesson we're going to discuss similarity assessment. Because are often based on assessing the similarity between features and between objects. So pretty much objects of different features. If for example you have different features for example the shape of the eye, the color of the hair, the surface of the face. You know all these are different features. Similarity measures the opposite of the distance. So whether we measure a similarity of distance, it is really equivalent. If we have high similarity, it's going to be the same as low distance. If we have low similarity, it's going to be the same as high distance. Normalization is usually applied before calculating similarities so that variables are within the same scale. So, for example, we can have variables within zero one or minus one one. So, the method used to calculate distance may be different, if you calculate distance between numeric variables, nominal variables, ordinal variables and mixed type of variables. If we look at distance between numeric variables, we see here the example of a the most used distance for numeric variables which is Euclidean distance. And distance is a formula here. So you take a good point has several coordinates. One, two and p in this case. You calculate the difference between these coordinates. Each coordinate, one by one. Square them add them, and calculate the square root. This is an example here with two points. We have the xi1 and xi2 here, we have the xj1 and xj2 here. So, to calculate the distance, I would calculate the x Xi1- Xj1 squared then Xi2- Xj2 squared. And then calculate the square root, and this is what we use every day when we measure distances. This number, the distance between i and j would simply be the length of this line. So that distance, that's the distance measure we use every day, when we measure distances, like in miles, and kilometers, for example, or in meters. Actually,w e can use any distance function. And, there are many more than the Euclidean distance. A distance is simply a mathematical function. That has the following properties. The distance between two point is a positive or 0, it is 0 when the two points are the same. The distance between i and j is the same as the distance between j and i. And then there is a Pythagorean theorem, which says that distance of i,j is less than distance i,j plus distance k,j. And this is the type of information we use when we say that this straight line is the shortest path between two points. This is from the Pythagorean Theorem. We can also use other distance measures, like correlation coefficient, or any other measure of similarity or dissimilarity. The distance measures dissimilarity. So, next, the distance between nominal variables. For binary variables, so, binary objects may have two values. For example, 0 and 1. This table, here, similar to a confusion matrix says that we have, between objects i and j. There are a, that are similar in the sense that they have 1 as a value. b is for the objects that have 1 in i and 0 in j, we have b of c's. c or and as a type of mismatch 0, 1 and d is again a match between a 0 and a 0 between the two objects total. I have p as these pairs, so it depends if the binary variable is symmetric. Symmetric means that the 1 and the 0 would have same value, same meaning. For example, if it's a gender, and 0 and 1. It doesn't matter it's called the symmetric variable. In this case [INAUDIBLE] counts how many zero one, or one zero we have. Which is b plus c divided by the total number of possible combinations, which is the same. a+b+c+d is the same as p. If the binary variables are asymmetric, we use a coefficient called the Jaccard coefficient. So in this case, actually, the similarity Is 1 minus b plus c over a plus b plus c. So distance for asymmetric variables we don't count the these here which is the match between a 0 and 0. Because a 0 when it's an asymmetric variable It means a one is what matters. For example in a diagnosis you say one suspicion has a disease suspicion. It doesn't have the disease and in this case we consider the 0, as not as important as 1. And we will not count here the B in the denominator. So my distance would be b + c divided by a + b + c. And my similarity would be 1- that distance, which is also a over a + b + c. So we would count on it, a similarity between 1 and 1. For non-binary we would calculate the distance between i, and j as the number of mismatches divided by the number of variables. Now, if we have ordinal variables, they can be processed by mapping each value into a number. So for example, if my nominal variables have values a, b, c. And a is less than b is less than c. That's what we said about ordinal variables is that each value can be compared to the other ones, knowing whether it's greater or smaller. So a will be mapped in 1, b into 2, etc. And once I'm up to numeric variable, then you will calculate the distance as we would do for a numeric variable. Now when you have a distance between mixed types variable. So what happens is that a particular object will have different features. Some features may be numeric. Some features may be ordinal. Some features may be categorical, nominal. In that case you would apply the distance measure according to each type in see object, and you would add up as distances. So the final distance would be calculated over all see objects. And so in that way you calculate the similarity between any function. Thank you for your attention.