Let's answer the first question first, about Structural small worlds. Let's use symbol L to denote the median of shortest path between node pairs. So, if I got node pair i,j and the shortest path distance between them is dij and then I look at all the node pairs and the corresponding shortest path distance. Histogram it and then I get L as the median. And I want L to be small. And of course, L will be small if there are only a few nodes, So really, I want to say L is small relative to the number of nodes, say n, And, more precisely, I want L to grow like a slow function of a number of nodes. How slow? For example, logarithmic function, Then even for exponential growth after taking the log, it become just linear. So what we now want to do is to reverse engineer and we'll comment on this, the utility and pitfalls of reverse engineering, topology or functionality of a network towards the end of the lecture. Want to reverse engineer and build an explanatory model to explain the observation like what has do. To explain that observation this time about the social network that L is often a slow function the number of nodes. But whatever explanatory of model we have, it can be just a tree growing like this twenty to the power of six, cuz we know that violates another observation empirically made in social network, That is there's a lot of transitivity quantified for example by a clustering coefficient of a graph quantifying the degree of existence of triad closures. And we're going to use C to denote this metric clustering coefficient. I just want to highlight C clustering coefficient of a graph has nothing to do with the clustered density p that we talked about in the last lecture, In contagion model. Okay? Clustering coefficient of a graph is not the cluster, the density of a subset of nodes. So what is clustering coefficient? Remember, we're trying to quantify the following observation that in many social networks, if a, b, c forms a connected triple like this, then there's a good chance that it also closes the triangle. Okay, that's why it's called triad closure, So, we want to look at the number of triangles versus the number of connected triples. And, we want this metric to be normalized, so it always lies between zero and one for all graphs. In particular, if it is a triangle, then we want the C to be one. In order to accomplish this normalization we're going to divide the number of connect triples by three. Why? Because of different ways we count, connected triples versus the count of number of triangles. If there is a, A link here, then the triangle exists, we call that triangle a, B, c. We count it only once. But when we count a number of connected triples, we actually have to count a, b and b, c, two links coexist, then there's this connected triple whether there's a link a, c or not. If there is one, we'll also have a triangle. If not, we don't. But, to connect a triple, this is one connected triple. And then, we can also write a, c and c, b, two links that is another connector triple. And b, a link and a, c link, that's another connector triple. So, just because when we count connected triples, we count basically three times, so we divide it by three to take into account that over-counting. And therefore now, c, d will be guaranteed to lie within zero and one. For just a triangle, it will be one. For just a connected triple, which is a line here, it will be zero. Now the question for a bigger general graph, What will C look like? We want C to be big, and L to be small for our explanatory model. So, here's another graph, slightly bigger, four nodes and four links. Let's see what is the clustering coefficient for tri closuring this graph. Well, for this graph, this number of c is actually easy to compute. Alright, C equals the number of triangle over the number of connected triples over three. How many triangles are there? Whether there's only one triangle, okay one, two, three nodes, triangle. What about connector triples here?. Well, there are let's see, how many connect triples. Two, one, three, that's a connected triple. Okay? One, two, three, that's another connector triple. Okay? one, three, two, that's another connector triple. One, two, four, That's another connector triple. And three, two, four, that's another connector triple. So, there are actually five connectoe triples. Of course, some of them we basically counted more than once, So, we have the normalization divided by three. And that means the clustering coefficient of this graph is three-fifths, which is a pretty big number considering that C must lie between zero and one. Okay? And that satisfies our intuition because in this graph we can see that is reasonably triad closed. Alright. So, what kind of model can give us what we want? Big C, small L. First attempt. Let's thinks about a random graph, also called Poisson random graph, also called Erdos-Renyi model of a graph. This is actually heavily studied and has uses in a variety of subjects, but we will actually not talk about random graph except this slide. The, because what we want to do it in this course, A random graph is neither a accurate predict depiction of reality, nor a useful design tool. But, let's still look at this failed attempt to explain small word using random graphs. Random graphs, what is that? It's just that, given a set of n nodes. Look at the possible links and then they will pop up with the probability p. If throw a coin, loaded coin, probability p, if shows up heads, then the link exists, otherwise it doesn't. So this is a probabilistic quantity. And let's look at, if you indeed construct such a graph what would be the l and c. Now l is actually small as we can intuitively see okay, because between any node pairs, there is a chance of p that they are directly connected, but the clustering coefficient C is also small. Okay? When we look at c now we're talking about expected c and expected l because this is a probabilistic object. And without going to the details there, sufficed us to show the intuition that c for our, plus our random graph is the following. Small c over n minus one. N Is the number of nodes. Small C is the expected degree of the node. Why is that? Why is this the expression for the expected clustering coefficient big C for a random graph? Well, because the chance, the probability that two nodes are connected is C over n minus one in a random graph. Okay? This is how many connections you have and this is how many nodes are out there other than yourself and, therefore, the ratio is the expected number of the probability of the two nodes been connected. Independent of whether these two nodes are already indirectly connected by another node or not, doesn't matter. That's not how the graph was constructed. So the chance that these two nodes are connected is exactly just c over n minus one. So, suppose we look at a graph, let's say 100 million people, and each has 1,000 friends. That's a lot already, much bigger than Dunbar's number. Still, the clustering coefficient is only on the order of ten minus three, which is way too small. We know C will be much bigger than that. Orders of magnitude bigger than that in the real social network. So, random graph can explain small l, but not big C. What about the extreme end on the other hand? A regular graph, let's say, a regular ring graph. What's a regular ring graph? Let's draw a ring here. Okay? It's just a line with two ends connected. And, let's say, there are nodes living here. Okay? In this case, there are eight nodes, so n equals eight. Just like random graph is parametrized by number p, that's the probability of a link appearing or not. A regular ring graph is parametrized by a number of c, a small c. Now this small c is not exactly what we just used expected degree. Here is the actual deterministic degree c, here. Okay. That's the number of neighbors each node has, It's the degree of each node. Given this is a regular graph, you can rotate however you want and it still looks the same. This is basically the degree for all the nodes. Now, in this particular ring graph c is two because for each node there are two neighbors. The degree is two for everyone, I can add a few more. Say, let's make a c and even number. Okay? I can also connect to this node with two hop neighbors now. So there's a direct link going out over there now. Okay? Similarly, I'll add all these. Now in this case c is no longer two, c is four, because there's one more missing, because every node has actually four neighbours. I pick this node, one neighbour, two neighbour, three neighbour, four neighbour. Okay, don't worry too much about exactly the way the lines are drawn. You can easily count that, one, two, three, four. This node has four neighbours and its symmetric is regular, so every node has four neighbours, degree four and eight nodes. Of course, you can make c even bigger now. If we insist the c to be even then, you can make c to be six and eventually six be eight and that would also be called full mesh network and every node is connected to every other node in fact in this case because of even number of nodes here, you can even do that. The largest and even c can get is just the six. Okay? Well, so in this case, what would be c and what would be l? First of all, c is now much better. Hm, okay? Let's compute the clustering coefficient here, c. Number of triangles and number of connected triples. Turns out number of connected triples is easy to compute. If you just look at each node, because it's symmetric, We can just count those centered around each individual node. The number of connected triples is easy to choose, it's count is just c choose two. You've got c neighbors and you choose two of them then you form a connected triple. It might be a triangle, might not be a triangle, But still, a number of connected triples centered around a node is just c over c choose two. Of course, don't forget you have to divide by three for normalization. Now, what about the number of triangles centered around a given node? Let's think about that. Well, first to form a triangle, you need to go up, okay, one step, then need to go up another step, and then you have to come back, Otherwise, you don't close it and you don't have a triangle. So you have to two go steps out and then be able to come back in one shot. So there, how many such choices there are? Well, in order to be able to come back in one hop, you can only go out c over two, that far. Farther than that you won't be able to come back. And out of those, you can choose two because you'll be making two steps. So the number of triangles is c over two choose two and the number of connect triple is c over two. What is this expression? It is one-half, c over two, times c over two minus one, divided by one-half, c, c minus one, over three, Which simplifies to three times c minus two over four times c minus one. This is the clustering coefficient quantifying triad closure, and closure, on a regular ring graph. You can see that a number of nodes doesn't come into play here, because this is entirely symmetric. So only c matters the deterministic number of the degree here for each node matters. Well, for sanity check, let's see what happens if c is two. Well, we know if c is two, we're just talking about a ring. There is no triangle. There's only connected triple. Indeed when c is two, this becomes zero as you can see. Okay? So triad closure, doesn't exist, Clustering coefficient big c is zero. What if c is four now, Like this graph? Actually, the way we draw this, it's hard to see, intuitively, what should a clustering coefficient be. There's another way to draw the same graph, Which is, you can easily verify for yourself like this. These are the nodes. . And now, you can clearly see the clustering coefficient should be half, Because had I also joined these dotted lines, which don't actually exist here, I would have actually tri closure all the tri closures. But they don't exist, Only these parts exist. So, intuitively this way I've joined this same picture, shows that clustering coefficients should be half. And indeed, if you'll plug in c equals four here, what you'll get is that clustering coefficient is one over two. Okay. Now, what if c becomes real big like this small c. Okay. When this little c is like infinite, then what would happen? Well then, in that case this simplifies us to three over four. In other words regular ring graph, the clustering coefficient can only go up to three quarters, but that's already very, very big. Compared to say ten to the minus three for our Poisson random graph, this is way bigger. And, indeed, most likely it would be, at least c, at least four and therefore, clustering coefficient is between one-half and three-fourths. This is a small dynamic range, but it is of the right magnitude. Good. We got c checked off, okay? We got large c on the regular ring graph, Which is not too surprising, cuz with enough C, you quickly start closing the triangles. Now the question is, what about these l's? That's where the problem is with a regular ring graph to explain small row.. It doesn't have a very small l. Indeed, if you want to go from this node to that node, you actually have to at least hop through here and here. But remember, this is only enabled with eight nodes. When you have many more nodes ends big on the regular ring, then no matter how many c you add for quite a large range of degrees, you still need to hop slowly, locally. You do not have a direct long range link anywhere in this graph. This graph only got short range links. Hm, okay? So, with only short range links, you cannot possibly have a small l. So, here's the problem. Random graph got small l, Very good, But also small c, Not good. Regular, say ring graph, got a large clustering coefficient c, Very good, But also large l, Not so good. We want a large c like regular graph, but a small l, like a random graph. So, maybe a hybrid. How about we combine these two ideas together? Start with a regular graph and then add some random links and that's precisely the Watts Strogatz model that explains small