[BLANK_AUDIO] Okay, so we're back and we're going to talk now a little bit about estimating these sub-graph generation models, or SUGMs. And again, we're looking at sets of models now that allow us to deal with link dependencies, and allow us to have different types of sub-graphs formed. And we want to capture those features and so we want to see how it might estimate those statistically, okay. So again, just to remind you the form of the model now. The model is that, that links form, triangles form. You could allow the maybe it's links between blue nodes, links between yellow nodes. Links between blues and yellows, triangles involving three different kinds. So, there could be different probabilities for each one of these things. But, each type of different sub graph pops up with some probability. These things might intersect and overlap. We res, we observe a resulting network and then we're trying to infer what these probabilities were. So, are, are triangles really being formed independently of links and so forth. So we are trying to estimate those probabilties. Okay, so there is sort of two different approaches that one might use to do this. One is sparse graphs then these incidental interactiosn are going to be rare and direct estimation turns out to be valid and consistent. There's also in paper with [UNKNOWN] you can go ahead and we also provide an algorithm that corrects for small numbers or gives estimates for non-sparse graphs. There's algorithms that you can use to actually just go through and, and try and estimate directly, how many links there would have been? How many triangles there would have been? And try and figure out how many of those were incidentally or overlapped generated in terms of overlapping. And then you can just do that directly. I'm not going to go through the algorithm for you to the paper but lets just talk about this sparseness. So the idea here is that incidentals are generated by combinations of subgraphs that overlap. Sparsity basically is a condition that I'll refer you to the paper in terms of its full glory. It basically says that, you know, different types of sub-graphs are rare enough that when it, they're not going to interact, so they can't grow too quickly. And it relies on conditions that, that, that just go through and we calculate explicitly combinatorics of different possible things generating each other. And then bound those. And, we end up with a fairly tight bound. But, in terms of a, an intuitive example, if we just go with our links and triangles then, as long as the probability of a link is, is smaller than the square root of n, the probability of triangles is, is smaller than 3 N to the minus 3 halves basically what is that tell you, that's in the situation where the typical node is involved in less than root and links and root and triangles. If you're looking at a fairly large social network that kind of condition is fairly weak, its going to be fairly easy to satisfy you know, the, the number of, of friends I have on Facebook is, is actually fairly tiny. And in that case, you know its not anywhere near the number of nodes out there even root n. And, and so root n can be a fairly good bound here. So again we're, that limits the number of incidentals. And if we went, went through for instance on this network, and wanted to estimate things. If we wanted to get estimates of what those probabilities were, what we could do is to start, let's try and estimate triangles. Okay, well, we have, if you go through and count triangles, there were actually nine that were originally generated directly and then one incidental. If we didn't see how they were generated, we would estimate that there were actually ten generated on these 42 nodes. And so what we're trying to figure out what's the relative likelihood of, of triangles forming. Well how many possible triangles are there? Well, there's n different nodes, 42 here 42 choosing triples. So you can do 42 choose three. If I did my calculation correctly it's 11,480. And so what's our estimate? Our estimate is going to be the ten that we observed out of the 11,480. And so what we end up with is something about one in 1000 of the triangles that could be there are there. So that gives us an estimate on, on probabilities of triangles. Okay? And in fact the estimate should've been if we were able to actually see what nature really did. It should of been nine instead of ten, okay? So the idea with sparcity, we're not off by far in terms of what our estimate was. Estimation of links. What we can do is now look at how many links there are not in triangles. Okay, so we've got different numbers of links that were formed not as part of triangles. We count up those. You can go through and count those. You're going to end up with 23, if I counted correctly. So you end up with 23 not in triangles. And here how many links could there have been? Well how many pairs of nodes are there? Theres n choose 2 861 in this case minus 28 that appeared in the triangles and so we don't count those. So there's 833 possible links that aren't in triangles. We saw 23 of them, so what's the estimate there? 23 over 833, about a 3% probability on any link forming. Okay? So that would be an estimate for this. So we ended up estimating the number of triangles the relative frequency of those things, and relative frequency of links. And that gives us estimates for what those parameters are. And then we can test whether they're different from zero. Are they fall in some range. We can test for homophily in this if we had different colored nodes and so forth, so we can enrich that, okay? So what's a theorem in the paper with a [INAUDIBLE] again? So take a sequence of a sparse SUGMs. And sparsity then has to be defined very carefully in a way that says that that there are certain bounds on the size the relative likelihoods of the true parameters compared to the number of nodes. Then the empirical frequency counts that we do by just looking at whatever statistic here is, You're looking at, figure out how many did you actually observe, how many could there have been, and the theorem is that when you do this, in this very simple way, the, that compared to the true one, the ratio of those two things will go to one. Okay, so we're going to get a consistent estimation we're going to zero in on the true parameters. And in particular you can also show that a normalized difference of these things is going to be a zero mean, normal random variable. So in fact, you also know something about the errors, there's a version of a central limit theorem that holds here. So these things are consistent, and they have nice, normal distributions in terms of the differences between your actual estimator and the true parameters, and so then we get a, if you look at what this d matrix looks like, it gives you the rates of convergence, and all the details are there in the paper. But the idea here is these things'll be very easy to estimate. You're basically just doing binomial counts, so you're counting how many links were there? How many could there have been? How many triangles are there? How many could have been? So really simple estimation techniques. And yet it gives us the same kinds of information that we can then back out of an exponential random graph family. And it gives us accurate estimation with trivially easy techniques. Okay? So let's try and see why we might be interested in these things. So, What's the need for these kinds of models? Why don't we just do straight blocks models, why bring triangles into the picture, why bring something else? And what we are going to do is we are going to go back to the Indian village data and what we can do is an estimate a model and then use it to generate networks and then in particular try and see how well do the models that we end up fitting. End up recreating the networks. Okay? So we just saw, there's easy ways to estimate these for the, for, to include things like triangles. Is it really necessary to include the triangles? Suppose instead we worked with a block model. We didn't bring this into the picture. Okay? So what we're going to do is, we're going to estimate the SUGM based on covariates. Allowing for triangles. We can also then compare that to what would have happened if we just did a block model. Counting for covariates. Does it do better, you know, how well do they do at recreating the actual networks? And does it, does, does the inclusion of triangles really help us in some substantial way? Okay. Okay so how are we going to do this? We're going to do a very simple version of, of categorizing things. We're going to have the nodes either be the same or different. So we'll allow probabilities of links between two of the si, similar nodes or different nodes. And we'll say that they're, they're in the same class if they have the same caste. And if the GPS distance between the homes, is less than the median distance between homes, okay? So going to a village, look at two households, we say okay are they linked or not? what, are they of the same caste? Is the GPS distance greater or less than the median distance? So when we're looking at two households, we're say they're in similar if they have the same caste and less than the median distance between home, and otherwise we'll say they're different. Okay, so if they're either of different castes or greater distance than the median, then we'll put them. So we're just going to make it a really simple model where we either keep track of nodes, and we'll allow for two probabilities, probabilities for nodes being similar and nodes being different. And similar here means they're very similar on both the di, dimensions of caste and GPS location. Okay? So now what we can do is, we can fit a block model. So we can say, what's, we'll allow block model where we have two different probabilities. Probability of a link of both of the same category or similar to each other and probability if they are different. And then we also fit subgraph generation model. We're now what we're going to add in is also triangles, and we'll allow triangles to have two different probabilities. Probabilities of triangles for people that are all similar, and probability of triangles if some of the people involved are different from each other. Okay? So we'll fit the block model, fit this sub-graph generation model. Both of these are very easy to fit here, right? So we can fit, the block model's a special case of a SUGM where we just look at links. So we can just count up lengths, count up triangles, count up whether they're same or different. So we're going to have four different counts and that will gives us estimates on all these things, okay? And the block model just looks, links, ignoring whether they're in triangles or not. This subgraph generation model keeps track of triangles separately from links and estimated that way. Okay, so that's the basic estimation technique. So we estimate these block model. Step one. We're going to estimate this probability of link, probability of link if you're same or different. Sub graph generation model we'll do the same thing but we're going to add in triangle counts. And then once we have these, the nice thing about these kinds of models is then we can generate back networks very easily. So how do we generate a network? Well once we have this probabilities there, we can just take this set of nodes, pick pairs, flip coins, put in links with probability same or different depending on whether they're the same or different, and then generate a, a network. For the SUGM what we can do is randomly pick triangle, randomly pick links and put them in with these probabilities and then we generate networks. Okay? So we randomly generate these networks and then we try to see whether or not these networks recreate the actual, original observations. Okay, so here is what we get. So here's the data. How many triangles are there? How many links were there that are not in triangle? So, we'll call those unsupported links, so links that aren't part of a triangle and then we also have, what's the average degree? How many isolates there are? And so forth. So this, the red here in the first column is the actual data. So we have 161 unsupported links, 39 triangles average degrees 2.3, 55 isolated nodes. So then we can go through and here's the block model, the block model does very well on fitting average degree, well it should because what it's estimating is, is just link probabilities. The SUGMs don't actually do quite as well on the estimates of the average degree, because here they're also putting in triangles. And so the number of, of links is, is slightly off. But the number of triangles, and the number of unsupported links actually matched much better, so in particular, it's doing better at matching the number of triangles and unsupported links, not surprisingly because its counting those things directly and, and that's the way we're doing the estimation. The sub graph generation models are not doing us well on isolates, And so here this last column here is you can also do a subgraph generation model where you will allow for isolates. And if you do that then you end up doing better on isolates. While still doing as well on, on, on links and unsupported links. So, okay. That, so far nothing too surprising. What's interesting then is, lets go down here and look at a series of other features of the graph that we weren't trying to fit directly, okay? So, clustering, how well do we do at clustering? Well the data actually has about a 10% clustering rate. The block model not surprisingly where all links to independent doesn't take that up. And so, one explanation, people often say okay, well you can get clustering on block models, because if people were in the same group, then they're more likely to have a higher density of links. And so you're going to pick that up. And here, you do see that people are much more likely to form links of they're of the same caste, and close in GPS distance. But that still doesn't come close to generating the clustering you need. Whereas once you put in the triangles, then you come back and you hit this. And actually if you pull out the isolates, you hit it right on the money. So you can get the clustering much better by allowing for links to be dependent in a triangular form. Then you can also ask, you know, what's the fraction in the giant component? And we see that, that you know again you do better at fitting the fraction of nodes in a giant component. You can, the first eigenvalue of the adjacency matrix, that's how rapidly things expand. again, you know, the, the sub-graph models are going to do better than just the block model. And in particular, once you get the isolates, you're coming pretty close. Second eigenvalue of the stochastic, stochasticized matrix. This is a measure of Mofily. Actually once you throw the isolates in, into the sub-graph generation model, this moves away a little bit, but it's still fairly good in terms of that. Average path length. You're going to come closer with the, the sub-graph generation model. So these models aren't perfect, but what they are doing is it's allowing you to get a little closer on some of the features. And let's actually take a look at some graphs here. So what do we have here? This is a graph at the degree distribution. So the truth, the actual data, is this gray line. The sub graph generation model is this, blue line and the block model is the red one here and what you begin to see is even though the sub graph generation model didn't get the, the average rate it actually masked the distribution of degrees much more accurately than the block model did. So it missed the, the first moment, but it was better at actually catching the whole spread, and doing better at matching the tails of this distribution than the block model did. So it actually did better in some dimensions of, of degree than the other model. When you get to clustering, well the block model's just way off. So here's the, you know, what's the clustering coefficient? And what's the CDF value of that? So when you look across the distribution of clustering the block model basically has you know, it, it says most, this shouldn't be clustering at all. The actual data has quite a bit of clustering and the subgraph generation model picks it up much better because it's allowing for these triangles. But then you, you know, you can look at other kinds of things. So yeah, here is another picture of the average clustering. here, there are two different types of sub graph generation models, one with different types of corrections done so the algorithm is doing a finite state algorithm. And then you can also the block models with just looking at links only or, or allowing for different types of covariates and so forth. And basically the block models are still going to miss the clustering, no matter whether you are incorporating different covariates, multiple covariates. You know, similarly the eigenvalue, you, you get different pictures or the block models end up not doing as well. When you look at the second eigenvalue, actually these are incredibly segregated villages, so the second eigenvalue, the is, is, is quite high. And in fact, the SUGMs pick up that, that they're quite high, the block models end up having a full spread on these things, even though the actual data is all very close to one in terms of the level. So here's the actual data numbers, they're all down here and the sub-graph generation models are picking that up the block models aren't. So what do we have, we have better fits on these different things, you can look at the fraction of the giant component. Neither of these do that well, but if you throw in the isolates, then you'd actually pick up things a lot better. Okay, so, what do we get? Dependencies are really important to pick up in social networks. Why, well that's the whole nature of social. I mean, what does social mean? It means that we're interacting in more than, than but, well, generally more than two at a time. And that's going to generate dependencies. We need tractable models to capture and test these. Exponential random graph models are a nice form. It's a rich family that has a, a long tradition in, history in statistics, but they're not always accurately estimable. If we take a step back from these and either work in a statistic space or just work with the, generating these things directly, we get easy and consistent estimation, these things are nice to work with. And so there are ways of, of dealing with these things directly, and then we can go ahead and test test things. We'll see some tests of that in lectures coming up in lecture 4.9 and others we'll begin to see how we can test different theories using these kind of models and, and do that explicitly. Okay, so. Statistical models gives us a, a medium to, to go through and, and estimating these things. Often we're going to need to put these models in context. So, you know, where do these dependencies come from? They, are they friends of friends? Is there some social enforcement? So on top of these models, we do need some theory, some social theory or some economic theory. To guide our thinking of, you know, we don't want to just randomly put down different types of counts into our exponential random graph models, we want to have some idea of exactly what should we be testing for. And again when we get to lecture 4.9. Then we'll start with some particular theories. We can go ahead and, and test those directly using these kinds of models. Strengths of random networks, we can generate large networks with well identified properties. We can mimic real networks, at least in some characteristics depending on the kinds of models we put down there. So the beauty of these models is now we can tie specific properties to specific processes. So it's either sub-graph generation or could be preferential attachment or sort of meeting friends of friends. These different processes we looked, small worlds. And we get some idea of why these different kinds of things are happening in terms of the process that's there. And some of the weaknesses of these models is we're still missing the why. Why this particular model and not some other one? And there we need to put in some social theory or economic theory behind that. And it also then would allow us to, to say something about relevance. So, you know, are, are we really worried about this? We often we're worried about this things, you know. Why do we worry about segregation because it might cause inequality, or might lead to certain kinds of things So, we have to have some idea of what to welfare is involved. One other thing is that a lot of the kind of stylized facts we've talked about. Small worlds, power laws, clustering and so forth. These things are stylized facts but, often by just saying okay, well here's another network that happens to have this And, and the laws are, are basically something that we've seen, in a series of different observations, but we need something that can systematically begin to look across and say, you know, is something really statistically there as it there consistently do we see it across different types of, of observations. Well we can have some theory and, and luckily now we have models that are beginning to address this kind of thing. And allowing us to, to answer these kinds of questions.