With all these work done, let's look at a bigger example. It does not illustrate the scale challenge, we'll, which we'll come back very briefly in a moment. But it is slightly bigger than four node, now we're talking about eight nodes, nodes one, two, three, four, five, six, seven, representing eight webpages, interconnected by directional links, these hyperlinks. If we just stare at this graph and say, which node is most important? Well, we may say by degree, in degree count, node four is most important. And I just got four, it's bigger than any of the other nodes in degrees. But, if you think it along the line of Google Page Rank, which implicitly assumes certain navigation behavior, you see that four basically gives away all this importance score to a single node downstream: three. And three does not give it back to four. It spreads it further to two and eight. So this gives you a hesitation. Say, maybe node three by page rank calculation of importance score will be higher than four. And, indeed, we'll see that's the case. Intuitively, two, three, four should be highly ranked, and that is what we will see. And, which nodes will have the least importance score? Well, those nodes that are not pointed by many other important nodes, who concentrate on them. And we can see, probably node seven and node six will have low importance scores. And, indeed, that's the intuition we'll confirm in a short while. These will be lowly-ranked, will be highly-ranked. So to make this a little more efficient, I've written out the H matrix already, just based on the topology of this given graph here. Now, we can then construct H hat, but H hat is the same as H, because there's no dangling nodes. And then, we can construct matrix G with randomization parameter, or anti-randomization parameter being 85%. And then fifteen percent of the time, we'll be doing random hopping among the pages. Then you can write down that G matrix, and then you can start doing the iteration. Pi equals pi transpose, equals pi transpose G, from one iteration to the next. And this shows the first initialization, all the nodes have one-eighth, exactly evenly distributed. It doesn't matter what initialization is used actually. It would change the convergence speed a little but not the end result. The first iteration is shown here. So this vector, okay? And I'm skipping the other steps. You can see that in the textbook. Eventually, it converges to this. Or something like this. Actually, this is only the sixth iteration. But it's getting quite close to the final, and so the final answer is shown here. This is the equilibrium pi vector for the eight nodes. I have already ordered them. And I've already normalized them. So what you see here is the normalized and ordered by the importance scores in descending order. So node three is the highest importance score and therefore ranked number one, followed by node two followed by node four. You can see, node three actually, is by far the most important. 0.2 sounds like small number, but remember, the pi is now normalized, so they all add up to one. So 0.2 out of eight nodes is actually quite impressive. Two and four are about the same. And, nodes six and seven are by far the smallest. Okay, that's 0.06, 0.04, but then again, Google page rank returns the ranking. It does not show you the actual scaling, so the scaling is only intermediate step. And this confirms our intuition. What we just said before, that node three perhaps is the most important, even though node four has a higher in degree. That's because PageRank calculation models the navigation behavior. Saying that, if a lot of people come to this web page, but then they all have to go out to this page, then this page actually shares a hundred percent of the importance scores spreading from node four. Now you can disagree with that navigation behavior, but given the dominance of Google's search engine, it's probably hard to refute why that does not satisfy users' need and if you can provide an even more useful ranking order. Now there's so many more things we can talk about. We'll save them for the advanced material part, but just want to briefly mention that in lecture one we looked at a distributed power control for cellular wireless networks. In fact, when we generalize PageRank, the famous algorithm we just showed--a little bit, generalize a little bit-- you see what exactly I mean by that in the advanced material part of the lecture. You will see a striking parallel between PageRank and DPC. The PageRank's used by Google for billions of searchers each day. And DPC is used by all the 3G cellphones invented by a number of companies and, but, eventually, most of them funnel through Qualcomm, and so these are very, two vastly different industries. This was invented early 90s, was, was in the mid to late 90s, But, mathematically, they have a very interesting parallel. This is a network of interfering nodes. And the topology, the graph is represented by a matrix G. Where the Gij's are the direct and the difference channel gains. This is about a graph of web pages, hyper-linked and directed. Turns out there's another matrix, same symbol but very different definition, of Google matrix. They help you to rank these webpages. The functionalities are vastly different, but it turns out the generalized version of the PageRank we just saw has exactly the same mathematical formula as DPC. So if you are curious about that please come to the advanced material part of the lecture. Now PageRank can be understood from quite a few different angles. We talked about this eigenvector angle. We talked about this iterative computation angle. Modeling the hopping behavior, navigation behavior of a user or this random walk on graph behavior. There's also an economic growth model angle which is actually what led to this equivalence. But the biggest challenge for Google using PageRank is the challenge of scale, the N is too big. But fortunately, these graphs, H is sparse. And the other graphs' matrices are rank one matrices. So you add two rank one matrices to a sparse graph, you get a very well-structured graph G that helps a lot in scaling up the computation. Notice that this computation is done in a central server. DPC is done in a distributed way that's the word distributed power control. So there's a big contrast between those but then again DPC we're talking about tens of nodes, here we're talking about millions of nodes relevant to a search if not the entire set of webpages 40 to 60 billion out there. There are quite a few tricks you can use to improve the scalability of this centralized computation. One is numerical linear algebra methods, okay? There are a lot of ways to decompose this matrix H, so that you speed up the computation of matrix multiplication. There are also a few more tricks, for example. It's only the order that matters, so don't worry too much about converging on the scale of the importance scores. Certain web pages can be grouped together. The dangling nodes can be treated separately and you may also have a hierarchy. You can group certain clusters of nodes together and run an approximation before you distributing a given importance score among those nodes in a cluster. In the homework problem you looked at this cluster-based hierarchy approximation. So whether it's speed up method or approximation tricks, there are many approaches to make this algorithm work so fast for a large N. Now finally. Google also plays a game with companies called SEO, search engine optimizations, and as soon as Google became popular around 98, SEO popped up around 99-2000. And since then, it's been over decade of constant struggle back and forth between the two. So SEO in a nutshell, basically try to increase your webpage ranking by doing things that try to leverage the PageRank's algorithm. For example, they'll say here's a truly important webpage I construct. Then if you pay me I will provide pointers to your webpages even though your webpages may not deserve a high rank. So basically try to change this structure of the H matrix. But then, Google is not sitting idle there. It's also reacting. Two recent reaction is in early 2011 and May 2012, where Google announced some changes in the way they run the actual detailed version of the PageRank methods that would try to counter SEO's artificial lift of certain web pages' importance score. So, there's a lot of interesting stories that we'll not have time to go into. Let's quickly summarized what we have seen so far in this lecture. The key concept is that we can view the collection of hyperlinked web pages as a network. There's a graph that we'll represent by matrices such as H, H hat and G. There is also a functionality of navigation that we're implicitly modeling through the way we construct these matrices. You can disagree with those models, but so far Google's search result has proven to be quite useful. And this is a general theme: the connectivity pattern provides a hint on the node importance. We'll come back to this in lecture four with three more additional importance metrics suitable for other purposes. So PageRank, this famous algorithm we just saw and construct, uniquely defines and then efficiently computes a consistent set of importance scores based on these ideas. And one of the angles we just saw and emphasized is the dominant eigenvector angle. That's the pi star transpose equals pi star transpose times Google matrix G.