[SOUND] This lecture is about link analysis for web search. In this lecture, we're going to talk about the web search and particularly, focusing on how to do link analysis and use the results to improve search. The main topic of this lecture is to look at the ranking algorithms for Web Search. In the previous lecture we talked about how to create index. Now that we have index, we want to see how we can improve ranking of Pages. The web. Now standard IR models, can be also applied here. In fact, they are important building blocks, for, improve, for supporting web search. But they aren't sufficient. And mainly for the following reasons. First, on the web, we tend to have very different information needs, for example, people might search for a webpage, or an entry page. And this is different from the traditional library search, where people are primarily interested in collecting literature Information. So this kind of query is often called a navigational queries. The purpose is to navigate into a particular type of the page. So for such queries we might benefit from using link information. Secondly, documents have additional information and on the web pages, are web format, there are a lot of other clues, such as the layout, the title, or link information again. So this has provided opportunity to use extra context information of the document to improve the scoring. And finally, information quality varies a lot. That means we have to consider many factors to improve the range in the algorithm. This would give us a more robust way to rank pages, making it harder for any spammer to just manipulate the one signal to improve the ranking of a page. So as a result, people have made a number of major extensions to the ranking algorithms. One line is to exploit links to improve scoring. And that's the main topic of this lecture. People have also proposed algorithms to exploit the loudest, they are implicit. Feedback information the form of click throughs and that's of course in the category of feedback techniques and machine all is often used there. In general in web search the ranking algorithms are based on machine learning algorithms to combine all kinds of features. Many of them are based on the standard of virtual models such as BM25 that we talked about [INAUDIBLE] to score different parts of documents or to provide additional features based on content matching, but link information is also very useful so they provide additional scoring signals. So let's look at links in more detail on the web. So this is a snapshot of some part of the web, let's say. So we can see there are many links that link the different pages together. And in this case, you can also look at the center here, there is a description of a link that's pointing to the document on the right side. Now, this description text is called anchor text. Now if you think about this text, it's actually quite useful because it provides some extra description of that page be points with. So for example, if someone wants to bookmark Amazon.com front page the person might say the biggest online bookstore and then the link to Amazon, right? So, the description here after is very similar to what the user would type in the query box when they are looking for or such a page. And that's why it's very useful for managing pages. Suppose someone types in the query like online bookstore or biggest online bookstore. All right the query would match this anchor text in the page here. And then this actually provides evidence for matching the page that's being pointed to that is the Amazon. a entry page. So if you match anchor text that describes an anchor to a page, actually that provides good evidence for the elements of the page being pointed to. So anchor text is very useful. If you look at the bottom part of this picture you can also see there are some patterns of some links and these links might indicate the utility of a document. So for example, on the right side you'll see this page has received the many inlinks. Now that means many other pages are pointing to this page. This shows that this page is quite useful. On the left side you can see this is another page that points to many other pages. So this is a director page that would allow you to actually see a lot of other pages. So we can call the first case authority page and the second case half page, but this means the link information can help intuit. One is to provide extra text for matching. The other is to provide some additional scores for the webpage to characterize how likely a page is a hub, how likely a page is a authority. So people then of course and proposed ideas to leverage this link information. Google's PageRank which was the main technique that they used in early days is a good example and that is an algorithm to capture page and popularity, basically to score authority. So the intuitions here are links are just like citations in literature. Now think about one page pointing you to another page, this is very similar to one paper citing another paper. So, of course then, if a page is cited often, then we can assume this page to be more useful in general. So that's a very good intuition. Now PageRank is essentially to take advantage of this Intuition to implement with the principal approach. Intuitively, it is essentially doing citation counting or in link counting. It just improves the simple idea in two ways. One it will consider indirect citations. So that means you don't just look at how many in links you have. You also look at what are those pages that are pointing to you. If those pages themselves have a lot of in-links, that means a lot. In some sense, you will get some credit from that. But if those pages that are pointing to you are not being pointed to by other pages they themselves don't have many in-links, then well, you don't get that much. So that's the idea of getting indirect citation. All right, so you can also understand this idea by looking at again the research papers. If you're cited by let's say ten papers, and those ten papers are just workshop papers or some papers that are not very influential, right? So although you've got ten in-links, and that's not as good as if you are cited by ten papers that themselves have attracted a lot of other citations. And so in this case where we would like to consider indirect links and page does that. The other idea is it's good to pseudo citations. Assume that basically every page is having a number zero pseudo citation count. Essentially you are trying to imagine there are many virtual links that will link all the pages together so that you actually get the pseudo citations from everyone. The reason why they want to do that. Is this will allow them to solve the problem elegantly with linear algebra technique. So, I think maybe the best way to understand the PageRank is to think of this as through computer probability of random surfer visiting every webpage. [MUSIC]