So, now we're going to write a set of applications and the code there is the pagerank.zip. That's simple webpage crawler and then a simple pet webpage indexer, and then we're going to visualize the resulting network using a visualization tool called d3.js. So, in a search engine, there are three basic things that we do. First, we have a process that's usually done sort of when the computers are bored. They crawl the web by retrieving a page, pulling out all the links, having a list, an input queue of links going through those links one at a time, marking off the ones we've got, picking the next one and on and on and on. So, it says front end processes, spidering or crawling. Then, once you have the data, you do what's called index building where you try to look at the links between the pages to get a sense of what are the most centrally located, and what are the most respected pages where respect is defined as who points to whom. Then, we actually look through and search it. In this case we won't really search it, we'll visualize the index when we're done. So, a web crawler is a program that browses the web in some automated manner. The idea is that Google and other search engines including the one that you're going to run, don't actually want the Web. They want a copy of the web, and then they can do data mining within their own copy of the web. It's just so much more efficient than having to go out and look at the web, you just copy it all. So, the crawler just slowly but surely shores crawls and and gets as good a copy of the web as it can. Like I said, its goal is to retrieve a page, pull out all the links, add the links to the queue and then just pull the next one off, and do it again, and again, and again, and then save all the text of those pages into storage. In our case, it'll be a database in Google's place. It's literally thousands or hundreds of thousands of servers, but for us we'll just do this in a database. Now, web crawling is a bit of a science. We're going to be really simple, we're just going to try to get to the point we've crawled every page that we can find in once. That's what this application is going to do. But in the real world, you have to pick and choose how often which pages are more valuable. So, in real search engines, they tend to revisit pages more often if they consider those pages more valuable, but they also don't want to revisit them too often because Google could crush your website and make it so that your users can't use their website, because Google is hitting you so hard. There's also in the world of web crawling this file called robots.txt. It's a simple website that tells that search engines, when they see a domain or a URL for the first time, they download this and it informs them where to look and where not to look. So, you can take a look at py4e.com and look at the robots.txt, and see what my website is telling all the spiders where to go look and where the good stuff is at. So, at some point you build this, you have your own storage, and it's time to build an index. So, the idea is to figure out what pages are better than other pages and it certainly, you start by looking at all the words in the pages. Python word splits etc. But the other thing we're going to do is look at the links between them and use those links as a way to ascribe value. So, here's the process that we're going to run. There's going to be a couple of different things in the code for all of this is sitting here in pagerank.zip. The way it works is that actually only just spiders a single webpage, you can spider dr-chuck.com, or you can actually spider Wikipedia. It's kind of interesting, but it takes you a little longer before the link start to sort of go back to one another on Wikipedia. But Wikipedia is not a bad place to start if you want to run something long, because at least Wikipedia doesn't get mad at you for using it too much. So, there's always all these sort of data mining things. This crawling have this thing where it grabs basically a list of the and. So, we end up with a list of URLs. Some of the URLs have data, some do not, and it randomly looks for one of the unretrieved URLs. Goes and grabs that URL, passes it, and then puts the data in for that URL but then also reads through to see if there's more links. So, in this database, there are a few pages that retrieved and lots of pages yet to retrieve. Then it goes back says, oh, let's randomly pick another unretrieved file. Go get that one. Pull that in. Put the text for that one in, but then look at all the links and add those links to our sorted list. If you watch this, even if you do like one or two documents at a time, you'll be like "Wow, that was a lot of links" and then you grab another page and there's 20 links, or 60 links, or 100 links. So, you're not Google so you don't have the whole internet, though what you find is as you touch any part of the internet, the number of links explodes and you end up with so many links that you haven't retrieved. But, if you're Google after a year and you've seen it all once, then you get your data more dense. So,that why in this program we stay with one website. So eventually, you get some of those links filled in and have more than one set of pointers. The other thing in here is we keep track of which pages point to which pages, right, little arrows. So these, each page then gets a number inside this database like a primary key, and we can keep track of which pages and we're going to use these inbound and outbound links to compute the Page Rank. That is the more inbound links you have from sites that have a good number of inbound links, the better we like that site. So, that's a better site. So, the Page Rank algorithm is a thing that sort of reads through this data and then writes the data, and it takes a number of times through all of the data to get this Page Rank values to converge. So, these are numbers that converge toward the goodness of and each page, and so you can run this as many times as you want. This runs really quickly, this runs really slow because it's got to talk to the network and pull these things back, talk to the network and that's why we can restart this. The Page Rank is all just talking to data inside that database and it's super fast, and then if you want to reset these to the initial value of the Page Rank algorithm, you can reset that and that just sets them all to the initial value. I think of one, they also won a goodness of one and then some of these ended with goodnesses of five and 0.01, and so the more you run this, the more this data converges. So, these data items tend to converge after a while. The first few times they jump around a bunch, and then later they jump around less and less. Then, at any point in time as you run this this ranking application you can pull the data out and dump it to look at the Page Rank values of, for this particular page, has a page rank value of one. These are dumping out, this one has probably just run the SP reset because they all have the same Page Rank. After you've run it, you'll see when you're on spdump, you will see that these numbers start to change. This stuff is all in the read me file that's sitting here in the zip file, you undo that. So, the spdump just reads the stuff and prints it out, and then spjson also reads through all the stuff that's in here and then takes the the best, some 20 or so links with the best Page Rank and dumps them into a js JavaScript file. Then there is some HTML and d3.js which is a visualization that produces this pretty picture and the bigger little dots are the ones with a better page rank, and you can grab this and move all this stuff around and it's nice and fun and exciting. So, we visualize, right? So, again, we have a a multi-step process where it's a slow restartable process than a sort of fast data analysis cleanup process, and then a final output process that pulls stuff out of there. So, it's another one of these multi-step data mining processes. The last thing that we're going to talk about is visualizing mail data. We're going to go from the Mbox-short to Mbox to Mbox-super gigantomatic. That's what we're going to do next.