[MUSIC] So this next set of videos are talking about graph processing in cloud computing. And this is a big new area of big data and use of large cloud clusters. There's a huge amount of data to be processed, there's a huge set of different problems that would benefit mankind if we could solve them. And so graph processing is become very popular and it has lots of businesses and lots of commercial implications. But also lots of social and scientific implications and we're going to be just looking at how cloud computing at the moment is supporting that sort of graph processing. So we should define what we mean by graph processing. Lots of different ways to describe it, but essentially you're looking at some storage system and processing system. You can have a graph data base with all the information in. It's going to provide some way of retrieving their data and typically a graph database is going to provide index-free adjacency. It's going to have adjacency elements and use pointers for them in some form or other. The nodes are going to represent entities, people, businesses, accounts. And the properties are going to be pertinent information that relates to the nodes and edges interconnect the nodes to nodes or nodes to the properties and they represent the relationships between the two. And there's all sort of different formulations the difficultly with it that the data's bases that represent graphs tend to be very, very, large and very sparse, and they're kind of difficult to process in a straightforward, vectorized way. So, new schemes need to be developed for them. We can talk about the relationship, clearly if you have a graph, you can represent it as an adjacency list. You can of it, well, is that just a relational database table, and we can talk about the differences between them and scaling is certainly one of the big issues. So graph database has a bunch of associative data sets, so you look up one item and you retrieve a different item from it like a node and it's connection to and age. And we look at the sort of systems that it applies to. For example, most of your object oriented applications are consisting of objects that talk to each other. And so you have objects, and you have interconnects. And they're represented by edges and they have properties on those edges and you can talk about well, what is the structure of an object or it's application and represent it as a graph. Typically when you're looking at that type of thing, as I said sparse data, you don't often employ joint operators to take the intersection of two graphs to see what the result is. You're more exploring a particular graph and looking at particular patterns within the graph. Relational databases, on the other hand, you perform the same operation on large numbers of data elements, you often do the same types of computation on them. You can provide relations over the data that's been performed and so they fit very nicely into sort of table operations. The entity type typically is represented by a table and there are rows which are instances the entity the column is representing values attributed to that instance and rows in one table can be related to rows in another table, if you have some sort of unique key expressing the relationship of that table to another table. So in that it is all very, very vectorized, very sort of straight forward and not huge databases. Graph database for example comparison, you might want to take the whole unit and stick it into a nearest neighbor description and allow people to explore the universe. In which case, you would represent a star and another star, and the distance between them as an edge and you would have billions, billions of notes and edges and so on. And allowing people to explore that would be very exciting but it would also require a lot of interesting computation. So looking at a graph the ways to represent a graph, typically it's an adjacency matrix. You have all the nodes. You have how they're related, the edges that relate in tool. Other vertices, so in this graph, A is joined by an edge to B, G, H. Now you could just list AB, AG, Or you could make A and a list of B, G, H, or well, you could have just a tuple of a vertex edge and another vertex and then have A as connected to B, A is connected to G, A is connected to H, a lot of different ways to represent adjacency matrix. But that's what a graph is. On the other hand, the relational database, typically, table represents some entity, some relationship and you have tuples or rows and attributes or columns and you can look at the system or look at the contents the table in those two ways. Typically the rows represent instances and the attributes describe what those instances are or might describe the keys. And so that's what a table is, nice and regular and can be dealt with a sort of simple vectorization. So looking at graph computation, a lot of people say, well, when you address this when you're writing algorithms for graph computation, what you need to do is think like a vertex. What is vertex? What can the vertex do? How can it perform its operations? And the reason is because a vertex is a sort of locality, it's interconnected with by edges. So it's your, sort of, focus of computation. There's two basic operations you can do in a vertex. One is you can aggregate information from neighbors and that gives you a whole way of doing computation about your local neighborhood. The other is you can diffuse your local information to the neighbors. So diffusion is the second approach and this is just the sort of diagram representation, so you can take each note in turn because well, okay, it could diffuse data from its local vertex to other vertices or it could fuse all the data from the local surrounding nodes into some properties that the local node has and most of the graph algorithms that built up this way. And you view. These networks of graphs as very, very large. These localities can stretch over variable distances of numbers of computations, you might say but the interconnections can reach across the graph. Or they can just be very, very short. And how you lay out the graph obviously determines that. But there's some fair degree of flexibility in how that's all organized, which leads to processing and data representation problems. So an example of a graph problem, a nice simple one, let's suppose you have a very, very large graph and what you would like to do is just sort of return all sets of vertices with edges. So you would like to go A,B, A,C and then find and edge from C to B. And if you can find that, then that's something you're looking for. Return the result so that would require you a lot of computation. It's not a straight forward type of table to table operation. You want specific types of graphs algorithms and operations in order to be able to solve it in an efficient manner. Well, we can sort of move on to actual practicals. Sort of, processing of drafts, and a lot of that's good in social networking, these days. First, sort of, example that everyone gives about graph processing is Page Rank. That's about web based computing and how people view the web. And what it does is to count how many nodes have incoming arcs, how big the number arcs are so that you find out the most popular nodes, vertices in that graph and rank order them. And that algorithm is one of the ones that Google has made its name for. What is does is to allow users to do searches on all the data in the web, the whole web. It reduces the data in the whole web to sets of terms and indices and then for the particular search, you try and get the best matches for the search terms, in terms of what's on the web. But the problem is, if you just provide a list, it doesn't help. So you try and rank what is provided to the customer, and Page Rank allows you to do that. So the Page Rank Algorithm, counting the incoming nodes to a website. It allows you to sub-rank which is the most important concepts in the web that relates to your search terms. All of those terms and present them in a satisfactory manner to the user. Well, there's all sorts of other applications. Typical, traditional one is to look at transportation routes. Of course those get more and more dense as time goes on in the world. But what you're looking for, very often, is the shortest path. How to get from A to B using the shortest number of either lengths or distances, if they have distance attached to them. They could have times attached to them. So you're looking for the shortest path, the Stytzer algorithm is a good sort of example of the approach to deal with that. And again it's not very amiable, computing with a table, but it is very good with its own type of algorithm, it's own type of processing. Another one would be citation relationships. Nowadays, all the faculty, when they publish papers, get cited. And, those citations go into indices and we have databases with citations in them now. Trying to connect all those groups together, so you could say, well, over here, her's the crowd group and over here, here's the let's say that the grid group, and how, how do they interconnect. Do they interconnect? Where is the interconnection, where's the people that are interconnecting the two groups? So you can have very very complicated, some type of graph algorithms working on how people interact on a networking basis. And this leads you to, if you like, the real sort of plum at the moment, social networking, trying to understand how social networks work, trying to understand how to categorize, classify, organize your social networks. That's really sort of what some of these like Twitter and Facebook are trying to do. And they're using clustering techniques. They're using lots of other techniques to do this. They would like to find everybody who needs a television, everybody who wants to buy a car, or they would like to find what social group this person is in or whether he's a leader in that social group or whether she is the real conduit for information amongst all the friends. So all of this type of computation requires a great deal of processing, and it's all sort of graph-based, very sparse connections, and very sort of interesting futures. So this type of processing doesn't easily map into what we've been discussing so far. We've looked at map/reduce. Map/reduce tends to work well if you know there's a large number of points that you want to actually compute over. But not if you're sort of have an enormous number of points scattered randomly over lots and lots of different data base, of lots of different file systems. That can get complicated. Now it doesn't mean to say you can't do it in map/reduce and we'll be talking a little bit about that, but graph processing really deserves it's own sort of way of handling some of this data. So what are the scaled of graphs in current in current graph processing. So you're looking at billions of edges and tens of gigabytes at this current time. And we can sort of just go through some of the examples here illustrating. DBLP is a citation index for science and it has about point 3 millions of vertices in it. And it has about one million edges, and its size is about 10 MB. You can download if you want to experiment with it. But that's an example of a citation index. Well, certainly, I use it, it's very popular. As skitter is another type of system with 1.7 vertices and 11 megabyte 11 million edges, and 142 megabytes. Let's see there's, going down here, there's all sorts of different web items. But let's concentrate on Twitter. Because Twitter is very popular, everybody uses it, knows what it is. There, what you're doing is sending tweets, you have a handle on the tweets that you're sending, and people form these networks of forwarding their tweets and so on. Looking at that, you're looking at something like 42 million vertices and 1,470 edges, and that's about 24 gigabytes. Not something you could easily download and play with, and it's beginning to represent a real problem trying to compute with all of that data. If you go to Yahoo Web, and this is back in 2002, the Web was cut then 1,413 million vertices, and something like 6 billion edges and represents a 120 kilobytes. So you see it's grown a lot since then. But it's the scale that's causing the problem. If you get any of these problems and make them small enough, you could easily do them, and you can probably use any technique to do them. But as they get bigger and bigger and bigger, you have to use distributed computing, because they won't fit in one machine, they won't fit in one huge memory. And then, you have this distributed computing problem of actually looking at it. And if you get them large enough and they're being updated, then you have an update issue about how you actually process things. So you would like to look at the current state, but perhaps the data keeps coming in to update all of the graph. And you would like to make sense of that graph. So there's plenty of room for research, plenty of room for interest, and this whole area is very live. So what's the scale? What's the future scale of what we're looking at? And we can take some naive examples, but they illustrate problem. So social network, you could expect a billion people to be online connecting together, talking to each other, sharing information. With a billion vertices, that's about 100 billion edges interconnecting them, and the adjacency list, remember the way that one could represent that as a table. That will be 2.92 terabytes adjacency list. You would find it difficult now to actually hold that in memory. Could just about be done in one day, memory systems, but it would be expensive, it's not a cheap environment at all. And that represents what, about a seventh of the world's population. This is assuming that youngsters don't necessarily go on social network and that old folk may not necessarily go on the network. You could expect that one billion perhaps to get much larger, maybe five or six at some point, and then it would overwhelm our current computers. If we go on to something like the Web, looking all Web systems, you're really getting a little bit out on the edge there. The reason multiple websites created for multiple people. So you're looking at a huge difference between that and just social networks. So here we're looking at 50 billion vertices, a trillion edges. And you're looking at 29.5 terabytes adjacency list, which is a challenge. You buy disks in what, 4 terabytes. So you probably can't fit that on any one machine anyway. You'd probably have to have a large number of machines to actually deal with this. And so, if you're going to do something like that, analysis, page rank, so that you could answer queries, you're going to need the sort of clusters that Google has in order to be able to provide the response. And, of course, if you want it fast, you're going to do even more work at getting it distributed. But even that just sort of touches on what is possible. And let's take the brain projects that are going on both in Europe and here, trying to actually produce a map of what's inside any one particular person's brain in terms of neurons. And of course, it's thought that neurons represent the way that brains actually compose thoughts. And therefore, understanding that connectivity is vital to understanding how people think, and a lot of medicine, too. So if you go to the brain scale, you see that we far exceed the Web. We're into 100 billion vertices, 100 trillion edges. Now into 2.8 for petabytes of adjacency list. Far exceeding even some of our super computers now. Super computers might be up to, say,three, some of the more advanced ones. But some of the other ones in use today would be less than that. So you're really sort of stretching the limits of what actually can be computed, and this is just one human brain. There are pathways, the chemical pathways, inside the brain that you could also map and also include, and that would increase this. And then, we have scientists working on bees, for example. Wanting to relate behavior to both neurons and the pathways through the brain and perhaps some other interesting behavioral topics about biology. And there you're really exceeding any possible computation that could be done in the next 10, 20 years. So this scale of things is really flexible really rather dramatic, and underlines the emphasis that we need better graph processing and in fact, graph processing has a lot of uses. And so, the next set of videos is all about what sort of computations can we do with this graph processing. And as we go, we'll be discovering algorithms and computer stretches looking at that in terms of cloud computing and how can cloud computing help. [MUSIC]