So in this video, we'll going to talk about graph analytics within the context of this big data specialization. So in the previous courses you know about the three important V's of big data. So the three well-known V's are volume, velocity, and variety. We will also talk about a lesser-known V, which is called valence. Okay, what we want to talk about is, what impact these things have on graph data. So for volume, let's take a dataset like the load network of the United States. Well, that's a pretty large graph. So when we say volume, I mean that the size of the graph is much larger than what you might have in the memory of a reasonable computer or real computing infrastructure. Now, we will see what impact the size of the graph has on analytic operations. What we mean by velocity when it comes to graphs? Well, think of Facebook again. So these little graphs are updates. So you write a post, then like somebody else's post, and make a comment. That's a bunch of updates. That comes and adds to your graph. Well, then ten minutes later, you do something similar, and that also comes and adds to the graph. Then your friend does the same thing, it adds to your graph. So as time goes by, you are sending more edges to your graph. And the speed at which you are doing this for at least like Facebook can be really, really high. So the rate of update in Facebook is really high. This is what is called streaming edges into graphs. And there can be multiple streams for various reasons. What do we mean by variety? For graphs, it means that the graph is collecting data from various places. And all these different places are giving different kinds of information to the graph. So in the end, the graph has more non-uniform and complex information potentially coming from multiple sources. That's what we mean by variety when we refer to graphs. That picture there, by the way, is different kinds of protein interactions. The next one, the less-known one is valence. Now, if you remember your chemistry, this comes from valence electrons, which are electrons in an atom which are used for bonding. The other electrons are called core electrons. So the idea is if we increase the valence of the graphs, you increase the connectiveness of the graph. How, we will see. Now, graph size clearly impacts analytics. Why, a, it takes more space, but more importantly, it increases the algorithmic complexity of any operation that you want it to on the graph. Now, we'll see an example of that, but what happens as a result is that the data-to-analysis time becomes high. So I put in some data, and I wanted to do this analysis. But there is so much data, that my analysis takes way longer than it should. Let's give a simple example, an example we have seen before. Remember, we had this little graph from our biological example where we were asking, find a simple path between Alzheimer's Disease and Colorectal Cancer. And in this case, the result is obvious. Now, let's pause and ask. There are two nodes that I mentioned, in this case, my Colorectal Cancer and Alzheimer's Disease nodes. And we are asking, is there a simple path connecting them? This is called a decision problem. I give you a data, and I'm asking does such a simple path exist or not exist? But this is actually a very hard decision problem. And the computer scientists will tell you that this is a very complicated problem because it has a very high complexity. Let's ask another question. Well, how many simple paths, now I want to count. How many simple paths exist between these two nodes? Indeed, it is another hard computing problem. And if you really want to know, the size of the result, in the worst case is exponential in the number of nodes. So if we increase the number of nodes and edges, if we increase the size of the graph such a seemingly simple question can take a very, very, very long time. So that it's almost practically impossible to compute it for a really large graph if we have no other information supporting. That's the worst case. But when we say algorithmic complexity increases, that's what we mean. Let's talk about velocity, and I said our favorite example is Facebook. So we are adding a bunch of updates, which means we are adding a bunch of edges. We are streaming the edges into the data, and we want to compute a metric. We want to see what is the shortest distance between person a and person b or item a and item b. Or I want to know that Facebook has communities. Twitter has communities like we saw. And how many people out there, in these communities, and how many such communities are there, like a Facebook group? Now, if you want to compute this metric, and you get this edges very fast, it is very difficult to know when you have the answer. Because you are going to get an increasing number of edges in the system, and you keep computing this metric that you want to find the answer for, and it will turn out that your continuous stream does not fit in memory. Because your memory is limited compared to the amount of edges, or edge updates you are streaming into the system. So that's what's happened when you have high velocity information. Very soon, your memory runs out, and you want to compute your answer right now from the data that you have. Okay, let's look at variety, also known as heterogeneity. There are two aspects of heterogeneity. One, we have already mentioned, graph data is often created through integration, like we saw in the case of the biology. And therefore, the variety comes because the nature of data is very different. Also, they may not be all the same kind of data. For example, the data may come from a relational database, it may come from an XML database. It may come from another graph. It may come from a document. It may even come from complex things like social networks, like citation networks between papers or patents, between interaction networks, between web entities, which are connected through links. And from human knowledge that has been represented as graphs through ontologists. So all of these graphs, the nodes and the edges do not mean the same thing. And somehow in there, you need to capture what it means to have an edge because that will determine what you can do with the edge. A simple example, in an ontology, is something that says a is a b, and b is a c, so a is a c. The a is a c is an inference that you do, given the other two relationships. What would be an example? My pet is a dog, and the dog is a mammal, therefore, my pet is a mammal. You want to do inferences for some edges likes is a. Now, you need to know this. You do not do this with the biology example where you are looking at genes and proteins because that operation does not make sense when you have genes and proteins. So therefore, every graph may have a different semantics. And what happens with variety is the number of sub-semantics and the number of valid operations that you can do. That changes, and that becomes more complex. Now, valence I said, is about connectedness. It is also about interdependency among data. So if I have a higher valence which means, I have more data elements that are more strongly related, and these relationships can be exploited. In most cases, the part where valence becomes important, is that it increases over time, which means, parts of the graph becomes denser, and the average distance between node pairs decreases. Let me show you, here is my Gmail. And I have plotted my Gmail graphs from 2006 to about two months back. When I first started using it, I had these users, a very few users, and they are not really related. Now with time, more and more people started communicating with me through Gmail. And more and more of these people were also talking amongst themselves and copying me and responding to me. By the end, you would see that you can find dense groups within my Gmail because the information and the connectedness between people have evolved and become more dense over time. This is the phenomenon of valence, and this is very important to study because you want to study things like, what parts of the graph have become more dense? And why have they become more dense? Maybe something was going on. Maybe there was an event that brought these people together, and you want to analyze that and find that out from your graph analytics. That's why you want to understand the effect of valence. You also want to understand what do I do if the graph becomes very, very dense in a place, so that finding a path through that dense space becomes very hard. You will see in a later video that when this happens, the computer system, that is trying to process these graphs in a parallel and distributed way has to do something special to handle these increasing density in parts of the graph.