Network analysis is actually one of my most favorite topics in all the big field of Data Analysis. I'm excited to share with you this path to network analytics. We will start with Network Descriptive Statistics. Network descriptive statistics are very different from traditional methods. We do not really specify the average the same way that the traditional statistics does. We talk about networks in terms of conductivity, and conductivity determines how nodes can reach each other in the network. Because nodes are connected, it is natural to talk about network in terms of walks between different nodes. Let's start with walks, trails and paths. A walk in the graph is a sequence of nodes and lines starting and ending with nodes in which each node is incident with the lines following and preceding it in the sequence. Remember the definition of an incident? The length of a walk is the number of instances of lines in it. When we try to determine how long the walk is we count lines, not nodes. Trails and paths are a little bit different. A trail is a walk in which all of the lines are distinct, even though some nodes may be included more than once. That's why I said we count the number of lines. For trails, we can have multiple lines going through the same node. But path is a walk in which all nodes and all lines are distinct. We do not repeat the same node and we do not repeat the same line. The number of a path is again the number of lines in it. The next concept that logically flows from walks and paths is length and distance. The length of a path is just the number of links, the number of lines in that path. The distance between two nodes is the length of the shortest path. We also call it a Geodesic. Here in this picture, we have multiple paths from node 1 to node 8 or vice versa. Some of them might be as long as one, two, three, four, five, six, seven links, but the distance between node 8 and one is one. That's the link of the shortest path and that's a direct distance between node 8 and node 1. Another important concept used all the time in network analysis is the degree of a node. The degree of a node, denoted as d(ni) is the number of lines with the incident with it. It is intuitive that incident lines may come to a node or emanate from it. We distinguish between in-degree and out-degree. The in-degree of a node is the number of nodes that are adjacent to our node, or the number of arcs terminating at our node. The out-degree is the number of nodes adjacent from our node, or the number of arcs originating with our node. Discussion of a degree network leads to other important network concepts. The most important of which is Network Density. The density of a graph is the proportion of possible lines that are actually present in the graph to the all possible. Here, the density of a graph, which we denote by Delta, is calculated as the number of ties that are present divided by the number of all possible ties for the given number of nodes. Network density can be calculated for complete graphs or for parts of the graphs, or sub-graphs. Let's define those. If all lines are present, then all nodes are adjacent, then the graph is said to be complete. But much of social network analysis involves the study of smaller pieces of a network, particularly those that arise from using graph theoretic ideas to split up a graph. We define the notion of a sub-graph. A graph Gs is a sub-graph of G. If the set of nodes in Gs and the set of lines in Gs is a subset of the lines in graph G. There are a variety of kinds of sub-graphs. We have just sub-graphs, we have node generated sub-graphs where we select all the nodes that we want to be present in our sub-graphs, or the line generated sub-graphs where we first select the lines and then we just pick up the incident nodes. The density of a sub-graph is calculated the same way, except we take all the ties that are possible in the sub-graph as the denominator. Next, important concepts in network are Network Reachability and Connectivity. Those walks, and trails, and paths lead us to some very important graph theoretic concepts, for example, are the nodes reachable? If there is a path between nodes ni and nj , then ni and nj are set to be reachable. A graph is connected if there is a path between every pair of nodes in a graph, that means that there is a path between every pair of vertices, even though sometimes that path might go through other nodes, A and C do not need to be connected directly for us to say that the graph is connected. It's sufficient if they're connected through another node, node B. We also separate what is called components. The connected subgraph in a graph are called components. Components of a graph is a maximal connected subgraph, which means that a single node not connected to anyone else in the network is still a network component. We'll also define geodesic and distance. The shortest path between two nodes is referred to as geodesic, which is also the distance between node i and node j, as we have talked about already. We have more concepts to talk about. The first one is the nodes eccentricity. The eccentricity of the node is the largest geodesic distance between that node and any other node. The eccentricity of a node in i is the connected graph that's equal to the maximum distance to any node on the network. Imagine that our node a is connected to node b and node c. Except to reach the node b, it would need to have two links and to reach the node c, it will need to have three. Eccentricity of such node would be three. We also define a diameter of a graph. The diameter of a connected graph is the largest geodesic within the graph between any pair of nodes, or equivalently, the largest nodal eccentricity. We just simply find the node with the largest eccentricity and that will be the diameter of a graph. If the largest number of links from any one node to another node is eight, the diameter of our graph would be eight. Remember that largest number of links has to be the geodesic, the shortest path. The diameter of a subgraph is calculated very similarly, but on a subgraph or just a part of the network. The connectivity of a graph is a function of where the graph remains connected when nodes and lines are deleted. It's entirely possible, especially in the longitudinal contexts, that nodes that were present previously are no longer present for the second data collection point. We lose a node. Would the graph remain connected if we remove that node? Here we define more important concepts. A cutpoint is the first one. A cutpoint is a vertex whose removal from the graph increases the number of components. In other words, if we remove the node, would the graph breakup? Remember the Medici picture we saw in our first lecture, and how Medici were connecting all those people together. When we remove Medici, we end up with several components, the network was becoming disconnected so Medici was a cutpoint. We also define a concept of a bridge. It's similar to a cutpoint except for links. The bridge is an edge whose removal from a graph increases the number of components, in other words, disconnects the graph into two or more components. An edge cutset is a collection of edges whose removal disconnects the graph. Sometimes the graph is connected, if you remove one of the nodes, it remains connected. but if you remove the second one, it might become disconnected. The cutset in that case would include two ties. A local bridge of degree k is an edge whose removal causes the distance between the endpoints to the edge to be at least k, that's a formal definition. Then we also define the node connectivity and the line connectivity. The node connectivity of a graph is the minimum number, k, of nodes that must be removed to make the graph disconnect. If we only remove Medici in just one node for the graph to be disconnected, then our node connectivity is one. The line connectivity is the same thing, but with respect to lines. It is the minimum number, l, of lines that must be removed in order to disconnect the graph or leave a trivial graph. I have provided you with lots of definitions and guess what, many more are to come. In networks, we measure position. The local or node-level measures of position, are degree and centrality. We also measure position on subgraphs, community-level, and those are cliques or components. We measure them on network or global level. Those are the summaries of structural information such as density and centralization. Notice that these additional network theoretic concepts are quite complex and they're very important for network analysis. But I find it that it is difficult, if not impossible, to study networks on theory alone. Therefore, before we talk anymore about theory, let's work with network data. I'm very excited to introduce to you the first case study, the story of a high turnover. This is a case study where we collected the data from a real life company. This case study will become the foundation for many other concepts that we will learn in this course.