In previous modules in this course, we've talked a lot about statistical graphics and different data sources, we've talked about multivariate datasets, how to explore different methods there, looking at things like clustering and classification. And in this lecture, we want to talk about a different data structures specifically hierarchical data and we want to learn how to apply methods of hierarchical data analysis. So hierarchies by definition are data repository in which cases are related to sub-cases, really we can think of this as imposing an ordering in which cases are parents or ancestors of other cases. You can think of this like your file system on your computer, you have the C drive. In the C drive, you save a bunch of different files. Inside of those files, you may save others. For example, on your C drive, you may have a folder for movies, inside of movies, you may break this down into genres such as action movies, comedies, and those things. And inside of those, you might have files for all the different movies that are comedies all the different movies that are action and so forth. And we can represent this as a tree, you can think of this like a family tree too. If this is you, this is your father and mother, you got your grandfather, grandmother and so forth and you can think about what this tree looks like. So there's a lot of data structures that a lot of datasets that have this hierarchical structure. We can think of this for genome data, we could think of this in a workplace environment you have the boss, you have the next level of employees and so forth down the tree. So we want to think about what is special and unique about these type of datasets and what methods can we apply to these. Now, the main way that hierarchies are represented as often as trees these are directed, they're acyclic so meaning they have no cycles and we typically visualize these using two main schemes, either node-link diagrams which are the most common or space-filling diagrams which look a lot like the mosaic plots that we talked about previously in the multivariate statistics module. So typically, for the node-link diagrams for hierarchical structures, we might use something like a rooted tree. So a hierarchical structure is again just an acyclic graph and this graph might be used to represent this type of hierarchy so we're using this tree metaphor. So, we have our initial level zero, so if we think back to this as our example of the C drive, this might be our C drive, here this might be our movie folder, here maybe we have a music folder as well and inside of our movie folder maybe we've just saved three different movies. So we save movie one, two, and three. And inside of music, maybe we're starting to think about some different genres. So we have our first folder for genre one, and then we have some music related to genre one here, so song A and song B. So, in this rooted tree diagram utilizes what we call nodes and links and these nodes or vertices are placed along horizontal lines according to their level and so we try to separate these levels by a fixed distance and we're trying to make this so the width of this drawing is as small as possible, so we can fit as much as possible onto the screen. So, for these structures like I said we can think about your file maker, your C drive, we can think about how we can create these trees for this and you can even see this tree structure in your file structure as well, where we have perhaps the node and then we can see the sub nodes and then the children as we go through the tree and this is essentially a top-down approach where we're taking the width of the fan out is going to use this horizontal real estate up really quickly. So we want to think about very carefully how to try to organize this information. Another problem is a tree could grow very long in one branch. So for example, we could keep expanding and just one branch and get very deep very quickly, depending on how things work out and so essentially then you can wind up with a whole bunch of white space here in the middle, leaving a lot of screen real estate empty. So these are some problems with the node-link top-down approach, so people have also tried thinking about how we can create things like space trees, where we try to overcome some of these issues in node-link diagrams. So Plaisant et al introduced the space tree where we have dynamic rescaling of branches to best fit the available screen space. So trying to think about Preview icons to summarize the branch topology and so here we can see little previews trying to show us what tree structures might be under there and trying to think about how we can better keep things compact, while allowing people to see and understand what elements might be line beneath these different hierarchical branches. Then people have also thought about, "Well, in 2D, we wind up taking up all this space, can we reorganize things perhaps in 3D." We've seen lots of different circular layouts and things. In 3D, we can think about this this a cone tree, so we add a third dimension now and essentially we have our root node here and we can lay out our children in circular patterns around this cone and then as we go down, we can continue this structure. So siblings live in, one of the 2D planes. So there's been a whole lot of different methods created. If we're trying to think about how we can draw and visualize this hierarchical data structure to learn about different groups and elements. Now, the thing though is not all data has a hierarchy imposed on it, but we can also think about using some different data exploration techniques to hierarchically cluster our data. So in previous lectures, we learned about k-means for doing unsupervised learning for clustering and organizing data. So we also want to think about is, if I'm just given a dataset can I also think about how I might be able to organize it in some hierarchical fashion to be able to use these visualizations for a different dataset and again with space trees, we don't have to constrain on top-down approach. With space trees, we're trying to fill up as much of the screen as possible, so remember when we first started we talked about our node-link diagrams and we're trying to keep the levels all filled out and have the same distance between them, but people can think about again these radial layouts where we don't have to constrain our geometry to this top-down approach, but we can apply a hyperbolic transformation. Now, the distance between parent and child decreases as you move further from the center, so the trick now is that the children are going to a wedge from the center, but we can follow our tree down so we can find the children, but notice now the heights might change. So interpreting this diagram may be a little bit harder than the original node-link diagram we've talked about. So these are the trade-offs between thinking about different ways we can organize the different information still showing the tree structure, but trying to maximize the amount of space that are being seen. Now, the shortcomings for node-link diagrams, it becomes difficult to encode more variables of data cases. So, position on the screen forces us to choose a point for the node. So we lose two of virtual visual variables in X and Y for position, we can try to encode different information in the children using things like shape, size, color but all those might clash with the basic node-link structure. So we have to think about what our trade offs are, how much data can we encode, because these elements here, remember if this is one of our files that this is your song number one, this song has its own properties too, this has a length of how long that song is, it may have an artist associated with it. Now, of course, you might use the artist as a type of hierarchy in your diagram, but you may have all different properties, you might want to encode in a node and so thinking about how we can do this, thinking now about how we can organize this and what the trade-offs are becomes really critical. So in the next set of modules, we're going to talk about how we can take any general data and do some hierarchical analysis on it for exploration, as well as talking about different methods for visualizing these hierarchical data sources. Thank you.