Data repositories in which cases are related to subcases are identified as hierarchical. This course covers the representation schemes of hierarchies and algorithms that enable analysis of hierarchical data, as well as provides opportunities to apply several methods of analysis.

Associate Professor at Arizona State University in the School of Computing, Informatics & Decision Systems Engineering and Director of the Center for Accelerating Operational Efficiency School of Computing, Informatics & Decision Systems Engineering

K. Selcuk Candan

Professor of Computer Science and Engineering Director of ASU’s Center for Assured and Scalable Data Engineering (CASCADE)

In previous modules in this course,

we've talked a lot about statistical graphics and different data sources,

we've talked about multivariate datasets,

how to explore different methods there,

looking at things like clustering and classification.

And in this lecture,

we want to talk about a different data structures

specifically hierarchical data and we want to

learn how to apply methods of hierarchical data analysis.

So hierarchies by definition are data repository in which cases are related to sub-cases,

really we can think of this as imposing an ordering in

which cases are parents or ancestors of other cases.

You can think of this like your file system on your computer,

you have the C drive.

In the C drive, you save a bunch of different files.

Inside of those files, you may save others.

For example, on your C drive,

you may have a folder for movies, inside of movies,

you may break this down into genres such as action movies,

comedies, and those things.

And inside of those,

you might have files for all the different movies that are comedies

all the different movies that are action and so forth.

And we can represent this as a tree,

you can think of this like a family tree too.

If this is you,

this is your father and mother, you got your grandfather,

grandmother and so forth and you can think about what this tree looks like.

So there's a lot of data structures that a lot

of datasets that have this hierarchical structure.

We can think of this for genome data,

we could think of this in a workplace environment you have the boss,

you have the next level of employees and so forth down the tree.

So we want to think about what is special and unique

about these type of datasets and what methods can we apply to these.

Now, the main way that hierarchies are represented as often as trees these are directed,

they're acyclic so meaning they have no cycles and we

typically visualize these using two main schemes,

either node-link diagrams which are the most common or space-filling diagrams which look

a lot like the mosaic plots that we talked

about previously in the multivariate statistics module.

So typically, for the node-link diagrams for hierarchical structures,

we might use something like a rooted tree.

So a hierarchical structure is again just an acyclic graph

and this graph might be used to

represent this type of hierarchy so we're using this tree metaphor.

So, we have our initial level zero,

so if we think back to this as our example of the C drive,

this might be our C drive,

here this might be our movie folder,

here maybe we have a music folder as well and

inside of our movie folder maybe we've just saved three different movies.

So we save movie one, two, and three.

And inside of music,

maybe we're starting to think about some different genres.

So we have our first folder for genre one,

and then we have some music related to genre one here,

so song A and song B.

So, in this rooted tree diagram utilizes what we call nodes and links

and these nodes or vertices are placed along

horizontal lines according to their level and so we try to separate these levels

by a fixed distance and we're

trying to make this so the width of this drawing is as small as possible,

so we can fit as much as possible onto the screen.

So, for these structures like I said we can think about your file maker, your C drive,

we can think about how we can create these trees for this and you can even see

this tree structure in your file structure as well,

where we have perhaps the node and then we can see the sub nodes

and then the children as we go through the tree and this is

essentially a top-down approach where we're taking

the width of the fan out is going to use this horizontal real estate up really quickly.

So we want to think about very carefully how to try to organize this information.

Another problem is a tree could grow very long in one branch.

So for example, we could keep expanding

and just one branch and get very deep very quickly,

depending on how things work out and so essentially

then you can wind up with a whole bunch of white space here in the middle,

leaving a lot of screen real estate empty.

So these are some problems with the node-link top-down approach,

so people have also tried thinking about how we can create things like space trees,

where we try to overcome some of these issues in node-link diagrams.

So Plaisant et al introduced the space tree where we have

dynamic rescaling of branches to best fit the available screen space.

So trying to think about Preview icons to

summarize the branch topology and so here we can see

little previews trying to show us what tree structures might be under

there and trying to think about how we can better keep things compact,

while allowing people to see and understand what elements might

be line beneath these different hierarchical branches.

Then people have also thought about, "Well, in 2D,

we wind up taking up all this space,

can we reorganize things perhaps in 3D."

We've seen lots of different circular layouts and things.

In 3D, we can think about this this a cone tree,

so we add a third dimension now and essentially we have our root node here and we can lay

out our children in circular patterns around this cone and then as we go down,

we can continue this structure.

So siblings live in,

one of the 2D planes.

So there's been a whole lot of different methods created.

If we're trying to think about how we can draw and visualize

this hierarchical data structure to learn about different groups and elements.

Now, the thing though is not all data has a hierarchy imposed on it,

but we can also think about using

some different data exploration techniques to hierarchically cluster our data.

So in previous lectures,

we learned about k-means for doing

unsupervised learning for clustering and organizing data.

So we also want to think about is,

if I'm just given a dataset can I also think about

how I might be able to organize it in some hierarchical fashion to be

able to use these visualizations for a different dataset and again with space trees,

we don't have to constrain on top-down approach.

With space trees, we're trying to fill up as much of the screen as possible,

so remember when we first started we talked about our node-link diagrams and

we're trying to keep the levels all filled out and have the same distance between them,

but people can think about

again these radial layouts

where we don't have to constrain our geometry to this top-down approach,

but we can apply a hyperbolic transformation.

Now, the distance between parent and child decreases as you move further from the center,

so the trick now is that the children are going to a wedge from the center,

but we can follow our tree down so we can find the children,

but notice now the heights might change.

So interpreting this diagram may be a little bit

harder than the original node-link diagram we've talked about.

So these are the trade-offs between thinking about different ways we can

organize the different information still showing the tree structure,

but trying to maximize the amount of space that are being seen.

Now, the shortcomings for node-link diagrams,

it becomes difficult to encode more variables of data cases.

So, position on the screen forces us to choose a point for the node.

So we lose two of virtual visual variables in X and Y for position,

we can try to encode different information in the children using things like shape,

size, color but all those might clash with the basic node-link structure.

So we have to think about what our trade offs are,

how much data can we encode,

because these elements here,

remember if this is one of our files that this is your song number one,

this song has its own properties too,

this has a length of how long that song is,

it may have an artist associated with it.

Now, of course, you might use the artist as a type of hierarchy in your diagram,

but you may have all different properties,

you might want to encode in a node and so thinking about how we can do this,

thinking now about how we can organize this and what

the trade-offs are becomes really critical.

So in the next set of modules,

we're going to talk about how we can take

any general data and do some hierarchical analysis on it for exploration,

as well as talking about different methods for

visualizing these hierarchical data sources. Thank you.

Explora nuestro catálogo

Inscríbete de manera gratuita y obtén recomendaciones personalizadas, actualizaciones y ofertas.