The primary topics in this part of the specialization are: greedy algorithms (scheduling, minimum spanning trees, clustering, Huffman codes) and dynamic programming (knapsack, sequence alignment, optimal search trees).

Loading...

Del curso dictado por Stanford University

Greedy Algorithms, Minimum Spanning Trees, and Dynamic Programming

646 calificaciones

The primary topics in this part of the specialization are: greedy algorithms (scheduling, minimum spanning trees, clustering, Huffman codes) and dynamic programming (knapsack, sequence alignment, optimal search trees).

De la lección

Week 2

Kruskal's MST algorithm and applications to clustering; advanced union-find (optional).

- Tim RoughgardenProfessor

Computer Science

So I don't blame you if you're feeling a little bit restless.

You've now watched three videos about this alternative approach

to the union find data structure based on lazy unions, and

frankly we don't have many changeable things to show for it.

We already had a perfectly fine implementation of this data structure

based on eager unions that gave us constant time fines, and yeah,

union could be linear in the worst case,

but it was logarithmic on average over sequence of unions.

So now we have this other implementation.

Both of our operations are requiring logarithmic time.

It's true it's a worst case bound, and it's true union by rank is pretty cool.

But still, the bottom line up to this point is not that compelling.

But I've still got another trick up my sleeve.

Now just watch what happens once we employ a second optimization known as

path compression.

So the optimization is very natural.

I suspect many a serious programmer would come up with this on their

own if they were tasked with implementing union find with union by rank.

So to motivate this optimization let's think about what would be our worst

nightmare as someone maintaining such a data structure and hoping for

good performance.

Well, remember the running time of a find is proportional to the number of parent

pointers you have to traverse.

That is disproportional to the depth of the object at which it's invoked.

So, what's the worst case find look like?

Well it's going to be the leaf.

And moreover, it's going to be a leaf,

which is furthest away from its corresponding roots.

Now, on the one hand we're using union by rank, so

we know that this depth can't be too bad.

It's going to be at most big 0 of log n.

However, there will be example there will,

in general, be leafs that are theta of log n hops away from their root.

And, for all we know, we're just going to get this endless sequence of find

operations, where every single one is invoked on

a leaf that's a log n number of hops away from from its root.

So for example, in the pink tree that is shown in the upper right of the slide

maybe someone keeps searching for the object,

keeps invoking find from the object one over and over and over again.

And then we're going to be suffering a log number of steps with every single

operation.

But if you think about it,

it's totally pointless to keep re-doing the work of previous finds.

To keep retraversing the same parent pointers over and over again.

So for example, in the pink tree that I've shown in the upper right,

imagine that find is invoked from object one.

So then we traverse the three parent pointers.

We discover the ancestors four and six before terminating at seven.

And we discover that seven is the leader or

the root corresponding to the object one.

Now remember, we do not care that four and

six are ancestors of one, that is uninteresting.

We only visited them to discover what we actually cared about,

that seven is the leader, the root corresponding to one.

Well, but let's just cache that information now that we've computed it.

So, this is now basically reconstructing the eager union find implementation.

Let's just rewire one's parent corner to point straight to seven.

We don't need to recompute foreign six in some later find operation.

And more generally, after we've invoked FIND from object one, we may as well

rewire four's parent pointer as well, to point directly to its leader, seven.

So that then is path compression.

When FIND is invoked from some node, x, and you traverse parent pointers from x

up to its root, call it r for every object that you visit along

this path from x to r, you rewire the parent corner to point directly to r.

So r doesnt have it's parent pointer changed, it still points to itself.

The penultimate object on this path doesn't have its parent pointer changed.

It already was pointing to the root r, but

anything below the root and its immediate descendents on this path from x

will have its parent point updated and it'll be updated to point directly to r.

Now we couldn't get away with this if the tree had to be binary.

But it doesn't have to be binary.

We couldn't get away with this if we had to satisfy some other constraint like

a search tree property, but we don't.

So nothing's stopping us from just caching this information

about who everyone's leaders are.

because that's really the only information we're responsible for

exporting from this union find data structure.

So pictorially, if you like thinking about the trees, in effect,

what path compression does is make the trees shallower and more bushy.

So in our example in the upper right, it rips out one and

four and makes them immediate descendants of the root seven.

Prefer to think in terms of the array representation and remember in the array

representation, the array index for i is recording the parent of the object i.

So, in our pink tree, five, six, and seven all point to seven.

They all have parent seven.

Whereas, two and three all have parent five.

Four has the parent six and one has the parent four.

And after applying path compression following a find invoked at the object

one, one and four are redirected to have parent seven.

So what are the pros and cons of the path compression optimization?

Well, the cons side is quite minor.

Really we're just doing a little multi-tasking on find and

that introduces a small constant factor overhead.

We're already doing work proportional to the number of hops.

On the path from x to it's root and we are still just doing constant work for

each node on that path.

We're doing an extra constant work for

each node after the fact to rewire the parent pointer to point to the root.

The pro should be obvious this is going to speed up all subsequent finds operations.

You are really making sure you don't redo redundant work.

You don't traverse parent pointers over and over and over again.

So what's clear is that FINDs will speed up.

What's really not clear is whether they'll speed up by a lot.

So this one going to affect the performance of the FIND operation by say

a factor of two, where we get something fundamentally better

than the logger than performance we were stuck with without path compression.

So, let me spend a slide on the subtle point of how ranks interact with the path

compression optimization.

So, the plan is we're going to manipulate rank fields with path

compression in exactly the same way as we did when we didn't have path compression.

So how did we manipulate them previously?

Well at the birth of the union find data structure everybody has rank zero.

Ranks never change except possibly when you do a union.

When you do a union you look at the roots of the two objects whose groups you

have to union.

You look at their ranks.

If one of them has strictly bigger rank that becomes

the root of your new merged tree.

The root with the smaller rank is installed as a child underneath.

If you union two roots that have exactly the same rank,

you pick arbitrarily which of them is the new root, and you bump up its rank by one.

That is how we manipulated ranks before,

that is exactly how we're going to manipulate them now.

So in particular, when we apply path compression,

when we rewire parent pointers following a find, we do not touch anybody's ranks.

We leave them all exactly the same.

So again, ranks only change in one specific case in a very specific way.

Namely, during a union when we merge two trees that have exactly the same rank.

And when we do that kind of union, that kind of merge,

whichever of the old roots winds up being the new root, we bump up its rank by one.

That is the only modification we ever make to any ranks.

Now, it might bother you that we don't recompute the ranks when we apply

path compression.

And in a way I sort of hope it does bother you because by not recomputing ranks, by

just manipulating them exactly as before, we're actually losing the semantics

of what ranks meant in the previous video, so the key invariant we had

that ranks exactly represented worst case search time to that object is now lost.

The easiest way to see what I'm talking about probably through an example, so

let me redraw the same tree we had on the previous slide, both before and

after we apply path compression.

Following a find invoked at the object one.

So what I've done here is, in the top tree, before the path compression's been

applied, I've labeled ranks just as they were in the previous video.

So for each node in the top tree, its rank is equal to the longest path.

From a leaf to that note.

Then I applied path compression from he object one

resulting in the bushier more shallow tree on the bottom.

And as I said I did not touch anybody's ranks when I applied

this path compression.

I didn't do any recomplications.

And you'll see now we've lost the semantics for the ranks.

Just to give a simple example there are now leaves that don't have rank zero,

they have ranks strictly bigger than zero.

So what we can say is that for every node the rank is an upper bound.

Not necessarily exactly the same as, but

at least as big as the longest path from a leaf to that node.

But because of the shortcuts that we've installed we no longer have the equality

that we had before.

There is good news however.

We can definitely salvage a lot of the technology that we

developed in last video for the union by rank analysis and

apply it fruitfully here to the case with path compression.

In particular, any statement that we made last video

that talks only about the ranks of objects and not about their parent pointers per se

that will be as true with path compression as without.

Why?

Well, think about making two copies of a union find data structure.

Exactly the same set of objects and they'll run them in parallel on exactly

the same sequence of union and find operations.

So, by definition,

we manipulate the ranks of objects in exactly the same way in the two copies.

It doesn't matter if there's path compression or not.

So at all moments in time, every object will have exactly the same rank in

one copy as it does in the other copy.

Path compression or no, it doesn't matter.

Exactly the same ranks.

What's going to be different in the two copies are the parent pointers.

They're, in some sense, going to be further along.

They're going to point further up into the tree in the copy with path compression.

But no matter.

Any statement that's purely about ranks, it's going to be equally well true with or

without path compression.

So in particular, remember the ranking lemma that we proved in the last video.

That says there's at most n/2 to the r objects of rank r.

That was true without path compression.

It's going to be true with path compression.

And we're going to use it in the forthcoming analysis.

Another fact which is still true with path compression,

in fact in some sense it's only more true with path compression, is that whenever

you traverse a parent corner you will get to a node with strictly larger rank.

Remember in the last video we said, as you go up toward the root,

you're going to see a strictly increasing sequence of ranks.

Here each pointer with path compression represents a sequence

of pointers without path compression.

So certainly you will still only be increasing in a rank every time you

traverse a pointer.

Let's now quantify the benefit of path compression by proving new performance

guarantees on the operations and the union find data structure.

And frankly, the running time bounds are going to blow away the logarithmic bound

that we had when we were only doing union by rank without path compression.

We're going to do the analysis in two steps, and

it's going to be historically accurate in the sense that I'm first going to show you

what theorem by Hopcroft-Ullman, which gives an excellent but

not quite optimal bound on the performance of this union find data structure.

And then we're going to build on the Hopcroft-Ullman ideas

to prove even better bound on the performance on this data structure.

So here is the performance guarantee that Hopcroft and

Ullman proved back in 1973, about 40 years ago.

They said,

consider a union find data structure implemented as we've been discussing.

Lazy unions, union by rank, path compression.

Consider an arbitrary sequence of m, find, and union operations, and

suppose the data structure contains n objects.

Then the total work that the data structure does to process this entire

sequence of m operations is m times log star of n.

What is log* of n, you ask?

Well, it's the iterated logarithm operator.

So, my analogy, remember when we first demystified the logarithm way back at

the beginning of part one, we notice the log base two of a number is well,

you just type that number in your calculator,

you count the number of times you divide by two until the result drops below one.

So, log star, you type the number n into your calculator and

you count the number of times you need to hit log before the result drops below one.

So it totally boggles the mind just how slowly this log star function grows.

In fact, let me give you a quick quiz just to make sure that you appreciate

the glacial growth of the log star function.

So the correct answer is B.

It is five.

And you should probably spend a couple seconds just reflecting on how absurd

this is.

Right? So you take this ridiculous number 2

to the 65500.

Remember the number of atoms in the known universe estimated is way smaller

than this.

Which is 10 to the 80.

So, you take this ridiculous number, and you apply the log star function.

You get back 5.

Why do you get back 5?

Well, take the logarithm once.

What do you get back?

65,536, also known as 2 to the 16, so you take log again.

You get back 16, also known as two to the four.

You take log again.

You get back four, also known as two to the two.

You take log again.

You get back a 2, and then one more time gets you down to 1.

So, you can think of log star as the inverse of the tower function,

where the tower function takes, think of it as a positive integer, t, and

it just raises two to the two to the two to the two to the two t times.

And this, in particular, shows that log star, despite its glacial growth,

is not bounded by a constant.

This function is not.

Big o of one.

For any constant c you can think of I can in principle exhibit a large enough n so

that log star of my n is bigger than your c.

Namely I can just set n to be two to the two to the two to the so on c times.

Log star of that will be at least c.

So that's the Hobcroft-Owen performance guarantee that

on average over a sequence of m union of five operations you spend

this only ridiculously small log star n amount per operation.

We're going to prove it in this next video.

Coursera brinda acceso universal a la mejor educación del mundo, al asociarse con las mejores universidades y organizaciones, para ofrecer cursos en línea.