The primary topics in this part of the specialization are: greedy algorithms (scheduling, minimum spanning trees, clustering, Huffman codes) and dynamic programming (knapsack, sequence alignment, optimal search trees).

Loading...

Del curso dictado por Stanford University

Greedy Algorithms, Minimum Spanning Trees, and Dynamic Programming

739 calificaciones

The primary topics in this part of the specialization are: greedy algorithms (scheduling, minimum spanning trees, clustering, Huffman codes) and dynamic programming (knapsack, sequence alignment, optimal search trees).

De la lección

Week 2

Kruskal's MST algorithm and applications to clustering; advanced union-find (optional).

- Tim RoughgardenProfessor

Computer Science

This video will prove the correctness of our greedy algorithm for clustering.

We'll show that it maximizes the spacing over all possible K clusterings.

You might have hoped that we could deduce the correctness of this greedy algorithm

for clustering immediately from our correctness proofs for various greedy

minimum spanning tree algorithms. Unfortunately that doesn't seem to be the

case. In the minimum cost spanning tree

problem, we're focusing on minimizing the sum of the edge cost.

Here we're looking at different objective, maximizing the spacings.

We do need to do a proof from scratch. That's said, you know, the arguments

we'll use should look familiar to you not just from the sort of exchange type

arguments when we prove the cut property, but also it might remind you even more,

going back further, to our greedy algorithms for scheduling.

So let's now set up the notation for the proof.

As usual, we're going to look at the output of r algorithm.

It achieves some objective function value, some spacing.

We're going to look at an arbitrary competitor.

Some other proposed scheduling. We're going to show that we're at least

as good, our spacing is, at least, as large.

So specifically, we'll denote the clusters in the output of r algorithm by

C1 up to CK. Our clustering has some spacing, some

distance between the near, closest pair of separated points.

Call it capital S. We're going to denote our competitor,

some alternative K clustering by C-Hat one of the C-Hat K, what is it that we're

tryin to show? We want to show that this arbitrary other

clustering has spacing no larger than R's, if we can show that, then because

this clustering was arbitrary it means the greedy clustering has spacing as

large as any other, so it's maximizing the spacing, that's what we want to

proof. But differently we want to exhibit a pair

of points separated by this cluster and C one-half to C1K, such that the distance

between those separated points is S or smaller.

So, let me just quickly depose of a trivial case.

If the C hats are the same as the C's, possibly up to a renaming, then of course

exactly the same pairs of points are separated into each of the clustering, so

that the spacing is exactly the same. So that's not a case we have to worry

about. The interesting case, then, is when the c

hats differ fundamentally from the cs, when they're not merely a permutation of

the clusters in the greedy clustering. And the maneuver we're going to do here

is similar in spirit to what we did in our scheduling correctness proof.

Way back in our scheduling correctness proof, we argued that any schedule that

differs from the greedy one, suffers from, in some sense, a local flaw.

We identified an adjacent pair of jobs that was, in some sense, out of order

with respect to the greedy ordering. The analog here is, we're going to argue

that, for any clustering which is not merely a permutation of the greedy

clustering. There has to be a pair of points which is

classified differently in the c hats relative to the c's.

By differently, I mean they're clustered together in the greedy clustering.

These points, p and q, belong to the same cluster, c sub i.

Yet, in this alternative clustering, which is not just the permutation of the

greedy clustering. They're placed in different clusters.

One, maybe p and c hat i, and q and some other c hat j.

So I want to now split the proof into an easy case and a tricky case.

To explain why the easy case is easy lets, lets observe a property that this

greedy clustering algorithm has. Now the algorithm's philosophy is that

the squeaky wheel should get the grease. That is, the separated pair of points

that are closest to each other are the ones that should get merged.

So for this reason, because it's always the closest separated pair that get

merged, if you look at the sequence of point pairs that get merged together,

that determine the spacing in each subsequent iteration, the distances

between these sort of worst separated points is only going up over time.

At the beginning of the algorithm, the closest pair of points in the entire

point set are the ones that get directly merged.

Then those are out of the picture, and now that some further away pair of

points are separated, it determines the spacing,

then they get merged. Once they've been coalesced, then there

is still some further away pair of points, which is now the smallest

separated. They get merged, and so on.

So if you look at the sequence of distances between the pairs of points

that are directly merged by the greedy algorithm, that is only going up over

time. And this sequence culminates with the

final spacing S of the greedy algorithm. At some sense, the spacing of the output

of the greedy algorithm is the distance between the point period that would get

merged if we ran the greedy algorithm one more in moderation but unfortunately

we're not allowed to do that. Okay?

So the point is, for every pair of points directly merged by the greedy algorithm,

they're always a distance at most S away from each other.

So the easy case, then, is when this pair of points, pq,

which, on the one hand, lie in a common greedy structure,

but on the other hand, in different clusters with c hats.

If they were, at some point, not merely in the same cluster, but actually

directly merged by the greedy algorithm. If, at some iteration, they determined

the spacing, and were picked by the greedy algorithm to have their, clusters

merged. Then we just argued that the distance

between p and q is no more than the space in capital s of the greedy clustering.

And since p and q lie in different clusters of the c hats.

It's separated by the C hats and therefore they upper bound the spacing of

the C hats. Maybe there's some even closer separated

pair by the C hats. But the very least P and Q are separated

so they upper bound the spacing of the C hat clustering.

So that's what we wanted to prove. We wanted to show that this alternative

spacing didn't have better spacing than our greedy spacing.

It had to be at most as big. It had to be at most capital S.

So in this easy case, when P and Q are directly merged by the greedy algorithm,

we're done. So the tricky case is when P and Q are

only indirectly merged, and you may be wondering at the moment, what does that

mean? How did two people wind up in the same

cluster if they weren't, at some point, directly merged?

So let's draw a picture and see how that can happen.

So the issue is that two points P and Q might wind up in a common greeting

cluster, not because the greedy algorithm ever

explicitly considered that point pair, but rather because of a path or cascade

of direct mergers of other point pairs. Imagine, for example, that at some

iteration of the greedy algorithm the point P was considered explicitly along

with the point A1, where here A1 is meant to be different than Q.

So that's a direct merger, and P and A1 wind up in the same cluster. Their

clusters are merged. Maybe the same thing happened to the

point Q at some point A sub L which is different than P.

Sooner or later maybe, you know, at some other time, some totally unrelated pair

of points A2 and A3 are directly merged and then at some point A1 and A2 are

considered by the greedy algorithm. Algorithm, because the other closest pair

of separated points, and, they get merged.

And so on. So the edges in this picture are meant to

indicate direct mergers, pairs of points that are explicitly fused because they

determine the spacing of some point of the greedy iteration.

But at the end of the day the greedy clustering is going to have the results

of all of these mergings. So in case you're feeling confused, let

me just point out that we really saw this exact same exact thing going on when we

were talking about minimum spanning trees in Kruskal's rhythm.

So, at an intermediate point in Kruskal's rhythm, after it's added some edges, but

before it's constructed a spanning tree. As we discussed, the intermediate state

is a bunch of different connected components.

And there are vertices that Have an edge chosen between them.

They, of course, are going to be in the same kinetic component.

But then again, a kenetic component could have long paths in it.

So you could have vertices that are in the same kinetic component in an

intermediate state of Kruskal's Algorithm, despite the fact that we've

haven't chosen an edge directly between them.

There's rather, a path of chosen edges between them.

It's exactly the same thing going on here.

Now, what we have going for us is that, if a pair of points, as discussed, was

directly merged, we know they're close. The distance between them is, at most,

this spacing, capital S. We really don't know anything, frankly,

about the distance between pairs of vertices that were not directly merged.

They just, sort of, accidentally wound up in a common cluster.

But this turns out to be good enough. This is actually sufficient to argue that

this competitor clustering with the c-hat has spacing no more than s?

No better than ours. Let's see why.

So given that P and Q are in a common greedy cluster it must mean there was a

path of direct mergers that forced them to be in the same cluster.

So let's let the intermediate points involved in that path denoted A1 of two

AL. So here's the part of the proof where we

basically reduce the tricky case to the easy case.

So we've got this pair of points, PQ. Now, remember, not, not only are they in

a common greedy cluster. But they're in different clusters in our

competitor in the c hats. So the point p is in some cluster.

Call it c hat i. And Q is in something else.

In particular, it's not in c hat i. Now, imagine you go on a hike.

You start at the point p, and you hike along this path.

You traverse these direct mergers toward q.

Now, you're starting inside c hat I, and you end up outside.

So at some point on your hike, you will traverse the boundary.

You will, for the first time, escape from c hat I, and wind up in some other

cluster. So that has to happen.

And let's call ai and ai+1 the consecutive pair of points at which you

go from inside this cluster to outside this cluster.

And now we're back in the easy case. Now we're dealing with a separated pair

that would directly merge by the greedy algorithm.

Remember that we set up this path to be a path of direct mergers so in particular,

AJ and AJ + one were direct mergers, therefore their distance is at most S.

And again, by virtue of being direct mergers, their distance is at most the

spacing of the greedy clustering and yet as a separated point by the C hats.

It's also an upper bound on the spacing of the C hats.

This means the spacing S of our greedy clustering is as good as the competitor.

Is the competitor was arbitrary or optimal.

That completes the proof.

Coursera brinda acceso universal a la mejor educación del mundo, al asociarse con las mejores universidades y organizaciones, para ofrecer cursos en línea.