In this video we'll establish the correctness of Huffman's algorithm meaning that that greedy algorithm always computes the prefix free binary code that minimizes the average encoding length. Let me also remind you of our expression for the average encoding length in terms of tree capital T. We'll be working with this expression quite a bit in this proof. So we average over the symbols. So we sum over the symbols I and the alphabet capital sigma. We weight the various terms by the frequencies, which are given as part of the input. And remember that the encoding length that we're producing for a given symbol i is exactly the same as the depth of the corresponding leaf in the tree, capital T. So I quite like the proof of Huffman's theorem. It's a cool proof, and it will give us an opportunity to revisit the themes that we've been studying and proving the correctness of various greedy algorithms. At a high level, we're going to proceed by induction, induction on the size n of the alphabet sigma. So there's a vague parallel you could draw with how we proved, say, Dijkstra's algorithm correct, back in part one, where by induction we showed that each successive iteration of the algorithm is correct, assuming that the previous ones were. But in the inductive step, we're going to use an exchange argument. So to justify each of our mergers, we're going to argue that any optimal solution can be massaged into one that looks more like ours without making it any worse. And that's how we argue that each of our individual mergers is, in fact, a reasonable idea, a good idea. So more precisely, we're going to go by induction on the size n of the alphabet. I'm going to assume, to make the problem non-trivial, that the alphabet size is at least 2. So, as in any proof by induction, we begin with a base case, and then we do the inductive step, wherein we can assume the inductive hypothesis. So, the base case is when we just have a two-letter alphabet. If you go back to Huffman's algorithm, you will see that it does the obvious thing in the base case. It just outputs a tree that encodes one of the symbols with the letter zero, and one of the symbols just with the bit one. And that's the best you can do. You need a bit to encode every symbol, and that tree uses exactly one bit for each symbol so Huffman's algorithm is optimal in that trivial special case. So for the inductive step, we focus on an arbitrary problem instance, where the alphabet size is at least three. Now, of course, whenever you're doing a proof by induction, the thing you've really got going for you is the inductive hypothesis. You assume that the assertion that you're trying to prove, in this case, correctness of Huffman's algorithm, is true for all smaller values of n. That is, we're going to assume that if we invoke the algorithm on any smaller input, as, of course, we do, in this recursive algorithm, we get to assume that the algorithm returns the correct solution to that smaller sub-problem. So to understand how we're going to pass from the inductive hypothesis, so this assumption that we're correct on all smaller inputs, to the inductive step, the assertion that we're correct on the current input. We need to take a closer look at how the original input, with its alphabet sigma, can relate to the smaller sub-problem, with the smaller alphabet sigma prime, with the two letters fused into one, which is the one that we solve correctly, by assumption, recursively. So recall our notation from the pseudocode for Huffman's algorithm. What the algorithm does is take the two symbols with the smallest frequencies, and let's call those two symbols a and b. And it replaces those two symbols with a single symbol ab, sort of a meta-symbol representing the presence of either a or b. We also, in the quiz, discussed how the sensible frequency for this new meta-symbol ab is the sum of the frequencies of a and b. Last video, when we were developing our intuition for what a greedy algorithm for successive bottom-up mergers might look like, we noticed that when you merge two symbols a and b, what you're really doing is committing to outputting a tree at the end of the day, at the end of your algorithm, in which the symbols a and b appear as siblings, in which they have exactly the same parent. Therefore, there's an exact correspondence, one-on-one correspondence, between, on the one hand, trees whose leaves are labeled with the symbols in sigma-prime. So that there is no leaf labeled a, there's no leaf labeled b, there's instead one labeled ab. So there's a correspondence between those kinds of trees, labels that the leaves correspond to sigma prime, and trees for the original alphabet sigma, in which the symbols a and b are siblings, in which they have exactly the same parent. Given a tree as shown in the left picture, that is, given a tree for T prime, the leaves labelled according to sigma prime, you can, and this is in fact exactly what we do in Huffman's algorithm, split the leaf with the metal symbol a, b. Make it an internal node, and give it two leaves with labels a and b. That produces a tree of the form on the right. Conversely, given a tree of the form on the right, that is, its leaves are labeled according to sigma, and it just so happens that a and b show up as leaves of that tree, you can produce a tree for sigma prime just by contracting a and b together. So, sucking up a and b into their parent, and labeling the parent ab. So you can go back and forth between these two types of trees. One arbitrary tree is for sigma prime and trees of a special kind for sigma, trees in which a and b happen to be siblings. This subset of trees for sigma are important enough to be given its own notation. So let me denote by capital X, sub ab, the trees with leaves labeled according to sigma, in which it just so happens that a and b appear as siblings. So that'll be some trees for sigma, but not all of them, only the ones in which a and b are siblings. So this correspondence between solutions to the smaller sub-problem and solutions of a particular form for the original problem has an important property. Namely, it preserves the objective function value, it preserves the average encoding length. Okay, that's not quite true, but it's almost true. It's good enough for our purposes. It preserves the average encoding length up to a fixed constant. So let me show you the computation which demonstrates that point. So consider any pair of matched trees, T prime and T. And by matched, I just mean T prime is any old tree whose leaves are labeled according to sigma prime. And T is the tree you would obtain if you split the leaf with meta label ab in the usual way. You replace it with an internal node, and children with labels a and b. So that's going to be the corresponding tree capital T. So take any such matched pair, and let's look at the difference between their average encoding lengths. Now, remember the average encoding length of a tree is just a sum of the symbols of the relevant alphabet. And what we got going for us here is that sigma and sigma prime are almost exactly the same, right? So the only difference is sigma prime has the meta symbol ab, whereas sigma has the individual symbols a and b. Furthermore, the two trees t and t prime are almost exactly the same. The only difference being that t prime has this leaf with the meta label ab, whereas t has two corresponding nodes, one level down, with the labels a and b. So, when we take the difference of these two sums, everything cancels out, except that the tree t contributes two summands, one for the leaf with label a, one for the leaf with label b. And the t tree prime contributes, with a minus sign, one summand, that corresponding to the leaf with label ab. So, when the dust clears, and we cancel out all the terms, what we're left is the term for the leaf a, and with the frequency of a times the depth of the leaf a in the tree t. A similar term for the leaf with label b in the tree t, and then with a minus sign, the frequency of the meta label ab in the tree times its depth in the tree t prime. But, we are certainly not done with our simplifications. There is a strong relationship between the frequencies that we're staring at and a strong relationship between these various depths. Let's begin with the frequencies. How did we define the frequency of the meta symbol ab? Or, remember our quiz. It made sense to define it as the sum of the frequencies of a and b. What about the depths? Well, you know, this symbol ab is at whatever depth it is in the tree T prime. So, it's a depth 10, something like that. Remember, T is obtained from T prime simply by splitting this leaf ab and giving it two new children with symbols a and b. So, if the meta symbol ab was at depth ten in T prime, then the leaves a and b are going to be at depth 11 in T. So, the depth is simply one more than whatever it was previously in the tree T prime. So these relationships are going to result in a second wave of cancellations. So just to make that crystal clear, let's call the the depth of ab in T prime d. So d is what I was calling ten earlier. So a and b are both a depth d plus 1 in the tree T. So the first term becomes the frequency of a times the depth, d plus 1. The second term becomes the frequency of b, plus the depth, d plus 1. And then the third term becomes the sum of the frequencies of a and b times the depth d. And, when the dust settles yet again, we are left with a constant, Pa + Pb. So, the sums of two frequencies. And one thing I want you to really understand is that the difference between these two average encoding lengths is just some constant. It does not depend on which trees we started with. So if we choose a perfectly balance tree and we do this difference, we get some constant like 0.1. If we choose some totally different pair of trees that are really lopsided and we take this difference, we still get 0.1, exactly the same thing. So, this fulfills the promise I gave you earlier, that not only do we have this natural correspondence between, on the one hand, trees labeled with sigma prime, and trees of a certain type labeled according to sigma, namely those in which a and b appear siblings, but the correspondence preserves average encoding length. Okay, it doesn't quite preserve the average encoding length, but it preserves it up to a universal constant, and that's going to be good enough for our purposes, as we'll see in a sec.