So this set of lectures will be our final application of the greedy algorithm design paradigm. It's going to be to an application in compression. Specifically, I'll show you a greedy algorithm for constructing a certain kind of prefix-free binary codes know as Huffman codes. So we're going to spend this video just sort of setting the stage. So let's begin by just defining a binary code. So a binary code is just a way to write down symbols from some general alphabet in a manner that computers can understand. That is, it's just a function mapping each symbol from an alphabet, capital sigma, to a binary string, a sequence of zeroes and ones. So this alphabet capital sigma could be any number of things but, you know, as a simple example, you could imagine it's the letters a through z, say in lowercase, plus maybe the space character and some punctuation. So maybe, for a set of size 32 overall. And if you have 32 symbols you need to encode in binary, well, an obvious way to do it is, there happens to be 32 different binary strings of length five, so why not just use one of each of those for your symbols. So this would be a fixed length code, in the sense we're using the exactly the same number of bits, namely five, to encode each of the symbols of our alphabet. This is pretty similar to what's going on with ASCII codes. And of course, it's a mantra of this class to ask, when can we do better than the obvious solution? So in this context, when can we do better than the fixed length codes? Sometimes we can, in the important case when some symbols are much more likely to appear than others. In that case, we can encode information using fewer bits by deploying variable length codes. And this is, in fact, a very practical idea. Variable length codes are used all the time in practice. One example is in coding MP3 audio files. So if you look up the standard for MP3 encoding, there's this initial phase which you do analog-to-digital conversion. But then, once you're in the digital domain, you do apply Huffman Codes, what I'm going to teach you in these videos, to compress the length of the files even further. And of course as you know, compression, especially lossless compression, like Huffman codes, is a good thing. You want to download a file, you want it to happen as fast as possible. Well, you want to make the file as small as possible. So a new issue arises when you pass from fixed-length codes to variable length codes. So let me illustrate that with a simple example. Suppose our alphabet sigma is just four characters, A, B, C, D. So the obvious fixed length encoding of these characters would just be 00, 01, 10, and 11. Well, suppose you wanted to use fewer bits, and wanted to use a variable length encoding, an obvious idea would be to try to get away with only one bit for a couple of these characters. So, suppose instead of using a double 0 for A, we just use a single 0. And instead of using a double one for D we just use a single one. So that's only fewer bits. So that seems like that can only be better. But now, here's the question. Suppose, someone handed you an encoded transmission consisting of the digits 001. What would have been the original sequence of symbols that led to that encoded version? All right, so the answer is D. There is not enough information to know what 001 was supposed to be an encoding of. The reason is is that having passed to a variable length encoding, there is now ambiguity. There is more than one sequence of original symbols that could have led, under this encoding, to the output 001. Specifically, answers A and C would both lead to 001. The letter A would give you a zero, the letter B would give you a 01. So that would give you 001. On the other hand, AAD would also give you 001. So there's no way to know. Contrast this with fixed-length encoding. If you're given a sequence of bits with a fixed length code, of course you know where one letter ends and the next one begins. For example, if every symbol was encoded with 5 bits, you would just read 5 bits. You would know what symbol it was, you would read the next 5 bits, and so on. With variable length codes, without further precautions, it's unclear where one symbol starts and the next one begins. So that's an additional issue we have to make sure to take care of with variable length codes. So to solve this problem, that with variable length codes and without further precautions, it's unclear where one symbol ends and where the next one begins, we're going to insist that our variable length codes are prefix free. So what this means is when we encode a bunch of symbols, we're going to make sure that for each pair of symbols i and j from the original alphabet sigma, the corresponding encodings will have the property that neither one is a prefix of the other. So going back to the previous slide, you'll see that that example was not prefix free. For example, we used zero, was a prefix of zero one, that led to ambiguity. Similarly, one, the encoding for d, was a prefix of 10, the encoding for c, and that also leads to an ambiguity. So if you don't have prefixes for each other, and we'll develop this more shortly, then there's no ambiguity. Then there's a unique way to decode, to reconstruct what the original sequence of symbols was, given just the zeros and ones. So lest you think this is too strong a property, certainly interesting and useful variable length codes exist that satisfy the prefix-free property. So one simple example, again just to encode the letters A, B, C, D. We can get away with encoding the symbol A just using a single bit, just using a zero. Now, of course, to be prefix free, it better be the case that our encodings of B and C and D all start with the bit 1. Otherwise we're not prefix free. But we can get away with that, so let's encode a B with a one and then a zero, and now both C and D better have the property that they start neither with 0 nor with 10. That is, they better start with 11, but let's just encode c using 110 and D using 111. So that would be a variable length code. The number of bits varies between one and three, but it is prefix free. And again, the reason we might want to do this, the reason we might want to use a variable-length encoding, is to take advantage of non-uniform frequencies of symbols from a given alphabet. So let me show you a concrete example of the benefits you can get from these kinds of codes on the next slide. So let's continue with just our four-symbol alphabet, A, B, C, and D. And let's suppose we have good statistics in our application domain about exactly how frequent each of these symbols are. So, in particular, let's assume we know that A is by far the most likely symbol. Let's say 60% of the symbols are going to be As, whereas 25% are Bs, 10% are Cs, and 5% are Ds. So why would you know the statistics? Well, in some domains you're just going to have a lot of expertise. In genomics you're going to know the usual frequencies of As, Cs, Gs and Ts. For something like an mp3 file, well, you can literally just take an intermediate version of the file after you've done the analog to digital transformation, and just count the number of occurrences of each of the symbols. And then you know exact frequencies, and then you're good to go. So let's compare the performance of the obvious fixed length code, where we used 2 bits for each of the 4 characters, with that of the variable length code that's also prefix-free that we mentioned on the previous slide. And we're going to measure the performance of these codes by looking, on average, how many bits do you need to encode a character. Where the average is over the frequencies of the four different symbols. So for the fixed-length encoding, of course, it's two bits per symbol. We don't even need the average. Just whatever the symbol is, it uses exactly two bits. So what about the variable length encoding that's shown on the right in pink? How many bits, on average, for an average character, given these frequencies of the different symbols, are needed to encode a character of the alphabet sigma? Okay, so the correct answer is the second one. It's, on average, 1.55 bits per character. So what's the computation? Well, 60% of the time, it's going to use only 1 bit, and that's where the big savings comes from. 1 bit is all that's needed whenever we see an A, and most of the characters are As. We don't do too bad when we see a B either, which is 25% of the time. We're only using 2 bits for each B. Now it is true that Cs and Ds, we're paying a price. We've having to use 3 bits for each of those, but there aren't very many. Only 10% of the time is it a C and 5% of the time is it a D. And if you add up the result, that's taking the average over the simple frequencies, we get the result of 1.55. So this example draws our attention to a very neat algorithmic opportunity. So namely, given a alphabet and frequencies of the symbols, which in general are not uniform, we now know that the obvious solution fixed length codes need not be optimal. We can improve upon them using variable length prefix-free codes. So the computation you probably want to solve is which one is the best? How do we get optimal compression? Which variable length code gives us the minimum average encoding length of a symbol from this alphabet? So Huffman codes are the solution to that problem. We'll start developing them in the next video.