So in this video we'll take a peek under the hood of hash functions. And I'll
discuss some of the high level principles by which their implemented. So let's
briefly review the raison d'etre of a hash table. So the purpose in life for a hash
table is to support super-fast lookups. So maybe you're keeping track of the
transactions that happened on your website yesterday. Maybe you're keeping track of
your employees; maybe you're keeping track of IP addresses in an Internet router.
Maybe you're keeping track of chess configurations in a, in a chess-playing
program, Whatever, The point is, you want to be able to insert stuff into a hash
table, and later remember whether something's there or whether something's
not there. So the implementations we'll discuss will generally also support
deletions. But that's pretty much it. It's a very restricted set of operations. But
the hash, It was going to execute them at very, very well. So, basically in constant
time, again subject to some fine print, which we'll discuss a little bit in this
video but, then more deeply in a separate optional video. So, the two caveats are
first of all the hash table had better be properly implemented. It's actually pretty
easy to screw up a hash table to screw up hash functions. We'll talk a bit about
that in a few minutes and, then, also, the data should, in some sense, be
non-pathological, and that, we will discuss more deeply in a separate video.
Alright, so let me give you an initial glimpse of some of the magic that's
happening under the hood in hash functions. So, at first let me say exactly
what the setup is. The first step is to identify all the things that you might
want to be storing. So, in other words, the universe of your application, So this
would be something like, all possible I-P addresses, of which there's 2^32 .
All possible names you might encounter, perhaps with a maximum of, say,
30 characters. All possible configurations of a chessboard, and so on, And one thing
I hope you can appreciate from these examples is that, in many cases, this
universe is really big. Sothe number of ] IP address is, quote unquote, only two 2^32.
The number of all names, you're probably talking more like 26 raised to the 30.
All chessboard configurations I don't even wanna think about. And what you
wanna accomplish is, you wanna maintain an evolving subset of this universe U. So
maybe you wanna keep track of all the IP addresses you've seen on your website in
the last 24 hours. You wanna keep track of the phone numbers of all of your friends.
You wanna keep track of the chessboard configurations that you've explored in the
past three seconds, whatever. And again I hope what is clear from the applications
we've been discussing, is that the set S is usually of a reasonable size. It's,
it's something you could store in main memory. You know, it maybe it's tens of
thousands of IP addresses. Maybe it's, you know, a few hundred names of your various
friends. You know, maybe it's in the, you know, millions of chessboard
configurations, but still way, way, way smaller than the size of the universe. So
without data structures, you'd have to resort to other unsatisfactory solutions
to maintaining this set. So the first thing you could try, as we discussed in
the previous video, would be just have an array with one position for every
imaginable thing you might want to store in your set. So this is the solution
that's going to work well if all of your friends happen to have names that are
integers between 1 and 10,000, but doesn't scale when the universe size
becomes really big, as in most of these applications. So, the good news is, is of
course, is an array of and it's of course fast random access so you can access any
position in constant time. So, if you have an array base solution index by all the
elements of the universe, you can do constant time, insert, delete and look up.
The bad news is, is the space requirement is proportional to the
universe. And again, forget about being unsatisfactory. That's just literary
impossible. Infeasible in many applications in which you'd use hash
tables. Now of course to get the memory proportional to the size of the set stuff
that you're storing, an easy solution would just be to use a list. You know, say
a doubly-linked list. Something like that. Now with a list-based solution the good
news is, is your memory is certainly proportional to the size of the set that
you're storing, and independent of the size of the universe from which these
elements are drawn. The bad news is that to figure out whether something is, or is
not, in a list you generally have to traverse through most of that list. And
that's gonna take up time proportional to the length of the list. So, really the
question we're faced in implementing cache table is, can we get the best Of both
worlds, of these two naive solutions. And the one hand, we want to have the constant
time operations enjoyed by the array based solution. But on the other hand, we wanna
have the, linear space in the size of the set that we're storing; that we get in the
list based solution. So to get the best of both worlds, we are going to use an array
based solution. But the array will not be big. It'll not be with size proportional
to the universe. The array will only have size, you know, roughly the same as the
set that we're storing, So somewhere in the ball park of the cardinality of S. So
the first thing we do is we decide on how big we want our array to be. So that, that
length is gonna be called n. We're gonna have an array of length n. And n is gonna
be in the ballpark of the size of S. It's gonna depend on a few things. Exactly how
n compares to S, but for now think of n as like double the size of S. We're gonna be
calling each entry of the array a bucket, so there's n buckets, and then, the size
of S is about 50 percent of the number of buckets, let's say. So one objection you
might legitimately raise at this point is, you know I thought, I said the set was
dynamic. The set S. Right? Stuff can be added, stuff can be deleted. So the size
isn't always the same. It can fluctuate over time. So what does it mean to define
an array which is the, roughly the same length as this changing set. So for
simplicity, for the purposes of this video to focus on the key points I am going to
assume that the set size S. While S itself can be changing, I'm going to assume that
the size of S doesn't fluctuate too much. So there are additional bells and whistles
you can add to a hash table implementation, and they're all quite
natural. I think most of you could probably figure them out on your own, to
deal with the fact that S might be changing sizes. So for example, you can
just keep track of how many elements are in your hash table. And when it exceeds a
big, a certain threshold, so when it's too big relative to the size of your array,
you just double the array. And then you reinsert all of the elements into this new
doubled array. Similarly, if you want to, if the set shrinks, you can have tricks
for shrinking the array dynamically as well. So I'm not gonna discuss these bells
and whistles for resizing your hash table dynamically. They are, of course Important
for a real implementation, and they are part of the implementations in the
standard programming libraries. But I view those as sort of a, a second order point
in the implementation of a hash table. And I wanna focus on the first order points,
in this video. So, summarizing, think of the set S. There are insertions and
deletions we have to accommodate. But, you know, S is gonna be roughly the same size.
And the number of buckets will be, you know, within a constant factor of the size
of the set. All right so there we have our array with totally reasonable space, space
proportional to the size of the set that we are storing. And now what we want is we
want is some way of translating between the things that we care about, say our
friends names or whatever the elements in the universe are to the positions in this
array. So the object responsible for that translation from keys drawn from this universe to
positions in this array is called a hash function. So formally, a hash function
Takes as input a key. So this is gonna be an IP address or the name of somebody or a
chessboard configuration or whatever. And it's going to spit out an position in this
array. So I'm gonna label the array entries from 0 to n-1 for this lecture.
Obviously at the moment this is super underspecified. There's a zillion
functions you could choose. Which one you use, we'll talk about that, but for now
there's just gonna be some hash function mapping from elements of the universe to
buckets, to positions in this array. Now, as far as the semantics of this hash
function, what the hash function is doing, it's telling us in which position we
should store a given key from the universe. So, if we have some new friend
named Alice. And we run Alice, we key Alice through the hash function and it
gives us a 17. It says we should store Alice's phone number in position
17 of the array. If we have some crazy chessboard configuration, we feed it
into a hash function and it spits out 172, it says we should remember this chessboard
configuration in the 172nd bucket of this array. So again, given x, which is some
key from this universe, we invoke a hash function to get a position in this array,
to get a bucket. And then that is where we try to store this x and any associated
data with it. So that's the high leveled idea of how you implement a hash
table, but we're quite far from done, And in particular there is a serious issue,
that we're going to have to deal with, that's fundamental to implementing hash
tables, and that's the notion of a collision. So probably many of you may
have already noticed that this problem might occur. Which is well what happens if
we're storing our friend's phone numbers, and you know Alice shows up and we ask our
hash function where to store Alice's phone number, and it says oh bucket number
17, And then our friend Bob shows up, and we ask our hash function where to
store Bob's phone number, and what if the hash function also says bucket number
17 for Bob? What do we put in bucket at 17? Do we put Alice
there, do we put Bob there, do we put them both there? How do we deal with these
so-called collisions? So, the next quiz is meant to give, to get you thinking about
collisions, and in some sense, how truly unavoidable they really are. [sound],
[sound] All right. So the correct answer to this question is the first answer,
believe it or not. All you need is 23 people in a room before you're equally
likely to have two people with the same birthday as not. So if you're looking to,
to skim a little money off of your non-mathematical friends, this is one way
you can do it. Go to cocktail parties with about 40 people and place bets with people
that there are two people in the room with the same birthday. So if you have 367
people, well there's only 366 distinct birthdays, I'm counting February 29th
here as one of them. So by the pigeonhole principle, certainly the
probability is 100%. By the time you get to 367. Now, by the time you're at 57.
You're already at 99%. So you already have overwhelming probability to have a
duplicate birthday with 57 people. So of course, with 184 you're gonna be almost at
100%, 99.99. Who knows? Some large number of 9's, And at 23, you're at 50%. So many
people find this quite counter-intuitive that you only need 23 people to get a
duplicate birthday on average. And so this is a, this is a quite famous example and
it sometimes goes by the birthday paradox. Calling it a paradox is sort of a
misnomer. A paradox, you know, often suggests some kind of logical
inconsistency. There's no logical inconsistency here. It's just that
people's brains are not really wired to have this intuition, for whatever reason.
So, but it's really just math. You can work out the math, and, and, and you can
just solve it. So, more generally, the principle behind the birthday paradox is
the following. So suppose you have a calendar, perhaps on some different
planet, which has K days. Where each, everybody's equally likely to have each of
the K days as their birthday. Then it's about the square root of k people that you
need in a room before you're equally likely to have a duplicate, or not have a
duplicate. Okay, and the reason that you get the square root effect is because if
you think about it. There's a quadratic number of pairs of people in the room, so
that's a quadratic, and the number of people Opportunities to have a duplicate.
Right? So, each pair of people could be a duplicate, there's a quadratic number of
pairs. And so, that's why, once the number of pairs starts reaching about the number
of different days, you're, you're about, you're likely to see a duplicate around
that point. So you might be wondering why I'm telling you about the birthday paradox
in the middle of a lecture about hashing, but really it's quite relevant. So imagine
for example you defined a hash function in the following way. Now to be clear, this
is not a practical hash function, but just for the purposes of discussion, imagine
you have a hash function which randomly assigned every single key to a uniform
bucket. 'Kay, so for each, each of the 1/n buckets equally likely. Then what
the birthday paradox says is, even for a very small dataset, you are already gonna
have a pair of things colliding. All right, So if you have an n buckets, so
maybe your n is like, 10,000, all you need is roughly 100 elements in your data set,
and despite the fact that the table is only going to be one percent full, you're
already going to see a collision, okay? So 99 percent of them are empty, but you're
going to have one bucket that has two, so that's sort of annoying. So the birthday
paradox says, you start getting collisions with the hash function, even with the
really tiny data sets. So in this sense, if you're going to have hash tables,
you've got to deal with collisions. There's going to be a fair number of them,
and you need some method for resolving them So, collisions are a fact of life
when you're talking about hashing. Where again, by collision, what I mean is two
different keys. So two different elements x and y from the universe that hash to the
same bucket, Who have the same hash value, So in general we can think of a hash
function as doing a compression of sorts. So we have a huge universe U and we have
this very modest size array A with the only n buckets. Where n, we're thinking
of as being much, much, much smaller than U. So, of course, this hash function has
to map various elements of U to the same bucket. So what are we gonna do about it?
How are we going to resolve these collisions? Well, there's two different
solutions which are both quite prevalent in practice. So solution number one is
called chaining, or sometimes you'll also see it called separate chaining. And this
is a very natural solution; it's also the one that's relatively easy to analyze
mathematically. What you do is just for elements that hash to the same bucket, you
just revert to the list-based solution that we talked about in a previous slide.
So, each of the n buckets will not necessarily contain just merely 0 or 1 element
, it will contain a list within a principle unbounded number of elements.
Okay, so when we use chaining, it's done quite straight-forward to figure out how
to implement all of the hash table operations, namely, insert, delete and
look-up, you just hash something to the appropriate bucket and then you just do
insert, delete or look-up, as appropriate, in the list that's in that bucket. So just
to make clear that everything is type checking, so here h(x), this is the bucket
for x. That's what's specified by the hash function. And then, in the h(x) position
of this array A, in the h (x), the bucket is where we find the linked list that is
going to contain x. So just to give a cartoon example, if you had, say, four
buckets, Maybe, you know, the first bucket has exactly one record. Corresponding to
Alice, maybe the second bucket just has a null pointer. No one's been inserted in the
second bucket. And then the third bucket we have, let's say, both Bob as well as
Daniel. And then maybe in the fourth bucket we have Carol. Okay, so because we
have a collision between Bob and Daniel, both map to the third bucket, and we
resolve that just by having a linked list, with Bob and Daniel in some order. So the
second solution which is trickier to talk about mathematically but still quite
important practically is called open addressing. And the principal in open
addressing is you're not going to use any space for pointers. You're not gonna
have lists. So you're only gonna have one object per bucket of the array. So another
question is what happens if, you know, you try and insert Daniel and you go, you
invoke the hash function on Daniel and it takes you to a bucket that already
contains Bob? That means there's no room for Daniel. So what you're going to do is
you're gonna probe the hash table in some other position. So a hash function is, is
now gonna be replaced by a hash sequence, where you try, the hash function tells you
the first bucket to try to insert Daniel; failing that, a second bucket in which to
try to insert Daniel; failing that, a third bucket to try to insert Daniel; and
so on. And you just keep trying till you find an open position somewhere in the