So up to now we've done inverted indexes that just treat strings like strings, but the real inverted indexes, if what we're searching is natural language, we've got to do some stuff that, or we can take advantage of some of the realities of natural language to make our indexes smaller and more efficient. So what we're going to do is just like we did before, first we're going to build by hand using SQL, and then we're later going to just do it with Postgres, and of course Postgres will be easy and the main reason to do it by hand is to better understand it. So the whole idea is that we're going to take advantage of what natural language is and the fact that the queries are supposed to have meaning, right? When's the last time you went to Google and typed in the, and, but, if, who, right? Those words we have in language, and they're important for the structure of our sentences, but they're not important for the meaning of the sentences. Because we use and and but and who in so many sentences, but they don't contribute to the meaning, they contribute to the structure of them. And so it turns out those words are, in natural language search queries, effectively useless. So why would we index them? And the term for those words is stop words, I don't know why they're stop words, it's not like that we stop. It'd be better to call them ignore words, for all we care. So we just do ignore words and we'd say here are the words to ignore, but we technically call them stop words and that's that. So stop words is one natural language optimization. The other optimization we're going to do is just make everything lowercase. So stop words reduces the size and again, generally, when someone types SQL uppercase or sql lowercase, they generally mean the same thing. And so teacher with the first letter capital T or teacher with not the first letter capital T, the meaning is the same, whether we start or end it. If you're looking for teacher stuff or teaching stuff, you don't care about capitalization. So we're going to lowercase it all and then we're going to make stop words. So if we take a look, for example, at this one query I've got sitting here, this SELECT DISTINCT id, doc FROM docs, and I'm searching for, Search for Lemons and Neons, you can see the uselessness of the stop words. And so it found things that had the word and in it but had nothing to do with Lemons or Neons. So there's sort of stop words and meaning words. And so we did this without stop words and we got a bunch of false positives in a way because the meaning of the resulting returned row and the meaning of the query didn't match at all. And the other thing we're going to do is what's called stemming. So the idea of stemming is where you have a whole series of words, so car and cars. Do you really want to separately index car and cars or auto and automobile? And so these, they're stems or lexemes, words that basically, what is the real word so we don't separately index. So it both makes our index smaller, but it also means that when we do it to the query and we apply the same thing to the query where we take automobile and auto and make it be the same stem, we're actually making it so it matches more documents because of the meaning. And so stemming is words with equivalent meanings. So a word and its stem, you map all the equivalent words down to the same stem and then that also reduces things. So, what we're going to do in this one is we are going to build our own manual index. In that manual index we are going to add a table of stop words, and then we're going to add a table of stems that map words in our text documents to their stems. We're going to have to map that both for the creation of the index phase and for the query phase. And then when we're all said and done, we're going to make a much better inverted index. So we're going to make an index that still has things like SQL is in three documents 1, 2, and 3, but case has been removed and there are no stop words. So how would we build an inverted index that maps everything to lowercase, handles stop words, and handles stems? And like before, in this dot dot dot part, I'm going to show you that in another demo. So I'll show you the whole dot dot dot. We'll walk through it all the way, we'll see everything, and that will get us to the point. I'm just showing you where it's going to go. The stemming and the stop words very much depend on which language and so you'll notice in this next part when we're going to do it all with PostgreSQL, it's really important to say which language it is. Now, this little next SELECT that just shows you which possible languages you have in SELECT cfgname FROM pg_ts_config, I think ts stands for text search in this. This shows in a default installation with Postgres, the number of languages that are precomputed, have precomputed stemming rules, etc., etc., etc., etc. Stop words and stemming rules and many other things, right? We're doing a very, very, very simple implementation but, like always, Postgres already has our back covered and has a bunch of things in there for us. But it does mean as you create data models you've got to remember what language it is because you might have to create different indexes for different information in different languages. So that's what we're doing is we're switching to like if it's natural language and we know what's language it's in, we can build much more efficient indexes. [MUSIC]