Now ahead is one of the most critical modules to pay attention to. Even if you're a sequel guru, here we're going to look back at the evolution of modern databases and end with how the technologies behind big query addressed some of the limitations in architecture that prevented true petabyte scale data analysis. We'll then discuss a core database concept called normalization and end with some pretty cool data structures like having nested records inside the table. Now, let's start with the Database Evolution Journey first. Okay, let's dive right in. So a lot of these concepts, we're going to step away from a lot of the sequel syntax that we've learned before and we're going to think more about database theory and the evolution of databases over time. And the example I'm going to walk through is our IRS data example that we've been using for all of this course. And let's take a look at this schema in the way that it's actually storing these expenses. So we have over 20 expenses that are recorded for each of these charities. You could have legal expenses on their form for 2015 that they're filing, accounting expenses, lobbying expenses, royalties, travel expenses grants. All of these data fields need to be stored in some kind of structured row and column format, right? So let's take a look at how we actually have done it. We being the IRS and thus suggesting that data into big query. So that form data is actually pivoted in a way, right? Where instead of being a list of expenses, this is just an increasingly wide table of all these different fields and this if you query the irs_990_2015 table this is exactly what you would see for each record. You're going to have one EIN and a whole list of all the fields that are available that have been filed on that 990 return with all the different field names. And these are some of the examples for the expenses, right? You can see legal fees, accounting fees, fees for lobbying, office expenses. Now, here's the problem with structuring the data this way. What if I asked you for 2016, there's ten new types of legal expenses that I want you to add to the IRS form 990 for charities to be able to fill out. What would that mean for your your data architects, right? So if the PDF form that charities have to fill out has 10 more fields. No problem for them. But you get that new form in. What about all this old data that you have? What about the schema that you already have, it's going to make fundamental changes, right? So you can technically go wider and create new columns. But think of the downsides to that anytime you're doing unions on the past, you're going to have tables that have mismatched columns. That's the least of your worries. The biggest part is when does it end, right? Maybe some years lose expense columns. So you're shortening the schema and then maybe some years have many more columns and you're just widening it. So, constant changes to the schema is bad design and we need to find a better way to cope with flexible amount of fields we want to store and the answer is potentially not having them stored as fields at all. So as we cover, that's not really, that's scalable in either direction. You don't want to make continuous changes to your scheme over time. You want to be flexible. So in the traditional relational database world, let's talk about what one of the potential solutions was. So what you can do is you can actually break apart your expenses into a completely separate table. So instead of storing everything in one massive, what I call the mother of all tables, you can actually begin to break apart. Those fields as grows in another table and tables are much more flexible vertically, right? By adding and appending new roads. Then they are horizontally by continuously adding and removing fields. That's no fun. So let's cover this example. So here we've broken it apart into three separate tables instead of one massive table that has all the different field values. You have the organizational details and one and that's the item in blue and then in green you have the historical transactions. And you can actually look up those expense codes right? From your expense code look up table and this basically says expense code one is lobbying to as legal, three's insurance and maybe 50 could be travel or something like that. You can have as many items in this look up table as you want. And then the historical transactions can just look up against that. You might be wondering why have an expense code at all. And then this gets to another point of database theory, which is try to have one fact in one place. So if you end up renaming something lobbying and renaming the word lobbying to lobbying services, if you had that repeated tens of thousands of times in that transactions table because you didn't have it broken out into a look up table, that's right for error as well. So generally you just want to have as much separation as you can. But again, we'll talk about when, how far is too far. And this entire process that we cover is called normalization. So normalization is in the relational database world. And this is again apart from big query and apart from any specific technology, this is just database theory is breaking apart, all of your massive data tables, you're centralized consolidated data tables into these separate but related tables. And this is the foundation of the relational database model, right? So what happens if we go too far and when you talk about that in normalization versus de normalization. So breaking apart, but then potentially coming back together, right? Now let's talk about some of those trade offs. So what's the benefit? So we cover this, right? So the organizational details table can now grow if you have new charities that are joining every year or removing every year boom, that's fine. Just grow as much as you want and the transactions you have expense codes or you have any kind of code, it might not even necessarily be an expense, right? And then you can look up against an ever changing list of ex expense codes that you see there. So you have much more flexibility and you don't need to grow horizontally and change that fundamental schema, which you absolutely want to avoid at all costs. What's the downside? It's a drawback is if you want to do a consolidated query, like give me the lobbying expenses for the New York association company. Now in order to tie those insights back together as you've learned how to do in sequel with those joints and unions and merging the data together, joins are now a necessity as part of the relational database model, right? And joins for performance reasons, joining data across multiple different tables and then doing that look up is a big performance hit. But again if you're relating database tables together across different and disparate data sources, you're going to have to use joints, right? There's no other way out and for big quarry, we're going to cover a particular answer to this relational database problem. Okay, so we now have data pulled from three different sources. Not ideal when it comes to performance, but for a squeaky squeaky clean database architecture design. This is what we want in theory. So what can we do? We can actually say, hey, well you mentioned that having a separate look up table for individual expenses that just maps this arbitrary code. The expense type is a good idea. I don't want to do that. I don't want to have a third table. I'm going to delete that third table and instead, I'm just going to actually have the expense code be spelled out for what it is. And you trade a little bit of you included some creeping redundancy into your potentially for duplicative fields. And what you've done here is potentially you've traded redundancy, you've added a little bit of redundancy, which is potentially bad in this case, right? You could ever repeating expense goes like lobbying, lobbying, lobbying. What happens if you wanted to change lobbying to lobbying services? Again you have to change that for every road there in favor of a performance boon because you're no longer joining across three tables. You're joining across just those two. Okay, here's another issue growing a relational database at scale vertically does have its limits. So, traditionally, if you're trying to scan across millions or billions of these charities, right? If you're doing a secret scan, that's traditionally very hard and computational expensive when you have things in the order of billions. So having one massive table, think of this like a spreadsheet, you have a massive vertical scroll bar because you're just trying to capture all that information in one singular table, which does have its benefit, it means that all facts are in one place. But if you're talking on the billion road level, it's not exactly efficient to begin sorting and scanning through a lot of this, massive vertical tables. So, in the relational again, it's not big quarry in the relational database realm. If you've probably, if you're a sequel guru, you've heard of these before, but indexes these pre sorted indexes, which are kind of like a separate artifact were created. So for common queries, right? If you wanted to get the rank order alphabetically of all these companies and you can imagine this at scale, I'd say there's 10 billion records here or however many. And that's a very, very common query. So it optimizes read performance because you can actually read from this index much, much faster because it's pre sorted because. Hey, we're getting a lot of folks that want to do this normal ranking of these company names, we're going to create this index. So we can actually reference and have that be much faster. So you're trading read performance, read meaning, select company name from this at the expense of write performance. So if you had a changing list of company names, you'll have to update that in two different places now. So if New York association Inc was no more, you have to delete that from both the organizational details table and your index table would have to be updated as well. So there's a trade off there. So for big query this, both the concept of scaling out into multiple different tables and the idea of building these indexes to handle massively, massive amounts of rows and tables is fundamentally handled a completely different way. And we'll talk about that next.