Welcome to the big data demo using BigQuery on Google Cloud Platform. Here we're going to show the serverless scaling features of BigQuery. How it scales up automatically behind the scenes without your intervention to query large datasets, we're going to talk about 10 billion rows of Wikipedia data. The first thing is first, we're going to follow the demo scripts, all these demos in the code that I'm going to be running. Everything I'll be running is inside of our demos folder in our public repository. First up, we're actually going to copy the query on the clipboard and, I don't want to search for it, navigate to BigQuery. Inside the Google Cloud Platform I already BigQuery open, but if you need to navigate to it, navigation, I have it pinned up here, just starring it. But if you scroll all the way down, under big data you have BigQuery and to promote things so you don't have to continuously scroll and search, I just pin them. I'm often using AI Platform, a notebooks for machine learning work, composer for data engineering work, BigQuery for data analysis work. Once you're inside of BigQuery, we're going to paste in our query in the query editor window. You notice where you're getting data sets from, where your data actually lives, is under Resources. One of the very popular public datasets that's available is Wikipedia. It's one of many. You can actually get airline data for flights, reddit data, geographic data, and Wikipedia benchmark for very large datasets. If you were given a script, one of my favorite hotkeys that you can choose is you can actually hold down. It's on a Mac, it's command key on a Windows, I think it's control. I think it works the Windows key. That'll highlight all of the tables inside of your query. Then it turns them into buttons. If you clicked on this, you automatically get back to the schema. It's a great way to iterate between what are the columns and what are the details and the preview of the data versus the query results as well. Again, that's just a cool hockey. All the shortcuts that I mentioned are available if you open up that module, you will get the shortcuts there as well. Ten billion rows, is it really 10 billion rows? Fastest way we can find that out is in the details. It is just about a gigabyte. Here we go 10,600,000,000 rows of Wikipedia data. What type of data are we talking about? What are we actually going to be querying here? Schema is not too wide, it is the year, month, and day, the Wikimedia projects, and the language that it's in and the title of the Wikipedia page and how many views it has. It has just had a lot of rows. What are we going to do? What type of operation are we going to do? Well, you can see our query when we're going to run it, it's going to go through 10 billion rows, which is about 415 gigabytes of data. Let's see how fast it does that. It's going to return not only just columns, but it's going to do a calculation. It's basically saying, give me the language that Wikipedia page was written in and give me the title of that page, give me the total number of views where somewhere in the title of any of these articles, the name Google was featured. It has to be a capital G, because SQL C is case-sensitive, I'll show you how to ignore that in just a second with a function. Of course, anytime you're doing aggregations, you need the group by, and wanted to have the pages that have Google somewhere in the title, the top pages or by view count first, which I'm assuming is just going to be a page called Google. But let's go ahead and run this. How long does it take to process a 400 gigabytes? We're running and again, you're not a DBA, you're not managing the indexes or anything like that, you just have your SQL query. It's not even our dataset, we are just using somebody else's dataset. You can see how long it's running for. When I recorded this video, we've got 10 seconds 415 gigabytes process and here's your insight. It's already reached out and it found in 10 billion row, 10 billion pages of Wikipedia data that's stored here. Alike is a rather expensive for operation. It's got to not only look at the column is going to look into that string value and find if the a Google somewhere appears anywhere within there. The wildcard character percentage sign is any characters before, any characters after, and sum up those total views. It did that pretty quickly. In total, there are 214,000 pages with Google somewhere in the name, the most popular pages is the English page for Google, the Spanish page for Google shortly after that, and then Google Earth, Google Maps, and then Chrome as well. Now, of course, if you wanted to make this not case-sensitive, one of the things that you could do is you could say, I wanted to wrap the title in. Everything is got to be uppercase and then you would have to do this as well. You just match like full word. If you're doing wildcard operators using like, it's a good idea to use upper, lower, or if you're experienced with rejects, you can do that as well. That is 10 billion and you can see what the really cool thing behind the scenes is on the execution details. You can see how it actually did this. It took you, the human, while you're watching it 10 seconds, you're just watching it. Behind the scenes, it took all of the computers, if you were to do it serially, linearly, stack all the computers, all the work that they did, it would be two hours and 38 minutes for one computer to do it essentially. But that's the beauty of distributed parallel processing that happened behind the scenes. These you don't even have to care about how many virtual machines were spun up to do this work. But in aggregate, they did almost three hours of work automatically and they shared a lot of data in between themselves as well. You can see the process of going from those 10 billion records all the way down after the aggregations to outputting the result that you see there. That's cool. 10 billion. Let's see if we can do 100 billion. Let's see if we have a dataset, I think it's literally just adding another 0. Why not go bigger? Again, if you want to get back to that dataset, we've going to a hack here, get more information here. We do get the title. I can just launch on the same schema details. Cool. We got a real big dataset. We get six terabytes, lot of records. Same principle, expensive operation when I go into every single field. How long do you think it's going to process to take to go through 100 billion records, open up every single title, and then see whether or not it's somewhere in that title is a string of the letters Google. Once you've got that result, it has to take that and all of its friends of the other 100 billion or those that match and then sum them all together. The virtual machines have to communicate with each other when they're doing aggregations. That's where that shuffling step comes into play, and see how much data is going to process. Less than a minute, just over 30 seconds, we went through 4.1 terabytes of data and it gave us the result there. You can see almost a full day of computing if you're going to be doing that just on my single machine and it doesn't even tell you how many machines were there behind the scene. That slot time is a phenomenally interesting metric that just shows you the scale. You waited 31 seconds behind the scenes, you don't even have to manage them. We're using a 24 hours, essentially if compute are boom just like that. When you don't need it anymore, obviously you're not paying for those machines, you're just paying for the bytes of data that were processed. That's the demo of BigQuery at scale.