Alright, in this demo I am going to show you how to use a WS glue to transform CSV files to park here. In other words, to transform a file that uses row based formats to a file that uses column based format, then I'm going to run a glue crawler against the perkier datasets to create a table in the glue wage catalog and I am going to be doing the same query against both tables. You're going to see the difference between the query that runs against the CSV data set. And the query that runs against the perky data set. So if we go to Athena. We have the Athena table that had been created. Pointing to the pointing to the table that uses the CSV file and that was created by the crawler demo you saw on the last week. If we click here on those three dots and clicking create generate creates table DDL. It will run this query against Athena with will show you how was the create stable command so you can see that it is it is using? Assert that is a CSV. 3rd right it points to the location as three TLC RAF demo, which is the. The location where I have that CSV file. And now I want to run a glue job. So let's go through WS Glue and find the jobs here on the sidebar jobs. Here. So I am going to add a new job. And this job will be called. CSV. To parkey. I need to use and Im role and I already have an IM role previously created here called a WS glue service role. Glue jobs. This role is and Im role as the other one I showed you but this one has access to also do put objects. Because Watt, these glow job will be doing is getting objects from bucket 8 and putting objects into bucket B. So this role is allowed to put objects and get objects and also allowed to use KMS keys if needed. In this case I'm not doing encryption so I don't need the role to interact with key management service KMS, but it is. They are just in case now it is going to run a spark job here I can choose what is this park version. The glue version. And here is the most interesting part. I want glue to create to propose a pie Spark script for me based on what I click in this wizard. So this is where the script will be created. Here you can see monitoring options. You can enable job metrics, continuous logging and sparky for monitoring because under the hood what will be happening is the execution of a pie spark scripts in an earmark cluster, but that's fully serverless, so you don't see that cluster, but you can have the access for the logging in a cloud watch log stream that you choose, or in an S3 bucket that you choose. Here I'm going to click next. And here I should choose a data source, which in this case is a table on the glue Age catalog. It is the only table I have so far which is the table that had been created by the crawler you saw next week. It points to S3 TL craft demo and the classification is CSV. Now I should choose a transformation type. In this case change the schema. And I need to choose a day to target. I will ask the glue job to create a new table as a data target and the data store is Amazon S3 and the format is Parkey. And here is where the job will do the data transformation from whatever it gets which is CSV to the format. I choose which is parki. Here I'm going to choose. My target path. Yeah, and I am going to save the output of this glue job under park it files in my S3 bucket. Here is where I could choose some sort of data transformation in terms of the. The content of the file itself. I can reorder them on the files that will be stored at the output and I can remove some fields as well. I will leave it as is because the main objective here is to have a one to one conversion CSV to perky. To do an Athena query to different tables pointing to the different data set containing the very same data. Save job and added script. Now notice something interesting here. This is a pi spark script. And as you can see, you don't need to have pi spark knowledge to operate these, at least on the very basic level. If you click here, you can reorder those elements And there's What you see is what you get kind of graphical user interface. And if you click here on the transform named resolve choice, it highlights the part of the PI spark code that will be doing that specific action for you. So it is applying the MAPI It is resolving choice and it's dropping new fields. And it's the story on the path S3 TLC ref demo parkette files and you can see that this is the actual line that is doing the right. So it is glue context right dynamic frame from options. This Dynamic frame is something that regards to Pies sparque. So we don't need to understand by Pies parque to operate these on this basic level, which is just CSV to park a transformation and here is where the connection options form at parque because that's what we had. So save, run the job .And now we can have a coffee. We can sit in our chair, wait a little bit, the job is running. And you can see the glue job running because when we click here. And we expand and we see the history, we see that the jobs are running. As I said, the logs and the error logs are available in cloud watch logs, log streams. So if we here in the error logs or in the logs, you will see the logging information for this job in cloud watch logs. This job is initiating your Hadoop cluster. So. It takes a couple of minutes the log group does still not exist because this is the first time I'm running this job. So it is a synchronous, it is creating the Hadoop job right now. And it should take around five to six, maybe seven minutes to complete. So after two minutes and nine seconds, it was actually quicker than I thought. The job had succeeded and I can see that the job succeeded because of the run status here. Now, if I click here in logs and error logs, let me click in the logs for you to see. I am linked to the CloudWatch logs console, and here I have all the logging information about that specific run about that specific job run. So you can see if you scroll up that there are things that are actually regarding a Hadoop cluster. If you're familiar with Hadoop, you will you will recognize those, those those type of informations available in those logs here. But the most interesting thing that I want to show you is If I go over my s3 bucket, which is the bucket that I am using for my data set in this example, and I click in TLC ref demo. Now there will be a new folder in that bucket. Now there will be a new prefix. Here, which is the parkade files. So what this job did was basically getting this CSV file here, 733 megabytes, and stored into a columnar file here in this prefix. So here you can see that it it it split the file in multiple chunks, which is a good practice regarding Athena queries and data lakes Because he will allow that data set being got in parallel by multiple download threads. And you can see that the total file size is about 150 megabytes. So you can see that this specific data set had been converted into something that was 700 megabytes to. Around 150 megabytes, which is time five times Less in a storage solution where you paper size here like we have in Amazon S3. You can see that how this would impact your budget. It's five times less and also the most interesting thing. Comes when we want to do, Queries. But before doing, Queries we need to crawl that data from that new data set parquet_files, and having the Glue crawler creating another Glue table on the Glue h catalog. So let's quickly do this. Crawlers. It could use the same crawler as a user the next the last week, but I'm going to add a new one. And this second crawler will be called TLC-Crawler- parquet because it's a crawler that will crawl against the Parker files, and will infer the schema and create the glue table on the glue H catalogue. Tlc-raf-demo. Now parquet_files. So now I am instructing the crawler to go on this bucket and. On the parquet files which are exactly where my parquet files are located. Next, No. Choose an existing IAM role the crawler role which I had created on the previous demo. Next, run on demand. Now, same database, TLC, and now, Next, Finish. TLC Crawler Parquet, Run Crawler. The crawler is actually faster than the job because the crawler is just do we GetObject, partial GetObject. Getting the first hundred or couple of thousands of lines and creating the FEMA table for you the glue table on the glue h catalog which can be understood as an Athena table. So, here it is a starting, if we refresh, it doesn't take too long. So this crawler is already stopping because as we can see, it took one minute to run and it added one table into my glue h catalog. Let's go to the glue tables. Let's see we should have two tables here which is the PR cat one and the TLT ref demo one. This one points to CSV this one points to parquet. If we click in this table, we can see the table properties and we can see that it is using the parquet surd. Remember the story, when I said that, if you try to see a DVD in a VHS, you would have errors because you're not using the proper surd. So here we have a data set that easing parquet and we are using the parquet surd, right? So we should not have errors when we go to Athena and query this table. Now let's go to Athena and query the table, Athena. And these are the two tables. And the first query I'm going to do, I already had the query here on my clipboard, so I just paste it, select, average of fair amounts, which is one of the fields in that CSV file or the parquet file data set, and also the average of tip amount. Remember, this is a New York City TLC Taxi and Limousine Commission TLC Trip Record data. So we have things like passenger counts, trip amount, and then we're going to run this query against the CSV file and here we have. For that specific data sets, which we have regarding the year 2018 and month 06 June, the average of fair amount was $13.25 cents, and the average of tip amount was $1.9 cents. You can see which is approximately 15% which is basically 15 to 18%, which is what people from New York City usually tip the taxi drivers. So now let's do the very same query against the parquet data set. So I'm just going to replace this table by the table called parquet files. Run Query, queries running you can already see that the query run faster, right? And same results. Now if we go here on Athena and click in History, we can see that the query we made against the CSV data set, scanned 800 or almost 900 megabytes, as you can see here. And it took 2 seconds and 21 milliseconds, 2.21 seconds. The query made against the parquet data set run faster. And here is the most important thing, 20 megabyte is scanned. That's because it only consumed the data from a specific column on the file, all right? So that's it, that's how you can see on how using glue jobs to transform your files from role based formats like CSV files, to column based files, like parquet files is very interesting and very important for your data lake architectures. See you on the next one and thank you for watching