Ingesting raw data into a data lake is one thing. But actually making it useful is a totally different task that actually will account for probably 80 percent of the work. Data gets ingested in it's raw format, and as you've learned, that data may or may not be formatted correctly for your use case. This is where Dataprep comes in. Prepping your data to be used by tools and services for Analytics is a big job. You'll likely spend a majority of your time writing scripts and devising solutions for making the data that you've already collected ready to be analyzed. It's even common to process the same data multiple times. You could use a dataset to generate reports for one use case, but then use that same dataset to train a machine learning model in a totally different use case. It's unlikely that these two use cases would require the data to be in the same format, which means processing the same data multiple times to fit each use case. A data lake can be seen as a central repository for data. Because of this, you cannot assume that you will be the only one using that data over time. Remember the idea of future-proofing? Well, it applies here as well. Remember that even though you may not have a use case for your data yet, you will in the future be able to draw insights from it. But yet, you might think that you're the only one using your data now, but others might find use for in the future. Therefore, leaving data in its raw format is a great idea. Others can prep the data the way that they see fit, starting from the raw source. Since it's likely you'll need to process the data in your data lake at least one time, you want to make sure that you treat the original copy of the data as immutable, meaning that you cannot change it once it's been uploaded. You can make copies of the data, but the original data that was ingested remains untouched. You can use S3 lifecycle policies to move raw data to more cost-effective storage tiers as it becomes more infrequently accessed over time. You can use S3 features like object lock to ensure that no one can modify the raw data asset. Any changes done to the data would be made and then saved into another S3 location used specifically for prep data. The analytical tools can then read the data from the prep data location, not from the raw source location. Leaving the raw source untouched allows you to format it in different ways depending on who is using the data. There are multiple types of Dataprep that you might create process for. The first is shaping. Some examples of shaping data is when you select only the fields you need to use for analysis, combining fields or files into one, or otherwise transforming and aggregating data in regards to its shape or schema. For example, you might find that the data is organically ingested across multiple files. But in order to analyze it efficiently, it makes sense to join the data across those files into one main file. Another type of Dataprep is blending data. This is taking datasets that are originally ingested using different schemas or formats, and then changing or tweaking the datasets so that the schema or format matches in order to analyze it. Then finally, there's Dataprep that falls under the cleaning category. This is when you are filling in missing values in a dataset, resolving any data conflicts or normalizing the data to common definitions. With data lakes, you'll likely be doing all types of Dataprep at one time or another. With the scale of data that we are talking about here, using automation to prep data is key. Knowing that what tools exist for Dataprep in AWS that can be automated. Well, you can use AWS Glue Jobs to automatically run scripts that transform data based on triggers or run on a schedule, or of course, you can use AWS Lambda functions that get triggered as data is uploaded to your a data lake. Just know that spending a lot of time setting up the processes for Dataprep is normal, and will account for a large percentage of the work involved with setting up your data lake. Create transformation processes that can be automated when possible as a best practice. Lambda is best used for transformation of real-time data since you can trigger Lambdas to run as data comes in, while Glue Jobs is best used for processing data in batches. AWS Glue has three main components. We have the glue data catalog, the crawlers and the classifiers, and Glue Jobs. Since we have already covered the data catalog and the crawlers and classifiers in a previous lesson, let's focus on Glue Jobs. A job is the business logic that performs the ETL work in AWS Glue. When you offer a job, you provide details about data sources, targets, and some other information. The result is a generated Apache Spark API or PySpark script. Then when you run a job, Glue runs the script that extracts the data from the sources, transforms the data, and then loads it into the targets. This diagram shows the process of creating a job. See the numbers 1 through 6 here that are labeled, we're going to go over each one. First, you must choose a data source for your job. The tables that represent your data source must already be defined in your data catalog. The data source is most likely the raw data before transformation. Next, you choose a data target. This is where the data that gets transformed will be loaded. You can either designate a table from the data catalog to be the target or you can have the job create a new table when it runs. If you're using Amazon Athena, the tables can point to data sources like a DynamoDB table, an S3 bucket, or a database that requires a JDBC connection. Once the source and target are defined, you then provide customized configurations for your job, and then a PySpark script gets automatically generated for you. Examples of items you can figure for your job include, what type of job environment to run in. You could run your job as an Apache ETL script, a Spark streaming script, or a Python shell script. It depends on your use case, which one you would choose. You can choose to let Glue generate a script for you, or you can provide an S3 location with a script that you have already written for the job to use. Another configuration you provide is the transformation type, which can change the schema, which allows you to change the schema of a source data and create a new target dataset, or you can choose to find matching records. You also configure logging and monitoring requirements. After you provide all of your configurations, AWS Glue generates a PySpark script and you can then edit the script to add things like transforms or whatever PySpark code that you want to add. Glue provides a set of built-in transforms that you can use to process your data. You can call these transforms from your ETL script. Some examples of built-in transforms Glue provides are things like dropping null fields, filtering records, joining datasets, mapping fields from source to target, and more. Check out the class notes for a link to view all of the built-in transforms that are available. If you are a PySpark expert, remember you can always provide your own script to a job by uploading it to S3 and then just pointing the job to the script. If you are not an expert, you can use the generated script that the job provides to you as a starting point, which you can then modify it to fit your specific use case. After you have edited the jobs you're liking, you then determine when you want the job to run, and configure a trigger or set up a schedule for the job to run. Finally, when all of this is done, you will have a script. In the AWS Glue console, descriptive is represented as code that you can both read and edit. Glue Jobs gives you a great starting point for beginners working with PySpark for the first time. It runs the script on essentially what is a managed Hadoop cluster. There's lots happening under the hood we didn't discuss in this video. Check out the class notes for more.