Let's look at some of the challenges that a data engineer faces. As a data engineer, you will usually really encounter a few problems when building data pipelines. You might find it difficult to access the data that you need. You might find that the data even after you can access it, doesn't have the quality that is required by the analytics or machine learning model. You might plan to build a model and even if the data quality exists, you might find that the transformations require computational resources that may not be available to you. Finally, you might run into challenges around query performance and being able to run all of the queries and all of the transformations that you need with the computational resources that you actually have. Let's take the first challenge. This is the challenge that you need to consolidate different datasets, different data formats, and to manage access to your data at scale. For example, you want to compute say a customer acquisition cost. How much does it cost in terms of marketing and promotions and discounts to acquire a customer? Now, that data might be scattered across a variety of different marketing products and customer relationship management software and so on, and finding a tool that can analyze all of this data might be difficult because it comes from different organizations, it comes from different tools, it has different schemas, and maybe some of that data is not even structured. So in order to find something as essential to your business as how much getting a new customer costs, so you can figure out what kind of discounts to offer to keep them from leaving, you can't have your data exist in these kinds of silos. So what makes data access so difficult? Primarily, it's because of this problem. The fact that data in many businesses is siloed department by department and each department creates its own transactional systems because they want to support their own business processes. So for example, you might have operational systems that correspond to store systems at different operational system that's maintained by a product warehouses that manages your inventory. Then you might have a marketing department, and the marketing department may manage all the promotions. Now, suppose you need to do an analytics query, a query such as give me all the in-store promotions for recent orders, and also tell me how many items we have in the inventory. So the query is give me all the in-store promotions for recent orders and their inventory levels. Given such a query, you need to know how to combine data from the stores, data from the promotions, data from the inventory levels. Because these are all stored in separate systems and usually some of these systems will have restricted access, building an analytic system that uses all three of these data sets to answer an ad hoc query like this, it can be quite difficult. The second challenge is that cleaning, formatting, and getting the data ready for insights requires that you build extract transform load or ETL pipelines. ETL pipelines are usually necessary to ensure data accuracy and quality. The cleaned and transformed data are typically stored not in a data lake, but in a data warehouse. A data warehouse is a consolidated place just like a data lake, it's a consolidated place. But this time, the data that we're storing is all easily joinable and queryable. Unlike a data lake where the data is in a row format in the data warehouse, the data is stored in a way that makes it very efficient to query. Because data becomes useful only after you clean it up, you should assume that any raw data that you collect from source systems has to be cleaned and has to be transformed. If you're going to be transforming it, you might as well transform it into a format that makes it efficient to query. In other words, ETL the data and store it in a data warehouse. So let's say you're a retailer and you have to consolidate data from multiple source systems. Then think about what the use case is, and if the use case is to get the best-performing in-store promotions in France, then you need to get the data from the stores and you have to get that data from the promotions, and then maybe you figure out that the store data is missing some information. Maybe some of the transactions are in cash, and for those cash transactions perhaps there's no information on who the customer is. Or some transactions might be spread over multiple receipts or you might need to combine these transactions because they come from the same customer. Or perhaps the timestamps of the products are stored in local time. Whereas you have to spread all across the globe, and so before you can do anything, you need to convert everything into UTC. Similarly, the promotions may not be stored in the transaction database at all. They might be just a text file that somebody loads on the webpage and it has a list of codes that are used by the web application to apply discounts. It can be extremely difficult to do a query like performing the in-store promotions, getting the best-performing in-store promotions, because the data as we talked about has so many unique problems. Whenever you have data like this, you need to get the raw data and transform it into a form with which you can actually carry out the necessary analysis. It's obviously best if you can do this clean up and consolidation just to once, and store the resulting clean data to make further analysis easy, and that is a point of a data warehouse. If you need to do so much consolidation and clean up, a common problem that arises is, where do I carry out this compute? The availability of computational resources can itself be a challenge. Why? If you're on an on-premises system, data engineers will need to manage server and cluster capacity and make sure that enough capacity exists to carry out the ETL job at the time that you need the ETL job to be finished. The problem is that the compute that's needed by any specific ETL job is not constant over time. Very often, it varies week to week and depending on factors like holidays and promotional sales. This means that when traffic is low, you're going to be wasting money because you have computers out there doing nothing, and when traffic is high, those computers are so busy that your jobs are taking way too long. Once your data is in your data warehouse, you need to optimize the queries that your users are running to make the most efficient use of your compute resources. This is difficult. If you're managing an on-premise data analytics cluster because you will be responsible for choosing a query engine and installing the query engine software and keeping it up-to-date as well as provisioning more servers for additional capacity. Isn't there a better way to manage server overhead, so that we as a business can focus on insights?