Hello, in this lesson you will learn some important aspects of data storage. Namely, data models, file formats, and compression. We are not going to dive deeply into the topics. But rather explore what trails are available, and how they affect your data applications. In previous lessons, you have learned about HDFS, a cluster file system designed to store large files. However, files are kept there for a purpose. Later on, you will learn how to use MapReduce and Spark to make use of the data, and to write scalable computations. Today, we are going to focus on something in between. If you think about what you have learned about HDFS already, you will notice that there is a mismatch between terms used to define business tasks and terms used to describe what HDFS is. For example, imagine yourself running a real time meeting platform. Imagine that your business task is to compute a click through ratio, that is a ratio of clicks to impressions for every ad. The task is formulated in the main specific terms such as clicks and impressions. Which are just abstractions of your data. In the meanwhile, age difference requires you to treat your data as bytes, stored in files, and provides you means to read and write these bytes. So there must be something that breaches, bites, and clicks. Data modeling and data management are concerned with these issues. As I mentioned earlier, data modeling and data management are broad topics. We are not going to cover them in detail throughout this course. However, decisions you make on how you treat your data have far-reaching implications for correctness, performance, flexibility and maintainability of your applications. In this lesson, you will deal with some basics, just to give you a sense what trade-offs are there. Let's start with data modeling. A data model is the way that you think about your data elements, what they are, what domain they come from, how different elements relate to each other, what they are composed of and similar questions. Probably you have heard about relational databases and relational data model. The terms stems from relational algebra, where a data set is an ordered set called table of tuples called rows. Where every tuple is composed of simple values such as numbers or strings. A position of a value within a tuple is a column. And column defines value semantics. Getting back to our RTB example, your platform keeps a log of impressions, events when an ad is presented to a user and clicks, events when a user clicks on the ad. Every impression and every click are characterized by a timestamp, user ID, displayed ad ID, and maybe some other data. You can think of this slope as a growing table with columns for an event type impression or click, timestamp, user ID, ad ID, and so on. That's how you can represent or model your data in a relational way. Another example. Imagine you are running an IMDb clone and your business task is to maintain knowledge of how movies, actors, directors, producers, titles and so on, are related to each other. You may want to represent this data in a slightly different fashion. Using a graph data model. A graph consist of vertices and edges. You can use vertices to represent entities. Movies, actors, directors, titles, and so on. And you can use edges to represent relations between entities. For example, the edge from Keanu Reeves vertex to the Matrix vertex, may encode the fact that Keanu was acting in the movie. There are other ways to model your data, and there are ways to convert data between different models. The data model defines the way you structure your data and hence make some things easier to express than the others. Throughout this lesson, we will speak with a relational data model as it is the most popular one especially for analytics. A couple of words about unstructured data. Technically speaking, there is no such thing as unstructured data. On the basic level, your data is always structured as a sequence of bytes. The question is, is your data structured enough for your task? For example, it is quite easy to parse web server loads and compute total traffic through a website. Logs are structured enough for this task. Irrespective of a particular storage format. To compare, consider using video data to count a daily number of visitors in a store. Videos are structured as a sequence of frames, where each frame is an ordered set of pixels, where every pixel is just a triple of RGB color intensities. However, this structure is useless if you're willing to count people in the video. The hard job here is to do the image recognition and bring the appropriate structure to the data so that solving accounting problem would become easier. Quite often the term unstructured data is used to denote complexity of bringing data to useful form for a particular application. Okay, now to the data management. We are interested in a particular aspect of it in this lesson. How to store and how to organize your data physically. In Hadoop, this is a matter of file format or storage format, interchangeably. There are many ways how you can layout data. And different choices lead to different tradeoffs in application complexity, and thus, affect performance and correctness. If you ever heard about CSV, XLS, or JSON, that are examples of file formats. Their function is to define how to transform raw bytes to programatical data structures and vice versa. This process is called deserialization, a reverse for serialization, converting data structures into raw bytes. There are many formats available. They differ in their space efficiency and coding and decoding speeds, supported data types, splittable or monolithic structure. And extensibility for the future. Let's take a more detailed look. Space efficiency. Different formats use different coding schemes which directly affect consumed disk space. The most efficient ones can use an order of magnitude less disk space than the others. Consuming less disk space, in turn, cuts your storage costs down in coding and decoding speeds. Of course, space savings come at the expense of extra computation required to operate on the data, and therefore increase timings. Also, sometimes a poorly chosen story format just adds extra work of converting between similar representations. A notable example here is story numbers in textual form. Parsing integers requires a sophisticated code with loop and conditionals, or lookup tables. At scale, parsing imposes significant overhead, and transforms into wasted CPU time. Supported data types. Some formats are capable of preserving type information during the serialization and deserialization. While the others expect the user to serialize and deserialize data to basic data types, usually strengths. Cell formats are strict at the right time, and they force global constraints on data, while the others are not. In latter case, once again, it is user's responsibility to validate data, and check constraints. Splittable or monolithic structure. This property allows you to extract a subset of data without reading the entire file. We typically expect data to be splittable. This follows from our data model. The question is, how exactly to implement the file splitting and how to do it efficiently. Think of compression or encryption as notable contra examples. Extensibility. Once you deploy your first data application, you start to face compatibility issues. For example, you should carefully think whether the existing code will break or continue working whether you add just one more extra field to your data. Some formats tolerate schema changes easily while others do not. To conclude, deciding on a data model and storage format have far reaching implications for your application performance, correctness, computation complexity, and resource usage. In next videos, we will take a closer look at formats commonly used in Hadoop world. See you.