Hello and welcome to Lesson 2 of Introduction to Data, Signal, and Image Analysis with MATLAB. This is the first section in the lesson on how we can use MATLAB for data analysis. Before we get into any analysis algorithms, in this first section, we will learn how datasets are loaded into and represented in MATLAB. By the end of this lesson, you will understand how to load different types of datasets into MATLAB and inspect the values within a dataset. In general, datasets will be represented using one or more arrays of N-dimensions. Those arrays could be one-dimensional vectors, or they could be two, or three, or four, or in general, N-dimensional matrices. The arrays could contain floating-point numbers, or integers, or character strings containing important information that we would like to analyze. Most often, our data are numerical and can be organized into a two-dimensional matrix where each index in one-dimension, for example, the columns of the matrix, corresponds to a different type of measurement or a different feature being measured for a process, and each index in the other dimension, for example, each row of the matrix, corresponds to a different instance or sample of the process we are studying. Thus, one row would contain a measurement of each feature for a single instance of the process. To help us understand this, let's consider an illustrative example dataset. Oftentimes, we have a dataset stored in a spreadsheet format, such as a Microsoft Excel spreadsheet. I have just such a dataset right here. Let's first take a look at the spreadsheet, and then see how we can load it into MATLAB. This spreadsheet contains a dataset available in MATLAB called the Auto MPG Dataset. This is a portion of the dataset made available by the University of California Irvine's Machine Learning Repository website. In this dataset, we can see a total of 392 rows, each corresponding to features measured for an individual model of car in the United States in the 1970s and '80s. In the columns for each car, we have a number of features, including the number of cylinders in the engine, the size of the engine, the engine horsepower, the vehicle weight, acceleration, and model year, and finally, we have the fuel efficiency of the vehicle and miles per gallon. Let's take a moment to think about this as a generic dataset of some general process and forget that it has anything to do with cars. How many different instances of the process do I have in this dataset? I can see I have 392 rows of data, so I have 392 instances. For each instance of this process, how many different features are there in the dataset? Well, I may be tempted to say eight because I have eight columns in my spreadsheet, but the first column just lists the instance number. That's not really a feature of the process. So we have a total of seven different features for each instance in our dataset. Our dataset is 2D, and we should be able to represent it in MATLAB as a 392 by seven matrix of numbers. So let's go to MATLAB and see how we can load it. We've got MATLAB Online open here. Loading data from Microsoft Excel spreadsheets is so common that MATLAB has built-in functions to import such datasets. But we first need to bring the file into our current folder on MATLAB Online. This is as easy as drag and drop. Let's use the xlsread function to load the data from our spreadsheet into MATLAB. We assigned our data input to variable mpg_in. In our workspace, we can see that this is a 392 by eight matrix of double-precision numbers. But we know our dataset should be 392 by seven. So what gives? Let's take a look at the first row and compare that to our spreadsheet. We can see the numbers are identical to the numbers in the first data row of the spreadsheet. But remember that the first column just corresponds to the instance number in our dataset; it's not really a feature. However, MATLAB's xlsread just sees a matrix of numbers. It does not reason about the meaning of the first column of numbers to understand that the first column is not necessary. So it includes it in the output. We can remove the first column like this. So an mpg were keeping all rows and columns two through eight of mpg_in. Generally, when we load data from a spreadsheet, we need to be aware of which columns of data we need and which ones we do not and remove the unnecessary columns accordingly. Now we have a variable mpg that contains the 392 by seven matrix of features we want to load from the Excel file. Comparing the data loaded into MATLAB and the contents of the spreadsheet again, we can see that xlsread has ignored a certain part of the spreadsheet. Do you see it? Xlsread has ignored the first row of the spreadsheet, which contains the strings describing the contents of the columns. By default, xlsread looks for numeric data and ignores character data. However, sometimes the data we care about does contain strings. If we go back to MATLAB Online and we type help xlsread, we can see how we can load in non-numeric data. Reading this, we can see the first output of the function is the numeric data, and the second output is the text data, and a third output from xlsread will contain all the raw unprocessed information from the spreadsheet. Let's look at that. We want the third output from xlsread. We can see that mpg_raw is a 393 by eight matrix of data type cell. The first row of the matrix contains the character strings in the first row of the spreadsheet. The second and remaining rows contain the numerical data and the following rows with this spreadsheet. But when we print these data to the screen in MATLAB, they are surrounded by these curly brackets. What's up with that? This has to do with the cell array data type, which is used to MATLAB when we have a matrix containing different data types. Here we have a matrix that has character strings on the first row and numeric data on the remaining ones. We'll see more about cell arrays in the next section. The take-away point from this exercise is that if we have a dataset stored in a spreadsheet, we can load that data into MATLAB using xlsread no matter what it looks like and we can obtain numeric or character data. That's the programmatic way to load in a dataset, but there's another way we can cheat to load the data into MATLAB. We can go to our spreadsheet and highlight the data that we want. We can click "Copy" to copy the contents to our clipboard. Then go to MATLAB Online and assign it to a variable using the paste command, like this. We open a right-facing square bracket like we are about to type in a matrix of numbers and then we paste in the contents of the spreadsheet. Then we use a left-facing square bracket to close the matrix and hit Enter to execute the command. Now we have mpg_paste in our workspace and it is identical to our variable mpg, which we have created programmatically using the xlsread function. Finally, we can observe that there are quite a few file types that MATLAB has support for loading. If we go to the Home tab, we see a button under the Home tab that says Import Data. In this window we can see there are a number of different file types that MATLAB can load, including a number of spreadsheet formats. In summary we can see that MATLAB provides many different straightforward ways to load different sources of datasets for further analysis. Now that we understand how to get data into MATLAB from common dataset file types, next we will take a closer look at how data are represented once they are loaded into MATLAB. This will help us understand how to use MATLAB to analyze the datasets. What we will see is that MATLAB has a versatile set of data types that let you represent datasets composed of different data types, such as numbers or strings with an easy to use arrays. Let's go to MATLAB and take a look at one of the built-in datasets for an example. MATLAB provides us with the Fisher's Iris Dataset. This dataset was originally reported in a paper by biologist Ronald Fisher, in the 1930s, and was made available on the University of California, Irvine's Machine Learning Repository website. Let's load the dataset. This dataset contains measurements of flower petal and sepal dimensions from 150 flower Irises from three different species of Iris. The data is distributed across two different arrays. We can see in our workspace that species is a cell array that contains 150 cells. Let's take a look at what that contains. We can see when we type species on the MATLAB command prompt, MATLAB displays every one of the cells in order and shows the contents of each cell within curly brackets. Cell arrays are different from normal MATLAB matrices. If we were to attempt to store a list of column vector strings as rows in a MATLAB matrix, those strings would all be required to have the same length. But the cell array, a list of strings can be stored such that they do not require the length of the string in each cell to be identical. We can access the first cell by indexing into the species array, like we would for any array. What MATLAB returns is a single cell corresponding to the first entry in the species cell array. If we want to get access to the contents of the first cell, the way to index into the contents of a cell array is with curly brackets rather than parentheses. We can see when we type species with curly brackets and a colon on the command prompt, instead of printing each cell in curly brackets then command window, the string within each cell is returned. When we type species with curly brackets with index 51, MATLAB returns the string that is in cell 51 of the species cell array. The dataset also includes a 150 by 4 matrix called meas, which is shorthand for measurements. Let's see what's in that matrix. We can see we have a matrix of floating point numbers with one decimal place precision. What each column contains is actually a different measurement of size in millimeters from a specific species of Iris. There's a different column for the sepal length and width and petal length and width. However, as data analysts, what the numbers mean does not necessarily matter. We can think of the data in these four columns as features 1 through 4. In general if we had m total features, we would have m columns in our matrix. The 150 rows each represent an independent sample of some process for which these four features were measured. Here the process that we are sampling is the size of a flower, and the independent samples are different individual flowers, but again the application is not terribly important. In general if we had n independent samples of a process, we would have n rows in our measurement's matrix. Oftentimes we need multiple variables in MATLAB to represent a complete dataset to accommodate multiple data types, we would store numeric data in a matrix format, so that the numeric data analysis methods can easily operate on that variable. If we have some categorical or a character string data, as in the species cell array here, we would store that in a separate variable. Whatever the type of data we're working with, MATLAB makes it straightforward to store the data in easy to use arrays.