So, good data. What characteristics does good data have? So you get one, has got sufficient quantity for your algorithms. Test questions stuff here on the exam. It represents the features that we care about. Unstructured data. There's lots and lots of unstructured data. Think of all the text messages and tweets and videos and all this unstructured data that's out there that could be used for analytics project. It's time-consuming but it is necessary to structure your unstructured data. Don't leave it unstructured, you want to structure it. It is only you, only us human beings can look at that data, understand the problem that we're trying to solve and organize that unstructured data into a structure. You may have to try it two or three or four or 20 or 100 times before you get it right and before you start to get meaningful results out of your analytics problem. Missing data; features missing at random is ideal because you can use an average to compute a replacement value from other future vectors. That's a fair way to deal with missing data values. A high number of specific missing features is more difficult to replace and requires additional analysis to synthesize a replacement. Some options to try are computing a mean or a median value for regression models. Many regression models you can just stick a zero in there for missing data and it won't throw all of your results off because zero times whatever the theta value is, for example on near regression doesn't throw off your results, it's just zero. The value goes to zero. It doesn't contribute to the hypothesis or the outcome values, the y values. You got an interpolate, you got a missing value, you got value of ten, you got a value of five, interpolate. I've got this missing value in the middle of just 7.5 might be a reasonable approach for dealing with missing data. Outliers, they're also sometimes called anomalies. They have something unusual about them. We found two basic categories. They really are anomalies or something went wrong with them and we call them manifest outliers, and they require repairing or removal. So, of those two data points way on the right there are manifest outliers are anomalies we want to get rid of them. We want to just delete them from the dataset. Or they may be novelties, which is a new kind of example that you haven't seen before. So again, it really takes a human being to look at the data and make a determination on whether these outliers, these anomalies, are relevant or not and only we can do that today. 100 years from now that might not be true. Where we are today that's a [inaudible] , take a look at those. There's a technique called Exploratory Data Analysis coined by John Tukey in his 1977 book by the same name, Exploratory Data Analysis. I'm going to go into any advantages when we pointed out that there's another, that these statistical techniques that you can use to go out to use and or apply in your problems to help you understand your data. Two authors of two of the books that I read pointed out that as a general rule of thumb, if you see a large difference between the mean and the median values in your data, there may be a problem and again human beings need to dig in and take a look at that. So, here's an example and definitions we're going to use in our class and on the final exam. So, here's big dataset, imagine this is 50 petabytes of data. Okay, and we want to extract data from this huge data set and feed it into a machine-learning algorithm for analysis over here. So, there's two steps to this. One is this extraction that I referred to, and what we want to do is apply various techniques to extract good data. These are the features that we care about. These are the features that we want to look at. These are the features that we want to feed into a machine learning algorithm or algorithms, okay? So, we want to- and there's other stuff out here, there might be noise, there might be lots of data that we don't care about and then there's a bunch of irrelevant data out there that isn't relevant at all to the problem that we're trying to solve. We definitely don't want to sample from that. But we have this notion of obtaining and extracting good data features that we care about and pull them out, and we're going to call that good data. This is what we care about. Then, the good data is cleaned and prepared. The anomalies are dealt with, outliers are dealt with, the variance and [inaudible] are dealt with. If there's any, it's cleaned and prepared. I'm going to call that smart data. This is properly prepared data. Now, we've got a really good dataset that we feed into machine learning algorithm. Hopefully, we will get extracted from that hidden insights because that's what we're really after. It's enormous quantity of data and we're trying to figure out if there's anything in there that's interesting, that will help us gain insight into operational efficiencies of a factory or predict where hydrocarbons are et cetera, et cetera, et cetera but this is the process.