So in this module, we're going to talk about, really we should be using Big Data but what are the challenges of Big Data?We'll explore one aspect of it. There are so many other aspects, we could spend a lot of time about it. We're going to talk about something called the curse of dimensionality. If some of you have seen Monk, it's a curse but it's a gift. So if you haven't seen Monk, you should go and see Monk and understand what's a curse and a gift at the same time. So in this module, we're going to talk about four things. The first, why does this issue of more data come up? It comes up because of different types, the way they're organized, and why multiple channels of communication? We may be getting inputs from many different channels and we are trying to coordinate them. So that's part of the data complexity issue. Now, increasing complexity, does it give benefits? We already saw, for example, in the car example, car price example, we say look, they could be in the future variable which is the condition monitoring of the car which is available to you. Can you use it? How can you use it? How much extra value does it get when you're pricing a car? So why do we call it a curse? We're going to talk a bit about it and talk about some very simple ways of taking care of the curse. Finally, we're going to say, "Okay, if our objective is to extract meaning from a data like this, is there some tools, some method which will help?" We will talk about very simple methods like a scatter plot and the more complicated methods. Only one, I'm going to explain but I'll at least name a few more, which you can go and read about in the references to this module. You might have seen this chart somewhere. Somebody would have already said that what has changed about Big Data is the velocity with which it is coming through, the volume to which it's coming through, the variety of data which is available, and also the veracity effect with the fake news and all, and some data you can trust and some was verified and some was triple-verified. So given that, we know the data comes in so many different ways and this is one way of thinking about it. But there're more precise ways of looking at it. Data comes in numbers. You've seen numbers already, you've seen statistics. You have seen weight, you've seen maybe distributions. So we have seen numeric data of all kinds of things. Data also comes in what we'd call ordered fashion but the different classes are not equidistant so you could have small, medium, large. You could have low, medium, high income bucket. Right. You could to have how relevant the data is, not very relevant, more relevant, and most relevant. Right. So there is an ordering. There is another kind of data which is symbolic, where there's no order. Red, green and blue. Right. There is a state, and the country, and the region so basically there are ways in which you can just label things. So these are labels. Right. So what we saw was different types of data. Here it did also differs as you would have seen in your first course in how they are stored. So the common way is to store them as a table, rows, and various features in tables. In fact all the data we have explored so far out of that kind. They are tabular. Or they could be basket, like a shopping basket of data, a market basket, or a list of keywords. Now, what's the big difference? That this is a high dimensional highly sparse data compared to a table. The table it's got fewer dimensions and it's very dense compared to a basket of items that you purchased. We would see that in one of our examples in one of the module. Another kind of data is simply whatever you've got in your bag, three eggs, four cereals, two soap bars, five milks. We could do the same thing where the document and say okay, does this document have five words which say you win and three words which say immediately and 10 times it says it's lucky and you know it's a spam, right, compared to a document which says you are requested to be at such and such a place for such and such meeting. So you've got a meeting mentioned once, you mentioned once, place mentioned one. So the Bag of Words is meaningful depending on what it contains. So we could organize data in different ways. So you have different types of data and different types of ways of organizing. So the Bag of Words is considered a high dimensional and sparse data. Moving on data and my colleague who shared some of these slides with me uses the word modality here I'm using the word structure here. Data could be structured in the sense it could be in fixed columns. Right. It could be numeric, it could be symbolic, whereas data could unstructured. So it could be a speech like what's I'm doing, it could be a ticker symbols morning, it could be text document. So just look at the complexity now you've types of data, how they're organized, how they're structured. It creates a sort of a variety in which we can use to do the same classification prediction. So the question for us is more better. Obviously, we would like to believe it is. Right. But we run into two problems. First problem is distance. So if your data, your predictor variables are categorical, how do you model distances? It becomes a problem because for categorical variables distance is not well-defined. The other problem is, on top of it add text so I have numerical variables, I have categorical variables, I have text. How do they even model something with it? So obviously I have to find a way of extracting the features in the text and making it in a nice modelable format so that then I can predict. We're asking is more better. But this is a case study shared with me by the same person who is now one of the top machine-learning experts with one of the big companies in India. He said in his first job he saw this example. So you see a graph here there's a line in black and a line in red and basically if it is higher it is better. So the first part, the bank wanted to predict whether calling customer would help in improving collections of a loan. So if you use just the numerical variables there was a bit of accuracy. What my friend his name is Sailesh Kumar found was let us say I recorded the call center conversation and I looked at the notes that the call center representative was making and extracted information from the text and added it to the numerical variables. My prediction accuracy goes up. Okay. So the idea being that more is better but how much more? That's the question we're trying to ask.