So this time this data will come from this website. And it's customer data so go to this link and download this file called custdata.tsv not a csv, tsv. That's a tep separative value file. So I'm assuming at this point, you've downloaded the file and so now we can go back to this. Okay we're going to first load up our library for the visualization which is ggplot2. And now we are going to load our customer data again using that read.table command or function. Can say file.choose so let us choose that file from the hard drive. Header is true and separator, so this is where we indicate how the data values are separated. If it's CSV file then it's comma separated. This is a TSV file, so it's tab separated. And the tab is represented with \t. We do this, we get the dialogue box, we pick the TSV file, open it, and now it's been loaded up so you can see it here. It's got 1,000 observations, and this is basically ensuring this data file that we loaded up here. Let's go ahead and plot it, and you see how easy it is to plot something in R. Okay, so we want to use ggplot on custdata. custdata is our data frame that we just loaded here. Okay, so that's what we are referring. And you could say + and that's, in this case, what kind of visualization you want to do. So we're going to do geometric histogram, so we're going to create histogram, and we'll indicate the axis so we have, age. Age is one of the variables, you can see it here, so on the x-axis we want age. And for now, let's just go with that and see what happens. So, essentially this will just give us the age distribution. There are some other commands here, but for now, let's just go ahead with this much. So if you run that, you see the visualization right here, so what we've got is the distribution of age. So here, age is ranging from 0 to 150, and this is a histogram, so it just provides a count of people in that age group, right, using the data that we have which is 1,000 observation points. Now, we may want to customize this a little bit so we can can use something called binwidth. And as you start typing, you can see that there is help provided here. So it shows as the width of the bin. The default size is that they cover the range of the data. So it's recommended that you overwrite it. Let's go with 5. And so what happens here is we're essentially saying how wide bin should be, okay? Let's actually try changing it to something different, let's say if you say 10, yeah. And expand this a little bit And so this is where you are. And actually, you can go back, all right? So the nature of the distribution is not going to really change, but you can see that the resolution has changed. So binwidth file means it's a smaller bin and that's what we had before. Which means that things will be more distributed. But if we have the binwidth 10, that means that each bin can hold age range of ten years, right? So then more people will fit into each bin but then you'll have lower resolution so that's what's happening. So again, I mean, you can play around with this to see what you get. Overall distribution is not going to change, but just gives like slightly better kind of resolution. Okay, let's also try plotting kind of categorical data. So, this is numerical data, age is numerical. But we have categorical data where all things are in one of the few categories. So marital status, so let's try plotting that. So again, we're working with the same customer data plus geometric distribution, we're going to use bars. Now bar chart is good for categorical data and you'll see in a moment why. The axis here, we're going to use marital.stat, and it's running. All right, so now you can see that we got these four bars, and they each represent different categories, so there's a Divorce/Separated, Married, Never Married, and Widowed. All right, and for each bar you have some values. So this is a bar chart, and it's very simple to plot it, and this is useful for categorical data. So, again, we are able to do this very easily. Now finally, let's do a little bit more than visualization and see if you can do some analysis, so we'll do a very simple analysis. We'll do correlation, all right, so let's see if there's a correlation between age and income. So there is a age value here, and then there is an income value here, okay? And so it's a very easy to do corelation, just use cor, it's a command that's a function. And you want to do corelation between two variables or columns of this dataset. Okay, so what is dataset? Dataset is custdata, okay? And we want to see age, so $age means age variable within the cusdata dataframe, or age column within the cusdata table and the cusdata $income. And so, we get this value. This is the correlation called vision, right? So, this is not very high, it's positive, but it's very, very low. Now, this may be a little problematic because if you think about it, some of the data, it's not actually good. So, for instance here, income at 0, and it's not clear if we can actually even sort it here, you have income -8,700. It's not clear if it's an error of some sort, all this income being 0. Perhaps it's not the best value. Age being 0, let's see, age being some other values. So there seems to be some problem with this data and maybe what we need to do is remove or ignore those kind of data values that are not giving us what we would expect. And so what we'll do is we'll create a different subset of this so we can say custdata2. And we're going to take a subset of custdata, okay? And what we're going to do is so subset is a function or command that allows us to take a part, or a subset, of an existing dataset. So, dataset is here is custdata, and so what we'll do is we'll take a subset of it To do the subset we'll need to provide some condition, okay, so let's see we can have multiple conditions. So we want to say that anything custdata h that's greater than 0, that's good and custdata$age<100, that's good too. So this is assuming that we are only looking at customers that are greater than 0 years of age and less than 100 years, right? So you can change this if you like. And custdata$income should be at least zero or greater than zero, right? So again, this will eliminate lot of those values that are negative or zero. Maybe they're right, maybe they're wrong, but it seems okay to ignore it. You run it. So now if you look in the, Environment here. Here, let's go back here. And so you'll see that custdata2 is a new data set that's created. It has 910 observations. Remember custdata2 is a subset of custdata and so it has eliminated all those cases that do not meet this condition that we specified. Okay? So it has eliminated 90 observations that did not meet at least 1 one these conditions, so now we are working with 910. So we can do the up arrow to bring up some of the previous commands and now we can run that correlation on custdata2 instead of custdata. Run it, and look at this now. Now we have negative correlation, it's not strong [COUGH] but now the sign is reversed. So what this means is, in this case we can't really say much because the value is very small. But if this was large enough value with negative sign we can sayas age goes up income actually goes down. Now, in reality what's probably happening is as age goes up, to some limit income goes up and maybe as age goes up after some limit, income goes down. So, we can imagine that perhaps in the retirement, that income perhaps less than one had income when they were before their retirement age. But that require further analysis, but at least you can now see how easy it is to do this kind of correlation analysis, right? So again, this is very, very simple, very easy to do in R and we've all ready looked through this kind of examples so what we saw in R is using basic mathematical operations, using conditions we saw to load or import data, CSV file, TSV file any kind of, they mark a separated values file. We saw how we can visualize the data using histogram, using bar chart and we saw how we can do co-relation analysis. So that's the very basic introduction to R.