So how many different ways are there visualizing data?

As many ways as you want,

and people keep finding new ways.

We've already seen histograms and distributions,

and they go into a topic called density estimation,

which we will not go there but you know what they are.

Covariance, we said okay,

a two-by-two analysis gives you

a number which is between zero and minus one and one,

sorry and tells you the strength and

the direction of relationships, we already saw that.

Today we're going to talk about

two things, a scatter plot,

this innocent thing called a scatter plot,

which is usually very very informative,

and a more complicated

way of looking at data called principal

components analysis which allows

you to reduce the number of features.

There are three advanced techniques,

I will talk to you a little bit more

about them towards the end of this module,

it's something that you may like to pursue on your own.

So let's take scatter plots.

This, if you have not

seen it before or even if you've seen it before,

it's called the Iris dataset.

It's the dataset on three subspecies of Iris.

This dataset was discovered originally by

Fischer as early 1930s paper,

and later on it's also called the Anderson Iris

dataset because Anderson collected

this data of iris flowers

related to three species and I can't say them very well,

one is called Setosa,

the other one is called Versicolor

and the third one is called Virginica.

Now, the interesting thing

is two of them were collected from

the same pasture and picked on the same day and

measured at the same time by

the same person which is very fascinating.

You've got 50 samples of each of them,

and the measurements were done on four dimensions.

One, the sepal length and the sepal width,

and other one is the petal length and the petal width.

I have to apologize, I keep messing up the two,

when I wanted to say petal I say

sepal, that's going to happen.

So I called four variables on

which each flower was measured.

So we have 150 observations, okay?

How does it look? Let's say I do a scatter plot.

The first scatter plot is

not as meaningful. What does it show?

It shows the sepal length versus the sepal width,

but you already see there are

two clusters in it if you stare at it, right?

So even without knowing what it is, you can start saying,

"hey, essentially two classes of data iris".

The second scatter plot is

the petal length versus the petal width.

Clearly you say "wow,

there are two clusters in it",

and you suspect that they,

maybe the data is telling you something.

So basically from four dimensions,

I've reduced it to two and we've started

the data and we can do this two-by-two in many ways,

there seem to be two clusters emerging.

We say "hey, wait a minute,

there are three types of species out here".

So when you start labeling them

and I hope you can see the color on this,

you see the blue one in the bottom is one species,

the red one in the middle is another species,

and the gray one on the top is a third species.

If you didn't know which is what species,

you wouldn't see the clusters,

but now you can almost see that it's cluster one,

cluster two, cluster three.

So it is basically used to tell you

that if you didn't know

the labels you would think there are two clusters,

but if you knew the labels,

you will know there are three clusters.

So that's the difference

between knowing what the object is,

it's just supervised learning,

and unknowing what the object is,

which is unsupervised learning.

So, in some sense we will use this

as a very small example to tell

you how labeling the object

improves the fact that you can cluster,

but again it's a scatter plot, it's useful.

Here are the same thing,

and remember I have four variables,

so I can choose them in six different ways,

four times three by one times two,

four choose two, and you can

see each of these scatter plots,

and you can also imagine that no scatter plot by

itself is telling you

that there are three clusters

and that you could put color on them.

These kind of plots are useful,

they are a way of visualizing,

but they have limitations.

Clearly they are a powerful visualization tool and I

would definitely suggest we use them.

Second, the problem is

they're limited to two or three dimensions.

Maybe in the next module I'll show you

a three-dimensional data where we can peek

at it from different angles and start seeing clusters.

Three, I don't think anybody

can pick in four dimensions easily,

I haven't heard of somebody but I had a friend

once who talked of five dimensions.

So obviously that is a limitation.

There is another important limitation,

that we can only think of

the sepal width and

length and the petal width and length,

but if you can't think of combining these features,

although our mind doesn't work that way,

but let us say maybe there is a function

which is a combination of these features,

a new way of looking at them,

which can reduce the dimensions.

Obviously that is another limitation.

So then that means it's limited to

the labels you're already given to

these dimensions, the scatter plot.

Another limitation is say we have hundreds of features,

say you have 100 features, right?

So you're taking body measurements and

there is a wonderful data on

how many measurements you can take off a body,

and let's say we took 100 of them.

How many scatter plots is that?

100 times 100.

So you are to sit there looking at 10,000 scatter plots,

do you see any data?

Not possible. So what is the solution?

I think the solution is to use a lens and project

this high-dimensional data into

a lower-dimensional space or

maybe onto this table, right?

So from three-dimensions, you

can then project it into two-dimensions,

or you have the four-dimensional Iris data

which we are going to see,

and I want to use that and

project it into two-dimensions.