[MUSIC]
Are ice cream sellers evil?
Probably not, well, at least not all of them.
But I can totally imagine a situation where the price of the ice cream goes
up whenever the temperature outside goes up.
And if it's indeed the case, we can see a plot like this.
Here on the x-axis we have temperature, and
on the y-axis we have the price of the ice cream.
And each data point, on some particular day, we measured the temperature,
we asked some ice cream seller about his price, and
we plotted this data point in the two-dimensional plane.
So we can see that these two variables are strongly correlated here and
related to each other.
Can we exploit this closeness of this meaning of these two random
variables to each other?
Well, we may say that these two variables are so
related that you can use one to measure the other.
For example, if you want to know your temperature outside and
you forgot your thermometer and also the smart phone.
You can ask your closest ice cream dealer for his price and
compute the temperature from that.
Which basically means that these two numbers are so
related that you don't have to use two.
You can as well use just one of them, and compute the other from the first one.
Or if you put it a little bit differently, you can draw a line which
goes through your data and kind of aligned with your data.
Then you can project each data point you have on this line.
And this way, instead of two numbers to describe each data point,
you now can use one, the position on this line.
And this way you will not lose much.
So if you look at the lengths of the projections, so
how much information do you lose when you project points?
And each blue data point is projected on the corresponding orange one.
You see the lengths are not high, so you keep most of
the information in your data set by projecting on this line.
And now, instead of this two-dimensional data,
you can use a one-dimensional projection.
So you can use just the position of
this line as your description of the data point.
And it's just another way to say that these two random variables are so
connected that you don't have to use two.
You, as well, may use just one to describe both of them, and
this is exactly the idea of dimensional introduction.
So you have two-dimensional data and you project it into one dimension,
trying to keep as much information as possible.
And one of the most popular way to do it is called
principal component analysis, or PCA, for short.
And PCA tries to find the best possible linear transformation,
which projects your two-dimensional data into 1D.
Or more generally, your multidimensional data into lower dimensions,
while keeping as much information as possible.
So PCA is cool.
It gives you an optimal solution to this kind of problem.
It has analytical solutions, so
you can just write down the formula of the solution of the PCA problem, and
this analytical formula is very faster implement.
So if you give me 10,000 dimensional points,
I can return you back the same points projected to ten dimensions, for
example, while keeping most of the information.
And I can do it in milliseconds, so it's really fast.
But sometimes people still are not happy enough with this PCA,
and try to formulate this PCA in probabilistic terms, why?
Well, is usually formulating your usual problem in probabilistic terms
may give you some benefits, like being able handle missing data, for example.
So in the original paper that proposed this probabilistic version of PCA,
they try to project some multidimensional data in two dimensions,
so you can now plot this data on two-dimensional plane.
And they then try to obscure some of the data, so
introduce missing values into the data.
They thrown away some parts of the features, and
then they projected this data set with missing values again.
And you can see that these two projections doesn't differ that much,
which means that we don't lose that much information by throwing
away some parts of the data, which is really cool, right?
We were able to treat this missing values, and
the solution doesn't change when we introduce them.
So we're really robust to missing values.
By the way, the paper where they proposed this principal component
analysis is really good.
So check it out if you have time and
if you want to know more details about this model.
So let's try to drive the main ideas behind this probability
principal component analysis in the following few slides.
So first of all,
it's natural to call this low dimensional representation of your data.
So in this example, one-dimensional position of each data point,
of each orange data point, to call a latent variable.
Because it's something you don't know,
you don't observe directly, and it causes your data somehow.
So the position of your orange data point on the line, this ti,
it influences where the data point will end up on the two-dimensional plane.
So it influences the position of the observed point, right?
So it's natural to introduce this latent variable model where you have ti,
which causes xi.
And you have to define some prior for ti, and
why not to set it just to standard normal?
This will just mean that your projections,
your low dimension projections,
will be somewhere around 0 and will have variance around 1.
Which, why not?
It's nice property to have.
Now we have to define the likelihood, so
the probability of x given ti, and how does x and ti connect?
So how does this one-dimensional data and two-dimensional data is connected?
Well, if you look at the orange,
kind of orange two-dimensional x on the projection of x,
then it equals to some vector times the position of
this one-dimensional line plus some shift vector.
So we can linearly transform from this one-dimensional line
to two-dimensional space and get these orange projected points.
Or more generally, we can multiply ti by some matrix W, and
then add some bisector, b, and we'll get our orange projections, xi.
And this W and b will be our parameters,
which we aim to learn from data.
Okay, but this is orange points, right?
How can we recover the blue points, the original data?