Now that we know what Gaussian process are,

let's see how they can be applied to machine learning.

So you have some points x1,

x2 so an xn.

You know the values for them.

Those are f(x1), f(x2), f(xn).

And normally what you'd like to do is to

predict the value of the function at new point x.

For Gaussian process, we will go in a different direction.

That is, we will try to predict the full posterior

over the f(x) given all our data points.

So we would like to estimate the probability of f(x) with

prediction at new point given all previous points.

This will allow us to compute the mean for example and

also to compute the confidence intervals at each point.

This will allow us to estimate uncertainty of our predictions.

So how do we predict this?

Here we have our desired value,

the probability of the prediction given our points is equals to the ratio between

the joint probability over all points including the

new one over the joint probability over our data plans.

Those two are normals.

The one in the numerator would be normal with mean zero and the covariance matrix C2

and in denominator will have the normal with mean zero

and covariance matrix C. So

the current covariance matrix C would have the following form.

On the diagonal we'll have k(0).

This is the variance of the random process and on the off diagonal elements we'll have

the covariance of the kernel function

actually of the difference between the two corresponding points.

The C tilt would look like this.

It would be a matrix which has four blocks.

On the first block is K(0).

That is the covariance between f(x) with itself.

We'll have on the right lower part

of the covariance matrix C corresponding to the covariance of

the data points and we will also have the covariance between

the new data points and the data points that we had before.

So this would be the vector K,

its elements would be k(x) the new point minus the old points

x1 and the second position will have k(x-x2) and so on.

And so finally, we'll have the normal distribution of f(x)

given mu as mean and sigma square as variance.

Do you remember why we have the normal distribution as the posterior?

Well actually, this happens since the ratio between two Gaussians is also a Gaussian.

We have parabola under the x points when we divide two Gaussians,

we have the sum or the difference of two parabolas and this is actually a parabola again.

So the posterior again will be normal.

We've given mean and the variance.

One can derive the formulas for them and those would look like follows.

It would be k transposed C inversed f where f is the vector of the values.

It's like f(x1), f(x2) and so on.

The variance would be k(0),

the initial variance minus k transposed C inversed k and

these trends would show us

how much the variance decreased after we observed the new points.

So this is how the posterior distribution would look like.

So notice here that the variance at the data points is zero.

So since we observed their values,

we can surely say that the value is just f(x1) for example.

And as we move away from the points,

the variance then starts increasing and we are really far from the data points,

the process would be simply stationary.

So the mean will be zero and the variance would be equal to the initial variance.

This is k(0).

So, we should actually preprocess our data to make it like stationary in the prior.

We want our predictions to be stationary when we go away from the data points.

So we would like to expect that the mean value is

zero and the variance would be kept zero.

To make this true,

we should remove the trend and seasonality and also subtract mean and normalize.

And so after training Gaussian process,

we'll have some functioning list.

Should also remember to invert all those transformations

when you predict for a new point.

That is you have to denormalize,

you have to add mean,

add trend and seasonality and output your prediction.