0:03

In the previous module,

Â we derived the general form of Expectation-Maximization Algorithm.

Â In this module, we'll consider

Â a few concrete latent variable models and

Â discuss the details how to apply the Expectation-Maximization Algorithm to them.

Â To start with, let's revisit the Gaussian Mixture Model and see how

Â the algorithm which we derived from some kind of

Â intuitive considerations fits into this general framework of EM.

Â I recall that in Gaussian Mixture Model,

Â we have a dataset of points and we want to

Â model them with a mixture of Gaussians naturally.

Â So, the density of each data point is

Â a weighted sum of several different Gaussian densities.

Â And we have three types of parameters,

Â the weights, the Gaussian locations,

Â and the Gaussian covariance matrices.

Â On the E-step of the Expectation-Maximization Algorithm,

Â if we apply to this Gaussian Mixture Model,

Â we'll have to assign q to be the posterior distribution on the latent variable t,

Â given the data and the parameters.

Â It turns out that we did the same,

Â the exactly the same thing in the Gaussian Mixture Model,

Â just from intuitive considerations.

Â The E-step is the same as in the first subtype of

Â each iteration of GMM of EM applied to Gaussian Mixture Model.

Â Okay, so what about the M-step?

Â Well, on the M-step in the EM algorithm,

Â we have to maximize the expected value of

Â the joint log likelihood with their respective parameters,

Â while fixing q, while fixing q to be the posterior from the previous iteration.

Â And in the Gaussian Mixture Model,

Â we decided to use this kind of formulas so, for example,

Â mean of the first Gaussian will be a weighted average of

Â the data points with weights being the posterior distributions,

Â or q, in limitation of the EM algorithm divided by some normalization.

Â How does this to collect it?

Â Well, let's use the blackboard to derive the connection.

Â To the proof that the maximization of this expected theory of logarithm,

Â which EM algorithm asks us to do,

Â is the same as the formulas we kind of intuitively derived for the GMM.

Â Let's look at how can we apply the general form of

Â the Expectation-Maximization Algorithm to Gaussian Mixture Model.

Â On the M-step of the expectation maximization, applied to GMM,

Â Gaussian Mixture Model, who would like to maximize the full expression?

Â We want to maximize the sum with respect to objects,

Â maximize with respect to parameters, by the way.

Â Sum with respect to objects in the dataset,

Â then expected value with respect to the Hive operational distribution,

Â which we found on the E-step,

Â of the logarithm of the joint distribution,

Â logarithm of probability or density of x_i and t_i, our given parameters.

Â Let's look closer at this expression in the case of Gaussian Mixture Model.

Â This thing equals to the sum with respect to objects,

Â again, and then we write the expected value by the definitions,

Â so expected value is just sum with respect to all of the values.

Â If we have three Gaussians it will be from one to three,

Â of weights, of probabilities,

Â from the operational distribution q,

Â there's this log joint likelihood.

Â This logarithm of the joint likelihood is actually equals to the x_i given t_i,

Â which is a Gaussian.

Â So, we have our normalization for this Gaussian times exponent,

Â and let's say that we have one dimensional data,

Â so in this case, we have one dimensional Gaussian,

Â which looks like this, back sign minus mu_c.

Â The center of the cluster number c of

Â the Gaussian number c divided by two times the variance of the Gaussian number c,

Â and that's the prior,

Â times the probability of t_i equals to c,

Â which is by c. It's also a parameter which we won't optimize for.

Â Let's write this expression even more.

Â This will be, again,

Â sum with respect to objects.

Â Sum with respect to the values of a latent variable,

Â the weights times the logarithm of pi_c divided by z,

Â minus the logarithm of exponent of minus something,

Â which is just this something.

Â So, it's x_i minus Mu_c, oh I'm sorry,

Â I forgot the squared here,

Â squared, divided by two times the variance.

Â And let's optimize this expression,

Â for example, with respect to Mu_1.

Â The gradient of this whole function with respect to Mu_1 will be,

Â so first of all,

Â its sum with respect to objects,

Â as always, so we can push the gradient inside the sum.

Â Then we'll have sum with respect to the Gaussians,

Â but note that the only Gaussian that depends on Mu_1 is the first one.

Â All old terms corresponding to c equals two or c

Â equals three will have zero gradient with respect to Mu_1.

Â We'll have, we don't have any summation with respect to c here.

Â Instead, we assume that t_i equals to one because otherwise,

Â the gradient will be zero,

Â and we don't care,

Â times gradient of this expression with respect to Mu_1, which is zero,

Â because pi_c doesn't depend on new one,

Â and the normalization constant of the normal distribution,

Â of the Gaussian distribution,

Â also doesn't depend on the location,

Â on the mean value.

Â So, this would be zero minus,

Â well, here we have this expression.

Â First of all, we'll have the same normalization,

Â it's a constant, with respect to Mu_1,

Â and then times the gradient of this expression,

Â which is two times x_i minus Mu_c,

Â and times minus one.

Â It's a gradient of the complete of the, basically the general.

Â And we want to set this gradient to zero to find the optimal value of Mu_1.

Â This, two and two one-ish,

Â we can basically divide by two in numerator and denominator and finally,

Â we can multiply this whole expression by,

Â which should be, Mu_1 here and Sigma_1 because we assume that c is one.

Â We can multiply this whole expression by the variance of the first Gaussian,

Â which we actually don't know,

Â but since it's not zero,

Â we can multiply by it and so it will vanish.

Â And this way, we'll have finally,

Â the full length expression of this gradient equals to,

Â this gradient multiply it by Sigma_1 squared,

Â equals to the sum of the objects, the weights.

Â The third term here will be minus x_i so it will be minus x_i,

Â minus times minus give plus,

Â so it will be plus x_i,

Â and minus the second term,

Â sum with respect to objects, the weights,

Â so we're basically decomposing these two terms into two separate expressions,

Â times Mu_1, and this thing equals to zero.

Â You can note that here,

Â Mu_1 doesn't depend on i,

Â so we can put brackets here like this.

Â And finally, it's a linear expression with respect to Mu_1,

Â so we can write down that Mu_1 equals to large ratio.

Â This term, sum with respect to objects,

Â weights, and x_i's divided by sum normalization,

Â basically, sum with respect to all objects of all the weights.

Â 10:18

And also, you can solve this maximization problem for the prior weights of pi.

Â It's a little bit more trickier because

Â you have to take into consideration the constraint.

Â You have to make sure that pi_c is non-negative and that they all sum up to one,

Â so it makes a proper prior distribution,

Â so, pi_1 plus pi_2 plus pi_3 equals to one.

Â And so it's a little tricky to take

Â this constraint into consideration and you maximizing,

Â but you can do that and you will get the full length expression.

Â Pi_c is basically just the fraction

Â of the points that were assigned to

Â the Gaussian number c. And since our assignments are solved,

Â we'll have just the fraction of the weights assigned to it.

Â The sum of all weights corresponding to

Â the Gaussian number c divided by the normalization,

Â which is just the number of objects.

Â And so we can see here that the formulas for,

Â when you apply the general form of the Expectation-Maximization Algorithm to

Â Gaussian Mixture Model is basically coincide with what we got,

Â then we applied some intuitive considerations to derive these kind of formulas.

Â To summarize,

Â the Expectation-Maximization Algorithm apply to Gaussian Mixture Model if you use

Â the general form of the expectation maximization using

Â exactly the same formulas as we have derived in the first module.

Â By just trying to do something reasonable.

Â In this case, we proved that this kind of

Â intuitive way of thinking about

Â Gaussian Mixture Model is just a special case of the EM algorithm.

Â In the next video, we will apply EM to some other problems.

Â