0:03

In previous videos, we derived formulas for updating the E-step.

Â Now, let us move on to the M step.

Â So, on the M-step, what we want to do is to maximize

Â the expected logarithm of the joint distribution.

Â Alright, so let's write it down.

Â The expected value over Q of theta and Q of C,

Â the logarithm of the joint distribution P of theta,

Â Z, and W. And we want to maximize it with respect to our matrix, Phi.

Â And before we write down the exact value for the logarithm,

Â let's see which values are constant.

Â So, we have to cross out the terms that do not depend on Phi.

Â Actually, there are many of them here.

Â So, the first term that does not depend on Phi is

Â constant and also here the only term that depends on Phi is this one,

Â so you can cross this term out too.

Â Alright, so let's write down this formula.

Â This would be an expectation with respect to Q of theta,

Â Q of Z of this function.

Â So, it should be the sum over all documents,

Â sum over all words in this document.

Â So, it would be N from one to Nd,

Â the number of words of the document.

Â Sum over all topics,

Â D from one to get to a T,

Â the indicator that the following topic occurs in this position.

Â And finally, the logarithm of Phi T.W.D.N plus some constant.

Â And we're trying to maximize this with respect to the matrix Phi.

Â We should also satisfy two constraints.

Â First one is that sales of Phi are the probabilities they should be non negative.

Â So, Phi T.W should be non negative

Â for any topic and for any word in the vocabulary.

Â And also, we should ensure that since it is

Â a probability distribution it should self to one along the W's.

Â So, if we sum up over W's from one to the vocabulary size,

Â Phi T.W should be equal to one for any topic.

Â The first constraint is actually already satisfied since we have Phi under the logarithm.

Â So, we can satisfy on the second constraint.

Â Do this less use the Lagrangian.

Â So, the Lagrangian here would be equal to

Â the fully function plus the sum over all constraints,

Â all of the constraints multiplied by the Lagrangian multipliers.

Â So, the Lagrangian L would be equal to the expected value of Q of theta, Q of Z,

Â sum over D from one to capital D,

Â sum over N from one to number of words.

Â And finally, sum over topics,

Â the indicator again, T of the logarithm,

Â Phi T.W.D.N

Â plus the Lagrangian multipliers.

Â Those would be, sum of overall constraints.

Â We have two of them, captured two of them.

Â The multiplier Lambda T,

Â that was a constraint so it would be sum over all words, Phi T.W.

Â minus one.

Â All right,

Â so we can take the expectation under the summation and get a bit shorter formula.

Â So, it will be sum over documents,

Â sum over words, sum over topics.

Â The expectation of Z D.N equals to T is actually Gamma D.N as position T. So,

Â we can write it down as Gamma D.N and let me write down the index T here at the top.

Â So, it would be times logarithm of Phi

Â T.W.D.N plus the constraints.

Â Lambda T for T from one to the number of topics,

Â sum over the words,

Â Phi T.W minus one.

Â All right, so the usual way to do

Â such things is to compute the derivative the Lagrangian with respect to the variables.

Â So, let's try to derive the partial derivative of

Â the Lagrangian with respect to sum Phi T.W.

Â So, you can put the derivative under the summations` and here we will get the fully form.

Â So, it would be sum over D,

Â sum over N, sum over T,

Â Gamma D.N at position T, one over T.W.D.N.

Â And also, we have to ensure this,

Â that this index matches the index that we [inaudible] derivative with respect.

Â So, we'll have to multiply it by the indicator that this word matches this word.

Â So, it would be W.D.N of course the W. And finally,

Â we have to compute the derivative with respect to these terms so it would be plus,

Â we will have only one term,

Â that is Lambda T, and that is it.

Â So, I want to say that this thing should be equal to one, to zero.

Â So, from this we can derive the value of Phi T.W.

Â In this case, this is actually equals to W. So,

Â we can say that this term,

Â it equals to Phi T.W,

Â unless [inaudible] So, Phi T,W

Â equals this summation and then enumerator,

Â sum over D, sum over N. Sum over T, Gamma D.N.T,

Â the indicator that the word W occurs at the position and in the document and D

Â over minus Lambda T. So,

Â what we can do next is we can sum it up over

Â all possible values of W. The thing on the left,

Â let me write it down here.

Â So, we added the sum of expected W here,

Â sum of expected W here,

Â and this thing equals to one from our constraint.

Â So, on this term equals to one.

Â All right, let me write it down carefully.

Â So, this means that the Lambda here,

Â the minus Lambda here,

Â actually equals to the numerator and it's sum with respect to all values of W. So,

Â Lambda T actually equals to the sum,

Â let me write down just the letters.

Â So, with respect to all words, all documents,

Â all positions and old topics of Gamma D.N position T,

Â the indicator that the worth of a given position is the one that we need.

Â Alright, now we know the value of Lambda,

Â we can plug it in into this formula and get the final result.

Â That is, the Phi T.W equals to the sum with respect to D.N.T,

Â Gamma D and T indicate in term,

Â D.N equals to W. Over the same things

Â sound up with respect to all possible words in our vocabulary.

Â So, W prime D.N.T,

Â Gamma D.N.T times the indicator,

Â W.D.N equals W prime.

Â And here is the updated formula for Phi.

Â So, let's see our algorithm again.

Â On E step, we [inaudible] update of theta N.Z until convergence.

Â And on M step,

Â you update a Phi using the following formula.

Â So, now, we are ready to train our model.

Â However, we should do one more thing.

Â Besides training, we need to know how to predict new values.

Â For example, you have a new document in your book and you

Â want to predict the values of Z.

Â Those are the mark up of the words.

Â So, for each quarter once assigned the topic and also Theta,

Â we want to assign the global distribution of topics in this document.

Â Let's do it here.

Â All right, so to do this,

Â we want to approximate the probability of P of theta,

Â let me write it down as D star.

Â Now, D star is the index of our new document.

Â And the Z.D star,

Â the topic assignments for the new document.

Â Doing our training data and also the matrix Phi that we found in the E.M maverick.

Â So, to approximate it, let's do the missile approximation again.

Â So, try to find it in the form of Q of C of theta D star,

Â Q of Z D star.

Â And so, we will try to minimize the KL divergence between these two terms.

Â So, here's KL divergence and we minimize it with respect

Â to Q of theta D star and Q of Z.

Â D star.

Â So, here is our formula for prediction.

Â We can do the mental approximation with the same formulas that we derive for the E step.

Â So, we know how to train the model and we also know how to predict from it.

Â And in the next video,

Â you will see some extensions.

Â Those are how we can modify the model so that we have some desired properties.

Â