In previous videos, we derived formulas for updating the E-step.

Now, let us move on to the M step.

So, on the M-step, what we want to do is to maximize

the expected logarithm of the joint distribution.

Alright, so let's write it down.

The expected value over Q of theta and Q of C,

the logarithm of the joint distribution P of theta,

Z, and W. And we want to maximize it with respect to our matrix, Phi.

And before we write down the exact value for the logarithm,

let's see which values are constant.

So, we have to cross out the terms that do not depend on Phi.

Actually, there are many of them here.

So, the first term that does not depend on Phi is

constant and also here the only term that depends on Phi is this one,

so you can cross this term out too.

Alright, so let's write down this formula.

This would be an expectation with respect to Q of theta,

Q of Z of this function.

So, it should be the sum over all documents,

sum over all words in this document.

So, it would be N from one to Nd,

the number of words of the document.

Sum over all topics,

D from one to get to a T,

the indicator that the following topic occurs in this position.

And finally, the logarithm of Phi T.W.D.N plus some constant.

And we're trying to maximize this with respect to the matrix Phi.

We should also satisfy two constraints.

First one is that sales of Phi are the probabilities they should be non negative.

So, Phi T.W should be non negative

for any topic and for any word in the vocabulary.

And also, we should ensure that since it is

a probability distribution it should self to one along the W's.

So, if we sum up over W's from one to the vocabulary size,

Phi T.W should be equal to one for any topic.

The first constraint is actually already satisfied since we have Phi under the logarithm.

So, we can satisfy on the second constraint.

Do this less use the Lagrangian.

So, the Lagrangian here would be equal to

the fully function plus the sum over all constraints,

all of the constraints multiplied by the Lagrangian multipliers.

So, the Lagrangian L would be equal to the expected value of Q of theta, Q of Z,

sum over D from one to capital D,

sum over N from one to number of words.

And finally, sum over topics,

the indicator again, T of the logarithm,

Phi T.W.D.N

plus the Lagrangian multipliers.

Those would be, sum of overall constraints.

We have two of them, captured two of them.

The multiplier Lambda T,

that was a constraint so it would be sum over all words, Phi T.W.

minus one.

All right,

so we can take the expectation under the summation and get a bit shorter formula.

So, it will be sum over documents,

sum over words, sum over topics.

The expectation of Z D.N equals to T is actually Gamma D.N as position T. So,

we can write it down as Gamma D.N and let me write down the index T here at the top.

So, it would be times logarithm of Phi

T.W.D.N plus the constraints.

Lambda T for T from one to the number of topics,

sum over the words,

Phi T.W minus one.

All right, so the usual way to do

such things is to compute the derivative the Lagrangian with respect to the variables.

So, let's try to derive the partial derivative of

the Lagrangian with respect to sum Phi T.W.

So, you can put the derivative under the summations` and here we will get the fully form.

So, it would be sum over D,

sum over N, sum over T,

Gamma D.N at position T, one over T.W.D.N.

And also, we have to ensure this,

that this index matches the index that we [inaudible] derivative with respect.

So, we'll have to multiply it by the indicator that this word matches this word.

So, it would be W.D.N of course the W. And finally,

we have to compute the derivative with respect to these terms so it would be plus,

we will have only one term,

that is Lambda T, and that is it.

So, I want to say that this thing should be equal to one, to zero.

So, from this we can derive the value of Phi T.W.

In this case, this is actually equals to W. So,

we can say that this term,

it equals to Phi T.W,

unless [inaudible] So, Phi T,W

equals this summation and then enumerator,

sum over D, sum over N. Sum over T, Gamma D.N.T,

the indicator that the word W occurs at the position and in the document and D

over minus Lambda T. So,

what we can do next is we can sum it up over

all possible values of W. The thing on the left,

let me write it down here.

So, we added the sum of expected W here,

sum of expected W here,

and this thing equals to one from our constraint.

So, on this term equals to one.

All right, let me write it down carefully.

So, this means that the Lambda here,

the minus Lambda here,

actually equals to the numerator and it's sum with respect to all values of W. So,

Lambda T actually equals to the sum,

let me write down just the letters.

So, with respect to all words, all documents,

all positions and old topics of Gamma D.N position T,

the indicator that the worth of a given position is the one that we need.

Alright, now we know the value of Lambda,

we can plug it in into this formula and get the final result.

That is, the Phi T.W equals to the sum with respect to D.N.T,

Gamma D and T indicate in term,

D.N equals to W. Over the same things

sound up with respect to all possible words in our vocabulary.

So, W prime D.N.T,

Gamma D.N.T times the indicator,

W.D.N equals W prime.

And here is the updated formula for Phi.

So, let's see our algorithm again.

On E step, we [inaudible] update of theta N.Z until convergence.

And on M step,

you update a Phi using the following formula.

So, now, we are ready to train our model.

However, we should do one more thing.

Besides training, we need to know how to predict new values.

For example, you have a new document in your book and you

want to predict the values of Z.

Those are the mark up of the words.

So, for each quarter once assigned the topic and also Theta,

we want to assign the global distribution of topics in this document.

Let's do it here.

All right, so to do this,

we want to approximate the probability of P of theta,

let me write it down as D star.

Now, D star is the index of our new document.

And the Z.D star,

the topic assignments for the new document.

Doing our training data and also the matrix Phi that we found in the E.M maverick.

So, to approximate it, let's do the missile approximation again.

So, try to find it in the form of Q of C of theta D star,

Q of Z D star.

And so, we will try to minimize the KL divergence between these two terms.

So, here's KL divergence and we minimize it with respect

to Q of theta D star and Q of Z.

D star.

So, here is our formula for prediction.

We can do the mental approximation with the same formulas that we derive for the E step.

So, we know how to train the model and we also know how to predict from it.

And in the next video,

you will see some extensions.

Those are how we can modify the model so that we have some desired properties.