0:03

Let's discuss, how can we apply Markov chain Monte Carlo,

Â to train Latent Dirichlet Allocation model,

Â or LDA for short.

Â We have already discussed LDA model in the previous week,

Â on variational inference, and derived

Â some complicated formulas to do the training in the LDA model.

Â But now, let's see how can the Markov chain Monte Carlo be applied to the same problem,

Â and what are the pros and cons of this approach.

Â Recall that LDA model is a model to find topics in

Â a corpus of documents and the idea of LDA is that,

Â each document has a distribution over topics.

Â For example, one document has a distribution,

Â can be like 80% about cats and 20% about dogs.

Â And then each topic is a distribution over words.

Â For example, topic about cats can, in principle,

Â can generate any word,

Â but it is much more likely to generate words like cat or meow than others.

Â And so, when we look at a particular document, for example,

Â cat meowed on the dog, then,

Â we can imagine that this document was generated as follows.

Â First of all, we decided on which distribution over topics we will use.

Â So here, it's 80 for cats,

Â and 20 for dogs.

Â And then, we decided that for each word,

Â we must decide on a topic.

Â So for our first word, we may flip biased coin,

Â and with probability, 80% will be cats.

Â So, for our first word, we happened to trade the topic, cats.

Â And now, we have to generate the first word itself.

Â So we know the topic, I'll have to generate the word.

Â And now, we can generate any word from our logic tree,

Â but we happen to trade the word, cat,

Â as the first one,

Â and then we repeat, and when we, again,

Â sample topic for the second word,

Â and let's say it again happened to be cats.

Â Then let's say we generated the word, meow.

Â And then for our last word, let's say we happen to generate the topic,

Â dogs, and to generate the word, dog.

Â Notice that for this model,

Â the order of words doesn't matter.

Â It's kind of a back of words approach.

Â But anyway, they're more than flexible

Â enough to discover meaningful topics in the documents.

Â If you put it a little more mathematically,

Â you can say that each document has a distribution over topics,

Â Theta d, so our document D has a distribution Theta d,

Â which is 80 and 20 in this case,

Â and each word has a topic,

Â which we don't know, it's latent verbal.

Â But for words from one to n,

Â Z_I is the topic of the Ith word in the document, D.

Â And these topics are generated from the distribution Theta d.

Â And now each word itself is generated from the appropriate topic.

Â So the first topic is about cats,

Â then we generate the word one from the topic cats,

Â from the distribution for the topic cats.

Â And this distribution is called Phi.

Â So Phi cats is the distribution for the words for the topic cats.

Â And thus, we generate the words themselves which we observe and use to train our model,

Â to find all the necessary latent variables and parameters.

Â And in the variational inference week,

Â we discuss how to use expectation maximization algorithm to train these LDA model.

Â So on the E-step,

Â we tried to find the posterior on the latent variables Zeta and Theta,

Â but it was too hard to do analytically,

Â so we decided to approximate.

Â We decided to find

Â the approximation of the posterior in the family of factorized distribution,

Â which is the idea of variational inference.

Â And then on the M-step,

Â we maximized some expected value of algorithm of

Â the joint probability with respect to this approximate posterior distribution.

Â And we spent quite a few time to derive necessary formulas and to find

Â a good solution to define

Â analytical formulas for E and M steps for this EM algorithm for LD.

Â If you summarize the model we have here for LD,

Â we have known data W,

Â which is what's in our documents.

Â We have unknown parameters Phi,

Â the distribution over words for each topic,

Â and we have a non-latent variables, Zeta and Theta.

Â So, for each word,

Â Zeta gives us the topic of this word and for each documents,

Â Theta gives us the distribution over topics for this document.

Â And kind of makes sense that we decided that Phi would be parameter,

Â and Zeta and Theta will be latent variables,

Â because the number of variables Zeta and Theta,

Â depends on the amount of training data they have.

Â So, for each document, we have to add one new Theta and a set of new Zeds.

Â So it makes sense to make it in latent variables.

Â So our parameterization will not grow with respect to the training data size.

Â But Phi it has a fixed size,

Â number of topics times the number of words in the dictionary.

Â So weighted parameters in our variational inference approach to L.D.

Â But let's see how can we deal with this model if we go full Bayesian.

Â So if we treat Phi as a latent variable itself.

Â So now we don't have any parameters told.

Â We don't want to estimate any numbers but we want to marginalize out

Â all the latent variables and do the predictions about some things.

Â We will apply a Markov chain Monte Carlo

Â for this model of full Bayesian inference for LD.

Â Although we could have applied Markov chain Monte Carlo to the EM algorithm,

Â but let's just use this full Bayesian model as an illustration.

Â If we want to apply Markov chain Monte Carlo,

Â we just need to sample all latent variables from the posterior distribution,

Â for example, by using the Gibbs sampling.

Â If we want to use Gibbs sampling,

Â we will start with some initialization of

Â all our latent variables and then we will sample

Â the first coordinates of the first iteration of the latent variable Phi,

Â from a conditional distribution conditioned on the initialized variables.

Â Then we'll do the same thing for the second coordinates,

Â and we'll condition on the dimension we just generated on the previous sub-step,

Â and also on the stuff we have from initialization.

Â And then we'll repeat any iterations to obtain the full Phi,

Â and we'll do the same thing for Theta and Zed.

Â So for Theta, we'll generate its Ith coordinates

Â yet by considering the condition distribution on Theta_i,

Â given all the Phis we have already generated,

Â and all the Thetas we have already generated

Â in the initial values for the rest of the Thetas and for the Zed.

Â And the same thing for the Zeta.

Â This is usually Gibbs sampling , it's really easy.

Â I mean you've going to implement like five lines of Python and nothing complicated here.

Â And we will do it in iterations to get a few hundreds of iterations of Gibbs sampling,

Â wait until convergence, and then obtain a sample from the desired posterior distribution.

Â Once we have at least one sample from this posterior,

Â we can just use it as an estimation of our parameters.

Â So one sample from the posterior on Phi for example,

Â can be used as just the trade value of Phi.

Â Or you can average a few Phis if you want,

Â or you can maybe make predictions with average of a few Phis.

Â So you can use the samples to do almost anything with your model.

Â You can do that, it's really easy.

Â But, the problem is that it will not be very efficient.

Â Compared to variational inference,

Â this approach will not be very fast.

Â And let's see what can we do about it to make it faster in this particular case.

Â The idea here is to look closely at the model,

Â and try to marginalize something out.

Â Basic Gibbs sampling is trying to integrate something.

Â We're substituting expected value which is integral, with Monte Carlo samples.

Â And, if we can integrate out some parts analytically,

Â and then use Monte Carlo only for the rest,

Â then the whole scheme will probably be more efficient.

Â Let's see if we can integrate out some parts of the model analytically.

Â This is the model,

Â we have prior distribution Theta,

Â and Theta is just probabilities for topics for each document.

Â For example, Theta for the first document may be 80 and 20,

Â 80 for cats, and 20 dogs.

Â And we have a prior distribution.

Â There's Theta being Dirichlet distribution, which is natural.

Â I mean, it's a standard prior on this kind of

Â latent variable which is probabilities for critical variable.

Â Then we have a conditional distribution, Zed given Theta,

Â which is at Theta itself.

Â Some topic of some words will be,

Â for example, cats with probability Theta cats.

Â Let's see what can we do about this model.

Â First of all, notice that these two distributions are conjugate,

Â which basically means that we can find the posterior on our Theta analytically.

Â And it's really easy, you can look up the formulas on Wikipedia,

Â or you can just derive them yourself but nothing complicated here,

Â because of the conjugate distributions.

Â Then we may try to marginalize our Theta,

Â so may try to find these marginal distributions at.

Â And then it looks complicated,

Â it has an integral inside,

Â but it turns out that since we have the posterior distribution Theta,

Â we can represent it in this way.

Â This formula comes just from the definition of the conditional probability.

Â If you multiply both sides by the probability of Theta given Zed,

Â you get the usual definition of conditional probability.

Â And now you have, on the right hand side you know all the terms.

Â You know the likelihood, P of Z given Theta,

Â it just fit itself.

Â You know that prior, your fit is just directly distribution,

Â and you know the posterior

Â because these distributions are conjugate and you can compute it.

Â These things are easy.

Â Now, we can do the same thing with Phi and W. You're going to notice that

Â the prior distribution in Phi is conjugate to the conditional distribution W,

Â and thus, you can easily obtain the posterior distribution in Phi.

Â In the same way you can marginalize out Phi.

Â You can find W given Z,

Â by using this formula without

Â any integration because you know the posterior analytically.

Â And finally, you can multiply P of Z times P of W given

Â Z to obtain a normalized version of the posterior on Z given your data.

Â And now everything that's left is you may for example,

Â sample from these posterior distribution.

Â Notice that, we don't know actually

Â the posterior exactly because we don't know

Â the normalization constant and it's really expensive to compute.

Â But it doesn't matter if we can keep sampling.

Â We may sample from Z anyway.

Â And this will give us very efficient scheme which is

Â called Collapsed Gibbs sampling because we collapse

Â some of the latent variables and marginalized them out analytically.

Â This approach will allow you to sample only Zs and then obtain,

Â and thus make a more efficient procedure overall.

Â And you may ask like, if I have samples only from the posterior on Z,

Â how can I find the the values of Phi, for example?

Â Something I care about.

Â Well, because you know all the posterior distributions you

Â cannot obtain these kind of things easily from the samples from the posterior on Z.

Â So the probability of Phi, U,

Â and W for example,

Â which you may be interested in,

Â can be computed by marginalized without Z from the posterior distribution of Phi,

Â U, and W, and Z.

Â And this is some expected value with respect to the posterior on Z,

Â U and W. So you can you can obtain this thing from samples, right?

Â You can sample a few Zs from this posterior distribution and then average

Â them to get this estimation,

Â the posterior distribution Phi,

Â and get the actual parameters Phi which are

Â response to the probability which worked for each topic.

Â If you're willing to know more details about this kind of

Â Collapsed Gibbs sampling for LDA,

Â look into the original paper that introduced that. It's really cool.

Â Check out the additional reading material if you're interested.

Â But the bottom line here is that if you run the Collapsed Gibbs sampling for the LDA,

Â it will be even a little bit more efficient than a variational inference.

Â Here on the plot, you can see the basically time on

Â the X-axis or number of operations wil be formed and some measure of error on the Y-axis.

Â The lower, the better.

Â And you can see that the Gibbs sampling is

Â performing a little bit better than variational inference,

Â and also better than something called EP or Expectation Propagation,

Â which we will not talk about in this course.

Â The bottom line here is that we were able to obtain

Â a really efficient scheme for solving

Â LDA model without too much trouble of variational inference.

Â We didn't have to derive large number of complicated formulas,

Â we just apply the usual scheme of Gibbs sampling.

Â It wasn't very efficient.

Â And then we just marginalized out some parts basically which was also really easy.

Â And we finally got an efficient scheme for LDA.

Â One more thing here, is

Â that generally Markov chain

Â Monte Carlo is not very good suited for mini-batches, so for large data.

Â It has to process the whole dataset in each situation,

Â but sometimes it works.

Â So, use all the current use for Markov chain Monte Carlo and

Â use mini-batches and sometimes it fails miserably.

Â But in this case, the authors just used mini-batches and it works quite nice here.

Â