0:21

So this is a slide that you have seen before and here, we show how we define

the mixture model for text crossing and what the likelihood function looks like.

And we can also compute the maximum likelihood estimate,

to estimate the parameters.

In this lecture, we're going to do talk more about how exactly we're going to

compute the maximum likelihood estimate.

As in most cases the Algorithm can be used to solve this problem for mixture models.

So here's the detail of this Algorithm for document clustering.

Now, if you have understood how Algorithm works for

topic models like TRSA, and I think here it would be very similar.

And we just need to adapt a little bit to this new mixture model.

So as you may recall Algorithm starts with initialization of all the parameters.

So this is the same as what happened before for topic models.

1:28

And then we're going to repeat until the likelihood converges and

in each step we'll do E step and M step.

In M step,

we're going to infer which distribution has been used to generate each document.

So I have to introduce a hidden variable Zd for

each document and this variable could take a value from the range of 1 through k,

representing k different distributions.

1:59

More specifically basically, we're going to apply base rules to infer which

distribution is more likely to have generated this document, or

computing the posterior probability of the distribution given the document.

2:17

And we know it's proportional to the probability

of selecting this distribution p of Z the i.

And the probability of generating this whole document from the distribution which

is the product of the probabilities of world for this document as you see here.

Now, as you all clear this use for kind of remember,

2:45

the normalizer or the constraint on this probability.

So in this case, we know the constraint on this probability in

E-Step is that all the probabilities of Z equals i must sum to 1.

Because the documented must have been generated from precisely one of these k

topics.

So the probability of being generated from each of them should sum to 1.

And if you know this constraint, then you can easily compute this distribution

as long as you know what it is proportional to.

So once you compute this product that you see here, then you simply normalize

3:31

these probabilities, to make them sum to 1 over all the topics.

So that's E-Step, after E-Step we want to know which distribution is

more likely to have generated this document d, and which is unlikely.

3:45

And then in M-Step we're going to re-estimate all the parameters based on

the in further z values or in further knowledge about which distribution

has been used to generate which document.

So the re-estimation involves two kinds of parameters 1 is p of theta and

this is the probability of selecting a particular distribution.

Before we observe anything,

we don't have any knowledge about which cluster is more likely.

But after we have observed that these documents,

then we can crack the evidence to infer which cluster is more likely.

And so this is proportional to the sum of

the probability of Z sub d j is equal to i.

4:54

Now the other kind of parameters are the probabilities of words in each

distribution, in each cluster.

And this is very similar to the case piz and

here we just report the kinds of words that are in

documents that are inferred to have been generated

from a particular topic of theta i here.

This would allows to then estimate how many words have

actually been generated from theta i.

And then we'll normalize again these accounts in the probabilities so

that the probabilities on all the words would sum to up.

Note that it's very important to understand these constraints as

they are precisely the normalizing in all these formulas.

And it's also important to know that the distribution is over what?

5:54

For example, the probability of theta is over all the k topics,

that's why these k probabilities will sum to 1.

Whereas the probability of a word given theta is a probability distribution

over all the words.

So there are many probabilities and they have to sum to 1.

So now, let's take a look at a simple example of two clusters.

I've two clusters, I've assumed some initialized values for

the two distributions.

And let's assume we randomly initialize two probability of

selecting each cluster as 0.5, so equally likely.

And then let's consider one document that you have seen here.

There are two occurrences of text and two occurrences of mining.

So there are four words together and medical and

health did not occur in this document.

So let's think about the hidden variable.

6:50

Now for each document then we much use a hidden variable.

And before in piz, we used one hidden variable for

each work because that's the output from one mixture model.

So in our case the output from the mixture model or

the observation from mixture model is a document, not a word.

So now we have one hidden variable attached to the document.

Now that hidden variable must tell us which distribution has been used to

generate the document.

So it's going to take two values, one and two to indicate the two topics.

7:25

So now how do we infer which distribution has been used generally d?

Well it's been used base rule, so it looks like this.

In order for the first topic theta 1 to generate a document,

two things must happen.

First, theta sub 1 must have been selected.

So it's given by p of theta 1.

Second, it must have also be generating the four words in the document.

Namely, two occurrences of text and two occurrences of sub mining.

And that's why you see the numerator has the product of the probability of

selecting theta 1 and the probability of generating the document from theta 1.

So the denominator is just the sum of two possibilities of generality in

this document.

And you can plug in the numerical values to verify indeed in this case,

the document is more likely to be generated from theta 1,

much more likely than from theta 2.

So once we have this probability,

we can easily compute the probability of Z equals 2, given this document.

How?

Well, we can use the constraint.

That's going to be 1 minus 100 over 101.

So now it's important that you note that in such a computation there

is a potential problem of underflow.

And that is because if you look at the original numerator and the denominator,

it involves the competition of a product of many small probabilities.

Imagine if a document has many words and

it's going to be a very small value here that can cause the problem of underflow.

So to solve the problem, we can use a normalize.

So here you see that we take a average of all these two math

solutions to compute average at the screen called a theta bar.

9:24

And this average distribution would be comparable to

each of these distributions in terms of the quantities or the magnitude.

So we can then divide the numerator and

the denominator both by this normalizer.

So basically this normalizes the probability of generating

this document by using this average word distribution.

So you can see the normalizer is here.

And since we have used exactly the same normalizer for the numerator and

the denominator.

The whole value of this expression is not changed but by doing this normalization

you can see we can make the numerators and the denominators more manageable

in that the overall value is not going to be very small for each.

And thus we can avoid the underflow problem.

10:24

In some other times we sometimes also use logarithm of the product

to convert this into a sum of log of probabilities.

This can help preserve precision as well, but

in this case we cannot use algorithm to solve the problem.

Because there is a sum in the denominator, but

this kind of normalizes can be effective for solving this problem.

So it's a technique that's sometimes useful in other situations in other

situations as well.

Now let's look at the M-Step.

So from the E-Step we can see our estimate of which distribution is more likely to

have generated a document at d.

And you can see d1's more like got it from the first topic,

where is d2 is more like from second topic, etc.

Now, let's think about what we need to compute in M-step well

basically we need to re-estimate all the parameters.

First, look at p of theta 1 and p of theta 2.

How do we estimate that?

Intuitively you can just pool together these z, the probabilities from E-step.

So if all of these documents say, well they're more likely from theta 1,

then we intuitively would give a higher probability to theta 1.

In this case, we can just take an average of these

probabilities that you see here and we've obtain a 0.6 for theta 1.

So 01 is more likely and then theta 2.

So you can see probability of 02 would be natural in 0.4.

What about these word of probabilities?

Well we do the same, and intuition is the same.

So we're going to see,

in order to estimate the probabilities of words in theta 1,

we're going to look at which documents have been generated from theta 1.

And we're going to pull together the words in those documents and normalize them.

So this is basically what I just said.

12:20

More specifically, we're going to do for example, use all the kinds of text in

these documents for estimating the probability of text given theta 1.

But we're not going to use their raw count or total accounts.

Instead, we can do that discount them by the probabilities that each document

is likely been generated from theta 1.

So these gives us some fractional accounts.

And then these accounts would be then normalized

in order to get the probability.

Now, how do we normalize them?

Well these probability of these words must assign to 1.

So to summarize our discussion of generative models for clustering.

Well we show that a slight variation of topic model can be used for

clustering documents.

And this also shows the power of generating models in general.

By changing the generation assumption and changing the model slightly we can achieve

different goals, and we can capture different patterns and types of data.

So in this case, each cluster is represented by unigram language model

word distribution and that is similar to topic model.

So here you can see the word distribution actually generates a term cluster

as a by-product.

A document that is generated by first choosing a unigram language model.

And then generating all the words in the document are using

just a single language model.

And this is very different from again copy model where we can generate

the words in the document by using multiple unigram language models.

14:16

partition documents into disjointed clusters.

Then we can just force a document into the cluster corresponding to the words

distribution that's most likely to have generated the document.

We've also shown that the Algorithm can be used to compute the maximum likelihood

estimate.

And in this case, we need to use a special

number addition technique to avoid underflow.

[MUSIC]