0:22

So this is a slide that you have seen earlier,

where we discussed the problems with using a term as a topic.

So, to solve these problems intuitively we need to use

more words to describe the topic.

And this will address the problem of lack of expressive power.

When we have more words that we can use to describe the topic,

that we can describe complicated topics.

To address the second problem we need to introduce weights on words.

This is what allows you to distinguish subtle differences in topics, and

to introduce semantically related words in a fuzzy manner.

Finally, to solve the problem of word ambiguity, we need to split

ambiguous word, so that we can disambiguate its topic.

1:15

It turns out that all these can be done by using a probabilistic topic model.

And that's why we're going to spend a lot of lectures to talk about this topic.

So the basic idea here is that,

improve the replantation of topic as one distribution.

So what you see now is the older replantation.

Where we replanted each topic, it was just one word, or one term, or one phrase.

But now we're going to use a word distribution to describe the topic.

So here you see that for sports.

We're going to use the word distribution over

theoretical speaking all the words in our vocabulary.

1:54

So for example, the high probability words here are sports,

game, basketball, football, play, star, etc.

These are sports related terms.

And of course it would also give a non-zero probability to some other word

like Trouble which might be related to sports in general,

not so much related to topic.

2:48

You can also see, as a very special case, if the probability of the mass

is concentrated in entirely on just one word, it's sports.

And this basically degenerates to the symbol foundation

of a topic was just one word.

3:04

But as a distribution, this topic of representation can,

in general, involve many words to describe a topic and

can model several differences in semantics of a topic.

Similarly we can model Travel and Science with their respective distributions.

In the distribution for Travel we see top words like attraction, trip, flight etc.

3:31

Whereas in Science we see scientist, spaceship, telescope, or

genomics, and, you know, science related terms.

Now that doesn't mean sports related terms

will necessarily have zero probabilities for science.

In general we can imagine all of these words we have now zero probabilities.

It's just that for a particular topic in some words we have very,

very small probabilities.

3:58

Now you can also see there are some words that are shared by these topics.

When I say shared it just means even with some probability threshold,

you can still see one word occurring much more topics.

In this case I mark them in black.

So you can see travel, for example, occurred in all the three topics here, but

with different probabilities.

It has the highest probability for the Travel topic, 0.05.

But with much smaller probabilities for Sports and Science, which makes sense.

And similarly, you can see a Star also occurred in Sports and

Science with reasonably high probabilities.

Because they might be actually related to the two topics.

So with this replantation it addresses the three problems that I mentioned earlier.

First, it now uses multiple words to describe a topic.

So it allows us to describe a fairly complicated topics.

Second, it assigns weights to terms.

So now we can model several differences of semantics.

And you can bring in related words together to model a topic.

Third, because we have probabilities for the same word in different topics,

we can disintegrate the sense of word.

In the text to decode it's underlying topic,

to address all these three problems with this new way of representing a topic.

So now of course our problem definition has been refined just slightly.

The slight is very similar to what you've seen before except we have

added refinement for what our topic is.

Now each topic is word distribution, and for each word distribution we know

that all the probabilities should sum to one with all the words in the vocabulary.

So you see a constraint here.

And we still have another constraint on the topic coverage, namely pis.

So all the Pi sub ij's must sum to one for the same document.

5:59

So how do we solve this problem?

Well, let's look at this problem as a computation problem.

So we clearly specify it's input and

output and illustrate it here on this side.

Input of course is our text data.

C is our collection but we also generally assume we know the number of topics, k.

Or we hypothesize a number and then try to bind k topics,

even though we don't know the exact topics that exist in the collection.

And V is the vocabulary that has a set of words that determines what

units would be treated as the basic units for analysis.

In most cases we'll use words as the basis for analysis.

And that means each word is a unique.

7:18

Now of course there may be many different ways of solving this problem.

In theory, you can write the [INAUDIBLE] program to solve this problem,

but here we're going to introduce

a general way of solving this problem called a generative model.

And this is, in fact, a very general idea and

it's a principle way of using statistical modeling to solve text mining problems.

And here I dimmed the picture that you have seen before

in order to show the generation process.

So the idea of this approach is actually to first design a model for our data.

So we design a probabilistic model to model how the data are generated.

Of course, this is based on our assumption.

The actual data aren't necessarily generating this way.

So that gave us a probability distribution of the data

that you are seeing on this slide.

Given a particular model and parameters that are denoted by lambda.

So this template of actually consists of

all the parameters that we're interested in.

And these parameters in general will control the behavior of

the probability risk model.

Meaning that if you set these parameters with different values and

it will give some data points higher probabilities than others.

Now in this case of course, for our text mining problem or

more precisely topic mining problem we have the following plans.

First of all we have theta i's which is a word distribution snd then we have

a set of pis for each document.

And since we have n documents, so we have n sets of pis, and each set the pi up.

The pi values will sum to one.

So this is to say that we first would pretend we already

have these word distributions and the coverage numbers.

And then we can see how we can generate data by using such distributions.

So how do we model the data in this way?

And we assume that the data are actual symbols

drawn from such a model that depends on these parameters.

Now one interesting question here is to

9:32

think about how many parameters are there in total?

Now obviously we can already see n multiplied by K parameters.

For pi's.

We also see k theta i's.

But each theta i is actually a set of probability values, right?

It's a distribution of words.

So I leave this as an exercise for

you to figure out exactly how many parameters there are here.

Now once we set up the model then we can fit the model to our data.

Meaning that we can estimate the parameters or

infer the parameters based on the data.

In other words we would like to adjust these parameter values.

Until we give our data set the maximum probability.

I just said, depending on the parameter values,

some data points will have higher probabilities than others.

What we're interested in, here,

is what parameter values will give our data set the highest probability?

So I also illustrate the problem with a picture that you see here.

On the X axis I just illustrate lambda, the parameters,

as a one dimensional variable.

It's oversimplification, obviously, but it suffices to show the idea.

And the Y axis shows the probability of the data, observe.

This probability obviously depends on this setting of lambda.

So that's why it varies as you change the value of lambda.

What we're interested here is to find the lambda star.

11:10

So this would be, then, our estimate of the parameters.

And these parameters,

note that are precisely what we hoped to discover from text data.

So we'd treat these parameters as actually the outcome or

the output of the data mining algorithm.

So this is the general idea of using

a generative model for text mining.

First, we design a model with some parameter values to fit

the data as well as we can.

After we have fit the data, we will recover some parameter value.

We will use the specific parameter value And

those would be the output of the algorithm.

And we'll treat those as actually the discovered knowledge from text data.

By varying the model of course we can discover different knowledge.

So to summarize, we introduced a new way of representing topic,

namely representing as word distribution and this has the advantage of using

multiple words to describe a complicated topic.It also allow us to assign

weights on words so we have more than several variations of semantics.

We talked about the task of topic mining, and answers.

When we define a topic as distribution.

So the importer is a clashing of text articles and a number of topics and

a vocabulary set and the output is a set of topics.

Each is a word distribution and

also the coverage of all the topics in each document.

And these are formally represented by theta i's and pi i's.

And we have two constraints here for these parameters.

The first is the constraints on the worded distributions.

In each worded distribution the probability of all the words

must sum to 1, all the words in the vocabulary.

The second constraint is on the topic coverage in each document.

A document is not allowed to recover a topic outside of the set of topics that

we are discovering.

So, the coverage of each of these k topics would sum to one for a document.

We also introduce a general idea of using a generative model for text mining.

And the idea here is, first we're design a model to model the generation of data.

We simply assume that they are generative in this way.

And inside the model we embed some parameters that we're interested in

denoted by lambda.