which we're going to denote h, given the observed values y, which are my data
instances. Which means that if you tell me the
values that you observe, then the fact that something may or may not have been
observed doesn't carry any additional information.
And this is a little bit of a tricky notion, so let's try and give an example.
Imagine that a doctor, a patient comes into the doctors office, and the doctor
chooses what set of tests to perform. For example, the doctor chooses, to
perform or not perform, say, a chest x-ray.
The fact that the doctor didn't choose to perform a chest X Ray probably in the
case that the person didn't come in with a deep cough or some other symptoms that
suggested tuberculous or phenomena. And therefore the test wasn't performed.
So the observation or lack there of, of a chest x ray,
the fact that a chest x ray doesn't exist in my patient record is probably an
indication that the patient didn't have tuberculous or pneumonia.
So these are not independent. So in that model we do not have the
missing it random, assumption holding because we the observe ability pattern
tells me something about the disease which is the unobserved variable that I
care about, on the other hand if I have in my medical record things like the
primary complaint that the patient came in, for example, a broken leg.
Then, at that point, given that the primary complaint was a broken leg I
already know that the patient likely didn't have tuberculous or pneumonia and,
therefore, given that, observed feature, observed variable which is the primary
complaint, the observability pattern no longer gives me any information about the
variables that I didn't observe. And, so that is the difference between a
scenario that is missing at random and a scenario that isn't missing at random.
For the for the for the purposes of our discussion we're going to make the
missing at random assumption from here on.
What's the next complication, with the case of incomplete data?
It turns out that the likelihood can have multiple, global maximum.
So, intuitively, that's almost, almost obvious.
Because if you have a hidden variable. That has two values, zero and one.
The values zero and one don't mean anything.
We could rename them one and zero and just invert everything.
And it would, basically, give us an exactly equivalent model to the one with
01, because the names don't mean anything.
And so, that immediately means that I have a reflection of my likely hood
function that occurs when I rename the variables.
And it turns out that this is not something that happens just in this case,
when they have multiple hidden variables the problem only becomes worse because
the number of local... The number of global maximum becomes
exponentially large in the number of hidden variables.
And so now we have a function with exponentially many reflections of itself,
and it turns out that this can also occur when you have missing data not just with
hidden variables. So, even if all I have are data where,
where only some occurences of the variable are missing its value even that
can give me multiple local and global maximum.
So to understand that a little bit in more depth lets go back to the
comparisons between the likelihood in the complete data case and the likelihood in
the incomplete data case. So here is a simple model where I have
two variables x and y with x being a parent of y.
And I have three instances, and if we just go ahead and write down the complete
data likelihood it turns out to have the following beautiful form which we've
already seen before where we have the product of probabilities for the
three instances and each of these can be we've admitted writing the parameters for
clarity, and that's going to be equal to here is.
The probability for theta X 0Y0 given the parameters, the second instance and the
third instance. And the point is this ends up being a
nice decomposable function of the parameters.
As, in terms of a product, which if we take the log ends up being a sum.
Is a likely it decomposes it decomposes without variables in it, it decomposes
within the CPD. What about the incomplete data case?
Lets make our life a little bit more complicated and where as before we had
these complete instances now notice that these, both of these instances have an
incomplete observation regarding the variable X.
And now let's write down the likelihood function, in this case.
Well the likelihood function, is now the probability of Y0, which is the first
data instance, times the probability of X0Y1, which is the second data instance,
times another probability Y0. So since p(y0) appears twice, we've
squared this term over here. And the probability of y0 is the sum over
x of the probability of x, y0. That you have to consider both possible
ways of completing the data, x, for the different values of x: x0 and x1.
And so if we unravel this expression inside the parentheses it ends up looking
like this, theta x zero times theta y zero x zero plus theta x one theta y zero
given x one. And the important observation about this
expression is that it is not a product of parameters in the model which means we
can not take its log and have it decompose over a parameter or the
summation because a log of a summation doesn't doesn't decompose.
And so that means that our nice decomposition properties of the
likelihood function have disappeared in the case of incomplete data.
It does not decompose by variables, notice that we have a theta.
For the x variable sitting in the same, expression as an entry from the p of y
given x cpd. It does not decompose within cpds, and
even computing this likelihood function actually requires that we do a sum
product computation. So it requires effectively a form of
probabilistic inference. So what does that imply, both of these
properties that we talked about in the previous slides?
What does that imply about the likelihood function?
Before, our likelihood function has the form of these gray lines over here.
So for example like this, this is a likelihood function of a complete data
scenario. The, when we have a multi, when we have a
case of incomplete data we're effectively summing up, the probability of all
possible completions, of, the, unobserved variables, and so, thee, overall
likelihoods function, end up being a product, of,
So end up being a summation. Sorry.
A summation of likelihood functions that correspond to the different ways that I
had to complete the data. So this end up being with this as one
such summation. So the likelihood function and the being
a sum of. Like a some these nice concave likelihood
functions, well log concave likelihood functions, but the point is when you add
them all up, it doesn't look so nice at all.
It ends up having multiple modes and and it's very much harder to deal with.
The second problem that we have, in addition to multi modality, is the fact
that the parameters start being correlated with each other.
So if you remember, when we were doing the case of complete data.
we had the likelihood function being composed as a product of little
likelihoods for the different parameters. What happens when we have an incomplete,
data scenario? So, when you look at this, you can see,
for example, that when X is not observed. So, when X is not observed.