0:00

We define the motion of Markov chain Monte Carlo sampling as a general

Â paradigm for generating samples from distribution through which it is

Â otherwise difficult to perhaps interactable to generate samples

Â directly. Markov chains gives us the general way of

Â of approaching this problem, but they, but the framework leaves open

Â the question of where the Markov chain comes from.

Â That is how do we design a Markov chain that has the desired stationary

Â distribution. We talked about the Gibbs chain as a

Â general solution to this problem in the context of graphical models,

Â but we also mentioned that the Gibbs chain has limitations, in terms of its

Â convergence rates for certain types of graphical models.

Â So what happens if we have a bump, graphical model for which the Gibbs chain

Â doesn't have good convergence properties? How do we design a Markov change for

Â that? Well, it turns out that there's a general

Â class of methods called Metropolis-Hastings, and the

Â Metropolis-Hastings algorithm gives us a general approach for designing a Markov

Â chain with a desired stationary distribution.

Â In fact for a given stationary distribution we can construct a whole

Â family of different Markov chains that explore the space differently, and then,

Â try and select among those or construct one among those that has good convergence

Â properties for our distribution. So the key insight behind the

Â Metropolis-Hastings Method is the notion of a reversible chain.

Â So what does it mean for a chain to be reversible?

Â Imagine that we have a chain with a particular stationary distribution pi and

Â we're going to consider two different experiments.

Â In the first experiment, we're going to pick a a point in this space, we're

Â going to according, a point in the state space according to the probability

Â distribution pi, so say we land on this one.

Â And then we're going to pick a random edge emanating from the state that we

Â picked according to the transition model defined by the chain T.

Â So say we pick this edge, that's the first experiment.

Â The second experiment is exactly the same thing,

Â we're going to basically do this experiment twice.

Â So now we're going pick a state and once again, we're going to pick an outgoing

Â transition from that state. A chain is reversible if, when I do this

Â experiment, the probability of traversing this edge, this red edge over here, is in

Â this process is exactly the same as the probability of traversing this green edge

Â in the other direction. That is, if picking I and then traversing

Â to J occurs with the same probability as picking J and then traversing to I.

Â So, written in, mathematical notation, this tells us that pi(x) which is a red

Â state, times the transition along the red edge is equal to the probability of x

Â prime. That was my green experiment times the

Â probability of transitioning in the opposite direction.

Â Now, it's very easy to construct Markov chains that are not reversible, so this

Â is by no means a universal property of Markov chains. But it turns out that

Â reversible Markov chains have an elegant properties that allow them to be

Â constructed so as to give rise to particular stationary distributions.

Â And, specifically we can show and it's not a very difficult proof, we'll show it

Â in a moment, that if this equation, which is called the detailed balance equation.

Â 3:44

If this equation holds and T is a regular Markov chain,

Â then T has a unique stationary distribution, which we know from the fact

Â that it's regular, but furthermore, that stationary

Â distribution is pi. Let's go ahead and prove this.

Â And it turns out to be actually a very, a one line proof.

Â So, here we have we take the two sides of the detailed

Â balance equation and we put one of them on this side and the other one on this

Â side. And because the two sides are equal,

Â if we sum up over x, then the two summations are also equal.

Â And, now we notice by looking at the right-hand side, that pi of x prime

Â doesn't depend on the p summing x or on the variable x.

Â And so we can move pi of x prime out the summation, and so, this can now be

Â rewritten as pi of x prime times sum over x T of x prime x.

Â Okay? And this summation over here,

Â because t is a probability distribution over it's outgoing over the edges that

Â are outgoing from x prime, is necessarily equal to one.

Â And so if we rewrite what we just proved in this one line derivation,

Â we just proved that the sum over x pi of x T of x to x prime, is equal to pi of x

Â prime, which is exactly the definition of the

Â stationary distribution. So, this is a one line proof that a

Â reversible chain gives me the, side sorry, a chain that is reversible

Â relative to a particular distribution pi necessarily has pi as a stationary

Â distribution. Now, why is this a useful transition

Â between the definition of the stationary distribution pi in terms of this equation

Â down at the bottom versus this alternative definition of the stationary

Â distribution pi? Because this distribution,

Â this definition down at the bottom involves the summation over what is in

Â general an exponentially large space, whereas this one is, as we'll see in a

Â minute, much more constructive. and that it allows us to construct T, so

Â as to match pi. And so, that is exactly the idea behind

Â the Metropolis-Hastings the definition of the Metropolis-Hastings chain.

Â So how does the Metropolis-Hastings chain work?

Â It starts out by saying, well, we want to move around broadly in the state space.

Â And so, we're going to have a distribution queue, which is kind of like

Â a transition model. So this is my proposal, this is my

Â distribution Q. Q is going to, to roam freely over the

Â state space, unlike the Gibbs chain, it's not going to necessarily just take little

Â local steps from one state to a state that is very nearby.

Â It can look very far away and and go into large, into very distant parts of the

Â state space. Now, that by itself of course, is just

Â going to give me a stationary distribution that has absolutely no

Â relationship to the stationary distribution pi that I care about and so

Â what I'm going to have is a critic. The critic is the guy who says, whoa,

Â wait a minute, you can't go there right now, because it's doesn't that's not

Â going to give you the right stationary distribution.

Â So what does the critic do? The critic listens to the proposal that

Â was made by the proposal distribution Q, so it says, oh, you want to go from x to

Â x prime. Well, what is the probability with which I'm going to let you do that?

Â That's the acceptance probability, and that's a binary random variable for each

Â x and x prime. What is the probability that we accept that proposal?

Â So, that gives rise to the following process.

Â I haven't told you how to pick Q or A or anything yet,

Â but let's just understand the process. At each state x, we sample a potential

Â successor state x prime from my proposal distribution Q.

Â So x, and we sort of have a proposal to go to x

Â prime and I note, denoted this using a [INAUDIBLE] line.

Â And now the acceptance, the acceptor says, do I accept that?

Â And if the proposal is accepted, then I actually make that transition.

Â And if the proposal is rejected, I just stay where I am.

Â So what transition model does that give rise to?

Â Because, ultimately, this just gives us a transition model from x to x prime.

Â So there's two cases. Case number one is if x is not the same,

Â sorry, x prime is not x,

Â that is x prime is a state other than x. Well, in this case, the only chance that

Â I have of moving to x prime is if I proposed to move the x prime and if that

Â proposal was accepted. So I need to multiply these two

Â probabilities and that gives me the expression, this expression for the

Â transition. What about transition from x to x?

Â Well, there is two different cases. Either, I propose a move to x and it was

Â accepted or x also gets the benefit of all of the transitions that were rejected

Â by the, from other states that were proposed but where, where the move wasn't

Â accepted. And so, what we have here is Q of x to x

Â which we assume is always accepted, that's sort of one of the actions in this

Â model. And then we sum up over all possible other transitions of the

Â probability that they were proposed times the probability that they were rejected,

Â which is one minus the probability that they were accepted.

Â That gives rise to a transition model. So this is a general definition of a

Â Markov chain, but as stated, there is no reason to expect it to have a particular

Â stationary distribution. And so, now this is where we bring back

Â the intuition from the detailed balance equation.

Â And we're going to use the detailed balance to construct an acceptance

Â probability that has the property that we want.

Â And so, here is the detailed balanced equation rewritten, and now let's write

Â down and we're going to assume in this in this particular example that x is not

Â equal to x prime. And so we're going to, because this is

Â trivial for the case x is equal to x prime,

Â this equality always holds pi of x T of xx is equal to pi of x T of xx.

Â And so let's look at the case x is not equal to x prime and our goal is to

Â construct A. So the goal is to construct A such that,

Â 10:50

such that detailed balance holds for Q. So I'm given Q,

Â the set up is I'm given a proposal distribution Q and now I'mm going to pick

Â A so as to make detailed balance hold. And if detailed balance holds for Q comma

Â pi, then I'm going to then that is going to

Â guarantee that the process overall has the correct stationary distribution.

Â So, this is the equality that we want to have happen.

Â And, now, let's go ahead and just define this as a constraint on the acceptance

Â probability. So we have acceptance probabilities A

Â occurring on both sides of the equation. So I'm going to divide, move both As to

Â one side and move everything else to the other side.

Â And so I get a constraint on the ratios between the acceptance probability from x

Â to x prime, and the acceptance probability from x prime to x,

Â and that has to be equal to this ratio over here.

Â And so I'm going to assume, without loss of generality, that this ratio

Â notice that this ratio, we have we have

Â we can decide this in any way that we want.

Â We can either consider the ratio from x to x prime as the numerator or from x

Â prime to x. I picked this one under the assumption

Â that this is a sum value rho, sorry, this is equal to rho, which is

Â less than one. And, so now if we pick A,

Â so now we need to pick the values of the two acceptance probabilities, x and x

Â prime, so as to make this equality hold and one simple way to do that is take the

Â smaller of the two, which we assume was the numerator,

Â and we make A(x) two x prime equal to rho and we make A x prime to x equal to one

Â and that gives me immediately that this [INAUDIBLE] holds.

Â Now, of course, we could have picked these probabilities to be, otherwise,

Â we could have picked for example, A x to x prime to be half of rho and A of x

Â prime to x to be equal to half, but notice what that would do. That would

Â just reduce the probability that our proposals are accepted, which would just

Â make the chain move half as quickly, because you'd end up staying at the same

Â state twice as often as you would if the acceptance probability was higher.

Â So these are in fact the highest numbers that we could possibly pick while still

Â making this ratio constraint hold.

Â 13:46

And so that gives me, when you actually put it together as a general formula for

Â the acceptance probability, it gives us this expression,

Â which, if you look at it, you will see that it gives the right values for both

Â xx prime and for x prime to x. So that problem that we have to pick A

Â given Q and the question is what about Q? How do I pick Q?

Â Well, that turns out to be the $64 million question,

Â picking Q is an art, and it's not an easy choice in many

Â applications. But let's think about some conditions the

Â Q must satisfy before we think about how to pick one.

Â So, one condition on cue is derived simply by looking at the form of

Â expression for A. And we can see that this here involves a

Â ratio, and if you have ratio, then, one

Â condition is that you better not have a denominator that's equal to zero.

Â And, so one of the cases that one of the things we must guarantee regarding Q in

Â order for to be a valid set of acceptance probabilities is that Q itself must be

Â reversible in the following sense. That is if the if the denominator, if, if

Â the numerator is greater than zero then the denominator is greater than zero and

Â vice versa. That is, if you can propose a step from x

Â to x prime, you'd better be able to propose a step

Â from x prime to x with nonzero probability which guarantees that this

Â ratio is always well-defined. Now, the problem which constrains on Q is

Â that it actually induces a, a tension in the design of Q.

Â now on the one hand, you'd like Q to be very broad.

Â So if you were at this state over here, you'd like to, you'd like Q to sort of go

Â really far away from the current state, because the further away you can go, the

Â faster you move around in the state space.

Â You don't want to stay local. You don't want to make the same mistakes

Â that the Gibbs chain does by staying very near to the current assignment.

Â You want to move around quickly and that improves mixing.

Â On the other hand, if you look at the implications that this formula has, when

Â the proposal distribution is very broad, and so that you have one state that for

Â example, has low pi and the other state has a very high value of pi.

Â Which is what you get when you move around from a hilly part of the space to

Â a low part of the space, you have very big differences in terms of

Â the height of the two stationary. So what I'm drawing here is the height of

Â pi and this is my state space x, and if I move around very far, I'm liable

Â to get into situations where pi of one state x might be considerably larger than

Â pi of x prime. And if you think about what that implies

Â regarding the acceptance probability, the acceptance probability can get very, very

Â low, which in turn, implies slow mixing again,

Â because you might not, you might be taking very global steps but you're

Â taking very few of them. And so this is a real sort of tension in

Â the design of distributions, of proposal distributions queue.

Â We can always take an example of a of a Markov chain where proposal distributions

Â can make a very big difference. And, this is a problem that you wouldn't

Â necessarily immediately think of as a graphical model, but it can be formulated

Â as one and it's a matching problem and we'll see it in other settings as well.

Â So here we have a bunch of red dots as we see over here and we have a bunch of blue

Â dots, and we want to match the red dots and the blue dots and there's some sort

Â of preference function. So, these edges over here are annotated

Â I'm going to mark in green, are annotated with some kind of weights that tell us

Â how happy is the red, is this red dot to be matched with this blue dot.

Â And this happens a lot, for example, in in various correspondence problems where

Â we're trying to match sensor readings to objects for example, in a tracking

Â application. It also happens in in image problems

Â where you want to match features and one image that features another,

Â so it's a very common problem. So, one formulation of the matching

Â problem as a graphical model is that we have a variable Xi for each red dot.

Â So the green, so the red dots are going to be the variables,

Â the blue dots are going to represent values.

Â And we're going to have that Xi is equal to j if I match the i to j.

Â [COUGH] So that can now be written as a probablity distribution over a space.

Â And so, we can consider a probability of an assignment of values to to variables

Â and we're going to say that, first of all, has to be a matching.

Â So if we don't have that every x, every red dot is matched to a different blue

Â dot, that's probability zero assignment. So this would be the case if you had two

Â red dots that matched to the same blue dot. And otherwise, we're going to define

Â some kind of weighted combination of, and so, is, sorry, it's it's going to be an

Â exponential of the sum of the quality of the matches between the red dots and the

Â blue dots. So, we're summing up the quality of the

Â edges that participate in the matching. So for example, if this is my matching

Â over here and notice that I didn't get to match all the red dots,

Â that's well actually here. I'm going to match all the red dots.

Â So, now we have a matching on a, on the, for each red dot, we assigned the blue

Â dot as a value. And and now

Â the and now the overall quality is the total sum of of the quality of the

Â matching. And that defines, and that can be

Â exponentiated to define a probability distribution.

Â So and we're going to represent these qualities visually in this example as, as

Â distances, so that we're going to assume for example that this red dot prefers to

Â be matched to this blue dot more than it prefers to be matched to this further

Â away blue dot. So one could imagine running, it gives

Â sampling distribution on this model by taking a variable and say one of the red

Â dots and picking a new value for it, that is a new blue dot that it wasn't matched

Â to. This is going to change things very, very

Â slowly, because most of the blue dots are going

Â to be taken by a different red dot. And so, those assignments, where a red

Â dot is matched to a previously taken blue dot, are going to have very low

Â probability, zero probability in fact,

Â and so those are going to be impossible. So the only thing I can do is I can make

Â the red dot match a different blue dot match, match one of the unmatched blue

Â dots and that's going to take a very long time for things to change.

Â Here is a different way of doing this, which is the Metropolis-Hastings which is

Â one way of constructing a proposal distribution and using

Â Metropolis-Hastings. And that's called an augmenting path

Â proposal distribution. So what that does is,

Â let's assume that this is my current matching,

Â and I'm going to define the following proposal.

Â I'm going to pick one of the variables, Xi,

Â say this one. And I'm going to say, I'm going to pick

Â another value for it pretending that everything is available.

Â So I'm going to ignore the fact that these guys already matched. I'm going to

Â pick this one. I'm would I'm, I'm picking it

Â probabilistically, so I could have picked a different one.

Â I could have picked the same one that I was matched to before.

Â I could have picked a further away one, let's say I pick this one.

Â Well, now, this guy, poor guy, this one is unmatched.

Â this, sorry, this red guy over here, so he have to have to pick a new partner

Â and so, say it picks this one. This red guy is unmatched now and so it

Â might pick one and say it picks this one. And that leaves me with this red guy and

Â say, this red guy picks this one, and notice that now I have a new legal

Â matching. Every, I've closed the loop and I've

Â defined what's called an alternating cycle, where basically, I can switch all

Â of the black edges to green edges and that gives me a new legal matching and

Â that's my proposal. And with a little bit of a numerical

Â calculation, one can figure out the acceptance probability for this proposal

Â distribution and it turns out that it's actually really a good proposal

Â distribution. So let me show you some examples on the

Â next slide. So this is the results on the chain, if I

Â just apply Gibbs sampling. And you can see over here, four chains,

Â so the four colors represent the four chains.

Â And what you see on, as, in the x-axis is number of steps and on the y-axis is some

Â kind of statistic that I'm tracking in order to gauge whether the chain is

Â mixing. As it happens, this is the probability

Â that the first red dot is matched to the first blue dot,

Â but it doesn't matter, any statistic would have illustrated the

Â point here. And you can see that there's a lot of

Â long range fluctuations in the chain, that is the probabilities change over

Â time. The probabilities, by the way, are

Â computed over windows of size 500, from the samples in that window.

Â And you can see that the chain take, that the chain takes a long time to move from

Â one region of the space to the other. The proposal distribution that I just

Â showed you by contrast is totally different.

Â You can see that the probabilities are almost totally flat and that there is,

Â basically, all four chains are the same and there is, almost the same and there

Â is very little change in windows over time.

Â And there's an even better proposal distribution that I didn't show you that

Â modifies the augmenting path by just a little bit and that gives you effectively

Â perfect mixing across the entire time. If we look at the other metrics that we

Â defined for evaluating mixing, we see, we can compare the probabilities

Â of these different statistic of, of of different statistics across the two

Â chains. So, for example, this might be, just for

Â example, the red chain and this might be the green chain.

Â And what we see here are the are the statistics that you get for different

Â kinds of probabilities for the probability that this variable is

Â matched, I'm sorry, that this dot is matched to that dot for example.

Â And you can see that the Gibbs chains, the estimate that you get are very noisy

Â and the different chains give you divergent estimates.

Â The second Metropolis-Hastings, sorry, the first of the Metropolis-Hastings

Â gives you things that are almost on the diagonal, and here, things are

Â effectively exactly on the diagonal, perfect mixing.

Â But to summarize, Metropolis-Hastings is a very general framework for building

Â Markov chains, so that they are designed to have a particular stationary

Â distribution. It requires that you come up with a

Â proposal distribution and that proposal distribution has its needs to have

Â certain properties in order to be valid. But once you give me such a proposal

Â distribution, the detailed balance equation, which is this nice simple

Â equation, tells me how I can construct the acceptance distribution, so as to

Â it enforce the fact that we get the right stationary distribution.

Â So, this is great in some ways because it gives us this huge flexibility in

Â designing proposal distributions that explore the space quickly and move around

Â to far away parts of the space. But as we saw, picking the proposal

Â distribution is actually an important design point and it makes a big

Â difference to performance. And it's not, there isn't like a science

Â of how you should do this effectively. And so, there are and so this is

Â something that is really an art and requires nontrivial amount of thought and

Â engineering.

Â