We define the notion of the general Markov chain. We talked about how certain conditions guaranteed that if we sampled from the chain we achieve conversion sistationary distribution. We then described how if we then generate samples from the Markov Chain we can then use that to estimate properties of that stationary distribution like varying statistics that we might care about. But the question of where the Markov Chain might come from was left very vague. It turns out that four graphical models people have constructed a general purpose sampler that can be used to sample from any graphical model, and that class of chains is called the Gibbs chain. So now we're going to talk about the Gibbs chain and how to use in the context of graphical models. So, here are target distribution, the one from which we would like to sample is the, good ole, Gibbs distribution, P of Pi, of X1, up to XN. And just as a reminder that P of Pi can equally well represent a, a Gibbs distribution that we got from just the product of undirected factors. A Markov net style distribution. Or it could, as well, represent a set of factors that [INAUDIBLE] Bayesian network, reduced by evidence. And so, that is the factor set Pi. And everything that I'm going to say from now on is agnostic to where those, factors Pi came from. If they came from a directed or an undirected model. So, now our Markov chain state space. Is a set of complete assignments, little x, to the set of variables big X. So we're sampling over, note, an exponentially large state space. The space of all possible assignments of the random variables in a graphical model. So those little circles that we saw when we were talking about mark off chains, and we were g, going from one state to the other. Each of these now is now a complete assignment. [INAUDIBLE]. So it turns out that the Gibbs chain is actually a very simple and easy to understand, Markov chain. And here's what it does. Assuming that I have a starting state, little x. What I'm going to do is, I'm going to take each variable in its turn, using some arbitrary ordering. And I'm going to sample. A new value for that variable given values for all of the rest. So let me introduce this notation, this is an assignment. All. x1 of the xn except xi. So for example if we had three random variables: x1, x2, and x3. We would sample x1 given x2 and x3. X2 given x1 and x3. And x3 given x1 and x2. Right, so one at a time. Okay so let me, let me draw that out because in order to make that clear. So here we would have, here's my current assignment, let's say here's X1, X2, X3, and let's say that we have zero, zero, zero as my initial assignment. So I'm going to start by. Sampling X one from the probability of X one given X two equals zero. And x3 equal zero, and let's say I end up with one, that other two haven't changed yet. I now sample from p of x two, given x1 equaled the new value one, and x3 equals zero. And say that's thay zero. Could. And then I sample x three from x, the probability of x three given x one equals one and x two equals zero and save our changes to the value one and so now this is my new assignment 101 which gets substantiated into x five. That is a single step of the Gibbs chain. It goes and potentially modifies, it doesn't have to modify because you might end up staying in the same assignment. Resampled every single variable, in its turn. And when all of them are done, that's my new state. So let's do this example in the context of a real graphical model. So here is the good old student Bayesian network. And I'm assuming that I have observed two of the values, two, values for two of the variables. So I've observed l equals l zero and s equals s one. And so now I have a product of, reduced factors that represent the evidence that I received and only three variables are now getting sampled because the other two are fixed so they observe values so now I am sampling from P of P tilta of Phi, I mean, I like the sample from P tilta Phi of DIG, given little S1 and L0. And so what I'm going to do in this sampling process is I'm going to start out with some arbitrary assignment to these three variables, D, I and G. So, for example, since it's arbitrary, D0, I0, E0. And now I'm going to go ahead and sample each of these in turn. And so I'm going to initially sample D from. Of the, given. I've. Zero. G zero. SL0 and S1. Okay? And we'll talk about how that's done in just a moment. And that might give me D one by zero D zero. And now I'll continue and I might sample I. So I'm going to sample I from probability of I given D one. E zero, L zero, S, one, they end up with I one. Finally I sample G and it couldn't be G0, I apologize because G only goes one, two, three. So let's say it was G1 and now I sample G, from the probability of G. Okay, so this was G1 here too, and this is the propability of, of mouse sampling G, from the propability of G, given, D1, I1 that's where I am right now, L0 and S1. And double G3, for the new finer. Use this one. D1, i1, g3. So how do I do this in a tractable way? So here is the trick that. I mean, you might say, well. You're dealing with this intractable distribution. I mean, you just moved the problem from the original problem that we had to computing these conditional probabilities. Why is that any easier? Well, turns out it is. So let's look at the sampling stuff over here where I sample'xi' from this conditional distribution and let's break that down into these into its constituent definitions. So this distribution, this conditional distribution is a ratio. Between the join distribution of the value XI and the stuff on the right hand side of the conditioning bar, divided by the stuff on the right hand side of the conditioning bar. Now importantly, because each of these is, is, an unnormalized measure divided by the partition function, the partition function cancels and so I end up so if I had a one over z here. The one over z cancels then I end up with the ratio to unnormalized measures, so, the point being now, that this. Is a complete assignment. And therefore the product of factors. So let's look at that in the context of a concrete example, and see how that might help us. So now we're doing an example that's a Markov network in order to demonstrate that this is equally applicable to both. So this is our good old misconception network where we have these four factors Pi one, Pi two, Pi three, and Pi four. And let's imagine that we're trying to do P Pi of A given B, C, and D. And we can see that what we have here is P tilde Pi of A, B, C, D divided by the sum over A prime. And I'm using A prime to distinguish it from this A over here, just to avoid ambiguity. P tildify of sum over A prime, P tildify of A prime BCD. And now I'm going to. Since each of these [INAUDIBLE] is a product of the factors, I'm going to inject product of factors into each of the numerator and denominators separately. And so over here I end up with five one of ab times five two bc five three cd five four ad. And similarly end up with exactly the same expression in the denominator. Except I sum out, over A prime. And, what, do we? Determined by looking at this expression. Well, some things are immediately jump to ones eye which is the sine two bc appears in the numerator and in the denominator and therefore I can cancel it and similarly with 5-3 of cd. And so what I end up with here is a product of these two factors. Five, one of ab and five, four of ad. Which are the two factors. That involve A. Well that's for the numerator, what about the denominator? Well the denominator is just our good old. Normalizing constant. So really what we have here is that this is proportional to, Pi one, of A, little B times pi four of A, little D. So this is a factor over A that I can compute, by multiplying two, singleton factors in this case over A. Multiply them together and renormalize and I get a distribution over a, from which I can sample, and we already talked about how to sample from a discrete distribution. So what is the computational cost, now, of this algorithm? The sampling step for xi, given all of the others is simply this expression, over here. Which is proportional to a product of factors. And not just any old product of factors, but just the factors that involves XI involve. Just the factors that have XI in its scope. Which means that I only care about XI and its neighbors in the graph. Which, for a large graph, with tens of thousands of nodes, can be a considerable savings, okay? So only XI and its neighbors. So, do we like the Gibbs chain? We've certainly defined it. Does it have the properties that we, would. Like it to have? Well, the answer is unfortunately, not always. So, let's look at and specifically is a, is a regular change and we have guaranteed convergence for unique station redistribution. So the answer is unfortunately not and to that we're going to return to our old enemy the exclusive-or network which as I said in other examples is the counter example to pretty much anything is exclusive-or. So let's look at the exclusive-or network. So here we have Y which is the exclusive or of X1, X2. And let's imagine that I observe. Y equals one. And so these two disappear. And now I'm going to try and sample. X one and X two, which I should be able, you know, if, if, if I were so lucky, I would get the, each of the two possible states for X one and X two occurs with probability a half. So let's imagine that I start out with this one. And I'm going to now sample X one given X two and Y. So I have my initial state is zero one for X one and X two. And of course Y equals one is my evidence. And now I'm going to sample. what I'm going to sample, X one given X two and Y. And that is a deterministic dependence, and the only possible value that I get is zero for X one. Well, am I going to have any better luck with X two? Nope. I'm going to end up with exactly the same zero one that I started out with. And I can do the exact same thing again, and you know what? It'll work the same way every single time, and I will always end up with zero one. Conversely, if I start out with one zero, I will never leave one zero. This is a classical example of a non mixing chain. On the other hand, if all factors are positive, that is, if there are no zero entries in any of the factors, then you can show that the Gibbs chain is regular. Now, however, it's important to realize that regular doesn't mean good. So, I can make this chain regular by adding a tiny probability epsilon of having y not be the exact exclusive orb X one and X two. Now the chain is regular, because all possibilities have probability non zero. And it's still going to mix extremely slowly for values of epsilon that are small. So regularity remember, is a very weak condition. It says if you sample long enough, we will eventually converge. But it doesn't mean it'll happen in a reasonable time frame. And one such example of a very slow to mix chain actually occurs in practical applications. So, here's one example. Here, it's an example in our image segmentation domain. And imagine that we're trying to resample, say, the class of. let me use a different color. Of this super pixel here, and that we have some fairly strong potentials that tell me that the class of this super pixel, should be very highly correlated with the class of its adjacent super pixels, because of the appearance being, very similar to each other. Because I have some set of peer wise Markov network with some very strong potentials that try and enforce consistency between adjacent super pixels. So if it turns out that for whatever reason, my initial assignment was actually wrong, so for example what is often an error mode for set segmentation models is that is that road is often labeled water or sky because the appearance is actually kind of a similar and grayish kind of appearance so say that for, for example if I happen to label all of these superpixels as say water, and now I'm going to use Gibbs sampling. None of them are likely to move away from their current assignment because each of the superpixels is really constrained by strong pairwise potentials with its neighbors. And so this notion of slow to mix Markov chains is not just a theoretical construct that occurs with exclusive or it actually happens in practice a fair amount. Now to summarize what Gibbs sampling does is it takes the very challenging problem of inference in a Gibbs distribution. And converts it to a sequence of easy sampling steps, where each sampling step samples one variable, given all of the others. This method has the advantage that it's probably the simplest Markov Chain for probabilistic graphical model and it's very efficient to compute each of the different stuff. The cause is that they are very slow to mix, when the probabilities are peaked and the variables are very highly correlated. And, it's only applicable in cases where we can sample from a product of factors. Now, we can always do that in a discreet graphical model, if our factors aren't too large. But, if our samples are, but if we have, for example, either a very densely connected Say Markov network or we have one with continuous distributions where the product the factors isn't something has a nice parametric form that we can sample, [INAUDIBLE] sampling might not be actually applicable. And so there is non-trivial limitations to the applicability of gift sampling and even when it applies, it is actually a very slowed mix. And we often want to do something more clever, and we'll talk about more clever algorithm later on.