In the last video, we explored model uncertainty using posterior probabilities of models and Bayesian model averaging based on BIC. In the example with the kids cognitive scores, we had four predictor variables, leading to 2 to the 4 or 16 models. With modern computers, it is easy to enumerate all possible models for BMA. However, what if you have 10,000, 1 million or more predictor variables? In this video, we'll present some of the ideas behind stochastic methods of implementing Bayesian model averaging. We will describe how Markov Chain Monte Carlo can be used to explore this space of models and estimate quantities of interest. Rather than a population of values of kid's cognitive scores, let's consider the pseudo population of possible models that we obtained from the all possible subset regression from the last video. Here, we see each of the 16 models and I've arranged them by their model size, the number of predictors plus the intercept on the X-axis, and the posterior probability on the Y-axis. Knowing the probabilities of each model, I could take a sample from this population, say, using the sample function in R. I would want to sample models with a probability that is equal to their posterior probability or proportional to the marginal likelihood times the prior probability. Once I have a sample, I could estimate the probability of a model by using the frequentist definition of probability. That is, I count the number of times I sample model M in the I samples and divide by I. This is the relative frequency of the event that model M occurred in my sample. I can estimate other probabilities by their relative frequency, such as whether variable j is included in the sample. These estimated probabilities can be used in model selection or BMA instead of the exact expressions, though this will add Monte Carlo variation to the answer. What if there are too many models to enumerate? If I can't enumerate all the models, I can't calculate the posterior model probabilities and use those for sampling. Fortunately, there's another way based on Markov Chain Monte Carlo. Let's walk through a basic Markov Chain Monte Carlo algorithm. We're going to start an initial model M(0). In principle, this can be any model. Then for iterations indexed by i, starting with 1 and up to capital I, we will do the following steps. First, randomly pick another model. This will be our proposed model. Next, we are going to compare it to our current model M(i) by looking at the posterior odds of the proposed model to our current model. Note, we actually do not need to know the posterior probabilities exactly, because the odds can be expressed as the product of the Bayes factor for comparing the proposed model to the current model, times the prior odds of the proposed model to the current model. In other words, that term that required summing over all of the 2 to the p possible models to compute the posterior probabilities, drops out, so all we need is to be able to evaluate Bayes factors in prior odds. Let's call that ratio R. If R or the posterior odds is greater than 1 that means that the proposed model has a higher probability than our current model. If so, we will accept the proposed model as model i+1. Now, if R is less than 1, well, we'll want to accept it some of the time so that we can explore the population of models. To accept it, we will flip a bias coin with probability R. If it comes up heads, then we update model i+1 to the proposed model. If tails, we stay at our current model. We'll still set model i+1 to model I. Indicating that this iteration we stayed at that model. Last, we increment i by 1 and then repeat this until I've carried out i iterations of the algorithm. The samples of models are not independent, that is, whether I accept a new model depends on where I am currently. However, if we run this long enough, then the relative frequencies of the models and ergodic average will converge to the posterior probabilities over models. And we can use these samples from the MCMC for Bayesian model averaging or selection. In my simple example, I proposed models randomly, i.e., all models were equally likely to be proposed. This can be pretty inefficient if there are lots of models with low probability. We can come up with other ways to propose models. For example, we can look at neighboring models of our current model by either flipping a coin to add one predictor that is currently not in the model, or randomly picking one of the current predictors to drop from the model. This forms a random walk across neighboring models. We could also propose to swap out a current predictor with one that is currently not in the model. This has the potential to take bigger jumps in the space of models. Other possible moves can be designed to help move around the space of models. However, we have to be careful to adjust for any potential bias, due to how we propose new models, to ensure that our relative frequencies converge to the posterior distribution. We can add a potential multiplier to R to correct for this bias. Let's look at an example with a cognitive kid's score. We start with the same plot of the model dimension and posterior probabilities. But now, all of the points of potential models are represented as open triangles, as place holders, as if we did not know their probability. In the simulation, you will see the current model as a solid orange triangle. A dashed line will connect it to the proposed model which will appear in blue. If we accept the model then it will change to orange. Once we visit a model, it will be shown as a solid gray triangle. Let’s go. Our first model is actually the model with the highest probability. You can see that we are proposing neighboring models which have lower probability but do not accept all of these proposals. As this runs, you can see that it's hard to move from the highest posterior probability model to the bottom, but the chain does move around visiting most of the models in the first 100 iterations. To sum up, we've explored a simple Markov Chain Monte Carlo algorithm to explore the space of models. Except for plotting, we did not actually need to know the model probabilities to run our sampler. This makes it easy to implement in problems where we cannot enumerate the models to obtain exact posterior model probabilities. These Monte Carlo frequencies and the sampled models can be used in BMA in place of the normalized marginal likelihoods times prior, allowing us to carry out Bayesian inference even when we cannot enumerate all models. Our MCMC was designed to provide estimates of probabilities that converged to the true posterior probabilities. There is quite a bit of art and science to designing a good Markov Chain algorithm to explore the space of models. In theory, if we run any MCMC sampler long enough, we will visit all the models. However, we do not necessarily have the time to do so. In large problems, having an efficient MCMC becomes more important. Now there are other stochastic search algorithms that try to find the models with highest posterior probability or that might sample models without replacement. These can be more efficient in some cases for model selection, but may not provide unbiased estimates of model probabilities or other quantities in large problems. In the next video, we will explore alternative prior distributions as part of prior sensitivity and give some examples using Markov Chain Monte Carlo.