In this video, we'll talk about decision making under model uncertainty. In particular, we'll talk about different methods for selecting models for the posterior distribution and prediction. We will continue the US crime data example to illustrate selection prediction using our package bass. For Bayesian model choice, we start with the full model, including all predictors. Model uncertainty, or in this case, variable selection uncertainty, arises when we believe that some of the variables may be unrelated to the response. Which corresponds to setting a regression coefficient to exactly zero. We specify a prior distribution that reflects our uncertainty about which variables are important. And we then update that we the data resulting in a posterior distribution over all models and coefficients and variances within models. Now that we have this posterior distribution, what do we do with it? Since this is our posterior, we can summarize it in many ways, how it is selecting a single model from this posterior and using it for future inference really necessary. What are your objectives from inference, in this case? Let's talk about some of the decisions regarding what models to use. First, there is the BMA model. Yes, that is a single model, but is referred to as a hierarchical model and it is composed of many simpler models as building blocks. This represents the full posterior uncertainty after seeing the data. For prediction, the posterior predictive mean, which is the posterior probability weighted average of all of the predictions from each sub model is the best under squared error loss. Now, if your objective is to learn what is the most likely model to have generated the data using a zero one loss, then the highest probability model is optimal. For the US crime data, that model includes eight of the 15 predictors. However, this model has a posterior probability just under 0.02 and there are many other models with comparable posterior probabilities. So, while this might be the highest probability model, we're still pretty unsure about weather it is best. Another model that is frequently reported, is what is called the median probability model. This is the model that includes all predictors who's marginal inclusion probabilities are greater than 0.5. If the variable are all uncorrelated, then this is the same as the highest posterior probability model. For a sequence of nested models such as polynomial regression with increasing powers. Then the medium probability model is the best single model for prediction, it includes more variables than the highest probability model. However, in the general case of correlated predictors and non-nested models, it often does well. If the correlations among the variables increase however, it may miss important variables as the correlations tend to dilute the inclusion probabilities of related variables. For the crime data, the median probability model includes all the predictors from the highest probability model except for time. Now, if you really have to select a single model and your objective is prediction, then the best choice is to find the model who's predictions are closest to those given by BMA. Closest could be based on squared error loss for prediction or for any other loss function. Now unfortunately, there's no nice expression that we can still calculate the loss for each of our sample bottles to try to identify it. Using a squared air loss, we find that the best predictive model are the one who's predictions are closest to BMA. Includes all the predictors that were in the medium probability model but includes three additional variables. These are the indicator of being in the southern state, police expenditures in the previous year, and the number of males to 1,000 females in the state. We can obtain fitted values and predictions for each of these approaches using functions that are similar to LM. In the following plot, we compare the predictions for the 47 states using the different approaches. We can see that the correlation among them is extremely high. As expected, the single best predictive model has the highest correlation with BMA, with a correlation of 0.994. However, the highest posterior model and the medium probability model are nearly equally as good. We've compared predictions using three ways to select models. The highest posterior probability model, the medium probability model and the best predictive model with model averaging. Selecting a model should ideally be based on some decision process that takes into account what is the purpose of the analysis and cost and losses associated with selection. In this case, there's still substantial uncertainty about what is the best model and model averaging would be preferable for reporting predictions incredible intervals. Models selection can be very sensitive to the choice of prior distributions over models where possibly you can use expert opinion that says that some variables must always be included to reduce uncertainty. This corresponds to having a variable with a prior probability of one for being included, or no uncertainty, and it's easy to accommodate with the abasing approach. We have used a prior distribution that says that all models were equally likely. This corresponds to a prior probability of 0.5 that a coefficient is non-zero. This suggests a priority that we expect half of the variables to be included. If the number predictor is large, this may lead to larger models under the posterior distribution. Rather than using a fixed probability, we might place a prior distribution for the probability that are variables included or that the beta is non-zero. Under beta prior distributions, this leads to what are called beta binomial distributions on the model size, and often simpler models. We've used the Zellner Shell Prior, which is a mixture of the Zellner G Prior. Like the Hyper-g over n Prior we described in the last video, this has some nice theoretical properties and has been shown to work well in a range of problems. Of course there's not one single best prior that's best overall and if you do have prior information you should include it. If you expect that there should be many predictors related to why, but that each has a small effect, alternate priors may be better. Also, think critically about whether model selection is important. If you believe that all the variables should be relevant but are worried about over fitting, there are alternative priors that avoid putting probabilities that coefficients are exactly zero. But still prevent over fitting by shrinkage of coefficients to prior means. Examples include the Bayesian lasso or Bayesian horse shape. There are other forms of model uncertainty that you may want to consider. Such as linearity in the relationship between the predictors and the response. Uncertainty about the presence of out wires and uncertainty about the distribution of the response. These forms of uncertainty can be incorporated by expanding the model and priors similar to what we have covered here. Multiple regression is one of the most widely used statistical methods, however, this is just the tip of the iceberg of what you can do with Bayesian methods.