In the last video we provided a Bayesian reference analysis for kids' cognitive scores. We found that several credible intervals contain zero, suggesting that we could potentially simplify the model. Next we discussed model selection, which is the science and art of picking variables for multiple regression models. But this time we'll take a Bayesian perspective. In the inferential statistics course, you compared model selection using p values and adjusted r squared. We're going to talk about Bayesian model selection using the Bayesian information criterion or BIC. There are many other Bayesian criteria that you could stumble upon as well, but this tends to be one of the most popular. Later, we'll talk about some of the other Bayesian criteria using base factors. What is BIC? Well, the Bayesian information criterion is formally defined as -2 times the log(likelihood), evaluated the maximum likelihood estimates plus the log of the sample size times the number of parameters in the model. The model with the smallest BIC is preferable. We can re-express BIC for regression models as a function of the r squared of the model, plus log n times the number of parameters. We can always increase r squared by adding another variable to the model, and by adding enough terms we can even achieve a perfect fit to the data. Large R squared will reduce the first term in the expression for BIC by improving the goodness of fit but may result in overfitting the data. BIC addresses this by adding a penalty term for the number of parameters, including the intercept in the model. And this provides a tradeoff between the goodness of fit on the left side and the model complexity represented by the term on the right. Let's start with backwards elimination using BIC. We'll start with the full model, the model with all possible predictors. We'll drop one variable at a time and record BIC for each of the smaller models. Then we pick the model with the smallest BIC. We repeat this until none of the models yield a decrease in BIC. Let's give an example for how to do this using the kids' cognitive scores where we predict the score for mom's high school status, mom's IQ score, whether or not the mom worked during the first three years of the kid's life, and the mom's age. The BIC for the full model is 2541. At the first step, we try removing each of the variables one at a time. So for example, here we've removed age and the BIC is 2,535. This is a decrease over what we had started with in the full model, so suggests that dropping age improves our model. Similarly, dropping work or high school status also leads to a smaller value for BIC and improvement over the full model, but not as much as dropping age. We can also try removing mom's IQ and we get a much higher value for BIC. It must be that IQ variable is very important for the prediction of the kid's cognitive score. So we know that, in the first step, we can drop age, and pick the model where we are predicting kid_score from high school status, IQ, and the work status of the mother. Next we move to the second step, where we once again try removing each one of the variables, one at a time. And we can see that dropping work leads to the smallest value for BIC. At step 3 we try to drop IQ or high school status. Dropping either of these leads to an increase in BIC from that of the previous stage. Therefore, our final model is going to be the one that predicts kid's cognitive test score from mom's high school status and mom's IQ. This does not include Mom's work status, which was in the best adjusted R squared model we found earlier. We would expect to get similar models, but not necessarily exactly the same model because our decision criteria are different. When using BIC to select a model, it's common to report parameter estimates based on the reference prior we used previously. The estimates under the reference posterior distribution can be obtained using the OLS estimates from R. If you compare these results to the estimates under the full model, the credible interval for IQ is the same. However, the interval for high school status is shifted slightly to the right. All credible intervals exclude zero, suggesting we found a parsimonious model. Bic is one example of a criterion based on the penalized likelihood. Other popular criteria may be obtained by using different values of k, such as AIC, the information criteria, or adjusted r squared. BIC tends to select parsimonious models while IIC and adjusted r squared may include terms that are not statistically significant, however, may do better for prediction. Other approaches to Bayesian model selection can be based on selecting a model with the highest posterior probability. Or if prediction is important we can use decision theory to help pick the model with the smallest expected prediction error. In addition to goodness of fit and parsimony, lost functions that include cost associated with collecting variables for predictive models may be an important consideration. Finally, what should you do if there are two competing models with very similar values at the model selection criterion, such as BIC. Do you still pick the winner and ignore uncertainty about the best model? Or can we use the ensemble of models for posterior inference? We will explore this in the next video.