So far, we've talked about Bayesian model selection and Bayesian model averaging, using BIC to determine base factors. Today, we will introduce a new conjugate prior distribution called Zellner's g prior. As we will see, this leads to simple expressions for base factors. In terms of summary statistics from ordinarily squares. We will talk about choosing the parameter g in the prior and then, conduct a sensitivity analysis. Using the kids cognitive score data that we used in earlier videos. We're going to write the sampling model for a multiple regression a little bit differently today. In the mean function, I'm going to subtract the sample mean from each of the predictors. Now, if I set x equal to it's sample mean, my mean for the response is the intercept and the ols estimate is the sample mean, y bar. This way, the intercept has the same meaning across all of the possible models. This will ease putting a prior distribution on the intercept. We need to specify the prior distribution for all of the betas. Zellner proposed a simple, informative, conjugate normal prior for the Betas conditional on sigma square. This has an informative prior mean b0 but the variance and covariance are constant. Times the variance and covariance from OLS. SXX is the matrix of some on square deviation in cross products for X. That provides the variance and covariance for OLS. If you prefer, you can just think of this in a one parameter case where it is just the sum of squares for single x. This provides a data dependent variance covariance matrix that has the same shape as in the likelihood. The parameter g scales the prior variance over variances from OLS. The advantage of this prior is, that it reduces prior elicitation down to two components. the prior main b0 and the scalar g. Like all conjugate priors, Zellner's g-prior leads to simple updating rules. However in this case, they're even simpler than most conjugate priors. The posterior mean is written as a linear combination of the OLS estimate beta hat and the prior mean. Since g over one plus g is less than one, the g-prior shrinks the OLS estimates towards the prior mean. As g goes to infinity, we recover the OLS estimates as in the reference prior. Similarly, the posterior variances are shrunken versions of the OLS variances. And the posterior distribution of beta, given sigma squared in g, has a normal distribution. Because of this simplicity, Zellner's g prior has been widely used in Bayesian model selection model averaging. One of the most popular versions uses the g prior for all coefficients. Except the intercept and takes the prior mean b0 to be 0. Combining this with the reference prior for the intercept and sigma squared, which is okay. As we're not testing any hypothesis about the intercept. The base factor for comparing model m to the null model is a simple function of the sample size n. The number parameters in the model P sub N, the R squared of the model and the parameter G. Using the Bayes factor, we can compare any two models using the posterior odds. This can be used to find the posterior probabilities under enumeration of all models. Or used in sampling models using MCMC like in the previous video. Of course, how do we pick g? Well, you might be tempted to pick an extremely large value g,, so that your prior distribution is not very informative. Perhaps surprisingly, this leads to the Bartlett/Lindley paradox. If I take the limit at the base factor as 'g' equals to infinity, well the result is that the base factor goes to 0. This is overwhelming evidence against model m. And the null model will have probability one in the posterior distribution, regardless of the data. This is not really what we want. Another problem arises if we use just any arbitrary fixed value of g. Suppose you have an imaginary sequence of examples. where the R squared is getting closer and closer to one for a fixed sample size and model dimension. Well, you would expect that a large R squared would suggest that the alternative model should be supported. Yet, with a fixed value of g, the Bayes factor does not provide overwhelming support for the alternative. This has been called the information paradox. This is troubling, as R squared going to one is the same as T squared going to infinity, and overwhelming evidence against H1. In this case, the Bayesian infrequentess would come to different conclusions. These should provide a warning that picking arbitrary values of g may have some unintended consequences for posterior inference. However, there's some solutions that appear to lead to reasonable results in small and large samples. Based on empirical results with real data to theory and provide a resolution to these two paradoxes. In the cases below, the prior distribution depends on n. Now, that might be troubling but remember, that the sum of squares term in the prior is likely getting larger with n. So, this serves to balance the growth of that term. And in the limit converges to prior that depends on the variance of x from the population. The unit information prior takes g = n, so that the information the prior is worth a single observation. This is the same as saying n over g = one. Taking g = n though ignores our uncertainty in the choice of g. Since we don't know g a priori, we could use a prior distribution where the expected value of n over g is 1. One such example is the prior. This is obtained by putting a gamma prior on n over g. A third example puts the beta distribution on one over one plus g over n. The name of hyper g over n is, because the base factor can be expressed in terms of a special hyper g metric function. Let's look at these priors and a couple of others for the kids cognitive scores example. For each prior, I found the posture of probabilities that each of the four productive variables would be included in the model. This is the posterior probability that each beta equals zero. Each bar plot corresponds to one of the four predictor variables. High school status, mom's IQ. Whether the mom worked during the first three years and mom's age. From left to write in each bar plot, we have the posterior inclusion probabilities under BIC. The g prior, which equals n. The Zellner shell prior. The hyper g over n prior. The empirical Bayes method, where we estimate g and AIC. Where the AIC values are converted to posterior probabilities, like we did with BIC. All the methods agree that we should include mom's IQ. Mom's high school status has a probability included that is over 0.5 across all the methods. With BIC the g prior being the most conservative. Also, all approaches are great at the probabilities that worker A should be included are less than 0.5. With AIC being the least conservative and the AC and the G prior being the most conservative. AIC is designed to find good predictive models and including a predictor whose coefficient is really 0. Does not impact predictions, although a simpler model may be true. In this case, the methods all agree loosely on what are the most important variables. But differ in the magnitudes of the posterior inclusion probabilities. The Zone E Kal Sochi Prior and the Hyper G over N Prior provide results that are between the two extremes of BIC and AIC. And overall perform well in a range a problems. To recap, we've introduced Zellner's g prior a conjugate prior distribution for basic model selection and model averaging. We've discussed some of the problems with choosing g. Where g's that are too large or small lead to inconsistent behavior from what one might expect. To resolve the Bartlett's paradox and information paradox, we recommend g equals n. Or, placing a prior distribution on g over n. We compare posterior inclusion probabilities using a range of priors. Overall, priors agree on the most important variables. In the next videos, we will put everything together with the larger example. And we'll talk more about decision making and what to report in the presence of model and certainty.