Let's consider Poisson data. For example, think about chocolate chip cookies. In mass produced chocolate chip cookies, they make a large amount of dough. They mix in a large number of chips, mix it up really well and then chunk out individual cookies. In this process the number of chips per cookie approximately falls a Poisson distribution. If we were to assume that chips have no volume then this would be exactly a Poisson practice and follow exactly a Poisson distribution. But in practice chips are not that big and so they follow approximately a Poisson distribution for number of chips per cookie. So if we model the number of chips per cookie Y sub i as Poisson between lambda, we can have a likelihood for a bunch of observations y given lambda. To be lambda to the sum of the y sub is times e to the minus m lambda over the product of y sub i factorial. For this we need lambda to be positive valued. Now our parameter has to be positive valued. What type of prior should we put on lambda? It would be convenient if we could use a conjugate prior. So we can ask, what distribution looks like lambda to the something e to the minus something lambda? That distribution is the gamma distribution. So gamma prior for lambda is going to look like lambda with being a gamma prior with parameters alpha and beta, f of lambda i beta to the alpha over gamma of alpha times lambda to the alpha minus 1 e to the minus beta lambda for lambda greater than 0. Plus the posterior f of lambda given y will be proportional to the likelihood f of y given lambda times the prior f of lambda. And this is proportional to lambda to sum of the y sub is times e to the minus n times lambda times lambda to alpha minus one times e to the minus beta times lambda. Under the proportionality we only need to focus on the terms that contain lambda, so we can drop these normalizing constants. This then simplifies to lambda to the alpha plus sum of y sub i minus 1 e to the minus theta plus n lambda. Thus we can see the posterior Is a gamma distribution with parameters alpha plus sum of the y sub is and beta plus Z. Recall that the mean of the gamma Of a gamma random variable is it's first parameter over it's second parameter under this parameterization. Do note that there are a lot of different parameterizations of gamma random variables out there and they may have different parameterizations of the means as a result. So in this case, we can see the posterior mean. Posterior mean is going to be alpha plus sum of the y sub is over beta plus n. We can then rewrite this as beta over beta plus n times alpha over beta plus n over beta plus n times some of the sub is over n. So we can see that in this case as well, our posterior mean is a weighted average of the prior mean and the data mean. The weights add up to 1 with the n here this is clearly the sample size for the data. And so the effective size for the prior in this case is beta. We can now ask how would we choose our parameters, our hyper parameters alpha and beta? For trying to specify a prior for this and we're thinking about something concrete but cookies, how would we chose values for alpha and beta in order to be able to do this problem and get a posterior after we observed data. How would we chose a particular prior, a particular alpha and beta we need to specify values for? In the case of something concrete like chocolate chip cookies, we may have some ideas about what is in our prior and how to specify information for a prior? What types of strategies would we use to put that information in, then collect data and update it to get a posterior. Let me present two strategies. Under the first strategy, we want to include information into our prior, based on our personal knowledge. We can start by thinking about what is the prior mean? Prior mean in this case is alpha over beta. And so we can think, what do we think our prior mean is for the number of chips per cookie? What do we think the number of chips per cookie on average is? That would be our prior mean. The strategy now involves we have to specify a second thing because there are two forevers. So if we specify two things we can solve for alpha and beta. One possible second thing we could specify is our prior standard deviation or prior variance. How sure are we about our information in our prior mean? If we think, the prior mean is that there are 12 chips per cookie. Do we think there's a standard deviation of 3? A standard deviation of 4, of 6? How sure are we? We can specify a prior standard deviation and that would be square root of alpha over beta. Given these two quantities, we can solve for alpha and beta and that will give us our gamma prior. Another way of specifying our confidence or uncertainty in our prior is to think about the effective sample size. As we see here, the effective sample size is beta. And so we can specify how many units of information, we think we have in our prior. How sure we are in that sense versus how confident we will be when we have n more data points? So we can specify a value for beta, and we can specify a prior mean and solve for alpha. This would be a way, two different ways to specify an alpha and beta and to get a prior to move forward. Another approach is that we can represent ignorance with a vague prior. In Bayesian statistics, a vague prior refers to one that's relatively flat across much of the space. In this case, we can think about some small epsilon. So epsilon is some small number that is strictly positive. And then we can have a gamma prior with parameters, epsilon and epsilon. As long as these are both strictly positive, this is a proper prior, it's a proper distribution. In this case then, the mean will be epsilon over epsilon which is 1, but its variants will be huge, essentially 1 over epsilon. If epsilon's small, there will be very big variances, so this will be a very diffuse prior across the whole space. We think about what the posterior mean looks like under this prior. The posterior mean would be epsilon plus the sum of the y sub is over epsilon plus n. If epsilon is really small then this is approximately the sum of the y sub is over n, just the data mean. And thus the posterior will be largely driven by the data and the prior will have very little influence.