[MUSIC] In lesson one, we defined a statistical model. As a mathematical structure used to imitate or approximate the data generating process. It incorporates uncertainty and variability using the theory of probability. A model could be very simple, involving only one variable. For example, suppose our data consists of the heights of n=15 adult men. So we have the heights of n=15 men. Clearly it would be very expensive or even impossible to collect the genetic information. That fully explains the variability in these men's heights. We only have the height measurements available to us. To account for the variability, we might assume that the men's heights follow a normal distribution. So we could write the model like this, yi will represent the height for person i, i will be our index. This will be equal to a constant, a number mu which will represents the mean for all men plus epsilon I. This is the individual error term for individual i. We're going to assume that epsilon i comes from a normal distribution with mean zero and variance sigma squared. We are also going to assume that these epsilons are independent and identically distributed from this normal distribution. This is also for i equal to 1 up to n which will be 15 in our case. Equivalently we could write this model directly for the yi themselves. So each yi comes from a normal distribution independent and identically distributed with the normal distribution. With mean mu and variance sigma squared. This specifies a probability distribution and a model for the data. If we know the values of mu and sigma. It also suggests how we might generate more fake data that behaves similarly to our original data set. A model can be as simple as the one right here or as complicated and sophisticated as we need to capture the behavior of the data. So far, this model is the same for frequentists and Bayesians. As you may recall from the previous course. The frequentist approach to fitting this model right here. Would be to consider mu and sigma to be fixed but unknown constants, and then we would estimate them. To calculate our uncertainty in those estimates. A frequentist approach would consider how much the estimates of mu and sigma might change. If we were to repeat the sampling process and obtain another sample of 15 men, over, and over. The Bayesian approach, the one we're going to take in this class. Tackles our uncertainty in mu and sigma squared with probability directly. By treating them as random variables with their own probability distributions. These are often called priors, and they complete a Bayesian model. In the rest of this segment, we're going to review three key components of Bayesian models. That were used extensively in the previous course The three primary components of Bayesian models that we often work with are the likelihood, the prior and the posterior. The likelihood is the probabilistic model for the data. It describes how, given the unknown parameters, the data might be generated. It can be written like this, the probability of y, the theta. Given theta, we're going to call unknown parameter theta right here. Also, in this expression, you might recognize this from the previous class, as describing a probability distribution. It might be, for example, the density of the distribution, if y were normal. Or y were discrete, this might actually represent the probability itself. We're just going to use for our notation a generic p right here. The prior, the next step, is the probability distribution that characterizes our uncertainty with the parameter theta. We're going to write it as p of theta. It's not the same distribution as this one. We're just using this notation p to represent the probability distribution of theta. By specifying a likelihood and a prior. We now have a joint probability model for both the knowns, the data, and the unknowns, the parameters. We can see this by using the chain rule of probability. If we wanted the joint distribution of both the data and the parameters theta. Using the chain rule of probability, we could start with the distribution of theta. And multiply that by the probability or the distribution of y given theta. That gives us an expression for the joint distribution. However if we're going to make inferences about data and we already know the values of y. We don't need the joint distribution, what we need is the posterior distribution. The posterior distribution is the distribution of theta conditional on y, theta given y. We can obtain this expression right here using the laws of conditional probability and specifically using Bayes' theorem. If we start with the laws of conditional probability. The conditional distribution of theta given y will be the joint distribution of theta and y. The same as this one right here divided by the marginal distribution of y by itself. How do we get the marginal distribution of y? We start with the joint distribution like we have on top, And now, we're going to integrate out or marginalize over the values of theta. Finally, to make this look like the Bayes theorem that we're familiar with. You'll notice that the joint distribution can be written as the product of the prior and the likelihood. I'm going to start with the likelihood. because that's how we usually write Bayes' theorem. Again we have the same thing in the denominator here. But we're going to integrate over the values of theta. [COUGH] These integrals right here are replaced by summations if we know that theta is a discrete random variable. The marginal distribution right here of y in the denominator is another important piece. Which you may use if you do more advanced Bayesian modeling. The posterior distribution is our primary tool for achieving the statistical modeling objectives from lesson one. [MUSIC]