Let's talk about probability distributions of two random variables. Here's some examples. Maybe your random variables are the daily temperature and snowfall. Maybe your random variables are the cloud cover and gasoline price for the day. Maybe if you're into neuroscience, they are the stimulus you show and the response you measure. The response of the neuron. Some of these are independent and some of these are not independent. For example, cloud cover and gasoline price are probably independent. The gasoline price probably isn't too related to the daily cloud cover. Temperature and snowfall, however, would not be independent. That's because if the snow is falling, you know something about the temperature. Stimulus and response can be independent if the stimulus has no effect on the response. Or they cannot be independent if the stimulus influences the response. Or if you can read the response and take a good guess about the nature of the stimulus. Let's be a little more general now. Let's just call our random variables x and y. There are a number of different kinds of distributions involving two random variables. Let's go through and make a list. First, we have the joint distribution. This is the probability of x and y. So the probability or probability density that the temperature is 20 degrees Fahrenheit and the snowfall is 1.5 inches. The key word is and. This is the probability of both of these events being true. We also have what are called marginal distributions. So the marginal distributions are just the single probability distributions over a single variable, so p(x) and p(y). So in our first example up here, one marginal distribution would be the probability density of the temperature, regardless of what the snowfall was doing. The other marginal distribution would be the probability density of the snowfall regardless of what the temperature was doing. Lastly, we have conditional distributions. Our conditional distributions are p(x|y), where this little bar is a given symbol, or conditioned upon symbol and p(y|x). So in a conditional distribution you might say, what is the probability that the temperature is 5 degrees Fahrenheit given the fact that the snowfall was 3 inches today? You can also go the other way around. You could say, what is the probability that the snow fall was 5 inches today, given that the temperature was 22 degrees Fahrenheit. Just to keep things, the joint distribution is a function of x and y, and a distribution over x and y. The marginal distribution is a function of either just x or just y. And it is a distribution over just x, or just y. The conditional distribution is a function of x and y, but it is just a distribution over x in the p(x|y) or y in the p(y|x) case. So what actually integrates to 1? So we know all probability distributions will integrate to 1 if you integrate them with respect to the random variable over which they are distributions. So to integrate the joint distribution to 1, we integrate over x and y. To integrate the marginal distribution to 1, we integrate over just the variable that remains. And to integrate the conditional distributions to 1, we integrate with respect to the variable that was not being conditioned on. So to integrate p(x|y) to 1, we integrate over x. And likewise with p(y|x), we integrate over y. The definition of independence is that two random variables, x and y, are independent if and only if the joint distribution is equal to the product of the marginal distributions. So if they're independent, you can just multiply the distributions over all your random variables together to get the joint distribution. This is equivalent to saying that p(x|y) is just equal to p(x). So knowing y doesn't have any influence on x, meaning that they are independent, and p(y|x) is just equal to the marginal p(y). So these are our official definitions of independence of two random variables. Next, there are a couple of rules about probability distributions over two random variables that will be very helpful. And as we go through these, think about where they have come up before in the lectures. So the first rule is the chain rule. And the chain rule says that p(x,y), the joint distribution, is equal to p(x|y)p(y). The conditional distribution, conditioned on y, multiplied by the marginal distribution p(y). And that also is equal to p(y|x)p(x). So this is the chain rule. It's useful because it tells you how you can split up a joint distribution into a conditional distribution and a marginal distribution. The second rule is the marginalization rule. So, the marginalization rule tells us how we can calculate a marginal distribution given a joint distribution. So this says that p(x), the probability distribution over x, is equal to p(x|y)p(y), all integrated with respect to y. So what is this? This is actually just the expected value with respect to y of the conditional distribution of x given y. So that kind of makes sense. Lastly, by using our chain rule this is equal to the integral of p(x,y)dy. And you can find the marginal distribution of p(y) in the same way. So that is the marginalization rule. Last, but certainly not least, we have Bayes rule. And this pops right out of the chain rule. To derive Bayes rule, we start with the chain rule, p(x,y) = p(x|y)p(y), which is also equal to p(y|x)p(x). To derive Bayes rule, we just divide everything by p(y). This gives us p(x|y) = p(y|x)p(x) / p(y). This is very important, because it tells us how to write one conditional distribution in terms of the other conditional distributional, as well as the marginal distributions. So for example, if x is your stimulus and y is your response, you might want to know what is the probability that a certain stimulus was presented given the response. This allows you to do that in terms of the probability of the response given the stimulus. So this tells you how to go back and forth between your two variables. And in the first couple of weeks this was very important when we were understanding how to figure out the non-linearity in linear/non-linear Poisson spiking systems. Because Bayes rule is often applied to the idea of trying to learn about a random variable given some observation, as well as some prior knowledge, each of the terms here has a separate name. So p(x) is called the prior, the prior over x. This tells you your prior knowledge of what you thought the variable x was likely to do or to be. Your conditional distribution, once you've seen observations or once you have seen an observation, is called the posterior. So this is your new distribution given the fact that you've observed something y. So maybe you change your guess about what the stimulus is after you observe a response. P(y|x) is called the likelihood, specifically, this is the likelihood of x. The likelihood of a variable x is the probability of y given x. So the likelihood of a stimulus would be the probability that that stimulus generated the response that you measured. And finally we have the evidence. The evidence is actually very rarely important, because it only comes into play as a normalization factor of this p(x|y). What that means is that if you know the expression for this top pair of terms, you can find p(y) just by forcing your posterior to integrate to 1. So these are the three big rules in probability theory that will get you far in neuroscience and in life. And that is all for now. That is all you need to know.