So, that's the binomial distribution. Let's talk about the most famous of all distributions and probably the most handy of all distributions is the so called normal or, or Gaussian distribution. The term Gaussian comes, the great mathematician, Gauss. And it's kind of interesting to note, Gauss didn't invent the normal distribution. The invention of the normal distribution is kind of a debated topic. For example, Bernoulli had used something not unlike the Gaussian distribution as a probabilistic inequality not formalizing it as a density. If you're interested in this, the book by Stephen Stigler on the history of Statistics actually has a nice summary of exactly where and when and who came up with the Gaussian distribution. But it's clear that Gauss was instrumental in the early development and use of the Gaussian distribution. So, a random variable is said to follow a normal or Gaussian distribution with parameters mu and sigma squared if the density looks like this, two pi sigma squared to the minus one half e to the negative x minus mu squared over two sigma squared. And so, this density, it looks like a bell and it's centered at mu. And sigma squared sort of controls how flat or peaked it is. And so, it turns out that, that, mu is exactly the mean distribution and sigma squared is exactly the variance of this distribution. So, you'll only need two parameters, a shift parameter and a scale parameter to characterize a normal distribution. So, we might write that x is this little squiggle and N mu sigma squared as just sort of short hand for saying that a random variable follows a normal distribution with mean, mu, and variance sigma squared. And, in fact, one instance of the normal distribution is sort of the kind of root instance from which other sorts are derived and that's why mu is equal to zero and sigma equals one. And so, we will call that the standard normal distribution. It's centered at zero and its variance is one and so all other normal distributions are simple shifts and rescaling of the standard normal distribution. But then again, you could pick a different root, maybe mu equal five and sigma equal two, but it wouldn't be quite as convenient. You could still get every other distribution from that one by shifting and scaling appropriately, but it wouldn't be as convenient. This is the most convenient way to define a sort of route of the normal distribution. The standard normal density is so common that we, we often reserve a Greek letter for it. So, the lower case phi we usually use for the normal density, and the upper case Phi, we would use for the normal distribution function. And standard normal random variables are often labeled with a z and you sometimes do even hear, introductory statistics textbooks and so on and refer to them as z-variables or z-distributors, or something like that and that's because this notation has become so common. Here's the normal distribution. It looks like a bell. That's how it gets its name the bell-shaped curve. And, sort of here, I've drawn reference lines at one standard deviation, two standard deviations, and three standard deviations. One above and, and negatives being below and positives being above. Now again, here, so the, the one, right, because this is a standard normal distribution, right, the one represents one standard deviation away from the mean. Here, the mean is zero. One is one standard deviation away from the mean, two is two standard deviations away from the mean, and three is three standard deviations away from the mean. Instead of thinking of these numbers as just z values in the denominator, if we think about them in the units of the original data, right, and this is representing one standard deviation from the mean, two standard deviations from the mean, and three standard deviations from the mean, it doesn't matter whether we're talking about a standard normal or a nonstandard normal. They all are going to follow the same rules. So, about 68 percent of the distribution is going to lie within one standard deviation, about 95 percent is going to lie within two standard deviations, i.e., between -two and +two. And about, almost all the distribution, about 99 percent of it is going to lie within three standard deviations. We can get from a nonstandard normal to a standard normal very easily. So, if x is normal with mean mu and variance sigma squared, then z equal to x minus mu over sigma is, in fact, standard normal. Now, you could at least, given the information from this class, check immediately that z has the right mean and variance. So, if you take the expected value of z, you get the expected value of x minus mu divided by sigma. You can pull the sigma out, and then you have expected value of x minus mu, which is just zero because that's expected value of x minus expected value of mu. And mu is not random so that's just mu and mu is defined as expected value of x, so that's just zero. Then, the same thing with the variant. If you take the variance of z, you get the variance of x minus mu divided by sigma, right? So, if we pull the sigma out of the variance, it becomes a sigma squared, and we have variance x minus mu. And we learn to rule with variances that if we shift the random variable by a constant, say, in this case, subtracting out mu, it doesn't change the variance at all. So, we get a variance of x divided by sigma squared. The variance of x is sigma squared so we get sigma squared divided by sigma squared, which is one. So, at the bare minimum, we can check that z has mean zero and variance one. By the way, there was nothing intrinsic to the normal distribution that, that occurred in that calculation, right? So, we've also just learned an interesting fact, which is that take any random variable, subtract off its population mean, and divide by its standard deviation and the result is a random variable that has mean zero and variance one. In this case, in addition, if x happens to be normal, then z also happens to be normal. Similarly, we can just take this equation where z equals x minus mu over sigma and we can multiply by sigma then add the mu and get that. If we were to take a standard normal, say, z, scale it by sigma and then, shift it by mu, then we wind up with a nonstandard normal. You know, the top calculation goes from a nonstandard normal and converts it into a standard normal then the bottom equation starts with nonstandard normal and converts it to a normal. Another interesting fact is that the nonstandard normal density can just be obtained as plugging into the standard normal density. So, if you take the standard normal density phi and instead of just plugging in z to it, say, you plug in x minus mu over sigma, and then divide the whole thing by sigma, then that is exactly the nonstandard normal density. And this is a kind of a way to generate, just kind of an interesting aside. Here, mu is a shift parameter. So, all mu does is shifts the distribution to the left or the right, right? Just like whenever you subtract a constant from an argument in a mathematical function. It's just moving the function to the left and the right. And then, sigma is a scale factor. And so, basically, whenever you take a kernel density, some density, I guess it works for any density but it makes most sense to do with a density with mean zero and variance one. And then, you create a new family where you're plugging in x minus mu over sigma and then divide the density by sigma, you wind up with the new family of densities that now have mean mu and variance sigma squared. So, this is kind of an interesting way of taking a root density with mean zero and variance one and then creating a whole family of densities that have mean mu and variance sigma squared, they are usually called location scale families. And any rate, we are only interested in this case in the normal distribution and this formula right here is exactly how you can go from the standard normal density and use it to create a nonstandard normal density by plugging into its formula. Let's just talk about some basic facts about the normal distribution that you should memorize. So, about 68%, 95%, and 99 percent of the normal density lies within one, two, and three standard deviations of the mean, respectively, and it's symmetric about mu. So, for example, take one standard deviation. About 34%, one-half of 68 percent lies from within one standard deviation on the positive side and about 34 percent lies within one standard deviation below the mean, So, each of these numbers split equally to above the mean versus below the mean. And then, there's certain quantiles of the normal distribution that are, are kind of common to have memorized. So, -1.28, -1.645, -1.96, and -2.33 are the tenth, fifth, two-point fifth, and first percentiles of the standard normal distribution. And then again, by symmetry, so, if we just flip it around, right? So if, if -1.28 is the tenth percentile, then 1.28 has to be the 90th percentile. So, by symmetry, 1.28, 1.645, 1.96, 2.33 are the 90th, 95th, 97.5th and 99th percentile of the standard normal distribution. One in specific I want to point out that you really need to memorize is 1.96. The reason it's useful is it's the point so that you could take -1.96 and +1.96, the probability of lying outside of that range, right, below -1.96 or above +1.96, well that's five%. So, 2.5 percent below it and 2.5 percent above it, so that's five%. So , the probability of lying between 1.96, -1.96 and +1.96, is 95%. And so, at any rate, it's used to do things like create confidence intervals in these other entities that are very useful in Statistics and people have kind of stuck with 95 percent as a reasonable benchmark for confidence intervals. And five percent is a reasonable cut off for a statistical test, and if you're doing two-sided, you need to account for both sides and so you, you use 1.96 and then, the other fact is that 1.96 is close enough to two that we just round up. So, a lot of times, things like confidence intervals, you might hear people talking about, we'll just add and subtract two standard errors. They're getting that two from this 1.96 right here. So anyway, that one in specific you should memorize, but you should probably just memorize all of them. Let's go through some simple examples. So, we'll go through two and you should just be able to do lots of these, after I go through two. So, lets take an example. What's the 95th percentile of a normal distribution with mean mu and variance sigma squared? So, recall, what do we want to sell for if we want a percentile? Well, we want the point x NOT, but the probability that a random variable from that distribution x, being less than or equal to x NOT turns out to be 95 percent or 0.95. Okay. And so, you know, it's kind of hard to work with nonstandard normals so the probability that x being less or equal to x NOT is 0.95. Well, why don't we subtract out mu from both sides of this equation and divide by sigma from both sides of this equation? And on the left-hand side of this inequality, x minus mu over sigma, well, that's just a, a z random variable, now, a standard normal random variable. So, the probability that x is less than or equal to x NOT, is the same as the probability that a standard normal is less then or equal to x NOT minus mu over sigma and we want that to be 0.95. Well, if you go back to my previous slide, 0.95 95th percentile of the standard normal is 1.645. So, we just need this number, x minus mu over sigma to be equal to 1.645 to make this equation work. And so, let's just set it equal to 1.645, right? And then, solve for x NOT so we get x NOT equals mu plus sigma times 1.645. So now, you know, you could ask lots of questions with specific values of mu and sigma. But you'll wind up with the same exact calculation. And here, in fact, you know, we used 1.645 because we wanted the 95th percentile. But, in general, x NOT is going to be equal to mu plus sigma z NOT, where z NOt is the appropriate standard normal quantile that you want. And then, you can just get them very easily. You know, the other thing I would mention too is you should be able to do these calculations more than anything just so you've kind of internalized what quantiles from distributions are and how to sort of go back and forth between standard and nonstandard normals and the kind of ideas of location scale densities and that sort of thing. In reality and practice, you know, it's pretty easy to get these quantiles because for example, in r you would just type in q norm 0.95 and then give it a mean and a variance. Or if your wanted, if you did q norm 0.95 without a mean and a variance, it'll return 1.645 and you can do the remainder of the calculation yourself, but even that's a little bit obnoxious so you can just plug in a mu and a sigma. So, these calculations aren't so necessary from a practical point of view even very rudimentary calculators will give you normal quartiles, nonstandard normal quartiles. The hope is that you'll kind of understand, you know, the probability manipulations. You'll understand, you know, what a quantile means. You'll understand, you know, what the goals of these problems are. And you'll understand sort of how to go backwards between the standard and nonstandard normal. That's kind of what we're going for here. It's kind of clear, I think everyone agrees that you can very easily just look these things up without having to, to bother with any of these calculations. Let's go with another easy calculation. What's the probability that a normal mu, sigma squared random variable is two standard deviations above the mean? So, in other words, we want to know the probability that x is greater than mu plus two sigma. Well, again, do the same trick where we subtract off mu and sigma from both sides and we just get the, the answer that that's the probability that a standard normal is bigger than two. And that's about, 2.5%. And, so you can see the kind of rule here. If you want to know the probability that a random variable is bigger than any specific number, or smaller than any specific number or between any two numbers, instead of take those numbers and convert them into standard deviations from the mean, right? And that can, of course, be fractional. It could be 1.12 standard deviations from the mean or whatever. And the way you do that is by subtracting off mu and dividing by sigma and then, revert that calculation to a standard normal calculation. So, if you wanted to know what's the probability that a random variable is bigger than say, let's say, 3.1, just to pick out a random complicated sounding number. Let's suppose you're talking about the height of a kid and you want to, you know, say, what's the probability of, of being taller than 3.1 feet. What you would need is the population mean mu and the standard deviation sigma, take 3.1, subtract off mu, divide by sigma. Now, you've just converted that quantity 3.1, which is in feet, right, to standard deviation units. And then, you can just do the remainder of the calculation using the, the standard normal. So, I would hope that you could kinda familiarize yourself with these calculations. And I recognize that, in a sense, they're kind of ridiculous to do because you can get them from the computer so quickly. And we'll give you the R code that you need to do these calculations very quickly on the computer. But I think it's actually worth doing them by hand so just to get used to working with densities, to get used to what these calculations refer to. So, let me just catalog some properties of the normal distribution, a lot is known about the normal distribution. And so, I'll outline some of the simpler stuff, and, and some of the stuff, the letter points, we probably won't get to in this class, but I thought I'd just at least say. So, at any rate, the normal distribution is symmetric and it's peaked about its mean, which means that the population mean associated with this normal distribution, the median, and the mode are all equal right at that peak. A constant times a normally distributed random variable is also normally distributed. And you can tell me what happens to the mean and the variance if, say, x is a normal random variable, what distribution does a times x have if I'm going to tell you that it's normal, what's the resulting mean and variance? It turns out that sums of normally distributed random variables are again normally distributed. And this is true regardless of the dependent structure of the data. So, if the random variables are jointly normally distributed. It's important that they are jointly normally distributed. They could be independent, they could not be independent, but they need to be jointly normally distributed. The sums or any linear function of the normal random variables turns out to be normally distributed. And again, you can calculate the mean and the variance. Sample means of normally distributed random variables are again normally distributed. Again, this, this is true regardless of whether or not they're jointly normal and possibly dependent, or if they're simply a bunch of independent normal random variables, this is true of sample means. However, let me just jump to point seven. It also turns out that if you have independent identically distributed observations, properly normalized sample means, their distribution will look like a Gaussian distribution, not entirely but pretty much regardless of the underlying distribution that the data comes from. So, take as an example, if you roll a die and look at what the distribution of a die roll looks like, it doesn't look like very Gaussian it looks like a uniform distribution on the numbers one to six. Now, take a die, roll it ten times, take the average, and then repeat that process over and over again and think about what's the distribution of this average of die rolls. Well, it turns out it'll look quite Gaussian. It'll look very normal. At any rate, that's the rule, is that random variables, properly normalized, with some conditions that we're probably going to gloss over will limit to a normal distribution. And that's how the normal distribution became the sort of Swiss army knife of distributions is that, pretty much anything you can relate back to a mean of independent things , tends to look normalish in distribution. And mathematically, formally, if they're independently and identically distributed in the, you normalize the mean in the correct way, then, then you get exactly the standard normal distribution. That is an incredibly useful result, an incredibly useful result. It's a very historically important result called the central limit theorem. So, lets see, back to point five. If you take a standard normal and square it, you wind up with something that's called a chi-squared distribution, you might of heard of that before. And if you take a standard or a nonstandard normally distributed random variable and exponentiate it, take e^x, where x is normal, then you wind up with something that's log-normal. Log-normal is kind of a bit of a pain in the butt in terms of its name. A log-normal means take the log of a log-normal and it becomes normal. It doesn't mean the log of normal random variable. It's a little annoying fact, right? And you can't log a normal random variable, by the way, because there's a nonzero probability that it's negative and you can't take the log of a negative number. The name makes it sound like a log normal is the log of a normal. It's not. Log-normal means take the log of mean and then I'm normal. Okay. Let's talk about ML properties associated with normal random variables. If you ever bunch of IID normal mu sigma squared random variables, and let's assume you know the variance. So, let's ignore the variance for the moment. Then, the likely to associate with mu is, is written now right here. You just take the product of the likelihoods for each of the individual observations. And so, you wind up with two pi sigma squared to the -one-half e^-xi minus mu squared over two sigma squared. If you move that product into the exponent, you get minus summation i equals one to N, xi minus mu squared over two sigma squared. Remember, we're assuming that the variance is known. So, the two pi sigma squared to the minus N over two, that you would have gotten, we can just throw that out, right? Because remember, the likelihood doesn't care about factors of proportionality that don't depend on mu. In this case, cuz mu is the parameter we're interested in. By the way, this little symbol right here, this proportion two symbol is what I mean. That means I dropped out things it's proportional to. I dropped out things that are not related to mu. And I'll try and use that symbol carefully where it's contextually obvious what I mean, what variable I'm considering important. Okay, so, let's just expand out this square and you get summation xi squared over two sigma squared plus mu summation xi over sigma squared minus N mu squared over two sigma squared. Now, this first term, negative summation xi squared over two sigma squared, again, that doesn't depend on mu. So, we can just throw it out, right? It's e to that power times e to the, latter two powers. So, that first part is a multiplicative factor that we can just chuck. Then the other thing here is it's a little annoying to write summation xi. Why don't we write that as nx bar, right? Because if you, take x bar, the sample average and multiply it by N, you get the sum. Okay, so, the, the likelihood works out to be mu nx bar over sigma square minus n mu squared over two sigma squared. So, that's the likelihood. Let's ask ourselves what's the ML estimate from mu when sigma squared is known. Well, as we almost always do, the likelihood is kind of annoying to work with, so why don't we work with the log likelihood? We take the log from the previous page, and we get mu nx bar sigma squared minus mu squared over two sigma squared. If you differentiate this with respect to mu, you wind up with this equation right here, which is clearly solved that x bar equal to mu and so what it tells us is that x bar is the ML estimate of mu. So, if your data is normally distributed, your estimate of the population mean is the sample mean. That makes a lot of sense. We would hope that the result would kind of work out that way. But also notice because this calculation didn't depend on sigma, this is also the ML estimate when sigma is unknown. It's not just the ML estimate when sigma is known. So, we know what our ML estimate of mu is. Let me just tell you what the ML estimate for sigma squared is. The ML estimate for sigma squared works out to be summation xi minus x bar squared over N. And you might recognize this as the sample variance, but instead of our standard trick of dividing by N - one, we're now dividing by N which is a, a little frustrating that there's this kind of mixed message that the maximum likelihood estimate for sigma squared is the so-called biased estimate of the variance rather than the unbiased one where you divide by N - one. Now, notice as N increases, this is irrelevant, right? The factor that disputes the two estimates is N - one / n. And that factor goes to one as N gets larger and larger. So, I've had several colleagues tell me that they would actually just prefer this estimate, this maximum likelihood estimate. And their argument is something along the lines of, well, the N - one estimate is unbiased but this one has a lower variance. And what they mean is this is the biased version of the sample variance. It's only a function of random variables so it, itself is a random variable, and as a random variable, as a mean and a variance. The fact that it's mean is not exactly sigma squared means that it's biased. But it has a variance and its variance is slightly smaller than the variance of the unbiased version of the sample variance. And so, this is an example that pops up all the time in Statistics, that you can trade bias verses variance. In this case, one variance estimate is slightly biased, but will give you a lower variance. Another one is unbiased, but the variance estimate, itself has a larger variance, and it's very frequent in Statistics that you have this kind of trade off, you can pick one as you increase the bias, you tend to decrease the variance and vice versa. So, the other thing I wanted to mention was here, we've kind of separated out inference from mu and inference for sigma. If you wanted to do kind of full likelihood inference then you have exactly a bivariate likelihood, a likelihood that depends on mu and sigma. And it's a little bit difficult to visualize, but it is just a surface, right? Where you have mu on one axis, sigma on another axis, and the likelihood on the vertical axis, then it would just be a likelihood surface instead of likelihood function. And, it's a little bit hard to visualize these kind of 3D looking things. So, there are methods for getting rid of sigma and looking at just the likelihood associated with mu, and getting rid of mu and looking at the likelihood of just for sigma and later on we'll discuss methods for them. But for the time being, it's not terribly important. What I would hope you would remember is that if you assume that your data is normally distributed, then, you know, we gave you the likelihood for mu if your sum sigma is known. We calculated that the ML estimate of, of mu was, in fact, x bar, and that the ML estimate of sigma squared was, you know, pretty much the sample variance. You know, off by a little bit from the standard sample variance, but pretty much the sample variance. And then, you know, the ML estimate of sigma, not sigma square but of sigma, is just the square root of our estimate for, ML estimate for sigma squared. Well, that's the, end of our whirlwind tour of probably the two most important distributions. There are some other ones that we'll cover later. Next lecture, we're going to travel to a place called Asymptopia. And everything's much nicer in Asymptopia, and so I think you'll quite like it there.