Hi, my name is Brian Caffo and this is Mathematical Biostatistics Bootcamp: Lecture nine, on Confidence Intervals. In this lecture, we're going to talk about confidence intervals mostly in the setting where we're going to assume that our data come from a Gaussian distribution. So we'll talk about confidence intervals, Confidence intervals for variance. We'll talk about Gosset's T distribution. And we'll use Gosset's T distribution to create confidence intervals for means. And we'll touch on the subject of profile likelihoods. In the last lecture, we talked a little bit about the Central Limit Theorem and we talked about the Central Limit Theorem to create a confidence interval. I think in that example we created a confidence interval for a Binomial Proportion. Now we'll discuss the creation of better confidence intervals for small samples using Gosset's T Distribution. Small samples where we're willing to treat the data as if it's continuous. So to get to that point, And Gosset's T, t distribution is often called student's T, t distribution and we'll explain why in a little bit. So to discuss the T, t distribution, we first have to go through what the Chi-squared2 distribution is. And so we'll develop that first. And any rate, what you'll probably hopefully have noticed whenever we create confidence intervals, there seems to be some kind of prevailing logic that we use. Basically we try to create a probability statement. And then we, in a sense, manipulate the probability statement to generate an interval. Well, this strategy is codified here. So basically, we create a pivot or a statistic that doesn't depend on the parameter of interest. I should say we create a parameter or a statistic whose distribution doesn't depend on the parameter of interest. So, for example, if you use the central limit term; if you take a sample mean, subtract off the population mean that you're interested in, and divide by the standard error, well that statistic clearly depends on the parameter of interest. But the distribution of that statistic, at least in the limit, doesn't depend on the parameter that you're interested in, in the sample mean. And then after we've created that pivot, we solve the probability that the pivot lies between bounds for the parameter. And so that's the kind of general strategy we'll go through. You don't have to really know or really understand the strategy at a very general level, but just in case you're wondering why does it always seem like we're generating confidence intervals using basically exactly the same technique, it's because we're employing the strategy kind of like this. So let's talk about the Chi-squared distribution. So remember the S^2 is the notation we have been using for the sample variance. And let's further assume that the data that comprised the sample variance are all IID normal with mean mu and variance sigma2. Squared Well then, n - one times the sample variance divided by sigma2 squared is a random variable that we call a Chi-squared2 distribution. And the Chi-squared2 distribution has an index something that differentiates between different kinds of Chi-squared2 distribution and we call that index the degrees of freedom. So this statement right here will be read. The normalized sample variance follows a Chi-squared2 distribution with n - one degrees of freedom. So the Chi-squared2 distribution is the skewed distribution in it. Of course since the sample variance has to be positive it has support between zero and infinity. And the mean of the Chi-squared2 distribution is its degrees of freedom. And we can see that very directly. Because we recall the sample variance is an unbiased estimator. That's why we divide by n - one instead of n. So if you look at this equation, when you take the expected value. The expected value of S^22 is sigma2. Squared You can see that n minus expected value of n2 - one S^2 /2. Sigma squared, the sigma2 squared will cancel out and you'll get the degrees of freedom, or its expected value. The variance of the Chi-squared2, by the way, is the, twice the degrees of freedom. As an aside, we're not actually going to spend a lot of time doing this, but as an aside, you can use this idea to create a confidence interval for the variance. So imagine if I were to draw a Chi-squared density and Chi-squared2 n - one alpha is the alpha quantile from that distribution. Then imagine taking the, say, alpha over two, you know? Let's take alpha to be 0.05 for example. The 2.5th percentile and the 97.5th percentile from the Chi-squared2 distribution and looking at the probability that this Chi-squared2 random variable, n - one S^2 over2 sigma2 squared is between those two quantiles. Well, that has to be one - alpha, just by the definition of those being the 2.5th and 97.5th quantiles of the Chi-squared2 distribution. So this equality holds the equality that one - alpha equals this probability. So this statistic, n - one S^2 over sigma squared, is our pivot. Let's solve for the parameter that we're interested in, sigma squared and you do that, keep track of your inequalities being sure to flip them if you invert everything, and that sort of thing; and you wind up with following probability statement. There's a one - alpha probability that the random interval, n - one S^2 divided by the upper quantile, and n - one S^2 divided by the lower quantile contains sigma squared. So we call this interval, the n - one S^2 divided by the two quantiles, we call that interval a confidence interval for sigma squared. And because the probability that the random interval contains the parameter it's estimating is one - alpha, we call it, say, a 100 times one minus alpha percent confidence interval. So, as an example, alpha might be 0.05 and so you would then wind up with a 95% confidence interval for the parameter sigma squared. Now, we should talk a little bit about what this confidence interval means. It's the interval that's random, in the paradigm that we're sort of thinking about here. The interval is random. And the parameter sigma squared is fixed. So when you actually collect data and you form this confidence interval, it either contains sigma squared, which you don't know, or not. There's no probability with that statement anymore. It's either one or zero, it either contains Sigma square or not. So what's the actual interpretation of a confidence interval? Well if you take an Intro Stat class, they make a lot of hay out of this point. And they basically say, okay, the confidence interval is a procedure that if you were to repeatedly do the experiment and form confidence intervals, 95% of the confidence intervals say that if you're creating 95% confidence intervals. 95% of the confidence intervals would contain the parameter that you're interested in. And you could, as an example, do this in R. You could generate normal data. You could, from a normal, let's say mu is zero, from a normal zero sigma square distribution, You could formulate this confidence interval from the sample variance, you could check, whether or not that interval contained the sigma squared that we used for simulation. And you can repeat that process over and over, and over again. And you will find that about 95% of the intervals that you get, if you construct 95% confidence intervals, will contain the Sigma square that you used for simulation. And that's the logic behind confidence intervals. And, they're, they're a little notoriously hard to interpret, if you go for this sort of hardball interpretation. They're notoriously hard to interpret. Kind of a, a much weaker interpretation of the confidence interval that's a little less specific, is you get two numbers out. And these two numbers are an interval estimate of the parameter that you want to estimate but the interval estimate incorporates uncertainty. So lets go through a couple of comments about this interval. So one thing is this interval is not terribly robust, to departures from normality. So, if your data is not normal, then this confidence interval tends to not be that great. Also, if you want a confidence interval for the standard deviation instead of the confidence interval for the variance, You can just square root the n points of the interval. The probability statement, one - alpha equal to the probability that the random interval contains sigma squared. Well you can still say that, that's one - alpha is equal to the probability of the square root of the endpoints of the interval contains sigma and you haven't mathematically change anything. So if you want an interval for sigma you just square root the endpoints. So you might be wondering, okay, well if this is heavily required normality do we have any other solutions other than this interval if we want a confidence interval for the variance? And it turns out the answer is yes, and several ways; but bootstrapping is kinda the way that I prefer. But we're not going to talk about bootstrapping in today's lecture. So today we're only going to take about this confidence interval when you happen to be willing to stomach the assumption that your data is exactly gaussian and you are willing to live with the consequence that the interval you attained is not going to be terribly robust and departure from that assumption. So the other thing I wanted to mention, it's kind of a nifty little point, is suppose you wanted to create a likelihood for sigma, and in this case the underlying data is Gaussian, with mean mu and variance sigma squared. So it's hard because you have two parameters. The likelihood is a bivariate function, right? It has mu on one axis, sigma on the other axis. And then the likelihood on the vertical axis. So there's a little trick you can use to create, I guess what I would call a marginal likelihood for sigma2. Squared It turns out, and we're not gonna cover the mathematics behind this. But that if you don't divide by sigma2, squared n - one S^22. And then don't divide by sigma ^two. Well, first of all, that can't be Chi-squared Let me just logic through that real quick. That can't be Chi-squared because the Chi-squared density doesn't have any units. Right? So S^2 has whatever units the original data has. Say it's in inches. It has inches squared units. So you haven't divided by anything that's in the inches squared. So n - one S^2 has inches squared units. And so it can't follow a distribution that's unit-less like the Chi-squared distribution. That's one of the reason why you have to remember to divide by sigma squared to get the Chi-squared distribution to get rid of the units. Let's suppose we don't divide by sigma squared. Then you end up with a gamma. And a so-called gamma distribution, and the gamma's indexed by two parameters, its shape parameter and its scale parameter. In this case, the shape parameter is n - one / two and the scale parameter is two sigma squared. And, either way, what you have is data, You have a single number, n one, one S^2 and if you're willing to assume the data points that comprise that number are Gaussian, then you can take the gamma density and plug in the data and view it as a function of the parameters and plot a likelihood function. So I'll go through an example of doing this. So in our Organa Lead Manufacturing Worker's example that we've looked at before, there was an average total brain volume of 1,150 cubic centimeters with a standard deviation of 105.977. And let's assume normality of the underlying measurements, which is not the case, but let's do it. And let's calculate a confidence interval for the population of variation in total brain volume. I give the R code here, so I gave the standard deviations so our variances, you know 105.106^2. Our n in this case as 513, confidence interval, we want a 95% confidence interval so our alpha is 0.05. The quantiles that we want we can just use the qchisq q function to grab those quantiles. This function right here just grabs the two quantiles. And then our interval is just n - twelve. S^2. You know, the S^22 divided by the quantiles. And then this puts it out from bigger to smaller. I want it from smaller to bigger. So I use the RAV function to reverse it. I think if I had just input my quantiles in the reverse direction, I would have been okay too. And then, here, just take the square root of that interval for an interval for the standard deviation. And we get the interval is about 100 to 113. So this interval, 100 to 113, is created in a way such that if the assumptions of the interval are correct, namely that the underlying data are IID, normal, with a fixed variance, sigma squared, and a fixed mean, mu. Then the procedure, if repeated over and over again, 95 % of the intervals that we obtain would be intervals containing the true standard deviation that we're trying to estimate. Lets actually plot the likelihood as well using this kind of likelihood trick that I gave. So I, sigma valves is the sequence I want to plot. And actually, I don't have to guess this because I just created this confidence interval on the previous page that went from 100 to 113, so let's for good measure go from 90 to 120. And I want to plot 1,000 points. In R, you kind of have to pretty specific about the range that you want to plot and how many points you want going into your plot. And then I just give you the code here for evaluating the gamma likelihood. It says basically, plug in the data n - one S^2. Right. And remember the likelihood views that as fixed. The shape doesn't involve anything other than things we know n - one / two. And then the scale is the part that varies two sigma2. Squared. And here, we're going to evaluate it over all the sigma vowels that I assigned in the previous line. So this will evaluate that likelihood over 1000 points, and return a vector of length 1000. I want to normalize my likelihood. And I'll just kind of, you know, mostly approximately do that by taking this vector, and dividing by its maximum value. And then I'll plot it, type = l means plot it as a line instead of as a bunch of points and then these two lines commands adds the one eighth and one sixteenth reference lines. And then on the next page you actually see the marginal likelihood for sigma. That's a whirlwind tour of confidence intervals and likely a clause for variances when you're willing to assume your data is exactly Gaussian. I hesitate to say this, but kind of those slides aren't exactly terribly useful material. You won't find a lot of people plotting marginal likelihoods for sigma. I just gave it to you cuz it's kind of a nifty little result. And, to be honest, the Gaussian confidence interval for variances, You don't see them as much. People just would tend to do bootstrapping these days instead, or some other more robust technique. So, this material, It's neat, and it's, the, the primary thing to do was actually introduce the Chi-squared distribution. So next, we're going to talk about something that's incredibly useful, probably one of the single most used distributions and techniques in all of