Okay, so you would get exactly the distribution of the sample median of ten

die rolls. And if you wanted to know what the sample

distribution of the median of twenty die rolls, well you'd have to roll the die

twenty times, get a sample median. Repeat that process over and over again,

and that would do it for you. Okay so now we know, if we can actually

sample from the population distribution over and over and over again, how we would

get the sampling distribution of a statistic.

But when confronted with real data, we can't roll the die.

Right. We don't know what the population

distribution is, so we can't do it. But what we can do is roll from a die,

where every side of the die we've put on the number associated with an observed

data point, then we're not drawing from the population distribution, we're drawing

from the empirical distribution. Okay, then if we had ten data points and

we want to know what the distribution of the sample median of ten observations is.

Well, we can't draw from the population distribution, but what we can do is draw

samples of size ten from the distribution defined by the data we observed, and look

at what the distribution of the sample median is for those.

And that is exactly what the boot strap does, is it basically says, well, you know

the bootstrap in practice via re-sampling. It basically it says, well, we know

exactly what we would do if we actually knew what the population distribution was.

Why don't we just do that and use the sample distribution, and, and see how that

works. And it's sort of a really nifty idea.

So again, let's just take our 630 measurements of grey matter volume from

workers at a lead manufacturing plant, Then the median grey matter volume is

about 589 cubic centimeters. And we want a confidence interval for the

median of these measurements. How do we do that?

So here's our bootstrap procedure for calculating confidence interval for the

median of a data set of N observations where we know nothing about the.

Sampling distribution, of medians, of ten observations.

So, we would sample and observations with replacement, from the observed data

resulting in one simulated complete data set.

We would take the median of this simulated complete data set.

That would give us one bootstrap resample, and one bootstrap resampled sample median.

Then we would repeat the step B times, let's say.

Resulting in B simulated medians of N observations.

Those N observations having been drawn with replacement from the collection of

observed. Data, then these medians, well, let's say,

they're approximately draws from the sampling distribution of the median of N

observations, and they're exactly draws from the sampling distribution of the

median of N observations from the distribution of the observed data, but

we're going to say that's approximately equal to the sampling distribution of the

median of N observations drawn from the population distribution.

That's the leap of faith we're making, is that this bootstrap process approximates

if we, instead of drawing from the observed data, we're drawing from the

actual population distribution. And we could take these B sample meeting

and draw a histogram of them, and then say we wanted to know, you know, a confidence

interval, why not take the 2.95% confidence interval, why not take the

2.5th and 97.5th percentiles and call that a confidence interval for the media.

That's exactly a so called boot strap percentile confidence interval.

So it's hard to describe, and I know I'm butchering it, and if I were Efron I'd be

doing a much better job at doing this, but unfortunately you have me and not Efron

And it's difficult to describe, for me at least.

On the next page I'm showing you the R code for doing this and even I've neatened

up the R code a little bit, so it's probably a little bit longer that it needs

to be, you could do this in about four lines.

So here B is my number of bootstrap re-samples.

I said let's just do it a thousand times but, y'know, you wanna set this number B

to be big enough so you don't have to worry about the error in your Monte Carlo

re-sampling. You don't want the number of times that

you have rolled the die to be a factor in what you are doing you want to do it, you

can't. So here I did a 1000 but you know crank it

up until you're tired of waiting at least there is a science to how you pick B, but

we're not gonna talk about in the class. So N is the length of the number of

observations that I have. Okay.

Then I re-samples is this code right here just draws with replacement from the

collection of N observations, it draws B complete data sets of size N from that

distribution. The replace = true means that we're

sampling with a replacement. And then, here, this resamples.

I dumb them all into a matrix, so that every row is a complete data set.

So there's B rows, and N columns. And then I go up for every row, and I

calculate the median in this next line. And that's then B.

Medians, where each median was obtained from re-sample of N observations from the

observed data. And then if you take the standard

deviation of these medians then that is a bootstrap estimate of the standard

deviation of the distribution of the sampling mean.

If you take the quantiles, the 2.5th and 97.5th quantile, you get 582 to 595.

That is a bootstrap confidence interval for the median of Bray matter volumes

conducted in the non parametric way. And it's always informative in the

bootstrap to plot a histogram of your re-sampled, in this case, medians.

Okay so in here is my histogram of my resampled medians.

And then the 2.5th and 97.5th quantiles of my bootstrap resampled medians are drawn

here in dashed lines, and so that 95% of my resampled medians lie between these two

lines, and so we're gonna call that a bootstrap confidence interval.

Now, I'm going to give you some notes on the boot strap.

So, for the both the boot strap and the jack knife, today's lecture is really just

a teaser. As you can probably guess from my

description, they're sufficiently difficult techniques to where, you know,

you don't want to take these lectures and view them as enough knowledge to just run

out and use them willy nilly. I just wanted to give you a teaser so that

if you hear the terms you know what people are talking about.

So the boot strap, the one that I described today, is non-parametric.

It makes very little assumptions about the population distribution.

And the kind of theoretic arguments proving the validity of the bootstrap tend

to rely on large samples so there's a question about when and how you can apply

it, but I find it to be a very handy tool in general.

The confidence interval measure that I gave you.

These percentile confidence intervals. They're not very good.

You can improve on bootstrap confidence intervals by correcting the end points of

the intervals. And the bootstrap procedure, the one I

would recommend is this so called BCA, confidence interval, the bootstrap.

Package in R will calculate these for you directly if you like.

That's what these perc-, better, when I say here, better percentile bootstrap

confidence intervals correct for bias. And then, there's lots and lots of

variations in the bootstrap procedures. There's parametric bootstrapping,

There's bootstrapping for time series, You have to do something different for

bootstrapping for time series. There's all sorts of different ways to

think about the bootstrap, and data resampling in general.

And the book called An Introduction to the Bootstrap by Efron and Tibshirani.

For anyone who's taken this class, absorb the material, and it is at a level that

you should be able to understand. It's beautifully written.

It's a wonderful treatment of the subject, and then, in addition to this, there is

lots and lots of other books on the topic of the bootstrap.

Probably too many good ones to name. Some of them, unbelievably theoretical,

and other ones, quite accessible. I think this Efron and Tibshirani book

makes a very nice balance between, you know, giving you the why things work and

the how to do things combination. And it also covers the jackknife and other

data resampling procedures. The last thing I wanted to mention is, I

gave you the exact code that you could use to generate for yourself the bootstrap

sampling distribution. You could, of course, use the bootstrap

package in R which takes about as many lines of code in this case as programming

and it up yourself and just on this last slide I go through actually using the

bootstrap package. But the nice thing about the bootstrap

package is that it will actually give you this bias corrected interval.

In this case you can see that the bias corrected interval is nearly identical to

the percentile interval so it didn't make a big difference.

But you can, of course come up with instances where the bias corrected

interval is a little bit better. So that's the end of today's lecture.

That was a teaser on the idea of bootstrap re-sampling and a little bit on the use of

the jack knife. You know, I hope this inspired to go learn

a little bit more about these tools, they are among the kind of the wide class of

tools that became available as modern computing came about.

And the idea of being able to use our data.

Especially when we have large data sets, to use the data more fully, and to use the

data to come up with things like sampling distributions instead of using mathematics

and assumptions and that sort of thing. So it was a neat idea brought about by the

computational revolution in it, it's a very nifty technique.

Well next time will be our last lecture and I look forward to talking about it

with you.