So in the previous two sections,

we estimated empirically the sampling distributions of sample means, proportions,

and incidence rates via computer simulation where we were able to

take multiple random samples from a theoretical computer-based population.

That was just to illustrate some of the common properties of

these distributions that we'll formalize in this section.

But what we'll generally be doing in real life is we're

not going to take multiple samples from a population,

we're only going to take one from each population of interest.

So, if we're going to be able to use characteristics of

the sampling distribution to help us extrapolate from our sample to the population,

we're going to have to be able to estimate characteristics of

the sampling distribution from a single sample of data.

So, let's talk about how to do this to start in this section.

So, upon completion of this lecture section,

you will be able to explain the Central Limit Theorem,

sometimes called the CLT for short,

with regards to the properties of theoretical sampling distributions.

Estimate the variability of the sample distribution for sample means and for

sample proportions using the results from a single random sample,

and begin to appreciate how

an estimated sampling distribution can allow us to incorporate the uncertainty in

an estimate to the story of what's going on the unknown truth at the population level.

So, starting with our estimate and adding in some uncertainty to

potentially create a range of plausible values for the unknown truth.

So, let's talk a little bit about ''real life research''.

So again, in the previous sections,

we showed the results of computer simulations to

illustrate some general properties of sampling distributions.

But in real life research, generally,

only one sample can be taken from each population under study.

So how can we use the results of the single sample we have to estimate that if you will,

behind the scenes theoretical sampling distribution of a sample statistic,

the one we're calculating from our single sample of data,

and how can we use this if we can characterize this,

how can we use this to help us?

So let's talk about some generalities we saw from the simulations.

So, regardless of what type of data we were summarizing,

whether it be continuous,

binary or timed to event,

with the appropriate sample statistics, whether for means,

for continuous data proportions,

for binary data incidence rates,

for timed event data, the resulting estimated

sampling distribution from simulation was generally symmetric.

In other words, the distribution of the sample estimates across

multiple random samples of the same size when we looked at a histogram of them,

it was generally symmetric and approximately normal regardless of

the size of the sample each statistic was based on.

Generally, these estimated sampling distributions were centered,

or the average of the 5,000 sample-based estimates we had came out to be

the true value of the population level quantity being estimated by those statistics.

So for example, the average of the 5,000 sample mean estimates

of the underlying population mean was in fact the underlying population mean.

Across all these distributions we saw the variability in

the sample statistics from sample to sample systematically decreased,

the larger the sample each estimate was based upon.

So, there's a mathematical theorem that

generalizes these properties called the Central Limit Theorem,

and many times this will be referred to by its initials, the CLT.

Basically, the Central Limit Theorem states that

the theoretical sampling distribution of a sample statistic,

were we to take an infinite number of random samples of

the same size and plot the distribution of the statistics across the samples,

the distribution will be approximately normal.

On average, the average of our estimates will be

the true population level value being estimated,

and the variability in these estimates will

be a function of both the variation in the individual values in the population,

the standard deviation of the individual population values,

and the size of the sample the statistics based on.

This variability in sample statistics across

multiple samples of the same size is called the standard error of the statistic.

So basically, the Central Limit Theorem or CLT says, ''Look,

I can tell you what would happen if you were to take

an infinite number of samples of the same size from a single population,

and computing the infinite number of samples statistics on these samples,

and then plot a histogram of your estimates.''

It says look, if you were to take this,

if you were to do this and take multiple random samples of

the same size from the same population,

and look at the distribution of the sample estimates across these multiple samples,

it would be if we did a histogram and draw a smooth curve over it,

it would be approximately normal centered at

the true value of whatever each statistic was estimating.

So, if we had sample means,

it be centered at the true mean.

We had sample proportions,

this the distribution of sample proportions would be centered at the true proportion.

And the variability in these values will be

a function of the variability of the individual values in the population,

the size of the sample each estimate in the histogram is based upon.

So for example, if we were looking at samples of size n taken

from a population with mean mu and standard deviation sigma,

and we plotted a histogram of

multiple sample means based on random samples of the same size n. So,

this histogram consist entirely of x bars each based on

the sample of

size n. So,

an infinite number of x bars,

the average of the x bars across

the infinite estimates we have would be the true population mean.

Then the variability in these x bars from sample to sample would be a function of

the variability of the individual values in the population and the size of each sample.

I'm gonna put this out here now,

and we'll formally do this in the next set of lectures,

but this variability in these x bars would actually be equal.

The theoretical variable is equal to the standard deviation of

the individual measurements in

the population divided by the square root of the sample size.

Again, we'll formally review this in lecture set seven, but this quantity,

the variability in the sample means cross

samples of the same size is called the standard error.

So the standard error of sample mean based on samples of

size n is given by the true variability of

individual values in the population divided by the square root of

the size of the sample each x bar is based on.

So we kind of have a conundrum here.

Right? We only have a sample.

We estimate the sample mean,

but in order to quantify the potential variation in

sample means across sample sizes of the same size,

we need to know the population standard deviation.

I can't think of many situations where we wouldn't know the population mean,

but we would have a measure of the standard deviation.

Luckily, we have a quick fix to help us move forward with this.

We don't know sigma.

Well what we're going to see is when we estimate this based on the single sample,

we can substitute sigma with our sample-based estimate s and get

an estimated standard error for sample means based on samples of size we have.

So for example, recall we had a single sample of 113 men

where the mean blood pressure in these 113 men was 123.6 millimeters of mercury,