So in this next set of lectures,

we're going to define the idea of sampling distributions of

sample statistics and something that measures the variation

in the sample statistic across multiple samples of

the same size from the same population,

something called standard errors.

So in this first section,

I'm just going to take a moment to define the idea of

a sampling statistic and lay out a roadmap for what we'll do in

the subsequent lecture sections to make more sense of

this title sampling distributions and standard errors.

So let's first define the sampling distribution of a sample statistic.

So, thus far in the course,

we have summary measures for single data samples like sample mean,

sample proportion, a sample incidence rate,

and we also have measures of association that compare two samples,

differences in means, risk differences,

relative risks in incidence rate ratios.

We have discussed how

these affirmation sample estimates are not necessarily the population truth,

fact we won't ever know if they are not because we don't know the population truth,

but this sample estimates are our best estimates

of these unknown truths based on the data we have at

hand on our imperfect sample or samples from a population or populations.

So ultimately, in addition to getting an estimate and correctly interpreting it,

it is important to recognize the potential uncertainty in

a sample-based estimate as it relates to the unknown truth that it estimates.

This uncertainty is sometimes called sampling variability.

If we understand that sample-based estimates vary across

random samples of the same size from the same population,

this will give us a framework for coupling our estimate with some measure of

uncertainty and putting those two things

together to make a statement about the unknown truth.

So this set of lectures,

lecture set six, involves defining, characterizing,

and estimating the theoretical sampling distribution of a sample statistic.

For example, the sampling distribution of

a sample mean or the sampling distribution of a sample proportion.

As we move through the course,

we will extend this concept to include

sampling distributions for comparison measures like mean differences,

risk differences, relative risks etc.

Ultimately, what this sampling distribution will

allow for is the estimates of an interval

describing a range of plausible values for the unknown truth that we can only estimate,

but we can use the results for our single sample as we

seem to estimate this unknown truth and now to add in

uncertainty bounds to create this interval of

possible values for this unknown truth called a confidence interval.

So to start, let's define the concept of a sampling distribution.

The sampling distribution of a sample statistic is a theoretical distribution that

describes all possible values of

a sample statistic based on all random samples of the same size,

taken from the same population.

The variability of the sample statistic value is characterized by

the sampling distribution is a measure of sampling variability,

we will ultimately call this the standard error of our sample statistic.

So let's think about,

what is meant by uncertainty in sample-based estimates also called sampling variability?

Well, you may remember earlier in

the course when we were defining continuous data measures,

we looked at the weight distribution for a sample of

236 one-year-old children from Nepal and we had the sample mean for that entire group.

So, think about this.

If you were doing research in Nepal,

you might take another sample of

236 children and you might get a slightly different estimate

of the mean of weight for Nepali children one-year-old than my sample 236,

and if you can imagine this study being repeated

an infinite number of times with an infinite number of researchers

taking a random sample of 236 one-year-olds from

Nepal and computing a mean weight for the sample,

then the theoretical sampling distribution of mean weights based on

random samples of 236 Nepali children who are 12 months old will be given

by a histogram that included all infinite sample mean estimates from samples of 236.

So for example, my mean was 7 kilograms,

the second researchers mean was 6.8 kilograms,

the third researchers mean on the 236 children,

his or her sample was 7.4 kilograms,

and we would get in the theoretical concept anyway,

we will be looking at an infinite number of means each based on samples of 236 children.

If we were to plot a histogram of these infamous means and look at

the distribution of the sample mean estimates from sample to sample,

this would be the theoretical sampling distribution

of a sample mean based on 236 Nepalese children.

So this is a theoretical quantity.

We would not likely do our study more than once and

we certainly wouldn't do it infinite number of times,

but the idea of a sampling distribution is a histogram where

each point in the histogram, each value,

is a sample mean based on,

in this case, in this example,

based on a random sample of N equals 236.

So this distribution shows the range of values,

their distribution, and the variability in

these values across the different random samples.

If I look at the sampling distribution of the proportion of

Marylanders who have been vaccinated for flu in current cycle.

Let's suppose, I was basing this on samples of size 500.

So, I researcher one take a sample of 500 and

maybe 36% of the sample has been vaccinated for the flu.

You research your number two,

take another random sample of 500 Marylanders and maybe 41% has been vaccinated,

and this goes on infinitely and infinite number researchers all take samples of

500 and compute a single summary measure

on their respective sample, the sample proportion.

If we were to do a histogram of the sample proportion values across

these infinite random samples of each of size 500,

this would give us the sampling distribution

of the sample proportion based on random samples of 500.

So each point in this histogram is example of proportion from

one random sample of N equals 500.

So again, this is the theoretical idea because no researcher will

likely take more than one sample for a given study and certain not an infinite number.

So again, the sampling distribution like I just said,

is a theoretical entity,

it cannot be observed directly or exactly specified.

In real life research,

only one sample from each population under study will be taken,

and even if we wanted to take multiple samples,

it would be impossible to take an infinite number.

So the remaining sections in this overarching lecture six will

serve to further demonstrate and define sampling distributions by

detailing the results of some computer simulations where we do sample from

a theoretical population multiple times and

look at the distribution of sample statistics across the samples.

By doing this, we'll empirically show some consistent properties of

sampling distributions regardless of the sample statistic we're creating.

A mean for continuous data,

a proportion for binary data,

and the incidence rate for some time to event data.

Will unveil a mathematical property that will allow

for the generalizations of these properties shown in the simulations,

and then we'll say, well,

we can use that property to further then demonstrate how to estimate characteristics of

a sampling distribution for sample statistics based on the results of one random sample.

So, even though it's a theoretical quantity measuring

the distribution of statistics across an infinite number of random samples,

we'll have some tools to estimate characteristics of

this distribution based on the results of a single sample.