A conceptual and interpretive public health approach to some of the most commonly used methods from basic statistics.

Loading...

From the course by Johns Hopkins University

Statistical Reasoning for Public Health 1: Estimation, Inference, & Interpretation

188 ratings

Johns Hopkins University

188 ratings

A conceptual and interpretive public health approach to some of the most commonly used methods from basic statistics.

From the lesson

Module 3A: Sampling Variability and Confidence Intervals

Understanding sampling variability is the key to defining the uncertainty in any given sample/samples based estimate from a single study. In this module, sampling variability is explicitly defined and explored through simulations. The resulting patterns from these simulations will give rise to a mathematical results that is the underpinning of all statistical interval estimation and inference: the central limit theorem. This result will used to create 95% confidence intervals for population means, proportions and rates from the results of a single random sample.

- John McGready, PhD, MSAssociate Scientist, Biostatistics

Bloomberg School of Public Health

So in this section we will show the results

from simulations to empirically demonstrate the concept of a theoretic

sampling distribution of both the sample proportion for binary data

and a sample incidence rate for time to event data.

The, you look at the behavior of these estimates across multiple

random samples of the same size from the same underlying theoretical population.

What you'll see is that the results will look remarkably similar

to what we showed with regards to distributions of

sample means across multiple random samples in the last section.

Okay, in this lecture section, we're going to basically repeat what

we did in section B but instead of focusing on means

from continuous data, we'll look at proportion summaries of binary

data and sample incidence rate summaries of time to event data.

So upon completion of this lecture,

hopefully you'll be able to describe the sampling distribution of the sample

proportion and the sample incidence rate in terms of their compu, compositions.

Comment or list some characteristics of the sampling

distributions for proportions and incidence rates, which we'll

demonstrate empirically through simulation including the general shape

of the distributions, the center of the distributions,

the variability of the distributions, and the relationship to

the size of the samples each statistic is based upon.

Hopefully you can also comment on

the similarities between this lecture section's results

and the results for sampling distributions of sample means from the previous section.

So the first thing that we're going to look at is actually, I

created a behind-the-scenes computer population of

Baltimore city residents in using census data.

The most recent census data, I knew the percentage of residents in

poverty is defined by being below the poverty line, and I've created

that population in the computer, and what I'm going to do is repeatedly

sample from it to look at the behavior of the sample proportion

across multiple samples.

But first, let's just try and get a sense what's going on with our data.

These data at the individual level are yes no data.

Zero one data.

A one if the resident sampled is living

under the poverty threshold, and a zero if not.

So if we took a sample of 50 persons for

example, this is what our individual data would look like.

There'd be a certain percentage of people who are not in poverty.

A certain percentage who are, and in this case, the percentage of people in the

sample who are living in poverty below

that threshold defined for the U.S. is 26%.

If we go over here, looking at taking another random sample, but it's larger.

See the distribution of the individual values is still such,

and it will always be such, regardless of sample size,

that they either take on a value of zero for not living in

poverty or one if the person does live in poverty, is below that limit.

In this sample situation, 19% of the residents, out of

100 sampled, were living in poverty in the year 2011.

So what can we ascertain?

Well, we know the data we're looking at, at the population level

can only be yes or no data, because of its binary nature.

We have some sense of the order of magnitude of the percent of residents

living in poverty on somewhere around 20

something percent, maybe slightly less, maybe above.

In any case, it's rather high, at least based on these two samples.

So let's, let's go on to sort of

investigate the sampling behavior of a sample statistic.

So what I've done here is I've actually repeatedly taken random

samples of size 50, 50 residents at a time,

I've taken 1000 samples, each with 50 people in them.

Computed, computed 1,000 sample proportions and have

plotted those 1,000 sample proportions in this histogram.

And the reason there's gaps in the histogram is, there's only certain values

that the proportion can take on when you only have a denominator of 50.

So just to reiterate, we have

1,000 p hats. From a 1000 samples,

each

with 50 people.

So, 1000 summary measures, each summ, summarizing 50 people.

That's what we see here, we see this distribution.

And it's interesting, you can see that, you know, it is

somewhat symmetric and the bulk of the observations are concentrated around here.

And then, it's, there sort of, it, the other values they're proportioned across

this multiple sample dies off as we get further from that middle.

Let's look at doing the same thing, but where we've upped the sample size.

Now we've got 1,000 p hats.

But now, each one summarizes a sample of 150 residents.

So there's 1,000

proportions in this histogram, each based on 150 persons.

And we can again see that this is roughly

symmetric and bell-shaped, with the bulk of the estimate's

concentrated in this area and then some outside of

that, but a lesser proportion across those 1000 samples.

Finally, here's the estimated sample distribution

for sample proportions from random samples

of 500.

In this picture, again, we are showing not individual level person level data,

but the distribution of p-hats, each computed on a sample of 500 people.

[BLANK_AUDIO].

So, let's put these all in one graphic like we did before and

I think you can see this will probably look somewhat familiar to you.

From the previous situations is that, this, there was the

box plots of those sample proportion estimates side by side.

Just an expression in this box plots instead of histograms and you can

see here's the distribution of estimates where each estimate is based on 50 people.

Here's the distribution where each estimate is based

on 150 people.

And here's the distribution where each p hat is based on 500 people, and you can

see that the variability in the estimates get

smaller the more information goes into each estimate.

Again, the common sense kicking in, more

information leads to a more stable estimate, or

sort of the mathematical explanation that, the more

information each proportion is based on, the less

influence each individual point has on that summary measure.

Each individual point in the sample, and the more

stable, the less variable that estimate is across samples.

Also, you can see again, that it's not perfect,

see because there's a slight shift here, but, the centers

of these distributions are roughly lined up at about

the same place as we saw before with sample means.

So, this picture just reiterates some of the properties

we examined before when we were dealing with continuous

data, and shows that they, at least in this

scenario exist for summaries of binary data as well.

So the truth is, at least based on the census, the most

recent census results, the truth is the proportion of people living in

poverty in Baltimore City in the year 2011, the true proportion was a

disappointingly large 0.229 or 22.9% nearly a quarter of the city residents.

So let's look at the results from

our estimated sampling distributions, summarize them numerically.

Well, for the, we'll notice that, that the mean of those 1,000 sample proportions,

whether they were based on samples of size 50, 150 or 500 are very similar in value.

The mean of the 1,000 sample proportions, each

based on samples of 50 residents, was 23.2%.

Slightly higher than that truth at 22.9%, but certainly very close.

When we moved

into samples in our simulation with 150

observations each, the mean was actually 22.9%.

The mean of those 1,000 sample proportion estimates.

And similarly when we had 500 persons in each sample that each p hat was based on

the mean of all the estimated p hats, the

1,000 estimated p hats was the true proportion in

the population.

And here's numerical evidence of what we saw visually in that

the variation in the estimated proportions decreases with an increase in

sample size.

Let's look at another situation now, where we're dealing with

timed event data, and remember this is deals with two components.

The yes or no, did the event occur or was the observation censored,

and the time at which the event occurred or the person was censored.

And this is a population of

substance abusers who roll, enrolled in rehabilitation.

So gain,

I have this population on my computer. It's large.

I'm going to take smaller samples of it to see what's going on.

So, this first sample I took of 50 people.

The only way to visually summarize what's going

on, as we've noted before, for timed event data.

The graphic that captures the two

dimensionality is the Kaplan-Meier curve estimate.

And this is the Kaplan-Meier curve estimate.

And we can see that there is a fair amount of relapse.

this is time to relapse.

So this tracks the proportion of subjects in our sample

who had still not relapsed by the follow-up time on the horizontal axis.

And what we got was, an incidence rate estimate of

108.6 events per 100 person years based on these 50 observations.

Next door here, what we have sample B was actually a sample of 250 people.

And, you can see that they Kaplan Meier curve estimate looks similar.

You can see that at, at the end when we

flat line, it doesn't look like there's any more events.

We're dealing with about 20, 20 something percent of the sample

and still not experienced the event a, as of say, 400 follow-up

days, and that looks similar to what we saw in the sample of 50.

The reason this curve goes on longer in the

sample of 250 is we must have picked up at

least one or more observations with a longer followup

or sensory time than we got in the first sample.

And the incidence rate estimate here is, is larger than we

got from sample A, it's 129.1 events per 100 person years.

But,

looking at either of these samples, we get some

estimate that there's a fair amount of relapse in these

data within the first year, and we get some estimate,

ballpark estimates, of where the true mean rate of relapses.

For now let's look at actually sampling multiple times from this population and

computing multiple sample incidence rates, just like

we did the means and proportions before.

So let's say I

took 1,000 samples, each with 50 people in them.

And I computed 1,000 incidence rates.

Well this, so what we have here in the plot in this histogram is the distribution

of 1,000 incident rate estimates across 1,000

samples. Across 1,000 samples.

Each of, each containing 50 people.

And you can see that there's a fair amount

of variation in these estimates but they tend to center

somewhere around here but of course not all estimates

are equal and there's a fair amount of variation here.

Let's do this again, but let's take samples each with 100 persons.

So now we got 1,000 incidence rate estimates,

each based on a sample of n equals 100.

So we have 1,000 IR each spaced on 1000 samples, each containing 100 persons.

And finally let's do with this samples

that each contain 250 people from this population.

Here we have 1000 incidence rates.

So, I think you saw here, and now we'll verify this all in one graphic.

It's same old story basically, what do we see?

Well we see that the distribution of

the instance rate, there's some outliers, and you

can probably see this in the histograms,

but on the whole, these are roughly symmetric.

We saw that in the histograms and we can reinforce that in the box plots here.

They're roughly symmetric the variation

in these estimates decreases the larger the sample each estimate is based upon.

And the center of these distributions is the average median in this case.

But, because the distribution are symmetric also the average,

are very similar even though the very ability Is different.

So we get the same

old results we've seen a couple

times now, with regards to

variation in estimates,

decreases the larger the

sample, each estimate

is based upon. The shape

of the distribution

of estimates,

the sampling

distribution.

The estimated sampling

distribution of the IR's

is approximately normal,

regardless of the sample

size each was based on.

And, the average IR hat

is similar regardless

of the sample size.

The variation decreases, but the average

stays the same, regardless of the sample size,

each IR hat is based upon. This is the same story we've solved with

sample means, and with sample proportions. Are you getting a theme here?

So just to show you, now I'll give you the truth.

The data set I had that served as our population was a

large data set of data on substance abusers in rehab, and their

time to relapse. And the actual incidence rate in the

population was 125.8 relapses per 100 person years of follow up.

So now let's see what we've got here.

Well, you can see the mean of the 1,000 estimates was close when we had 50 samples

of size 50, we had 1,000 incidents rate estimates based on samples of size 50.

It was 128.

So, slightly larger than that truth. When we had samples of size 100, it was

126.9, and when we had samples of 250, it was 126.2.

Some of the discrepancy we see here in

these estimates of center are just simulation error, if

we had taken a different set of samples of

size 50, we would've gotten a slightly different estimate.

And some of it has to do with the fact that in the

samples of size 50, we're still seeing the influence of outliers to some degree.

And we'll adjust that as we get into using new sampling distributions

to create what we call confidence intervals.

But on the whole the averages are relatively similar and close to

that truth in the, in the population from which the data comes.

the variation in these estimates though, as we saw visually,

decreases the larger the sample each one is based on.

So to summarize.

Theoretical sampling distributions for sample proportions and

incidence rates across random samples of the same size,

from the same population, can be estimated via computer simulation.

Simulation is a useful tool for helping explore these properties, the

sampling distributions, and some properties

observed with the two examples in

this lecture and the previous lecture, which will be generalized, include,

I'm not even going to write it out, you can write it yourself.

So if you want to summarize here, it's probably embedded in your

brain now, cause you've seen it so many times, is that the

variation where estimates decrease is the larger the sample they're based on.

There is variation estimates from sample to sample, but on average the

estimates equal or are close to the truth that they're trying to estimate.

And the vari the shape of the distribution

of these estimates tends to be normal or approximately

normal regardless of the distribution of the

individual layer data that these estimates summarize.

So ultimately estimating the characteristics

of a sampling distribution will

be done using the results from a single random sample.

We can't do these simulations, we don't know the pop eh, if we knew the population

values we wouldn't worry about estimates from inferior inperfect

subsets, right? We would just go with the truth.

so we're never privy to the truth.

So we can't do the simulations in real life.

furthermore, we're only ever going to take one random sample from each of the

populations we wish to study as a ge-, well, generally speaking, in research.

So.

Ultimately what we'll want to be able to do is

estimate these characteristics of sampling distribution using the results

from a single random sample from a population.

In the next lecture section, section D,

these properties that have been demonstrated empirically

via the simulations in this lecture set and lecture section B will be generalized.

We'll sort of show or give a mathematical theorem that says

what you saw happening, those patterns that you saw in each of these examples

will happen in most situations.

Coursera provides universal access to the world’s best education,
partnering with top universities and organizations to offer courses online.