So, in this section, we're going look at some examples for the Sampling Distribution of

Sample Proportions summarizing binary data

and Sample Incidence Rates summarizing time to event data.

Pay special attention to this because you'll see that the results we get

here are going to be very similar to what we saw when we did means in the last section.

So, upon completion of this section,

you will be able to describe the sampling distribution of

a sample portion and a sample incidence rate in terms of their compositions,

and comment on some of the characteristics of

the sampling distributions for these quantities that we'll demonstrate empirically,

including the general shape of the sampling distribution,

the center of the distributions,

and the relationship between the variability of these distributions and the size of

each sample that the statistics in the distribution are based upon.

You'll also be able to comment on the similarities between

this lecture section's results and the results for

sampling distributions of sample means from the previous section.

So, let's look at an example of taking samples of Baltimore City residents

to estimate the proportion of residents who live below the poverty line.

Let's suppose we took a random sample of 50 persons,

and 24% of the sample live below the poverty line.

Well, certainly the p-hat tells us everything we need to know about the sample result,

but if we did a histogram of it,

it will look like this.

The proportion had values of zero,

or didn't live in poverty were 76% versus 24% who were yes' or one's.

Suppose we took another sample of 150 Baltimore City residents,

a separate random sample,

we'd expect to get a similar estimate to that's what we got our sample 50,

but there'd be more data in the sample.

But again we get a p-hat of 25.3%,

and similar histogram with two bars one rising to 74.7%,

and the other for the yes' or the ones at 25.3%.

But let's now suppose we were doing this as a class,

we had a large class,

we were interested in poverty in Baltimore City in each of

the 5,000 students in this class went out,

took a random sample of 50 Baltimore City residents,

and computed the proportion in their sample who were living below the poverty line.

I collected all 5,000 p-hats each based on 50 persons and

plotted these 5,000 p-hats in a histogram and that's what I'm showing you here.

So, you can see there are some gaps in the histogram,

and if you think about it, there's only 50 people in the sample.

So, the proportions are in increments of two percent.

In other words, an increase in

one person will yield an increase in the proportion of two percent,

which is why there are some gaps in this histogram.

Because each proportion has to be a multiple of two percent.

But you can see that the distribution of

the sample portions still looks approximately bell-shaped and symmetric.

Now, let's suppose we did this again in each of the 5,000

students took a random sample of 150 Baltimore City residents.

I collected these 5 thousand p-hats,

now each based on 150 persons,

and I plotted all 5,000 p-hats in one histogram.

So again, each point in this histogram is

a sample proportion summarizing

the proportion of persons in poverty in a sample of 150 people.

Just as I noted before with the right skew data for length of stay values,

now look at through or think about the original data we get in each of the samples,

we're summarizing with a proportion.

It's pine area, it only takes on one of two values.

There's no real distribution shape to talk of,

just the height of two bars,

and yet when we summarize this on each sample via proportion,

and then look at the distribution of the proportions

across multiple samples of the same size,

it is symmetric and bell-shaped.

This is drawn on the same scale as a prior distribution.

So, if we look at them back-to-back,

we can see that the variation in these 5,000 proportions each based on

150 people is lesser than the variation in the proportions based on 50 people.

Finally, if I did this one more time and sent

the class out and each of the 5,000 students took

a random sample of

500 Baltimore City residents and turned in not the individual measurements in the sample,

but just the p-hat summarizing the proportion of

the 500 persons in the sample who live below the poverty line,

and I got 5,000 p-hats each based on 500 people.

If I compile those and presented a histogram of those 5,000 p-hats,

it would look like this.

You can see it's again symmetric and bell-shaped,

and an even narrower and less variable than the previous two histograms I showed you.

So, just to get these on side-by-side comparison,

here are the box plots of

these estimated sampling distributions each based on 5,000 proportions.

This first box plot shows the distribution of the

5,000 proportions each based on 50 people.

The one in the middle shows the distribution of the 5,000 proportions,

each based on 150 people,

and the last one shows the distribution of the

5,000 proportions each based on 500 people.

With a little shift here generally speaking,

the center of these distributions is comparable if not totally equivalent,

the medians are similar.

But again the more information that each statistic is based upon,

in other words the larger the sample the proportion is based on,

the less variability there is in proportions across the samples.

So, I created a theoretical population in my computer,

which had a proportion of 25%,

which is the latest estimate based on census data from the City as well.

So, I had a theoretical population of thousands and thousands and thousands of persons,

and each had a value of one or zero depending on whether they were in poverty or not,

and the proportion across these thousands and thousands and thousands of

individual observations I had for my population was 25%.

Here are the mean of the 5,000 sample proportion estimates I

get respectively for proportions based on samples of size 50,

and the mean of those 5,000 sample proportions was 25%.

So, the average of my estimates was

the true population proportion even when each was based only on 50 people,

and it stayed there,

and the average was similarly 25% for the means.

If I average all 5,000 sample means based on 150 people each,

or if I average all 5,000 means based on 500 persons each.

But we can see and we saw visually that these sample proportion

estimates varied about the mean of the 5,000 sample proportions,

and the estimated standard deviation of these proportion estimates is given here

numerically and this corroborates with what we saw in the pictures,

but the larger the sample that each proportion is based on,

the less variable the proportions are from sample to sample.

Let's do another example but looking at incidence rates.

So, the University of Massachusetts Research Unit IMPACT Study,

includes data on time to relapse.

For persons discharged from

alcohol and drug rehabilitation facilities and

data available on each participant clued their follow-up time,

which is their time from discharged to relapse or

censoring and whether or not the participant relapsed or was censored.

So, via simulation from these data,

I'll use these data on hundreds of persons to

stand in for the infinite population, the theoretical population,

and I will investigate the sampling behavior of

a sample incidence rate using multiple sample sizes of size 50,

a 100, and 250 perspectively taken from the data set.

Just to get us started,

let me just show you some information about

two randomly selected samples of different sizes just to try and

understand what the data look like in the population through the lens of these samples.

So, my first sample had 50 randomly selected people.

Here's a Kaplan-Meier curve for the percentage of

persons who have not relapsed by a certain time,

so who survived beyond that time without relapsing.

You can see the drop off pretty steep in the first year following discharge,

roughly 365 days, and then it tends to stabilize a bit more.

In this sample of 50,

there were 134.9 relapses per 100 person-years.

That's notably larger than some of the other rates we've seen.

If I look at the random sample of 250 persons here,

we get a similar Kaplan-Meier curve and also

a similar estimated incidence rate of 125 relapses per 100 person-years.

So, what I'm going to do now is show the results of

the estimated incidence rates across multiple random samples of the same size.

So I took 5,000 samples each with 50 people from this population of rehab discharges,

and this histogram shows the distribution of 5,000 sample incidence rates

each based on a random sample of 50 persons.

You can see just like we saw with means and with proportions,

the distribution of the summary statistic across random samples

of the same size is roughly symmetric and bell-shaped.

I'll do this again but with each of my samples having 100 persons instead of 50.

So again, now I have 5,000 incidence rates each

computed on a random sample of 100 persons discharge from rehab,

and this distribution in a 5,000 incidence rate value is also looks symmetric

and bell-shaped but the variability is less than that in the previous picture,

and I'll show you box plots in a minute.

Then finally, I did the same thing when I took

random samples of 250 discharges at a time,

computed the incidence rate of relapse for each sample of 250,

I did this 5,000 times.

Again, got 5,000 sample incidence rates and I'm plotting them in this histogram here.

So, there's 5,000 points in this histogram,

each has an incidence rate based on the single sample of 250 persons.

So, if I look at these distributions side by side, again,

I see similar to what I saw with proportions and means that

the median value of

the estimator sample incidence rates is similar whether the rates are based on 50,

150, or 500 persons,

but the variability in the incidence rates between

samples decreases the larger the sample each statistic is based upon.

So now, let's compare characteristics of

our distributions of estimating incidence rates to

the underlying truth based on the theoretical population I

created based on the results that University of Massachusetts study.

The true population incidence rate of relapse is 125.8 relapses per 100 person-years.

So, that's the truth that we were estimating

multiple times via our sample incidence rates based on samples of different sizes.

The average of my 5,000 estimates based on sample of size

50 is 128.6 relapses per 100 person-years.

So not exactly equal to the underlying truth but I will say close in value.

The average of the 5,000 incidents rates each estimated from a sample of

100 persons is 126.6 relapses per 100 person-years,

and the average of the 5,000 incident rates each based on

250 persons is 126.2 relapses per 100 person-years.

So, none of these averages across the three estimated sampling distributions is

equal in value and none of them exactly equal the truth but they all get close.

Theoretically, if we were to take an infinite number of random samples,

then the mean of the infinite number of estimating incidence rate should be that truth.

You can see that the standard deviation or variability, though,

in these incidence rates around their respective mean of all 5,000

decreases the larger the sample each incidence rate is based upon.

So again, theoretical sampling distributions

for sample proportions and incidence rates across

random samples of the same size from

the same population can be estimated via computer simulation.

Simulation is a useful tool for helping

explore the properties of the sampling distributions,

and the properties we saw here are the same properties we saw with means.

Namely, that the variation and sample estimates,