So, in this section,

we're going to look at some examples for the Sampling Distribution of Sample Proportions

summarizing binary data and Sample Incidence Rates summarizing time to event data.

Pay special attention to this because you'll see that the results we get here,

are going to be very similar to what we saw when we did means in the last section.

So, upon completion of this section,

you will be able to describe the sampling distribution of

a sample portion and a sample incidence rate in terms of their compositions.

Comment on some of the characteristics of

the sampling distributions for these quantities that will demonstrate empirically,

including the general shape of the sampling distribution,

the center of the distributions,

and the relationship between the variability of these distributions,

and the size of each sample that the statistics in the distribution are based upon.

You'll also be able to comment on the similarities between

this lecture section's results and the results for

sampling distributions of sample means from the previous section.

So let's look at an example of taking samples of Baltimore City residents

to estimate the proportion of residents who live below the poverty line.

Let's suppose we took a random sample of

50 persons and 24 percent of the sample live below the poverty line.

Well, certainly the p-hat tells us everything we need to know about the sample result,

but if we did a histogram of it,

it will look like this.

The proportion who had values of zero or didn't live in

poverty were 76 percent versus 24 percent who were yes' or one's.

Suppose we took another sample of 150 Baltimore City residents,

a separate random sample,

we'd expect to get a similar estimate to that's what we got our sample of 50,

but there'd be more data in the sample.

But again, we get a p-hat of 25.3 percent,

and a similar histogram with two bars,

one rising to 74.7 percent,

and the other for the yes' are the ones at 25.3 percent.

But let's now suppose we were doing this as a class,

we had a large class,

we were interested in poverty in Baltimore City in

each of the 5,000 students in this class,

went out, took a random sample of 50 Baltimore City residents,

and computed the proportion in their sample who were living below the poverty line.

I collected all 5,000 p-hats,

each based on 50 persons,

and plotted these 5,000 p-hats in a histogram,

and that's what I'm showing you here.

As you can see,

there are some gaps in the histogram, and if you think about it,

there's only 50 people in the sample,

so the proportions are in increments of 2 percent.

In other words, an increase in

one person will yield an increase in the proportion of 2 percent,

which is why there are some gaps in

this histogram because each proportion has to be a multiple of 2 percent.

But you can see that the distribution of

the sample portions still looks approximately bell-shaped and symmetric.

Now, let's suppose we did this again in each of the 5,000

students took a random sample of 150 Baltimore City residents.

I collected these 5,000 p-hats now each based on

150 persons and I plotted all 5,000 p-hats in one histogram.

So again, each point in this histogram is

a sample proportion summarizing

the proportion of persons in poverty in a sample of 150 people.

Just as I noted before with the right skew data for length of stay values,

now look at or think about the original data

we get in each of the samples were summarizing the proportion.

It's binary; it only takes on one of two values.

There's no real distribution shape to talk of just the height of two bars,

and yet when we summarize this on each sample via proportion,

and then look at the distribution of the proportions

across multiple samples of the same size,

it's symmetric and bell-shaped.

This is drawn on the same scale as a prior distribution,

so if we look at them back-to-back,

we can see that the variation in these 5,000 proportions each based on

150 people is lesser than the variation in the proportions based on 50 people.

Finally, if I did this one more time,

and set the class out in each of the 5,000 students took

a random sample of 500 Baltimore City residents,

and turned in not the individual measurements in the sample,

but just the p-hat summarizing the proportion of

the 500 persons in the sample who live below the poverty line,

and I got 5,000 p-hats each based on 500 people.

If I compile those and presented a histogram of those 5,000 p-hats,

it would look like this.

You can see it's again symmetric and bell-shaped,

and even narrower and less variable than the previous two histograms I showed you.

So, just to get these on side-by-side comparison,

here are the box plots of

these estimated sampling distributions each based on 5,000 proportions.

This first box plot shows the distribution of the

5,000 proportions each based on 50 people.

The one in the middle shows the distribution of the

5,000 proportions each based on 150 people.

The last one shows the distribution of the 5,000 proportions each based on 500 people.

With a little shift here, generally speaking,

the center of these distributions is comparable if not totally equivalent,

the medians are similar,

but again the more information that each statistic is based upon,

in other words the larger the sample the proportion is based on,

the less variability there is in proportions across the samples.

So, I created a theoretical population in my computer,

which had a proportion of 25 percent,

which is the latest estimate based on census data from the city as well.

So, I had a theoretical population of thousands and thousands and thousands of

persons and each had a value of one or

zero depending on whether they were in poverty or not,

and the proportion across these thousands and thousands and thousands of

individual observations I had for my population was 25 percent.

Here are the mean of the 5,000 sample proportion estimates I get respectively for

proportions based on samples of size 50 and

the mean of those 5,000 sample proportions was 25 percent.

So, the average of my estimates was

the true population proportion even when each was based only on 50 people.

It stayed there, and the average was similarly 25 percent for

the means if I average all 5,000 sample means based on 150 people each,

or if I average all 5,000 means based on 500 persons each.

But we can see and we saw visually that these

sample proportion estimates varied about the mean of the 5,000

sample proportions and the estimated standard deviation of these proportion estimates

is given here numerically and this corroborates with what we saw in the pictures,

but the larger the sample that each proportion is based on,

the less variable the proportions are from sample to sample.

Let's do another example but looking at incidence rates.

So the University of Massachusetts Research Unit IMPACT Study

includes data on time to relapse.

For persons discharged from alcohol and drug rehabilitation facilities.

A data available on each participant include their follow-up time,

which is their time from discharge to relapse or censoring,

and whether or not the participant relapsed or was censored.

So we will via simulation from these data,

I'll use these data on hundreds of persons to

stand in for the infinite population, the theoretical population,

and I will investigate the sampling behavior of

a sample incidence rate using multiple sample sizes of size 50,

100, and 250 respectively,

taken from the data set.

Just to get us started let me just show you

some information about two randomly selected samples of different sizes

just to try and understand what the data look

like in the population through the lens of these samples.

So, my first sample had 50 randomly selected people.

Here's a Kaplan-Meier curve for the percentage of

persons who have not relapsed by a certain time,

so who survive beyond that time without relapsing.

You could see the drop off is pretty steep in the first year following discharge,

roughly 365 days, and then it tends to stabilize a bit more.

In this sample of 50,

there were 134.9 relapses per 100 person years.

That's notably larger than some of the other rates we've seen.

If I look at the random sample of 250 persons here,

we get a similar Kaplan-Meier curve and also

similar estimated incidence rate of 125 relapses per 100 person years.

So, what I'm going to do now is show the results of

the estimated incidence rates across multiple random samples of the same size.

So, I took 5,000 samples each with 50 people from this population of rehab discharges,

and this histogram shows the distribution of 5,000 sample incidence rates

each based on a random sample of 50 persons.

You can see just like we saw with means and with proportions,

the distribution of the summary statistic across random samples

of the same size is roughly symmetric and bell-shaped.

We do this again,

but with each of my samples having 100 persons instead of 50.

So again, now I have 5,000 incidence rates each computed on a random sample of

100 persons discharged from rehab and

this distribution in a 5,000 incidents rate values also looks symmetric and bell-shaped,

but the variability is less than that in

the previous picture and I'll show you box plots in a minute.

Then finally, I did the same thing and I took random samples of 250 discharges at a time,

computed the incidence rate of relapse for each sample of 250,

I did this 5,000 times,

again got 5,000 sample incidence rates and I'm plotting them in this histogram here.

So there's 5,000 points in this histogram,

each has an incidence rate based on a single sample of 250 persons.

So if I look at these distributions side by side, again,

I see similar to what I saw with proportions and means that the median value of

the estimated sample incidence rates is similar whether

the rates are based on 50,150 or 500 persons,

but the variability in the incidence rates between samples decreases,

the larger the sample each statistic is based upon.

So now, let's compare characteristics of

our distributions of estimated incidence rates to

the underlying truth based on the theoretical population I

created based on the results that University of Massachusetts study.

The true population incidence rate of relapse is 125.8 relapses per 100 person years.

So, that's the truth that we were estimating

multiple times via our sample incidence rates based on samples of different sizes.

The average of my 5,000 estimates based on samples of size

50 is 128.6 relapses per 100 person years,

so not exactly equal to the underlying truth,

but I'm going to say close in value.

The average of the 5,000 incidents rates,

each estimated from a sample of 100 persons is

126.6 relapses per 100 person years and the average of the

5,000 incidents rates each based on 250 persons is 126.2 relapses per 100 person years.

So, none of these averages across

the three estimated sampling distributions is equal in

value and none of them exactly equal the truth,

but they all get close.

Theoretically, if we were to take an infinite number of random samples then

the mean of the infinite number of estimated incidence rate should be that truth.

You can see that the standard deviation or variability though in

these incidence rates around their respective mean of all 5,000,

decreases the larger the sample each incidences rate is based upon.

So again, theoretical sampling distributions

for sample proportions and incidence rates across

random samples of the same size from

the same population can be estimated via computer simulation.

Simulation is a useful tool for helping

explore the properties of these sampling distributions,

and the properties we saw,

here are the same properties we saw with means.

Namely that the variation and sample estimates