0:42
So, the drill is going to be very similar to what we did with means here
because we laid out the logic in lecture six and the beginning of lecture seven A.
So let's
just jump in head first.
So recall the Kaggle data were we looked at the response to therapy
in a random sample of 1000 HIV
positive patients from a citywide clinical population.
And you may recall, out of the thousand subjects in this sample, 206 responded to
give us a sample-based estimate of 20.6% or 0.206 responding
in this sample as our best estimate for the population.
So how are we going to actually create an interval statement about the unknown true
proportion of people in this population who
would respond in this citywide clinical population?
Well, we've got an estimate here, it's, it's our best
estimate based on the information we have, this p hat.
And what we're going to do is exactly the
same drill, to create a 95% confidence interval,
we're going to add and subtract two standard errors.
And again, we're going to have to estimate the standard error of
the sample proportion from the single sample of data we have.
So you may recall that the standard error of a sample proportion, the formula, is a
function of our estimated proportion itself, times 1 minus the proportion.
So it would be the proportion who respond
in this sample, times the proportion that did not, divided by the size
of the sample. So with these data, the estimated standard
error looks like this. It's square root of
0.206 times 1 minus 0.206, so 79.4%
or 0.794 of the sample did not respond divided by
the sample size of a thousand. If you do
the math, this is approximately equal to 1.3%
or 0.13. So to do our computations
to get the confidence interval, we take this sample proportion
estimate, the p hat of 0.206, and add and subtract two standard
errors that we just computed of 0.013 or 1.3%.
So if you do the math on this, you get a confidence interval of 0.18
to, and I'll just be, we could round this, but this is, if you do the
math directly, you get 0.232. So I would probably
present this as 0.18 to 0.23 or
18% to 23.2%. So now, we've quantified
the rate of response in this population, both by our
best estimate of 20.6% or .206, and now we've
given a range of possibilities for the true response
rate 18% to 23.2%. This plus or minus two standard error
piece for a proportion is frequently called its margin of error.
And many of you have probably heard this phrase in
the news when the results from a poll are being reported.
Like this poll was conducted with a margin of error of plus or minus 3%.
And what they're telling you is the piece that you would add and subtract
to their estimate, to get a confidence
interval for the true proportion that's being estimated.
4:54
So, we wanted
to quantify the uncertainty in this. We
could estimate the standard error for this
overall proportion by taking the square
root of sample proportion, 0.15
times 1 minus that sample proportion of
0.85, divided by sample size. And this is, with
rounding, approximately equal to 0.019 or 1.9%.
So we could actually estimate the confidence interval for this population of
women who were HIV infected and pregnant. And it's a mixed population,
some were treated with AZT and some weren't.
6:23
Here's our example of colorectal screening,
and remember, from the results section,
we've sort of mentioned this before, and now we'll bring it in.
This is the study where they
actually compared automated information, intervention with
stepped increases in support to increase the uptake of colorectal cancer screening.
And they found that with increased intensity with stepped increases in
support, we saw increased response or uptake of colorectal screening.
And they reported in each group,
the proportion who actually got screened within
two years after the study started, and
a 95% confidence interval for that proportion.
So they did it for the usual care group,
the one that was ostensibly given standard of care.
The automated care group, so the estimated proportion in
this sample who got the automated care was about half, 50.8% were screened for
colorectal cancer, but a 95% confidence interval, and that was 47.3% to 54.4%.
And you can go through this and look and see the estimates that
we reported in the section on binary
outcomes, now coupled with their confidence intervals.
7:36
And here are the proportions in each of the four groups.
And now
let's just focus on the usual care group.
So this would sort of describe if people went with
the status quo, if there were no changes in how
we treated colorectal cancer screening marketed to people, this is
what we'd expect to see in terms of people getting screened.
It would be about a quarter, slightly over
a quarter, of the population would get screened.
And our estimate of that is 26.3%,
based on the sample we have at hand.
8:07
So this describes what we'd expect if no changes were
made, if everyone was given the usual standard of care.
But of course, this is an imperfect estimate, because
it's only based on a subsample of about 1,166 persons.
So, let's put confidence limits on this to get a sense
of how much response we can expect at the population level.
So,
I'll leave this for the review exercises to actually formally do.
But if you actually do the routine p hat plus or minus 2
estimated standard errors, p hat, the confidence interval
for the proportion of who could get the screen, is between
0.237 and 0.289. So roughly 0.24
to 0.29, 24% to 29%.
So, this tells us that on the whole, we would expect somewhere around a quarter
of the population to get screened for colorectal cancer if
we continued to give the usual standard of care.
9:48
Well,
the drill is the same in terms of taking
our estimate and adding and subtracting two standard errors.
So here's the example from the Mayo clinic
and the primary biliary cirrhosis randomized clinical trial.
And if we wanted to just get a sense of the burden of
death in the entire population from which the sample were
taken, mixed between those who got treated with DPCA and those who got a placebo.
We can think of this a mixed population where some
people got treatment and some didn't, and of course more
interesting will be the comparison between those who got treated
and those who didn't, but this will just get started.
The overall incidence rate of death in these data was
125 deaths per 1715 years of follow up, and we
showed how to compute that in lecture five, or an
incidence rate of 0.073 deaths per person year of follow-up time.
11:25
So, in order to quantify the uncertainty, put, put
an interval statement on the true incidence rate in this
population, this mixed population of some who got treated
and some who didn't, we take our estimated incidence rate.
Add and subtract two estimated standard errors.
By that same logic, we worked out in lecture seven A and in lecture
six, regarding creating an interval of possible
values for an unknown truth using the
logic from the central limit theorem. So to do this, we take our
estimated incidence rate in these data 0.73 and add and
subtract two standard errors to get, ultimately
get a confidence interval that goes from 0.06 deaths per year to
0.086 deaths per year. So of course, just like
the estimated rates can be re-scaled to different reference time periods, so can
the confidence interval. So if we scaled this up to per 1000
years of follow up time, the sample estimate
expressed in this would be 73 deaths per a 1000
person-years. And the confidence interval
would be.
So look at one more example for incidence rates, the maternal vitamin
supplementation and infant mortality data, this
is amongst the sample of Nepali children.
So amongst the entire sample of 10,295 Nepali children, what we saw,
there were, there were 644 deaths for 1,627,725 days of follow-up
time, for an estimated incidence rate, we presented as in lecture five, of
644 deaths per 1,627,725 days or 0.0004 deaths per day.
13:23
So if we were to actually estimate the standard error of this incidence rate,
we take the square root of the number of deaths, the square root of 644,
and then divide by that total follow up time of over 1.6 million days.
And we get a standard error of 0.000016
deaths per day. So in order to create a confidence
interval for this mortality rate, the incidence rate of mortality in
the six months following birth in this population of Nepali children,
we'd take our estimate from the sample, 0.0004 deaths per day, and
subtract two standard errors, 2 times 0.000016 deaths per day.
And we get a confidence
interval of 0.000037 deaths per day, up to 0.000043 deaths per day.
And of course, we could present these on
a different scale, per year, per 100 years, etc.,
14:35
So, in summary, what we've done in this section is more of the same.
We've done, to get a confidence interval for either
a proportion or an incidence rate, we've taken our
best estimate from a sample, and added and subtracted two
estimated standard errors to get a 95% confidence interval.
So this is 95% CI.
For proportions, the estimated standard error
is a function of the proportion itself and the
sample size. For an incidence rate,
the standard error is estimated by
taking the square root of the number of
events total in the sample, divided
by the total follow-up time. And if
we wanted to create other levels of interval, whether it be,
say, a 99% confidence interval or a 90%, we could alter
this formula slightly. For the 99% confidence
interval, we'd add and subtract 2.58 standard errors.
Estimated standard errors.
And for the 90%, we could add and subtract 1.65 standard errors.
For this course and in most of your
research life, you will exclusively use 95% confidence
intervals, but I just wanted to remind you
that in theory, that level is arbitrary and we
could present any level of confidence based on what we know about
the relationship between area under the
normal curve and number of standard errors.