A practical and example filled tour of simple and multiple regression techniques (linear, logistic, and Cox PH) for estimation, adjustment and prediction.

Loading...

From the course by Johns Hopkins University

Statistical Reasoning for Public Health 2: Regression Methods

68 ratings

A practical and example filled tour of simple and multiple regression techniques (linear, logistic, and Cox PH) for estimation, adjustment and prediction.

From the lesson

Module 3A: Multiple Regression Methods

This module extends linear and logistic methods to allow for the inclusion of multiple predictors in a single regression model.

- John McGready, PhD, MSAssociate Scientist, Biostatistics

Bloomberg School of Public Health

Greetings, and welcome.

It's Lecture Set 7.

In this set of lectures,

we'll do a parallel treatment of logistic regression, like we did for

linear regression, for estimation, adjustment, and basic prediction.

So, in the first section,

we'll look at some examples of multiple logistic regression.

We'll also get back to doing the same thing in Lecture Set C where we'll look at

examples from the published literature.

So hopefully, after viewing this section, you'll have some sense of how to interpret

the estimates from multiple logistic regression models in a substantive or

scientific context.

And compare the results from simple and

multiple logistic regression models to this past potential confounding.

So let's first look at the data from a random sample of 192 Nepali children

between a year and three years old, 12 to 36 months old.

We already looked at the relationship between breast feeding and

age in this group.

We looked initially at the relationship between breast feeding sex,

in the group that included children between 0 and 12 months.

But we're going to restrict our analysis now to this group that's a year or older,

up to three years.

So in this sample,

70% of the children were being breast fed at the time of the study.

In the samples 48% female and

we're going to have a sex variable that's coded as 1 for males and 0 for females.

So here's the result, the unadjusted association, in the sample.

But, that says the log odds of being breast fed is a function of

this intercept 1.12 plus 0.02.

Times the sex of the child.

Where 1 is for males and 0 is for females.

So this equation is only estimating 2 log odds.

1 for females which is the reference group and 1 for males.

This slope estimate of 0.02 is the log odds ratio of being breast fed for

males compared to females.

If we exponentiate this, we'll get nons ratio estimate of approximately 1.02.

So, in the sample, males have a slightly higher odds.

We could get a confidence level for this by taking the estimate plus or

minus 2 times the standard error which I haven't given you in this slide.

But the idea is exactly the same as we were doing before.

And then this intercept of 0.83 estimates the log odds of being breast fed for

female children in the sample.

And we could translate that, exponentiate it to get the odds for

female children, and estimate the probability or proportion from that.

Now, let's bring in age of the child.

We saw before a negative association, not surprisingly between the age of

the child and breastfeeding status in this age group.

Let's put sex and age together and see what the results look like.

So here is the result which is the regression equation that includes sex as

a predictor.

1 for males and 0 for females and age and months as a continuous predictor.

When we'd all ready visually assessed and that by a low s plot and

solve it at least, and

the other adjusted case that made sense to model it as a linear function.

So.

The estimated association.

The log odds ratio from males to females is 0.27.

If we were to put a confidence interval in that, and we could do that by

taking the estimate plus or minus 2 times its estimated standard error, but

I'll just jump to the, we're so familiar with doing that now.

I'll just jump to the results here.

Even though this is a multiple regression model, but

the approach is exactly the same for

confidence intervals as we've seen for slopes from simple regression models.

So, the slope or the log odds ratio of males to females, it was positive.

But, after accounting for sampling variability, the confidence interval

includes 0, and the resulting pvalue for testing the null that the slope is 0.

Where that the resulting adjusted odds ratio is 1, is 0.48.

So after accounting for.

Age the child,

we don't see a statistically significant association between the.

Log odds of being breast fed and the sex of the child.

However, the slope for.

Age is statistically significant.

All possibilities in the confidence interval are negative and the pvalue for

testing the null, that there's no association between breast feeding and

age after accounting for sex is small.

So let's parse these estimates a little bit further,

make sure we detail what they mean.

So, the slope estimate for sex is beta 1 hat equals 0.27.

And it's still an estimated log odds ratio of breast feeding for

male children to female children, but

it's removed any difference in the age distributions between those.

We've adjusted for age.

So, this number compares male children to female children.

Of the same age.

And this is called the age adjusted association between breast feeding

and sex.

So, the log odds ratio was 0.27.

We know to get the odds ratio we would just exponentiate that.

And that turns out to be 1.3.

So, male children in the sample have 30% greater odds of

being breast fed than female children in the sample of the same age.

However, when we hold this up, for when we account for

the uncertainty in the estimate and

see if it holds at the population level, what we already saw in the log odd scale.

The confidence interval included 0.

We exponentiate those endpoints to get a confidence interval for the.

Population level association on the odds ratio scale.

The resulting confidence interval goes from 0.61 to 2.82 and

includes the null value of 1.

So after accounting for sampling vali,

variability, sex is not associated with breast feeding after accounting for age.

And it wasn't associated even when we ignored age in the simple model.

We'll look at the estimate for age, the slope is negative 0.24.

This is still an estimated log odds ratio of breast feeding for

children who differ by one month in age, older to younger.

But now we've removed any differences in the sex distributions across

the age groups.

So this compares children who are of the same sex, males to males who differ by

one month in age or females to females who differ by one month in age.

And this is called sex adjusted association between

breast feeding and age.

So the resulting odds ratio estimate here if we exponentiate this is 0.79.

Suggests that a one month difference in age is associated with

a 21% reduction in the odds beaten, being breast fed, for

older compared to younger among children of the same sex.

And then 95% confidence level for the population level, sex adjusted odds ratio.

If we just exponentiate those endpoints of the confidence interval for

the slope, it goes from 0.73 to 0.84.

After counting for sample availability,

there's clear evidence of a population level association.

And while we estimate a 21% reduction,

this could be anywhere on, up to a 27% reduction.

Or as quote, unquote, small as a 16% reduction in the odds per month of age.

So, how could we present the findings from our simple and

adjust, adjusted models together?

Well and like seen with linear aggression in research articles, frequently a single

table of unadjusted and adjusted associations will be presented.

So we, if we were just putting the results here, we could do something like this.

We could put the unadjusted column, which would and

look at the way I've handled sex, here I've said male and female.

And for the unadjusted odds ratio for

females, I put a 1, that's a way of indicating that it's the reference group.

There's no confidence limits, et cetera.

This is what the other levels are being compared to.

Of course, there's only one other level of sex, which is male.

In this pr, the unadjusted.

Odds ratio and confidence interval.

And then, this does the same thing.

This does the same thing for age, the unadjusted, you know.

And then, in this column here it shows the results from

the model that includes both sex and age.

Well, let's just take a look at this for a minute.

Certainly the relationship, the estimated relationship between sex and

age changed from an odds ratio of 1.02 to 1.3 for

males compared to females, after adjusting for age.

So it appears that within the sample there might have been a slight discordance in

the age distributions between males and females that was.

May gain some of the associations, but

of course, this association is not statistically significant, and

the confidence intervals overlap a fair amount.

So, I would say that there, in general qualitatively, there was no

overall confounding of the sex relationship after we accounted for age.

Similarly, if we look at the age association,

it's identical to what it was when we ignored sex.

Of the same confidence intervals.

So, it's pretty clear that the,

regional association we saw between breast feeding and age of the child

was not attributable at all to any sex differences between the age group.

One more thing that you'll sometimes see in papers, and

this looks crazy when you exponentiate that intercept of 7.2, it is on the order.

Of a 1,333 and that's the baseline odds.

This is the, the group odds when age is zero, and

we don't have any newborns in this sample.

Remember, we started at 12 months and

when we are looking at females, the reference group.

So, this expresses sort of the starting odds, wherever you start.

When we make these comparisons.

And even though we don't describe a single group,

it's worth noting that if this is the starting odds,

then the estimated probability of being breast fed, that we're working off of,

as a, a reference on the log odd scale is very close to.

1.

Remember, we just take the odds over the odds,

plus 1 to get the estimated probability.

So even though there isn't a group in our sample that's newborn and female.

Because we don't start with children till they're 12 months old,

this tells us that the st,

the general ideas were starting with a very high point and coming down from that.

And remember the over all proportion of people being breast fed in

this sample was on the order of 70%.

So by the time we actually get into estimates with our x

values that are relevant to our data set,

that will reduce it from this high starting point of.

A really large odds to begin with.

So, there's some other additional predictors in

these data that we could look at.

One is maternal parity.

And it has, actually I put it into five groups for these data.

About 17% of the sample, the child we're looking at now, is their first child.

They had no previous children prior to this one.

Another 16% had one previous child.

Another 14% had two previous children, before the one we're looking at and

an, analyzing the breast feeding status of.

Another 15% had three previous children.

And over a third of the sample, 38% had greater than or

equal to four previous children.

Also of interest, and it's, could be very well related to the parity as well,

although it doesn't necessary have to be, is the maternal age.

In the sample the age, the mean age of the mothers for

these children, is 27.7 years with a range of 17 to 43 years.

So, I'm just actually going to present the results from several models

side-by-side to have us take a look at what's going on.

So, here is the unadjusted column.

This shows the associations between sex and age we already talked about.

Let's look at what's going on with maternal parity.

It looks like, with regards to the reference,

which is no previous children, children whose mothers had.

At least one prior child have lower odds of being breast-fed

than those whose mothers had previous children.

This doesn't take into account any other characteristics,

as it's the unadjusted association.

But you'll see these confidence intervals all are very wide and

include the null value of one.

And in fact if we test.

The overall construct that there's any differences in the odds

of breast feeding among children for any comparison of these parity groups.

And the nice thing about this overall pvalue is it also tests behind

the scenes the differences between this group.

It, it doesn't just task the comparisons to the reference, which is all we,

the, all that we see with the odds ratios.

And so, the null is that there's no association between breast feeding

and the parity.

There's no differences in the odds across any of the parity groups, and,

and the unadjusted level, we would fail to reject that.

We didn't find anything here and mother's age.

As an increase in mother's age is associated with a slight decrease in

the odds of being breast fed, but it's not statistically significant.

So, if we go forth to the adjusted associations,

I'm just going to let you look at these a little bit, but

you'll see that very little changes about the story with age.

We certainly didn't see a change at all when we adjusted for sex.

But when we adjusted additionally this model here adjusts for maternal parity.

And this final model here adjusts includes male,

sex, maternal parity, and maternal age.

And you can see that the the age association is robust.

It's stays pretty much the same regardless of the other things in the model.

Similarly, the story with sex remains in

that there's no statistically significant association or

difference between the odds of being breast fed from males and females whether.

We don't consider these other factors in the unadjusted sense, or

we consider an adjust for age.

Or add in maternal parity or add in on top of that mother's age,

the story is pretty much the same across these adjusted estimates.

Similarly, if we look at maternal parity.

I won't walk you through these, but when we include it after including sex and

age, it's, doesn't become a statistically significant predictor.

And in fact, the odds ratios look slightly different,

the estimates, than they did for some of the groups.

But it doesn't appear that qualitatively there's much of a difference here.

And certainly, statistically speaking, there's not.

And, in that model includes everything, sex, age of the child,

maternal parity, and maternal age.

Maternal parity is not significant, not a significant predictor either.

Nor is maternal age.

It doesn't become so after adjusting for these other things.

So, on the whole I think the big story there is

that the only predictor that holds up and consistently holds up at all,

not just consistently so, is the age of the child.

Among these candidates, there don't appear to be differences related to sex,

maternal parity, or mothers age.

With or without considering each other in adjusted models.

Let's just focus on this diagramming this Model 4 for a minute.

This just comes from multiple linear regression with a bunch of xs.

And so if you actually looked at the model.

Here is the I'm just going to write the, the counterpart to each of these.

This is the baseline odds.

This is the exponentiated intercept.

So on the log scale, the log odd scale.

The original regression scale.

This intercept would be the log, natural log of 7,071, which is 9.1.

And if we actually took the logs of each of these slopes.

I'll just put them in.

So that's for, sorry that's for parity.

For maternal age, it's negative 0.017.

And for age, it's negative 0.26 and then for sex.

It's 0.21, the only positive coefficient we have here.

So, this is just it on a log scale, and so the equation that gave us these estimates

was the log odds of being breastfed equals 9.1 plus neg,

negative 0.017 times mother's age in years.

Et cetera, et cetera.

And these estimates come from the exponeniated coefficients.

And all of these confidence intervals were first done by the computer on

the log ratio scale by taking the estimate for each and adding and

subtracting 2 estimated standard errors given by the computer.

And then exponentiating those end points to get the confidence interval.

So, lets look at our data from 2009 to 10 NHANES to give another example of

multiple logistic regression and comparing the results from several models.

This is, we initially looked at HDL cholesterol levels and whether is

predicted obesity in the population from which the sample was taken.

6400 US residents,16 to 80 years old,

so that's a population of 16 to 80 year old US residents.

And we saw that the HDL levels, the average was 52.4 milligrams per deciliter.

There was a substantial variability in the sample and

15% of the samples obese by BMI.

So, some other potential predictors we might want to look at include sex,

the age of the years, the age in years of the person, and their marital status.

We're trying to get a demographic and

physiological overview of predictors of obesity using these data.

So, just some things to consider, just to let you know.

In this sample 49% of the sample was female, 51% was male.

The average age in the sample was 46.3 years ranging from 16

to 80 as we talked about before.

And, there were actually six different categories of marital status.

Married in which a little over half the sample identified as.

Another 9% were widowed.

Another 11% had been divorced.

3% were separated.

18% had never been married.

And the remaining 7% classified themselves as in

a relationship where they were cohabiting or living with a partner.

So, I just want to show you some things I looked at before moving forward with this.

I wanted to get a sense of what the obesity,

age relationship looked like because that was measured on a continuum, in years.

And here are the results from a lowess plot that

shows the unadjusted association.

And you can see if, if it doesn't appear that this is well described by a line.

In fact if I fit, if I fit a line in an assumed linear relationship,

the line I get would, the best fitting line for this picture would probably miss

the story because it would probably be close to flat.

So, there's a couple of things we'll talk,

later lecture, about how we might handle this literally, but

there's, there's procedures you can use to fit a changing slope.

A slope that changes at different points over the relationship to allow for

the association to be non-linear.

But another catch-all method to kind of deal with this is just to

create groups of age and model that as a categorical predictor.

So that's what I chose to do, I broke age into quartiles for these data.

First group was from the minimum to the 25th percentile,

second group from the 25th percentile to the 50th, etc.

So, let's look at the results from some regression models actually putting this

all together.

So I'm first going to focus on the unadjusted associations with

each of these things.

So let's just, this is the result we got before with HDL.

The odds ratio of being obese for

two groups who differ by one milligram per deciliter.

Is, I'm going to take it to three decimal places because otherwise

the confidence inter, endpoints would round to the estimate, and

it would have looked a little confusing.

So, the odds ratio here was 0.967, so

a relative decrease in the odds of being obese by 3.3%.

Per one milligram deciliter difference in cholesterol level.

And this, at the unadjusted level, as we saw before, was statistically significant

in the confidence interval for the ratio does not include one.

If we compared males to females, no other factors considered, males had a higher,

odds by 75% of being obese compared to females in the sample.

And if we then accounted for sampling variability, there was,

was statistically significant.

Each category was, was in,

on the whole associated as this, the pvalue testing the null.

That all the associations, all the category comparisons are,

are equal to 0 in the log scale, where all the odds ratios are equal to 1.

In other words the null of no association between obesity and

age when age is modeled as four categories.

That would be rejected, and we can see that actually for the most part.

Increasing age is associated with increased risk of being obese,

relative to the reference group of being less than 30 years old.

But it's pretty similar comparison for the second quartile,

the group 30 to 46 years old and the group 46 to 62 years old.

1.79 and 1.82, respectively with very similar confidence intervals.

And then the estimate shifts down a bit.

It's still greater than one, statistically significant for

the group, that is greater than or equal to 62 years.

But is a smaller estimated odds ratio when

compared to the same reference as the other two.

If we look at marital status, interestingly enough it doesn't appear

that in the adju, unadjusted sense, there is any association between marital status.

The null is not rejected and the pvalue is 0.69.

But you can see that the widowed, divorced, and

separated categories all have higher estimated odds, but the re,

differences not statistically significant when compared to the married group.

Whereas the never married has almost equivalent odds,

being obese compared to the married group.

And the living together group has slightly lower odds in the sample.

But after counting for

sampling variability, it's not statistically significant.

So let's look at Model 2.

Model 2 is the multiple model that includes HDL, sex, and age as predictors.

And let's see what the results are,

compare the adjusted associations from this model to the previous unadjusted.

So, we can see that the association with HDL is still

such that increased HDL is associated with decreased odds of being obese.

It's similar magnitude, to the unadjusted estimate and

still statistically significant.

So, it doesn't look like that association we initially saw was greatly explained

by sex or age differences in the HDL group, so very little confounding here.

Interestingly enough though, the sex comparison increases and

the confidence interval shifts up to what it was when we ignored HDL and age.

And it looks like if we were to compare males to females of the same age and

HDL levels, the males would have 2.6 times the odds of being obese.

And that's statistically significant, and the confidence interval is different.

It shifts up compared to what it was in the unadjusted.

So, it looks like some of the male female relationship was being dampened.

Because of behind the scenes relationships between obesity,

sex, HDL, and potentially age.

If you look at the age association,

it's still statistically significant on the whole, and

there is some movement in the estimates compared to the unadjusted counterparts.

But qualitatively speaking the same ordering of associations, and

they're all statistically significant.

So, really the biggest story about confounding was with the,

with the sex association.

It appeared to shift upwards and

differ statistically after adjustment when compared with the unadjusted.

In Model 3, just for, just to look at the sensitivity to bringing in marital status,

I, I added that to the model that we just looked at.

It had very little impact on the first two, relative to

what they were when we adjusted only, when we only included HDL, sex, and age.

The age category comparison attenuated a bit after counting additionally for

marital status.

But qualitatively the same ordering exists.

And again, the results are statistically significant indicating that

generally older age,

beyond 30, is generally associated with increased odds relative less than 30.

And if you look carefully at the marital status of the estimates and

the confidence levels, the significance

remained the overall pvalue for testing this association was on the order of 0.5.

That test, they know that marital status is non-associated with

obesity after accounting for HDL, sex, and age.

And if you look at the estimates here, some of them change a little bit.

The confidence intervals are all very wide.

But I think the general story is that.

These models show that HDL, sex, and age are significant predictors of obesity.

And generally with the exception of sex,

did not impact each other as association with the outcome of obesity.

And after additionally accounting for

marital status, marital status doesn't add anything statistically.

And doesn't change our results for the other three.

So again, and we could look at this model, for example, the one with age.

And we could figure out what the coefficients were,

not that we want to work backwards.

But this does come from a logistic regression model that

started off of the form, a log odds of obesity.

Equals some intercept estimate.

And then we'd have our three slopes for

the three indicators of age group because this is being treated as categorical.

This would be our age, for example.

And you could, and then we'd have something for being male, versus female.

And then we'd have something for.

And it really doesn't matter what you name these coefficients and

xs, just know that there are three predictors.

HDL, sex, and age.

And, then if we wanted to get the re, corresponding coefficients and

the intercept, we could take the log of the estimates given here.

Not that we want to do this backwards,

but generally, this is where these results came from.

And the confidence intervals came from behind the scenes.

The computer estimated the regression slopes, and then a standard error, and

added and subtracted 2 standard errors, these estimates, business as usual.

And then, exponentiated those to get the end points on the odds ratio scale.

So, in summary logistic regression is a tool to allow us to look at

binary outcomes as a function of multiple predictors at once.

And we've already defined, in Lecture 2, a simple linear regression.

And we talked about how our slopes are interpretable as log odds ratios, to how

they can be ext, extended, exponentiated to become odds ratio estimates.

How we can create confidence intervals for the slope and exponentiate those results,

to get confidence intervals for the odds ratio of interest.

But we can just extend that with multiple regression,

we can look at the impact of potentially multiple predictors on the binary outcome.

We can get adjustments of the adjusted association between each predictor and

the outcome adjusted for the other variables in the model.

And then, we compare these adjusted estimates and

their confidence intervals to the unadjusted estimates to get some sense of

the degree of confounding if there is any.

And we can also decide which of these predictors add information about

the outcome, versus which do not by looking at the statistical significance.

So, it's just a very nice logical extension of what we did with

simple logistic regression and the ideas in terms of the interpretation parallel.

Exactly what we do with multiple linear regression.

This is really just another form of the same thing.

It just so happens that the scale on which the estimates are is different.

So, in the next section we'll talk a little bit about,

the the basics of using the multiple regression models to compare the odds for

groups who differ by more than one predictor.

And to estimate probabilities proportions of

persons having the outcome given their x values.

And then in the third section, we'll look at some examples of logistic regression

from the published, public health, and medical literature.

Coursera provides universal access to the world’s best education,
partnering with top universities and organizations to offer courses online.