1:02

In addition if your data is strictly positive,

then having additive Gaussian errors presents a problem because that

allows for positive mass on negative values.

Now that may not be a problem and I often say that it's not a problem to have.

Strictly positive data where the arrows look out seen and

there's just very low probability down towards the negative values.

That's often okay.

But, on the other hand, if the normal distribution is putting a lot positive

probability on negative values even though you know your response has to be

positive then that's problematic for your model.

1:45

On the other hand, you might try transformation.

A common transformation when your outcome has

to be strictly positive is to take the natural log of it.

A natural log, in my mind,

is perhaps the most interpretable transformation possible.

Putting that one to the side.

Lots of other transformations that are available to try to make our data

more normal.

Such as for binomial data, people often will take a so

called arc sine square root transformation.

That often destroys a large amount of interpretability

of our model coefficients, which is a real problem.

2:36

There's another reason to perhaps approach generalized linear models

rather than doing a transformation of the outcome or to do

approximate things with a linear model is just nice and pleasant to have.

A model on the scale in which the date it was recorded without having

a transformation and that really honors the known assumptions about the data.

So, if we have binary data, a model that really honors the fact that the data is

binary and really doesn't require us to transform it.

It has a lot of pleasantness to it and it makes a lot of internal sense.

3:20

And then I mentioned here on this last point,

the natural log transformation which is probably the most common

transformation is applicable for negative or zero values.

Now there's some fixes to that, but then that then harms some of the nice

properties of that transformation, some of the really nice interpretable properties.

So generalized linear models were from a 1972 paper by Nelder and Wedderburn,

a kind of famous paper that a few are PhD statistician you've for certain read.

A generalized linear model has three components.

First of all, the distribution that describes the randomness has to come

from a particular family of distributions called an exponential family.

This is a large family of distributions that includes things like the normal and

the binomial and the poisson.

4:14

So that's the random component right,

that's the exponential family is the random component.

Then the systematic component is the so called linear predictor.

That's the part that we're modeling.

And we've done this very much so in linear models already.

The random component was the errors, the systematic component was the linear.

Component with the co variance, coefficients and

then we need some way to connected to and the link function it

connects some important meaning from the exponential family

distribution to the linear predictors so there's three things we need.

We need the distribution.

Which is going to be an exponential family for generalizing your model.

We need the systematic component which think of this as the linear predictor,

think of this as the set of regression code or regression variables and

coefficients and then we need a link function that links it to.

Okay, so let's try an example.

5:17

And we'll go to our familiar example which is linear models.

The subject that we've been covering the entire class up to this point.

So, in this case we are assuming that our Y is normal with the mean Mu and

this happens to be an exponential family distribution and

then we're going to define the linear predictor, this a to i, okay,

to be the collection of co variant axis, times their coefficients, beta, okay?

And in this case,

our link function is just going to be the identity link function.

It just says that the mu from the normal distribution is exactly this collection,

this sum of co variants and coefficients.

And so this just leads to the same model that we've been talking about along.

We could just write it again as Y, Yi is equal to

the Mu component which is summation xi beta plus the error component.

So, we've simply written out the linear model in a separate way.

We've talked about.

The fact instead of saying, the error is normally distributed we say,

the why is normally distributed, which is a consequence of our other specification.

We specify a linear predictor kind of separately.

Okay. And then we just connect

the mean from the normal distribution to the linear predictor.

And you might think this seems like a crazy thing to do

when just righting it out as an additive sum with additive errors.

Seems like such an easy thing to do but

as we move over to different settings like the plus sign and the binomial setting

it'll be quite a bit more apparent why we're doing this.

So let's take in my mind perhaps the most useful variation

of generalized models logistic regression.

So in this setting we have data that are zero one so binary.

And so it doesn't make a lot of sense to assume it come from a normal distribution.

So the natural, the only distribution available to us for coin flips for

zero one outcomes is a so called Bernoulli distribution so we're

going to assume that our data, our outcome Ys, follow a Bernoulli distribution,

where the probability of a head, or a so called expected value of the y, is mu y.

Okay, so we're modeling our data as if it's a bunch of coin flips.

Only the probability of a head may switch from flip to flip,

and it's necessarily .5.

Okay, so the probability is given by this parameter Mu sub i.

The linear predictor is still the same.

It's just the sum of the covariance times the coefficients.

Now the link function in this case, the most famous and most common one,

is the so called logistic link function, or the log odds.

So in this case the way in which we're going to get from the mean from

the probability of the head to the linear predictor is to take the log of the odds.

So the odds are, the probability over one minus the probability, so

in this case we have written it is mu over one minus mu.

We're going to take the natural logarithm of it.

8:57

So notice, we're transforming the mean of the distribution.

We're not transforming the Ys themselves, okay, right?

That's a big distinction and

that's the neat part of generalization in your models.

What we're transform, we're assuming that the coin has a probability the head,

if we're modeling our data, is coin.

That coin has a probability of getting a head and that probability, if we transform

it in a specific way, then, it relates to our co-variance and coefficients, okay?

So we can go backwards from the log odds

to get back to the mean itself, okay?

And so the inverse logic,

I guess I often call it the X pit, though I don't know how standard that is.

Is e to the eta i over 1 plus e to the eta i, and that gets you back to mu, okay?

So going forward, you take log of mu over 1 minus mu, that gives us eta.

If we take e to the eta over 1 plus e to the eta, that gets us back to mu.

And by the way,

1 minus mu, the probability of the tail, is 1 over e to the eta.

10:06

So we could write out our likelihood as the binomial likelihood

right there like this.

And I think you can see then,

it's through this likelihood, like we have talked about in our stat inference class,

it's through that likelihood that we're going to optimize,

we're going to maximize that likelihood to obtain our parameter estimates.

[NOISE] Okay let's go through another example Poisson regression or

I like to say Poisson regression.

I know I'm not pronouncing it correctly but I like to say it that way.

So assume that Y is Poisson mu i where the mu i is the expected value

of each of the Poisson random variables.

In this case the mu has to be larger than zero.

So Poisson is extremely useful for

modelling count data, or that's really what it is for, for modelling counts.

So if you have a bunch of positive counts that are unbounded, right?

So not like binomial counts where they're bounded by the number of coin flips

we take, Poisson counts are unbounded, and so it's a very useful model.

Let's suppose you want to count the number of people that show up at a bus stop,

you don't have an upper limit on that, or sure, there is some upper limit, but

you don't really know what it is, so you might want to model that as Poisson.

Our linear predictor is again, the same as it was in every case,

it's just the sum of the co-variance times their coefficients.

11:36

The link function in this case is the log link.

The most common link function for the Poisson case is the log, the log link.

And remember, we go from the mean,

mu, to the linear predictor, eta, by taking the log of the mean.

Okay so again we're not logging the data, we're logging the mean

from the distribution that the data is assumed to come from.

And then remember the inverse of the natural logarithm is e to that thing.

So we can go backwards from eta back to mew by taking e to the eta.

So, by doing that we can just simply write out what our likelihood is,

and again the way GLMs work,

is they obtain the parameter estimates by maximizing the likelihood.

12:24

So, I give some technical facts here and

basically we're just saying that the likelihood simplifies quite

a bit in all these cases because of the particular link function that we've shown.

But I want to point out this final point here which says that,

the maximum likelihood looks like sort of an equation that we would want to solve,

not unlike least squares.

So think way back to our initial lectures on least squares.

We found our estimates by minimizing the sum of the squared

vertical distances between the fitted line and the outcome.

Well if you wanted to opt to minimize that

function in a sort of automated way, you might take a derivative, so

then that function will no longer be squared, the two would come down.

And so you get to try to find the root of that derivative,

you just get a linear equation.

Well this generalization in generalizing your models by solving the likelihood,

trying to maximize the likelihood you get a very similar equation that you want to

set equal to zero and solve that gives you your estimate, and I give it right here.

And it's basically not very similar to the linear model

case only there's a set of weights and a variance in the denominator,

that doesn't go away like it does in the least squares case.

So again this is not for this class,

it's just if you're interested in some of the details of the fitting.

13:58

Basically the point of this slide is to say that it's very similar

to what's going on in least squares,

just how we get to that point is a little bit more circuitous.

For most people, in most settings, this is all going to be very transparent to you.

You're going to mostly concern yourself with the interpretation

of your generalizing of your model, you're not going to concern yourself too much

with how the specifics of how it was fit.

14:45

However, for the Bernoulli case, the variance of a coin flip is p*(1-p),

and in the notation we're given here it's mu(1- mu).

But remember, our mu depends on i, so what we're saying is,

the variance actually depends on which observation you're looking at.

Unlike the linear model case where the variance is constant across I.

Same thing in the Poisson case.

Variance of a Poisson is it's mean, so in this case,

the Poisson has variance that differs by I.

This is a modeling assumption that you can check, right?

So if you have Poisson data, you can, let's say you have several

Poisson observations at the same level of co-variance so

the mean should be the same.

Then the variance of those should be roughly equal to the mean.

If your data doesn't exhibit that, then that's a problem.

So this is a highly consistent, an important,

practical consideration in generalized linear models,

is that the modeling assumptions often put a restriction

on a relationship between the mean and the variance, and

that relationship may not hold in your specific data set, so.

What can you do?

Well there was a way to address this by having a more flexible variance model

even though you lose some of the assumptions of generalized linear models.

And these are all standard options and are and so all you look at our so

called quasi blank options in the family, in the distribution.

So we're going to go through lots of examples, but the point is is that's, so

if you see an R that you see that you can fit a model that's Poisson, with its

GLM function, but then you see there's another option called quasi-Poisson.

The same thing with binomial.

You can see an option where you'd fit binomial, but

then you see another option where it's quasi-binomial.

What that's referring to is a slightly more flexible variance model

in case your data doesn't adhere to the GLM variant structure.

17:18

So, just some odds and ends about the fitting before we go through the specific

cases, we're going to do separately the Poisson case and the Binomial case, we're

not going to go through a full treatment of GLMs just Poisson and Binomiali.

But these equations have to be solved iteratively, so unlike the linear

model where you can just do strict linear algebra to find solutions,

GLMs actually have to be optimized which means sometimes the program fails.

For example if you have a lot of zeros in a binary regression or zeros and

ones, these things can happen.

17:54

But other than that,

then I think most of the analysis should be pretty familiar to us.

If we want to get our predicted response, we're just going to take our coefficients,

our estimated coefficients beta hat, multiply them times our regressors, and

that we'll give us our predictive response.

Now notice this is going to be on the logent scale if you're doing, for example,

logistic regression, or the log scale if you're doing Poisson regression, so

you'll have to convert it back to the natural scale

if you want it to be on the same scale as the original data.

So if you're modeling coin flips and

you get your regression coefficients and out and

you come up with predicted response those will be on the logent scale and

if you want them to be back on the scale of the coin flip the zero or

one between zero or one you're going to need to take an inversion logent.

And again we're going to go through a lot of examples I just want to outline these

facts before we do the examples.

The coefficients are interpreted very similarly to

the way that our coefficients were interpreted in linear aggression.

They are the expected change in the, the change in the expected response per unit

change in the regressors, holding the other regressors constant, the only

difference is now, this interpretation is done on the scale of the linear predictor.

So the binomial cases on the scale of logent,

on the Poisson case it's on the scale of the log mean, and so on.

So it's a slightly more complicated interpretation, but again,

we are gaining the benefit of modeling our dot data naturally on its own scale,

and we haven't had to transform the outcome at all.

19:37

So the inference, we also lose the nice

collection of closed form, normal inferences that you get.

We don't get t-distributions any more.

But largely, this is transparent.

There's a body of mathematics where statisticians and mathematicians have

figured out what the right distributions are to compare all the coefficients.

And from the output of your GLM too in order to get things like p values and

all of those are going to be tested or hypothesis tested the coefficients

are just going to be tested and interpreted in the same way as in our

linear regression just the background that's going on is a little bit harder.

One thing I would say though is all of these results are based on asymptotics

which means that they require larger sample sizes.

So if you have a GLM setting with a very small sample size,

20:40

And so many of the ideas can be brought over from GLMs, so

this was just a whirlwind overview of GLMs Now for the next two

lectures let's just dig in to the two most important cases of binomial and

Bernoulli regression via logistic regression and Poisson regression.

So we're going to do spend a lot of time with those and then if you want

further material on GLMs there's some more advanced classes that you can take.

Okay, well thank you for attending this lecture and I look forward to seeing you

in the next one where we're going to cover logistic regression for binary outcomes.