0:00

I want to finish this module with a look at a different form of regression.

So, regression actually takes many different forms.

We saw a linear regression, we talked about a multiple regression.

Now, I'm going to briefly introduce you to what is termed, logistic regression.

So, It's useful to know about logistic regression, because it's appropriate for

a certain type of problem.

So, linear regression,

the type of regression that we have been discussing so far.

Is appropriate when the outcome variable, why is continous..

So, we saw the price of a diamond, we saw the fuel economy of a vehicle.

We saw the amount of time it takes to do a job.

Those are all examples of continuous variables.

But not every variable you're going to come across, in a business setting,

is going to be continuous.

In fact, some of them are discrete.

So examples of discrete variables that you might well come across?

Well, within the marketing context, there's a classic one.

It's did that consumer buy my product?

Either yes or no?

So that's a two-level outcome.

Yes or no,

purchase don't purchase, and I might well be interested in modeling such a variable.

For example, as a function of the age of the consumer,

their sex, their income, etc.

So, sometimes we find ourselves wanting to model a categorical, or discrete.

And in this particular case, I would say dichotomous.

It can take on one of two values, a dichotomous outcome.

Another example of a dichotomous outcome, it's a pretty brutal one.

But it's certainly out there if you run drug trials, medical experiments.

Is the patient alive after five years?

Yes or no?

I might like to model that as a function of the severity of the illness.

The drugs that the patient was able to take for the illness, etc.

So that's another example of a dichotomous outcome variable that we'd like to model.

And within the internet or web-based world,

those dichotomous outcome variables are very common.

So, one of them might be, you go onto a website, and immediately.

A page pops up which says something like,

would you give us your email address, please?

And as a consumer, you're either going to say yes or no to that.

And as the person who runs the website.

I might be very interested In understanding what the drivers are.

2:24

As to whether or not you chose to give me your email address.

Sign up for some email newsletter, for example.

Another place where these dichotomous outcome variables are used

in web-based businesses all the time is conversion.

Conversion could be understood as, did you buy my product?

Yes or no?

You were on the web page,

it went through the shopping cart, did you actually get to the checkout?

Yes or no?

And so, these dichotomous values,

these dichotomous variables are absolutely commonplace in business processes.

They are not continuous nd we're going to need a slightly different methodology.

If we're going to create a realistic model for such outcomes.

And here's the methodology.

It's logistic regression.

3:24

So logistic regression is used to estimate the probability

that a Bernoulli random variable is a success.

The probability that a consumer buys my product.

But we will estimate that probability as a function of a set of predictive variables.

As a function of a set of X's.

So, it is a regression set up.

X is trying to tell us something about why we believe there is an association.

But the model is formulated in terms of the probability of a success.

So we might say, and here is the example I'm going to work with.

4:01

How does the probability that a website is compromised,

vary as a function of the number of plugins that the website has installed?

So, if you have a website, you like it to be functional to your consumers.

So if its nice, highly functional then it should be engaging.

People stay on the site.

That's usually viewed as an outcome, good outcome.

But the more functionality we want to offer the user.

Than the more plug-ins they typically after have going on your website.

And the unfortunate truth is that the more functionality you provide.

The more plug-ins that you put in your websites,

that might be a shopping cart for example, as a plug-in.

It might be a blog, all sorts of plugins.

But the more of those you have associated with your website.

Unfortunately the higher the probability of compromise is, and that's not good.

So, we're thinking of our outcome variable.

Is this website compromised?

Yes or no?

And we'll think about predictive variable as the number of plugins that the site

has installed.

5:00

Now, we got out and collected some data here.

And we have looked at a large number of websites.

We've counted the number of plugins that they've got.

And we've looked to see how many of them are compromised.

So in this particular example, we've got 100 websites with no plugins.

And of those websites 16 are compromised and 84 are not compromised.

That gives you a compromise percentage or proportion of 16%, not .16.

So, those are the numbers in the second column of this table.

For example, picking out another one, websites that had five plugins.

I looked at 100 of those, 55 were compromised, 45 were not compromised.

By the time we get up to 10 plugins on the website.

Then we got 88 of those sites were compromised, 12 were not compromised.

Got an 88% compromise rate.

So, remember our outcome here is, was the website compromised?

Yes or no?

Now, you might take data that looks like this.

And, when I say data, I'm going to look at the proportion compromised as a function

of the number of plug-ins.

If I choose to plot it like this.

6:12

I can put a line through the data.

There's nothing stops me running a linear regression.

But what I'm saying now, is that's not totally appropriate here.

And the thing to realize is that the outcome that I'm modeling.

The probability of compromise has to lie between zero and one.

Probabilities must lie between zero and one.

Proportions must lie between zero and one.

So, if I put a line through data that looks like this.

You can see that something odd is going to happen.

Especially, if I extrapolate the line.

So first of all, the line doesn't fit the data too well.

But absolutely if I extrapolate.

If I took an extrapolation out to 12 plugins, for example.

t's going to predict a proportion compromise or

a probability of compromise greater than one.

It's nonsensical, so the underlying issue here.

s that my outcome has to lie within a range 0,1 and unfortunately,

my line doesn't respect that range.

So what am I going to do?

So, this slide shows you that the linear regression isn't necessarily

a smart thing to do.

There are alternatives and the alternative that I would typically use,

would be a logistic regression.

Let me show you what a logistic regression looks like.

So, a logistic regression actually fits on a transform scale.

And what I'm showing you here,

is the back transformed to the original scale of the data.

And so I'm not really going to dig into the details here.

The most important point that I'm making for you.

Is that, if you're looking at dichotomous outcome data like, live and die, buy and

don't buy.

You might find a logistic regression model much much more appropriate.

If we were to fit a logistic model for this data, which I've done here.

It provides a different sort of fit.

This sort of fit, I would often term an s-shaped curve.

In fact, it's a logistic curve to be more precise, and hence,

we call it the logistic regression.

But it has some very good features associated with it.

And the main feature is,

it's never going above one and it can never go beneath zero.

So, it provides a more suitable model, when you're trying to predict outcomes.

That are things like probabilities and

proportions, that should live between zero and one.

So, here's the fit of the logistic regression model, and

once you have got that fit.

You can see how you can use it for prediction.

If I have, for example, four plugins and

I want to predict the probability of site compromise.

I just take four, and I go up to the curve and I read off the value.

And that's what this regression methodology will give me.

It's a prediction methodology that's more suitable for these dichotomous outcomes.