0:05

Just want to give you a little bit of a thought exercise.

I've picked a couple of companies here that deal a lot in consumer data.

Apple or any online business really, Netflix and then Whole Foods.

One brick and mortar company in the bunch.

But think for a second about the types of decisions consumers make

with these businesses, and framing them in terms of choices that consumers make.

So for example, at Whole Foods,

it might be am I going to buy a particular brand on a given shopping trip?

Yes or no.

Well, for the company's standpoint,

it might be helpful to know which of those brands are going to be popular.

Which ones are people going to buy on different trips?

Is there seasonality associated

with their products when they're making their ordering decisions?

Am I going to come to Whole Foods when I need groceries?

Yes or no?

I could go to one of the other grocery stores that's available to me.

So what it is it about the people who choose to shop at Whole Foods,

they're more likely to go there when they're making their shopping trips?

With Netflix, do I retain service this month?

Yes or no?

Do I choose to watch the recommended series, yes or no?

Do I choose a larger plan this month?

Do I add on the DVD service this month, yes or no?

Similar types of decisions you can imagine consumers making,

whether it's Apple, or Amazon, or any other business.

So there are a lot of customer choices that are driving these businesses,

again, highlighting the importance of understanding what's the right way for

us to be analyzing this choice data?

1:55

And the reason that I wanted to talk up front about distributional assumptions,

as I said, we're used to using the normal distribution.

Well, what we're really going to be changing is

that distributional assumption.

When it comes to binary choices, we're not going to be using the normal distribution.

We're going to assume that a customer choices between a yes or

no outcome follows a Bernoulli distribution.

And there are only two values allowed under a Bernoulli decision, 1 or 0.

Yes or no.

And the only parameter that's associated with the Bernoulli distribution

is the probability p.

So with the probability p, you get a 1.

With the probability of 1- p, you get a 0.

Again, framed differently with the probability of p, there's a yes outcome.

With the probability of 1- p, there's a no outcome.

Now, we can calculate the mean and

the variance associated with the Bernoulli distribution and we've done that here.

2:58

So the expected value under a Bernoulli distribution if we take

what are the outcomes, 1 and 0, and

what are the probabilities associated with those outcomes?

That weighted sum, our best guess, our expectation,

is that's the mean of the Bernoulli distribution, so probability p.

We can also calculate the variance under the Bernoulli distribution.

So when it comes to writing out the likelihood of a single observation from

the Bernoulli distribution, this is the form that it takes on.

Now, notice, it's the probability p raised to the power y (1-

p) raised to the power of 1- 1.

Now it looks a little bit foreign, but

let's break it down based on the values that y can take on.

Suppose we observe a 1, all right, y equals 1.

Well, p raised to the power of y means I have a value of p.

(1- p) raised to the power of 1- y,

so raised to the power of 0, that term is going to go away.

So the likelihood for a single draw from a Bernoulli distribution,

if I observe a 1, y equals 1, the likelihood is p.

All right, well, what if I observe y equals 0?

If y equals 0, it's p raised to the power of y.

P raised to the 0, well that term equals 1, so that essentially goes away and

then, I'm left with a likelihood of (1- p) raised to the power of 1- 0.

So when I observe a 1, the likelihood is p, when I observe a 0,

the likelihood is 1- p.

Now again,

that's just mapping onto the two values that we had talked about earlier.

And then the product, saying, let's multiply that

function over all the data points that we observe.

All right, now how do we go about bringing covariance or

marketing activity into this?

Recall when we looked at linear regression,

what we said was the outcome's y following normal distribution with the mean mu.

All right, and we said mu was a function of marketing activity.

Well, what we're going to do here is say,

my outcome is a function of the parameter p.

Well, my probability p is going to be a function of marketing activity.

We're just going to change the form in which that marketing activity effects

the probability p.

All right, so we talked about this piece already, said outcomes follow

Bernoulli distributions, and we can write out the likelihood function.

When we bring in marketing activity, we're going to change that a little bit, and

say that the probabilities p, well,

they're going to be a function of the marketing activity.

All right, so we're going to look at an example for customer acquisition.

Well, marketing actions are going to affect the acquisition probability.

So the acquisition probability may be affected by, did I send you an email?

Did I send you a coupon?

6:32

All right, so two different models that are commonly used.

One is the Logit Model,

and you can see here, this is the functional form that we're going to use.

So the probability, it's the exponential function where e raised to the power

of (X transpose beta) divided by 1 + e raised to the power of (X transpose beta).

One thing to keep in mind, we're talking about a probability,

p is always going to be a value between 0 and 1.

This X transpose beta term, well, that's actually our regression equation.

Our regression equation previously looked like we

had an intercept beta 0 + coefficient beta 1 x X1 + coefficient beta 2 x X2.

And however many coefficients we had, that's our regression term.

So every time you see that X transposed beta,

just plug in your regression equation, because that's all we're doing.

So think of this as rescaling your regression equation.

That regression equation can take on values negative and positive.

We've gotta somehow make that into a probability, bounded between 0 and 1.

So the exponential e raised to that power divided by 1 + e raised to that

power guarantees that it's going to be between 0 and 1.

That's the logit model.

Another model that we could use, it's referred to as the probit model,

where we plug our, excuse me.

We plug in the regression equation we have into the normal CDF.

8:09

And that's going to give us our probability between 0 and 1.

For the most part,

you're going to get very similar predictions between these two approaches,

with the exception of when we get far out to the tails of the distribution.

Just to give you a sense, this is going to be consistent with economic theory,

random utility theory,

where you choose the option that provides you the highest utility.

So utility's going to be comprised of two components.

X transposed beta, that's our deterministic component.

That's the place where the marketing activity comes in.

And then the random component.

Well depending on what assumptions we make about the distribution that that random

component comes from, we're either going to end up with the logit model or

the probit model.

All right, so we have the logit model on one side,

we've got the probit model on the other,

just different ways of translating that utility into a probability.