0:05

The first thing in this kind of data we would want to do is create a scatter plot

of the child's heights by the parent's height.

Here, I use a ggplot.

There is numerous failings in this plot.

One of the primary failings is that these points are over-plotted.

There's lots of different paired, paired parent child x,

y values at each one of these specific points.

Here I give a better plot, where the size of the point represents the number

of parent child combinations at that particular x, y location.

0:44

Here the color, a very light color, represents the frequency near 40 and

a very bluish or darker relatively speaking,

bluish color represents a frequency of number toward that of, of,

toward ten or down to one at specific locations.

So the size of the point and the color of the point are representative of

the frequency of parent-child combinations that that specific x,

y player pair and this is a much better plot, because it, it gives you,

it doesn't lose that information.

1:18

I would say, there is another failing of this plot.

For example, I don't put the units inches on the either of the two axes and

I think that's good practice to have the units on the axes, so

that's a little bit of a failing for this plot.

If you want to see the code, it's in the r mark down file.

1:38

Now let's suppose we want to dis, explain the children's heights

using the parent's heights and let's assume we wanted to do it with a line.

Well, in order to make things easy for right now,

let's force the line to go through the origin.

2:14

In order to find the best line, all we have to find is the slope.

Well, here's how we could potentially do that.

We would want to find the slope beta that minimizes

the sum of the squared distances between the observed data points the Yi and

the fitted data points on the line, Xi beta.

We'll square that distance and add them up and this is directly analogous to

finding the least squares mean, that we did just a couple of slides ago.

So this is exactly sort of using the origin as a pivot point and

picking the line that minimizes the sum of the squared vertical

distances between the points and the line.

So we're going to use our studio's function manipulate to

experiment with this and see if we can find that line.

Now there, there is a point in regression to the origin is useful for

explaining things, because we only have one parameter, the slope and

we don't have two parameters, the slope and the intercept.

But it's generally bad practice to force regression lines through the point

zero, zero.

So, an easy way around this is to subtract the mean from the parent's heights and

the mean from the child's heights, so that the zero, zero point is right in

the middle of the data and that will make this solution a little bit more palatable.

And we'll discuss it later on how this relates to real regression,

where you fit both the slope and an intercept.

Let me just show a picture to illustrate some of these concepts.

So here, I have a scatter plot where I have some data on the y-axis and

some data on the x-axis and I want to use my x variable to predict my y variables.

So this is my x-axis and this is my y-axis.

4:08

My red crosshairs are my data points.

Progression through the origin, the way that we're doing it,

takes the point zero, zero and treats it as a pivot point.

It then tries to find the best line going through

the points, where the best line is as follows.

I take this line here,

it's going to take each observed point y,

it's going to calculate the vertical distance,

so this is that's Yi, that height is Yi.

Okay?

This length Xi and

that point is Xi beta on the line.

So this distance right here is Yi minus Xi beta and

then if we square it, we'll get the squared distance.

So what regression to the origin is trying to find is,

it's trying to take all these vertical distances like

this between the fitted line and the observed heights.

Square all those distances, add them all up.

so that each one contributes to the error rate that it's calculating and

tries to find the best slope.

Because remember, this line is defined by a simple equation y equal to x beta,

when you have a line going through the origin you only need one parameter,

that's the slope.

Now.

5:48

That line's not going to be very good.

If you look, the line really should hit somewhere along here.

So regression to the origin kind of doesn't make sense.

What we're doing by centering the data is we're setting the origin to be right in

the middle of the data, so that point now is zero, zero by subtracting off the mean.

So, it's basically reorienting the axis, so that the zero,

zero point lies right in the middle of the data.

Now, it seems a little more reasonable to find the regression

line that goes through the data where we just consider a slope.

So regression the origin seems to make a little bit more

sense if you subtract the means and we'll find out later that this yields

an equivalent to the solution if we were both fit the intercept and the slope.

6:47

So let's try and do this with R studio's manipulate function and I think,

because this is one of the central themes of the next several lectures,

we're going to go over these points over and over again.

They're very fundamental.

7:01

So, I won't show running the code, because you can grab it from the R markdown file.

And I'm hoping at this point in the specialization you'll be familiar with

running R code.

But here's my plot and over it,

I have a specific value of the regression line.

Notice that my child's heights if you,

if the indexes get down sampled a little in the, in the video compression, but

what, what you can see the center of the child's height here is at zero,

because I have subtracted the mean, the center of the parent's height is at zero.

So, I have a pivot point right in the middle at zero, zero.

I also want to point out that up here I give the slope data equals point six.

And the mean squared error, where the mean squared

error is again the sum of the y data points subtracting off the x data point,

the parent's height multiplied times this factor, 0.6.

So, it's taking each one of these points, let's take this one or

let's take one that's more centered.

This one right here, it's calculating this distance.

The vertical distance between the child's heights and the parent's heights multiply

it times the slope, calculating that, squaring it and adding them up.

But here again, because there's multiple points at any, at any x, y combination.

You can think of maybe multiplying this distance by the square distance by

the appropriate number of points with that specific combination,

because of the overplot.

8:41

At the point zero, zero, our line is going to pivot around that point zero, zero,

because we're only fitting lines that have to go through the origin,

because we're only considering the slope.

Okay.

So let's do it and

see how our mean squared R changes as a function of our slope.

0.6 doesn't look so bad, let's move it to one that's not so good.

Okay. The mean squared error has gotten a lot

worse, right?

So, it went from 5.004, say at .68.

Now let's move it all the way up now to 5.8 and

you can see the mean squared error getting lower.

9:20

Right?

As you get down to a slope.

Looks, looks like 0.6 something is, is doing pretty well.

0.64 has 5 and then 0.62 has 5.0,

5.002, so it's gone up.

So, it looks like about 0.64.

What's interesting is the slope of one is not good.

There's the slope of one.

That's saying, you want to predict your child's height,

just try the parent's height.

Apparently, we have to multiply by a factor to get the child's, to get a better

prediction of the child's height than just the parent's heights by itself.

We have to multiply it by a factor of 0.6 in this case,

that's what the slope is doing for us.

10:09

Here, I give the manipulate code.

And again, all of this code is in the R markdown file.

So don't retype it in from the slides, actually grab it,

grab the actual text from the R markdown file.

10:26

Now that we've used manipulate to find the slope,

I'm going to show how you can do this very quickly in R.

The function lm fits the linear model and

if you, here I have the code where I.

10:44

Here I have the code, where I use lm.

My outcome, my y value is the centered child's heights.

So, I've subtracted off the mean.

And here is the centered parent's heights, where I subtracted off the means for

the parents.

This minus 1 says, get rid of the intercept,

because we're talking about regression through the origin.

11:06

Where we forget about a y intercept and then telling them that the data

set I want to fit is galton and it gives me the slope, 0.646.

So we're going to talk about this a lot.

We'll go through this in great detail throughout the rest of the class.

11:24

Here I, in the final slide of this lecture, I just give the fitted line.

Now because of how we've forced the model, it has to go through the point zero,

zero for the center child's heights and the center parent's height.

So, it has to go through the mean of the child's height to the mean

of the parent's height.

So, I've reshifted the plot so that, that zero, zero is,

is back at the original, at the, at the original point.

11:53

And so here is our line that has slope of 0.646.

And now what we're going to do in subsequent lectures is talk about

how we get these?

How the motivation behind it and all the things we can do with this fitted line,

we're going to spend maybe the next several lectures talking about this.