Okay let's go through yet a third derivation of least squares, but in this way we're going to demonstrate why least squares is thought of as sort of an adjustment mechanism. So I'm going to write it out a little bit differently as y minus x one beta one, minus x p beta p, where each of these are vectors. So when I estimate the beta one coefficient, in what sense is it adjusted for the presence of all the other variables in the model? Before we begin, let me define my residual function for two vectors as e(a,b) as a minus b times the inner product of a and b over b by itself. And this is merely the prediction for a from b based on a regression to the origin. And then this is the residual if I had subtracted. Okay. So let's take this asterisk's function right here and then let's fix beta two up to beta p as if they were known. As if they were known. So then, the vector y minus x two, beta two up to x p, beta p could be considered a single outcome. And then x one beta one could be thought of as the predictor. So I've just simply rewritten asterisks there as a single predictor and regression through the origin. So we know that asterisk has to be larger than or equal to if we plug in a beta one. Where beta one, where as it depends on the beta two up to beta p. If we plug in a beta one that satisfies this criteria, with need to be inner product of y minus x two beta two minus up to x p beta p, we'd want that comma, and x one, all over norm x one by itself. And the bookkeeping is going to get a little bit thick here. But imagine if I were to plug this term. Well, before I do that, notice this is equal to the inner product of y and x one over x one by itself, minus the inner product of x two and x one over the inner product of x one by itself times beta two. Minus all the way up to inner product of x p and x one. All the way up to beta p. Okay, now if I were to plug these into beta one there, then I'd like you to churn through the calculations. What you get is e of y and x one minus e of x two and x one times beta two all the way up to e of x p and x one times beta p squared. We have to have gotten smaller because we'll have plugged in the optimal beta one holding the other ones fixed. So now we're back at the exact same equation with one fewer coefficients. Instead of having p coefficients, now we have p minus one where we've gotten rid of beta one. And now instead of y, we have the residual having a regressed x one out of y. We have the residual for x two having regressed to x one. So in essence, we've simply gotten smaller by taking the linear association with x one out of every other variable. Out of every other predictor in the outcome. And we've gotten smaller. So, we could repeat this process, because notice, these are all, this is just the same exact starting equation with one fewer. One fewer regressors, and our vectors now look a little bit weirder, because they're these residual, they're the output of this residual function, but, ostensibly, we're back at this same thing. So we know that we can get smaller if we repeat the exact same process. So this is going to be greater than or equal to. If I simply take the residual, the residual of the residual where I've now, let's supposed I've held beta three up to beta p fixed, and then I'm getting rid of this term here. Then I would get e of x two comma x one minus e of e of x three, x one and e of x two, x one beta three and all the way up to e of e x p comma x one. Now regressing out e of x two and x one. Okay. Times beta p. So now we've gotten rid of this regressor and this coefficient by regressing it out of every term. And then you can see, as you iterate through this process, you regress out until you get to the beta p variable, you'll find that the beta p variable estimate is then a regression to the origin, with just that variable. And the outcome where we've iteratively regressed out the linear association of every regressor. So we took and regressed x one out of everything. Then we took those residuals and took the residuals of x two and regressed it out of all the other residuals. And then we took the residual of x three and then having took the residual of it having regressed out of. And then in that sense, what we'll see is that, and then we get to this beta p effect, which will just be the remaining regression for the origin. So you could see this would be actually an easy way to do linear regression. Where all you needed to know how to do was this residual function. No matrix inversion required. So that's kind of a neat result. But also, a couple of other things come to mind. The first thing that comes to mind is, because we did it in an arbitrary order, we just picked working towards the last coefficient. We could have also work towards the first coefficient and we could have done it in any order, so that you can see that it doesn't matter which order we take the residuals in. All that matters is that we're iteratively taking residuals. And I find this process really helps me understand in what sense linear regression is adjusting for these other variables. Because it's taking out the linear association of all the other variables from everything else. And I should note that thinking this way is not just restricted to separating everything in the vector. So, suppose for example I had y minus x one beta one minus x two beta two, where now, x one is m by p one and x two is n by p two. So I've broken my x matrix my design matrix into two parts x one and x two. Okay and so beta one is a p one by one vector and beta two is a p two by one vector. So if I wanted for example to hold beta one fixed Okay, so now this term, y minus x one beta one, is a single term, and I know what by solution for beta two would have to be. Beta two as it depends on beta one, my beta two hat as it depends on beta one, would simply have to be x two transpose x two inverse x two transpose. Then times my outcome, but because x one and beta one are fixed, it'll be that which works out to be x two transpose x two inverse, x two transpose times y, which is the coefficient. And then x two transpose x two inverse x two transpose x one times beta one. Okay, so, this term is the coefficient having let's see if I can get some more space down here. This term is the coefficient if I were to only have regressed x two on y. And this is the collection of coefficients I would get if I regressed every single column of x one as an outcome and x two as a predictor. When I plug these, this estimate back into here, into beta two, what do I get? I get y minus the hat matrix x two for x two. X two transpose x to inverse, x to transpose y and minus and then, I'm going to just write this out in a more convenient form as i minus that times y and then, minus I minus x two transpose x two transpose x two inverse x two transpose times x one and then, Beta one. Okay. So now, what is this? This is the residual of y having regressed out x two, and this is of course smaller than our original equation. And this is the residual, for having regressed x two out of every column of x one. Okay, so one way to think about our estimate for beta one is first get rid of all the effect associated with x two out of y and x one, and then perform the regression with just those sets of residuals. And again, to me this really helps me understand in what sense regression is doing adjustment. But notice again this is the same argument as above, we're just doing it with matrices now. Okay, so I find this way of thinking, even though it's a little confusing and you would never actually program the computer to fit least squares this way, I find it a very useful way to think about linear regression and what it's accomplishing. And conversely there's nothing in particular about holding beta one fixed first, we could have held beta two fixed first. And so what we see is that every coefficient in least squares is obtained this way, by having regressed all the other regressors out of both y and the predictors associated with that coefficient.