As a last aside, before we get to our coordinate descent algorithm for Lasso. Let's first describe, or rather derive our coordinate descent algorithm for just least squares regression using these normalized features that we've just described. So here's our least squares objective, which is just minimizing our residual sum of squares. And when we derive our coordinates to send algorithm, remember that we're going to look at a specific variable, wj. We're going to hold all other variables fixed. And we're going to optimize just with respect to wj. So what we're going to do, is we're going to take the partial with respect to wj. And then think about setting that equal to zero, to find the optimal value for wj, fixing everything else. So this is our statement of just our 1d optimization. Coordinate by coordinate. Okay, so let's write our partial with respect to wj of our residual sum of squares. And we've derived this in various forms multiple times in this course so I'm not gonna re-derive it again. But, I'm just writing it here and then we're gonna manipulate this into the form that we need in this specific case. And I wanna highlight those first before I move on that this underbar notation. It was actually introduced when we're talking about our normalized features. So, these are our normalized features. That's what the underbar represents. So, everything i'm doing here is in the context of normalized features. Okay, so we can rewrite the subjective. What I wanna do is, I specifically wanna pull out this wj term. That's the term I'm solving for, everything else is assumed fixed. So when I'm thinking about my coordinate distant algorithm, I know the values of all the other wjs. I guess I should mention what this notation means, this minus j means all wk for k not equal to j. So that's my shorthand saying I'm just taking my w vector and subtracting off, just removing from that set that jth element wj. Okay, I'll remove the circle so that I don't cover it up too much. Okay, so I'm assuming that W minus j, those numbers are known, those values are known at any given iteration of my algorithm, so I wanna separate out wj, which is my variable, and solve for everything in terms of these other quantities. So let's do that, so I have minus two Sum i equals one to N, h bar j of xi. And I'm just gonna rewrite this as yi minus sum over k not equal to j Of wk h-bar k of xi. So that's everything except for wj and I'm gonna write the wj term this specific j that I'm looking at here, separately. So h bar j of xi, and then I'm just gonna multiply through. And so what I'm gonna get is I'm gonna get minus two sum I equals one to N H bar J X I times Y I minus sum K not equal to J W K H bar K X I. So that's the first term. And then this minus sign and this minus sign are going to turn into a plus two, and then the sum over i. Well wj doesn't depend on i, so that's going to come out, and then I'm going to have a sum i=1n, and I have two hj's. I have hj here multiplying this hj here. So h-bar, j squared, of xi. Or just to match the notation we did before, I'll say hj of xi squared. My features are normalized, so let me just switch to another color here, lets do purple, so. What is the sum of the squared values of my normalized features? Well by definition of normalization, this is equal to one. Features, this is equal to one. So, and the other thing I'm gonna do is I'm gonna define this term here. I'm just gonna use this notation triangle equal meaning, I'm defining this notation, I'm gonna call this row j. And the last thing I want to mention is this here. What is this term here? I'm looking at the sum over all features not equal to j of w k times my feature k. Well that's my predicted value of the ith observation if I excluded the jth feature from the model. So this is my predicted value without including this jth feature, okay. Well let's just rewrite what we've done. So the result is we get minus two row j plus two wj times one okay? So this is my partial with respect to wj of my residual sum of squares. And what I'm gonna do like I said is I'm gonna set this partial equal to zero and I'm going to solve for wj. So we get, remember when we set it equal to zero, we're going to start calling things w hat. So, these twos cancel out and very straight forwardly I get w hat j equals row j. Okay, so now I've derived my coordinate descent algorithm. And i'm showing this coordinate descent algorithm here. So what we see is we just initialize our w hat to zero, while not converged. We're just gonna cycle through, we're doing round robin here, cycling through each one of our coordinates. We're gonna compute this row j term for a given coordinate j, and then we're simply gonna set w hat j equal to row j. So let's discuss this row j term and give a little intuition behind why we're setting w hat j equal to row j. Well what row j is doing is it's a measure of the correlation between our future j and this residual between the predicted value. Where remember that prediction was formed excluding that jth feature from the model and the true observation. So we're looking at our prediction, the difference relative to the truth, when j is not included in the model. And then we're seeing how correlated is that with the jth feature. And if that correlation is really high, so if they tend to agree, then what that means is that jth feature is important for the model. And it can be added into the model and reduce the residuals. So equivalently improve our predictions, so what this is going to mean. If these things tend to agree, row j is going to be really large, high correlation. And we're gonna to set w hat j large, we're gonna put a weight strongly on this feature being in the model. But in contrast if they're uncorrelated, if they don't tend to agree across observations, or if the residuals themselves are already very small, because the predictions are already very good. Then what it's saying is that row j is gonna be small and as a result, we're not gonna put much weight on this feature in the model. [MUSIC]