[MUSIC] But instead of gradient descent or sub-gradient descent, let's take this as an opportunity to teach a really important alternative optimization procedure called Coordinate descent. So, let's just have a little aside on the coordinate decent algorithm, and then we're gonna describe how to apply coordinate descent to solving our lasso objective. So, our goal here is to minimize sub function g. So, this is the same objective that we have whether we are talking about our closed form solution, gradient descent, or this coordinate descent algorithm. But, let me just be very explicit, where. We're saying we wanna minimize over all possible w some g(w), where here, we're assuming g(w) is function of multiple variables. Let's call it g(w0,w1,...,wD). So this W I'm trying to write in some bold font here. And often, minimizing over a large set of variables can be a very challenging problem. But in contrast, often it's possible to think about optimizing just a single dimension, keeping all of the other dimensions fixed. So easy for each coordinate when keeping others fixed, because that turns into just a 1D optimization problem. And so, that's the motivation behind coordinate decent, where the coordinate descent algorithm, it's really intuitive. We're just gonna start by initializing our w vector, which will denote w hat equal to 0. Or you could use some smarter initialization, if you have it. And then while the algorithm's not converged, we're just gonna pick a coordinate out of 0, 1 all the way to d. And we're gonna set W hat J equal to the W that minimizes. I'm gonna write this as an omega, so I'm gonna search over all possible values of omega that minimizes g of w hat zero dot dot dot, all the way to W hat j minus one, coma omega, coma w hat at j plus one all the way up to w hat D. We're here, these values, everything with a hat on it are values from previous iterations of this coordinate to send algorithm. And our objective here, we're just minimizing over this J coordinate. Sorry, these J's are pretty sloppy here. Okay, so whatever omega value minimizes this joint objective, plugging omega in, is gonna be what we set w hatch a equal to at this iteration, and then were gonna keep cycling through until conversions. So, let's look at a little illustration of this just in two dimension. So, here it would just be W0, W1. And we're gonna pick a dimension, so let's say, we choose this optimizing the W1 dimension, keeping W0 fixed. So I'm just gonna look at this slice, and what's the minimum along this slice? Well the minimum occurs here, if I look at the contours this is the minimum along this dimension, and then what I'm gonna is I'm, in this case just gonna cycle through my coordinates. I'm next gonna look at keeping W1 fixed and optimizing over W0, and here was the minimum, I'm gonna drop down to the minimum which is here and I'm gonna optimize over W1 holding W0 fixed to this value. And I'm gonna keep taking these steps and if I look at the path of coordinate descent, it's gonna look like the following, where I get this axis aligned moves. So how are we gonna pick which coordinate to optimize next? Well there are a couple choices. One is to just choose coordinates randomly. This is called either random or stochastic coordinate descent. Another option is round robin, where I just cycle through, zero, one, two, three, all the way up to D and then cycle through again. And there are other choices you can make as well. And, I want to make a few comments about coordinate descent. The first is, it's really important to note that there is no stepsize that we're using in this algorithm. So, that's in contrast to gradient descent, where we have to choose our stepsize or stepsize schedule. And hopefully, in the practice that you've had, you've seen the importance of carefully choosing that step size. So it's nice not having to choose that parameter. And this algorithm is really useful in many, many problems. And you can show that this coordinate descent algorithm converges for searching objective functions. One example of this is, if the objective is strongly convexed, and we talked about strongly convexed functions before. Another example is, for lasso which is very important and that's why we're talking about using coordinate descent for lasso. We're not gonna go through proof of convergence in this case, but know that it does indeed converge. [MUSIC]