Welcome back to the notebook, in this section we're going to discuss hyperparameter tuning. As a quick recap, hyperparameters versus parameters, the hyperparameter's to be parts of our model that we as users will actually tune ourselves, versus parameters, which will be learned by the model using machine learning. Now, how do we come up with these hyperparameters in order to optimize the model performance? The way that we do this is going to be a hyperparameter tuning, and that's going to involve using cross-validation, so multiple train test splits, as we just did, in order to determine which hyperparameters are most likely to generalize well for an outside sample. So like we saw in lecture, we had that curve and regards to the complexity versus the air, we want to find that perfect part of the curve. So we'll test many different hyperparameters to see which one of those hyperparameters will lead to the right level of complexity to minimize our error on a holdout set. So generally speaking, tuning those hyperparameters will increase or decrease the level of complexity of your model. So a quick introduction to a function that we're going to use here, the geomspace. The geomspace is just going to be, here we say 1 through 1,000 and the number of values you want is equal to 4. It's just that every single value in between will be a multiple of the prior value. So 10 is 10 times 1 and 100 is 10 times 10 and 1,000 is 10 times 100. You can think of this in the same way and np. Geomspace. And I can say 1, 27, 4, and you see that each one's 3 times the prior value. So here we're going to create 10 value starting with 1 times e to the negative 9, 1 times 10 to the power of negative 9, which is going to be 0.00, that many zeros one, up until just one itself. So we see that, that this is to the negative 9, negative 8, negative 7, so on and so forth till we get to 0.01, 0.1, and then just 1. So we have not yet introduced lasso, and all I mentioned before and all you need to know here is that changing this alpha value within lasso. So we're going to use this lasso model and we're going to initiate the same way we do with linear regression. And we're going to pass in this argument to alpha for each one of the alphas that we defined before. All you need to know is that the higher the alpha is, the less complex your model is. The lower the alpha is, the more complex it is, and the closer it is to regular linear regression. And this will become even clearer as we go through the outputs of the actual coefficients for each one of our features that we have in our data set. So we're starting off with scores is equal to a blank list, and the coefficients is equal to a blank list. And what we're going to want to do is actually see which one of our different alpha values is going to lead to the highest score for our holdout set. So that's going to allow us to look through each one of these hyperparameters, each one of them being an alpha, and see how much do we need to limit the complexity of our model. So we, for every alpha, initiate our lasso model. We're then going to run our estimator, create our estimator using pipeline. Now, we'll get into this later as well, but it's very important whenever you're doing lasso or ridge regression, and we'll show you ridge regression in a second, that you scale your data first. Always have to scale your data, otherwise the model will not work optimally. So we scale it, then we run lasso regression, so similar to the pipeline that we saw above. And then we just get our predictions using cross val predicts. We pass in our estimator, our x, our y, and how we want to do cross validation. And then we get our r2 score and append that to our scores. So we're going to do that for each one of our alphas. So I'm going to run this. And then we can look at for each one of our alphas what our score was. So this is the most complex of the models, and this is the least complex. And we see that it performed, it has a pretty flat curve in regards to how much lowering complexity is actually helping in regards to our holdout set. So, Just to make clear how we are actually making it more or less complex of a model. If we look at the coefficients for very little alphas, for those that will reduce complexity much less, you see that our coefficients are all different numerical values. Whereas if I do alpha equals 1, which is much higher alpha value, so much less complexity, we end up removing many of our coefficients, right? These coefficients are each going to relate to a feature, and we have removed those features, essentially, from our prediction values. And I'm going to plot this and we're going to be able to see the trade-off between higher complexity and our error rate. And we see that it's fairly level throughout. So we probably do not need as much regularization as we're just working with plain lasso with just the standard scalar. We're going to see in a second, as we end up with many more features using something like polynomial features, how we'll need to actually reduce the complexity, and probably move many of these actual interaction terms or squared values. So that's what we're doing in this next exercise. Now we're going to add polynomial features to this pipeline. We're then going to rerun the cross validation using the polynomial features added. Now, pipelines are going to input from first to last, and we discuss now, we're adding on another step to our pipeline. And let's think about the order that it would make sense to add polynomial features to the data, in the sequence to add them in the appropriate place in the pipeline. So if you think about it, now there's a little bit of an argument here. But on one side, if you standardize first, then you will end up with negative and positive values for values that may have all been positive, therefore changing all of your interaction terms. Also, you're going to be bringing them down to a much smaller scale. So rather than the first value times the second value both being above 1 and therefore increasing that vale even mor. If you scale it down to values that are maybe 0.5 and 2, we're actually reducing that value. So that's on one end of the argument why we would want to do polynomial features first, and then scale it down. Also on top of that, scaling it second will also ensure that each one of them are on the same scale at the end. On the other hand, you may want to know the interaction between the scale data so that you'd be able to see, just in regards to how much one varies from the mean, what will be the interaction versus the squared of those values. But we're going to stick here with just the polynomial features first, which is generally regarded as best practice. We're going to initiate our polynomial features object. We're going to start off with a blank list of scores, as we did before. We're going to have our list of alphas. And then for each one of those alphas, we're going to do the same as we did in regards to initiating our lasso object. Now, I do want to say that we have this max iteration, which we didn't discuss before. The idea being that when you're working with lasso regression, it's actually going to work with radiant descent, which we haven't introduced yet. The idea is for it to get to the optimal value, it's essentially running in the back end, what you can think of as for loop, as it gets closer and closer to the correct value. And it's going to do that in small little steps, and if you don't have enough iterations, and the default here is not enough iterations, then it won't get to that optimal value. So that's why we have to set the max iterations. We're then creating our estimator object, which starts with the polynomial features, then scales it, and then pass that through to our lasso regression. We're then going to come up with predictions by just running cross val predicts, passing in our estimator, our x and our y. And then we're going to get our new score and we're going to append each one of these scores. Now, running this will take about 30 seconds. Also note that you will probably get a warning giving the max iterations that we have here, that it hasn't converged. But we tested this on the back end to ensure that the numbers are fairly close to the actual optimal value. So I'm going to take a break here, and we'll come back as soon as this is done running. Okay, now our functions have ran and we can look at for each one of our alpha values what are different scores were. And we see, starting off with very little alphas, so very high complexities that we may not be generalizing well. We see with a little bit more of a little bit higher of an alpha, then we are generalizing better. And then as that alpha gets even higher, we may have reached our peak. So to look at this graph we would say that at this 0.01 we probably got our optimal value in regards to the hyperparameter that will generalize well to new new data coming in. So with that, we can then use that hyperparameter to train our actual model. So we're going to set our best estimator equal to this pipeline where your polynomial features degree equals 2. Then scaling, and then using the alpha that we found was the best value with 0.01, and we can fit that. We see what the score was here for the best estimator, which will be built in, and it will be by default the r2 score. And when we are looking at that, we have to keep in mind also that we are training and scoring on the same data set. So generally speaking, as usual, we probably want to have a holdout set. But this is just to show us how much of our variation we were actually able to cover given the model that we used. And then when we look at the coefficients, because we used the lasso regression and we added this alpha term, we actually removed many of these features, as you see a lot of them essentially zeroed out. So I'm going to take a stop here. And then when we come back, we will come to ridge regression to see how we can run through the same process. And then how we can ultimately string these all together into, again, a succinct couple of lines of code. All right, I'll see you there.