Welcome back to our notebook. Here we're going to briefly introduce ridge regression. Now again, we haven't gone through the math of ridge and lasso regression. All we need to know is that these are two different forms of reducing the complexity of the original linear regression model. So that's all we'll know for now, we'll get deeper into the math in lecture. Now as before, before passing through to our new model here being the ridge regression, we're going to create our PolynomialFeatures object, we're going to add certain Alphas that we're going to search through, and we're going to initiate just blank scores that we will append as we get the scores for each one of our ridge regressions given the different hyperparameters that we're passing through. Again, our goal is to find the optimal hyperparameters so that it generalizes well to new data. So we don't want it to be too complex, but we don't want it to be a minimal amount of complexity. We want that just write region and we do that by doing cross-validation and saying for each one of these Alphas, either reducing complexity a turn or just a bit, where is the right balance in regards to complexity to minimize the error on our holdout sets. We're doing for Alpha in our Alphas, we run our ridge regression and for each one we pass in this hyperparameter of our Alpha. We also are passing in this max iteration to ensure that it will converge as we discussed. We will initiate our estimator object again using a pipeline of polynomial_features, the standard scalar, and then ultimately our ridge_regression, which we defined above and then we'll get our predictions using the cross _val_predict on each one of our holdout sets and then we can see the r2_scores for each and append that onto our scores. Then we'll plot in order to see how the Alphas versus the scores, how that's going to increase or decrease our r2_score as we increase or decrease our complexity using these Alphas. Recall that a lower Alpha means more complexity. So we see all the way here to the left with much lower Alpha, we are not getting as high of an r2_score but as we increase the Alpha and reduce the complexity, we have this optimal points around 0.75. So that would be the optimal hyperparameter. Anything to the right of that is probably not complex enough. So we have that just write point at around 0.75. Our conclusion that we can get from this, from both of these curves having this upward trajectory and then this downward slope, is that using this Alpha value in order to reduce complexity, again, no Alpha value at all would be essentially the same as linear regression. So we can see that as we reduce complexity slightly, we're actually able to improve how well we are able to generalize our model. Reducing complexity even on a simple model such as linear regression is doing well in regards to optimizing how well we will perform on our holdout sets. Now I want to go over how we can look at some feature importances. So looking at interpretability. What is important? Whenever we want to look at interpretability for something like linear regression, we need to ensure that all of our features are on the same scale. If you think about one of them being values between zero and five and the other one being values between 10,000 and 100,000, for trying to predict how much a unit change of 0-5 will affect median value, a one unit change will probably have a larger effect, whereas a one unit change in values between 10,000 and 100,000 will have a very little effects on the overall median value of our households. So what we want to do is bring those all down to the same scale so that our coefficients are actually measuring for one unit change in standard deviation. So as it varies in accordance to the variation built into that feature, how much will that affect our median household value. So now they're all on the same scale. The larger that coefficient is, the more important that feature can be seen in regards to predicting the median household value. The way that we're going to do that is again, first standardize the data. We're then going to fit and predict on the entire dataset. So we're going to get our training set, because we don't care as much about prediction here, but rather interpretability. We want to bring in all of our training data and then want to see which one of our values have the largest coefficients. Now in order to do this, we're going to have to run this code that we see here, and I'll explain that clearly as we go through the code in just a couple cells. So we have our best hyperparameter with lasso, with Alpha equal to 0.01. We create our pipeline where we create our polynomial features. So we're going to have a ton of new features as we square each one of our features, as well as creating all of the interaction terms between each one of our features. Then we're going to scale it, so they're all on the same scale, and then we're going to fit the X and y. Again, we don't care as much about prediction, we've already gotten the 0.01, is going to be our optimal value in regards to how well we'll perform on a holdout set. Now, we want to see what the actual interpretability of that output will be. So we're fitting on the entire training set to get the coefficients fit across as much data as possible. We'll look at the score here that won't be as important, and then we're going to look at each one of the feature importances using the code that we see above. So step-by-step, when we run best_estimator here, let's pull this all out. So I promised you that later on we would take from the pipeline the actual names and we will actually make use of each of these names that we pass through, rather than just passing through perhaps polynomial features or our StandardScaler or whatever our object is that we are passing through the pipeline. If we want to access a portion of the pipeline, then we have to take our best_estimator, which is equal to this pipeline that we have fit before and call out.named_steps. That.named_steps will allow us to access a dictionary where we can look at different attributes of each of these subsets of our pipeline. So each one of these subsets could be either the polynomial features or the StandardScaler or if we want the coefficients, we get both from the lasso here. So we do best_estimator.named_steps. We're pulling outs the polynomial features here and we're calling this function called get_feature_names, and that's going to allow us to get the X1, X2, X3, and those values squared so that we'll be able to look back and see which one of our values actually ended up having the highest feature importance. So you see here I'm pulling out all the names where we have CRIM and then we have CRIM squared, then CRIM time zone, and so on and so forth. So this allows us to see each one of our new features that we created using the polynomial features. Then we also want to pull out the coefficients that we learned with our model. So I'm going to copy this. Again, we use that.named_steps and we say that we want to pull from the lasso_regression. The lasso_regression is going to be the portion of our pipeline that contains the attribute of coefficients, and we see a coefficient for each one of those values that we just saw laid out with our polynomial features. We're going to zip those two together so that for each one of those features, it is perfectly aligned with each one of these coefficients. So that's going to be this zip function, and then we're going to pass it into a DataFrame and call that DataFrame df_importances. So what does our df_importances look like now? It's now the first column is going to be each one of our feature names, and our second column is going to be the magnitude of that coefficient. We're then going to sort our values so we can see the highest negative and the highest positive values in regards to which variation affected our predicted outcome the most. Then we see the interaction between number of rooms and the tax rate has actually very negative effects, whereas we can actually pull out the boston_description that we defined earlier to see each one of these different column names if we are more curious. So we had that RAD times PTRATIO, which is going to be the index of accessibility to radial highways to PTRATIO, the pupil-teacher ratio by town. It's a little bit confusing to think that that would have such a large effects. We may want to test removing this as we have so many more features at this point. We're now working with 104 features or we can say maybe there is that effect and dive deeper in. Then the other thing that we can look at here is that room squared. So as the number of rooms increases, that's going to have more and more value in regards to how much we would predict the outcome variable that we're looking at. That closes out this section in regards to looking at feature importances and getting a quick peek at ridge regression. In the next section, we will go over grid search CV, which will allow us to do this for loop through all of these different hyperparameters in a much more succinct fashion.