Welcome to part 2 of our notebook. Here we're going to discuss adding in a preprocessing step, as well as introducing the pipeline and the cross val predict, which we'll see will consolidate a lot of the code that we're writing out in this next step. Similar to before, we're going to start off with just an empty lists for scores. We're going to initiate our linear regression object to lr, and then we're also going to add on this preprocessing step of scaling our data. That's going to be this s equals StandardScaler. We're then going to run that same for-loop using our kf.split, and end up with an x_train, x_test, y_train, and a y_test set. We're then going to take our x_train and then fit and transform it using our StandardScalar. We're going to come up with that mean and standard deviation, and subtract the mean, divide by the standard deviation, and come up with this new x_train_s. We've now scaled our data. We're then going to pass in our scale data along with our original outcome variable just for the training set, and we're going to fit our linear regression model on our training set. We then have to transform our test set, getting it to the same scale using what we learned from the standard scalar on the training set. So we just run s.transform on our test set, we get the output when we subtract the mean and the standard deviation of x_test_s, we can then pass that into our linear regression model that's fitted to our training set and come up with predictions on that test set. We can then use those predictions in order to see r2_score between the actual values and our predictions. Then just as we did before, we'll keep appending on the new scores into our scores list. Then we see here that our scores are exactly the same and that's because for vanilla linear regression, without any regularization, and that's going to be a term we'll introduce later on, but the idea there being that there's lasso and ridge, which we talked about, which will allow you to prevent overfitting. Those will need you to scale your data for regular linear regression. Scaling won't actually affect performance. As we add on more and more preprocessing steps, this can become pretty cumbersome to keep adding on and doing your fit transform, then your next fit transform if you have another preprocessing step, and then when you want to run it on the test having to transform each one of those. Luckily, Sklearn recognizes that this is common within the machine learning work for, and they have introduced the pipeline functionality. Pipelines are going to allow you to chain together multiple operators on your data as long as they each have a fit method. They all have to have a fit method, and then on top of that, every step leading up to the last step has to have a fit and transform so that the output of one can be the input of the next step. So you can chain together more than two steps. You can chain together 10 steps as long as each one of those steps has fit transform, and the last step has fit. We're going to reinitiate our StandardScaler and our linear regression, and then we're going to introduce our pipeline function here. Recall we imported pipeline earlier from sklearn.pipeline, and we're going to set this variable estimator equal to our pipeline. Our pipeline's going to have two steps. First, it will scale our data, then it will pass that through to a linear regression. This will allow us to bypass those steps that we saw above of fit transform as well as transform when we actually want to test. To show you what that looks like, we're going to quickly introduce the code. We have estimator now created, that object is created. Similar to what we do with linear regression, we can actually just pass in our x_train and our y_train. We have now fit that pipeline objects specifically to the x_train and y_train. What we've done there is we first scaled our x_train data, and then we ran regression, so we fit the data according to the scaled version of the data. Now that the estimator is fit to the data, we can even call estimator.predict, similar to what we do with linear regression and pass it in our x_test, and it will create an actual prediction. That's the way that pipeline works. Now, rather than doing that for-loop, if we want to get the prediction for each one of our holdout sets in our K-folds, we can use this function called cross val predict. The way that works, first, let's reintroduce that. We have this K-fold objects, which will specify that we want three splits and that we want it to shuffle rather than doing it in some type of ordering. We're going to pass in that K-folds object here into our cross val predict, which is why we reintroduced it. We're going to say, I want to form my estimator, and that's the pipeline that we created above. For my initial values of x, so no splits, the cross val predict will do the splitting for you. For my initial y, I want to pass in how many folds? So you can say I want the K-folds object that we specified here, which will ensure that you're not just passing in three splits, but also that it's a shuffle, it'll be a specific type of split. You could also just pass into here for your CV, the number 3, and that will create three splits, but they may not be shuffled. This allows you to specify exactly how you want it to split. Then when I run predictions, that will output predictions, giving that it's going to train on two-thirds and then predict for the other third. After it predicts all three of those thirds, it will have predicted for every single value in our dataset, but only using training from different subsets of that dataset. That's how the cross val predict works. We have our training set which is two-thirds, and a holdout set that is one-third. So we're predicting that one-third and then we're using a different two-thirds to predict another one-third. Now we have our predictions. I'm going to run the length of our predictions and you see that it's the same length as the length of our actual dataframe. Then going to check the r2_score on our predictions compared to the original y, and we see that that's almost identical. I'm going to run that here. These are the scores that we had above in regards to calling out every single one, one at a time. That's how the cross val predict will work. What's important to note is that the cross val predicts did not actually fit the model at any step along the way. It's going to give you all the outputs, but it essentially came up with three different models; one trained on the first two-thirds, one trained on the next two-thirds, and so on. Well, actually, we've refit our estimator, but otherwise, I wouldn't be able to run estimator.predict. We can't here because I fit it backup here. But otherwise, after you run cross val predict, just be aware that the estimator had never been fitted. That closes out this section. After this, we will come back and talk about hyperparameter tuning. Remember that we learn our parameters, but we have to choose our hyperparameters. Hopefully, you can start thinking about now that we're introducing this cross val predict, how are we going to change our hyperparameters? Those arguments that we're allowed to change, that will actually change the output of our model. They're not learned, but we adjust them. So how do we choose upon all of the different hyperparameters that may be available. I'll see you there.