Welcome to our third notebook here in the second course on cross-validation. In this notebook, we will discuss how to chain multiple data processing steps together using the pipeline functionality, which will allow you to speed up a lot of your Machine Learning workflow. We'll talk about using the KFolds object to split data into multiple folds as we saw in lecture, and then we'll learn how to perform cross-validation using sklearn's cross_val_predict, as well as the GridSearchCV to see how well we are able to perform given our model on each one of these folds. Now, as usual, we're going to bring in many different libraries. Let's go down to the second line here on sklearn. From model selection, we're going to bring in the objects we just mentioned, KFold and cross_val_predict. From the linear model, we're not just going to bring in linear regression, but we're also going to introduce lasso and ridge regression. We're going to talk more about these in detail during lecture, but note that these are just linear regression plus a way of ensuring that you don't overfit that linear regression. Then from metrics, we're bringing in the r2_score, which is just r squared, and then from pipeline, we're bringing in the pipeline function. Now our dataset this time is going to be saved in a pickle file. This pickle file is actually a dictionary. Pickle allows us to save Python objects and be able to achieve them easily. So we're going to open up this pickle file that we have here. If we look at boston.keys, we see this is a dictionary with the keys dataframe and description, and we're going to pull out the pandas DataFrame specifically, call that boston_data, and then we're also going to plot the boston description to separate these two out into their own objects. Then we see we have the pandas DataFrame and we can look at the first five rows and see the median value is what we're trying to predict for the housing and we also have all these different features in order to help us predict that median value. Now we know, given our discussion and the subject of this current notebook, that our goal is going to be, how can we predict future values when we only have the data available to us in this dataset? So what we're going to want to do is use KFolds and separate out into three different folds, three different train and test sets, and we want to think about how are we going to do this in Python code. So to code this up, the first thing that we want to do is separate out our x and y variables, our features x, and our target variable y. So x is just going to be equal to boston_data, and we're doing.drop just to remove the outcome variable, and then y is just going to be equal to the outcome variable. Then as we've done so far with sklearn, we're going to want to initiate an object. So we're going to initiate our KFold object before using it, and we're going to pass into that initiated object certain arguments. When we say shuffle equals true, if you recall when we were looking at a DataFrame, we can take the first 10 and leave, let's say it was DataFrame of 100 values, we can take the first 10 and leave out the next 90, and then take the next 10. So we started off with zero to nine, and then 10-19, and each one of those can be the different test sets, or we can shuffle it and choose a random 10 to be our test set, the rest being our training set, and then take another random 10. This will become clear as we actually look at the indices that we're pulling out in the next couple of cells. So we're going to shuffle that, so it's not just an order, and then we set our random state, and then the other important argument that we need to pass is the number of splits. So how many times do we want to split up our DataFrame? We're going to have three folds, meaning three training sets and three test sets, where none of the test sets are going to overlap. If you recall from lecture, the training sets can overlap, but the test sets have to be exclusive so that we are looking at different test sets every single time. Let's actually look at what kf.split does for us. That's going to give us a generator object. That generator object you can think of as a list. A list where each value in that list is a tuple. That tuple is going to be first, all the indices that we want to set to our train index, and the second part of that tuple is going to be the test index. So it's going to be three tuples, each one with a train index, giving all of the indices taken from our DataFrame, taken from our x, so it's going to be the same size, and then our test index, which is just going to be again, the indices just specifying the test Index. We have our train index, that's just going to be a list of numbers, and that list of numbers is going to be somewhere between, let's just look at x.shape, it's going to be within 506. So values between zero and 505, all the way up until the end. We're going to look at the first 10 values for each one of our train and test splits. That's the first 10 here and then we're also going to get the length of our train index and the length of our test index. Now, think about what the length of our train and test indices should be. If we're splitting into three, then we know that our tests indices should be around one-third of the length of our entire data set. I'm going to run this and we see that 170, which is around one-third of 505, is going to be our tests index size, and it's going to be the same for each one of these folds. One of them is one-half, just because it didn't cleanly divide by 3. But we see that each one of these lengths for test index is going to be around one-third, and then the remaining two-thirds are going to be the size of our training index. All the values that we're going to eventually want to train our model on. We also see that we have the actual indices, just the first 10, here for the train index. Remember, the train indices can overlap, and you should see some overlap. Here, you see the two and the two, two and two, and there should be some overlap, but the tests indices will not have any overlap. These are each going to be unique and these are going to be different holdout sets. We're going to train on this training set and then test on this test set. Then we're going to train on this train set and test on this test set, and so on. Now, we want to to the scores for each one of our train and test splits. We're going to start off with a blank list for scores. We're going to initiate our linear regression object. Now, we have our predictor, it's lr, and we're going to say for train index and tests index in our split. Again, the split was defined earlier and we saw what that output will look like. That'll be a train index and a test index will run through that, and then we'll get to the next tuple of train and test indices. We're going to set our x_train, x_test, y_train, and y_test to the following outputs. It's going to be x, so our original x with all of our rows, but we're only going to take the rows for our first train index, and we're obviously going to take all of our columns from x. We're then going to set to x_test all of our test indices. So now we've set the x_train and the x_test. Then y_train is just going to be y and it's going to be the matching train index to our x_train that we specified first, and then we're going to set our y towards the test index to match up with the x_test index, which we defined here with x_test. We're then going to fit our model to our training set. Again, this is going to be a for loop. So it's going to do this three times. Using our first training set, we're going to do lr.fit on x_train and y_train. We're going to come up with our predictions on x_test. Again, we fit to our training model and then we test what the predictions will be, assuming we didn't have the labels for x_test, we get our prediction. Then we can say for that prediction, how well did we do according to the actual values that we held out in our test set. So you get the r2_score of y_test.values and y_pread. So r2_score we brought in earlier, recall that was one of the metrics that we pulled in. We set that equal to our score and then we add that onto our original list that we had empty at first. We're going to do this for three different splits. We'll run this code and you see that it will output three different results for each one of those test and training sets, and this also makes clear how you can end up with fairly different values depending on what your test set is. This highlights the importance of doing multiple folds, and then eventually, if you're doing cross validation, you would end up averaging these all together. Now, in the next section, we will continue with this, adding on scaling and then going into cross-val prediction using the cross-val predicts functionality. I'll see you there.