In this lab, we're going to build and evaluate a simple linear regression model to predict Apple closing stock prices using scikit-learn and BigQuery. Our objectives are to load data from BigQuery into a Pandas dataframe for the linear regression model in scikit-learn and do all this inside of an AI platform notebook. Okay. Let's get started by booting up an AI platform notebook. To do so, I'd like to use this navigation menu on the left, scroll all the way to the bottom. In the artificial intelligence section, you should see platform. Go there and then click on notebooks. Then click on new instance, and go to TensorFlow 1.14. We can actually get away with less by just clicking, for example, this Python environment, but this is suitable. We've all the defaults as is and click on create. Now this will take a few minutes to boot up. When it's done, the swirling will basically give you the option to open up the Jupyter lab notebook environment. Clear instances ready, opening Jupyter Lab. Now rather than creating a new notebook, we're actually going to work out of an existing notebook that we have on GET. So we're going to open up a terminal. We're going to clone that repository. As our repository gets cloned, you'll see its all contents here on the left in Training Data Analyst. So if we click on Training Data Analyst, then we navigate to courses, AI for finance, and we'll go straight to the solution section and open up this notebook here. In this first cell, we make use of so-called magic functions. Magic functions allow you to execute system commands in notebook cells. In this first cell, we're basically running a small batch file. The first cell creates a dataset in BigQuery called ai4f. The second statement load is CSV file containing 10 years worth of Apple stock data stored in Google Cloud Storage into a BigQuery table we call AAPL10Y. This contains the data that we're going to build our regression model on. So let's go ahead and execute this cell. In the second cell, we import all the libraries we need namely scikit-learn and pandas. Lets go ahead and execute that. In AI platform notebooks, there's a BigQuery magic function. See the full documentation on this function, go ahead and execute the cell. Didn't see the documentation for any magic function, simply enter the magic function followed by a question mark. The way the BigQuery magic function works is you basically enter your query and then it saves the output of the query to a Pandas DataFrame, which we call df. So this here is relatively long complicated query that will give us all of the features we need for our Machine Learning model. Our model will be really simple and that it only has two features as input. The first feature will be the previous day's closing price. Makes sense to me. The second feature, it will be something we call a three-day trend. A three-day trend variable for any given row looks at the previous four days of closing costs. If the closing price on a day is greater than the closing price on the previous day, and we assign a day plus one, otherwise we assign a day minus one. If the majority in the past three days consists of positive ones and the three-day trend is set to positive one, otherwise, is at negative one. So we get the data we need by making heavy use of this lag function in SQL. We go into more depth behind this lag function in the BigQuery Machine Learning Lab. For now, let's take this query as is. Let's go ahead and execute it. The output of our query should be stored in this Pandas DataFrame that we call df. Let's go ahead and use the head command to see the first five rows of the DataFrame. Perfect. You have the column close, which is what we're trying to predict, the previous day's close and the three-day trend. We have all the data will need to build a regression model in scikit-learn. First, let's visualize some of the data. So since we basically have a time series, the closing Apple stock value as a function of the date, let's go ahead and plot that. So here's where we have a steady increase. Next, let's try to get some intuition for this trend underscore three underscored day variable by embedding in the time series above. This series of commands looks at a subset of a full closing price time series from June to July of 2018. So the first plot command contains the closing price as a dotted line. The second plot command shows positive three-day trends as blue dots while the third plot command shows negative three-day trend as red dots. Let's go ahead and execute this cell. So as expected, when the close values are visually trending downwards, we expect to see a lot of red dots. When the trend is upwards we expect to see more blue dots. In this next cell, we check the size for our dataset. Okay. Now we're at the point where we're ready to build a linear regression model in scikit-learn. First, in this cell, we specify the features we'll be using, which are the previous day's closing value in the three-day trend. We also specify what we're actually trying to predict, which is the current day's close value. We also break our dataset up into training and testing sets. As you learned, it's never a good idea to gauge how good your model is using the same data that you trained it on. Here, we somewhat randomly select the first 2,000 rows we're training and send the rest toward testing dataset. So go ahead and execute this. In the next cell, we initialize a linear regression model object in scikit-learn. We don't want to intercept in this case, so we set that parameter to false. Next, we go ahead and fit the model on our training data. Since we don't have a lot of data, the training will be pretty fast. This our mix predictions on the testing data, we're going to use this output to gauge how good our model is. To gauge how good our model is, we'll use two metrics; the Root Mean Squared Error and the variance score. The Root Mean Squared Error attempts to engage on average how far off your prediction is when the truth; the lower the better. The variance score is how correlated your prediction are to the truth; the closer to one the better. Now, getting good values for these metrics doesn't necessarily mean your model is useful. It all depends on the context. In this case, would you be okay with being off by around three points for your closing value prediction? As a sanity check, whenever you build a regression model, I'd like to plot the predictions against the truth like we do in this cell here. Ideally, all the points will lie on this straight line. As you can see, we have some deviation. So the model is not perfect. When building a machine learning model, it's always good to have a simple baseline to compare your model performance against. For our problem, what if we just set the previous day's close value and made that our prediction? Let's calculate the mean squared error for this kind of model. So we execute the cell and we see that the error is actually lower for this case and we also include a three-day trend variable. This tells us that the three-day trend variable really has no predictive power. Try to add some new features to the model we built to get that Root Mean Squared Error as low as possible. Perhaps, you can try 5A trend. What if you included two days worth of previous values? Get creative.