In this video, let's apply some of the skills that we've learned about Poisson regression to analyze some real data in R. And along the way will perform some data cleaning tasks, which are always really important for data scientists to have some sense of. So the data set that will look at, is a data set of bicycle in counts. So during several months of 2017, there were some data collected about the number of cyclists that cross a few bridges from either Queens or Brooklyn into Manhattan. And we'll look at those cyclist counts along with some weather data. So the data frame that will look at, will have well each date and the day of the week. It will include high temperature measurement and a low temperature measurement, and also the amount of precipitation on that day measured in inches. And then it has a total number of cyclists counts for four different bridges. So we have the Brooklyn Bridge, the Manhattan Bridge, the Williamsburg Bridge and the Queensboro Bridge. And all of these counts are over the exposure of a 24 hour period. Now, we also have the total number of cyclists counts over all of these bridges combined in its own column. And our goal for this analysis will be to try to come up with a Poisson regression model that uses the weather data to explain or predict the total number of cyclists on the Manhattan Bridge on any given day during the period in question. So we'll focus our attention on the Manhattan Bridge. But of course I would encourage you to maybe do similar analyses for the other bridges that are included in this data set. So first, let's load the data into R, and of course, there are many ways to do this. And we'll use the (RCurl) package, and the get url function. So I've loaded the RCurl package here, and just for display purposes, I've used this paste function here so that the url, which is rather long, doesn't continue off. But I'm able to wrap it so that you can see the entire url here. And so the rest of this code reads the data in, we're calling it bike. We're calling the data frame bike, and you know a few things that we should check, we'll look into whether there are any na recordings. And summing whether we have na will tell us the number of true values, so are there any na's? If we have one or more, then that says that there is an na somewhere and we should investigate it. But if this sum is zero, that tells us that there are no na's, and that zero here, reflects that there are no na recordings. Now, that, of course, doesn't mean that the way that these data were recorded, doesn't have missing values. For example, we have to explore the way that the data were recorded. Sometimes missing values are recorded with several nines, for example. So you might have to go a little bit further in exploring for missing values. In this case, I won't do that because I know that there are no other missing values. So if we take a look at the first 10 rows of our data set, we'll notice something interesting. So if you look at the precipitation variable, we'll notice some numerical entries, but also this T entry. And it turns out that if you read some of the documentation about this data, the T stands for trace precipitation, and that's great to know. But it creates some potential problems for our analysis that we have to deal with. So we could treat precipitation as a numerical predictor in R model. Except for the T, we'd have to do something with it. And what we'll choose to do is just treat trace precipitation as zero, and then we'll treat precipitation as a numerical predictor. The other option might be to bin precipitation into, say, a quarter inch bins, and then trace precipitation would naturally fall within the 0 to 1 quarter inch precipitation bin. And then you could treat precipitation as a factor. But we'll go with the continuous route. So there are many different ways to turn the T to a 0, and we'll use a method and a function in the tidyverse package, so we'll use the mutate function and the factor recode function. Now, it's worth noting that right now the precip variable is stored as a factor, and so what we'll do first,, is treat the T as a level of a factor and just recode it to be 0. And then in just the next cell, we'll then convert the precip variable from a factor to a numerical variable. So this code here, just changes the level from T to 0, and then if we again print out the first 10 lines of the data frame, we see that T and all the other Ts are now being treated at 0. So we know that we have to change the precip from a factor to numerical variable. But let's see what we need to do with any others. So we will look at the class of each one of our variables, and we see, okay, some things that are numeric should be numeric. So all of our counts are numeric, that's good. The temperature measurements are numeric, that's great. Day, so this is day of week, Monday, Tuesday, Wednesday, etc, that's stored as a factor, and it should be, so that's great. The only two things we see jumping out at us are what we know, the precipitation variable should be treated as numeric, so we'll fix that. And might not be super important for our analysis, but it's nice to learn how to do this, we can change our date variable from a factor into an actual date in R. So let's work on those things. So again, we're using the tidyverse and some functions within it, to first change the date. So we're going to mutate the date column of the data frame, and we want to store it. So we have to take it from a factor to a character first, and then we'll use the as date function so that R can recognize the format of the date and then store it properly. So the format of the data, if you look up in the data frame, is m/d. So this bit here, just tells R okay, m/d is the format that the data will be written in as. Now, there's a curve out here, that without a year R will treat the year as the current year, and so it will give us month/day/2020 in this case. But the data were recorded in 2017, so a fix for this is to have this line here. So the ymd function allows us to manipulate dates a little bit more easily. So, we are basically subtracting 3 years. Off of our date here. And to use the ymd function, we need this lubricate library. So that allows us when we print out a summary to see okay, dates are now stored as dates that's great. And the precipitation variable, we wanted to change that from a factor into a numerical variable. And the way that we do that is, first we change it to a character string. So we take the levels and we convert those two character strings, and then from there we make them numerical values. And we want to mutate our data frame and store this new manipulation as precipitation. So now we can just confirm our code from above, and we see okay great, date is now stored as a date, and we have our day factor. And now everything else is numeric. So in particular, our precipitation variable went from a factor to numeric, that's great. Now, I print out a summary here for us just to look and see if anything jumps out as odd. And nothing seems too odd here, but some examples of what might be odd. For example, all of these things are counts, and they are counts of the number of cyclists going across the bridge. And so, if say the minimum value were zero, that might be a red flag that well, it's probably very unlikely that zero cyclists went across the Brooklyn Bridge in this case. And that might be an indication that zero was coded as a missing value. We might have to investigate that and then figure out what to do with our missing values. But for our counts we don't see any zeros, which in this case would probably be implausible. So it seems pretty reasonable there. Again, maybe max values if they were like several nines in a row for all of them, that could indicate that those 9 9 9 9 is being treated as a missing value. But we don't see that, so that's great. And all of the other values look relatively reasonable. For example, precipitation measurements, temperature measurements, etc. All right, now that we've done some cleaning, I think we're ready to start the modeling process. But the first thing that we'll do is split our full data set into a training set and a testing set. And of course, we will fit the model on the training set. And then we might look at some metrics on say how it predicts on covariate classes in the testing set. So we'll do an 80 20 split. Of course, there are more sophisticated versions of cross validation, which you might learn in another course. Now, once we've randomly split the data into a training and test set which I've called df for data frame train and df_test, we'll now fit a Poisson regression model. So we do that with the glm function. So glm for generalized linear models. And this glm function will look similar to what we did with binomial regression, and similar in some respects to the lm function for standard linear regression. So we always have a formula, which is the response, tilde, and then our predictors. So in this case, we're using the weather predictors, and then the day of week as a factor. We should tell our what data frame we're looking into, so here we're doing this on the training set. And then for Poisson regression, we need to specify the family as poisson, and this will use the log link function that we've learned about in earlier lessons. So printing a summary, we see a table similar to binomial regression, and we should notice that we have our estimates that were produced by maximum likelihood estimation using an iterative procedure. We have our standard error estimates, and z tests based on the asymptotic theory of maximum likelihood. Now, at this point I think it might be important to learn how to interpret some of the estimates on this real data set, and so we'll go ahead and do that. If we look down here we have an interpretation, so we'll do this with the precipitation variable. So if precipitation is increased by one inch, so if we get an extra inch of rain on day two as opposed to day one, we would expect the mean number of bikes across the Manhattan Bridge to be multiplied by about 0.51. So we would expect somewhere around half the number of bikes if we got an inch of rain versus 0 inches or 2 inches versus 1 inch. Now we can easily verify that interpretation here. So in this cell, what I've done is I've created a new covariate class based on just the second row of the test set. So remember, we left some values out of the model. So I'm just taking the second row of all the ones that we've left out, and then I'm taking the fifth column which is the precipitation variable, and I'm just adding one to it. So it turned out that in the second row the precipitation was zero. And here in the new data the precipitation will be one. So if we printed out here everything, it will be the second row of the test set. All the variables will be the same except precipitation went from 0 and now it's 1. And so now I'm just constructing two different predicted values from the Poisson regression model that we fit above. The first one is just on the original test set. So the second row taking all those covariates and plugging them into the model and getting us a predicted value. Now predict2 is taking the new data that we just created. So basically the same as the second row of the test set, except we have increased precipitation by one unit by one inch. So predict2 reflects that. And basically, what we should notice is that if we multiply predict1 by e raised to this value here, we should get the same thing as predict2. And that's reflected down here. So the precipitation estimate times the original prediction will give us this value, so almost 2,200. And if we just look at the prediction, so predict2 will give us the same exact value, right? So we're just exponentiating our second coefficient in the model above multiplied by prediction1, that gives us the same thing as prediction 2. And that really verifies this interpretation up here. You increase your precipitation by one unit, then you're basically cutting in half, so you're multiplying by about 0.5 what the original prediction would be. Now, this kind of prediction could be really helpful in understanding, some of the relationships between precipitation and the number of cyclists that cross the Manhattan Bridge. But if you think about a one inch increase in precipitation, that's kind of a lot. And so it might be helpful to be a little bit more fine grained in our interpretation. And so we might want to know what happens if we increase precipitation by a standard deviation, which it turns out in the in. The training set is about 0.38 inches, so we can easily adjust our interpretation to account for not just a one unit increase, but a one standard deviation increase. And so one way to do that is to create a new data frame where we change some of our original variables to be scaled variables. So we're using the mutate at function, which will mutate these three columns here using this scale, function and the scale function will subtract the mean off of each one of the variables. So it'll take the mean of the precipitation, for example, and subtract off that from each value, and then we'll also divide by the standard deviation. So we will get scaled values of each of these three predictors in our new data frame. And here I print that the standard deviation of precipitation is, roughly 0.38 inches. So now I'm going to use this new data frame to fit the generalized linear model with the same response. Same predictors, except that these predictors are now standardized and we see what we get out. So again, assuming that the model is correct, if precipitation has increased by one standard deviation. So about 10.38 inches, we would expect the mean number of bikes across the Manhattan Bridge to be multiplied by around 0.78. And that's adjusting for temperatures and day of week. So this allows us to be a little bit more fine grained. Instead of adjusting up by one inch, we can now adjust by something a little bit smaller. All right, so far we've seen how to clean some of our data. We've seen how to fit the Poisson regression model to do a bit of interpretation, and some of the interpretation required us to say that assuming the model is correct. We have a particular interpretation, but we really haven't studied whether the model is actually correct, and we'll do that in future lessons. But one really simple assessment that we can look at here is to plot the predicted values from the model against the true values. And we could do that either in the training set or the test set. And what we should expect if we plot predicted versus true values is, values lining up along the line Y = X. And if your predictions were perfect, every point in this following plot would line up along the line. Y = X, now, if they're not perfect but really good, maybe they would be tightly clumped around the line. Y = X, and if they're bad, they might be scattered quite far off. So here's some code to generate the following plot. This plot does just that predicted values against true values, and I've plotted two different lines here. So the first line, is the first line that we'll talk about is this light gray line and it's the line Y = X. So again, if the predictions were perfect. If they perfectly predicted the true values, you would have all the dots lining up on the gray line, then that clearly doesn't happen here. Now, no model is perfect. And we might like to see values lining very tightly on that gray line, and in this case, it doesn't look super tight. It looks like there's a decent bit of variability now, the second line I've plotted just for just to illustrate that maybe these predictions are not so great is the regression line. So a fitted line through these points and we noticed that the fitted line looks relatively different from from the line Y = X, and so that could be some indication that our predictions are are a bit off in some systematic way. And we also have a decent bit of variability in our predictions. And all of this suggests that maybe we could do a bit better. Maybe there's another model that could reduce this variability and maybe be a bit closer to the best line, which would be Y = X. Now I've done the same plot for, here for the test set, and you see something similar. So again, a difference between the regression line that's fit through these values and the line Y = X. You still see some variability and you see, some systematic over prediction, and all of that suggests that we could do a bit better in our model. And so, in future videos, we'll take a look at how to maybe be a little bit more formal in our assessment of the, of how good this model is, and then we'll talk about alternatives to try to make the models better.