Welcome back to our notebook here on stationarity. Here we are going to pick up and review what we discussed in the last video, by looking over these SMS datasets that we had before. Dataset SMS one and SMS two, or plotting the run sequences here as a quick reminder, where we had both not being stationary due to the first one being high in variance and the second one having high autocorrelation/trend. So the first thing that we're going to do, is try splitting it up into chunks, so we call np.split. We have our ten different chunks, and then chunk one is going to reference dataset one, which we have up here with the different variances, and then the second chunks will be according to that second graph. And we see that the variance is fairly low for that first chunk, for the first three sets of 10. And then it jumps up for the next four, or maybe three or four and we see it goes back down. So, clearly that variance has large discrepancies, and it's not going to be homoscedastic, it's going to have heteroskedasticity, it's going to have different variances. And then the means of say, fairly constant around that zero value. But for the second one, we see that the variance stays fairly constant as one jump here, but it says fairly constant. But we do see that the mean goes from 2 to 10 to 16 and continues to increase throughout, as we see here in the plot as well. We're then going to plot the histograms as well, and we see for the first one that large peak in the middle. And most of the values being towards that middle, but still having those really long tails on either side. Due to that small section where it was really heteroskedastic, where it had different variances and jumped around. And then for the second one, we see something pretty far from the normal distribution, where we have a bunch of high values, some in the middle and then kind of a uniform spread throughout. Besides for that one peak and that one chunk between 30 and 40. The next thing we're going to do is run our ADF tests. And recall that the ADF test, the ADFuller tests are going to tell us whether there's a unit root. Again, whether that autocorrelation is 1 or greater, and we're going to run this and the, if we are called the. Null hypothesis will be that it's a non stationary series and if we reject that null, then we are working with a stationary series. And we see here for the first one, when we were looking at the one with different variances. Actually at a 5% thresholds, we do end up rejecting the null and saying that it is stationary. And we've already determined this is probably not stationary, this one that's heteroskedastic with different variances. But again, as mentioned earlier, the way that this test works, is this really looking more for that autocorrelation. So, using the ADFuller tests to measure whether or not, the series is heteroskedastic. May be a bit misleading. We do see here, though, that for the second series where we have that high hurdle correlation. We do end up with a not rejecting the null, and the null was again, that is non stationary and that's very clearly not stationary. Now in Section 3. We're going to talk about, how we can overcome non stationary transform them to stationary models, and that's going to be key to later on. Actually, develop some of the models that we're going to use that rely on working with stationary series. So, we're going to pull out the ADFuller for the trend and seasonality plot that we had before. Just to remind you right here, we have the plot of those values where there's both trend and seasonality. If we look at the ADFuller test originally, clearly this is not going to pass the test. We have a very non stationary series and that's clear by the p value being very high up. We're then going to use our decomposition that seasonal decomposed, which we actually saw in our last series in our last notebook. And then we'll have our trend or seasonality in our residuals leftover. And then, if we were to actually plot each of these, the trend, the seasonality and the residuals. We see that we have this linear trend pulled out. We have the seasonality and these are both. If we look at it, very predictable in the direction that they're going, we can plot those trend and seasonality separately. And then we have the leftover residuals. And the idea is that we have subtracted out. Here we use that additive model, so we're subtracting out the seasonality and the trends to end up with the residuals from our original series. And we see here we have these different random values. And if you were to actually take the 80 Fuller Test of just the residuals, and again we have these null values if we call from the last video to estimate trend and seasonality. We need to incorporate some extra values, so we can't calculate it for those first three in last three values. So, we're going to check the residuals for all the values for which we actually have numerical values. And we check the 80 Fuller tests, and now we see for those residuals that are leftover, once we've removed both the trend and seasonality, we now have a stationary model. Now the next thing that we're going to do is working with the heteroscedastic data set, the one that we purposely built. Again, as mentioned, the 80 Fuller test will not do a great job of rejecting the null here, and saying that is actually not a stationary series. And this is again, because it won't do a good job of capturing the heteroskedasticity compared to finding trend or that autocorrelation. But, even with this in mind, we can take a log transformation, to shrink down those large fluctuations in our trends. To make it so that the variance doesn't fluctuate quite as much between, the difference in variance doesn't fluctuate quite as much between those first 50 values and those last really 50 values. Now, it's important that you can't take a log transform of a negative value, so we're going to add on 38, which will ensure that all these values are positive. And still, if you think about the idea of modeling, it's not a big deal to add on 38, later on subtracting 38 if we were to predict future values would be very easy to do. So now we have our new values, we see they're all greater than 0, otherwise it looks the same as what we had above. We then take the log of those values, and we see here that the variance goes from or the values go from positive 60 down to zero, right? Has that huge variance, whereas here it goes between, I'd say 38 and 40 around, so not very much variance there. We plot this out, and because relative to one another, you still see those large spikes, it still looks like as a problem with heteroskedasticity, but it will do a bit better. We see that this variance is tight between 3 and 3.8, and then it has a jump down to one, but it's not quite as bad as the difference of jumping down to zero from 60 as we had before. And again we love the 80 Fuller test, and we have even lower value and we can reject the null even more easily. But still I would say, it won't do a strong job of being able to find that variance, it is something that you will probably have to plot out that actual run sequence to see whether or not there is heteroskedasticity and a good transformation to use. If you have heteroskedasticity, and to keep in mind is using these log transformations. Now, one more thing that we can do. Is we can remove auto correlation with different sets. So remember that lag data that we had, and that lag data was that we purposely built in that autocorrelation where each value was equal to the last one plus some random noise. So we can transform this into a stationary data set, by subtracting the lag one from the value. So we say all values through negative 1- 1 through the rest. And if you imagine each of those series, well, let's just to make perfectly clear what we're working with. Will just say, if we look at the first three values, we see that these are the first three values. And if we said lagged, one through, and let's just take the first three values. You see, we're just starting at the next value. So here in the first one we're going through to the end, except for the last one. And here we're starting at the first one, so we're essentially going to be subtracting, this value or tracking this value, though -1.4 from this value, to get our differenced values. And what that should do given the way that we built that lags data. If we're subtracting one value from the next, essentially you should end up with just white noise. With just that error that we added on. And then here this also makes it clear we have the shift the original and then the shifted. Data looking at the lags one through. And then we can plot the run sequence of the difference, n, on our x-axis, we're going to put time through negative one because we need to subtract one value, and we see that this difference tends to be essentially white noise. And we can look at the new ADF and we see that it has a very low P value and therefore again, we can reject that null. Now finally, in exercise 3, we're going to go through these different steps using the dataset SNS that we had. For recall, the first one had the problem with heteroskedasticity, so we're going to take the log of that one in order to reduce the amount of difference between the variance of one-half and the other half or one-third and the other third. And then for the dataset 2, we saw that trend to going up and down. And will actually put this out. This is the original trend that was going up and we saw that there was high autocorrelation. But we take the difference then we are able to see running this below, that we now essentially have white noise when we have a difference. And then here in this top one we plotted out, first, the original dataset and then the log transformation. And that log transformation is going to be the much less wiggly line that we see overlaid on the various wiggly line which is our original dataset. So we see how you reduce the variance. And then we can take the different chunks and look at the mean and the variances of each. So again, this is using the log in our difference data, so once we've done our transformations. So the first one is going to be the log which at first had a problem with the big changes in variances. Now we see those changes in variances are not nearly as high as they were before. Each one of the different chunks have fairly similar variances at each one of the chunks. And then when we looked at the difference data, originally, we had the problem of there being very different means. Now we see these means are all around 0 as well, and they're all fairly similar. Doesn't have to be around 0, just all have to be around that same value that the mean says the same throughout each one of the different chunks. We can look at the histograms and here when we look at the log difference, there's still a big spike and they're still these two that kind of outlier. Maybe not perfectly in showing that we have necessarily a normal distribution, but also because of these outliers is hard to tell that the rest of these are not necessarily as condensed as it seems. But really, there's a bit of a spread here in these values between 3 and 3.5 that all got condensed into one chunk, into one bin. And then we see here a bit more of a normal distribution for our difference data. And then finally we can run the AD-Fuller test. Again, that first one really was a problem of variance differences, so it probably would pass AD-Fuller test originally as well. But we see here again, it does still pass the AD-Fuller test, and then for the second one, the non difference data definitely would not have passed AD-Fuller test and we saw that. But now we see here that it definitely does. So that closes out our video here in our lecture and our notebook on stationarity. So in this exercise, we covered what it means for time series to be stationary, right? We wanted to know trends or no changes in mean, no changes in variance and the autocorrelation structure saying the same throughout. We identified some common ways to identify stationarity. Looking at the run sequence, looking at the histograms and then finally running those AD-Fuller tests. And then we went through some different transformations. And in the lecture, we'll get further into these transformations and discuss them in a bit more depth. All right, I'll see you there.