Welcome back to Practical Time Series Analysis. In this lecture, we review the partial autocorrelation concepts, in effect, we try to understand it just how it's calculated. We've seen that for an ERP process, the PACF can be a really good tool to tell us the order of the process. We look at the PACF and we determine when the spikes essentially die down into noise. We'd like to know just what is being measured, however. After this lecture, you'll be able to, in a regression sense, and we'll apply that to time series, you'll be able to partial out a variable, and you'll be able to describe to a friend or colleague what the PACF measures. This very nice example available in several text books and also some of the our packages, having to do with body fat. To measure body fat is a pretty expensive and laborious process involving people getting into big vast of water, etc., and looking at their displacement. It would be nice if there was a simple, fast, cheap, way, non-invasive way, to get the same kind of measurement. And so, what this data set explores is, why is it measuring somethings with essentially a caliper and a tape measure. Triceps skinfold thickness, thigh circumference, and mid-arm circumference to see if those would be good proxies, if we could build the good regression model for body fat based upon these simple, easy to measure variables. If you look at the data set, the results look rather promising. If you want, you may have access to these data sets somewhere else but I'm going to show you that you can get it through the isdals library. We'll just bring the body fat data set into play. I like to always attach, so I can just call the variables directly. And, in order to run the pairs command, in order to look at graphs of one on one plots, we'll put the variables into a matrix and then we were in pairs. You can see the triceps, its really a decent predictor of fat, so is thigh. Thigh is a good predictor of fat. The thing that's interesting or annoying from our aggression point of view is that, triceps and thigh are themselves very strongly correlated. And so, that can produce problems in a regression model. Your coefficients maybe hard to estimate, interval estimates may become wider rather elusive to statistical power. There are reasons why we would like to not have too many correlated variables in one regression model. So, we're not really going to explore a multicollinearity in any kind of systematic way. What we are going to do is, confirm numerically that, yes, the correlations between fat and triceps and thigh and fat, they're both pretty high and also between thigh and triceps. Our job right now is to try to measure the correlation of thighs and triceps for instance. After we control for, we'll hear people say, partialing out and after we partial out on thigh, what we're going to do is look at residuals. So if that seems unmotivated, think about it like this, we'll try to take fat and predict it using thigh. So we're trying to find the linear component of thigh in fat, speaking loosely. If we look at the residuals, we're extracting the linear predictive power of thigh on fat. We'll do the same thing with thigh and triceps. We're essentially subtracting at the linear relationship here. After that's removed, we then see how fat and triceps are correlated. And we'll call that a partial correlation of fat and triceps. To operationalize, this is really quite simple. Lm is the command that'll give us the linear model. And if you're comfortable with R, you're probably comfortable nesting commands like this. We'll do a linear regression of fat and thigh. We'll interrogate that model with the predict command in order to give us our hat values. Things with a hat on them are things being estimated, so fat.hat is how fat is estimated in the model used in thigh, linear model. Same thing, corresponding thing with triceps.hat. Once we're done with that, we'll subtract off, we'll essentially look at the residuals, and we see that the partial correlation of fat in triceps after thigh's been partialled out is around 17%. If you're lazy or I'd like to say efficient, then there's a library that will do this for you. It's a very popular thing to do. If you run ppcor as a library, then there's a command there called pcor. Again, put your variables in a matrix and run pcor and you'll get the customary table. And you can see that, that's many significant digits. We've calculated the exact same quantity. Now from a time series point of view, when you have an ARp model, you'd probably like to partial out more variables than just one. In this example, we stay with our body fat model and show how to partial out a couple variables. It's really the same process. Build a model of fat on thigh and mid-arm, for instance. So we're going to partial out thigh and mid arm. Build a model predicting fat off of thigh and mid-arm, and then, subtract the linear component. Do the same thing with triceps. We're taking the linear predictor of fat on thigh and mid-arm, and essentially, getting rid of that linear contribution. We'll take a correlation. Does it surprise you that the partial correlation, in this case, is higher? Now, how does this work when we're dealing with the time series? We have, let's say stochastic variables apart from X theta, Xt+h. We'll try to find the effect of X of t on X of t + h, all the way to the right, after we control for or partial out the intervening random variables. So, we're going to use I think a very natural notation, x hat t + h will be the value predicted at the position, t + h, by using the several random variables preceding. We won't include Xt in the model, but we'll go from Xt + 1, all the way through X sub t + h- 1. In other words, the variables in the middle. The subscripts that we're using on our coefficients for the betas are really just telling you how far away you are from the thing you're predicting. So beta1 is just one step away from x hat t + h, beta2 is the coefficient, two steps away and so on. Interestingly, due to stationarity, we can find a relationship for X hat t using the very same variables. We're going to use the same coefficients, but look which coefficients are with which variables now? Again, the subscript from the data tells you how far away you are from the thing you're trying to predict. So, beta1 is one way, beta2 is two way and so on. Now we're going to suppress some details on just how this is done in a time series rather than, as you see here for stochastic process. So we won't worry about how the estimation is done. But at this point, I think you can see what we're about to do. We've got a predictor for X hat t + h. We got a predictor for X hat t. And what we'll do is partial out the intervening the random variables in the middle. We'll do that by looking at the residuals, and then finding a correlation. So we're going to remove the linear effects of all terms in between. That's how your partial autocorrelation plot is obtained. By getting rid of the linear effects of terms between two random variables at a certain lag, or a certain distance away. At this point, especially in a simple linear regression context, you should feel very comfortable partialling out a variable and you should know now and be able to a friend, just what it is the PACF is measuring.