[MUSIC] So we've just finished our deep dive into the formal definition of the three different sources of error that we have. But now, what we're gonna do is we're gonna turn to again, another optional video. It's gonna be even more technical, possibly than the video that we just completed. To derive why specifically these are the three sources of error, and why they appear as sigma squared plus squared plus variance. Okay, so let's start by recalling our definition of expected prediction error, which was the expectation over trending data sets of our generalization error. And, here I'm using just a shorthand notation train instead of training set,, just to save a little bit of space. I don't mean choo-choo trains, I mean training data sets. Okay, so let's plug in the formal definition of our generalization error. And remember that our generalization error was our expectation over all possible input and output pairs, X, Y pairs of our loss. And so that's what is written here on the second line. And then let's remember that we talked about specifying things specifically at a target XT, and under an assumption of using a loss function of squared error. And so again we're gonna use this to form all of our derivations. And so when we make these two assumptions, then this expected prediction error at xt simplifies to the following where there's no longer an expectation over x because we're fixing our point in the input space to be xt. And our expectation over y becomes an expectation over yt because we're only interested in the observations that appear for an input at xt. So, the other thing that we've done in this equation is we've plugged in our specific definition of our loss function as our squared error loss. So, for the remainder of this video, we're gonna start with this equation and we're gonna derive why we get this specific form, sigma-squared plus bias squared plus variance. So this is the definition of expected prediction error at xt that we had on the previous slide, under our assumption of squared error loss. What we can do is we can rewrite this equation as follows, where what we've done is we've simply added and subtracted the true function, the true relationship between x and y, specifically at xt. And because we've just simply added and subtracted these two quantities, nothing in this equation has changed as a result. But what this allows us to do is complete our derivation where in particular what we're gonna use, and maybe let me just switch colors quickly so that I can do a little aside here. That's gonna be a useful aside to kind of follow what I'm going through here. And for this littlest side, so if we take the expectation of some quantity a + b squared then what I'm gonna get is the expectation of a squared plus 2ab plus b squared which is equal to the expectation of a squared plus. Sorry this is getting sloppy here let me just rewrite this little term. It's plus two times the expectation of ab plus the expectation of b squared. And this is simply using the linearity of expectation after I've gone through and completed this square a plus b. Okay, and in our case a. I'll just write this here as this mapping. This is gonna be our a term and this here is gonna be our b term. Okay, so the next line I'm writing is using this little identity, defining the first term as a and the second term as b. Now let me switch to the blue color which is specifically in this case let me do one more thing which I think will be helpful. I'm going to define some shorthand I'll write in one other color the shorthand notation. Just to be very clear here, I'm gonna say, for short hand, that yt, I'm just gonna write as yf sub w true. I'm just gonna write as f and f sub w hat of our training data, I'm just gonna right as F hat. Okay, this will save me a lot of writing and you a lot of watching. Okay, so now that we've set the stage for this derivation, let's rewrite this term here. So we get the expectation over our training data set and our observation it's remember I'm writing y t just as y and I'm going to get the first term squared. So I'm going to get y- f. Squared that's my a squared term this first term here. And then I'm gonna get two times the expectation of a times b, and let me again specify what the expectations is over the expectations over training data set and observation Y. And when I so A times B I get Y minus F times F minus F hat. And then the final term is I'm going to get the expectation over my training set and the observation Y. Of B squared, which is F minus F hat squared. Okay, so now let's simplify this a bit. Does anything in this first term depend on my training set? Well y is not a function of the training data, F is not a function of the training data, that's the true function. So this expectation over the training set, that's not relevant for this first term here. And when I think about the expectation over y, well what is this? This is the difference between my observation and the true function. And that's specifically, that's epsilon. So what this term here is, this is epsilon squared. And epsilon has zero mean so if I take the expectation of epsilon squared that's just my variance from the world. That's sigma squared. Okay so this first term results in sigma squared. Now let's look at this second term, you know what, I'm going to write this a little bit differently to make it very clear here. So I'll just say that this first term here is sigma squared by definition. Okay, now let's look at this second term. And again what is Y minus F? Well Y minus F is this epsilon noise term and our noise is a completely independent variable from F or F hat. And so what that means is if you take the expectation, I think I have some room to do it here. If I take the expectation of A and B, where A and B are independent random variables, then the expectation of A times B is equal to the expectation of A times the expectation of B. So, this is another little aside. And, so what I'll get here, is I'm going to get that this term is the expectation of epsilon times the expectation of F minus F hat. And what's the expectation of epsilon, my noise? It's zero, remember we said that again and again, that we're assuming that epsilon is zero noise, that can be incorporated into F. This term is zero, the result of this whole thing is going to be zero. We can ignore that second term. Now let's look at this last term and this term for this slide, I'm simply gonna call the mean squared error. I'm gonna define this little equal with a triangle on top is something that I'm defining here. I'm defining this to be equal to something called the mean square error, let me write that out if you want to look it up later. Mean square error of F hat. Now that I've gone through and done that, I can say that the result of all this derivation is that I get a quantity sigma squared. Plus mean squared error of F hat. But so far we've said a million times that my expected prediction error at XT is sigma squared plus phi squared plus variance. On the next slide what we're gonna do is we're gonna show how our mean squared error is exactly equal to bias squared plus variance. What I've done is I've started this slide by writing mean squared error of remember on the previous slide we were calling this F hat, that was our shorthand notation. And so mean squared error of F hat according to the definition on the previous slide is it's looking at the expectation of F minus F hat squared. And I guess here I can mention when I take this expectation over training data and my observation Y. Does the observation Y appear anywhere in here, F minus F hat? No, so I can get rid of that Y there. If I look at this I'm repeating it here on this next slide where I have the expectation over my training data of my true function, which I had on the last slide just been denoting as simply F. And the estimated function which I had been denoting, let me be clear it's inside this square that I'm looking at I'd been denoting this as F hat. And both of these quantities were evaluated specifically at XT. Again let's go through expanding this, where in this case, when we rewrite this quantity in a way that's gonna be useful for this derivation, we're gonna add and subtract F sub W bar and what F sub W bar, remember that it was the green dashed line in all those bias variance plots. What F sub W bar is looking average over all possible training data sets, where for each training data set, I get a specific fitted function and I average all those fitted functions over those different training data sets. That's what results in F sub W bar. It's my average fit that for my specific model that I'm getting averaging over my training data sets. And so for simplicity here, I'm gonna refer to F sub W hat. I mean, sorry, W bar. As F bar. This is new notation [SOUND] on this slide [SOUND]. I guess I'll call it again, just to be clear, new shorthand notation and this is just going to make things easier to write in these derivations here. Using that same trick of taking the expectation of A plus B squared and completing the square and then passing the expectation through, I'm going to do the same thing here. New definition of A plus B, but same idea, again. I'm gonna get the expectation over my training set [SOUND] of now my first term squared. I'm gonna get F minus F bar squared, and then I'm gonna get two expectation over my training set of E times B, so that's gonna be F minus F bar, times B is F bar minus F hat. And then the final term is the expectation of B squared which in this case is F bar minus F hat squared, and again this expectation's over the training sets. Now let's go through and talk about what each of these quantities is. And the first thing is let's just remember that F bar what was the definition of F bar formerly? It was my expectation over training data sets of F hat of my fitted function on a specific training data set. I've already taken expectation over the training set here. F is a true relationship. F has nothing to do with the training data. This is a number. This is the mean of a random variable, and it no longer has to do with the training data set either. I've averaged over training data sets. Here there's really no expectation over trending data sets. Nothing is random in terms of the trending data set for this first quantity. This first quantity is really simply F minus F bar squared, and what is that? That's the difference between the true function and my average, my expected fit. Specifically add XT, but squared. That is bias squared. That's by definition. So by definition [SOUND] this is equal to bias squared of F hat. Okay. Now let's look at this second term here, and here. Again, f- fr just like here has, is not a function of training data. So, this is just like a scaler. It can just come out of the expectation so for this second term I can rewrite this as f minus f bar, well let's keep the two there, times the expectation over my training data of f bar minus f hat. Okay. And now let's re-write this term, and just pass the expectation through. And the first thing is again f bar is not a function of training data, so the result of that is just f bar And then i'm gonna get minus the expectation over my training data of f hat. So, what is this? This is the definition of f bar. This is taking my specific fit on a specific, so it's the fit on a specific training data set at xt And it's taking the expectation over all training data sets. That's exactly the definition of what f bar is, that average fit. So, this term here is equal to 0. Again, by definition. So, what we end up seeing is this whole second term is gonna disappear. Because we have some quantity times zero. Okay, that just leaves one more quantity to analyze and that's this term here where what I have is an expectation over a function minus it's mean squared. So, let me just write this in words. It's an expectation of let's just say, so the fact that I can equivalently write this as F hat minus F barred squared. I hope that's clear that the negative sign there doesn't matter. It gets squared. They're exactly equivalent. And so what is this? This is a random function at xt which is equal to just a random variable. And this is its mean. And so the definition of taking the expectation of some random variable minus its mean squared, that's the definition of variance. So, this term is the variance of f hat. Okay, so now we can make our concluding statement about this mean squared error where what we see is the first term was equal to bias squared of f hat, and the second term was 0, and this third term was variance of f hat. So, what we've shown on this slide is that mean squared error of our f hat is equal to bias squared of f hat plus variance of f hat. And that's exactly what we're hoping to show because now we can talk about putting it all together. Where what we see is that our expected prediction error at XT we derived to be equal to Sigma squared plus mean squared error. And then we derived the fact that mean squared error is equal to bias squared plus variance. So, we get the end result that our expected prediction error at Xt is sigma squared plus bias squared plus variance, and this represents our three sources of error. And we've know completed our formal derivation of this. [MUSIC]