[MUSIC] So by now we've gone through notions of three sources of error at a level that's needed to be a practitioner in machine learning. But we know that some of you are interested in more of the technical underpinnings of these ideas both in terms of mathematical formalisms and statistical understanding. And so what we've done is created this optional video that provides a much more technical definition of the three sources of error for those of you that are interested in this material. But we wanna highlight that this is completely optional because this will be taught at a more technical level than what we're assuming in the rest of the specialization. So you mentioned that the training set is just a random sample of some and observations. In this case, some N houses that were sold and recorded, but what if N other houses had been sold and recorded? How would our performance change? So for example, here in this picture we're showing one set of N observations that are used for training data, those are the blue circles. And we fit some quadratic function through this data and here we show some other set of N observations and we see that we get a different fit. And to assess our performance of each one of these fits we can think about looking at generalization error. So in the first case we might get one generalization error of this specific fit w hat 1. And in the second case we would get some different evaluation of generalization error. Let's call it generalization error of w hat 2. But one thing that we might be interested in is, how do we perform on average for a training data set of N observations? Because imagine them trying to develop a tool that's gonna be used by real estate agents to form these types of predictions. Well I like to design my tool, package it up and send it out there, and then a real estate agent might come in and have some set of observations of house sales from their neighborhood that they're using to make their predictions. So that might be different than another real estate agent. And what I'd like to know, is for a given amount of data, some training set of size N, how well should I expect the performance of this model to be, regardless of what specific training dataset I'm looking at? So in these cases what we like to do is average our performance over all possible fits that we might get. What I mean by that is all possible training data sets that might have appeared, and the resulting fits on those data sets. So formerly, we're gonna define this thing called expected prediction error which is the expected value of our generalization error, over different training data sets. So very specifically, for a given training data set, we get parameters that are fit to that data set. So I'll call that w hat of training set. And then for that estimated model, I can evaluate my generalization error and what the expected prediction error is doing is it's taking a weighted average over all possible training sets that I might have seen. Where for each one I get a different set of estimated parameters and thus a different notion of the generalization error. And to start analyzing this quantity of prediction error, let's specifically look at some target input xt, which might be a house with 2,640 square feet. And let's also take our loss function to be squared error. So in this case when we're talking specifically about a target point xt. What we can do later after we do the analysis specifically for xt is we can think about averaging this over all possible xt's, over all x all square feet. But in some cases we might actually be interested in one region of our input space in particular. And then when we talk about using squared error in particular, this is gonna allow our analysis to follow through really nicely as we're gonna show not in this video, but in our next even more in depth video which is also optional. But under these assumptions of looking specifically at xt and looking at squared error as our measure of loss. You can show that the average prediction error at xt is simply the sum of three terms which we're gonna go through. Sigma squared plus pi squared plus variants. So these terms are yet to be defined, and this is what we're gonna walk through in this video in a much more formal way than we did in the previous set of slides. So let's start by talking about this first term, sigma squared and what this is gonna represent is the noise we talked about in the earlier videos. So in particular, remember that we're saying that there's some true relationship between square feet and house value. That that's just a relationship that exists out there in the world, and that's captured by f sub w true, but of course that doesn't fully capture how we think about the value of a house. There are other factors at play. And so all those other factors out there in the world are captured by our noise term, which here we write as just an additive term plus epsilon. So epsilon is our noise, and we said that this noise term has zero meaning cuz if not we can just shove that other component into f sub w true. But we're just gonna make the assumption that epsilon has 0 mean then we can start talking about what is the spread of noise you're likely to see at any point in the input space. And that spread is called the variance. So we denote it by sigma squared and sigma squared is the variance of this noise epsilon. And as we talked about before, this noise is just noise that's out there in the world, we have no control over it no matter how complicated and interesting of a model, we specify our algorithm for fitting that model. We can't do anything about the fact that we're using x for our prediction. But there's just inherently some noise in how our observations are generated in the world. So for this reason, this is called our irreducible error. Because it's noise that we can't reduce through any choices that we have control over. So now let's talk about this second term, bias squared. And remember that when we talked about bias this was a notion of how well our model could on average fit the true relationship between x and y. But now let's go through this at a much more formal level. And in particular let's just remember that there's some relationship between square feet and house value in our case which is represented by this orange line. And then from this true world we get some data set and to find a training set which are these blue circles. And using this training data we estimate our model parameters. Well, if we had gotten some other set of endpoints, we would have fit some other functions. Now, when I look over all possible data sets of size N that I might have gotten, where remember where this blue shaded region here represents the distribution over x and y. So how likely it is to get different combinations of x and y. And let's say, I draw endpoints from this joint distribution over x and y and over all possible values I look at an estimated function. So for example here are the two, estimated functions from the previous slide, those example data sets that I showed. But of course there's a whole continuum of estimated functions that I get for different training sets of size N. Then when I average these estimated functions, these specific fits over all my possible training data sets, what I get is my average fit. So now let's talk about this a little bit more formally. We had already presented this in our previous video. This f sub w bar. But now, let's define this. This is the expectation of a specific fit on a specific training data set or let me rephrase that, the fit I get on a specific training data set averaged over all possible training data sets of size N that I might get. So that is the formal definition of this f sub w bar, what we have been calling our average fit. And what we're talking about when we're talking about bias is, we're talking about comparing this average fit to the true relationship. And here remember again, we're focusing specifically on some target xt. And so the bias at xt is the difference between the true relationship at xt between xt and y. So between a given square feet and the house value whatever the true relationship is between that input and the observation versus this average relationship estimated over all possible training data sets. So that is the formal notion of bias of xt, and let's just remember that when it comes in as our error term, we're looking at bias squared. So that's the second term. Now let's turn to this third term which is variance. And let's go through this definition where again, we're interested in this average fit f sub w bar, this green dashed line. But that really isn't the quantity of interest. It's gonna be used in our definition here. But the thing that we're really interested in, is over all possible fits we might see. How much do they deviate from this expected fit? So thinking about again, specifically at our target xt, how much variation is there in the training dataset specific fits across all training datasets we might see? And that's this variance term and now again, let's define it very formally. Well let me first state what variance is in general. So variance of some random variable is simply looking at the expected value of that random variable minus its main squared. So in this context, when we're looking at the variability of these functions at xt, we're taking the expectation and our random quantity is our estimated function for a specific training data set at xt. And then what's the mean of that random function? The mean is this average fit. This f sub w bar. So we're looking at the difference between fit on a specific training dataset and what I expect to earn averaged over all possible training datasets. I look at that quantity squared and what is my expectation taken over? Sorry, let me just mention that this quantity when I take this squared, represents a notion of how much deviation a specific fit has from the expected fit at xt. And then when I think about what the expectation is taking over, it's taking over all possible training data sets of size N. So that's my variance term. And when we think intuitively about why it makes sense that we have the sum of these three terms in this specific form. Well what we're saying is variance is telling us how much can my specific function that I'm using for prediction. I'm just gonna use one of these functions for prediction. I get a training dataset that gives me an f sub w hat, I'm using that for prediction. Well, how much can that deviate from my expected fit over all datasets I might have seen. So again, going back to our analogy, I'm a real estate agent, I grab my data set, I fit a specific function to that training data. And I wanna know well, how wild of a fit could this be relative to what I might have seen on average over all possible datasets that all these other realtors are using out there? And so of course, if the function from one realtor to another realtor looking at different data sets can vary dramatically, that can be a source of error in our predictions. But another source of error which the biases is capturing is over all these possible datasets, all these possible realtors. If this average function just can never capture anything close to their true relationship between square feet and house value, then we can't hope to get good predictions either and that's what our bias is capturing. And why are we looking at bias squared? Well, that's putting it on an equal footing of these variance terms because remember bias was just the difference between the true value and our expected value. But these variance terms are looking at these types of quantities but squared. So that's intuitively why we get five squared and then finally, what's our third sense of error? Well let's say, I have no variance in my estimator always very low variance. And the model happens to be a very good fit so neither of these things are sources of error, I'm doing basically magically perfect on my modeling side, while still inherently there's noise in the data. There are things that just trying to form predictions from square feet alone can't capture. And so that's where irreducible error or this sigma squared is coming through. And so intuitively this is why our prediction errors are a sum of these three different terms that now we've defined much more formally. [MUSIC]