Clearly, just because you've done everything right and trained a model, doesn't necessarily mean it's going to work. Hopefully, that's obvious. When we define many of these concepts, we do so in a hypothetical or optimal situation. When you create a nicely labeled training dataset and then train your machine learning model, but as you can probably expect, that doesn't always work out the way it's supposed to. There are many issues that can come up. But first, we have to understand how to evaluate and test the model that we trained. As we now know, machine learning models operate by first of all, getting a rough set of weights, which fit the training patterns in a general way and then progressively learning towards a set of weights in the model for features that fit well the training data. But if learning goes too far down this path, one can reach a set of weights that perfectly fits the idiosyncrasies of the training data patterns so well that the model will only work well for the training data, and not for the test or any other data. So if it's allowed to learn for too long, the model will become more prone to overfitting. In order to get a good fit, we must stop at a point just before where the error on the validation datasets starts increasing. At this point, the model is said to have good skills on the training dataset as well as on our unseen validation dataset. This phenomenon is not very useful and it's called overfitting. Again, overfitting happens when the model fits to more than just the signal that is useful about the features in a dataset and also begins to memorize the random fluctuations, anomalies, and noise that might be present in a training dataset. That will harm the performance of the system on any new data. Another thing to keep in mind about overfitting relates to the choice of what kind of algorithm that will be used for your dataset or solution. A very complex model, for example, like a very deep neural network that can accommodate a high number of feature weights and that uses a high number of predictor variables, will frequently experience overfitting, especially in small datasets. As a result, these very complex models perform really well on training data sets, but only because they've simply memorized the noise together with the signal in the training data, and then they don't really work well on new data sets. Now if there is a term of overfitting, you can correctly assume that there's also a term of underfitting. A statistical model or machine learning algorithm suffers from underfitting when it is unable to adequately capture underlying structure in the data. In other words, an underfit model is unable to obtain a good fit of the trends in the data. We'll be able to see this through poor performance on the training data. As you can probably expect, an underfit model will also perform poorly on any new data, and it's overall undesirable. Most of the time, underfitting happens when the models or algorithms are too simple to fit more complex trends in the training data. Since the solution to this is relatively simple in most cases, try a more complex model or algorithm, underfitting typically gets less time than overfitting in discussions about debugging. Appropriate fitting is the goal, and one of the major challenges that machine learning practitioners often spend a lot of time on in practice, is how to tweak hyperparameters and algorithmic design choices in order to hit the sweet spot of appropriate fitting. So how do we go about achieving this balance between the opposing forces of overfitting and underfitting? First, we need a tool to monitor the learning process. Specifically, we need a learning curve. Learning curves are one of the most common tools that machine learning practitioners use to monitor algorithms that learn incrementally from training data. They are essentially plots of model performance over time, as the model incrementally learns. We can use learning curves to diagnose problems, such as model underfitting or overfitting, as well as the sanity check or debug our code and implementations. In machine learning, since we typically use the concept of loss as an indication of how well a model is doing, a common type of learning curve plot, models loss over time, or as the model is exposed to more and more training data. Typically, the right combination of model architecture and hyperparameters will lead to large decreases in loss initially, which gets smaller over time as the model gets closer and closer to the optimal weights. But with some loss functions, we can never observe exactly zero loss. Usually, a small loss, such as one close to zero, indicates that the training dataset was learned close to perfectly and few of any mistakes were made. Now we've just seen learning curves plotted for the training data, but we can also evaluate and plot these for other validation sets that are not part of the training data as well. We will soon see that plotting both curves can be very useful for debugging. Note that the training curve here is much more jittery than the validation curve. This is because as we incrementally expose the model during the training process to examples from the training set and update the model parameters, remember our discussion of the gradient descent algorithm earlier, we compute a loss for each incremental batch of examples as part of the gradient descent process. This loss, since it is only computed on a batch of examples, is an estimate of the loss over the entire training set. But it comes for free and does not take extra time to compute. So we usually visualize this in our learning curve. Computing the loss over an entire large training set often takes a lot of time. So we can't afford to do this very frequently. Since we don't get the validation loss for free during training, we do have to compute this periodically. For example, after every sweep through the training data, which is called an epoch. Fortunately, the validation dataset is usually smaller than the training dataset, so it takes less time. Now that we're familiar with the concept of learning curves, let's see how we can perform some debugging with these in practice. Let's start with underfitting. The loss curve for an underfit model usually looks something like this, where both the training and the validation loss curves don't really decrease very much, and instead they stay relatively flat lines at a high loss value. This indicates that the model is unable to learn from the training dataset to reduce its loss. Even if you just see a flat training loss curve in general, without necessarily plotting the validation loss curve along with it, you can pretty much suspect under fitting is happening. Sometimes the curves may also show oscillating noisy values, but this will still be centered around a high loss value without significant trends of decrease. Now let's look at overfitting. The loss curve for an overfit model usually looks something like this, where the training loss continues to decrease over time, but the validation loss decreases to a point and all of a sudden begins to increase again. Why is this happening? Well if we think back to our discussion of overfitting, this behavior starts to make sense. The training loss curve is fitting extremely well to the training data and so the loss is decreasing. But the generalization performance of that model on a validation set, which it hasn't seen before, starts to get worse. So basically, the model is learning to fit random fluctuations and noise in the training data or memorizing the training data and it continues to decrease the training data loss, as you would expect. But this is happening at the expense of the validation loss when overfitting starts to become increasing again. So diagnosing overfitting requires inspecting both the training and the validation curves together. A good fit is our goal when training machine learning models. It occurs at the sweet spot where the model is neither underfitting nor overfitting. When we have a good fit, the training and validation losses will both decrease at a large rate of decrease initially, and then smaller over time, until they reach a point of stability or in other words, they converge. People also refer to this as reaching a plateau. Ideally, the training and validation loss curves will plateau to similar values, but you'll often see a small gap between the two, where the training curve converges to a lower loss value than the validation loss. This gap between the training loss and validation loss is referred to as a generalization gap. Intuitively, we can expect this to happen because the model is directly optimizing to perform well on the training set. Now remember, we also mentioned earlier that continuing to train a model that's at a good fit point can turn it into an overfit model. In terms of loss curves, we can think of this as the training loss of our model continuing to go down gradually while the validation loss starts to go in the opposite direction and increase, the classic sign of overfitting. In addition to plotting learning curves of the loss to monitor and debug during the model training process, machine learning practitioners often plot learning curves for other metrics as well. A common second type of learning curve to visualize during training is the plot of the final performance metric, for example, accuracy. This is useful to get a sense of actual model performance, since lower loss usually corresponds to better model performance, but it doesn't tell us whether the accuracy is at a level we are happy with or not. Performance curves, such as accuracy curves, can also provide additional information to complement our analysis when we diagnose issues from learning curves. Now we've seen how to interpret learning curves to monitor the learning process and diagnose problems, including some of the most common problems of overfitting versus underfitting, but we also need to know how to address the problems once we find them. We'll talk more about how to handle these different scenarios next.