Previously, you've seen confusion matrices as a method for evaluating classification models. In this video, you'll learn how to construct a confusion matrix from which you'll derive four key evaluation metrics: recall, fallout, precision, and accuracy. Next, you'll see why it's necessary to consider trade-offs among these metrics when evaluating classification model performance. Finally, you'll work with Receiver Operating Characteristic curves to systematically evaluate the trade-off between recall and fallout. Let's get started with a simple example of binary classification. Recall a binary classification model uses the decision boundary to separate two classes of data, in this case, positives and negatives. Although in general, the decision boundary does not have to be a straight line, this video focuses on this simple case to illustrate the concepts. Data points on one side of the decision boundary are predicted to be positive, and data points on the other side are predicted to be negative. There are 12 data points in the positive class, and this model correctly identifies nine of them. These are the true positives. You can see this abbreviated as TP. The remaining three positives that were classified as negatives are the false negatives. Similarly, there are 12 data points in the negative class. The 10 of them that were correctly classified are the true negatives. Finally, the two negatives incorrectly classified as positive, are the false positives. A common way to evaluate these four quantities together is using a confusion matrix. The true values are grouped by row, and the predicted values are grouped by column. Each quadrant in the confusion matrix represents the corresponding number of true positives, true negatives, false positives, and false negatives. These four quantities form the basis of all the classification evaluation methods in this video. The total actual positives is equal to the sum of the true positives and false negatives. Similarly, the total actual negatives equals false positives plus true negatives. Total predicted positives equals true positives plus false positives, and total predicted negatives equals false negatives plus true negatives. The four key metrics mentioned earlier in this video, recall, fallout, precision, and accuracy, can all be calculated from here. Let's take a closer look at each of them individually before considering the trade-offs among them. Recall, tells you how much of a given class is correctly identified by a model. Here, that is the ratio of the data correctly classified as positive to the total positive data. Fallout tells you how many false alarms the model generates for a given class. For binary classification, this is the ratio of data incorrectly classified as positive to the total negative data. Precision represents the fraction of classifications to a given class that we're correct, which is the ratio of correctly classified positive data to the total data classified as positive. Accuracy is the overall rate a model correctly classifies data. It is the fraction of all correct predictions to the total of all predictions, in other words, how often the model got its predictions right overall. If the classification is perfect, the accuracy, precision, and recall will be one, and the fallout will be zero. Simple, right? Unfortunately, most of the time you will not be able to achieve this kind of ideal. Instead, you'll have to consider trade-offs in the context of a model's optimal capabilities for the intended application. To see this, let's return to the example. The model has an accuracy of 79 percent, a recall of 75 percent, a precision of 82 percent, and a fallout of 17 percent. Would this be acceptable performance for a real-life situation? In a situation like cancer detection, a false positive may cause a temporary anxiety for a patient, but a false negative, in other words, a missed positive, could have potentially deadly results. Here, you may want to maximize recall to ensure you don't miss any positives. You have to be careful though. By changing the decision boundary, you can make recall 100 percent. However, accuracy will decrease to 71 percent and worse yet, the new false positives will drop precision down to 63 percent and increase the fallout to 58 percent. As another example, imagine you're training an image classifier to identify overripe fruit. Your initial goal might be to reduce the number of fruit incorrectly identified as overripe. In this case, reducing fallout will lower the number of false positives generated by your model. Again though, be careful. If you change the decision boundary to make fallout zero percent, the accuracy will be 71 percent. However, this will increase false negatives so that your recall drops down to 42 percent. In other words, your model would let 58 percent of bad fruit through. In real life scenarios, there will rarely be a decision boundary that perfectly separates the positives and negatives. You need to consider the inherent trade-offs to build an acceptable model. Based on what you've seen so far, it may be tempting to assume that while the other metrics may give contradictory priorities to the model, shooting for high accuracy will find a good balance among them all. This works sometimes, but yet again, you have to be careful. If your data has a class imbalance, for example, a large majority of the data is negatives, then a model that predicts nearly, if not every data point to be negative, will have high accuracy and even high precision and low fallout. However, the recall will suffer. What about when you have a large majority of positives? In this case, a model that predicts almost, if not all data as positive, will again have high accuracy. What happens to the other metrics? At this point, you may correctly suspect it is a good idea to always look at multiple metrics when evaluating a model. For example, both recall and fallout. An effective way to evaluate a model using both recall and fallout is to construct the ROC curve and determine the area under the curve, or AUC. ROC stands for Receiver Operating Characteristic, a term that comes from radar engineering where it was first developed and used. The ROC curve represents recall and fallout for a model as functions of a threshold parameter that varies from 0-1, moving the decision boundary through the range of data. The area under the curve also ranges from 0-1, with better models being closer to one. Now let's see what's behind all of this. First, let's talk about the threshold parameter. Various binary classification models will assign a confidence score or probability of a data point being positive, ranging from 0-1, which can be plotted as contours. As you have seen, the decision threshold can vary throughout this range affecting both recall and fallout metrics as it changes. Vary the threshold from 0-1, and calculate the recall and fallout. The calculated recall and fallout values are used to construct the ROC curve. Notice that ROC curves always start from the upper right corner. This is because for a threshold of zero, all data will be classified as positive, so both recall and fallout will be one. As the threshold value progresses through the data, the recall and fallout will change accordingly until the threshold reaches one, where all data will be classified as negative, making both recall and fallout zero. In practice, the ROC curve is often constructed with more granularity. In general, various models will have different ROC curves and therefore different areas under them. An AUC value close to one indicates that the model can achieve high recall while still maintaining low fallout. As the area under the curve decreases, the amount of fallout necessary to achieve high recall gets worse. At an area of 0.5, the model is essentially making random predictions. Below this, the model is actually better at predicting incorrectly than correctly. The ROC curve is useful for helping you determine where to set a decision boundary by visualizing how recall and fallout are related. Let's summarize. In this video, you learned how to construct a confusion matrix and derive key metrics for evaluating classification models. You saw how evaluating a classification model requires considering trade-offs and how to work with Receiver Operating Characteristic curves to systematically evaluate the trade-off between two metrics, recall and fallout.