Welcome to the first module of the image classification course. Here you'll try your hand at classifying images with traditional machine learning techniques that you've learned before and look at what their limitations are, when working with image datasets. We'll start with a brief introduction, we'll cover the MNIST image dataset that you'll be using for part of this course. Then, we'll tackle an image classification problem, with a linear model in TensorFlow. After that, we'll move on to tackling the same problem using a deep neural network. Lastly, we'll close out the discussion and the application of dropout, which is the regularization technique for neural networks. To help them prevent from overfitting or memorizing our training dataset. So, in this module you'll learn how to understand how image data is represented as floating point numbers, that can then be flattened. Then, you can compare functions from model confidence and image classification with the focus on Softmax. Then, you need to train and evaluate a linear model for image classification using TensorFlow. After that as you might guess, you got to do the same thing except with a Deep Neural Network. Lastly you're going to understand and how to actually apply dropout as a regularization technique for Deep Neural Networks. So, here's our problem statement, "You're running a local post office and you need to quickly recognize handwritten digits on envelopes in order to route them to the correct mailing addresses and zip codes. Now, to create our models we use a common dataset that's used in computer vision, MNIST. MNIST is a dataset of handwritten black and white digits. It was created by mixing two of the original datasets, from the National Institute of Science and Technology. That's where you get the M and the name from some modified. The original datasets of handwritten digits, were from the US Census Bureau's employees and American high-school students. They are mixed together and split to create 60,000 labeled training images of handwritten digits and an additional 10,000 for testing. Each image is 28 pixels by 28 pixels, which is really small compared to the images that your camera probably takes and it represents a single handwritten digit from zero to nine. You'll also have the correct label of the image, for your model to train and learn from. Additionally, the images are in grayscale or they're not colored beyond simple black, white, and gray. So, it's a single channel layer for art depth. Remember those three layers that we talked about before; red, blue, and green. Here we're just going to have one channel. If we want to convert this 2-D image into a single dimensional vector. What can we do? Well, we could flatten the image. It'll take each row of the pixel data and line it up and along single row end to end. So, a question for you, if we flatten a 28 by 28 image into a single long array of data, how long would that array be? Now, if you said 784. That's exactly right. On stacking each of the 28 rows and lining them up end to end, with each row having 28 columns, gives us that 28 times 28 value or 784 total elements in our array. Now, remember that the computer doesn't see images like we do. In all the color intensities of each pixel or simply just floating point numbers do. So, the actual array will look more like this and that's going to be the input for our model. Our output is going to be the class of the image, whether it's a zero or eight or five, what have you. So, how many possible classes for your output do you think that the model is going to be choosing or predicting from? If you said 10, that's exactly correct. We've got the digits from zero to nine which is 10 possible output classes. Given that last answer, what type of tasks do you think will be performing? Is a binary classification? Linear regression? Multiclass classification? Or using a neural network? The correct answer for the task is it's multiclass classification. Simply because we have more than two output classes. Now, that last example that you saw before was very clearly a zero. What about these two's? If you solve these in isolation, would be 100 percent certain that they were two's? We can also set up the architecture to output a probability, for each handwritten digit you fit into it. So, you can see the models confidence in each classification prediction it makes. Because this is a classification task, we've chosen the output of the model will be a 10 dimensional vector of numbers. To make the output more interpretable, we use a function that will render them like probabilities. We actually introduce this function back in the first specialization. So, I want to test your memory. Which function takes a vector of floats and converts the numbers to probabilities? Now, if you said this sigmoid function. No. You missed the fact that we have multiple classes. A sigmoid we'll take a logit and convert it into a probability, but that'll only work if you have just two classes. Here we've got 10 numbers and the total of these probabilities has to be equal to one, which means the classes are exclusive. One of the most common ways and the correct answer, is for using a Softmax function. What does the Softmax function does, is it exponentiates its inputs and then it normalizes them. The exponentiation means that one more unit of evidence, increases the weight given to any hypothesis multiplicatively. Conversely, having one less unit of evidence means that hypothesis gets a fraction of its earlier weight. Basically, it makes the high value is higher and the lower value is lower, but keeps the relative order the same. In addition, as you can see here, the Softmax function normalizes the weights. So, they all add up to one when they're summed up together and this forms that valid probability distribution. To get more intuition about the Softmax function, check out the link in the resources section to Michael Nielsen's book, complete with this interactive visualization. So, let's look at an example of Softmax. Here we take an input image that you see on the left and after classification, the Softmax function works its magic on the outputs. The resulting vector on the right clearly shows which of its elements is the largest or the max, but it retains the original relative order to the rest of the values hence the soft. In this case the max is a class of six which is the prediction with 70 percent chance. Now, keep in mind the model could be wrong in it's classification. As it was in this case, where the input image that was actually labeled as a five but the model predicted a six. That's likely because of that a handwritten little loop at the bottom of the image. Optimization requires that we have a number representing the quality of our solution so far. In machine learning, we call that the loss function. For this task, because it's a classification task, we'll use the same loss function that we use for classification in the last specialization, which is cross-entropy. We will avoid numerical issues that come with taking the log of really really tiny tiny numbers, we'll use an optimized TensorFlow function called Softmax cross-entropy with logits version number two.