0:00

In the last video, you learned about the soft master,

Â the softmax activation function.

Â In this video, you deepen your understanding of softmax classification,

Â and also learn how the training model that uses a softmax layer.

Â Recall our earlier example where the output layer computes z[L] as follows.

Â So we have four classes,

Â c = 4 then z[L] can be (4,1) dimensional vector and we said we compute t

Â which is this temporary variable that performs element y's exponentiation.

Â And then finally, if the activation function for your output layer,

Â g[L] is the softmax activation function, then your outputs will be this.

Â It's basically taking the temporarily variable t and normalizing it to sum to 1.

Â So this then becomes a(L).

Â So you notice that in the z vector, the biggest element was 5, and

Â the biggest probability ends up being this first probability.

Â The name softmax comes from contrasting it to what's called a hard

Â max which would have taken the vector Z and matched it to this vector.

Â So hard max function will look at the elements of Z and just put a 1 in

Â the position of the biggest element of Z and then 0s everywhere else.

Â And so this is a very hard max where the biggest element gets a output of 1 and

Â everything else gets an output of 0.

Â Whereas in contrast,

Â a softmax is a more gentle mapping from Z to these probabilities.

Â So, I'm not sure if this is a great name but at least, that was the intuition

Â behind why we call it a softmax, all this in contrast to the hard max.

Â 1:43

And one thing I didn't really show but had alluded to is that softmax regression or

Â the softmax identification function generalizes the logistic activation

Â function to C classes rather than just two classes.

Â And it turns out that if C = 2, then softmax with

Â C = 2 essentially reduces to logistic regression.

Â And I'm not going to prove this in this video but the rough outline for

Â the proof is that if C = 2 and if you apply softmax,

Â then the output layer, a[L], will output two numbers if C = 2,

Â so maybe it outputs 0.842 and 0.158, right?

Â And these two numbers always have to sum to 1.

Â And because these two numbers always have to sum to 1, they're actually redundant.

Â And maybe you don't need to bother to compute two of them,

Â maybe you just need to compute one of them.

Â And it turns out that the way you end up computing that number reduces to

Â the way that logistic regression is computing its single output.

Â So that wasn't much of a proof but the takeaway from this is that softmax

Â regression is a generalization of logistic regression to more than two classes.

Â Now let's look at how you would actually train a neural network

Â with a softmax output layer.

Â So in particular,

Â let's define the loss functions you use to train your neural network.

Â Let's take an example.

Â Let's see of an example in your training set where the target output,

Â the ground true label is 0 1 0 0.

Â So the example from the previous video,

Â this means that this is an image of a cat because it falls into Class 1.

Â And now let's say that your neural network is currently outputting y hat equals,

Â so y hat would be a vector probability is equal to sum to 1.

Â 0.1, 0.4, so you can check that sums to 1, and this is going to be a[L].

Â So the neural network's not doing very well in this example because this is

Â actually a cat and assigned only a 20% chance that this is a cat.

Â So didn't do very well in this example.

Â 3:52

So what's the last function you would want to use to train this neural network?

Â In softmax classification,

Â they'll ask me to produce this negative sum of j=1 through 4.

Â And it's really sum from 1 to C in the general case.

Â We're going to just use 4 here, of yj log y hat of j.

Â So let's look at our single example above to better understand what happens.

Â Notice that in this example,

Â y1 = y3 = y4 = 0 because those are 0s and only y2 = 1.

Â So if you look at this summation,

Â all of the terms with 0 values of yj were equal to 0.

Â And the only term you're left with is -y2 log y hat 2,

Â because we use sum over the indices of j,

Â all the terms will end up 0, except when j is equal to 2.

Â And because y2 = 1, this is just -log y hat 2.

Â So what this means is that,

Â if your learning algorithm is trying to make this small

Â because you use gradient descent to try to reduce the loss on your training set.

Â Then the only way to make this small is to make this small.

Â And the only way to do that is to make y hat 2 as big as possible.

Â 5:18

And these are probabilities, so they can never be bigger than 1.

Â But this kind of makes sense because x for this example is the picture of a cat,

Â then you want that output probability to be as big as possible.

Â So more generally, what this loss function does is it looks at whatever is the ground

Â true class in your training set, and it tries to make the corresponding

Â probability of that class as high as possible.

Â If you're familiar with maximum likelihood estimation statistics,

Â this turns out to be a form of maximum likelyhood estimation.

Â But if you don't know what that means, don't worry about it.

Â The intuition we just talked about will suffice.

Â 5:54

Now this is the loss on a single training example.

Â How about the cost J on the entire training set.

Â So, the class of setting of the parameters and so on, of all the ways and

Â biases, you define that as pretty much what you'd guess,

Â sum of your entire training sets are the loss,

Â your learning algorithms predictions are summed over your training samples.

Â And so,

Â what you do is use gradient descent in order to try to minimize this class.

Â Finally, one more implementation detail.

Â Notice that because C is equal to 4, y is a 4 by 1 vector, and

Â y hat is also a 4 by 1 vector.

Â 6:34

So if you're using a vectorized limitation, the matrix capital

Â Y is going to be y(1), y(2), through y(m), stacked horizontally.

Â And so for example, if this example up here is your first training example

Â then the first column of this matrix Y will be 0 1 0 0 and then maybe the second

Â example is a dog, maybe the third example is a none of the above, and so on.

Â And then this matrix Y will end up being a 4 by m dimensional matrix.

Â And similarly, Y hat will be y hat 1 stacked up horizontally

Â going through y hat m, so this is actually y hat 1.

Â 7:19

All the output on the first training example then y hat will these 0.3,

Â 0.2, 0.1, and 0.4, and so on.

Â And y hat itself will also be 4 by m dimensional matrix.

Â Finally, let's take a look at how you'd implement gradient descent when you

Â have a softmax output layer.

Â So this output layer will compute z[L] which is C by 1 in our example, 4 by 1 and

Â then you apply the softmax attribution function to get a[L], or y hat.

Â 7:53

And then that in turn allows you to compute the loss.

Â So with talks about how to implement the forward propagation

Â step of a neural network to get these outputs and to compute that loss.

Â How about the back propagation step, or gradient descent?

Â Turns out that the key step or

Â the key equation you need to initialize back prop is this expression,

Â that the derivative with respect to z at the loss layer, this turns out,

Â you can compute this y hat, the 4 by 1 vector, minus y, the 4 by 1 vector.

Â So you notice that all of these are going to be 4 by 1 vectors when

Â you have 4 classes and C by 1 in the more general case.

Â 8:34

And so this going by our usual definition of what is dz,

Â this is the partial derivative of the class function with respect to z[L].

Â If you are an expert in calculus, you can derive this yourself.

Â Or if you're an expert in calculus, you can try to derive this yourself, but

Â using this formula will also just work fine,

Â if you have a need to implement this from scratch.

Â With this, you can then compute dz[L] and then sort of start off the back prop

Â process to compute all the derivatives you need throughout your neural network.

Â But it turns out that in this week's primary exercise, we'll start to use one

Â of the deep learning program frameworks and for those primary frameworks,

Â usually it turns out you just need to focus on getting the forward prop right.

Â And so long as you specify it as a primary framework, the forward prop pass,

Â the primary framework will figure out how to do back prop,

Â how to do the backward pass for you.

Â 9:27

So this expression is worth keeping in mind for if you ever need to implement

Â softmax regression, or softmax classification from scratch.

Â Although you won't actually need this in this week's primary exercise because

Â the primary framework you use will take care of this derivative

Â computation for you.

Â So that's it for softmax classification,

Â with it you can now implement learning algorithms to characterized inputs

Â into not just one of two classes, but one of C different classes.

Â Next, I want to show you some of the deep learning programming frameworks which

Â can make you much more efficient in terms of implementing deep learning algorithms.

Â Let's go on to the next video to discuss that.

Â