0:00

In this video, we're going to look at the soft max output function.

This is a way of forcing the outputs of a neural network to sum to one so they can

represent a probability distribution across discreet mutually exclusive

alternatives. Before we get back to the issue of how we

learn feature vectors to represent words, we're gonna have one more digression, this

time it's a technical diversion. So far I talked about using a square area

measure for training a neural net and for linear neurons it's a sensible thing to

do. But the squared error measure has some

drawbacks. If for example the design acuities are

one, so you have a target of one, and the actual output of a neuron is one

billionth, then there's almost no gradient to allow a logistic unit to change.

It's way out on a plateau where the slope is almost exactly horizantal.

And so, it will take a very, very long time to change its weights, even though

it's making almost as big an error as it's possible to make.

Also, if we're trying to assign probabilities to mutually exclusive class

labels, we know that the output should sum to one.

Any answer in which we say, the probability this is A is three quarters

and the probability that it's a B is also three quarters is just a crazy answer.

And we ought to tell the network that information, we shouldn't deprive it of

the knowledge that these are mutually exclusive answers.

So the question is, is there a different cost function that will work better?

Is there a way of telling it that these are mutually exclusive and then using a,

an appropriate cost function? The answer, of course is, that there is.

What we need to do is force the outputs of the neural net to represent a probability

distribution across discrete alternatives, if that's what we plan to use them for.

The way we do this is by using something called a soft-max.

It's a kind of soft continuous version of the maximum function.

So the way the units in a soft-max group work is that they each receive some total

input they've accumulated from the layer below.

That's Zi for the i-th unit, and that's called the logit.

And then they give an output Yi that doesn't just depend on their own Zi.

It depends on the Zs accumulated by their rivals as well.

So we say that the output of the i-th neuron is E to the Zi divided by the sum

of that same quantity for all the different neurons in the softmax group.

And because the bottom line of that equation is the sum of the top line over

all possibilities, we know that when you add over all possibilities you'll get one.

That is, the sum of all the Yi's must come to one.

What's more, the Yi's have to lie between zero and one.

So we force the Yi to represent a probability distribution over mutually

exclusive alternatives just by using that soft max equation.

The soft max equation has a nice simple derivative.

If you ask about how the YI changes as you change the Zi, that obviously depends on,

all the other Zs. But then the Yi itself depends on all the

other Zs. And it turns out, that you get a nice

simple form, just like you do for the majestic unit, where the derivative of the

output with respect to the input, for an individual neuron in a softmax group, is

just Yi times one minus Yi. It's not totally trivial to derive that.

If you tried differentiating the equation above, you must remember that things turn

up in that normalization term on the bottom row.

It's very easy to forget those terms and get the wrong answer.

4:12

Now the question is, if we're using a soft max group for the outputs, what's the

right cost function? And the answer, as usual, is that the most

appropriate cost function is the negative log probability of the correct answer.

That is, we want to maximize the log probability of getting the answer right.

So if one of the target values is a one and the remaining ones are zero, then we

simply sum of all possible answers. We put zeros in front of all the wrong

answers. And we put one in front of the right

answer and that gets us the negative log probability of the correct answer, as you

can see in the equation. That's called the cross entropy cost

function. It has a nice property that it has a very

big gradient when the target value is one and the output is almost zero.

You can see that by considering a couple of cases.

5:17

So value of one in a million is much better than a value of one in a billion,

even though it differs by less than a millionth.

So when you make the output value, you increase by less than one millionth.

The value of C improves by a lot. That means it's a very, very steep

gradient for C. One way of seeing why a value of one in a

million is much better than a value of one in a billion, if the correct answer is one

is that if you believe the one in a million, you'd be willing to bet but odds

of one in a million, then you'd lose $one million.

If you thought the answer was one in a one billion you'd, you'd lose $one billion

making the same bet. So we get a nice property that.

6:01

That cost function, C has a very steep derivative when the answer is very wrong

and that exactly bounces the fact that the way which the advert changes is to change

the import, the Y or the Z is very flat when the once is very wrong.

And when you multiply the two together to get the derivative of cross entropy with

respect to the logic going into i put unit i.

You use the chain rule so that derivative is how fast the cost function changes as

you change the output of the unit times how fast the output of the unit changes as

you change Zi. And notice we need to add up across all

the Js, because when you change the i, the output of all the different units changes.

The result is just the actual output minus the target output.

And you can see that when the actual target outputs are very different, that

has a slope of one or -one. And the slope is never bigger than one or

-one. But the slope never gets small until the

two things are pretty much the same. In other words, you're getting pretty much

the right answer.