0:54

So, if F corresponds to maybe some probability and

Â you're trying to find the most probable set of Xs.

Â Then, you would probably the maximum of f,

Â however if f were some cost function, and X1 through Xn are some parameters

Â of that cost function, then you would be trying to find the minimum.

Â 1:35

it's hard to draw very high-dimensional functions.

Â But what we can do, Is to draw a function of two arguments and

Â then claim that all the geometry we're going to talk about and all of the math

Â we're going to talk about holds for functions of more than two arguments.

Â So let's say we have f(X1, X2),

Â and let's say it looks something like this,

Â so here is X1, here is X2, draw F.

Â 2:36

Now let's pretend that F is a probability function.

Â A probability density and we are trying to find the most probable X1 and X2.

Â And that corresponds to the X2 and X1 where F is maximized so

Â on this picture you can see the maximum right there, so that's the maximum.

Â And when you see the whole picture like this,

Â it's actually fairly easy to figure out where the maximum is.

Â 3:25

However, what you can do is sample the function at different pairs of Xs.

Â What that means is I can provide the function with an X1, an X2 and

Â see what the value of the function is at that X1 and X2.

Â And if I could do this with every X1 and

Â X2, it would be very easy to see what the maximum of the function was.

Â I would just choose the X1 and X2 that yielded the greatest value of F.

Â However, this is often infeasible

Â especially if we have a very high dimensional space of inputs.

Â So what does one do?

Â One possibility would be to choose a bunch of random pairs of X1 and

Â X2, and see which one yielded the highest path.

Â So maybe we would choose an X1, X2 that yielded that F right there.

Â Maybe one that yielded that F, maybe one over there,

Â maybe one there, maybe one there.

Â And then at the end of the day,

Â once we have chosen a finite number of random X1, X2 pairs.

Â We would just say okay, which of those pairs yielded the highest value of F?

Â And we'll say that's the maximum.

Â However, that's actually not going to work very well and

Â you can probably see why that's true based on this drawing.

Â When you're just randomly sampling,

Â you're using no strategy to try to figure out where the maximum might be.

Â You're just choosing randomly, so can we do something a little bit better?

Â Instead, we're going to use a method called gradient ascent and

Â how gradient ascent works is that you pick a starting point.

Â And then instead of just randomly moving to any other X1,

Â X2, you instead move to an X1 X2 pair that is in your little neighborhood.

Â So, you only move to points that are nearby.

Â So this is very similar to what you would do if you were

Â maybe in a foggy city or in a city at night where you couldn't see very well.

Â And you were task with finding the highest point of the city.

Â You obviously, can't randomly jump around to take different

Â samples of different points of the city Instead you can only move locally.

Â All right, so now that you're constrained to only move locally,

Â to only be able to move to locations very nearby to where you currently are.

Â What's the well, if I were trying to find a highest point in the city,

Â what I would do is I would always walk in the steepest direction.

Â 6:36

And then, I would keep going and keep going and keep going and

Â keep going until I got to the top and how would I know when I got to the top?

Â Well, every direction would be equally steep, and the steepness would be zero,

Â because it would be completely flat in every direction.

Â So hopefully, that makes sense intuitively, and

Â in fact that is all that gradient ascent does.

Â 7:21

Well it has to do with what kind of calculation we do

Â to find the steepest direction.

Â So let's introduce the notion of the gradient, so

Â that's our symbol for the gradient.

Â A little upside down triangle and so if we were to take

Â the gradient of f, that is just equal to df/dx,

Â 7:48

df, or sorry df/dx1, df/dx2,

Â and so forth all the way to df/dxn.

Â So, the gradient is a vector each of whose entries

Â is the partial derivative of f with respect to one of f's arguments.

Â So, if f has n arguments and

Â inputs, then the gradient is an n dimensional vector.

Â 10:49

What about the next step?

Â Well, we calculate the gradient at X1,

Â X2 sub 1 and then our next step Leads us to,

Â where we were previously (X1,

Â X2)1 + alpha X the gradient.

Â 11:16

The new gradient, if you're doing this right, the gradient might

Â change as you move to different locations and so this is the algorithm.

Â At each point you take a step in the direction of the gradient,

Â which is the steepest direction and that step size is alpha, And so,

Â you can keep on doing this over and over and over again And then when do you stop?

Â 12:07

Of the gradient is equal to the steepness of the slope,

Â so if the gradient is very large

Â then you have a very steep slope.

Â And if the gradient is 0, then you have a 0 slope and what does a 0 slope mean?

Â Well, that means that you've reached a maximum, so

Â that's how gradient ascent works.

Â 12:39

You pick a starting point and then you follow the gradient to the top.

Â This can have a few problems unfortunately so

Â in this picture we have drawn for example.

Â Which peak you reach we depend on your starting location,

Â so if we started here we will probably reach that peak.

Â However, if we start somewhere over here we will reach this peak

Â which is a local maximum.

Â So, if your goal is to find the global maximum of your function using

Â gradient ascent, it's usually a good idea to run the algorithm a few

Â times from random starting locations.

Â And each time you run the algorithm you will

Â get some final X1 X2 all the way Xn set of inputs.

Â That yielded a local maximum or

Â the cause to the gradient ascent algorithm can stop running.

Â And so, for

Â each time you run the algorithm you'll have a set of these X1 through Xn's.

Â And then, all you do is pick the set of X1 through Xn that yielded the highest

Â value of f, so that's it, that's the gradient ascent optimization method.

Â If you're trying to do gradient descent, so maybe trying to find the minimum of

Â a cost function for example, then you're just walking down the hill.

Â So you do the same thing as in gradient ascent but

Â you will go in the direction opposite of the gradient and

Â again, you keep taking steps until the gradient reaches 0.

Â And again, we've gone through our examples in two dimensions,

Â so with two arguments for our function f, but

Â the same general idea holds for higher dimensions.

Â 14:38

And when the gradient is equal to zero that means we are at a maximum or

Â a minimum and that's it.

Â This is just one of the many optimization efforts there are for

Â optimizing a function of many arguments.

Â And in many cases it works pretty well, so if you're trying to find the ideal

Â network waves for a feed forward neural network which you're trying to use for

Â classification for example.

Â Gradient assent does a pretty good job.

Â However, keep in mind that it is not the be all end all solution.

Â There are still many varieties of functions which are not maximized

Â particularly well with gradient offset

Â 15:21

However, one good check to see if gradient assent is working is to pick

Â a bunch of random starting locations.

Â And if they all converge to the same maximum,

Â then that suggests that f is actually fairly well behaved and

Â may have a global maximum that is relatively easy to find.

Â All right, and that's it for gradient ascent and descent.

Â Thanks for watching.

Â