Now, what happens if instead of having say five or six discrete actions,

now deal with a continuous action space?

So, previously you had two options,

you steer the bike to the left or to the right.

Now, you have to provide a particular value between say,

minus pi over two and plus pi by over two,

which is the exact amount of steering you want to apply.

This is especially important in the means like say, robot control.

In this case, you have all the joints or all the limbs of your robot.

And you can move them via motors,

and motors are controlled via voltage.

So, you can apply any voltage you want within some particular diapason.

Now, this actually means that you can no longer deal,

you can no longer solve this problem by simply having one neuron per possible action,

and using a soft mark similarity.

So, how do you use say neural network to predict,

not discrete variable, but the real value of variable.

One way you can do this is,

you can simply solve regression problem.

It's probably the most obvious one.

For example, if you use default circuit theorem or Schirus regression models,

they usually minimize the mean squared error.

If you remember the Bayesian method scores,

minimizing squared error is

actually the same thing as maximizing the logarithm of likelihood,

in case that you predict the normal distribution for the action space.

So basically, it means that the probability of taking action is

a normal distribution with mean value given by your neural network,

and a standard deviation of one.

Again, in this case, you can simply fit your model using the existing classification or,

in this case, regression algorithm,

which is for neural network is another modification of bi-propagation.

And you can do this repeatedly until your model converges to an optimal result.

In our practical assignments,

once you can modify just two lines of code in your assignment,

you'll be able to solve a different problem which requires a real value of its output.

We'll use the details of

these particular changes later in the practical assignment itself.

So of course, it's not all theories.

Sometimes you get algorithms work really well.

You'll have to employ some kind of dirty hacks and practical heuristics.

For cross-entropy method, there are several kinds of those heuristics.

One family of heuristics is aimed at

reducing the amount of samples it takes to be in training.

In cross-entropy method, this thing is especially dire.

You have to play 100 sessions,

and you only use some fraction,

say 25 percent of them,

and it gets even smaller if you use larger sample sizes.

This is of course terribly bad,

and this is probably the worst case of sample inefficiency we have.

So, cross-entropy method relies on you being able to give it all the samples.

This is true for virtual reality, for games,

for computer models robots,

but it's not true if you want actual robotic car to steer on the actual streets.

So instead, you can try some hacks to get it to run more smoothly.

Example, you can re-use the samples from several past iterations.

So, you don't have to sample 1000 or 100 sessions.

You can sample say 20 sessions,

and use 80 sessions leftover from the previous iterations.

This of course make the training slightly less strategically nice,

but it tends to somewhat work, time to time.

Now, another problem with cross-entropy method,

is that it tends to sometimes fall into the local optima.

So, you have a neural network that has a weird structure,

so that the gradient sometimes explode.

And once they explode, there is a small chance that new ones will appear in

a situation where some action has a probability of almost zero.

Now, in the usual supervised during set up, this is not so bad.

Well, in the worst case you will get not a number everywhere,

but usually what you have,

is you'll have your network trained to fix this error.

Reinforcement learning, the problem is much worse.

Because if you don't have a probability of zero,

this means that your agent explicitly avoids

taking some particular action, some particular state.

This is bad, especially if this action

was the optimal one that you have not yet discovered.

So, since you never take this action,

you'll never get the samples where this action happens in the elite session.

So, you're stuck in this sub-optimal policy.

How you can improve this?

Well, of course there's many ways,

but one way to do so is to simply regularizing your network.

You can try to not only minimize the cross-entropy elite sessions,

but also as a regularizer,

slightly increase the entropy of the output distribution.

So, as we all know, entropy gets smallest

when engine is absolutely certain about one action,

and takes this action all the time.

So, the probability of one for this one action,

and zero for all the other actions.

The highest value of entropy is achieved for uniform distribution.

Now, this means that if you regularize,

the higher your entropy gets the better.

It means that your agent will be biased against completely giving up on actions.

So, if some action gets you personal probabilities,

the probability will eventually get slightly higher,

falling degree into the entropy.

We'll cover this in more detail in the reading section.

Now finally, since ruling in the modern world and

even your smartphone has more than one computer core, basically this means that,

whenever you have parallelizable algorithm,

you can get them to run 100 times as fast to 1,000 times as fast,

depending on how many servers do you have.

For cross-entropy method, it's very simple.

You have this phase, we sampled 1,000 sessions.

You can sample them all in parallel.

Of course sometimes, it requires that you buy

1,000 separate kind of environment emulators if it's something physical.

But for videogames for example,

it's very easy to parallelize.

Finally, there is a very neat situation here.

Sometimes you want to experiment

the neural network architecture for cross-entropy method.

And for some cases,

if you don't want to only rely on your current observation,

you can use your current neural network based architecture to make your agents kind

of use a memory to store

whatever useful information they have seen on the previous observations.

This is of course slightly more complicated than it gets in this particular sentence.

So, we'll cover this in much more details near the end of the course.

So, I hope the cross-entropy method is

slightly lesser pack to you now. Now let's get to practice.