Now, the problem is that in this difficult sentence,
any improvement agent makes is actually going to affect the score much better
than any improvement in this Jon Snow example.
But the reinforce algorithm,
the policy gradient information we've just derived, kind of stays the opposite.
This case you would multiply your simple sentences,
the gradient of simple sentences.
But the slash you want is plus 100, and
your more complicated sentences with whatever the agent gets, say 20.
This is kind of where, this is not the kind of behavior you want to exhibit.
On the contrary, you want to encourage your agent for
doing things that aren't not just good by themselves.
Not just good because there is just a simple task to perform this time.
But you want to reward the things that our goods are in comparison to how your
agent usually performs here.
So if you on average perform very poor on these sentences,
you'll say you'll get a reward of 10 out of 100.
Now, you've just got a reward for, say, 30.
This is a very good improvement.
You have to capitalize on it.
You have to actually make sure that agents learn this and
learns to repeat this thing more often.
And if it says Jon Snow perfectly, just like during previous 100 iterations,
it's not a big deal, even though it gets a perfect score.
Now this basically translates,
that you have baseline in the reinforcement learning algorithm.
Now the idea here is that you want to reward not the Q function,
as is written in this formula, but something called the advantage.
So the advantage is, how good does your algorithm perform to what it usually does?
It's like the advantage versus the usual performance.
And this leads us to a bit more math here.