Welcome back to our reinforcements coursera. By the end of last week, you've learned one huge concept, which is that if you know the value function or action value function, then you can find optimal policy very easily. We have also found at least one way to infer this value or action value function by means of some way or kind of dynamic programming. This is all nice and well, but now, we're going to try to understand how this transfers to practical problems and find out all the limitations that accompany it. What I mean by practical problems is basically any kind of problem that arises in the wild where you don't have access to all the state action transition probabilities and so on. The worst case as I have probably shown you already, you have all the access you want to the agent, you can implement anything there, but the environment is mostly a black box that just responds to your actions in some kind of hardly modellable way. Think of environment as your Atari game or maybe a robotic car or automatic pancake flipper for whatever measure. Even the trivial kind of case of Go, please don't blame me for calling it trivial. Even this case, you don't know anything about how, for example, your opponent going to react. You can model it to some theory, but you're never given accurate predictions of what is going to happen on the next turn. So the first problem that arises here and actually a huge problem, that you no longer have access to these state transition probability distribution or the reward function as well. You can sample states and rewards from the environment, but you don't now the exact probabilities of them occurring. So, in this case, of course you can not simply come with the expectation of the action values with respect to possible outcomes. And this prevents you from both training and using your optimal policy given the value function. So, what are you going to do to approach this problem? Is there any trick of the trade from machine learning that you do when you don't know probability distribution? Oh, yeah, you kind of learn it. This is what you do in machine learning when you have unknown dependency, unknown slash in the data and you have a lot of samples to train and model. Can train another network the tool for example take your break car game and break the probability of the next state? It would kind of sort of technically work, but the problem here is that it's usually much harder to learn how the environment works than to find an optimal policy in it. In breakout, this transition function is actually an image to image problem, and which you have to use probably a fully convolutional network that will take an image and predict the next image, which is super complicated comparing to simply picking an action. In a more kind of leisured problem, if you're trying to find whether or not you want a cup of coffee, you're not required to find out how the coffee machine works, or you can, but does a lot of spare work that you don't actually need. Instead, what you want do is you want to design a new algorithm that would get rid of this probability distribution. So, it would do by only using samples from the environment. So, let's add a bit more formulas, a bit more details to this problem. With your usual value iteration, there are two missing links that two supposed you cannot complete explicitly. First, you can not compute the maximum over all possible actions. To do so, you would have to actually see the rewards for all actions. And in model free setting, in the black box setting, this would take at least one attempt for each action. So, to figure out whether your robot should do action A or action B, should it, for example, jump forward or just make a single step forward. It have to do both things and then see which one of them will better work plus the value function. This is kind of impossible because in real life, if you're taking some particular action, there is no undoing, you can not get back in time. Now, the problem of this expectation. Here, actually I have to expect over all possible outcomes, all possible next seats. And this is another problem that you can not approach directly because in real life, you're only going to see one outcome, one possible result. So, if you're trying to use a slot machine. If you're pulling a lever, then you are only going to see one outcome. Not all the sets of outcomes with there respective probabilities. Otherwise, you would be much better off in any gambling. Now, what happens here is you have basically a lot of expectations maximizations that you can take exactly. So, let's find out what we actually can do to see how we can approximate them and approach this problem. The usual model free settings requires that you train from not all the states and actions but the trajectories. And trajectories basically a history of your age in playing a game, either it is a history from state zero to the last state. So, you begin playing break out, you take a few actions, you get a few breaks and then you lose or you win or whatever depending on your agent. Worse may be a partial session, so you begin but maybe didn't finish yet. So, this trajectory is basically a set of states, actions, and rewards coming the sequence. So, there's first state, first action, first rewards, second stage, second action, second reward, and so on. Of course, you can sample all of those trajectories and from many other times you need plenty of them. But in many cases, I mean practical application is most important here. Each trajectory is particular expense on your side. So, if you're training your robotic car, you would actually have to consider its expenses in the gasoline in maybe the amount of time you spend on it to take just one session off the driving five minutes for a street. Now, if we're talking about let's say, Atari games, it's a little bit cheaper because you no longer need to spend money, you just need to spend, in fact, computer resources which do convert in money. Now, those costs are different for each environment but they are usually none zero. So you have to take that into consideration. Now, the other issue is again we are not able to see all the possible outcomes, so we only see one outcome, you only try one action at a time. So, to find all the possible outcomes, you have to sample all the trajectories and average over the different actions, different outcomes in them. That can be quite costly. So, once you have got those trajectories, we have to somehow use them to train our algorithm. And the first question is a question for you by the way is that, which kind of value function would you prefer to train? If you only have trajectories, there's no probability distribution, would be better if you had a perfect value function of a state or an action value function of state and action? Which is better? Well, right. As you probably don't remember, if not, you might have guessed by using your common sense, that is if you have state value function, then to find an optimal policy, you actually to average with probabilities that come from the environment. So, you have to compute the expectation over all the possible next state of this value function. You don't get this thing unless you explicitly approximate it. On the contrary, if you already have perfect Q-functions, action value functions, you just speak the action with highest action value and you're golden as your optimal policy. So, the first decision here is that unless we're trying something very specific and exotic would be better off learning a perfect Q-function than a V-function and even an imperfect Q-function will do. Now, to keep it strict and formal, let's complete on what action value is and how it is defined. The definition for the last lecture that action value is the expected amount of your different returns, the reward plus gamma times next reward plus and so on, that you would get if you start from state S then take action A and both S and A are function prime just here. And then you end up on the next state after which you follow your policy. If this policy is an optimal policy, this gives you Q star. If it's your policy, it will be Q pi, so far as the notation goes. So, the good part about Q-function is that if you know this Q function, by definition gives access to an optimal policy given as deterministic. And the Q-function itself is very easy to express in terms of the V-function. So, this formula if only with an expectation for success is here, gives you a way to estimate Q-function and if you are kind of on raw V term here as an expectation of action values over policy, you'll get recurrent formal for Q-function. This is old probably unknown information for you since you've gone for the last week.