Welcome to Week Four of the Reinforcement Learning Course. During this week, we will talk about model free approximate methods, which are in fact used in a wide variety of practical tasks. However, the most practical widely-known application of reinforcement learning techniques nowadays, is playing video games. This is in part because video games are perfect small-sized and oversimplified reflections of real life, but also in part because current methods are not efficient in nature for applications to real life problems. There is still much that remains to be done in the field of reinforcement learning. But our today's topics, approximation methods are ambiguities in any reinforcement learning application whatever the target domain is, whether it is a video game playing, or rep-ranking, or anything else. Let me start with a quick reminder of what is data question for SARSA algorithm, the method you have learned in previous lectures. If you've probably seen this update in a slightly different form, but in fact this is precisely the same going average of data of current Q value. It is the value of a target highlighted in blue in this slide. That is we move our current estimate a little bit towards it's slightly more refined approximation. What is important here that in tabular methods, each Q, S, A, could be seen as a separate parameter, which was set up during the learning process. We update this parameter. And after we finish this single update, no other values in our table has changed, only this Q of S and A. In fact, in tabular methods, we have more parameters than we have states. That is so because for each state we have as many parameters as there are possible actions is at state. Such number of parameters should sound scary. Consider for example a simple Atari game. Number of states in this game is equal to number of all possible combinations of pixel colors on the screen of this game. And this number is all the finite by the numbers. It is in fact much, much larger than number of items in a universe. Consider also a very basic environment, gold cards ball, where one need to balance a ball on the card as its name suggests. In this environment stage includes continuous components. That is combination of the position of the card and the ball's angle. That means we need a table of infinite size just to start applying table of SARSA or tabular [inaudible] method for this problem. The problem here is not only about the storage memory. It is also about data needed to appropriately fill such table. It is also about the time needed to collect and process this data. This tabular methods sounds Utopia on the way apply to arbitrary task of interest. To learn such environments, we should know how to approximate. How to generalize, to unseen but close states, given some finite amount of experience. And this is precisely what we are going to talk about. One straightforward idea is to reduce number of parameters as much as possible. Say much, much less than number of state. Ideally, we would want to make number of parameters independent of number of states. That is if we want to approximate value function or action value function using some parametric model V hat or Q hat, whose parameters W are highlighted in red on the slide. This is a straightforward idea that hypothetically, could allow us to learn in environments with infinite and combinatorial number of states. Cool but nothing comes without a price. Now we have an interesting coupling between parameters and state. Any of the model parameters usually affects estimates in all possible states. This fact is worth keeping in mind, because it may be a source of many problems. For now, let's concentrate on how to learn this value and actual value functions. So we want to be able to compute value and actual value functions given raw state or state action pairs. This formulation sounds very similar to what is done in supervised learning. In supervised learning, we have an input X, I want to predict some output Y. What do we do is we construct a loss function that penalizes for misprediction of Y, and optimizes its loss over as training set that is over Brickell X payers of X and Y. However, it is unclear how to transfer this supervised learning paradigm to reinforcement learning. Which have not yet specified any loss function. And in reinforcement learning, we have no Brickell X payers of states in zero values. We have no training data because we don't have outcomes Y. And until an agent steps into this environment, we even have no inputs. We don't actually know what states are possible in the environment. We actually have nothing at all. So let's dig a little bit deeper into this issue, and develop two different approaches of application supervised learning to reinforcement learning. In fact, in addition to dynamic programming, there are two core approaches in reinforcement learning, which we will make use of and which allow our agent to learn model free. The first and more intuitive or to understand is Monte Carlo learning and the second is temporal difference learning. But before we jump into the Monte Carlo learning, let rides the definitions of failure and actual value functions. Recall that value function is an expectation of return, condition it on the current state, and action value function is an expectation of the return, condition it on the current state and action. This is expectations are our goals in a perfect world. If we know this goal for every to state and action, we're done. We could use any of the supervised learning methods to approximate these goals and once we build such approximations, we could use any of the reinforcement learning algorithms, you already know, such as SARSA expected SARSA or Q learning that is given perfect approximate functions. The reinforcement learning task doesn't differ from tabular setting. But again, we are not given these goals, and nobody knows them beforehand. Well, if nobody knows the truth, an approximation is inevitable. We could also approximate this stargaze written as expectations. The most straightforward way to approximate an expectation, is to replace it with its sample-based estimate. That is in case of value function, it could replace our goal with sample return G sub T. Let us denote goals as G of AS and G of S and A for convenience. This sample return is just sample of cumulative discounted reward gained by a Policy pi, from state S. Precisely in the same way, we can proceed with X value function. Just to place an expectation with its sample based estimate, that is cumulative discounted reward obtained by following policy by after making action A instead S.. Noted both is a symbol based estimates of true goals, are simple numbers, which are known completely at the very end of the episode. I see numbers to emphasize that they do not depend on the parameters of our approximation model, they don't depend on parameters W. This assertion might seem obvious, but is very important for further discussion. So, that was super easy. We now have determined the training data for Supervised Learning tasks, we have inputs, state and actions, we have targets, a sample best estimates of our goals, so can we start learning process without worry? Well, both yes and no. Yes, because Supervised Learning task is set up properly, and no, because it is not the best solution available. For Monte Carlo approximations, we will need too many samples to learn in an environment which lasts more than several hundreds of time steps. This is a direct consequence of large variance inherent to Monte Carlo methods. Intuitively, large variance is due to the sum of very many random variables, that is state, rewards, actions, at each time step, in an environment. All these variables are random, either because scedasticity is the environment or due to scedasticity as a policy, such slow learning might be inappropriate for many applications, can we do better, or can we learn faster? Well, yes we can, all we can achieve this with the use of Temporal difference learning. The theory is a question of, which methods is better for learning, Temporal difference or Monte Carlo? Is an open question. However in practice. Temporal difference methods are a lot more sample efficient than Monte Carlo methods. And that, let us now see how to approximate in Temporal difference setting. First, let's derive the definitions of value and action value function in a convenient way, we enroll one iteration and it can prove the return as a sum of reward, and gamma times value of the next state. Just as we have done previously, we want to approximate these expectations with it's sample based estimates. With less term value of the next state is unknown to us, what can we do? Again, if we are to approximate, we can approximate this value too and use our current parametric estimate of the value of the next state. Now, that we approximate three times, we approximate the value function with parameters W, we approximate our target, the expectation with a temporal based analysis, and third, within this sample based estimate, we are approximate the value of the next state with our model of failure function, because our model is not a perfect one, this targets unlike Monte Carlo targets are biased, that is they are not equal to the two targets in an expectation. However, given enough amount of experience, Temporal difference methods will converge to the same estimates as much Carlo methods. The good news is, a Temporal difference targets otherwise to learn as it plays a game, we don't need to wait until the very end of the episode to update our model. They also have a very small variance because they depend only on to hosticity in reward and the next state, and not all of the hosticity of all rewards, and states, till the end of the game. These methods are also free of hosticitiy effect of exploratory actions made with goals, after the immediate next state, all these makes Temporal different methods very attractive. But again, nothing comes without a price, we are going to uncover some difficulties with Temporal different methods in a couple of flights. But for now, please know the targets G in Interpal different methods are not numbered, they are functions of parameters W, and it is better to denote it's dependence explicitly in the notation of goals. Well, the reason for not doing that, will become clear in a couple of slides, stay tuned. Up to now, we have presented a way to approximate both value and action value function, but you might have noticed that, an idea is very similar for both of them, the main thing, simpler in frequent presentation, I'm going to talk only about the action of other function, that is Q function, and I hope that you can figure out how to deal with value function by analogy, if you need to. To recap, we have talked about what inputs and outputs in supervised learning problem could be, but we have not yet discussed what loss is appropriate for such a problem of approximating value functions. Let's talk about loss now, as you may notice rewards are better in real numbers, and thus, returns and of course expected returns, are also real numbers. For these reasons, it is natural to impose a regression loss for learning Q function. There are plenty losses for regression task available, among them are mean squirter, mean absolter, and Huber loss, all of them could and sometimes should be applied to enforce learning problem. But for the sake of fixed position clearness, let's focus now on the loss, most popular in reinforcement learning literature, that is mean squared the error. The means squared discrepancy between targets, also known as goals, and our estimates, that is our loss function. Why should average discrepancies across all possible combinations of state and actions? However, we may be more interested in accurate estimation of Q function in some state, and be almost indifferent to the Q values of other states, for example, we don't care much about rarely what we do our best states, it is sufficient for us to know, they are bad. This is Rosier initial idea because estimates for each and every state and action, could be improved only after experiencing that state and making this action, but nobody want an agent to experience very bad state and made by their actions, over and over again just to increase precision of bad action estimates. The expresses idea, we multiplied the discrepancies by the corresponding weight of importance role. This weight can be sort of as the fraction of time, falls in current state as, and makes action aims a state. But can also the thing about these weights, these fractions of time, as probabilities of visiting the states as. Now making an action a by policy pie. Now that we cannot compute this loss using only finite experience, we should sum over all possible state and actions, but we might not know which states are in the environment at all, and it is also not clear how to compute this ways of importance for the states actually visited, and actions done by policy. Also know that, size loss formulation is in fact an expectation of this squared discrepancy. And this is so, if we assume that state and actions are distributed according to distribution row. Well, if we cannot compute the loss, we can approximate this loss in a sample based fashion, that is sample from row. You see reinforces learning is full of approximations, but how to sample from row? With such definition of face of importance, this is in fact an easy task. Samples of states and actions from distribution row can be obtained simply, by collecting visited states and actions done by the policy pie in the environment. When we are not allowed to do actions in the environment, that is if we learn of policy, we can grab samples S and A from experience generated by behavior distributions. It tells that these distributions, that row of behavior distributions and row of our distribution pie are close enough to provide meaningful approximation. Let me remind you what is off and on policy learning. Off policy is a set up where there are two policies, one that makes actions in an environment and this is called, Behaviour policy, and one that we are interested in and over which we have a control, a Target policy. The Target policy is updated and improved to become closer and closer to optimal policy. The Behaviour policy is out of our control, it could change over time, but it could not. We only require Behaviour policy to have non-zero probabilities of making actions, we have another probabilities under our Target policy. This assumption is required to learn, well in off policy scenario, otherwise there will be white spots in our data and we can't guarantee anything about Target policy at all. Off policy learning is a slightly easier task because in this set up, Target policy is the same as Behaviour policy, so we are in control of acting in that environment, and in control of changing the policy. But these changes, we could influence the data which we are collecting, and explores areas on the environment, which are of our interest not of somebody's interest. If algorithm can learn off policy, it also can learn on policy, but not the other way around. Off policy and On policy may seem similar to online and offline learning, however, in varying versions learning works online, and offline are more frequently used as reference to updating policy doings opposite that is online, or updating the policy only when the opposite ends, that is offline. So Temporal difference methods can be described as online methods, while Monte Carlo methods, can be described as a offline methods.