[MUSIC] Welcome to the last week of our reinforcement Llarning course. Today, we're going to finally tackle one issue we've been carefully avoiding, only covering at a minimal level in weeks one through five of the course. Namely, the exploration versus exploitation problem as the title kind of suggests. We've already learned at least, well, three major ways of how we can build [INAUDIBLE] algorithms. Plus there are the black box methods like [INAUDIBLE] method, [INAUDIBLE] strategies or any other method that basically tries to take [INAUDIBLE] problem for [INAUDIBLE] or something similar tackles at least two major algorithms in the so-called value-based family, namely Q-learning, SARSA, valutration. Actually, much more than in those I suggest. We then have to approximate policies with, well, neural networks, any other approximators. And we finally learned how to perform direct optimization on the policy as it to distribution in the previous week about reinforce and similar ones. So all those algorithms have one common problem. They only learn by trying actions, and seeing which of them work better. So even Q-learning, you only speak the same action. You're never going to learn anything. So of course, we've learned with made up futuristics of how to speed up this process like epsilon-greedy exploration. And today, we going to, well, study a few more accuracies. Although much more efficient and much better study, theoretically. So exploration versus expectation. Welcome to this week and the next thing we're going to do is we're going to simplify our problem a little bit or quite a bit actually. Instead of starting with the usual marketing issue and process we've been dealing with from like week two to week five, this time, we only going to consider a single step decision process. Also known as multi-armed bandit. The difference here is that your agent only sees one first observational state. He picks an action. He gets some feedback and then the whole sessions is bound to terminate. The new session starts over and the next observation agencies in the new session is completely independent of his actions in the previous one. This is of course, a grossly oversimplified view on any practical problem. But for starters, it is very useful some times where you have to cut the corners in your formulation. And besides, it's much simpler to describe advanced exploration in this setting. Formulas get much shorter. We'll of course, get back to the modest ambition process later in this week. But for now and until midway through this section, consider everything we say about the modest ambition process. So is have mission to you already, those modest ambition processes usually go by the name of multi-armed bandits. The story behind this theme is quite simple and weird at the same time. The original multi-armed bandit is a situation like this. Picturing in the middle a gaming club, gambling club actually. And before you stand, six slot machines like this picture. Should those slot machines have a different set of rules describing how do they pay you and how much money they take? And basically, whats the outcome or what distribution of the rewards they give you? And your objective is to find the slot machine, which gives you highest expenditure reward. Of course, there might be from this firmly, depending on what you want to achieve from this game club. But for now, let's consider the expenditure reward program. [INAUDIBLE] pick just one machine and stick to it. Because well, God knows what other machine brings here. But a rational person, if he has a lot of money could try to explore some of those machines. All of those machines first, get them a few tries and only then convert to something that works better. So of course, you are highly recommended not to apply this strategy to an actual gambling situation. Because in any real casino, all those slot machines are going to be stacked against you. Otherwise, the owners of casino would get no profits. But in any real situation outside this gambling, it's often possible to reduce the problem to something similar to the multi-armed bandit. Now imagine you are not trying to gamble, but instead you are solving an optimal banner ad placement program. So have a user which is in this case, mapped to a gaming club and for this user have ten potential banners. Each of which he can either click or not click. This case, if you show user a particular banner, this snaps to you going to a particular slot machine. And well, inserting a dollar in it and seeing what happens. A dollar here is basically your time and electricity of the opportunity you have lost or could have gained, then the user is either going to click which is similar to a slot machine giving you a jackpot or whatever, giving you some money. Or not click, which is the same as well, you wasting your money on the slot machine. And you have to find the policy of showing a particular user some set of banners, so that it has the highest probability of clicking on them or bringing your potential profit in the expectation, of course. So of course, the multi-armed bandit formula systems are limited. One, completely neglects the effects of agents action on his next observation. Should be far from if you tried to play Atari or make a self-driving car with this multi-armed bandit core. It turns out [INAUDIBLE] products can be sold by them with almost no loss. In this case, we're just going through the banner ad example where the state was the user you are trying to get to click all your ads. Your action was a peak of particular banner to show this user and the reward was the expected revenue or just the revenue for this particular case we are showing user this banner. It's the cliff rate divided by the, or multiplied by the amount of money you paid for clicks. So with that, you can also try to solve the recommendation system problem. Say you're an online store where users buy stuff and you also suggest them to buy more stuff that they might be also interested in. In this case, your state is also a used, or whatever entity buys stuff from you, maybe it's a robot and the action to recommend this user a particular item. In this case, what do you think the suitable version of feedback would be? Yes, if it's an online store where you try to make as much money as possible, it makes sense to measure feedback as basically the amount of profit you make from items you have sold if users have bought them. So again, you can simply compute this as an expectation. If it's a different thing say, a free site like YouTube, then you could consider instead to measure kind of user's satisfaction, users retention on or any other complicated towards that measure. How satisfied user is with your service? In case of again, YouTube, this would be the fraction of users that likes documentation or that have wished the whole video from start to end. Of course, in any practical illustration, you can come up with more efficient methods. But this is not the subject of our present lecture, so let's go on. A few less of these problems are the problem of, for example, information retrieval. These too can be sold by bandits. In this case, your state is user's query and your action is to show user a particular set wow, search engine responses for this query The feedback is whether a user was satisfied with the search, whether he has found what he was looking for. And again, it's not that [INAUDIBLE] to measure the user's satisfaction without having [INAUDIBLE]. Of course, these are all huge problems to tackle in one lecture and let's just go on with it. [MUSIC]