The general idea about how reinforcements learning operates is it has some kind of decision process, and we've already seen the decision process in the banner ads example. In this case, you have a website which is your agent, something that can take actions. And this website repeatedly observe users that browse particular pages and can take an action. In this case, its action is to pick up banner and displayed when the page runners. Finally, the agent gets the feedback whenever it shows an action to a user. The feedback, in this case, is a binary one, it's either user clicks or he doesn't. Now, of course, there is a number of similar problems. For example, you can replace banner ad placement with movie recommendations for example, or you can even consider from a medical treatment. In this case, you have a patient that goes to a doctor and the doctor is your agent and your agent observes the symptoms of a patients to apply a treatment to him and then he gets a feedback on how well did the patient feel himself after the treatment was applied or whether he survived or not. Now, let's consider this kind of practice problem. Let's say that you have an online shop and it's online shop, you're selling not movies, but let's say, you're selling books, and you have a user base that's mostly authorized so they have some personal data entered about them. And what you want to do is you want to sell them as much books as possible. And you want them to be satisfied with the books they buy. To make them want to buy, you want to develop a system that recommends books to your users. So in this case, if you're trying to optimize your money and your user's happiness, how would you define the observation that your agent gets? What would you include there? What kind of actions your agent can take? And finally, what kind of feedback will he receive and what kind of thing do you want to optimize there? So of course, this is more or less obvious, but there are a lot of possible choices that are all correct. For example, you find user features in great many ways. Generally, you want to include here the features, the interest that your user has, his age, his gender, maybe some kind of previous books you're out and whether he likes them or not. And you also want to include any other information like maybe his social network account, if you have access to one. For actions, you now have books and each action basically corresponds to trying to recommend the particular book to your customer whenever he visits your site. And obviously, your feedback is whether a user like this book or maybe how much revenue do you get from it. Depending on your agreed for money. So this particular problem is usually referred to as the multi-armed bandit problem. The name originates from gambling, you can consider yourself not trying to assign the optimal banner to each user, but gambling in a casino. And in this case, showing each banner is like pulling the lever of a slot machine. In this case, you want to find a slot machine which brings you the highest rewards or find the strategy of the banner which gives you the highest probability of user clicking and the benefit of either paying you money. The same comes to recommendations, but instead of showing banners to get money explicitly, you can recommend user some kind of movies so that he wishes down becomes happy and the overall happiness of humanity grows higher. And of course, for this problem should not be viewed explicitly as the gambling problem because in any casino, you are more or less certain from the of said that every slot machine is stacked against you. So, please don't try reinforcements are in there. Now, the only issue here is that this formulation is incomplete from many cases. Remember the clickbait problem, so if you show a user some kind of banners that are explicitly designed to be clicked often but that then user dissatisfied after he has clicked on those banners. You'll probably see hiking for rates initially, but then, you'll lose all your user base because they're frustrated by your advertisement campaign. So, there is one missing link here and it's about how your agents can affect the environment that supply him with observations. In case of online advertisement, you have this user base, note that there's earth planet here and whenever you take a particular action, you may affect your user base by either making some kind of user into your fan and seeing him more often, or basically, scaring him off with some aggressive advertisement and basically losing this user and having less revenue because of that.