For now, we shall stick to the banner ads example. It's the first one and arguably the simplest to understand. One more feature of it is that, it's one of those rare problems where we get to compute the gradient of your money, the expected revenue taht you're going to get over next month, respect to weight of a neural network, and optimize the gradient accent, following the gradient of your money. How cool is that? So, okay. Banner ads. And for now on, let's imagine that you're not a data scientist, but an executive officer say CEO. And you're a CEO of a startup which hosts a site that gets most of its revenue, your revenue from banner ads. Of course, you have some huge daily active user accounts by now, and what you want to do now is you want to optimize, you want to squeeze as much dirty money from those users as you can. So, what you going to do as executive officer, is you're going to hire some data scientists, or pay some data scientists to do those job for you. And that just turns out that two guys are willing to do the job. The first guy says, "Hello CEO, I've just invented a super mega method. It uses deep learning block chain, all those fancy words to learn how to find an optimal policy of banner placement, and it solves it doesn't matter the problem." Then comes the second one says, "Wow this thing is total bullshit. Instead, you have this method which is much better. It uses, whatever, another fancy word, one fancy word, two buzzer word, three buzzer, four." And that does solve the bandit problem as well, although in a different way. So you have those two black box methods, and being assumed executive officer, you basically don't know anything about black boxes. And instead, you have to somehow measure the efficiency by the business value of those methods. So what you going to do? How do you pick one of those, method A or method B? Well, yes. Right. The obvious thing to do is to measure money, and to say give each of those methods some five percent, 10 percent of your user accounts rest runs by your classic advertisement methods. And what happens to them after say a month of advertisement. Your task is obviously to get as much money as you can, but there are a few dilemmas to solve. For example, I mentioned that method A gives you say a profit of one million dollars or whatever, one over each month. It starts giving it from day zero, and stays there. While method B starts by giving you almost no profit, but eventually creeps upwards and gets better than method A in say, half a year. Or maybe method A does creep up as well, but not at this pace. This brings you out of special cases where your decision might depend on whether you plan to stay in business for next month, or next year, or eternity. And in a theoretical field, this brings you to a notion of regret, source regret. Regret is basically how much money your algorithm wasted? Or how much money you could have earned if your algorithm you the optimal policy from the get go. So now let's plot this Eta, the regret value as a function of time. Then given the point of time, we want to sum up all those differences, the [inaudible] with optimal policy versus with your policy. From time step zero to your time step. So, Eta of 10 would be the sum for time steps zero, one, two, three, and so on, until you reach the step 10. Inclusive or exclusive it doesn't quite change what we're going to see. Since those differences are all positive and you're adding them up, the function is going to grow or at least, it's going to either grow or stay where it is if you've converged. And the curves going to look anything like this, actually. Lets begin with the blue curve. You can see is that the regrets, the y axis value. Starts very good for the blue curve. It's below everything else and iteration is like 200, 400, and so on. But eventually, it exceeds those of any other curve and keeps on growing. What happened is, basically, This strategy, the blue one, the strategy failed to converge to the optimal policy. This actually means that, if the final policy is not optimal, then the difference between your policy and optimal policy of the difference in reward is going to be non-negative and is going to stack up, and grow, and grow, and grow over time, until as the time step t converts to the infinity, your regret is also going to be infinite. So, this is the bad, theoretical bad case, where your policy is not optimal at any given moment of time. So, let's find out what's going to happen if it does get to the optimal. The red curve here, the second one, is quite different in how it behaves. It starts much worse, at the beginning it gets a lot of regrets early on, because it explores, it has to be mix up optimal actions to get a better look of how the actual space looks is like, how the rewards are defined at all those actions. But eventually, it converges. And this case, you can see that it converges to either almost the simplistic function or almost a horizontal line, or it converges to the exact horizontal line. What this means is that, it has finally found an optimal strategy after some point in time. And so regret is basically, it's a constant, it's fixed, after some number of iterations. This is a theoretically again, the best outcome. There's a guarantee that you're going to eventually get to what you want. Of course, if you get it faster then your policies even better. Of course, there can be a middle ground if your regrets starts not that bad but then it grows, but not linearly. Its growth speed decreases, and then in the infinity it reaches the constant value, which means that the regrets, it grows logarithmically, while the blue curve grows linearly. And this can be again only derived by looking at some toy tasks and particular properties of your algorithm that can be exploited in mathematical derivation. In practice however, we're going to see something much more rough round the edges, much less curvy, and you can see some step like ascension trajectory at an aerial plot that you can obtain by actually feeding users with your banners. But let's get further.