In this video we will cover Boosting and Gradient Boosted Trees. In order to understand Gradient Boosted Trees you need to understand Boosting first. Boosting is also known as Stagewise Additive Modeling. So, the idea is in contrast to random forest that all the decision trees are created parallel. In boosting, only one model after the other is trained based on outputs from previous models. Again, this is non of weak models. So lets consider hi(x) to be a weak model. So i is an index of the model. So then you can define a strong learner capital F(x) as the sum of all the weak models weighted by gamma i. This is another notation which is maybe a bit more intuitive. So if you want to define the model capital Fm(x), it is equal to capital Fm-(1)x plus gamma m times hm(x). Where Fm- 1(x) is the previous model and gamma m times hm(x) is the weighted current model. So gamma m is a learned weight, and don't be scared. It looks a bit complicated, but it's relatively easy to understand how we derive gamma m. So we are taking the arg min. If you never have seen that notation, arg min means that we are optimizing, or minimizing in this case, a function with respect to gamma in this case. And then we have to understand what L means. So L basically is the sum of squared errors. So that's a measure how good or how bad we are doing. We call this residuals. So we take the correct value of yn and we subtract the protected value, we square it and we add it up for all our training examples and we normalize it. And we will later see why we have 1 over 2 of n and not 1 over n. That's just because taking the derivative is more simple than. So let's actually take the derivative of that function. So we take the partial derivative with respect to wj, and it looks like this. So let's have a look at each individual component. The reason I would just pull out the whole term from the sum is because of the rule that the partial derivative of a sum is just the sum of the partial derivatives. And then we pull the squared to the front. And that's because of the rule that the derivative of x to the power of n equals to n times x to the power of n minus 1. And the last thing that we have to understand is why we put this Xnj here. So Xnj is basically a term which is part of the weak learner. And since we are taking the partial derivative of l with respect to wj, the chain rule says that thee need to pull out Xnj since wj is part of a nested multiplicative term, whereas all the Lwj's become constants and cancel out. So those are the rules which you are using here. The second rule is called the chain rule for partial differentiation. So now we have learned what L is, we take the sum over all training examples. And since L is the first derivative of the sum of squared errors, we have yi, which is the real value and the reduction of y hat based on x. And that's the term that we minimize over to obtain an optimum value for gamma m. So now let's have a look how gradient boosted trees are working. So let's consider our training data set. So we start with a weak learner, h0, and we just take the mean of y as a prediction. Of course, this is not a very good model. Therefore, we have error terms. So error 1 = y- F0(X). The output is basically a new training data set. Those are called residuals. It's basically the error of the previous model. So now we define a new model F1(X) based on F0(X) plus a new weak learner h1, which we train under residuals. And we continue this step. So we compute the error of the newly obtained model, epsilon 2 = y- F1(X). And again, that we are using to train a weak learner, h2, and we just continue until we are convergent. So gradient boosted trees are outperforming RandomForest, but they are computationally more expensive because of their sequential nature. Remember, in RandomForest, you can in theory train all the decision trees on the samples which you have obtained by bootstrapping. But here each iteration is dependent of the residuals of the previous iteration. Therefore, it's hard to parallelize. But there's hope. There is a specific implementation which is called XGBoost, which is very fast. And it has a slightly different implementation, which makes it even more resistant to overfitting. So in summary, boosting is different from bootstrapping because we are taking the error into account which we are doing during building up the model. Especially we will create decision trees which are specifically dedicated to training data, where our previous model is doing a bad job. And to weight how strongly this weak learner per iteration, this edit to the model is learned with an additional weight, gamma-m.