Previously you have seen how to effectively deal with the missing values within the basic matrix factorization approach. Now, you can do something even more interesting, include user and item biases into the model. Let me briefly remind you what are those biases. If for example a user has assigned rating 3 to some movie, can you tell if the movie is actually not so good? Well, if all other ratings provided by this user are higher, then probably yes. But what if the user almost never rates movies with five stars, waiting for something really exceptional? Such a demanding user may have a different rating pattern, shifted towards the lower ratings. Similar intuition works for items, as well. For example, popular items are likely to receive higher ratings. You can conclude that the same rating value may contain different information, depending on various biases. In fact, most of the signal is contained within user and item biases, and taking it into account may improve the quality of recommender systems. One of the greatest benefits of the factorization approach is the ability to flexibly modify the way utility function is defined, while staying in the same computational framework. As you remember, in the SVD model, you estimated LH values manually, and then used them as a substitution for the unknowns. But now, within your matrix factorization approach, you can directly add all the needed biases to your utility function and let your model figure out on its own what those biases actually are. In other words, in this new model, you assume that using bias bi and item biases, bj, are unknown. And they become new model parameters along with factors p and q. The global average mu is pre-computed in order to make the effect of other biases more pronounced. The new optimization objective becomes a bit more variables but still very intuitive. It has more parameters to estimate but this doesn't change the general optimization procedure. As in the previous case, you can use the approximate partial derivatives to make computations more efficient, employing the stochastic gradient descent to optimize their objective. After differentiating with respect to different types of the parameters, you obtain a new system of equations. As you can see, the only difference is that two additional dumps related to user and item biases are added. Although this algorithm is not implemented in Spark out of the box, there are a few open source-projects devoted to that. There is also an efficient parallel implementation of the algorithm called Hogwild. This matrix factorization approach became popular after it was published by Simon Funk when he attended the Netflix Prize competition. Because of that, sometimes this algorithm is also called Funk SVD. I hope you won't be confused by the name and can easily explain the main differences between this approach and actual SVD method. I also hope that you have enough confidence to make your own exploration of the field of matrix factorization algorithms. It might be good exercise to check your understanding, for example, by looking into famous SVD++ and timeSVD models. Now, as you have seen, there is one aspect common for all previously discussed models. All of them minimize the squared error. In other words, these algorithms are designed to optimize RMSE metric, which is suitable for the rating prediction task. However, more often, you are interested not in the specific rating values, but in a good list of top-N recommendations. And for this task, the metrics like NDCG and MAP are more appropriate. On the other hand, it turns out that the optimum value in terms of an error-based metric like RMSE, doesn't necessarily correspond to the optimum value in terms of the ranking-based metrics. Of course, as you have previously seen, it is possible simply to sort items by their predicted ratings. However, this is not what you initially optimized for. You can actually divide all those functions into three different groups. The first one is called pointwise, and you are already familiar with it. The second one is pairwise. Can you spot the main difference of this loss function from the pointwise loss? In the pairwise case, the function within summation operates solely on this course predicted by the model, while actual values are only used to enforce the ordering. The main goal of the approach is to minimize the number of inversions when the predicted order of any two items doesn't correspond to their natural order. Finally, the listwise loss operates over the list or sets of items. This one is especially suitable for optimizing ranking metrics, such as MAP and NDCG. Both pointwise and listwise approaches are members of an important family of methods called learning to rank. However, a bad optimization objective in terms of top-N recommendations task comes at the cost of a high computational complexity. Depending on the actual model, one may have to use additional computational fix and simplifications in order to avoid exhaustive shorts and deal with non-smooth nature of the metrics. Which is why pointwise methods are still popular. To summarize, you know how to extend basic matrix factorization and include user and item biases into it. You know about optimization objectives and can explain the key differences between them.