A few Recommendations for a Data Scientist who wants to get started in Recommender Systems
As a Data Scientist, you are expected to be able to build all sort of data products, that may involve simple-yet highly valuable business trends extracted through data querying and cleansing; and sometimes, more sophisticated Machine Learning algorithms for prediction, classification, or even recommendation. However, the cold start in a specific topic may be tough for Data Scientists, especially for those who have no experience with a domain-specific problem. Thinking for the very first time about what metric do you need, what is the correct feature engineering, choosing a baseline, and so on, maybe not an easy task.
In this article, we answer questions related to the “What” and “How” in Recommender Systems, based on a real request from a colleague at BBVA who was interested in developing a complete RS product from scratch.
What type of data do you have? And how large is it?
Most likely, you will be dealing with implicit feedback datasets, in the form of clicks, views, listenings, purchases, etc. Let’s face it, the ratings era is over! Therefore, you will be trying to predict the preference of a user for a product, not the rating she would have given to it. Also, get a glimpse of the size of your problem: how many users and items are in your system? And how many interactions among them? In online services, users are typically in the order of hundreds of thousands to hundreds of millions, and items are in the order of thousands to millions. On average, users typically interact with tens of different products, so that the density of the rating matrix is in the order of 0.1-5%.
But not all RS are that big and sparse! For instance, at BBVA we have built RS for financial products, which depending on how you phrase the problem, have a product catalog ranging from tens to a hundred of products. In this scenario, casting the recommendation problem as a multi-class classification problem might probably make a lot of sense as a first step. Similarly, when Spotify personalizes its landing page for you (the different lists that appear arranged in shelves), it does so by using a RS based on Multi-armed Bandits.
Choose an offline metric
Do not use traditional classification metrics like the AUC ROC, since in RS we do not have complete knowledge about negative samples, and thus precision-oriented metrics are ill-posed. And never ever use RMSE with implicit feedback datasets (take a look to this excellent talk to understand why). Instead, focus on top-k ranking metrics, such as MAP, NDCG, and so on (take a look at this post to see how frequently each metric was used in 2017 RecSys Conference). Also, try to measure other important behaviors of the engine, aside from the relevance: diversity, novelty, popularity or coverage.
Define a baseline and test it with public datasets
A simple popularity-based algorithm might be harder to beat than you think since, at the end of the day, something is popular because people like it. Also, nearest neighbors methods perform surprisingly well most of the time. Association rules might be another good starting point, and there are scalable implementations out there.
Matrix Factorization Algorithms are also a popular algorithmic choice, see for instance Spark’s Alternating Least Squares implementation. In addition, pair or list-wise methods (like Bayesian Personalized Ranking, aka BPR) have great acceptance within the RS community. However, several research articles in the last couple of years suggest that pairwise methods may perform worse than pointwise estimations in top-k metrics. As explained here, BPR optimizes the ROC curve, which does not necessarily imply improvements in ranking-aware metrics such as MAP or NDCG. In our experience, BPR-like methods tend to recommend popular items less frequently than their pointwise counterparts (which might be actually a good thing), concomitant to a drop on ranking-aware metrics. Speaking about how biased towards popular recommendations are different algorithms, you might want to have a look to our recent work, which shows how to effectively recommend items from the medium and long tail of the item catalog, without jeopardizing the relevance of the recommendation.
Incorporate your business into the model
This is the most important part, and sometimes, the most forgotten one. Think hard about the specialties of your business. For instance, incorporating side information (such as product categories, user profiles, and so on) will boost the performance of your model. Is there any stationarity in your data, or any trend in the way ratings are collected? Try to make use of the graph of users and products, it hides a lot of information!
A word of caution when using Neural Networks
Although Deep Learning has shown incredible success in several domains, the improvements observed in RS are still limited. Indeed, the main advantage of DL applied to RS lies in the incorporation of side information, not that much in the modeling of ratings itself. In that regard, they may become quite handy if you combine them with your domain expertise! Regarding pure rating modeling, user-based Autoencoders provide a more compact learning algorithm (fewer parameters) than traditional Matrix Factorization, and slightly improve metric performance when combined with Denoising techniques. On the other hand, Recurrent Neural Networks have a great potential in session-based RS, as well as the prediction of the next action. However, a recent study indicates that they are far from being really exploited (they perform similarly to k-NN methods in several cases).
There are many other topics to be covered here. For instance, the problems with the selection bias and how counterfactual analysis might help to prevent it, how to run A/B testing effectively, or how to incorporate the temporal dimension into your system. But we hope the above tips will give you a good starting point for running your first Recommender System!