domingo, 3 de setembro de 2017

"Learning to ranking" with xCLiMF python implementation

At Globo.com we try to optimize our personalized recommendation providing better content in a ranked list. We have a hybrid system that can automatically essemble collaborative filtering, content based and non personalized algorithms.

But none of our implemented algorithms are intented to rank. Our best collaborative filtering algorithm, based on the paper Collaborative Filtering for Implicit Feedback Datasets, doesn't optimizes a ranking metric like MRR or NDCG.

Even so we have good results of those algorithms choosing the best hyper parameters focusing in MRR, we thought we could get better results using algorithms focusing in optimizing a ranking metric: Our first hypothesis is xCLiMF.

xCLiMF is a latent factor collaborative filtering variant wich optimizes a lower bound of the smoothed reciprocal rank of "relevant" items in ranked recommendation list. It is a evolution o CLiMF algorithm that can handle using multiple levels of relevance data instead of using only unary user preferences.

Implementation references


Since xCLiMF is very similar to CLiMF, I implemented xCLiMF based on this python CLiMF repository ( https://github.com/gamboviol/climf ). During the process I found an error and submitted a pull request ( https://github.com/gamboviol/climf/pull/2 ).

I have also found a C++ xCLiMF implementantion by other brazilian guy ( https://github.com/gpoesia/xclimf ). But this solution have two problems: The first one is a bug reported here ( https://github.com/gpoesia/xclimf/issues/1 ) and the second one is related to performance, since it updates user and item vectors separately as two loops of other two chained loops.

My implementation


So my implementation also have some differences from original papers:

  • In the paper they don't mention any kind of previously normalization, but it's always a good practice to do it for machine learning algorithms input data. So I normalized dividing all rating by the maximum rating in the dataset. Using original ratings lead to some math errors:
    • Exponential function of large values causes math range error;
    • Exponential function of very large negative numbers get 0, and lead to some division by zero error; 
    • Log function of negative numbers, caused by large exponentials, causes math domain error
  • The paper experimentation setup uses N random ratings by user in training and in the cross validation data. Even so I have implemented this way, I have put an option to select 2N top ratings by user, randomly distributed in training and cross validation. For me, this last protocol reflected a more realistic kind of problem that xCLiMF is going to solve.
  • In the original paper they described the algorithm as two chained loops. Firstly I implemented as described in the paper. But them I updated the code to use a more vectorized solution, having a 54% improvement in the performance. I maintained the chained loop solution for comparison purpose ( https://github.com/timotta/xclimf/blob/master/tests/looping_update.py )
  • For the objective function, the paper describes that it's not necessary to divide by the number of users, because it is only used for validating if the gradient ascending is still increasing after many interactions. But I made this division for interpretability purposes.


Results


Using the experimentation protocol of N random ratings by user in training and in the cross validation, applied to the MovieLens 1M dataset, I got MRR variying from 0.08 to 0.09 at the 50 iteration using the hyper-parameters described in the paper.


Using the top K experimentation protocol in the same dataset I got MRR varying from 0.2 to 0.26. I tried those two experimentation protocol with the Alternative Least Squares implemented at Spark and the maximum MRR I got was 0.01.

The sequence of experiments, and how to run can be seen at https://github.com/timotta/xclimf

Conclusions and future


Now we have a working xCLiMF implementation in python, but it is not feasible to train it with Globo.com dataset. Running it using MovieLens 1M dataset that have 6k users took 4 minutes. With MovieLens 20M dataset that have 130k users took more than one hour. Globo.com datasets are much bigger than that, if we count only the logged users, our dataset of users preferences on shows has more than 4 millions users, it would take more than 40 hours to run.

That's why my next step is implementing this same algorithm on Spark. Since the updating user by user step is independent, it is possible to scale it running in parallel on our hadoop cluster.

I appreciate any help, correction, suggestion or criticism

References


CLiMF: Learning to Maximize Reciprocal Rank with Collaborative Less-is-More Filtering
Yue Shi, Martha Larson, Alexandros Karatzoglou, Nuria Oliver, Linas Baltrunas, Alan Hanjalic
ACM RecSys 2012

xCLiMF: Optimizing Expected Reciprocal Rank for Data with Multiple Levels of Relevance
Yue Shia, Alexandros Karatzogloub, Linas Baltrunasb, Martha Larsona, Alan Hanjalica
ACM RecSys 2013

Collaborative Filtering for Implicit Feedback Datasets
Yifan Hu, Yehuda Koren, Chris Volinsky
IEEE, 2008