<<

for Binary, Positive-only Data

Koen Verstrepen∗ Kanishka Bhaduri† Froomle Apple, Inc. Antwerp, Belgium Cupertino, USA [email protected] [email protected] Boris Cule Bart Goethals University of Antwerp Froomle, University of Antwerp Antwerp, Belgium Antwerp, Belgium [email protected] [email protected]

ABSTRACT other. Studying recommender systems specifically, and the connection between individuals and relevant items in gen- Traditional collaborative filtering assumes the availability of eral, is the subject of recommendation research. But the explicit ratings of users for items. However, in many cases of recommendation research goes beyond connect- these ratings are not available and only binary, positive-only ing users with items. Recommender systems can, for exam- data is available. Binary, positive-only data is typically as- ple, also connect genes with diseases, biological targets with sociated with implicit feedback such as items bought, videos drug compounds, words with documents, tags with photos, watched, ads clicked on, etc. However, it can also be the re- etc. sults of explicit feedback such as likes on social networking sites. Because binary, positive-only data contains no neg- 1.1 Collaborative Filtering ative information, it needs to be treated differently than Collaborative filtering is a principal problem in recommen- rating data. As a result of the growing relevance of this dation research. In the most abstract sense, collaborative problem setting, the number of publications in this field in- filtering is the problem of weighting missing edges in a bi- creases rapidly. In this survey, we provide an overview of the partite graph. existing work from an innovative perspective that allows us to emphasize surprising commonalities and key differences. The concrete version of this problem that got most atten- tion until recently is rating prediction. In rating prediction, one set of nodes in the bipartite graph represent users, the 1. INTRODUCTION other set of nodes represent items, an edge with weight r Increasingly, people are overwhelmed by an abundance of between user u and item i expresses that u has given i the choice. Via the , everybody has access to rating r, and the task is to predict the missing ratings. Since a wide variety of news, opinions, (encyclopedic) informa- rating prediction is a mature domain, multiple overviews ex- tion, social contacts, books, music, videos, pictures, prod- ist [23; 47; 1; 59]. Recently, the attention for rating predic- ucts, jobs, houses, and many other items, from all over the tion diminished because of multiple reasons. First, collect- world. However, from the perspective of a particular person, ing rating data is relatively expensive in the sense that it the vast majority of items is irrelevant; and the few relevant requires a non-negligible effort from the users. Second, user items are difficult to find because they are buried under a ratings do not correlate as well with user behavior as one large pile or irrelevant ones. There exist, for example, lots would expect. Users tend to give high ratings to items they of books that one would enjoy reading, if only one could think they should consume, for example a famous book by identify them. Moreover, not only do people fail to find rel- Dostoyevsky. However, they would rather read Superman evant existing items, niche items fail to be created because comic books, which they rate much lower. Finally, in many it is anticipated that the target audience will not be able to applications, predicting ratings is not the final goal, and the find them under the pile of irrelevant items. Certain books, predicted ratings are only used to find the most relevant for example, are never written because writers anticipate items for every user. Consequently, high ratings need to they will not be able to reach a sufficiently large portion of be accurate whereas the exact value of low ratings is irrel- their target audience, although the audience exists. Recom- evant. However, in rating prediction high and low ratings mender systems contribute to overcome these difficulties by are equally important. connecting individuals with items relevant to them. A good Today, attention is increasingly shifting towards collabora- book , for example, would typically rec- tive filtering with binary, positive-only data. In this version, ommend 3, previously unknown, books that the user would edges are unweighted, an edge between user u and item i ex- enjoy reading, and that are sufficiently different from each presses that user u has given about item i, and the task is to attach to every missing edge between ∗This work was done while Koen Verstrepen was working at a user u and an item i a score that indicates the suitabil- the University of Antwerp. ity of recommending i to u. Binary, positive-only data is †This work was done while Kanishka Bhaduri was working typically associated with implicit feedback such as items at Netflix, Inc. bought, videos watched, songs listened to, books lent from a library, ads clicked on, etc. However, it can also be the is desirable for collaborative filtering, but typically not for result of explicit feedback, such as likes on social networking association rule mining. sites. As a result of the growing relevance of this problem setting, the number of publications in this field increases 1.3 Outline rapidly. In this survey, we provide an overview of the exist- After the Preliminaries (Sec. 2), we introduce our framework ing work on collaborative filtering with binary, positive-only (Sec. 3) and review the state of the art along the three di- data from an innovative perspective that allows us to em- mensions of our framework: Factorization Models (Sec. 4), phasize surprising commonalities and key differences. To Deviation Functions (Sec. 5), and Minimization Algorithms enhance the readability, we sometimes omit the specifica- (Sec. 6). Finally, we discuss the usability of methods for tion ‘binary, positive-only’ and use the abbreviated term rating prediction (Sec. 7) and conclude (Sec. 8). ‘collaborative filtering’. Besides the bipartite graph, five types of extra information 2. PRELIMINARIES can be available. First, there can be item content or item We introduced collaborative filtering as the problem of weight- . In the case of books, for example, the content is ing missing edges in a bipartite graph. Typically, however, the full text of the book and the metadata can include the this bipartite graph is represented by its adjacency matrix, writer, the publisher, the year it was published etc. Meth- which is called the preference matrix. ods that exclusively use this kind of information are typi- Let be a set of users and a set of items. We are given cally classified as content based. Methods that combine this a preferenceU matrix with trainingI data R 0, 1 |U|×|I|. kind of information with a collaborative filtering method ∈ { } Rui = 1 indicates that there is a known preference of user are typically classified as hybrid. Second, there can be user u for item i . Rui = 0 indicates that there is no metadata such as gender, age, location, etc. Third, users such∈ U information.∈ Notice I that the absence of information can be connected with each other in an extra, unipartite means that either there exists no preference or there exists graph. A typical example is a social network between the a preference but it is not known. users. An analogous graph can exist for the items. Finally, there can be contextual information such as location, date, Collaborative filtering methods compute for every user-item time, intent, company, device, etc. Exploiting information pair (u, i) a recommendation score s(u, i) that indicates the besides the bipartite graph, is out of the scope of this sur- suitability of recommending i to u. Typically, the user-item- pairs are (partially) sorted by their recommendation scores. vey. Comprehensive discussions on exploiting information |U|×|I| outside the user-item matrix have been presented [53; 72]. We define the matrix S R as Sui = s(u, i). Further- more, c(x) gives the count∈ of x, meaning 1.2 Relation to Other Domains (P i∈I Rxi if x To emphasize the unique aspects of collaborative filtering, c(x) = P ∈ U Rux if x . we highlight the commonalities and differences with two re- u∈U ∈ I lated data science problems: classification and association Although we conveniently call the elements of users and rule mining. the elements of items, these sets can containU any type of First, collaborative filtering is equivalent to jointly solving object. In the caseI of online social networks, for example, many one-class classification problems, in which every one- both sets contain the people that participate in the social class classification problem corresponds to one of the items. network, i.e., = , and Rui = 1 if there exists a friendship In the classification problem that corresponds to item i, i link between personsU I u and i. In image tagging/annotation serves as the class, all other items serve as the features, the problems, contains images, contains words, and Rui = 1 users that have i as a known preference serve as labeled if image uUwas tagged with wordI i. In chemogenomics, an examples and the other users serve as unlabeled examples. early stage in the drug discovery process, contains active U .com, for example, has more than 200 million items drug compounds, contains biological targets, and Rui = 1 in its catalog, hence solving the collaborative filtering prob- if there is a strongI interaction between compound u and lem for Amazon.com is equivalent to jointly solving more biological target i. than 200 million one-class classification problems, which ob- Typically, datasets for collaborative filtering are extremely viously requires a distinct approach. However, collaborative sparse, which makes it a challenging problem. The sparsity filtering is more than efficiently solving many one-class clas- , computed as S sification problems. Because they are tightly linked, jointly P solving all classification problems allows for sharing informa- (u,i)∈U×I Rui = 1 , (1) tion between them. The individual classification problems S − share most of their features; and while i serves as the class |U| · |I| typically ranges from 0.98 to 0.999 and is visualized in Fig- in one of the classification problems, it serves as a feature ure 1. This means that a score must be computed for ap- in all other classification problems. proximately 99% of the (u, i)-pairs based on only 1% of the Second, association rule mining also assumes bipartite, un- (u, i)-pairs. This is undeniably a challenging task. weighted data and can therefore be applied to datasets used for collaborative filtering. Furthermore, recommending item i to user u can be considered as the application of the as- 3. FRAMEWORK sociation rule I(u) i, with I(u) the itemset containing In the most general sense, every method for collaborative the known preferences→ of u. However, the goals of associa- filtering is defined as a function that computes the rec- F tion rule mining and collaborative filtering are different. If ommendation scores S based on the data R: S = (R). F a rule I(u) i is crucial for recommending i to u, but ir- Since different methods originate from different intuitions relevant on the→ rest of the data, giving the rule a high score about the problem, theyF are typically explained from very A Survey of Collaborative Filtering Algorithms for Binary, Positive-only Data 1:23

Also the maximum margin based deviation function in Equation ?? cannot be solved with ALS because it contains the hinge loss. Rennie and Srebro propose a conjugate gradients method for minimizing this function [Rennie and Srebro 2005]. However, this method suffers from similar problems as SGD, related to the high number of terms in the loss function. Therefore, Pan and Scholz [Pan and Scholz 2009] propose a bagging approach. The essence of their bagging approach is that they do not explicitly weight every user-item pair for which Rui =0, but sample from all these pairs instead. They create multiple samples, and compute multiple different pairs (S(1,1), S(1,2)) cor- responding to their samples. These computations are also performed with the conju- gate gradients method. They are, however, much less intensive since they only con- sider a small sample of the many user-item pairs for which Rui =0. The different corresponding factor matrices are finally merged by simply taking their average. For solving Equation ??, Kabbur et al. propose to subsequently apply SGD to a num- ber of reduced datasets that each contain all known preferences and a different random sample of the missing preferences, although an ALS-like solution seems to be in reach . When the deviation function is reconstruction based and S(1) = R, i.e. an item-based neighborhood model is used, the user factors are fixed by definition, and the problem resembles a single ALS step. However, when constraints such as non-negativity are imposed, things get more complicated [Ning and Karypis 2011]. Therefore, Ning and Karypis adopt cyclic coordinate descent and soft thresholding [Friedman et al. 2010]. When a deviation functions contains a triple summations, it contains even more terms, and the minimization with SGD becomes even more expensive. Moreover, most of the terms only have a small influence on the parameters that need to be learned and are therefore a waste to consider. However, in some cases no other solution has been proposed yet (Eq.22,26). To mitigate this problem, Rendle and Freudenthaler propose to sample the terms in deviation function not uniformly but proportional to their im- pact on the parameters [Rendle and Freudenthaler 2014]. Weston et al. mitigate this problem by, for every known preference i, sampling non preferred items j until they encounter one for which Suj +1 > Sui, i.e. it violates the hinge-loss approximation of the ground truth ranking. Consequently, their updates will be significant [Weston et al. 2013b]. Also the deviation function in Equation 24 contains a triple summation. However, Takacs´ and Tikk [Takacs´ and Tikk 2012] are able to propose an ALS algorithm and therefore do not suffer from the triple summation like an SGD-based algorithm would. Finally, In line with his maximum likelihood approach, Hofmann proposes to min- imize the deviation function in Equation ?? by means of a regularized expectation maximization (EM) algorithm [Hofmann 2004].

6.2. Convex Minimization Aiolli [Aiolli 2014] proposed the convex deviation function in Equation 28 and indicates that it can be solved with any algorithm for convex optimization. Also the analytically solvable deviation functions are convex. Moreover, minimiz- ing them is equivalent to computing all the similarities involved in the model. Most works assume a brute force computation1:24 of the similarities. However, Verstrepen and K. Verstrepen et al. 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 i" i" 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 Goethals [Verstrepen and Goethals 2016], recently proposed two methods that(1,2)" are eas- 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 S D" 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ily0 0 0 an0 0 0 order0 0 0 0 0 of0 0 magnitude faster than the brute force computation. 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 u" u" 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 … 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 |I| 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 S" S(1,1)" 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 =" 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 |U| 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1:240 0 K. Verstrepen et al. 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 REFERENCES … … 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 ACM0 0 0 0 Computing0 0 0 0 0 0 0 Surveys,0 0 Vol. 1, No. 1, Article 1, PublicationFabio Aiolli. date: 2013.January Efficient 2015. top-n recommendation for very large scale binary rated datasets. In Proceedings 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 of the 7th ACM conference on RecommenderD" systems. ACM, 273–280. 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 Fabio Aiolli. 2014.|I| Convex AUC optimization for top-N recommendation with implicit feedback. In Proceed- 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ings of the 8th ACM Conference on Recommender systems. ACM, 293–296. 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 C.M. Bishop. 2006. and . Springer, New York, NY. 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Figure 2: Matrix Factorization with 2 factor matrices 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Evangelia Christakopoulou and George Karypis. 2014. Hoslim: Higher-order sparse linear method for top-n 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 … 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 (Eq. 3).recommender systems. In Advances in Knowledge Discovery and . Springer, 38–49. 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 REFERENCES0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Mukund Deshpande and George Karypis. 2004. Item-based top-n recommendation algorithms. ACM Trans- 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Fabio0 0 Aiolli. 2013. Efficient top-n recommendationactions on Information for very large Systems scale binary(TOIS) rated22, 1 datasets.(2004), 143–177. In Proceedings of the 7th ACM conference on Recommender systems. ACM, 273–280. Jerome Friedman, Trevor Hastie, and Rob Tibshirani. 2010. Regularization paths for generalized linear Fabio Aiolli. 2014. Convex AUCto optimizationemphasizemodels forvia that top-N coordinate therecommendation descent.matrixJournal of with recommendation implicitof statistical feedback. software In Proceed- scores33, 1 (2010),S 1. Figure 1: Typical sparsity of training data matrixings of theR 8thfor ACM Conference on Recommender systems. ACM, 293–296. is dependentEric Gaussier on and these Cyril Goutte. parameters, 2005. Relation we betweenwill write PLSA it and as NMFS(θ and). implications. In Proceedings binary, positive-only collaborative filtering. C.M. Bishop. 2006. Pattern Recognition and Machine Learning. Springer, New York, NY. The modelof thein 28th Figure annual international2, for example, ACM SIGIR contains conference ( on+ Research) D and development in information Evangelia Christakopoulou and Georgeretrieval Karypis.. ACM, 2014. 601–602. Hoslim: Higher-order sparse linear method|U| for|I| top-n· recommender systems. Inparameters.AdvancesPrem Gopalan, in Knowledge Computing J Hofman, Discovery and all and D the Blei.Data parameters 2015.Mining Scalable. Springer, recommendation in 38–49. a factoriza- with hierarchical poisson factor- Mukund Deshpande and Georgetion Karypis. modelization. 2004. is In done Item-basedProceedings by minimizing top-n of the recommendation Thirti-first the deviation Conference algorithms. AnnualbetweenACM Trans- Conference the on Uncertainty in Artificial different perspectives. In the literature on recommenderactions on sys-Information Systems (TOIS) 22, 1 (2004), 143–177. data RIntelligenceand the (UAI-15). parameters AUAI Pressθ.. This deviation is measured tems in general and collaborative filtering specifically,Jerome Friedman, two Trevor Hastie,Thomas and Rob Hofmann. Tibshirani. 1999. 2010. Probabilistic Regularization latent pathssemantic for indexing. generalized In Proceedings linear of the 22nd annual interna- dominant perspectives have emerged: the modelmodels based via per- coordinate descent.by theJournaldeviationtional of ACM statistical SIGIR function software conference33, on 1( (2010),Researchθ, R). 1. and Formally development we in compute . ACM, 50–57. ∗ D spective and the memory-based perspective. Unfortunately,Eric Gaussier and Cyril Goutte.the 2005.optimalThomas Relation Hofmann.values between 2004.θ PLSAof Latent the and semantic parametersNMF and models implications. forθ collaborativeas In Proceedings filtering. ACM Transactions on Informa- of the 28th annual international ACMtion Systems SIGIR conference (TOIS) 22, on 1 (2004), Research 89–115. and development in information these two are often described as two fundamentallyretrieval separate. ACM, 601–602. Yifan Hu, Yehuda Koren,∗ and Chris Volinsky.R 2008. Collaborative filtering for implicit feedback datasets. In Prem Gopalan, J Hofman, and D Blei. 2015. Scalableθ recommendation= arg min with(θ, hierarchical) . poisson factor- classes of methods, instead of merely two perspectives on the Data Mining, 2008. ICDM’08.θ EighthD IEEE International Conference on. IEEE, 263–272. ization. In Proceedings of theDietmar Thirti-first Jannach, Conference Markus Annual Zanker, Conference Alexander on Felfernig, Uncertainty and in Gerhard Artificial Friedrich. 2010. Recommender sys- same class of methods [47; 23]. As a result, the comparisonIntelligence (UAI-15). AUAI Press. Many deviationtems: an introduction functions. Cambridge exist,and University every Press. deviation function between methods often remains superficial. Thomas Hofmann. 1999. ProbabilisticChristopher latent semantic Johnson. indexing. 2014. Logistic In Proceedings Matrix Factorization of the 22nd annual for Implicit interna- Feedback Data. In Workshop on Dis- We, however, will explain many different collaborativetional ACMfil- SIGIR conferencemathematically on Researchtributed Machine andexpresses development Learning in anda information different Matrix Computations retrieval interpretation. ACM, at the 50–57. Twenty-eighth of the Annual Conference on Neural Thomas Hofmann. 2004. Latentconcept semanticInformationdeviation models for Processing. collaborative Systems filtering. (NIPS)ACM. Transactions on Informa- tering methods from one and the same perspective.tion Systems As (TOIS) 22, 1 (2004), 89–115. F Third,Santosh efficiently Kabbur and computing George Karypis. the 2014. parameters NLMF: NonLinear that minimize Matrix Factorization Methods for Top-N such we facilitate the comparison and classificationYifan Hu, of Yehuda these Koren, and Chris Volinsky.Recommender 2008. Collaborative Systems. In Data filtering Mining for implicit Workshop feedback (ICDMW), datasets. 2014 In IEEE International Conference on. methods. Our perspective is a matrix factorizationData Mining, frame- 2008. ICDM’08.a deviation EighthIEEE, IEEE function167–174. International is often Conference non on trivial. IEEE, 263–272. because the major- work in which every method consists of threeDietmar fundamental Jannach, Markus Zanker,ity ofSantosh Alexander deviation Kabbur, Felfernig, Xiafunctions Ning, and and Gerhard is George non-convex Friedrich. Karypis. 2010. 2013. inRecommender Fism: the parametersfactored sys- item similarity models for top-n rec- F tems: an introduction. Cambridgeof the factorizationommender University systems. Press. model. In Proceedings In that of the case, 19th ACM minimization SIGKDD international al- conference on Knowledge building blocks: a factorization model of theChristopher recommenda- Johnson. 2014. Logistic Matrixdiscovery Factorization and data mining for Implicit. ACM, Feedback 659–667. Data. In Workshop on Dis- tion scores S, a deviation function that measurestributed the devia- Machine Learninggorithms andNoam Matrix Koenigstein, can Computations only Nir compute at Nice, the Twenty-eighth Ulrich parameters Paquet, Annual and Nirthat Conference Schleyen. correspond on 2012. Neural The to Xbox recommender system. In tion between the data R and the recommendationInformation scores ProcessingS, Systemsa local (NIPS)Proceedings minimum.. of the The sixth ACM initialization conference on ofRecommender the parameters systems. ACM, in 281–284. Santosh Kabbur and Georgethe Karypis.Yehuda factorization 2014. Koren NLMF: and model Robert NonLinear Bell. and Matrix 2011. the Advances chosenFactorization in hyperparameters collaborative Methods for filtering. Top-N of In Recommender systems hand- and a minimization procedure that tries to findRecommender the model Systems. In Data Miningbook. Springer, Workshop 145–186. (ICDMW), 2014 IEEE International Conference on. parameters that minimize the deviation function.IEEE, 167–174. the minimizationYehuda Koren, Robert algorithm Bell, and determine Chris Volinsky. which 2009. localMatrix minimum factorization techniques for recommender First, the factorization model computesSantosh the matrix Kabbur, ofXia Ning, andwill George besystems. found. Karypis.Computer If2013. the Fism:8 value (2009), factored of 30–37. the item deviation similarity models function for top-n in rec- this ommender systems. In ProceedingslocalWeiyang minimum of Lin, the 19thSergio is ACM not A Alvarez, SIGKDD much and internationalhigher Carolina than Ruiz. conference 2002. that Efficient ofon Knowledge the adaptive-support global association rule mining recommendation scores S as a link function l ofdiscovery a sum ofandT data mining. ACM,for 659–667. recommender systems. Data mining and knowledge discovery 6, 1 (2002), 83–105. minimum, it is considered a good minimum. An intuitively terms in which a term t is the product of Ft factorNoam Koenigstein, matrices: Nir Nice, UlrichGuilherme Paquet, Vale and Menezes,Nir Schleyen. Jussara 2012. M The Almeida, Xbox Fabianorecommender Belem,´ system. Marcos In Andre´ Gonc¸alves, An´ısio Lacerda, Proceedings of the sixthappealing ACM conferenceEdleno deviation on Silva Recommender De Moura, function systems Gisele is L. ACM, Pappa, worthless 281–284. Adriano if Veloso, there and exists Nivio Ziviani. no 2010. Demand-driven tag   Yehuda Koren and Robert Bell. 2011. Advances in collaborative filtering. In Recommender systems hand- T Ft algorithmrecommendation. that can efficiently In Machine Learning compute and Knowledgeparameters Discovery that in cor- Databases. Springer, 402–417. X Y (t,f) book. Springer, 145–186. S = l S . (2) Andriy Mnih and Ruslan Salakhutdinov. 2007. Probabilistic matrix factorization. In Advances in neural   Yehuda Koren, Robert Bell,respond and Chrisinformation toVolinsky. a good 2009.processing local Matrixminimum systems factorization. 1257–1264. of techniques this deviation for recommender function. t=1 f=1 systems. Computer 8 (2009),Finally, 30–37.Bamshad we Mobasher,assume a Honghua basic usageDai, Tao scenario Luo, and Miki inwhich Nakagawa. model 2001. pa-Effective personalization based on Weiyang Lin, Sergio A Alvarez,rameters and Carolinaassociation are Ruiz. recomputed rule 2002. discovery Efficient periodically from adaptive-support web usage (typically, data. association In Proceedings arule few mining timesof the 3rd international workshop on For many methods, l is the identity function, Tfor recommender= 1 and systems. Data miningWeb information and knowledge and datadiscovery management6, 1 (2002),. ACM, 83–105. 9–15. a day). Computing the recommendation scores for a user F1 = 2. In this case, the factorization is givenGuilherme by: Vale Menezes, Jussara M Almeida, Fabiano Belem,´ Marcos Andre´ Gonc¸alves, An´ısio Lacerda, Edleno Silva De Moura,based Gisele L on Pappa, the Adriano model, Veloso, and and extracting Nivio Ziviani. the 2010. items Demand-driven correspond- tag recommendation. In . Springer, 402–417. S = S(1,1)S(1,2). (3) Machineing to Learning the top- and KnowledgeN scores, Discovery is assumedACM in Databases Computing to be Surveys, performed Vol. 1, No. in1, Articlereal 1, Publication date: January 2015. Andriy Mnih and Ruslan Salakhutdinov. 2007. Probabilistic matrix factorization. In Advances in neural time (1,1) |U|×D information(1,2) processing systems.. 1257–1264. Specific aspects of scenarios in which models need Because of their dimensions, S R Bamshadand S Mobasher, Honghuato Dai, be Tao recomputed Luo, and Miki in Nakagawa. real time, 2001. areEffective out personalization of the scope based of on this D×|I| ∈ ∈ R are often called the user-factor matrixassociation and item- rule discovery from web usage data. In Proceedings of the 3rd international workshop on Web information and datasurvey. management. ACM, 9–15. factor matrix, respectively. Figure 2 visualizes Equation 3. In this overview, we first survey existing models in Section 4. More complex models are often more realistic, but generally Next, we compare the existing deviation functions in Sec- ACM Computing Surveys, Vol. 1, No. 1, Article 1, Publication date: January 2015. contain more parameters which increases the risk of overfit- tion 5. Afterwards, we discuss the minimization algorithms ting. that are used for fitting the model parameters to the devia- Second, the number of terms T , the number of factor ma- tion functions in Section 6. Finally, we discuss the applica- trices Ft and the dimensions of the factor matrices are an 1 bility of rating-based methods in Section 7 and conclude in integral part of the model, independent of the data R . Ev- Section 8. ery entry in the factor matrices however, is a parameter that needs to be computed based on the data R. We col- lectively denote these parameters as θ. Whenever we want 4. FACTORIZATION MODELS 1High level statistics of the data might be taken into con- Equation 2 gives a general description of all models for col- sideration to choose the model. There is however no clear, laborative filtering. In this section, we discuss how the spe- direct dependence on the data. cific collaborative filtering models map to this equation. 4.1 Basic Models Figure 2 visualizes the user-vector of user u and the item- A statistically well founded method is probabilistic latent vector of item i. Now, as a result of the factorization model, semantic analysis (pLSA) by Hofmann, which is centered a recommendation score Sui is computed as the dotproduct around the so called aspect model [19]. Hofmann models of the user-vector of u with the item-vector of i: the probability p(i u) that a user u will prefer an item i as D | (1,1) (1,2) X (1,1) (1,2) the mixture of D probability distributions induced by the S S = S S u· · ·i ud di hidden variables d: d=1 (6) D (1,1) (1,2) = S S cos(φui) X || u· || · || ·i || · p(i u) = p(i, d u). (1,2) | | = S cos(φui), d=1 || ·i || · Furthermore, by assuming u and i conditionally independent (1,1) with φui the angle between the two vectors, and Su· = 1 given d, he obtains: a probabilistic constraint of the model (Eq.5).|| Therefore,|| D an item i will be recommended if its vector norm S(1,2) X || ·i || p(i u) = p(i d) p(d u). is large and φui, the angle of its vector with the user vec- | | · | d=1 tor, is small. From this geometric interpretation we learn This model corresponds to a basic two-factor matrix factor- that the recommendation scores computed with the model ization: in Equation 4 contain both a personalized factor cos(φui) (1,2) (1,1) (1,2) and a non-personalized, popularity based factor S . S = S S || ·i || Many other authors adopted this two-factor model. How- Sui = p(i u) ever, they abandoned its probabilistic foundation by remov- | (4) S(1,1) = p(d u) ing the constraints on the parameters (Eq.5) [21; 36; 37; 55; ud | (1,2) 73; 45; 52; 61; 16; 12]: Sdi = p(i d), | (1,1) (1,2) (1,1) S = S S . (7) in which the D parameters in Sud and the D |U|(1 ×,2) × |I| Yet another interpretation of this two-factor model is that parameters in Sdi are a priori unknown and need to be computed based on the data R. An appealing property of every hidden variable d represents a cluster containing both (1,1) this model is the probabilistic interpretation of both the pa- users and items. A large value Sud means that user u rameters and the recommendation scores. Fully in line with has a large degree of membership in cluster d. Similarly, a (1,2) the probabilistic foundation, the parameters are constrained large value Sdi means that item i has a large degree of as: membership in cluster d. As such, pLSA can be interpreted (1,1) as a soft clustering model. According to the interpretation, S 0 ud ≥ item i will be recommended to a user u if they have high S(1,2) 0 degrees of membership in the same clusters. di ≥ X Although much less common, hard clustering models for col- p(i d) = 1 | (5) laborative filtering also exist. Hofmann and Puzicha [20] i∈I proposed the model D X p(d u) = 1, p(i u, e(u) = c, d(i) = d) = p(i)φ(e, d), (8) | | d=1 in which e(u) indicates the cluster user u belongs to and expressing that both factor matrices are non-negative and (1,1) (1,2) d(i) indicates the cluster item i belongs to. Furthermore, that all row sums of S and S must be equal to 1 φ(e, d) is the association value between the user-cluster e since they represent probabilities. and the item-cluster d. This cluster association factor in- Latent dirichlet allocation (LDA) is a more rigorous statis- creases or decreases the probability that u likes i relative tical model, which puts Dirichlet priors on the parameters to the independence model p(i u) = p(i). As opposed to p(d u) [6; 19]. However, for collaborative filtering these pri- | | the aspect model, this model assumes E user-clusters only ors are integrated out and the resulting model for computing containing users and D item-clusters only containing items. recommendation scores is again a simple two factor factor- Furthermore a user or item belongs to exactly one cluster. ization model. The factorization model corresponding to this approach is: The aspect model also has a geometric interpretation. In the training data R, every user is profiled as a binary vec- S = S(1,1)S(1,2)S(1,3)S(1,4), tor in an -dimensional space in which every dimension Sui = p(i u), corresponds|I| to an item. Analogously, every item is profiled | (1,1) as a binary vector in a -dimensional space in which every Sue = I(e(u) = e), |U| dimension corresponds to a user. Now, in the factorized rep- S(1,2) = φ(e, d), resentation, every hidden variable d represents a dimension ed (1,3) of a D-dimensional space. Therefore, the matrix factoriza- Sdi = I(d(i) = d), (1,1) (1,2) tion S = S S implies a transformation of both user- (1,4) Sij = p(i) I(i = j), and item-vectors to the same D-dimensional space. A row · (1,1) 1×D vector Su· R is the representation of user u in this in which I(true) = 1, I(false) = 0, and E and D are hyper- ∈ (1,2) (1,1) D-dimensional space and a column vector S D×1 parameters. The E parameters in S , E D pa- ·i R |U| × × is the representation of item i in this D-dimensional∈ space. rameters in S(1,2), D parameters in S(1,3) and the × |I| |I| parameters S(1,4) need to be computed based on the data. This model is rooted in the intuition that good recommen- Ungar and Foster [63] proposed a similar hard clustering dations are similar to the known preferences of the user. method. Although they do not present it as such, Gori et al. [18] proposed ItemRank, a method that is based on an interest- 4.2 Explicitly Biased Models ing extension of this intuition: good recommendations are In the above models, the recommendation score Sui of an similar to other good recommendations, with a bias towards item i for a user u is the product of a personalized fac- the known preferences of the user. The factorization model tor with an item-bias factor related to item-popularity. In corresponding to ItemRank is based on PageRank [35] and Equation 6 the personalized factor is cos(φui), and the bias given by: (1,2) S (1,2) factor is ·i . In Equation 8 the personalized factor is S = α SS + (1 α) R. (12) φ(e, d), and|| the|| bias factor is p(i). Other authors [28; 40; · − · 24] proposed to model item- and user-biases as explicit terms Because of the recursive nature of this model, S needs to be instead of implicit factors. This results in the following fac- computed iteratively [18; 35]. torization model: 4.3.2 User-based  (1,1) (2,1) (3,1) (3,2) S = σ S + S + S S Other authors proposed the symmetric counterpart of Equa- (1,2) (1,1) tion 10 in which S = R [48; 2; 3]. In this case, the S = bu (9) ui factorization model is given by (2,1) Sui = bi, S = S(1,1)R, (13) (1,1) (2,1) |U|×|I| with σ the sigmoid link-function, and S , S R the user- and item-bias matrices in which all columns∈ of which is often interpreted as a user-based neighborhood model S(1,1) are identical and also all rows of S(2,1) are identical. because the recommendation score Sui of item i for user u The parameters in S(1,1), parameters in S(2,1), D is computed as |U| (3,1) |I| (3,2)|U|· parameters in S , and D parameters in S need (1,1) X (1,1) Sui = S R·i = S Rvi, to be computed based on|I| the · data. D is a hyperparame- u· · uv · v∈U ter of the model. The goal of explicitly modeling the bias (3,1) (3,2) (1,1) terms is to make the interaction term S S a pure per- in which parameter Suv is interpreted as the similarity sonalization term. Although bias terms are commonly used (1,1) (1,1) between users u and v, i.e., Suv = sim(u, v). S is of- for collaborative filtering with rating data, only a few works ten called the user-similarity matrix. The intuition behind with collaborative filtering with binary, positive-only data this model is that good recommendations are preferred by use them. similar users. Furthermore, Aiolli [2; 3] foresees the possi- bility to rescale the scores such that popular items get less 4.3 Basic Neighborhood Models importance. This changes the factorization model to Multiple authors proposed special cases of the basic two- (1,1) (1,3) (1,3) −(1−β) factor factorization in Equation 7. S = S RS , Sji = I(j = i) c(i) , · 4.3.1 Item-based in which S(1,3) is a diagonal rescaling matrix that rescales In a first special case, S(1,1) = R [10; 54; 45; 2; 34]. In this the item scores according to the item popularity c(i), and case, the factorization model is given by β [0, 1] a hyperparameter. Additionally, Aiolli [3] imposes ∈ the constraint that S(1,1) needs to be row normalized: S = RS(1,2). (10) (1,1) Su. = 1. Consequently, a user u is profiled by an -dimensional bi- || || |I| nary vector Ru· and an item is profiled by an -dimensional To the best of our knowledge, there exists no user-based (1,2) |U| counterpart for the ItemRank model of Equation 12. real valued vector S·i . This model is often interpreted as an item-based neighborhood model because the recommen- dation score Sui of item i for user u is computed as 4.3.3 Explicitly Biased Furthermore, it is, obviously, possible to add explicit bias (1,2) X (1,2) X (1,2) Sui = Ru· S = Ruj S = S , terms to neighborhood models. Wang et al. [68] proposed · ·i · ji ji j∈I j∈I(u) an item-based neighborhood model with one bias term:

(1,1) (2,2) (1,1) with I(u) the known preferences of user u, and a parameter S = S + RS , S = bi, (1,2) ui Sji typically interpreted as the similarity between items (1,2) (1,2) and an analogous user-based model: j and i, i.e., Sji = sim(j, i). Consequently, S is often called the item-similarity matrix. The SLIM method [34] (1,1) (2,1) (1,1) S = S + S R, Sui = bi. adopts this model and additionally imposes the constraints 4.3.4 Unified S(1,2) 0, S(1,2) = 0. (11) ji ≥ ii Moreover, Verstrepen and Goethals [66] showed that the The non-negativity is imposed to enhance interpretability. item- and user-based neighborhood models are two incom- The zero-diagonal is imposed to avoid finding a trivial so- plete instances of the same general neighborhood model. lution for the parameters in which every item is maximally Consequently they propose KUNN, a complete instance of similar to itself and has zero similarity with any other item. this general neighborhood model. The model they propose can be written as a weighted sum of a user- and an item- A limitation of this model is that it implies that the sim- based model, in which the weights depend on the user u and ilarity matrix is symmetric. This might hurt the model’s the item i for which a recommendation score is computed: accuracy in certain applications such as recommending tags     for images. For an image of the Eiffel tower that is already S = S(1,1)R S(1,3) + S(2,1) RS(2,3) tagged Eiffel tower, for example, the tag Paris is a reason- able recommendation. However, for an image of the Louvre (1,3) −1/2 (14) Sij = I(i = j) c(i) · already tagged Paris, the tag Eiffel tower is a bad recom- (2,1) −1/2 mendation. Paterek solved this problem for rating data by Suv = I(u = v) c(u) . · representing every item by two separate D-dimensional vec- Finally, note that the above matrix factorization based de- tors [41]. One vector represents the item if it serves as evi- scriptions of nearest neighbors methods imply that matrix dence for computing recommendations, the other vector rep- factorization methods and neighborhood methods are not resents the item if it serves as a candidate recommendation. two separate approaches, but two perspectives on the same In this way, they can model also asymmetric similarities. approach. This idea is not restricted to rating data, and for binary, positive-only data, it was adopted by Steck [58]: 4.4 Factored Similarity Neighborhood Models S = S(1,1)RS(1,3)S(1,4) (17) The item-similarity matrix S(1,2) in Equation 10 contains 2 (1,1) −1/2 parameters, which is in practice often a very large num- Suv = I(u = v) c(u) . (18) · ber.|I| Consequently, it can happen that the training data is Kabbur et al. also adopted this idea. However, similar to not sufficient to accurately compute this many parameters. Equation 9, they also add bias terms: Furthermore, one often precomputes the item-similarity ma- (1,2) (1,2) (1,1) (2,1) (3,1) (3,3) (3,4) trix S and performs the dotproducts Ru· S to com- S = S + S + S RS S · ·i pute the recommendation scores Sui in real time. In this (1,1) Sui = bu case, the dotproduct is between two -dimensional vectors, (19) |I| (2,1) which is often prohibitive in real time, high traffic applica- Sui = bi 2 tions. One solution is to factorize the similarity matrix, (3,1) −β/2 Suv = I(u = v) c(u) , which leads to the following factorization model: · (3,3) |I|×D (1,2) (1,2)T with S R the matrix of item profiles when they S RS S ∈ (3,4) D×|I| = , serve as evidence, S R the matrix of item pro- ∈ (1,2) |I|×D files when they serve as candidates , and β [0, 1] and D in which every row of S R represents a D-dimen- ∈ sional item-profile vector, with∈ D a hyperparameter [71; 8]. hyperparameters. In this case the item-similarity matrix is equal to 4.5 Higher Order Neighborhood Models T S(1,2)S(1,2) , The nearest neighbors methods discussed up to this point only consider pairwise interactions sim(j, i) and/or sim(u, v) which means that the similarity sim(i, j) between two items and aggregate all the relevant ones for computing recommen- i and j is defined as the dotproduct of their respective profile dations. Several authors [10; 31; 64; 50; 30; 33; 7] have pro- vectors posed to incorporate also higher order interactions sim(J, i) T and/or sim(u, V ) with J and V . Also in this case S(1,2) S(1,2) . ⊂ I ⊂ U i· · j· we can distinguish item-based approaches from user-based approaches. This model only contains D parameters instead of 2 which is much fewer since|I| typically · D . Further-|I| 4.5.1 Item-based  |I| (1,2) more, by first precomputing the item vectors S and For the item-based approach, most authors [10; 31; 64; 30; then precomputing the D-dimensional user-profile vectors (1,2) 7] propose to replace the user-item matrix R in the pairwise given by RS , the real time computation of a score Sui model of Equation 10 by the user-itemset matrix X: encompasses a dotproduct between a D-dimensional user- (1,2) profile vector and a D-dimensional item-profile vector. Since S = XS D , this dotproduct is much less expensive and can be Y (20) XuJ = Ruj , more |I| easily performed in real time. Furthermore, there is j∈J no trivial solution for the parameters of this model, as is the (1,2) D×|I| case for the non factored item-similarity model in Equation with S R the itemset-item similarity matrix and 10. Consequently, it is never required to impose the con- X 0, 1 |U|×∈ D the user-itemset matrix, in which D 2|I| straints from Equation 11. To avoid scaling problems when is the∈ { result} of an itemset selection procedure. ≤ fitting parameters, Weston et al. [71] augment this model The HOSLIM method [7] adopts this model and additionally with a diagonal normalization factor matrix: imposes the constraints in Equation 11. |I| (1,2) (1,1) (1,3) (1,3)T The case in which D = 2 , and S is dense, is in- S = S RS S (15) tractable. Tractable methods either limit D 2|I| or im- (1,1) −2/2 pose sparsity on S(1,2) via the deviation function. While we Suv = I(u = v) c(u) . (16) · discuss the latter in Section 5.11, there are multiple ways to 2Another solution is to enforce sparsity on the similarity do the former. Deshpande and Karypis [10] limit the number matrix by means of the deviation function. This is discussed of itemsets by limiting the size of J. Alternatively, Chris- in Section 5. takopoulou and Karypis [7] only consider itemsets J that were preferred by more than a minimal number of users. Alternatively, they also propose NLMFs, a version in which van Leeuwen and Puspitaningrum, on the other hand, limit S(2,2) = S(1,2), i.e., the item profiles are shared between the the number of higher order itemsets by using an itemset se- global and the taste-specific terms: lection algorithm based on the minimal description length     principle [64]. Finally, Menezes et al. claim that it is in cer- S = S(1,1)S(1,2) + max S(2,1,p)S(1,2) . tain applications possible to compute all higher order inter- ui p=1...P ui actions if one computes all higher order interactions on de- An important downside of the above two models is that P , mand instead of in advance [31]. However, delaying the com- the number of distinct account-profiles, is a hyperparameter putation does not reduce its exponential complexity. Only that is the same for every account and cannot be too large if a large portion of the users requires recommendations on (typically 2 or 3) to avoid an explosion of the computational a very infrequent basis, computations for these users can be cost and the number of parameters. DAMIB-Cover [67], on spread over a very long period of time and their approach the other hand, starts from the item-based model in Equa- might be feasible. c(u) tion 10, and efficiently considers P = 2 different profiles An alternative model for incorporating higher order interac- for every account u. Specifically, every profile corresponds tions between items consists of finding the best association to a subset S(p) of the known preferences of u, I(u). This rule for making recommendations [50; 33]. This corresponds results in the factorization model to the matrix factorization model  (1,2,p) (1,3) (1,2) Y Sui = max RS S S = X S , XuJ = Ruj , c(u) ⊗ p=1...2 ui j∈J (p) (1,2,p) I(j = k) S j S = · | ∩ { }| , with jk S(p) β | | (A B) = max AxiBiy, (1,2,p) ⊗ xy i=1...m with S a diagonal matrix that selects and rescales the (p) n×m m×k known preferences of u that correspond to the subset S in which A R and B R . ⊆ ∈ ∈ I(u), and β [0, 1] a hyperparameter. We did not find a convincing motivation for either of the ∈ two aggregation strategies. Moreover, multiple authors re- port that their attempt to incorporate higher order inter- 5. DEVIATION FUNCTIONS actions heavily increased computational costs and did not The factorization models described in Sec. 4 contain many significantly improve the results [10; 64]. parameters, i.e., the entries in the a priori unknown fac- 4.5.2 User-based tor matrices, which we collectively denote as θ. These pa- rameters need to be computed based on the training data Incorporating higher order interactions between users can R. Computing all the parameters in a factorization model be achieved be replacing the user-item matrix in Equation is done by minimizing the deviation between the training 13 by the userset-item matrix Y [30; 60]: data R and the parameters θ of the factorization model. (1,1) Y This deviation is measured by a deviation function (θ, R). S S Y Y R = , V i = vi, (21) Many deviation functions exist, and every deviationD func- v∈V tion mathematically expresses a different interpretation of D×|I| (1,1) |U|×D the concept deviation. with Y 0, 1 and, S R the user-userset ∈ { } ∈ similarity matrix, in which D 2|U| is the result of a userset The majority of deviation functions is non-convex in the selection procedure [30; 60]. ≤ parameters θ. Consequently, minimization algorithms can only compute parameters that correspond to a local min- 4.6 Multi-profile Models imum. The initialization of the parameters in the factor- ization model and the chosen hyperparameters of the mini- When multiple users, e.g., members of the same family, share mization algorithm determine which local minimum will be a single account, or when a user has multiple distinct tastes, found. If the value of the deviation function in this local the above matrix factorization models can be too limited minimum is not much higher than that of the global mini- because they aggregate all the distinct tastes of an account mum, it is considered a good minimum. into one vector [67; 70; 26]. Therefore, Weston et al. [70] propose MaxMF, in which they model every account with multiple vectors instead of just one. Then, for every candi- 5.1 Probabilistic Scores-based date recommendation, their model chooses the vector that Hofmann [19] proposes to compute the optimal parameters ∗ maximizes the score of the candidate: θ of the model as those that maximize the loglikelihood of the model, i.e., log p(R θ), the logprobability of the known  (1,1,p) (1,2) | Sui = max S S . preferences given the model: p=1...P ui ∗ X X Kabbur and Karypis [26] argue that this approach worsens θ = arg max Rui log p(Rui θ). (22) θ | the performance for accounts with homogeneous taste or u∈U i∈I a low number of known preferences and therefore propose, Furthermore, he models p(Rui = 1 θ) with Sui, i.e., he in- NLMFi, an adapted version that combines a global account terprets the scores S as probabilities.| This gives profile with multiple taste-specific account profiles: ∗ X X  (1,1) (1,2)  (2,1,p) (2,2) θ = arg max Rui log Sui, Sui = S S + max S S . θ ui p=1...P ui u∈U i∈I which is equivalent to minimizing the deviation function assumption that an observed preference is indeed preferred. X X One possible way to apply this missing data model was pro- (θ, R) = Rui log Sui. (23) posed by Steck [56]. Although his original approach is more D − u∈U i∈I general, we give a specific simplified version that for binary, Recall that parameters θ are not directly visible in the right positive-only data: hand side of this equation but that scores Sui are computed ∗ X X based on the factorization model which contains parameters θ = arg max Rui log Sui θ θ, i.e., we use short notation Sui instead of the full notation u∈U i∈I Sui(θ).  + α (1 Rui) log (1 Sui) , (24) This deviation function was also adopted by Blei et al. in · − − their approach called latent dirichlet allocation (LDA) [6]. in which the hyperparameter α < 1 attaches a lower im- Furthermore, note that Hofmann only maximizes the log- portance to the contributions that correspond to Rui = 0. probability of the observed feedback and ignores the missing Johnson [24] proposed a very similar computation, but does preferences. This is equivalent to the assumption that there not motivate why he deviates from the theoretically well is no information in the missing preferences, which implicitly founded version of Steck. Furthermore, Steck adds regular- corresponds to the assumption that feedback is missing at ization terms to avoid overfitting and finally proposes the random (MAR) [56]. Clearly, this is not a realistic assump- deviation function tion, since negative feedback is missing by definition, which is obviously non random. Moreover, the number of items in X X (θ, R) = Rui log Sui collaborative filtering problems is typically very large, and D − u∈U i∈I only a very small subset of them will be preferred by a user. α (1 Rui) log(1 Sui) Consequently, the probability that a missing preference is − · − · − 2 2  actually not preferred is high. Hence, in reality, the feedback + λ α ( θu F + θi F ) , is missing not at random (MNAR), and a good deviation · · || || || || function needs to account for this [56]. Furthermore, notice with F the Frobenius norm, λ a regularization hyper- || · || that Hofmann only maximizes the logprobability of the ob- parameter, and θu, θi the vectors that group the model pa- served feedback and ignores the missing preferences. This is rameters related to user u and item i, respectively. equivalent to the assumption that there is no information in the missing preferences, which implicitly corresponds to the 5.2 Basic Squared Error-based assumption that feedback is missing at random (MAR) [56]. Most deviation functions, however, abandon the interpreta- Clearly, this is not a realistic assumption, since negative tion that the scores S are probabilities. In this case, one feedback is missing by definition, which is obviously non can choose to model p(Rui θ) with a normal distribution random. Moreover, the number of items in collaborative fil- | (Rui Sui, σ). By additionally adopting the AMAN as- tering problems is typically very large, and only a very small Nsumption,| the optimal parameters are computed as the ones subset of them will be preferred by a user. Consequently, that maximize the loglikelihood log p(R θ): the probability that a missing preference is actually not pre- | ferred is high. Hence, in reality, the feedback is missing not ∗ X X at random (MNAR), and a good deviation function needs θ = arg max log (Rui Sui, σ) , θ N | to account for this [56]. u∈U i∈I One approach is to assume, for the purposes of defining the which is equivalent to deviation function, that all missing preferences are not pre- ferred. This assumption is called all missing are negative ∗ X X 2 θ = arg max (Rui Sui) (AMAN) [37]. Under this assumption, the parameters that θ − − maximize the loglikelihood of the model are computed as u∈U i∈I X X 2 = arg min (Rui Sui) . ∗ X X θ − θ = arg max log p(Rui θ). u∈U i∈I θ | u∈U i∈I The Eckart-Young theorem [13] states that the scores matrix For binary, positive-only data, one can model p(Rui θ) as ∗ S that results from these parameters θ , is the same as that Rui (1−Rui) | S (1 Sui) . In this case, the parameters are ui · − found by singular value decomposition (SVD) with the same computed as dimensions of the training data matrix R. As such, the the- ∗ X X orem relates the above approach of minimizing the squared θ = arg max (Rui log Sui + (1 Rui) log(1 Sui)) . θ − − error between R and S to (LSA) [9] u∈U i∈I and the SVD based collaborative filtering methods [8; 49]. While the AMAN assumption is more realistic than the Alternatively, it is possible to compute the optimal param- MAR assumption, it adopts a conceptually flawed missing eters as those that maximize the logposterior log p(θ R), data model. Specifically, it assumes that all missing prefer- which relates to the loglikelihood log p(R θ) as | ences are not preferred, which contradicts the goal of collab- | orative filtering: to find the missing preferences that are ac- p(θ R) p(R θ) p(θ). tually preferred. A better missing data model still assumes | ∝ | · that all missing preferences are not preferred. However, it When p(θ), the prior distribution of the parameters, is cho- attaches a lower confidence to the assumption that a missing sen to be a zero-mean, spherical normal distribution, maxi- preference is not preferred, and a higher confidence to the mizing the logposterior is equivalent to minimizing the de- viation function [32] popular. To compensate for this bias, Steck proposes to weight the known preferences non-uniformly: X X 2 (θ, R) = (Rui Sui) D − u∈U i∈I C Wui = Rui + (1 Rui) α, 2 2  β + λu θu + λi θi , · c(i) − · · || ||F · || ||F with λu, λi regularization hyperparameters. Hence, maxi- with C a constant, R the number of non-zeros in the training mizing the logposterior instead of the loglikelihood is equiv- data R, and β [0, 1] a hyperparameter. Analogously, a alent to adding a regularization term. This deviation func- preference is more∈ likely to be missing in the training data tion is adopted by the FISM and NLMF methods [27; 26]. if the item is less popular. To compensate for this bias, Steck In an alternative interpretation of this deviation function, S proposes to weight the missing preferences non-uniformly: is a factorized approximation of R and the deviation func- tion minimizes the squared error of the approximation. The β Wui = Rui + (1 Rui) C c(i) , (26) regularization with the Frobenius norm is added to avoid − · · overfitting. For the SLIM and HOSLIM methods, an alter- native regularization term is proposed: with C a constant. Steck proposed these two weighting strategies as alternatives to each other. However, we be- X X 2 lieve that they can be combined since they are the applica- (θ, R) = (Rui Sui) D − u∈U i∈I tion of the same idea to the known and missing preferences T F respectively. X X (t,f) 2 (t,f) Although they provide less motivation, Pan et al. [37] arrive + λR S + λ1 S 1, || ||F || || t=1 f=1 at similar weighting schemes. They propose Wui = 1 if Rui = 1 and give three possibilities for the case when Rui = with 1 the l1-norm. Whereas the role of the Frobenius- 0: || · || norm is to avoid overfitting, the role of the l1-norm is to in- troduce sparsity. The combined use of both norms is called Wui = δ, (27) elastic net regularization, which is known to implicitly group Wui = α c(u), (28) correlated items [34]. The sparsity induced by the l1-norm · regularization lowers the memory required for storing the Wui = α ( c(i)) , (29) |U| − model and allows to speed-up the computation of recom- mendations by means of the sparse dotproduct. Even more with δ [0, 1] a uniform hyperparameter and α a hyper- ∈ sparsity can be obtained by fixing a majority of the param- parameter such that Wui 1 for all pairs (u, i) for which ≤ eters to 0, based on a simple feature selection method. Ning Rui = 0. In the first case, all missing preferences get the and Karypis [34] empirically show that this significantly re- same weight. In the second case, a missing preference is duces runtimes, while only slightly reducing the accuracy. more negative if the user already has many preferences. In the third case, a missing preference is less negative if the 5.3 Weighted Squared Error-based item is popular. Interestingly, the third weighting scheme These squared error based deviation functions can also be is orthogonal to the one of Steck in Equation 26. Addition- adapted to diverge from the AMAN assumption to a missing ally, Pan et al. [37] propose a deviation function with an data model that attaches lower confidence to the missing alternative regularization: preferences [21; 37; 56]: X X 2 X X 2 (θ, R) = Wui (Rui Sui) (θ, R) = Wui (Rui Sui) D − D − u∈U i∈I u∈U i∈I 2 2  T F + λ θu F + θi F . (30) X X (t,f) 2 || || || || + λtf S , (25) || ||F t=1 f=1 Yao et al. [73] adopt a more complex missing data model |U|×|I| that has a hyperparameter p, which indicates the overall in which W R assigns weights to values in R. The ∈ likelihood that a missing preference is preferred. This trans- higher Wui, the higher the confidence about Rui. Hu et lates into the deviation function: al. [21] provide two potential definitions of W:

Wui = 1 + βRui, X X 2 RuiWui(1 Sui) − Wui = 1 + α log (1 + Rui/) , u∈U i∈I X X 2 with α, β,  hyperparameters. Clearly, this method is not + (1 Rui)Wui(p Sui) − − limited to binary data, but works on positive-only data u∈U i∈I in general. We, however, only consider its use for binary, T F X X (t,f) 2 positive-only data. Equivalently, Steck [56] proposed: + λtf S (31) || ||F t=1 f=1 Wui = Rui + (1 Rui) α, − · with α < 1. The special case with p = 0 reduces this deviation function Additionally, Steck [57] pointed out that a preference is more to the one in Equation 25. likely to show up in the training data if the item is more An even more complex missing data model and correspond- Fast Maximum Margin Matrix Factorization

2 0.5 The thresholdsing deviationθr functioncan be learned was proposed from by the Sindhwani data. Further- et al. [55]: Hinge Hinge Smooth Hinge Smooth Hinge more, a different set of thresholds can be learned for each 1.5 0 X X 2 RuiWui(1 Sui) user, allowing users to “use− ratings differently” and allevi- 1 -0.5 u∈U i∈I Loss ates the need to normalizeX X the data. The problem2 can then 0.5 -1 + (1 Rui)Wui (Pui Sui) Derivative of Loss be written as: − − u∈U i∈I 0 -1.5 -0.5 0 0.5 1 1.5 -0.5 0 0.5 1 1.5 T F X XR 1 (t,f) 2 z z + −λtf S || r ||F minimize X + C t=1 f=1 h(T (θir Xij )) (4) ∥ ∥Σ ij − FigureFigure 1. Shown 3: Shown are are the the loss loss function function values values (left)h(z) and (left) gradients and ij S r=1X X ∈ λH (1 Rui)H(Pui) (32) the gradients dh(z)/dz (right) for the Hinge and Smooth !− ! − (right) for the Hinge and Smooth Hinge. Note that the gradients u∈U i∈I Hinge. Note that the gradients are identical outside the where the variables optimized over are the matrix X and areregion identicalz outside(0, 1) [46]. the region z (0, 1). with P and W additional parameters of the model that need ∈ ∈ the thresholdsto be computedθ.Inotherwork,we based on the data Rfi,nd together that such with alla for- other mulationparameters. is highly The effective last term for ratingcontains prediction the entropy (Rennie function &H is considered unknown, a value Sui > 1 is not penalized if and serves as a regularizer for P. Furthermore, they define Srebro, 2005). objectiveRui = 1.J(U,V,θ) is fairly simple. Ignoring for the mo- the constraint ment the non-differentiability of h(z)=(1 z)+ at one, Although the problem was1 formulatedX X here as a single op- 5.5 Overall Ranking Error-based − Pui = p, the gradient of J(U,V,θ) is easy to compute. The partial timization problem with a combined objective, X + |U||I| − |R| u∈U i∈I ∥ ∥Σ derivativeThe scores withS computed respect to by each a collaborative element of filteringU is: method C error,itshouldreallybeviewedasadual-objective are used to personally rank all items for every user. There- which expresses that the average probability that a missing · fore, one can argue that it is more natural to directly op- problemvalue of balancing is actually between one must low be equaltrace-norm to the and hyperparame- low er- timize the rankingR of1 the items instead of their individual ter p. To reduce the computational cost, they fix Pui = 0 ror. Considering the entire set of attainable ( X Σ , error) ∂scores.J − r for most (u, i)-pairs and randomly choose∥ a few∥ (u, i)-pairs = Uia C Tij (k)h′ T (θir UiV ′) Vja pairs, the true object of interest is the exterior “front” of ∂URendle et al.− [45], aim to maximize the areaij under− the ROCj for which Pui is computed based on the data. It seems that ia curve (AUC), whichr=1 isj givenij S by: this set,this i.e. simplification the set of matrices completelyX for offsets which theit modeling is not possi- flexibil- ! |!∈ " # ble to reduce one of the two objectives withoutP increasing (7) ity that was obtained by introducing . Additionally, they 1 X 1 simplify W as the one-dimensional matrix factorization AUC = the other. This “front” can be found by varying the value u ( u ) |U| u∈U | | · |I| − | | of C from zero (hard-margin)Wui to= inVfiunityVi. (no norm regular- The partial derivative with respectX X to Vja is analogous. The I(Sui > Suj ). (34) ization).A conceptual inconsistency of this deviation function is that partial derivative with respect· to θik is Rui=1 Ruj =0 S P All optimizationalthough the problems recommendation discussed score in this is given section by canui, beui could also be used. Hence, there exist two parameters for If the AUC∂J is higher, the pairwise rankings induced by the writtenthe as semi-desame concept,finite which programs is, at (Srebro best, ambiguous. et al., 2005). model S are more in line withr ther observed data R. How- = C Tij h′ Tij (θir UiVj′) . (8) ever, because∂θir (Sui > Suj ) is non-differentiable,− it is im- I j ij S 5.4 Maximum Margin-based possible to actually|! compute∈ the" parameters that# (locally) 3. OptimizationA disadvantage Methods of the squared error-based deviation func- maximize the AUC. Their solution is a deviation function, tions is their symmetry. For example, if Rui = 1 and called the Bayesian Personalized Ranking(BPR)-criterium, 2 With the gradient in-hand, we can turn to gradient descent We describeSui = 0,here (Rui a localSui) search= 1. Thisheursitic is desirable for the behavior problem be- which is a differentiable approximation of the negative AUC − methods for localy optimizing J(U,V,θ).Thedisadvan- (4). Insteadcause we of searchingwant to penalize over X the,wesearchoverpairsof model for predicting that a from which constant factors have been removed and to which preference is not preferred. However if Sui = 2, we obtain tagea regularization of considering term (6) has instead been added: of (4) is that although the matrices (U, V ),aswellassetsofthresholds2 θ,andattempt the same penalty: (Rui Sui) = 1. This, on the other − minimization objectiveX X inX (4) is aconvexfunctionofX,θ, to minimizehand, theis not objective: desirable because we do not want to penalize (θ, R) = log σ(Sui Suj ) the objectiveD J(U,V,θ) is not− aconvexfunctionof− U, V . the model for predicting that a preference will definitely be u∈U R =1 R =0 This is potentially bothersome,ui uj and might inhibit conver- preferred.. 1 2 2 T F J(U,V,A maximumθ) = ( margin-basedU + V deviation) function does not suffer gence to the global minimum. X X (t,f) 2 Fro Fro + λtf S F , (35) from this problem2 ∥ ∥ [36]: ∥ ∥ || || R 1 t=1 f=1 −   X X ˜r (θ, R)+ = C Wui hh TRui(θSui U+ λV S′) Σ., (5)(33) 3.1.with Smoothσ( ) the Hinge sigmoid function and λtf regularization hyper- D · ij ·ir − i ||j || · u∈Uri=1∈I ij S parameters. Since this approach explicitly accounts for the ! !∈ " # Inmissing the previous data, it corresponds discussion, to the we MNAR ignored assumption. the non- with . Σ the trace norm, λ a regularization hyperparame- ||||  differentiabilityPan and Chen claimof the that Hinge it is loss beneficial function to relaxh(z) theat BPRz =1. R˜ S ter, h ui ui a smooth hinge1 loss given2 by Figure2 3 [46], deviation function by Rendle et al. to account for noise in For any U, V we have· UV Σ ( U Fro + V Fro) and In order to give us a smooth optimization surface, we use W ∥ ∥ ≤ 2 ∥ ∥ ∥ ∥ R˜ the data [38; 39]. Specifically, they propose CoFiSet [38], so J(U,V,givenθ) upper by one bounds of the Equationsthe minimization 27-29 and objective the matrix of an alternative to the Hinge loss, which we refer to as the defined as which allows certain violations Sui < Suj when Rui > Ruj : (4), where X = UV ′.Furthermore,forany( X,andinpar- Smooth Hinge.Figure1showstheHingeandSmooth R˜ ui = 1 if Rui = 1 X X X ticular the X minimizing (4), some factorization X = UV ′ Hinge( lossθ, R) functions. = The Smooth Hinge shares many R˜ ui2 = 1 if2Rui = 0. D 1 u∈U I⊆I(u) J⊆I\I(u) achieves X Σ = 2 ( U Fro +− V Fro).Theminimization properties with the Hinge, but is much easier to optimize This∥ deviation∥ function∥ ∥ incorporates∥ ∥ the confidence about P P ! problem (4) is therefore equivalent to: directly via gradient descentS methods.ui j∈J LikeSuj the Hinge, the the training data by means of W and the missing knowledge log σ i∈I Smooth Hinge− is not sensitiveI − to outliers,J and does not about the degree of preference by means of the hinge loss | | | |  minimize J(U,V,θ). (6) continuously “reward” the model for increasing the output h R˜ ui Sui . Since the degree of a preference Rui = 1 + Γ(θ), (36) · value for an example. This contrasts with other smooth loss The advantage of considering (6) instead of (4) is that functions, such as the truncated quadratic (which is sensi-

X Σ is a complicated non-differentiable function for tive to outliers) and the Logistic (which “rewards” large ∥which∥ it is not easy to find the subdifrential. Finding good output values). We use the Smooth Hinge and the corre- descent directions for (4) is not easy. On the other hand, the sponding objective for our experiments in Section 4. with Γ(θ) a regularization term that slightly deviates from they optimize a lower bound instead. After also adding reg- the one proposed by Rendle et al. for no clear reason. Alter- ularization terms, their final deviation function is given by natively, they propose GBPR [39], which relaxes the BPR deviation function in a different way: X X  (θ, R) = Rui log σ(Sui) X X X D − (θ, R) = Γ(θ) u∈U i∈I D u∈U R =1 R =0 X  ui uj + log (1 Ruj σ(Suj Sui)) P ! − − Sgi j∈I 2 g∈Gu,i log σ α + (1 α) Sui Suj , (37) T F − · u,i − · − X X (t,f) 2 |G | + λtf S (41) || ||F t=1 f=1 with u,i the union of u with a random subset of g G { } { ∈ u Rgi = 1 , and α a hyperparameter. U\{ }| } Furthermore, Kabbur et al. [27] also aim to maximize AUC with λ a regularization constant and σ( ) the sigmoid func- with their method FISMauc. However, they propose to use tion. A disadvantage of this deviation· function is that it a different differentiable approximation of AUC: ignores all missing data, i.e., it corresponds to the MAR assumption. X X X 2 (θ, R) = ((Rui Ruj ) (Sui Suj )) An alternative for MRR is mean average precision (MAP): D − − − u∈U Rui=1 Ruj =0 T F 1 X 1 X 1 X X (t,f) 2 MAP + λtf S F . (38) = || || c(u) r>(Sui Su·) t=1 f=1 |U| u∈U R)ui=1 | X I(Suj Sui), The same model without regularization was proposed by ≥ T¨oscher and Jahrer [62]: Ruj =1

(θ, R) = which still emphasizes the top of the ranking, but less ex- D X X X 2 tremely than MRR. However, MAP is also non-smooth, pre- ((Rui Ruj ) (Sui Suj )) . (39) − − − venting its direct optimization. For MAP, too, Shi et al. u∈U R =1 R =0 ui uj derived a differentiable version, called TFMAP [51]: A similar deviation function was proposed by Tak´acsand Tikk [61]: X 1 X X (θ, R) = σ (Sui) σ (Suj Sui) D − c(u) − X X X u∈U R) =1 Ruj =1 (θ, R) = Rui w(j) ui D u∈U i∈I j∈I T F X X (t,f) 2 2 + λ S F , (42) ((Sui Suj ) (Rui Ruj )) , (40) · || || · − − − t=1 f=1 with w( ) a user-defined item weighting function. The sim- · plest choice is w(j) = 1 for all j. An alternative proposed with λ a regularization hyperparameter. Besides, the for- by Tak´acsand Tikk is w(j) = c(j). Another important mulation of TFMAP in Equation 42 is a simplified version difference is that this deviation function also minimizes the of the original, concieved for multi-context data, which is score-difference between the known preferences mutually. out of the scope of this survey. Finally, it is remarkable that both T¨oscher and Jahrer, and Dhanjal et al. proposed yet another alternative [12]. They Tak´acs and Tikk, explicitly do not add a regularization term, start from the definition of AUC in Equation 34, approx- whereas most other authors find that the regularization term imate the indicator function I(Sui Suj ) by the squared 2 − is important for their model’s performance. hinge loss (max (0, 1 + Suj Sui)) and emphasize the de- viation at the top of the ranking− by means of the hyperbolic 5.6 Top of Ranking Error-based tangent function tanh( ): · Very often, only the N highest ranked items are shown to users. Therefore, Shi et al. [52] propose to improve the top (θ, R) = of the ranking at the cost of worsening the overall ranking. D Specifically, they propose the CLiMF method, which aims ! X X X 2 mean reciprocal rank (MRR) tanh (max (0, 1 + Suj Sui)) . (43) to minimize the instead of the − AUC. The MRR is defined as u∈U Rui=1 Ruj=0 −1 1 X   MRR = r> max Sui Su· , Rui=1 | 5.7 k-Order Statistic-based u∈U |U| On the one hand, the deviation functions in Equations 35-40 in which r>(a b) gives the rank of a among all numbers in b try to minimize the overall rank of the known preferences. when ordered| in descending order. Unfortunately, the non- On the other hand, the deviation functions in Equations smoothness of r>() and max makes the direct optimization 41 and 43 try to push one or a few known preferences as of MRR unfeasible. Hence, Shi et al. derive a differentiable high as possible to the top of the item-ranking. Weston et version of MRR. Yet, optimizing it is intractable. Therefore, al. [71] propose to minimize a trade-off between the above two extremes: for user u, he does not use the formulation in Equation 34,   but uses the equivalent X X r>(Sui Sui Rui = 1 ) w | { | } c(u) 1 u∈U R =1 AUC u = ui c(u) ( c(u)) X I(Suj + 1 Sui) " · |I| − ! # ≥ , (44) c(u) + 1 X · r>(Suj Suk Ruk = 0 ) ( + 1)c(u) rui , Ruj =0 | { | } · |I| − 2 − Rui=1 with w( ) a function that weights the importance of the with rui the rank of item i in the recommendation list of known preference· as a function of their predicted rank among user u. Second, nDCGu is defined as all known preferences. This weighting function is user-defined and determines the trade-off between the two extremes, i.e., DCG X 1 nDCGu = , DCG = , minimizing the mean rank of the known preferences and DCGopt log(rui + 1) R =1 minimizing the maximal rank of the known preferences. Be- ui cause this deviation function is non-differentiable, Weston with DCGopt the DCG of the optimal ranking. In both et al. propose the differentiable approximation cases, Steck proposes to relate the rank rui with the recom- mendation score Rui by means of a link function   X X r>(Sui Sui Rui = 1 ) n  o (θ, R) = w | { | } ˆ D c(u) rui = max 1, 1 cdf (Sui) , (47) u∈U Rui=1 |I| · −

X max(0, 1 + Suj Sui) with Sˆui = (Sui µu)/std u the normalized recommenda- −1 − , (45) − · N ( c(u)) tion score in which µu and std u are the mean and stan- Ruj =0 |I| − dard deviation of the recommendation scores for user u, where they replaced the indicator function by the hinge-loss and cdf is the cumulative distribution of the normalized and approximated the rank with N −1 ( c(u)), in which scores. However, to know cdf , he needs to assume a dis- N is the number of items k that were|I| randomly − sampled tribution for the normalized recommendation scores. He 3 until Suk + 1 > Sui . Furthermore, they use the simple motivates that a zero-mean normal distribution of the rec- weighting function ommendation scores is a reasonable assumption. Conse- quently, cdf (Sˆui) = probit(Sˆui). Furthermore, he proposes   r>(Sui Sui Rui = 1 ) to approximate the probit function with the computation- w | { | } = u ally more efficient sigmoid function or the even more efficient ( | | function 1 if r>(Sui S Sui Rui = 1 , S = K) = k and | ⊆ { | } | | 2 rankSE(Sˆui) = [1 ([1 Sˆui]+) ]+, 0 otherwise , − − with [x]+ = max 0, x . In his pointwise approach, Steck i.e., from the set S that contains K randomly sampled known uses Equation 47{ to compute} the ranks based on the rec- preferences, ranked by their predicted score, only the item at ommendation scores. In his listwise approach, on the other rank k is selected to contribute to the training error. When hand, he finds the actual rank of a recommendation score k is set low, the top of the ranking will be optimized at the by sorting the recommendation scores for every user, and cost of a worse mean rank. The opposite will hold when k is uses Equation 47 only to compute the gradient of the rank set high. The regularization is not done by adding a regular- with respect to the recommendation score during the min- ization term but by forcing the norm of the factor matrices imization procedure. After adding regularizaton terms, he to be below a maximum, which is a hyperparameter. proposes the deviation function Alternatively, Weston et al. also propose a simplified devia- tion function:  X X 2 2  (θ, R) = L(Sui) + λ θu F + θi F   D · || || || || X X r>(Sui Sui Rui = 1 ) u∈U Rui=1 (θ, R) = w | { | } D c(u)  u∈U R =1 X 2 ui + γ [Suj ] , (48) · + X j∈I max(0, 1 + Suj Sui). (46) · − R =0 uj with θu, θi the vectors that group all model parameters re- lated to user u and item i, respectively, λ, γ regularization 5.8 Rank Link Function-based hyperparameters, and L(Sui) equal to AUC u or nDCGu, − − The ranking-based deviation functions discussed so far, are which are a function of Sui via rui. The last regularization all tailor made differentiable approximations with respect to term is introduced to enforce an approximately normal dis- the recommendation scores of a certain ranking quality mea- tribution on the scores. sure, like AUC, MRR or the k-th order statistic. Steck [58] proposes a more general approach that is applicable to any 5.9 Posterior KL-divergence-based ranking quality measure that is differentiable with respect to In our framework, the optimal parameters θ∗ are computed the rankings. He demonstrates his method on two ranking as θ∗ = arg min (θ, R). However, we can consider this a θ D quality measures: AUC and nDCG. For AUC u, the AUC special case of

3   Weston et al. [69] provide a justification for this approxi- θ∗ = ψ arg min (θ(φ), R) , mation. φ D in which ψ is chosen to be the identity function and the from AUCu, the AUC for u: parameters θ are identical to the parameters φ, i.e., θ(φ) = 1 X X φ. Now, some authors [28; 40; 16] propose to choose ψ() AUCu = I(Sui > Suj ). c(u) ( c(u)) and φ differently. · |I| − Rui=1 Ruj =0 Specifically, they model every parameter θj of the factoriza- tion model as a random variable with a parameterized pos- Next, he proposes a lower bound on AUCu: terior probability density function p(θj φj ). Hence, finding 1 X X Sui Suj | AUCu − , the variables φ corresponds to finding the posterior distri- ≥ c(u) ( c(u)) 2 butions of the model parameters θ. Because it is intractable · |I| − Rui=1 Ruj =0 to find the true posterior distribution p(θ R) of the param- | Sui−Suj eters, they settle for a mean-field approximation q(θ φ), in and interprets it as a weighted sum of margins 2 which all variables are assumed to be independent. | between any known preferences and any absent feedback, 1 ∗ in which every margin gets the same weight . Then, they define ψ as ψ(φ ) = Eq(θ|φ∗)[θ], i.e., the point- c(u)·(|I|−c(u)) estimate of the parameters θ equals their expected value Hence maximizing this lower bound on the AUC corresponds under the mean-field approximation of their posterior dis- to maximizing the sum of margins between any known pref- ∗ tributions. Note that θ = ∗ [θj ] because of the inde- erence and any absent feedback in which every margin has j Eq(θj |φj ) pendence assumption. the same weight. A problem with maximizing this sum is that very high margins on pairs that are easily ranked cor- If all parameters θj are assumed to be normally distributed 2 rectly can hide poor (negative) margins on pairs that are with mean µj and variance σj [28; 40], φj = (µj , σj ) and ∗ ∗ ∗ difficult to rank correctly. Aiolli proposes to replace the θj = Eq(θj |φ )[θj ] = µj . If, on the other hand, all parameters j uniform weights with a weighting scheme that emphasizes θj are assumed to be gamma distributed with shape αj and ∗ ∗ ∗ the difficult pairs such that the total margin is the worst rate βj [16], φj = (αj , βj ) and θ = q(θ |φ∗)[θj ] = α /β . j E j j j j possible case, i.e., the lowest possible sum of weighted mar- Furthermore, prior distributions are defined for all parame- gins. Specifically, he proposes to solve for every user u the ters θ. Typically, when this approach is adopted, the under- joint optimization problem lying assumptions are represented as a graphical model [5]. ∗ X X The parameters φ, and therefore the corresponding mean- θ = arg max min αuiαuj (Sui Suj ), θ αu∗ − field approximations of the posterior distribution of the pa- Rui=1 Ruj =0 rameters θ, can be inferred by defining the deviation func- P tion as the KL-divergence of the real (unknown) posterior where for every user u, it holds that R =1 αui = 1 and P ui distribution of the parameters, p(θ R), from the modeled αuj = 1. To avoid overfitting of α, he adds two | Ruj =0 posterior distribution of the parameters, q(θ φ), regularization terms: | (θ(φ), R) = DKL (q(θ φ) p(θ R)) ,  D | k | ∗ X X θ = arg max min αuiαuj (Sui Suj ) which can be solved despite the fact that p(θ R) is un- θ αu∗ − | Rui=1 Ruj =0 known [25]. This approach goes by the name variational  X 2 X 2 inference [25]. + λp αui + λn αui ,

A nonparametric version of this approach also considers D, Rui=1 Rui=0 the number of latent dimensions in the simple two factor factorization model of Equation 7, as a parameter that de- with λp, λn regularization hyperparameters. The model pa- pends on the data R instead of a hyperparameter, as most rameters θ, on the other hand, are regularized by normaliza- other methods do [17]. tion constraints on the factor matrices. Solving the above Note that certain solutions for latent Dirichlet allocation [6] maximization for every user, is equivalent to minimizing the also use variational inference techniques. However, in this deviation function case, variational inference is a part of the (variational) expec- X  X X tation-maximization algorithm for computing the parame- (θ, R) = max αuiαuj (Suj Sui) D αu∗ − ters that optimize the negative log-likelihood of the model u∈U Rui=1 Ruj =0 parameters, which serves as the deviation function. This is  X 2 X 2 different from the methods discussed in this section, where λp αui λn αui . (49) − − the KL-divergence between the real and the approximate Rui=1 Rui=0 posterior is the one and only deviation function. 5.11 Analytically Solvable 5.10 Convex Some deviation functions are not only convex, but also an- An intuitively appealing but non-convex deviation function alytically solvable. This means that the parameters that is worthless if there exists no algorithm that can efficiently minimize these deviation functions can be exactly computed compute parameters that correspond to its good local min- from a formula and that no numerical optimization algo- imum. Convex deviation functions on the other hand, have rithm is required. only one global minimum that can be computed with one Traditionally, methods that adopt these deviation functions of the well studied convex optimization algorithms. For this have been inappropriately called neighborhood or memory- reason, it is worthwhile to pursue convex deviation func- based methods. First, although these methods adopt neigh- tions. borhood-based factorization models, there are also neighbor- Aiolli [3] proposes a convex deviation function based on the hood-based methods that adopt non-convex deviation func- AUC (Eq. 34). For every individual user u , Aiolli starts tions, such as SLIM [34] and BPR-kNN [45], which were, ∈ U amongst others, discussed in Section 4. Second, a memory- frequent items and the factor 1/c(v) to reduce the weight of based implementation of these methods, in which the neces- i and j co-occurring in the preferences of v, if v has more sary parameters of the factorization model are not precom- preferences. Other similarity measures were proposed by puted, but computed in real time when they are required Aiolli [2]: is conceptually possible, yet practically intractable in most !q cases. Instead, a model-based implementation of these meth- X RviRvj sim(j, i) = , ods, in which the factorization model is precomputed, is the c(j)α c(i)(1−α) v∈U best choice for the majority of applications. · with α, q hyperparameters, Gori et al. [18]: 5.11.1 Basic Neighborhood-based P R R A first set of analytically solvable deviation functions is tai- v∈U vj vi sim(j, i) = P P , lored to the item-based neighborhood factorization models k∈I v∈U Rvj Rvk of Equation 10: and Wang et al. [68]: (1,2) S = RS .  P  v∈U Rvj Rvi (1,2) sim(j, i) = log 1 + α , As explained in Section 4, the factor matrix S can be in- · c(j)c(i) terpreted as an item-similarity matrix. Consequently, these (1,2) + deviation functions compute every parameter in S as with α R0 a hyperparameter. Furthermore, Huang et al. show∈ that sim(j, i) can also be chosen from a number of (1,2) Sji = sim(j, i), similarity measures that are typically associated with [22]. Similarly, Bellogin et al. show that typical with sim(j, i) the similarity between items j and i according scoring functions used in information retrieval can also be to some analytically computable similarity function. This is used for sim(j, i) [4]. equivalent to It is common practice to introduce sparsity in S(1,2) by defin- S(1,2) sim(j, i) = 0, ing ji − which is true for all (j, i)-pairs if and only if sim(j, i) = sim0(j, i) KNN (j) i , (53) · | ∩ { }| 2 0 X X  (1,2)  with sim (j, i) one of the similarity functions defined by S sim(j, i) = 0. ji − Equations 50-52, KNN (j) the set of items l that correspond j∈I i∈I to the k highest values sim0(j, l), and k a hyperparameter. (1,2) Hence, computing the factor matrix Sji corresponds to Motivated by a qualitative examination of their results, Sig- minimizing the deviation function urbj¨ornssonand Van Zwol [54] proposed additional adapta- 2 tions: X X  (1,2)  (θ, R) = S sim(j, i) . 0 D ji − sim(j, i) = s(j) d(i) r(j, i) sim (j, i) KNN (j) i , j∈I i∈I · · · · | ∩ { }| In this case, the deviation function mathematically expresses with the interpretation that sim(j, i) is a good predictor for pre- ks s(j) = , (54) ferring i if j is also preferred. The key property that de- ks + ks log c(j) termines the analytical solvability of this deviation function | − | kd is the absence of products of parameters. The non-convex d(i) = , (55) kd + kd log c(i) deviation functions in Section 6.1, on the other hand, do | − | kr contain products of parameters, which are contained in the r(j, i) = , (56) S kr + (r 1) term ui. Consequently, they are harder to solve but allow − richer parameter interactions. in which i is the r-th most similar item to j and ks, kd and A typical choice for sim(j, i) is the cosine similarity [10]. kr are hyperparameters. The cosine similarity between two items j and i is given by: Finally, Desphande and Karypis [10] propose to normalize P sim v∈U Rvj Rvi (j, i) as cos(j, i) = p . (50) c(i) c(j) sim00(j, i) · sim(j, i) = P 00 , Another similarity measure is the conditional probability l∈I\{j} sim (j, l) similarity measure [10], which is for two items i and j given sim00 by: with (j, i) defined using Equation 53. Alternatively, Aiolli [2] proposes the normalization X RviRvj condProb(j, i) = . (51) sim00(j, i) c(j) sim v∈U (j, i) = P 00 2(1−β) , l∈I\{i} sim (l, i) Deshpande and Karypis also proposed an adapted version: with β a hyperparameter. ∗ X RviRvj condProb (j, i) = , (52) A second set of analytically solvable deviation functions is c(i) c(j)α c(v) v∈U · · tailored to the user-based neighborhood factorization model of Equation 13: in which α [0, 1] is a hyperparameter. They introduced the factor 1∈/c(j)α to avoid the recommendation of overly S = S(1,1)R. In this case, the factor matrix S(1,1) can be interpreted as a In this case, the deviation function is given by user-similarity matrix. Consequently, these deviation func- 2 X X  (1,2)  tions compute every parameter in S(1,1) as (θ, R) = S sim(j, i) . D ji − (1,1) j∈S i∈I Suv = sim(u, v), with 2I the selected itemsets considered in the factor- sim with (u, v) the similarity between users u and v accord- izationS model.⊆ ing to some analytically computable similarity function. In Deshpande and Karypis [10] propose to define sim(j, i) simi- the same way as for the item-based case, computing the fac- (1,1) larly as for the pairwise interactions (Eq. 53). Alternatively, tor matrix Suv corresponds to minimizing the deviation others [33; 48] proposed function 0 2 sim(j, i) = sim (j, i) max (0, c(j i ) f ) , X X  (1,1)  (θ, R) = S sim(u, v) . · ∪ { } − D uv − u∈U v∈U with f a hyperparameter. Lin et al. [30] proposed yet another alternative: In this case, the deviation function mathematically expresses the interpretation that users u and v for which sim(u, v) is 0 sim(j, i) = sim (j, i) KNN c(i) j high, prefer the same items. · | ∩ { }| max (0, condProb(j, i) c) , Sarwar et al. [48] propose · − with KNN c(i) the set of items l that correspond to the k sim(u, v) = KNN (u) v , | ∩ { }| highest values c(i, l), k a hyperparameter, condProb the con- with KNN (u) the set of users w that have the k highest ditional probability according to Equation 51 and c a hyper- cosine similarities cos(u, w) with user u, and k a hyperpa- parameter. Furthermore, they define rameter. In this case, cosine similarity is defined as P 2 0 XviXvj P R R sim (j, i) = v∈U . (58) j∈I uj vj c(j) cos(u, v) = p . (57) c(u) c(v) · A fifth and final set of analytically solvable deviation func- Alternatively, Aiolli [2] proposes tions is tailored to the higher order userset-based neigh- (1,1) !q borhood factorization model of Equation 21: S = S Y. X Ruj Rvj sim In this case, the deviation function is given by (θ, R) = (u, v) = α (1−α) , 2 D c(u) c(v) P P  (1,1)  U j∈I · Suv sim(v, u) , with 2 the selected v∈S u∈U − S ⊆ with α, q hyperparameters, and Wang et al. [68] propose usersets considered in the factorization model. Lin et al. [30] P ! proposed to define j∈U Ruj Rvj sim(u, v) = log 1 + α , P 2 · c(u)c(v) j∈I Yuj Yvj sim0(v, u) = . (59) with α a hyperparameter. c(v) The deviation function for the unified neighborhood based Alternatively, Symeonidis et al. [60] propose factorization model in Equation 14 is given by P j∈I Yuj Yvj 2 sim(v, u) = v KNN (u)cp v , X X  (1,1)  (θ, R) = S sim(u, v) c(v) · | | · | ∩ { }| D uv − u∈U v∈U with KNN (u)cp the set of usersets w that correspond to the 2 X X  (2,3)  k highest values + Sji sim(j, i) . − P j∈I i∈I Yuj Ywj j∈I . However, sim(j, i) and sim(u, v) cannot be chosen arbitrar- c(w) ily. Verstrepen and Goethals [67] show that they need to satisfy certain constraints in order to render a well founded 6. MINIMIZATION ALGORITHMS unification. Consequently, they propose KUNN, which cor- Efficiently computing the parameters that minimize a de- responds to the following similarity definitions that satisfy viation function is often non trivial. Furthermore, there is the necessary constraints: a big difference between minimizing convex and non-convex X Ruj Rvj deviation functions. sim(u, v) = p c(u) c(v) c(j) j∈I · · 6.1 Non-convex Minimization X RviRvj sim(i, j) = . The two most popular families of minimization algorithms p for non-convex deviation functions of collaborative filter- v∈U c(i) c(j) c(v) · · ing algorithms are gradient descent and alternating least 5.11.2 Higher Order Neighborhood-based squares. We discuss both in this section. Furthermore, we A fourth set of analytically solvable deviation functions is also briefly discuss a few other interesting approaches. tailored to the higher order itemset-based neighborhood fac- 6.1.1 Gradient Descent torization model of Equation 20: For deviation functions that assume preferences are miss- S = XS(1,2). ing at random, and consequently consider only the known preferences [52], gradient descent (GD) is generally the nu- algorithm in sequence. In the actual algorithm, there are merical optimization algorithm of choice. In GD, the param- two nested for loops. The outer loop selects a configuration eters θ are randomly initialized. Then, they are iteratively C of the independent blocks and sends it to the SGD sched- updated in the direction that reduces (θ, R): uler. The latter, simply spawns as many parallel threads as D the number of independent blocks in each configuration and θk+1 = θk η (θ, R) , − ∇D applies SGD to each of them in parallel. Once the results with η a hyperparameter called the learning rate. The up- are back, the next configuration is loaded to each proces- date step is larger if the absolute value of the gradient sor after consolidating the results of the last iteration. This (θ, R) is larger. A version of GD that converges faster is process continues until all the configurations are computed. Stochastic∇D Gradient Descent (SGD). SGD uses the fact that Zhuang et al. [78] argue that both these methods suffer from two problems. The first is the issue of locking of threads, T X in which a thread that is executing a slower block needs to (θ, R) = t (S, R) , ∇D ∇D finish before the next round of assignments can begin. Sec- t=1 ond, since the elements of the rating matrix are accessed at with T the number of terms in (S, R). Now, instead of random, it may result in high cache-miss and performance D computing (θ, R) in every iteration, only one term t is degradation. To solve both these problems, the authors pro- ∇D randomly sampled (with replacement) and the parameters pose a shared memory parallel algorithm called Fast Par- θ are updated as allel SGD (FPSGD). There are two main features of the k+1 k algorithm. To overcome the locking issue, the authors sug- θ = θ η t (S, R) . − ∇D gest continuous execution of SGD blocks by the scheduler. Typically, a convergence criterium of choice is only reached The only two conditions it needs to satisfy are (1) it has to after every term t is sampled multiple times on average. be a free block, and (2) its number of past updates has to However, when the deviation function assumes the missing be smallest among all the free blocks (random if there is a feedback is missing not at random, the summation over the tie). The second condition is necessary because otherwise P P known preferences, Rui, is replaced by a sum- all blocks will not get same chance of being updated. For u∈U i∈IP P mation over all user item pairs, u∈U i∈I , and SGD needs dealing with memory discontinuity, the authors suggest up- to visit approximately 1000 times more terms. This makes dating each block in a fixed order of users and items. The the algorithm less attractive for deviation functions that as- randomness of the algorithm is introduced in selecting each sume the missing feedback is missing not at random. block for the next update. Since FPSGD always selects the To mitigate the large number of terms in the gradient, sev- next block in a deterministic fashion, one simple strategy to eral authors propose to sample the terms in the gradient select the next block in a random fashion is to split the orig- not uniformly but proportional to their impact on the pa- inal rating matrix R into many sub blocks (more than the rameters [75; 44; 76]. These approaches have not only been number of available threads). Since many blocks will have proven to speed up convergence, but also to improve the the same update number, randomization is needed to select quality of the resulting parameters. Weston et al., on the the next block for update. The authors call this method other hand, sample for every known preference i, a number partial randomization. The results of experiments on a va- of non preferred items j until they encounter one for which riety of datasets show that FPSGD achieves lower RMSE in Suj + 1 > Sui, i.e., it violates the hinge-loss approximation a far smaller number of iterations. of the ground truth ranking. In this way, they ensure that A natural strategy to optimize the update rules in matrix every update significantly changes the parameters θ [71]. factorization (MF) is to do a block-wise update and let each Additionally, due to recent advances in distributed comput- block be handled by a separate processor/computing unit. ing infrastructure, parallel and distributed approaches have This is the strategy proposed by Yin et. al [74]. The au- been proposed to speed up the convergence of SGD-style thors first show that most of the loss functions used in MF computations. are decomposable in the sense that they are sums over in- One of the classic works is the HogWild algorithm by Recht dividual loss terms. Hence they can easily be parallelized. Given A = WH as the model, the proposal is to split W et al. [43]. In that work, the authors assume that the rat- (I) (J) ing matrix is highly sparse and hence, for any two sampled and H into W and H , such that the total loss can be ratings, their SGD updates will likely be non-conflicting (in- written as the sum of the losses over indices I and J. Then dependent), since any two such updates are unlikely to share the task is to develop a block-wise partition rule, in which either the user or item vectors. As a result, HogWild drops blocks are updated independently when updating a factor the synchronization requirements, and lets each thread or matrix (by fixing the other factor matrix). Each block can processor update a random rating value. In the worst case be treated as one update unit. There are several ways in of a conflict, there will be contention in writing out the re- which the blocks can be updated. A straightforward way sults. The authors prove convergence of the algorithm under is to update update all blocks of H and then update all a simple assumption of rating matrix sparsity. blocks of W . This approach can be referred to as concur- Gemulla et al. [15] present a distributed SGD algorithm rent block updates since each update happens concurrently. (DSGD) in which the main assumption is that some blocks An alternate strategy is to update some blocks of H and W of the rating matrix are mutually independent and hence alternately. This method, called the frequent block-wise up- their variables can be updated in parallel. DSGD uniformly date, has the advantage of faster convergence. This happens grids the rating matrix R into sub-matrices or blocks, such due to the fact that updated block information are used in that they are independent with respect to the rows and each alternating sequence and hence converges faster. The columns. This method generates several configurations of remainder of the paper presents a recipe for implementing the original rating matrix R, which are then fed to the SGD these algorithms in MapReduce. 6.1.2 Alternating Least Squares 6.1.4 Coordinate Descent If the deviation function allows it, alternating least squares When an item-based neighborhood model is used in com- (ALS) becomes an interesting alternative to SGD when pref- bination with a squared error-based deviation function, the erences are assumed to be missing not at random [29; 21]. user factors are fixed by definition, and the problem resem- In this respect, the deviation functions of Equations 25, 30, bles a single ALS step. However, imposing the constraints in 31, and 48 are, amongst others, appealing because they can Equation 11 complicates the minimization [34]. Therefore, be minimized with a variant of the alternating least squares Ning and Karypis adopt cyclic coordinate descent and soft (ALS) method. Take for example the deviation function thresholding [14] for SLIM. from Equation 25:

T F 6.2 Convex Minimization X X 2 X X (t,f) 2 (θ, R) = Wui (Rui Sui) + λtf S , Aiolli [3] proposed the convex deviation function in Equation D − || ||F u∈U i∈I t=1 f=1 49 and indicates that it can be solved with any algorithm for convex optimization. combined with the basic two-factor factorization model from The analytically solvable deviation functions are also con- Equation 7: vex. Moreover, minimizing them is equivalent to computing S = S(1,1)S(1,2). all the similarities involved in the model. Most works as- sume a brute force computation of the similarities. However, As most deviation functions, this deviation function is non- Verstrepen [65] recently proposed two methods that are an convex in the parameters θ and has therefore multiple local order of magnitude faster than the brute force computation. optima. However, if one temporarily fixes the parameters in S(1,1), it becomes convex in S(1,2) and we can analyti- cally find updated values for S(1,2) that minimize this convex 7. RATING BASED METHODS function and are therefore guaranteed to reduce (θ, R). Interest in collaborative filtering on binary, positive-only Subsequently, one can temporarily fix the parametersD in data only recently increased. The majority of existing col- S(1,2) and in the same way compute updated values for S(1,1) laborative filtering research assumes rating data. In this that are also guaranteed to reduce (θ, R). One can keep al- case, the feedback of user u about item i, Rui, is an integer D ternating between fixing S(1,1) and S(1,2) until a convergence between Bl and Bh, with Bl and Bh the most negative and criterium of choice is met. Hu et al. [21], Pan et al. [37] and positive feedback, respectively. The most typical example of Pan and Scholz [36] give detailed descriptions of different such data was provided in the context of the Netflix Prize ALS variations. The version by Hu et al. contains optimiza- with Bl = 1 and Bh = 5. tions for the case in which missing preferences are uniformly Technically, our case of binary, positive-only data is just a weighted. Pan and Scholz [36] describe optimizations that special case of rating data with Bl = Bh = 1. However, apply to a wider range of weighting schemes. Finally, Pi- collaborative filtering methods for rating data are in general laszy et al. propose to further speed-up the computation by built on the implicit assumption that Bl < Bh, i.e., that only approximately solving each convex ALS-step [42]. both positive and negative feedback is available. Since this Additionally, ALS has the advantages that it does not re- negative feedback is not available in our problem setting, it quire the tuning of a learning rate, that it can benefit from is not surprising that, in general, methods for rating data linear algebra packages such as Intel MKL, and that it needs generate poor or even nonsensical results [21; 37; 56]. relatively few iterations to converge. Furthermore, when the k-NN methods for rating data, for example, often use the basic two-factor factorization of Equation 7 is used, every Pearson correlation coefficient as a similarity measure. The row of S(1,1) and every column of S(1,2) can be updated in- Pearson correlation coefficient between users u and v is given dependently of all other rows or columns, respectively, which by makes it fairly easy to massively parallelize the computation of the factor matrices [77]. pcc(u, v) = P (Ruj Ru)(Rvj Rv) 6.1.3 Bagging Ruj ,Rvj >0 − − , The maximum margin based deviation function in Equa- r P 2r P 2 (Ruj Ru) (Rvj Rv) tion 33 cannot be solved with ALS because it contains the Ruj ,Rvj >0 − Ruj ,Rvj >0 − hinge loss. Rennie and Srebro propose a conjugate gradi- ents method for minimizing this function [46]. However, with Ru and Rv the average rating of u and v respectively. this method suffers from similar problems as SGD, related In our setting, with binary, positive-only data however, Ruj to the high number of terms in the loss function. There- and Rvj are by definition always one or zero. Consequently, fore, Pan and Scholz [36] propose a bagging approach. The Ru and Rv are always one. Therefore, the Pearson corre- essence of their bagging approach is that they do not ex- lation is always zero or undefined (zero divided by zero), plicitly weight every user-item pair for which Rui = 0, but making it a useless similarity measure for binary, positive- sample from all these pairs instead. They create multiple only data. Even if we would hack it by omitting the terms samples, and compute multiple different solutions Se corre- for mean centering, Ru and Rv, it is still useless since it sponding to their samples. These computations are also would always be equal− to either− one or zero. performed with the conjugate gradients method. They are, Furthermore, when computing the score of user u for item however, much less intensive since they only consider a small i, user(item)-based k-NN methods for rating data typically sample of the many user-item pairs for which Rui = 0. The find the k users (items) that are most similar to u (i) and different solutions Se are finally aggregated by simply taking that have rated i (have been rated by u) [11; 23]. On binary, their average. positive-only data, this approach results in the nonsensical result that Sui = 1 for every (u, i)-pair. Conf. on Recommender Systems, pages 293–296. ACM, The matrix factorization methods for rating data are in gen- 2014. eral also not applicable to binary, positive-only data. Take for example a basic loss function for matrix factorization on [4] A. Bellogin, J. Wang, and P. Castells. Text retrieval rating data: methods for item ranking in collaborative filtering. In P. Clough, C. Foley, C. Gurrin, G. Jones, W. Kraaij, 2 X  (1) (2)  (1) 2 (2) 2  H. Lee, and V. Mudoch, editors, Advances in Informa- min Rui Su· S·i + λ Su· F + S·i F , S(1),S(2) − || || || || tion Retrieval, volume 6611, pages 301–306. Springer Rui>0 Berlin Heidelberg, 2011. which for binary, positive-only data simplifies to Machine Learning 2 [5] C. M. Bishop. Pattern recognition. , X  (1) (2)  (1) 2 (2) 2  min 1 Su· S·i + λ Su· F + S·i F . 128:1–58, 2006. S(1),S(2) − || || || || R =1 ui [6] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet The squared error term of this loss function is minimized allocation. The Journal of machine Learning research, when the rows and columns of S(1) and S(2), respectively, 3:993–1022, 2003. are all the same unit vector. This is obviously a nonsensical solution. [7] E. Christakopoulou and G. Karypis. Hoslim: Higher- The matrix factorization method for rating data that uses order sparse linear method for top-n recommender sys- singular value decomposition to factorize R also considers tems. In Advances in Knowledge Discovery and Data Mining, pages 38–49. Springer, 2014. the entries where Rui = 0 and does not suffer from the above problem [49; 8]. Although this method does not result in [8] P. Cremonesi, Y. Koren, and R. Turrin. Performance nonsensical results, the performance has been shown inferior of recommender algorithms on top-n recommendation to methods specialized for binary, positive-only data [34; 56; tasks. In Proc. of the fourth ACM Conf. on Recom- 57]. mender Systems, pages 39–46. ACM, 2010. In summary, although we cannot exclude the possibility that there exists a method for rating data that does perform well [9] S. C. Deerwester, S. T. Dumais, T. K. Landauer, G. W. on binary, positive-only data, in general this is clearly not Furnas, and R. A. Harshman. Indexing by latent seman- the case. tic analysis. Journal of the American Society of Infor- mation Science, 41(6):391–407, 1990. 8. CONCLUSIONS [10] M. Deshpande and G. Karypis. Item-based top-n rec- We have presented a comprehensive survey of collabora- ommendation algorithms. ACM Transactions on Infor- tive filtering methods for binary, positive-only data. Its mation Systems (TOIS), 22(1):143–177, 2004. backbone is an innovative unified matrix factorization per- spective on collaborative filtering methods, also those that [11] C. Desrosiers and G. Karypis. A comprehensive sur- are typically not associated with matrix factorization mod- vey of neighborhood-based recommendation methods. els such as nearest neighbors methods and association rule In Recommender systems handbook, pages 107–144. mining-based methods. From this perspective, a collabo- Springer, 2011. rative filtering algorithm consists of three building blocks: [12] C. Dhanjal, R. Gaudel, and S. Cl´emen¸con.Collabora- a matrix factorization model, a deviation function and a tive filtering with localised ranking. In Proc. of the 29th numerical minimization algorithm. By comparing methods AAAI Conf. on Artificial Intelligence, 2015. along these three dimensions, we were able to highlight sur- prising commonalities and key differences. [13] C. Eckart and G. Young. The approximation of one ma- An interesting direction for future work is to survey certain trix by another of lower rank. Psychometrika, 1(3):211– aspects that were not included in the scope of this survey. 218, 1936. Examples are surveying the different strategies to deal with cold-start problems that are applicable to binary, positive- [14] J. Friedman, T. Hastie, and R. Tibshirani. Regulariza- only data; and comparing the applicability of models and tion paths for generalized linear models via coordinate deviation functions for recomputation of models in real time descent. Journal of statistical software, 33(1):1, 2010. upon receiving novel feedback. [15] R. Gemulla, E. Nijkamp, P. J. Haas, and Y. Sisma- nis. Large-scale matrix factorization with distributed 9. REFERENCES stochastic gradient descent. In Proc. of the 17th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data [1] G. Adomavicius and A. Tuzhilin. Toward the next gen- Mining, pages 69–77, 2011. eration of recommender systems: a survey of the state- of-the-art and possible extensions. 17(6):734–749, 2005. [16] P. Gopalan, J. Hofman, and D. Blei. Scalable recom- mendation with hierarchical poisson factorization. In [2] F. Aiolli. Efficient top-n recommendation for very large Proc. of the 31st Conf. Annual Conf. on Uncertainty scale binary rated datasets. In Proc. of the 7th ACM in Artificial Intelligence (UAI-15). AUAI Press, 2015. Conf. on Recommender Systems, pages 273–280. ACM, 2013. [17] P. Gopalan, F. J. Ruiz, R. Ranganath, and D. M. Blei. Bayesian nonparametric poisson factorization for [3] F. Aiolli. Convex auc optimization for top-n recommen- recommendation systems. Artificial Intelligence and dation with implicit feedback. In Proc. of the 8th ACM Statistics (AISTATS), 33:275–283, 2014. [18] M. Gori, A. Pucci, V. Roma, and I. Siena. Itemrank: A [32] A. Mnih and R. Salakhutdinov. Probabilistic matrix random-walk based scoring algorithm for recommender factorization. In Advances in neural information pro- engines. In Proc. of the 20th Int. Joint Conf. on Artifi- cessing systems, pages 1257–1264, 2007. cial Intelligence, volume 7, pages 2766–2771, 2007. [33] B. Mobasher, H. Dai, T. Luo, and M. Nakagawa. Effec- [19] T. Hofmann. Latent semantic models for collaborative tive personalization based on association rule discovery filtering. ACM Transactions on Information Systems from web usage data. In Proc. of the 3rd Int. workshop (TOIS), 22(1):89–115, 2004. on Web information and data management, pages 9–15. ACM, 2001. [20] T. Hofmann and J. Puzicha. Latent class models for collaborative filtering. In Proc. of the 16th Int. Joint [34] X. Ning and G. Karypis. Slim: Sparse linear methods Conf. on Artificial Intelligence, pages 688–693, 1999. for top-n recommender systems. In Proc. of the 11th IEEE Int. Conf. on Data Mining, pages 497–506. IEEE, [21] Y. Hu, Y. Koren, and C. Volinsky. Collaborative filter- 2011. ing for implicit feedback datasets. In Proc. of the Eighth [35] L. Page, S. Brin, R. Motwani, and T. Winograd. The IEEE Int. Conf. on Data Mining, pages 263–272. IEEE, citation ranking: bringing order to the web. 2008. 1999. [22] Z. Huang, X. Li, and H. Chen. Link prediction ap- [36] R. Pan and M. Scholz. Mind the gaps: weighting the un- Proc. of the 5th proach to collaborative filtering. In known in large-scale one-class collaborative filtering. In ACM/IEEE-CS joint Conf. on Digital libraries, pages Proc. of the 15th ACM SIGKDD Int. Conf. on Knowl- 141–142. ACM, 2005. edge discovery and data mining, pages 667–676. ACM, [23] D. Jannach, M. Zanker, A. Felfernig, and G. Friedrich. 2009. Recommender systems: an introduction. Cambridge [37] R. Pan, Y. Zhou, B. Cao, N. N. Liu, R. Lukose, University Press, 2010. M. Scholz, and Q. Yang. One-class collaborative filter- ing. In Proc. of the Eighth IEEE Int. Conf. on Data [24] C. Johnson. Logistic matrix factorization for implicit Mining, pages 502–511. IEEE, 2008. feedback data. In Workshop on Distributed Machine Learning and Matrix Computations at the Twenty- [38] W. Pan and L. Chen. Cofiset: Collaborative filtering via eighth Annual Conf. on Neural Information Processing learning pairwise preferences over item-sets. In Proc. of Systems (NIPS), 2014. the 13th SIAM Int. Conf. on Data Mining, pages 180– 188, 2013. [25] M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul. An introduction to variational methods for graph- [39] W. Pan and L. Chen. Gbpr: Group preference based ical models. Machine learning, 37(2):183–233, 1999. bayesian personalized ranking for one-class collabora- tive filtering. In Proc. of the Twenty-Third Int. joint [26] S. Kabbur and G. Karypis. Nlmf: Nonlinear matrix fac- Conf. on Artificial Intelligence, pages 2691–2697. AAAI torization methods for top-n recommender systems. In Press, 2013. Workshop Proc. of the IEEE Int. Conf. on Data Min- ing, pages 167–174. IEEE, 2014. [40] U. Paquet and N. Koenigstein. One-class collaborative filtering with random graphs. In Proc. of the 22nd Int. [27] S. Kabbur, X. Ning, and G. Karypis. Fism: factored Conf. on WWW, pages 999–1008, 2013. item similarity models for top-n recommender systems. In Proc. of the 19th ACM SIGKDD Int. Conf. on [41] A. Paterek. Improving regularized singular value de- Proc. of KDD Knowledge discovery and data mining, pages 659–667. composition for collaborative filtering. In cup and workshop ACM, 2013. , volume 2007, pages 5–8, 2007. [42] I. Pil´aszy, D. Zibriczky, and D. Tikk. Fast als-based [28] N. Koenigstein, N. Nice, U. Paquet, and N. Schleyen. matrix factorization for explicit and implicit feedback The xbox recommender system. In Proc. of the sixth datasets. In Proc. of the fourth ACM Conf. on Recom- ACM Conf. on Recommender Systems, pages 281–284. mender Systems, pages 71–78. ACM, 2010. ACM, 2012. [43] B. Recht, C. Re, S. Wright, and F. Niu. Hogwild: A [29] Y. Koren, R. Bell, and C. Volinsky. Matrix factoriza- lock-free approach to parallelizing stochastic gradient tion techniques for recommender systems. Computer, descent. In Advances in Neural Information Processing (8):30–37, 2009. Systems 24, pages 693–701. 2011. [30] W. Lin, S. A. Alvarez, and C. Ruiz. Efficient adaptive- [44] S. Rendle and C. Freudenthaler. Improving pairwise support association rule mining for recommender sys- learning for item recommendation from implicit feed- tems. Data mining and knowledge discovery, 6(1):83– back. In Proc. of the 7th ACM Int. Conf. on Web search 105, 2002. and data mining, pages 273–282. ACM, 2014. [31] G. V. Menezes, J. M. Almeida, F. Bel´em, M. A. [45] S. Rendle, C. Freudenthaler, Z. Gantner, and Gon¸calves, A. Lacerda, E. S. De Moura, G. L. Pappa, L. Schmidt-Thieme. Bpr: Bayesian personalized rank- A. Veloso, and N. Ziviani. Demand-driven tag recom- ing from implicit feedback. In Proc. of the Twenty-Fifth mendation. In Machine Learning and Knowledge Dis- Conf. on Uncertainty in Artificial Intelligence, pages covery in Databases, pages 402–417. Springer, 2010. 452–461. AUAI Press, 2009. [46] J. D. Rennie and N. Srebro. Fast maximum margin ma- [60] P. Symeonidis, A. Nanopoulos, A. N. Papadopoulos, trix factorization for collaborative prediction. In Proc. and Y. Manolopoulos. Nearest-biclusters collaborative of the 22nd Int. Conf. on Machine learning, pages 713– filtering based on constant and coherent values. Inf. 719. ACM, 2005. Retr., 11(1):51–75, 2008.

[47] F. Ricci, L. Rokach, B. Shapira, and P. B. Kantor. Rec- [61] G. Tak´acsand D. Tikk. Alternating least squares for ommender Systems Handbook. Springer, Boston, MA, personalized ranking. In Proc. of the sixth ACM Conf. 2011. on Recommender Systems, pages 83–90. ACM, 2012.

[48] B. Sarwar, G. Karypis, J. Konstan, and J. Riedl. Anal- [62] A. T¨oscher and M. Jahrer. Collaborative filtering en- ysis of recommendation algorithms for e-commerce. In semble for ranking. Journal of Machine Learning Re- Proc. of the 2nd ACM Conf. on Electronic commerce, search W&CP: Proc. of KDD Cup 2011, 18:61–74, pages 158–167. ACM, 2000. 2012. [49] B. Sarwar, G. Karypis, J. Konstan, and J. Riedl. Ap- [63] L. H. Ungar and D. P. Foster. Clustering methods for plication of in recommender collaborative filtering. In AAAI workshop on recom- system-a case study. Technical report, DTIC Docu- mendation systems, volume 1, pages 114–129, 1998. ment, 2000.

[50] B. Sarwar, G. Karypis, J. Konstan, and J. Riedl. [64] M. van Leeuwen and D. Puspitaningrum. Improving tag Advances in Item-based collaborative filtering recommendation al- recommendation using few associations. In Intelligent Data Analysis XI gorithms. In Proc. of the 10th Int. Conf. on WWW, , pages 184–194. Springer, pages 285–295. ACM, 2001. 2012.

[51] Y. Shi, A. Karatzoglou, L. Baltrunas, M. Larson, [65] K. Verstrepen. Collaborative Filtering with Binary, A. Hanjalic, and N. Oliver. Tfmap: Optimizing map Positive-Only Data. PhD thesis, University of Antwerp, for top-n context-aware recommendation. In Proc. of 2015. the 35th Int. ACM SIGIR Conf. on Research and devel- opment in information retrieval, pages 155–164. ACM, [66] K. Verstrepen and B. Goethals. Unifying nearest neigh- 2012. bors collaborative filtering. In Proc. of the 8th ACM Conf. on Recommender Systems, pages 177–184. ACM, [52] Y. Shi, A. Karatzoglou, L. Baltrunas, M. Larson, 2014. N. Oliver, and A. Hanjalic. Climf: learning to maximize reciprocal rank with collaborative less-is-more filtering. [67] K. Verstrepen and B. Goethals. Top-n recommendation In Proc. of the sixth ACM Conf. on Recommender Sys- for shared accounts. In Proc. of the 9th ACM Conf. on tems, pages 139–146. ACM, 2012. Recommender Systems, pages 59–66. ACM, 2015.

[53] Y. Shi, M. Larson, and A. Hanjalic. Collaborative fil- [68] J. Wang, A. P. De Vries, and M. J. Reinders. A user- tering beyond the user-item matrix: A survey of the item relevance model for log-based collaborative filter- state of the art and future challenges. ACM Computing ing. In Advances in Information Retrieval, pages 37–48. Surveys (CSUR), 47(1):3, 2014. Springer, 2006.

[54] B. Sigurbj¨ornssonand R. Van Zwol. Flickr tag recom- [69] J. Weston, S. Bengio, and N. Usunier. Wsabie: Scaling mendation based on collective knowledge. In The 17th up to large vocabulary image annotation. In Proc. of the Int. Conf. on WWW, pages 327–336. ACM, 2008. 22th Interantional Joint Conf. on Artifical Intelligence, volume 11, pages 2764–2770, 2011. [55] V. Sindhwani, S. S. Bucak, J. Hu, and A. Mojsilovic. One-class matrix completion with low-density factor- [70] J. Weston, R. J. Weiss, and H. Yee. Nonlinear latent izations. In Proc. of the 10th IEEE Int. Conf. on Data factorization by embedding multiple user interests. In Mining, pages 1055–1060. IEEE, 2010. Proc. of the 7th ACM Conf. on Recommender Systems, pages 65–68. ACM, 2013. [56] H. Steck. Training and testing of recommender systems Proc. of the 16th on data missing not at random. In [71] J. Weston, H. Yee, and R. J. Weiss. ACM SIGKDD Int. Conf. on Knowledge discovery and recommendations with the k-order statistic loss. In data mining , pages 713–722. ACM, 2010. Proc. of the 7th ACM Conf. on Recommender Systems, [57] H. Steck. Item popularity and recommendation accu- pages 245–248. ACM, 2013. racy. In Proc. of the fifth ACM Conf. on Recommender Systems, pages 125–132. ACM, 2011. [72] X. Yang, Y. Guo, Y. Liu, and H. Steck. A survey of col- laborative filtering based social recommender systems. [58] H. Steck. Gaussian ranking by matrix factorization. In Computer Communications, 41:1–10, 2014. Proc. of the 9th ACM Conf. on Recommender Systems, pages 115–122. ACM, 2015. [73] Y. Yao, H. Tong, G. Yan, F. Xu, X. Zhang, B. K. Szy- manski, and J. Lu. Dual-regularized one-class collabo- [59] X. Su and T. M. Khoshgoftaar. A survey of collabora- rative filtering. In Proc. of the 23rd ACM Int. Conf. on tive filtering techniques. Advances in artificial intelli- Information and Knowledge Management, pages 759– gence, 2009:4, 2009. 768. ACM, 2014. [74] J. Yin, L. Gao, and Z. M. Zhang. Machine Learning and [77] Y. Zhou, D. Wilkinson, R. Schreiber, and R. Pan. Knowledge Discovery in Databases: European Conf., Large-scale parallel collaborative filtering for the net- ECML PKDD 2014, Nancy, France, September 15-19, flix prize. In Algorithmic Aspects in Information and 2014. Proc., Part III, chapter Scalable Nonnegative Management, pages 337–348. Springer, 2008. Matrix Factorization with Block-wise Updates, pages 337–352. 2014. [78] Y. Zhuang, W.-S. Chin, Y.-C. Juan, and C.-J. Lin. A fast parallel sgd for matrix factorization in shared mem- [75] W. Zhang, T. Chen, J. Wang, and Y. Yu. Optimizing ory systems. In Proc. of the 7th ACM Conf. on Recom- top-n collaborative filtering via dynamic negative item mender Systems, pages 249–256, 2013. sampling. In Proc. of the 36th Int. ACM SIGIR Conf. on Research and development in information retrieval, pages 785–788. ACM, 2013. [76] H. Zhong, W. Pan, C. Xu, Z. Yin, and Z. Ming. Adap- tive pairwise preference learning for collaborative rec- ommendation with implicit feedbacks. In Proc. of the 23rd ACM Int. Conf. on Conf. on Information and Knowledge Management, pages 1999–2002. ACM, 2014.