Proceedings of the Twenty-Fifth International Joint Conference on (IJCAI-16)

A Relaxed Ranking-Based Factor Model for from Implicit Feedback + + Huayu Li , Richang Hong⇤, Defu Lian, Zhiang Wu⇥, Meng Wang⇤ and Yong Ge + UNC Charlotte, hli38, yong.ge @uncc.edu { } ⇤ Hefei University of Technology, hongrc, wangmeng @hfut.edu.cn { } University of Electronic Science and Technology of China, [email protected] ⇥ Nanjing University of Finance and Economics, [email protected]

Abstract about which items users dislike. This has been a thorny issue for learning task. Implicit feedback based recommendation has re- cently been an important task with the accumulated In the literature, some related work has been proposed to user-item interaction data. However, it is very chal- take advantage of implicit feedback for item recommenda- lenging to produce recommendations from implicit tions. For example, [Hu et al., 2008] regards user’s prefer- feedback due to the sparseness of data and the lack ence for an item as a binary value, where a user’s preference of negative feedback/rating. Although various fac- for observed item and unobserved one are viewed as one and tor models have been proposed to tackle this prob- zero, respectively. Then it fits these predefined ratings with lem, they either focus on rating prediction that may vastly varying confidential levels based on matrix factoriza- lead to inaccurate top-k recommendations or are tion framework. Although it assumes that a user prefers the dependent on the sampling of negative feedback observed items to unobserved ones, the quadratic in the for- that often results in bias. To this end, we propose mulation weakens their instinct ranking order. There is no a Relaxed Ranking-based Factor Model, RRFM, to guarantee that the higher accuracy in rating prediction will re- relax pairwise ranking into a SVM-like task, where sult in the better ranking effectiveness [N. and Qiang, 2008]. positive and negative feedbacks are separated by For instance, the true ratings for two items are 1, 0.5 . The predicted ratings 0.4, 0.6 and 1.6, 0.6 have{ the same} pre- the soft boundaries, and their non-separate property { } { } is employed to capture the characteristic of unob- diction accuracy. But in fact, they lead to totally different served data. A smooth and scalable algorithm is ranking orders of items. To this end, Rendle formulates user’s developed to solve group- and instance- level’s op- consuming behavior into the pairwise ranking problem, i.e., timization and parameter estimation. Extensive ex- users are much more interested in their consumed items than periments based on real-world datasets demonstrate unconsumed ones [Rendle et al., 2009]. Due to a large num- the effectiveness and advantage of our approach. ber of such pairs, it only samples some negative items for the learning procedure. However, there are two limitations. First, the pairwise ranking increases the number of compar- 1 Introduction isons. Second, although Rendle [Rendle and Freudenthaler, Recommender systems have been an important feature to 2014] improves the sampling skill by oversampling the top recommend relevant items to relevant users in many on- ranked items, sampling technique itself easily leads to bias. line communities, e.g. Amazon, Netflix, and Foursquare. It is likely that the sampled negative item is already ranked Some online systems allow users to provide an explicit rat- below the positive one, which as a result has no contribution ing for an item to express how much they like it. A higher to the optimization. (or lower) rating indicates that the user likes (or dislikes) To address these issues, we propose to relax the rank- the item more. Nevertheless, many recommender systems ing model in [Rendle et al., 2009] to eliminate the pair- only have user’s implicit feedback, such as browsing activ- wise ranking. Specifically, the positive and negative feed- ity, purchasing history, watching history, click behavior and backs are separated by the positive and negative boundaries. check-in information. As implicit feedback becomes more In fact, the unobserved implicit feedback is often a mix- and more prevalent, this type of recommender system has at- ture of negative and missed positive data, so a slack vari- tracted many researchers’ attention [Joachims et al., 2005; able is introduced to capture such characteristic, which al- Lim et al., 2015]. However, implicit feedback based recom- lows some negative feedbacks and positive feedbacks to be mender systems suffer from many challenges. For example, non-separate. Furthermore, instead of sampling, a smooth the sparseness of observed data (i.e., only a small percentage and scalable algorithm is designed to learn model’s param- of user-item pairs have implicit feedback) increases the diffi- eters based on group and instance level’s optimization. The culty to learn user’s exact taste on items. Also, different from proposed algorithm allows to take all the unobserved items explicit feedback, only positive preference is observed in im- into account for optimization, and as a result addresses the plicit feedback. In other words, we have no prior knowledge bias caused by the sampling technique in [Rendle et al., 2009;

1683 Rendle and Freudenthaler, 2014]. Finally, the proposed is no negative rating. Most research work argues that one model is evaluated with many state-of-the-art baseline mod- user’s preference for the observed item is supposed to be els and different validation metrics on three real-world data larger than that for any unobserved one [Rendle et al., 2009; sets. The experimental results demonstrate the superiority of Hu et al., 2008], which indicates the presence of ranking be- our model for tackling implicit feedback based recommenda- tween positive and negative ratings. However, the pairwise tions. ranking of a user’s preference for the observed items over the unobserved ones is quite inefficient, especially when the 2 Preliminaries user’s historical data increases. To address this issue, we pro- The recommendation task addressed in this paper is defined pose to relax the pairwise ranking. We treat one user’s pref- as: given the consumption behaviors of N users over M erence for an item as an point, where his positive (or nega- items, we aim at recommending each user with top-K new tive) rating is viewed as the positive (or negative) point. Our items that he might be interested in but has never consumed goal is to make all positive points reside above all negative before. Matrix factorization based models assume that U points. Inspired by the soft margin idea of SVM, we sepa- K N K M 2 rate these two types of points by two different boundaries. In R ⇥ and V R ⇥ are the user and item latent feature 2 other words, user’s preference for the observed items and un- matrices, with column vectors Ui and Vj representing the K- dimension user-specific and item-specific feature vectors of observed ones are separated by the boundaries. Specifically, user i and item j, respectively. The predicted preference (rat- user’s positive rating is located on or above a boundary rep- ˆ resented as a numeric value r+. On the other hand, his nega- ing) of user i for item j, denoted as Rij, is approximated by: tive rating resides on or below another boundary. It not only ˆ T improves the efficiency of comparisons, but also preserves Rij = Ui Vj. (1) the ranking information. Along this line, we relax Eq.(2) for In implicit feedback datasets, only positive feedback is ob- o u j i k i in the following: served. As we lack substantial evidence on which items users 8 2M ^8 2M T dislike, their preference for an unobserved item is regarded Ui Vj r+, T (3) as a mixture of negative and missing values. Along this line, Ui Vk r + ⇠ik, ⇢  [Rendle et al., 2009] assumes that each user prefers the ob- where ⇠ik is the slack variable for user i on the unobserved o served items over unobserved ones. Let us denote i as a set item k. Similar to [Hu et al., 2008], r+ and r are set as u M of items that user i has consumed and as the remaining one and zero, respectively. Even though there may be many Mi items that he never consumed. Then for user i, the ranking unobserved items for a user, it does not necessarily indicate based on user’s preference for an observed item j over an un- that he dislikes them. Probably he may be just unaware of observed item k is given by: them. In other words, some of the unobserved items might be T T o u those users are interested in, while others are actually those U V >U V , j k (2) i j i k 8 2Mi ^8 2Mi they dislike. To capture this characteristic, the slack variable For convenience we will henceforth refer to i as user, j as ⇠ik in Eq.(3) is introduced to allow the mixture of negative observed item, and k as unobserved item unless stated other- feedback and positive feedback in unobserved data. Hence, wise. Eq.(2) models the correlation of user’s preference for we apply the following constraint for ⇠ik as: each pair of observed item and unobserved one. It is actu- r + ⇠ik r+ ⇠ik r+ r . (4) ) ally maximizing the Area Under the ROC Curve (AUC) for It is worth to note that for each user i, pairwise ranking in matrix factorization. The optimization problem for pairwise Eq.(2) needs o u comparisons; but our relaxed ranking ranking is formulated as follows: |Mi ||Mi | in Eq.(3) only requires M= o + u comparisons. Specifi- |Mi | |Mi | N o 2 cally, when i =M/2, pairwise ranking needs M /4 compar- T T U 2 |M | max log (Ui Vj Ui Vk) U F isons while relaxed ranking only requires M comparisons. U,V 2 || || i=1 j o k u However, Eq.(3) is a “hard” version of optimization prob- X 2XMi 2XMi lem due to the strict constraints. Thus, we make the con- N straints “soft” by introducing a plus function to penalize the V1 V 2 + V2 V 2 , 0 2 || j||F 2 || k||F 1 violated constraints. The optimization problem with the soft- i=1 j o k u X 2XMi 2XMi ened constraints is formulated as follows: @ A x N where (x) is the sigmoid function, i.e., (x)=1/(1+e ); T F is the Frobenius norm; and U , V and V are regu- min Ui Vj + r+ + 1 2 U,V,⇠ + || · || 0 o larization constraints. In optimization process, sampling neg- i=1 j X 2XMi ative items is adopted to avoid comparing with all unobserved @ items for each individual user. The optimal solution, as a re- T Ui Vk ⇠ik r + ⇥ , (5) sult, can be obtained by Stochastic Gradient Descent (SGD). +1 || || k u 2XMi 3 Methodologies where ( )+ is the plus function [Zhu et al., 2003A ], i.e., (x)+ = max(x,·0), and ⇥ is the regularization term given by: 3.1 The Relaxed Model || || N In implicit feedback systems, user’s preference for the ob- ⇥ = U U 2 + V V 2 + ⇠ , || || 2 || ||F 2 || ||F ⇠ || i||1 served item is usually regarded as positive rating and there i X

1684 where U , V and ⇠ are the regularization constants, and where the slack variable has the constraint as ⇠ih r+ r . ⇠i 1 is `1 norm of vector ⇠i. More important, one reason We exploit the gradient descent based optimization procedure for|| || imposing such a vector norm on ⇠ is that users are usually to obtain the optimal solutions for U, V and ⇠ in above prob- interested in a small percentage of unobserved items among lem. Their gradients are given by: G all the remaining unobserved ones. It is worth to note that the @ ¯ i L = pij0 (U,V,⇠)Vj + p¯ih0 (U,V,⇠)Vh + U Ui, (7) @Ui formulation in Eq.(5) is different from a pure SVM scheme: j o h=1 (1) We relax the pairwise ranking problem which is adaptable 2XMi X @ p¯ih0 (j)(U,V,⇠)Ui L 0 for any pairwise ranking based application; (2) A novel and = pij (U,V,⇠)Ui + i + V Vj , (8) @Vj i o i u h(j) scalable optimization is designed for the objective function 2XNj 2XNj |G | (in the next section). (3) The slack variable with `1 norm is @ L = p¯ih0 (U,V,⇠)+⇠, (9) introduced to capture the characteristics of unobserved data @⇠ih o (i.e., the mixture of positive and negative data). where j is the set of users who have consumed item j and u N j is the remaining users who have never consumed it. One 3.2 Optimization caveatN is that as the constraint placed on ⇠ guarantees it to be The bottleneck for the optimization problem shown in Eq.(5) larger than 0 (i.e., its sign holds as positive), we can simply is the calculation of each user’s ratings on all the unobserved remove the absolute value sign to obtain its gradient. Also, items, which results in at least (MNK) time complexity. p0 (U,V,⇠) and p¯0 (U,V,⇠) are defined as the following: O ij ih Particularly, with the increase of the item number, it becomes T p0 (U,V,⇠)=1 1/(1 + exp(↵( U V + r )), more and more inefficient. As sampling negative items possi- ij i j + T ¯ i bly leads to bias, in this paper we propose a smooth and scal- p¯ih0 (U,V,⇠)=1 1/(1 + exp(↵(Ui Vh ⇠ih r ))). able optimization algorithm to take all the unobserved items To efficiently calculate the gradient of latent factor V , we into consideration. rewrite Eq.(8) as the following (see the next section): First, we solve the problem in a group-based optimization. @ (U,V,⇠) L = pij0 (U,V,⇠)Ui Vh(j)i(U,V,⇠) The entire item set is randomly divided into G groups. It is @Vj i o i o 2XNj 2XNj worth to note that we adopt the random group technique in e s the experiments for the sake of efficient computation. Instead + Vh(j)(U,V,⇠)+V Vj , (10) of requiring the rating on each unobserved item to reside on s where Vhi(U,V,⇠) ande Vh (U,V,⇠) are defined as follows: or below the negative boundary, we softly make the average N 1 s 0 rating of the unobserved items in each group locate on or be- Vhi(U,V,⇠)= i p¯ih(U,V,⇠)Ui, Vh (U,V,⇠)= Vhi(U,V,⇠). e h e low this boundary. Suppose h is the set of all items which |G | Xi=1 i G eThe algorithm detail is showne in Figure 1, wheree adaptive belong to group h, and h is the set of unobserved items for user i with group as h, thenG we have the following inequality: is used to improve convergence rate [Orr, 1999]. Specifically, we sample a set of unobserved items for each 1 T user to reinforce the optimization granularity. It allows us Ui Vk r + ⇠ih, i k i h 2Gh  to optimize the problem in both group and instance levels. |G | X Suppose for user i, we sample a set of unconsumed items, ⇠ i h where ih is the slack variable for user on group . Let us denoted as , and then select those items whose predicted define the following variables: ratings are largerA than its group’s mean rating to learn param- 1 V s V , V˜ s V , V¯ i (V s V˜ s ), eters. The selected item l for user i satisfies the following h ⌘ p ih ⌘ j h ⌘ i h ih p j o h(j)=h h rule: X2Gh 2MiX^ |G | U T V >UT V¯ i . (11) where h(j) is the group index of item j. In addition, the i l i h(l) plus function is not twice differentiable and can be smoothly Hence the objective function in Eq.(6) is refined by: approximated by the integral to a smooth approximation of N (new) T the sigmoid function [Lee and Mangasarian, 1999; Chen and = + p(Ui Vq ⇠ih(q) r ,↵)), L L Mangasarian, 1993] as follows: i=1 q i X X2A where is the set of selected items that satisfy Eq.(11) and 1 Ai (x)+ p(x, ↵)=x + log(1 + exp( ↵x)). are consumed by user i. Therefore, the gradients of U, V , ⇡ ↵ and ⇠ can be obtained similarly as above inference, given by: In the experiments, we find the above proximation form is @ (new)(U,V,⇠) @ (U,V,⇠) L = L + p¯ih0 (q)(U,V,⇠)Vq , slightly better than the logistic function. Hence, the objective @Ui @Ui q i function in Eq.(5) can be modified as the following: X2A @ (new)(U,V,⇠) @ (U,V,⇠) L = L + p¯0 (U,V,⇠)Ui, @V @V ih(j) N j j i X2Aj T = argmin p Ui Vj + r+,↵ + @ (new)(U,V,⇠) @ (U,V,⇠) L U,V,⇠ 8 L = L p¯0 (U,V,⇠), i=1 j o @⇠ @⇠ ih < i ih ih q X 2XM 2XAih G where is the set of users who have consumed item j and T ¯:i Aj p Ui Vh ⇠ih r ,↵ + ⇥ , (6) is the set of items that belong to group h and are con- || || ih h ) sumedA by user i. X

1685 3.3 Complexity Analysis Input: Observed user-item pairs, regularization constants U , V and In each iteration, updating U, V and ⇠ dominates the major ⇠, dimension K of latent space, group number G, stop criteria ⌧ and maxIter, and learning rate ⌘ time complexity for solving the optimal problem shown in Output: U (k), V (k), ⇠(k) (0) (0) Eq.(6). To update Ui, the first term of Eq.(7) requires time 1 Randomly initialize U and V , k 1, " ; 1 (0) (0) complexity as (niK), where ni is the number of items that 2 Randomly initialize ⇠ and ensure ⇠ r+ r ; O ih s 3 Randomly divide items into G equal groups; user i has consumed. Vh is independent of i and can be pre- ˜ s 4 while k 6 maxIter && " > ⌧ do computed. Hence, computing V costs (nihK), where n ih O ih 5 Uniformly sample unobserved items for each user i and select those is the number of items that user i has consumed and belong items which satisfiesA Eq.(11) to join to learn latent factors; 6 for j =1to M do to group h, and ni = h nih. Then, the second term of up- (k+1) (k) @ (new)(U (k),V (k),⇠(k)) 7 V V ⌘ L ; dating Ui needs time complexity as (niK + GK). Conse- j j @Vj P O quently, the computation of Ui is performed in (niK + GK). 8 for i =1to N do O (k+1) (k) @ (new)(U (k),V (k+1),⇠(k)) This procedure is performed over N users, so the total time is 9 Ui Ui ⌘ L ; @Ui (nK + NGK), where n i ni. 10 for (i, h)=(1, 1) to (N,G) do O ⌘ (k+1) (k) @ (new)(U (k+1),V (k+1),⇠(k)) In the procedure of optimizing Vj , the searching direction 11 ⇠ max(r+ r ,⇠ ⌘ L ) P ih ih @⇠ih (old) (new) (k) (k) (k) (new) (k+1) (k+1) (k+1) is dependent on V obtained in the last iteration. The 12 (U ,V ,⇠ ) (U ,V ,⇠ ); L L first term in Eq.(10) costs (nj K), where nj is the num- 1.05⌘ if > 0, O 13 Update ⌘: ⌘ ber of users who have consumed item j. Similar to above, ( 0.5⌘ otherwise; (old) s V 14 | | ih in Eq.(10) can be pre-calculated. On the other hand, " (new)(U (k),V (k),⇠(k)) (old) |L | Vhi(U, V ,⇠) is independent of j and could be also pre- 15 k k +1 e s (old) (k) (k) (k) computed. Then Vh (U, V ,⇠) is pre-stored in the memory 16 return U , V , ⇠ e (old) due to the sum of Vhi(U, V ,⇠) over all users. Hence, the e second term in Eq.(10) costs (nj K) time complexity. Thus, e O Figure 1: Algorithm of Relaxed Ranking-based Factor Model updating Vj needs running time as (nj K), and the total cost O time over M items is (nK), where n = nj . O j Similar to update user latent matrix UP, for each slack vari- according to the check-in timestamp; and then select the earli- able ⇠ih, we only need (K) time complexity due to the pre- est 80% to train the model and use the next 20% as testing. In O s computation of Vih. Totally, updating ⇠ over N users and G MovieLens dataset, we randomly select 80% data as training groups costs time complexity as (NGK). and use the rest as testing due to the absent timestamp. O In instance-level optimization, similar to above analysis, 4.2 Parameter Settings in the worst situation, each iteration needs (NAK) running O time. In a summary, each iteration of our algorithm costs In the experiments, the regularization constants U , V and (nK + N(A + G)K) time complexity, where n is the number ⇠ are set as 0.01. The number K of latent space is set as 10, O N(A+G) the initial learning rate ⌘ is 0.001, and parameter ↵ is set as 5. of the observed entries in user-item matrix, and n is small. Therefore, the time complexity can be approximated We divide 100 groups for three datasets. A in check-in data by (nK). In other words, the time complexity of each itera- and MovieLens are 150 and 400, respectively. O tion for the optimization is a linear proportion to the number 4.3 Evaluation Metrics of observed user-item pairs. We quantitatively evaluate model performance in terms of top-K recommendation performance, i.e., Precion@K and 4 Experimental Results Recall@K (where K is different from the latent space dimen- 4.1 Datasets sion), and ranking performance, i.e., MAP and AUC [Li et al., We use three datasets to evaluate the performance of our pro- 2015]. Formally, their definitions are shown as: posed model. The first two datasets are check-in data col- Mi lected from Gowalla and Foursquare. Each check-in record N N p(j) rel(j) 1 Si(K) Ti 1 jf=1 ⇥ in the dataset includes a user ID, a location ID, a check-in P @K = | \ | ,MAP = P , N K N Ti frequency and a timestamp that he first time checked-in this Xi=1 Xi=1 | | I(Rˆ > Rˆ ) location. As we only know user’s check-in action, these two N N ij ik 1 Si(K) Ti 1 (j,k) E(i) data sets are the implicit feedback based data sets. The third R@K = | \ | ,AUC = P2 , N Ti N E(i) data set is the movie data of MovieLens, where each user pro- Xi=1 | | Xi=1 | | vides explicit ratings, 1 to 5 stars, for some movies. As our where Si(K) is a set of top-K new items recommended to task focuses on the implicit feedback, similar to [Rendle et user i excluding those items in training, and Ti is a set of al., 2009], we remove the rating score from the dataset. We items that have been consumed in the testing. Mi is the num- utilize consuming to represent user’s checking-in or watching ber of the returned items in the list of user i, p(j) is the pre- behavior in these three datasets. We remove those users who cision of a cut-off rank list from 1 to j, and relf(j) is an in- have consumed less than 10 items. The detailed statistics of dicator function that equals to 1 if the item is in the testing, datasets are reported in Table 1. otherwise is 0. E(i) is defined as E(i):= (j, k) j Ti k/ t t { | 2 ^ 2 As recommender system targets at recommending new (Ti Ti ) , where Ti is a set of items consumed by user i in items for users, we split the training and testing as the follow- the training.[ } I( ) is a indicator function, which equals to 1 if ing. In check-in data, for each user, we first sort the records its argument is true· and 0 otherwise.

1686 Table 1: Statistics of Data Sets. Table 2: Performance comparison in terms of MAP and AUC Dataset #User #Item #Records Sparsity metrics. The result is reported in percentage (%). Gowalla 52,216 98,351 2,577,336 0.0399% Gowalla Dataset Foursquare 74,343 198,161 3,501,608 0.0238% Metric RRFM WRMF ABPR UBPR PMF MovieLens 6,040 3,681 1,000,179 4.4986% MAP 3.793 2.634 3.169 3.080 1.357 AUC 95.673 88.748 94.732 94.642 62.005 Foursquare Dataset 4.4 Baseline Methods MAP 2.302 1.147 1.999 1.913 0.024 AUC 84.866 94.726 94.554 61.211 To comprehensively demonstrate the effectiveness of our pro- 95.814 posed model, named as RRFM, we compare it with the fol- MovieLens Dataset lowing popular recommendation models. MAP 21.108 20.528 19.659 19.512 10.159 AUC 93.795 91.743 92.417 92.372 85.685 UBPR [Rendle et al., 2009] models the pairwise ranking • for each pair of the observed and unobserved items, and employs SGD with uniformly sampling for optimization. two reasons. First, we separate the positive and negative rat- ABPR [Rendle and Freudenthaler, 2014] is similar to • ings by the soft boundaries, where some of them are non- UBPR, but it over-samples the top ranked negative items separate. It captures user’s actual consuming behavior where based on a context-dependent sampling schema. he does not consume an item probably due to his dislike or WRMF [Hu et al., 2008], treats user’s rating as a binary • unawareness. Second, we optimize the model in group and value and fits the ratings on the observed and unobserved instance granularity, where group-based optimization makes items with different confidential values. parameters reach their optimal solutions in a direction to- PMF [Salakhutdinov and Mnih, 2007] regards the rating • wards the minimum of cost function and instance-based opti- as the dot product of user-specific and item-specific fea- mization reinforces the approximation to the exact optimiza- ture factors. Logistic function is used as rating conversion tion problem in Eq.(5). In addition, WRMF has different method to avoid the data’s bias. performance in MAP due to the much less items and much denser user-item entries in MovieLens than check-in data. 4.5 Performance Comparison Last, ABPR performs slightly better than UBPR. Instead Ranking Performance Comparison. of uniformly sampling negative items, ABPR samples with The ranking performance in terms of MAP and AUC for our an adaptive distribution. It exploits item tailed characteristics proposed model with baseline models is reported in Table 2. (i.e., over-sampling the top ranked negative items) to speed We summarize the following observations. up the convergence of SGD. But with the increase of iteration, First, the method only modeling the observed feedback UBPR approximates to the true optimal solution as ABPR. It (i.e., PMF), performs much worse than those taking both ob- is the reason why it performs similarly as ABPR. served and unobserved data into consideration (i.e., other four models). Specifically, in Gowalla data, RRFM achieves re- Top-K Recommendation Performance Comparison. sult as 95.673% in terms of AUC while PMF only obtains The top-K performance of our proposed model versus base- 62.005%. Also, RRFM is 3.793% in terms of MAP while line methods over three datasets is plotted in Figure 2. Based PMF is 1.357%. Different from explicit feedback, it is dif- on the observations, we summarize as follows: ficult to know user’s negative preference from implicit feed- First, RRFM outperforms all baseline methods. In partic- back. As a result, explicit feedback based model is difficult to ular, it has significant improvement over PMF, which con- achieve better performance for implicit feedback. The result tributes to the involvement of unobserved data for modeling. indicates that the unobserved data also provides meaningful Its superior performance than ABPR and UBPR is beneficial information and assist to improve ranking performance. from the following reasons: (1) The relaxed model allows Second, ranking-based methods (i.e., UBPR, ABPR and user’s preference for some unobserved items to mix with pos- RRFM) are nearly better than rating prediction-based models itive preference. This is based on the assumption that the un- (i.e., WRMF and PMF), particularly in terms of AUC met- observed data might be actually negative or missed positive ric. Ranking-based methods are modeling the ranking order value. (2) It optimizes the ranking problem based on the en- of user’s positive rating over negative rating, which actually tire unobserved items with a smooth approximation, which maximizes the AUC metric; while rating prediction-based addresses the bias caused by ABPR and UBPR. Furthermore, models focus on the task how to correctly predict ratings. Al- RRFM obtains much better performance than WRMF. Al- though WRMF regards user’s positive and negative ratings as though WRMF also considers the ranking of positive rating one and zero respectively, there is no guarantee for the cor- over negative one, it models the recommendation problem by rectness of their ranking order due to rating prediction based fitting the binary ratings, which cannot guarantee their actual loss. This explains why it has poor performance in AUC. ranking order. Hence, it does not perform as well as RRMF. Third, our proposed model is superior to others, such as in In MovieLens, the improvement of RRFM over WRMF is Foursquare, RRFM outperforms WRMF 100.7% and 12.9% small due to that the denser observed data helps WRMF ap- in terms of MAP and AUC, respectively; RRFM achieves proximate to the true optimal point. 15.2% improvement over ABPR in MAP. It happens due to Second, ABPR and UBPR are superior to WRMF in check-

1687 0.05 RRFM RRFM WRMF 0.08 WRMF timize different ranking metrics [Weston et al., 2013; Yang ABPR ABPR 0.04 UBPR UBPR ] PMF PMF et al., 2011; Liu et al., 2015; 2014 . The first common type 0.06 is based on AUC [Dhanjal et al., 2015]. Specifically, [Ren- 0.03 0.04 dle et al., 2009] proposed to model pairwise ranking of user’s 0.02 preference for observed item over unobserved item, where 0.02 the ranking is approximated by a logistic function. Parame- 0.01 5 8 10 12 15 20 5 8 10 12 15 20 ter estimation was obtained by SGD with bootstrap sampling. K K Later, [Rendle and Freudenthaler, 2014] improved the con- (a) Precision@K on Gowalla (b) Recall@K on Gowalla vergence rate of such SGD by oversampling the top ranked

0.025 0.06 negative items. However, [Balakrishnan and Chopra, 2012; RRFM RRFM WRMF WRMF ] 0.05 ABPR Taylor et al., 2008; Weimer et al., 2008 proposed to opti- 0.02 ABPR UBPR UBPR PMF PMF 0.04 mize the Normalized Discounted Cumulative Gain (NDCG) 0.015 0.03 and [Shi et al., 2012a] optimized the Mean Average Precision 0.01 0.02 (MAP). On the other hand, [Shi et al., 2013; 2012b] proposed

0.005 0.01 to optimize the Mean Reciprocal Rank (MRR). Most of them

0 0 either only model the observed data or highly rely on sam- 5 8 10 12 15 20 5 8 10 12 15 20 K K pling negative items, which are different from our work. The second category is rating prediction-based (c) Precision@K on Foursquare (d) Recall@K on Foursquare model [Salakhutdinov and Mnih, 2007; Koren et al., 2009;

RRFM RRFM Pan et al., 2008; Takcs and Tikk, 2012]. The rating 0.35 WRMF WRMF ABPR 0.2 ABPR prediction-based model is to minimize the square error UBPR UBPR 0.3 PMF PMF loss between the observed rating and the estimated rating. 0.15 [ ] 0.25 Salakhutdinov and Mnih, 2007 proposed to approximate 0.1 the observed rating by the dot product of user and item 0.2 latent factors, where the user and item latent factors were 0.05 0.15 drawn from Guassian distribution. [Hu et al., 2008] treated 5 8 10 12 15 20 5 8 10 12 15 20 K K rating as a binary value, where the user’s preferences for (e) Precision@K on MovieLens (f) Recall@K on MovieLens observed and unobserved items were regarded as one and zero, respectively. Then it exploited a large weight to fit those Figure 2: Performance comparison in terms of precision and observed ratings and a small weight to fit unobserved ones. [ ] recall on Gowalla, Foursquare and MovieLens datasets. Similar to this idea, Pan et al., 2008 formulated the implicit feedback based item recommendation as one-class problem, and proposed to sample some negative items to improve the in data, while they are a little worse than WRMF in Movie- efficiency. Although these two methods model both positive Lens data. ABPR and UBPR actually optimize the AUC met- and negative ratings, they spend too much effort on rating ric which evaluates the ranking of one user’s preference for prediction. the testing items over the remaining unobserved items. To make user’s positive ratings larger than most of the negative 6 Conclusion ones, they might sacrifice some top-K performance. It is a In this paper, we propose a relaxed ranking-based algorithm tradeoff between the performance in AUC and top-K. Hence for item recommendation with implicit feedback, and design although they obtain extremely superior performance in terms a smooth and scalable optimization method for model’s pa- of AUC, it cannot guarantee that they can also perform well in rameter estimation. The relaxed model not only avoids the top-K performance. That’s why they have poor performance pairwise ranking by separating the positive feedback and neg- in terms of precision@K and recall@K in MovieLens dataset. ative feedback with the soft boundaries, but also exploits the Last, ABPR, UBPR and PMF perform consistently with non-separate property of negative and positive feedback to the ranking performance, where ABPR is better than UBPR, capture the characteristic of unobserved data. To evaluate and PMF performs the worst among all models. Although the our proposed model, we conduct extensive experiments with way to sample negative items has been improved, ABPR is many baseline methods and evaluation metrics on the real- difficult to make significant improvements due to the limita- world data sets. The experimental results have shown our tion of sampling technique. PMF’s poor result further demon- model’s effectiveness. strates the importance of modeling all data. Acknowledgments 5 Related Work This work is partially supported by the National 973 Program In this section we briefly review some existing work on rec- of China under grant 2014CB347600, NIH (1R21AA023975- ommender system. From the point of view of optimization 01), NSFC (71571093, 71372188, 61572032), National Cen- method, we group them into two categories. ter for International Joint Research on E-Business Informa- The first category is ranking-based model [Park et al., tion Processing (2013B01035), and National Key Technolo- 2015; Pan and Chen, 2013]. The ranking-based models op- gies R&D Program of China (2013BAH16F03).

1688 References [Pan et al., 2008] Rong Pan, Yunhong Zhou, Bin Cao, [Balakrishnan and Chopra, 2012] Suhrid Balakrishnan and Nathan N. Liu, Rajan Lukose, Martin Scholz, and Qiang Sumit Chopra. Collaborative ranking. In the fifth ACM Yang. One-class collaborative filtering. In ICDM, pages international conference on Web search and 502–511, 2008. (WSDM), pages 143–152, 2012. [Park et al., 2015] Dohyung Park, Joe Neeman, Jin Zhang, [Chen and Mangasarian, 1993] Chunhui Chen and O. L. Sujay Sanghavi, and Inderjit S. Dhillon. Preference com- Mangasarian. Smoothing methods for convex inequalities pletion: Large-scale collaborative ranking from pairwise and linear complementarity problems. Mathematical Pro- comparisons. In ICML, page 19071916, 2015. gramming, 71:51–69, 1993. [Rendle and Freudenthaler, 2014] Steffen Rendle and [Dhanjal et al., 2015] Charanpal Dhanjal, Romaric Gaudel, Christoph Freudenthaler. Improving pairwise learning for and Stphan Clmenon. Collaborative filtering with localised item recommendation from implicit feedback. In the 7th ranking. In Twenty-Ninth AAAI Conference on Artificial ACM international conference on Web search and data Intelligence, pages 2554–2560, 2015. mining (WSDM), pages 273–282, 2014. [Hu et al., 2008] Yifan Hu, Yehuda Koren, and Chris Volin- [Rendle et al., 2009] Steffen Rendle, Christoph Freuden- sky. Collaborative filtering for implicit feedback datasets. thaler, Zeno Gantner, and Lars Schmidt-Thieme. Bpr: In In IEEE International Conference on Data Mining Bayesian personalized ranking from implicit feedback. In (ICDM), pages 263–272, 2008. the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence (UAI), pages 452–461, 2009. [Joachims et al., 2005] Thorsten Joachims, Laura Granka, Bing Pan, Helene Hembrooke, and Geri Gay. Accurately [Salakhutdinov and Mnih, 2007] Ruslan Salakhutdinov and interpreting clickthrough data as implicit feedback. In SI- Andriy Mnih. Probabilistic matrix factorization. In NIPS, GIR, pages 154–161, 2005. pages 1257–1264, 2007. [Koren et al., 2009] Yehuda Koren, Robert Bell, and Chris [Shi et al., 2012a] Yue Shi, Karatzoglou Alexandros, Bal- Volinsky. Matrix factorization techniques for recom- trunas Linas, Martha A. Larson, Alan Hanjalic, and Oliver mender systems. Computer, 42(8):30–37, August 2009. Nuria. Tfmap: Optimizing map for top-n context-aware recommendation. In SIGIR, pages 155–164, 2012. [Lee and Mangasarian, 1999] YUH-JYE Lee and O. L. Man- [ ] gasarian. Ssvm: A smooth support vector machine for Shi et al., 2012b Yue Shi, Alexandros Karatzoglou, Linas classification. Technical report, Computational Optimiza- Baltrunas, Martha Larson, Nuria Oliver, and Alan Han- tion and Applications, 1999. jalic. Climf: learning to maximize reciprocal rank with collaborative less-is-more filtering. In RecSys, pages 139– [Li et al., 2015] Huayu Li, Richang Hong, Shiai Zhu, and 146, 2012. Yong Ge. Point-of-interest recommender systems: A [ ] separate-space perspective. In IEEE International Con- Shi et al., 2013 Yue Shi, Alexandros Karatzoglou, Linas ference on Data Mining (ICDM), pages 231–240, 2015. Baltrunas, Martha Larson, Nuria Oliver, and Alan Han- jalic. Climf: Collaborative less-is-more filtering. In IJCAI, [Lim et al., 2015] Daryl Lim, Julian McAuley, and Gert 2013. Lanckriet. Top-n recommendation with missing implicit [ ] feedback. In Proceedings of the 9th ACM Conference on Takcs and Tikk, 2012 Gbor Takcs and Domonkos Tikk. Al- Recommender Systems, pages 309–312, 2015. ternating least squares for personalized ranking. In RecSys, pages 83–90, 2012. [Liu et al., 2014] Siyuan Liu, Shuhui Wang, Feida Zhu, [ ] Jinbo Zhang, and Ramayya Krishnan. Hydra: Large-scale Taylor et al., 2008 Michael Taylor, John Guiver, Stephen social identity linkage via heterogeneous behavior model- Robertson, and Tom Minka. Softrank: Optimizing non- ing. In SIGMOD, pages 51–62, 2014. smooth rank metrics, 2008. [ ] [Liu et al., 2015] Siyuan Liu, Shuhui Wang, and Feida Zhu. Weimer et al., 2008 Markus Weimer, Alexandros Karat- Structured learning from heterogeneous behavior for so- zoglou, Quoc V. Le, and Alex J. Smola. Cofi rank - max- cial identity linkage. IEEE Trans. Knowl. Data Eng., imum margin matrix factorization for collaborative rank- 27(7):2005–2019, 2015. ing. In Advances in Neural Information Processing Sys- tems 20, pages 1593–1600. Curran Associates, Inc., 2008. [N. and Qiang, 2008] Liu Nathan N. and Yang Qiang. Eigen- [ ] rank: A ranking-oriented approach to collaborative filter- Weston et al., 2013 Jason Weston, Hector Yee, and Ron J. ing. In Proceedings of the 31st Annual International ACM Weiss. Learning to rank recommendations with the k-order SIGIR Conference on Research and Development in Infor- statistic loss. In RecSys, pages 245–248, 2013. mation Retrieval, pages 83–90, 2008. [Yang et al., 2011] Shuang-Hong Yang, Bo Long, Alexan- [Orr, 1999] Genevieve Orr. Momentum and learning rate der J. Smola, Hongyuan Zha, and Zhaohui Zheng. Collab- adaptation, 1999. orative competitive filtering: learning recommender using context of user choice. In SIGIR, pages 295–304, 2011. [Pan and Chen, 2013] Weike Pan and Li Chen. Gbpr: group [Zhu et al., 2003] Ji Zhu, Saharon Rosset, Trevor Hastie, and preference based bayesian personalized ranking for one- Rob Tibshirani. 1-norm support vector machines. In NIPS, class collaborative filtering. In IJCAI, pages 2691–2697, page 16. MIT Press, 2003. 2013.

1689