Extreme Learning to Rank Via Low Rank Assumption

Extreme Learning to Rank via Low Rank Assumption Minhao Cheng 1 Ian Davidson 1 Cho-Jui Hsieh 1 2 Abstract Negahban et al., 2012; Wauthier et al., 2013). We consider the setting where we wish to perform However, in many modern applications, a single global ranking for hundreds of thousands of users which ranking is not sufficient to represent the variety of individual is common in recommender systems and web users preferences. For example, in movie ranking systems, it search ranking. Learning a single ranking func- is preferable to learn an individual ranking function for each tion is unlikely to capture the variability across user since users’ tastes can vary largely. The issue also arises all users while learning a ranking function for in many other applications such as product recommendation, each person is time-consuming and requires large and personalized web search ranking. amounts of data from each user. To address this situation, we propose a Factorization RankSVM Motivated by these real scenarios, we consider the prob- algorithm which learns a series of k basic rank- lem of learning hundreds of thousands of ranking functions ing functions and then constructs for each user jointly, one for each user. Our target problem is different a local ranking function that is a combination of from collaborative ranking and BPR (Rendle et al., 2009; them. We develop a fast algorithm to reduce the Wu et al., 2017a) since they only aim to recover ranking time complexity of gradient descent solver by ex- over existing items without using item features, while our ploiting the low-rank structure, and the resulting goal is to obtain the ranking functions (taking item features algorithm is much faster than existing methods. as input) that can generalize to unseen items. This is also Furthermore, we prove that the generalization er- different from existing work on learning multiple ranking ror of the proposed method can be significantly functions (i.e. (Qian et al., 2014)) because in that setting better than training individual RankSVMs. Fi- they are learning only several ranking functions. Here we nally, we present some interesting patterns in the focus on problems where the number of ranking functions T principal ranking functions learned by our algo- is very large (e.g.,100K) but the amount of data to learn each rithms. ranking function is limited. The naive extensions of learning to rank algorithms fail since the training time grows dramat- ically, and also due to the over-fitting problem because of 1. Introduction insufficient number of pairs for training. Learning a ranking function based on pairwise compar- To resolve this dilemma, we propose the Factorization isons has been studied extensively in recent years, with RankSVM model for learning multiple ranking functions many successful applications in building search engines and jointly. The main idea is to assume the T ranking functions other information retrieval tasks. Given a set of training in- can be represented by a dictionary of k ranking functions with k T . In the linear RankSVM case, this assumption stances with features x1; :::; xn and pairwise comparisons, the goal is to find the optimal decision function f(·) such implies a low-rank structure when we stack all the T linear hyper-planes together into a matrix. By exploiting this low that f(xi) > f(xj) if i is preferred over j. This is usu- ally referred to as a learning-to-rank problem, and several rank structure, our algorithm can be efficient for both time algorithms have been proposed, including RankSVM (Her- and sample complexity. brich et al., 1999), gradient boosting decision tree (Li et al., Our contributions can be summarized as follows: 2007), and many others (Cao et al., 2007; Yue et al., 2007; • We propose the Factorization RankSVM model for 1Department of Computer Science, University of Califor- learning a large number of different ranking functions nia Davis, USA. 2Department of Statistics, University of Cal- on different sets of data simultaneously. By exploiting ifornia Davis, USA . Correspondence to: Minhao Cheng the low-rank structure, we show that the gradient can be <[email protected]>. calculated very efficiently, and the resulting algorithm Proceedings of the 35 th International Conference on Machine can scale to problems with large number of tasks. Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 • We derive the generalization error bound of our model, by the author(s). Extreme Learning to Rank via Low Rank Assumption showing that by training all the T tasks jointly, the sam- is different from BPR. We consider problems given both ple complexity is much better than training individual pairwise comparisons and “explicit” item features, and the rankSVMs under the low rank assumption. goal is to learn the personalized ranking “functions” that • We conduct experiments on real world datasets, show- can generalize to unseen items as long as we know their fea- ing that the algorithm achieves higher accuracy and tures. In comparison, the BPR does not take item features faster training time compared with state-of-the-art into account, and the goal of BPR is to recover the ranking methods. This is a critical result as it shows the low among existing items. Also, the ranking cannot generalize rank ranking conjecture that underlies our research to unseen items. Moreover, BPR considers implicit (0/1) does occur. feedback instead of explicit feedback. • We further visualize the basic ranking functions Collaborative Ranking is another line of research that in- learned by our algorithm, which has some interesting corporates ranking loss in collaborative filtering. (Park et al., and meaningful patterns. 2015; Weimer et al., 2007; Wu et al., 2017a) combines the 2. Related Work ranking loss with matrix completion model, and (Yun et al., 2014) also uses a low-rank model with ranking loss given Learning to rank. Given a set of pairwise comparisons a binary observed matrix. However, similar to matrix com- between instances and the feature vectors associated with pletion and BPR, these collaborative ranking approaches do each instance, the goal of learning to rank is to discover not use the item features. So they are not able to predict the the ranking function. There are three main categories of preferences for unseen items. Also in this category, (Bar- learning to rank algorithms: pointwise (Li et al., 2007), jasteh et al., 2015) uses a trace norm to constraint ranking listwise (Cao et al., 2007; Yue et al., 2007), and pairwise function, which is similar with our idea. However, they use methods (Herbrich et al., 2000; Cao et al., 2006). In this implicit feedback which will lose certain information. paper, we mainly focus on pairwise methods, which process Multi-task Learning: has been extensively studied, espe- a pair of documents or entries at a time. Among all the cially in computer vision application. To model the shared pairwise methods, rankSVM is a very popular one, so we information across tasks, a low-rank structure is widely choose it as our basic model. assumed (Chen et al., 2012; 2009). (Hwang et al., 2011; We focus on the problem of solving T learning-to-rank Su et al., 2015) takes the attributes correlation as low-rank problems jointly when the problems share the same feature embeddings to learn SVM. However, our approach of learn- space and there is some hidden correlation between tasks. ing basic ranking functions has not been discussed in the Obviously, we could apply existing learning-to-rank algo- literature. rithms to solve each task independently, but this approach A summary of the differences between our algorithm with has several major drawbacks, as will be discussed in next others are showed in Table 1. section. Collaborative filtering and matrix factorization Low- 3. Problem Setting rank approximation has been widely used in matrix com- Our goal is to learn multiple ranking functions together, one pletion and collaborative filtering (Koren et al., 2009), and for each user. Assume there are in total T ranking functions there are several extensions for matrix completion (Weimer to be learned (each one can be viewed as a task), and we et al., 2007). However, these methods cannot be applied in are given pairwise comparisons for these ranking functions our setting, since our predictions are based on item features, among n items with features x ; x ;:::; x 2 Rd. For and the corresponding items may not even appear in the 1 2 n each task i, the pairwise comparisons are denoted as Ω = training data. To conduct prediction based on item features, i f(j; k)g, where (j; k) 2 Ω means task i compares item j the inductive matrix factorization model has been recently i with k, and y 2 f+1; −1g is the observed outcome. For proposed in (Jain & Dhillon, 2013; Xu et al., 2013), and fac- ijk convenience, we use Ω to denote the union of all Ω . Given torization machine (Rendle, 2010) also uses a similar model. i these pairwise comparisons, we aim to learn a set of linear However, this model only allows input to be user-item rat- ranking functions w ; w ;:::; w 2 Rd such that ings, not the pairwise comparisons used in our problem. 1 2 T T In the experiments, we observe that even if the rating data sign(wi (xj −xk)) ≈ yijk; 8(j; k) 2 Ωi; 8i = 1;:::;T is available, our model still outperforms inductive matrix completion significantly. The only assumption we make for these T ranking tasks is that the items involved in each task share the same fea- Bayesian Personalized Ranking (Rendle et al., 2009) ture space with d features. Note that our algorithm allows proposes Bayesian Personalized Ranking(BPR) method to each task has non-overlapping items—in that case we can solve personalized ranking task.

Load more