Large-Scale Kernel Ranksvm

Large-scale Kernel RankSVM Tzu-Ming Kuo∗ Ching-Pei Leey Chih-Jen Linz Abstract proposed [11, 23, 5, 1, 14]. However, for some tasks that Learning to rank is an important task for recommendation the feature set is not rich enough, nonlinear methods systems, online advertisement and web search. Among may be needed. Therefore, it is important to develop those learning to rank methods, rankSVM is a widely efficient training methods for large kernel rankSVM. used model. Both linear and nonlinear (kernel) rankSVM Assume we are given a set of training label-query- have been extensively studied, but the lengthy training instance tuples (yi; qi; xi); yi 2 R; qi 2 S ⊂ Z; xi 2 n time of kernel rankSVM remains a challenging issue. In R ; i = 1; : : : ; l, where S is the set of queries. By this paper, after discussing difficulties of training kernel defining the set of preference pairs as rankSVM, we propose an efficient method to handle these (1.1) P ≡ f(i; j) j q = q ; y > y g with p ≡ jP j; problems. The idea is to reduce the number of variables from i j i j quadratic to linear with respect to the number of training rankSVM [10] solves instances, and efficiently evaluate the pairwise losses. Our setting is applicable to a variety of loss functions. Further, 1 T X min w w + C ξi;j general optimization methods can be easily applied to solve w;ξ 2 (i;j)2P the reformulated problem. Implementation issues are also (1.2) subject to wT (φ (x ) − φ (x )) ≥ 1 − ξ ; carefully considered. Experiments show that our method i j i;j is faster than state-of-the-art methods for training kernel ξi;j ≥ 0; 8(i; j) 2 P; rankSVM. where C > 0 is the regularization parameter and φ is Keywords: Kernel method, Learning to rank, Ranking a function mapping data to a higher dimensional space. support vector machines, Large-margin method The loss term ξi;j in (1.2) is called L1 loss. If it is replaced by ξ2 , we have L2 loss. The idea behind 1 Introduction i;j rankSVM is to learn w such that Being heavily applied in recommendation systems, on- T T line advertisements and web search in recent years, w φ(xi) > w φ(xj); if (i; j) 2 P: learning to rank gains more and more importance. Among existing approaches for learning to rank, A challenge in training rankSVM is to handle the rankSVM [7] is a commonly used method extended from possibly large number of preference pairs because p can the popular support vector machine (SVM) [2, 6] for be as large as O(l2). data classification. In contrast to linear rankSVM that can directly In SVM literature, it is known that linear (i.e., data minimize over a finite vector w, the difficulty of solving are not mapped to a different space) and kernel SVMs (1.2) is on the high and possible infinite dimensionality are suitable for different scenarios, where linear SVM is of w after data mappings. Existing studies have more efficient, but the more costly kernel SVM may proposed different methods to solve (1.2) through kernel give higher accuracy.1 The same situation may also techniques. The work [7] viewed (1.2) as a special occur in rankSVM because it can be viewed as a special case of SVM, so standard training methods to solve case of SVM; see more details in Section 2.1. Because the SVM dual problem can be applied. However, the of the lower training cost, linear rankSVM has been dual problem of p variables can become very large if extensively studied and efficient algorithms have been p = O(l2). Joachims [11] reformulated rankSVM as a 1-slack structural SVM problem and considered a ∗Department of Computer Science, National Taiwan Univer- cutting-plane method. Although [11] only experiments sity. [email protected] with linear rankSVM, this approach is applicable to y Department of Computer Science, National Taiwan Univer- kernel rankSVM. However, cutting-plane methods may sity. [email protected] zDepartment of Computer Science, National Taiwan Univer- suffer from slow converge. The work [26], inspired by sity. [email protected] a generalized representer theorem [22], represents w as 1See, for example, [27] for detailed discussion. a linear combination of the training instances. It then reformulates and modifies (1.2) to a linear programming to rankSVM, the dual problem of (1.2) is as follows. problem. However, the resulting problem is still large 1 because of O(p) variables as well as constraints. min αT Q^α − eT α In this paper, a method for efficiently training non- α 2 linear rankSVM is proposed. Our approach solves (1.2) (2.4) subject to 0 ≤ αi;j ≤ C; 8(i; j) 2 P; by following [26] to represent w as a linear combina- p p tion of mapped training instances. In contrast to the where α 2 R is indexed by pairs in P , e 2 R is a linear programming problem in [26], we rigorously ob- vector of ones, and tain an l-variable optimization problem that is equiv- ^ T alent to (1.2). A method for efficiently computing the (2.5) Q(i;j);(u;v) = φi;jφu;v; 8(i; j); (u; v) 2 P loss term and its sub-gradient or gradient without going is a p by p symmetric matrix. For example, [10] solves through all the p pairs is then introduced by modifying (2.4) using the SVM package SVMlight [9]. the order-statistic trees technique from linear rankSVM Problem (2.4) is a large quadratic programming [1, 14]. Our approach allows the flexible use of var- problem because the number of variables can be up to ious unconstrained optimization methods for training O(l2). From the primal-dual relationship, optimal w kernel rankSVM efficiently. We present an implementa- and α satisfy tion that is experimentally faster than state-of-the-art methods. X (2.6) w ≡ αi;jφi;j: This paper is organized as follows. In Section 2, we (i;j)2P detailedly discuss previous studies on kernel rankSVMs. By noticing their shortcomings, an efficient method Notice that because φ(x) may be infinite dimensional, is then proposed in Section 3. An implementation (2.5) is calculated by the kernel function K(·; ·). with comprehensive comparisons to existing works is ^ described in Section 4. Section 5 concludes this paper. Q(i;j);(u;v) = K(xi; xu) + K(xj; xv) − K(xi; xv) − K(xj; xu); where 2 Existing Methods T K(xi; xj) = φ(xi) φ(xj): We introduce three existing methods for training nonlinear rankSVMs. The first one treats rankSVM as an Although directly computing the matrix Q^ requires SVM problem. The next way solves rankSVM under O(l4) kernel evaluations, we can take the following a structural SVM framework. The last approach is in- special structure of Q^ to save the cost. spired by a generalized representer theorem to optimize an alternative linear programming problem. (2.7) Q^ = AQAT ; 2.1 SVM Approach Given a set of label-instance where n pairs (yi; xi); xi 2 R ; yi 2 f1; −1g; i = 1; : : : ; l; SVM l×l solves the following optimization problem. Q 2 R with Qi;j = K(xi; xj); and A 2 Rp×l is defined as follows. l 1 T X ¯ min w w + C ξi w;ξ¯ 2 i=1 ··· i ··· j ··· T ¯ . 2 3 (2.3) subject to yiw φ(xi) ≥ 1 − ξi; . ¯ A ≡ : ξi ≥ 0; i = 1; : : : ; l: (i; j) 4 0 ··· 0 +1 0 ··· 0 −1 0 ··· 0 5 . Clearly, if we define That is, if (i; j) 2 P then the corresponding row in A has that the i-th entry is 1, the j-th entry is −1, and other y ≡ 1; and φ ≡ φ(x ) − φ(x ); 8(i; j) 2 P; i;j i;j i j entries are all zeros. Hence, computing Q^ requires O(l2) kernel evaluations. However, the difficulty of having then (1.2) is in the form of (2.3) with p instances [7]. O(l2) variables remains, so solving (2.4) has not been a One can then apply any SVM solver to solve (1.2). viable approach for large kernel rankSVM. To handle the high dimensionality of φ(xi) and w, it is common to solve the dual problem of (2.3) with the 2.2 Structural SVM Approach To avoid the dif- help of kernel tricks [6]. To adopt the same technique ficulty of O(l2) variables in the dual problem, Joachims showed that (1.2) is equivalent to the following 1-slack coefficients β.3 structural SVM problem [11]. Xl (2.9) w ≡ βiφ(xi): i=1 1 T min w w + Cξ T w,ξ 2 Therefore, w φi;j can be computed by T X X (2.8) subject to w ci;jφi;j ≥ ci;j − ξ; l T X (i;j)2P (i;j)2P w φi;j = βm (K (xi; xm) − K (xj; xm)) m=1 ci;j 2 f0; 1g; 8(i; j) 2 P: = (Qβ)i − (Qβ)j: He then considered a cutting-plane method to solve the To reduce the number of nonzero βi, [26] considered a T dual problem of (2.8). Because of the 2p constraints regularization term e β with β being nonnegative, and in (2.8), the corresponding dual problem contains the obtained the following linear programming problem. same amount of variables. While optimizing (2.8) seems min eT β + CeT ξ to be more difficult at first glance, it is shown that the β;ξ optimal dual solution is sparse with a small number of (2.10) subject to (Qβ)i − (Qβ)j ≥ 1 − ξi;j; non-zero elements, and this number is independent of p.

Load more