<<

Large-scale RankSVM

Tzu-Ming Kuo∗ Ching-Pei Lee† Chih-Jen Lin‡

Abstract proposed [11, 23, 5, 1, 14]. However, for some tasks that is an important task for recommendation the feature set is not rich enough, nonlinear methods systems, online advertisement and web search. Among may be needed. Therefore, it is important to develop those learning to rank methods, rankSVM is a widely efficient training methods for large kernel rankSVM. used model. Both linear and nonlinear (kernel) rankSVM Assume we are given a set of training label-query- have been extensively studied, but the lengthy training instance tuples (yi, qi, xi), yi ∈ R, qi ∈ S ⊂ Z, xi ∈ n time of kernel rankSVM remains a challenging issue. In R , i = 1, . . . , l, where S is the set of queries. By this paper, after discussing difficulties of training kernel defining the set of preference pairs as rankSVM, we propose an efficient method to handle these (1.1) P ≡ {(i, j) | q = q , y > y } with p ≡ |P |, problems. The idea is to reduce the number of variables from i j i j quadratic to linear with respect to the number of training rankSVM [10] solves instances, and efficiently evaluate the pairwise losses. Our setting is applicable to a variety of loss functions. Further, 1 T X min w w + C ξi,j general optimization methods can be easily applied to solve w,ξ 2 (i,j)∈P the reformulated problem. Implementation issues are also (1.2) subject to wT (φ (x ) − φ (x )) ≥ 1 − ξ , carefully considered. Experiments show that our method i j i,j is faster than state-of-the-art methods for training kernel ξi,j ≥ 0, ∀(i, j) ∈ P, rankSVM. where C > 0 is the regularization parameter and φ is Keywords: Kernel method, Learning to rank, a function mapping data to a higher dimensional space. support vector machines, Large-margin method The loss term ξi,j in (1.2) is called L1 loss. If it is replaced by ξ2 , we have L2 loss. The idea behind 1 Introduction i,j rankSVM is to learn w such that Being heavily applied in recommendation systems, on- T T line advertisements and web search in recent years, w φ(xi) > w φ(xj), if (i, j) ∈ P. learning to rank gains more and more importance. Among existing approaches for learning to rank, A challenge in training rankSVM is to handle the rankSVM [7] is a commonly used method extended from possibly large number of preference pairs because p can the popular support vector machine (SVM) [2, 6] for be as large as O(l2). data classification. In contrast to linear rankSVM that can directly In SVM literature, it is known that linear (i.e., data minimize over a finite vector w, the difficulty of solving are not mapped to a different space) and kernel SVMs (1.2) is on the high and possible infinite dimensionality are suitable for different scenarios, where linear SVM is of w after data mappings. Existing studies have more efficient, but the more costly kernel SVM may proposed different methods to solve (1.2) through kernel give higher accuracy.1 The same situation may also techniques. The work [7] viewed (1.2) as a special occur in rankSVM because it can be viewed as a special case of SVM, so standard training methods to solve case of SVM; see more details in Section 2.1. Because the SVM dual problem can be applied. However, the of the lower training cost, linear rankSVM has been dual problem of p variables can become very large if extensively studied and efficient algorithms have been p = O(l2). Joachims [11] reformulated rankSVM as a 1-slack structural SVM problem and considered a ∗Department of Computer Science, National Taiwan Univer- cutting-plane method. Although [11] only experiments sity. [email protected] with linear rankSVM, this approach is applicable to † Department of Computer Science, National Taiwan Univer- kernel rankSVM. However, cutting-plane methods may sity. [email protected] ‡Department of Computer Science, National Taiwan Univer- suffer from slow converge. The work [26], inspired by sity. [email protected] a generalized representer theorem [22], represents w as 1See, for example, [27] for detailed discussion. a linear combination of the training instances. It then reformulates and modifies (1.2) to a linear programming to rankSVM, the dual problem of (1.2) is as follows. problem. However, the resulting problem is still large 1 because of O(p) variables as well as constraints. min αT Qˆα − eT α In this paper, a method for efficiently training non- α 2 linear rankSVM is proposed. Our approach solves (1.2) (2.4) subject to 0 ≤ αi,j ≤ C, ∀(i, j) ∈ P, by following [26] to represent w as a linear combina- p p tion of mapped training instances. In contrast to the where α ∈ R is indexed by pairs in P , e ∈ R is a linear programming problem in [26], we rigorously ob- vector of ones, and tain an l-variable optimization problem that is equiv- ˆ T alent to (1.2). A method for efficiently computing the (2.5) Q(i,j),(u,v) = φi,jφu,v, ∀(i, j), (u, v) ∈ P loss term and its sub-gradient or gradient without going is a p by p symmetric matrix. For example, [10] solves through all the p pairs is then introduced by modifying (2.4) using the SVM package SVMlight [9]. the order-statistic trees technique from linear rankSVM Problem (2.4) is a large quadratic programming [1, 14]. Our approach allows the flexible use of var- problem because the number of variables can be up to ious unconstrained optimization methods for training O(l2). From the primal-dual relationship, optimal w kernel rankSVM efficiently. We present an implementa- and α satisfy tion that is experimentally faster than state-of-the-art methods. X (2.6) w ≡ αi,jφi,j. This paper is organized as follows. In Section 2, we (i,j)∈P detailedly discuss previous studies on kernel rankSVMs. By noticing their shortcomings, an efficient method Notice that because φ(x) may be infinite dimensional, is then proposed in Section 3. An implementation (2.5) is calculated by the kernel function K(·, ·). with comprehensive comparisons to existing works is ˆ described in Section 4. Section 5 concludes this paper. Q(i,j),(u,v) = K(xi, xu) + K(xj, xv) − K(xi, xv) − K(xj, xu), where 2 Existing Methods T K(xi, xj) = φ(xi) φ(xj). We introduce three existing methods for training non- linear rankSVMs. The first one treats rankSVM as an Although directly computing the matrix Qˆ requires SVM problem. The next way solves rankSVM under O(l4) kernel evaluations, we can take the following a structural SVM framework. The last approach is in- special structure of Qˆ to save the cost. spired by a generalized representer theorem to optimize an alternative linear programming problem. (2.7) Qˆ = AQAT ,

2.1 SVM Approach Given a set of label-instance where n pairs (yi, xi), xi ∈ R , yi ∈ {1, −1}, i = 1, . . . , l, SVM l×l solves the following optimization problem. Q ∈ R with Qi,j = K(xi, xj), and A ∈ Rp×l is defined as follows. l 1 T X ¯ min w w + C ξi w,ξ¯ 2 i=1 ··· i ··· j ··· T ¯ .   (2.3) subject to yiw φ(xi) ≥ 1 − ξi, . ¯ A ≡ . ξi ≥ 0, i = 1, . . . , l. (i, j)  0 ··· 0 +1 0 ··· 0 −1 0 ··· 0  . . Clearly, if we define That is, if (i, j) ∈ P then the corresponding row in A has that the i-th entry is 1, the j-th entry is −1, and other y ≡ 1, and φ ≡ φ(x ) − φ(x ), ∀(i, j) ∈ P, i,j i,j i j entries are all zeros. Hence, computing Qˆ requires O(l2) kernel evaluations. However, the difficulty of having then (1.2) is in the form of (2.3) with p instances [7]. O(l2) variables remains, so solving (2.4) has not been a One can then apply any SVM solver to solve (1.2). viable approach for large kernel rankSVM. To handle the high dimensionality of φ(xi) and w, it is common to solve the dual problem of (2.3) with the 2.2 Structural SVM Approach To avoid the dif- help of kernel tricks [6]. To adopt the same technique ficulty of O(l2) variables in the dual problem, Joachims showed that (1.2) is equivalent to the following 1-slack coefficients β.3 structural SVM problem [11]. Xl (2.9) w ≡ βiφ(xi). i=1 1 T min w w + Cξ T w,ξ 2 Therefore, w φi,j can be computed by T X X (2.8) subject to w ci,jφi,j ≥ ci,j − ξ, l T X (i,j)∈P (i,j)∈P w φi,j = βm (K (xi, xm) − K (xj, xm)) m=1 ci,j ∈ {0, 1}, ∀(i, j) ∈ P. = (Qβ)i − (Qβ)j.

He then considered a cutting-plane method to solve the To reduce the number of nonzero βi, [26] considered a T dual problem of (2.8). Because of the 2p constraints regularization term e β with β being nonnegative, and in (2.8), the corresponding dual problem contains the obtained the following linear programming problem. same amount of variables. While optimizing (2.8) seems min eT β + CeT ξ to be more difficult at first glance, it is shown that the β,ξ optimal dual solution is sparse with a small number of (2.10) subject to (Qβ)i − (Qβ)j ≥ 1 − ξi,j, non-zero elements, and this number is independent of p. ξ ≥ 0, ∀(i, j) ∈ P, At each iteration, their cutting-plane method optimizes i,j a sub-problem of the dual problem of (2.8). The sub- βi ≥ 0, i = 1, . . . , l. problem is consisted of variables in a small working set, 4 which is empty in the beginning of the optimization A package RV-SVM is released, but this approach has procedure. After the sub-problem is solved, the variable the following problems. First, in the representer theo- corresponding to the most violated constraint of (2.8) rem, the coefficients β are unconstrained. To enhance is added to the working set. This approach thus avoids the sparsity, they added the nonnegative constraints on unnecessary computations involving variables that are β. Thus the setting does not coincide with the theorem zero in optima [12]. An efficient algorithm is also being used. Second, they have the regularization term eT β, which, after using (2.9), is equivalent to neither proposed to evaluate ξ and decide which variable is T added to the working set at each iteration. Although kwk1 (L1 regularization) nor w w/2 (L2 regulariza- [11] only discussed the case when kernels are not used, tion). Therefore, after solving (2.10), we cannot use the method can be easily extended to kernel rankSVM. (2.9) to obtain the optimal w of the original problem. Based on the structural SVM package SVMstruct [24, This situation undermines the interpretability of (2.10). 12], a package SVMrank that can solve both linear and Third, without noticing (2.7), they claimed in [26] that kernel rankSVM by a cutting plane method is released.2 the number of kernel evaluations required by solving the dual problem (2.4) is O(l4), and theirs only requires However, empirical examinations show that in the 3 linear case, solving rankSVM under the structural O(l ). However, as we discussed earlier in (2.7) and from (2.10), if Q can be stored, both approaches re- SVM framework using a cutting-plane method con- 2 verges slower than other state-of-the-art methods that quires only O(l ) kernel evaluations. Finally, the num- directly solve (1.2) [14]. ber of variables ξ in the linear programming problem can still be as large as O(l2). Solving a linear program- 2.3 Reformulation from Representer Theorem ming problem with this amount of variables is expen- To address the difficulty of optimizing the dual problem sive, so in the experiments conducted in [26], the data with O(l2) variables, [26] considered solving the primal size is less than 1, 000 instances. problem. Although the number of variables may be infinite following the data mapping, they applied a 3 Methods and Technical Solutions generalized representer theorem [22] to have that the An advantage of the approach in [26] is that the optimal w is a linear combination of training data with formulation of w in (2.9) involves only l variables. This is much smaller than O(l2) in the dual rankSVM (2.4). However, the O(l2) number of constraints in (2.10) still

3A similar idea, using the original representer theorem [13] for 2http://www.cs.cornell.edu/People/tj/svm_light/svm_ L2-regularized L2-loss kernel rankSVM, is briefly mentioned in rank.html. Note that as stated on the website, the method for [5]. However, their focus was linear rankSVM. computing the loss term in this package has a higher complexity 4https://sites.google.com/site/postechdm/research/ then the algorithm proposed in [11]. implementation/rv-svm. causes a large linear programming problem. To derive The work [17] derives (3.14) without using the represen- efficient optimization algorithms, in this section, we ter theorem. Instead, they directly consider (3.14) and rewrite (1.2) as the following form. investigate the connection to the SVM dual problem via optimization theory. Following the same setting, the 1 T X T  (3.11) min w w + C max 0, 1 − w φi,j , problem to be considered for rankSVM is w 2 (i,j)∈P 1 ˆ T ˆ X ˆ (3.15) min β Qˆβ + C max(0, 1 − (Qˆβ)i,j). βˆ∈Rp 2 where the first term is the regularization term, while (i,j)∈P the second is the loss term. We then incorporate two techniques. In Appendix A, we prove that any optimal βˆ leads to 1. Let w involve l variables as in (2.9). an optimal 2. Apply efficient techniques to calculate the loss term. X ˆ In particular, some past developments for linear (3.16) w = βi,jφi,j (i,j)∈P rankSVM are employed. of (1.2). This form is the same as (2.6), but a crucial 3.1 Regularization Term If w is represented by difference is that βˆ is unconstrained. Therefore, we can (2.9), then (1.2) can be written as define (3.12) β ≡ AT βˆ 1 T X min β Qβ + C max(0, 1 − (Qβ)i + (Qβ)j), to simplify (3.15) to an equivalent form in (3.12). This β∈Rl 2 (i,j)∈P discussion explains why we are able to reduce the number of variables from O(l2) of βˆ to l of β. From and Appendix A, any optimal solution of dual rankSVM is (3.13) also optimal for (3.15). Thus, we can say that (3.15) 1 T X 2 min β Qβ + C max(0, 1 − (Qβ)i + (Qβ)j) , provides a richer set of possible values to construct the β∈Rl 2 5 (i,j)∈P optimal w. Therefore, the simplification from βˆ to β becomes possible. respectively for L1 and L2 losses. A difference between We mentioned that for standard SVM, if Q¯ in (3.14) the two problems is that (3.13) is differentiable while is positive definite, the dual SVM and (3.14) have the (3.12) is not. The small number of l variables is superior same unique solution. For rankSVM, this property 2 to the O(l ) variables in the dual rankSVM problem requires that Qˆ is positive definite. However, from (2.4). Interestingly, this advantage does not occur for (2.7) and the fact that each row of (AQ)AT is a linear standard SVM. We explain the subtle difference below. combination of AT ’s rows, In SVM, the use of formulations like (3.12)-(3.13) has been considered in many places such as [19, 17, 4], rank(Qˆ) ≤ rank(AT ) ≤ min(p, l) ≤ l. where the derivation mainly follows the representer ˆ theorem. Take L1 loss as an example. The SVM Therefore, Q tends to be only positive semi-definite optimization problem analogous to (3.12) is because usually p 6= l. This result indicates that the connection between (3.15) and the rankSVM dual (2.4) l is weaker than that between (3.14) and the SVM dual. 1 ¯ T ¯ ¯ X ¯ ¯  (3.14) min β Qβ + C max 0, 1 − Qβ i , β¯ 2 i=1 3.2 Loss Term After reformulating to (3.12) or (3.13), the number of variables is reduced to l. However, where Q¯ = y y K(x , x ) and notice that in SVM i,j i j i j to calculate the loss term, we still have the summation problems, y ∈ {−1, 1}, ∀i. If α¯ is the variable of dual i over p preference pairs. The same difficulty occurs for SVM, then both α¯ and β¯ have l components. However, linear rankSVM, so past works have proposed efficient α¯ is nonnegative while β¯ is unconstrained. It is proved algorithms. Among them, a recently developed method in [4] that if Q¯ is positive definite, then the optimum is using order-statistic trees [1, 14] is the fastest, with only unique and satisfies O(l log l) cost to compute this summation. For other ¯ α¯i = yiβi, ∀i. values commonly used in optimization methods such as sub-gradient (if L1 loss is used), gradient, or Hessian- Therefore, (3.14) and SVM dual are strongly related. vector products (if L2 loss is adopted), the cost is also For SVM, because (3.14) does not possess significant advantages, most existing packages solve the dual prob- 5Note that the optimal w is unique because wT w/2 is strictly lem. The situation for rankSVM is completely different. convex. O(l log l). We observe that the loss term of (3.12) is may not hold because calculating the sum of loss terms similar to that of linear rankSVM, which is of the form may become dominant by taking O(l log l) cost. Alter- natively, because (3.12) and (3.13) are unconstrained, X T max(0, 1 − w (xi − xj)). we can apply any general optimization method. Usu- (i,j)∈P ally such methods must calculate the product between Thus we can easily see that methods for summing up Q and β. We can split Q to several blocks and store the p pairs in linear rankSVM is also applicable to a fix portion in the memory. Other blocks are com- nonlinear rankSVM. We give an illustration by showing puted upon needed. Each Q’s block can be efficiently the calculation of the loss term in (3.12). By defining obtained if the data set is dense and optimized numer- ical linear algebra subprograms (e.g., optimized BLAS SV(β) ≡ {(i, j) | (i, j) ∈ P, 1 − (Qβ)i + (Qβ)j > 0}, [25]) are employed. Our method may provide an efficient alternative the loss term can be written as to train linear rankSVM when l  n. Note that X (3.12)-(3.13) involve a vector variable β of size l. In 1 − (Qβ)i + (Qβ)j (i,j)∈SV(β) contrast, most existing methods train linear rankSVM l X + − − by optimizing w, which has n variables. Therefore, if = (l (β) − l (β))(Qβ)i + l (β), i=1 i i i l  n, using (3.12)-(3.13) may be superior to (3.11) because of a smaller number of variables. Because where kernels are not used, (3.17) l+(β) ≡ |{j | (j, i) ∈ SV(β)}|, i T − Q = XX , (3.18) li (β) ≡ |{j | (i, j) ∈ SV(β)}|. T + where X ≡ [x1,..., xl] . We can then easily conduct We can then use order-statistic trees to compute li (β) − some operations without storing Q. For example, Qβ and li (β) in O(l log l) time; see details in [1, 14]. The can be calculated by calculation of sub-gradient (for L1 loss), and gradient as well as Hessian-vector products (for L2 loss) is similar. Qβ = X(XT β). We give details in Appendix B. The major difference between the loss term of linear and nonlinear rankSVM In linear rankSVM, using only a subset of pairs T is that the computation of w xi only costs O(n) while in P may reduce the training time significantly by obtaining (Qβ)i requires l kernel evaluations, which slightly trading the performance [18]. Interestingly, this usually amount to O(ln) time. However, the O(ln) cost approach may not be useful for kernel rankSVM. For can be reduced to O(l) if Q is maintained throughout linear rankSVM, the dominant cost is to evaluate the the optimization algorithm. summation over all preference pairs. Thus reducing An important advantage of our method is that it the number of pairs can significantly reduce the cost. is very general. Almost all unconstrained optimization In contrast, the bottleneck for kernel rankSVM is on methods can be applied. calculating Qβ. The O(l2) or even O(l2n) cost is much larger than O(l log l) for the loss term. Therefore, kernel 3.3 Implementation Issues and Discussion Ev- rankSVM shares the same bottleneck with kernel SVM ery time to evaluate the function value of (3.12), we and support vector regression (SVR) [2, 6]. In this need to conduct kernel evaluations. However, because regard, kernel rankSVM may not be more expensive Q is fixed regardless of the value of β, it can be calcu- than kernel SVR, which is also applicable to learning lated and stored if space is permitted. Unfortunately, to rank.6 Contrarily, training linear rankSVM may for large problems, storing the whole Q may not be pos- cost more than linear SVR, which does not require the sible because Q is a dense matrix. The same situation summation over all pairs. has occurred in SVM training, where the main solution is the decomposition method [9, 20, 3]. This method 4 Empirical Evaluation needs few kernel columns at each iteration and allo- We first implement a truncated Newton method to cates available memory to store recently used columns minimize (3.13). Next, we compare this implementation (called kernel cache). The viability of decomposition with the existing methods described in Section 2. methods relies on O(l) cost per iteration if needed ker- nel columns are in the cache and O(ln) if some kernel columns must be calculated. If similar decomposition 6This is called the pointwise approach because it approximates methods are applied to rankSVM, the same property the relevance level of each individual training point. 4.1 An Implementation of the Proposed Data set l n k |S| p Method We consider (3.13) that is derived from L2- MQ2007 42, 158 46 3 1, 017 246, 015 regularized L2-loss rankSVM. An advantage of using MQ2008 9, 360 46 3 471 52, 325 (3.13) rather than (3.12) is that (3.13) is differentiable. MQ2007-list 743, 790 46 1, 268 1, 017 285, 943, 893 We then apply a type of truncated Newton methods MQ2007-list 5% 37, 251 46 1, 128 52 14, 090, 798 called trust region Newton method (TRON) [15] to min- imize (3.13). Past uses of TRON for Table 1: The details of the first fold of each data set. l include and linear SVM [16], and lin- is the number of training instances. n is the number of ear rankSVM [14]. features. k is the number of relevance levels. |S| is the At the tth iteration with iterate βt, TRON finds a number of queries. p is the number of preference pairs. truncated Newton step vt by approximately solving a MQ2007-list 5% is sub-sampled from MQ2007-list. linear system (4.19) ∇2f(βt)vt = −∇f(βt), We also consider normalized discounted cumulative gain where we use f(β) to denote the objective function of (NDCG), which is often used in information retrieval (3.13). Note that f(β) is not twice differentiable, so [8]. We follow LETOR 4.0 to use mean NDCG defined following [19] we consider a generalized Hessian; see below. Assume k is a pre-specified positive integer, π is (B.3) in Appendix B for details. an ideal ordering such that To approximately solve (4.19), TRON applies con- jugate gradient (CG) methods that conduct a sequence yπ(1) ≥ yπ(2) ≥ ... ≥ yπ(lq ), ∀q ∈ S, of Hessian-vector products. We have discussed in Sec- tion 3.2 and Appendix B on the efficient calculation of where lq is the number of instances in query q, andπ ¯ is Hessian-vector products and function/gradient values. the ordering being evaluated. Then As a truncated Newton method, TRON confines vt m −1 X yπ¯(i) NDCG@m ≡ Im (2 − 1) · d(i), and to be within a region that we trust. Then β is updated i=1 by l  t t P q β + v if the function value m=1 NDCG@m  Mean NDCG ≡ , βt+1 = sufficiently decreases, lq βt otherwise. where t Xm 1 If we fail to modify β , the trust region is resized so I ≡ (2yπ (i) −1)·d(i) and d(i) ≡ . t t m that we search for v in a smaller region around β . In i=1 log2(max(2, i)) contrast, if the function value sufficiently decreases, we After obtaining mean NDCG of each query, we take enlarge the trust region. We omit other details of TRON the average. because it is now a common optimization method for linear classification. More details can be seen in, for 4.3 Settings for Experiments We compare the example, [16]. proposed method with the following methods. • SVMlight: it is discussed in Section 2.1 by solving the 4.2 Data Sets and Evaluation Criteria We con- dual problem of rankSVM. sider data from LETOR 4.0 [21]. We take three sets • SVMrank: this method, discussed in Section 2.2, MQ2007, MQ2008 and MQ2007-list. Each set consists solves an equivalent 1-slack structural SVM problem. 5 folds and each fold contains its own training, valida- • RV-SVM: it implements the method discussed in Sec- tion and testing data. We take the first fold for each tion 2.3 by using CPLEX7 to solve the linear program- set. Because MQ2007-list is too large for some methods ming problem. Note that CPLEX supports parallel in the comparison, we sub-sample queries that contains computing, but the extra memory requirement is be- 5% instances. Note that the feature values are already yond our machine’s capacity. Hence we run CPLEX in the range [0, 1]; therefore, no scaling is conducted. using only one thread. The details of data sets are listed in Table 1. We consider Radial Bias Function (RBF) kernel To evaluate the prediction performance, we first consider pairwise accuracy because it is directly related K(xi, xj) = exp(−γkxi − xjk) to the loss term of RankSVM. T T |{(i, j) ∈ P : w x > w x }| 7http://www-01.ibm.com/software/commerce/ Pairwise Accuracy ≡ i j . p optimization/cplex-optimizer/ and conduct the experiment on a 64-bit machine with Intel Xeon 2.0GHz (E5504), 1 MB cache and 32GB 75 memory. 70 For parameter selection, we must decide C and γ, 65 where C is the regularization parameter in (1.2), and 60 γ is the kernel parameter in the RBF kernel. Because 55 the compared approaches solve problems under differ- 50 45 light ent loss or regularization terms, parameter selection is Pairwise Accuracy SVM 40 TRON conducted for each method and each evaluation cri- rank SVM

35 2 3 4 terion. We search on a grid of (C, γ) values, where 10 10 10 C ∈ {2−5, 2−4,..., 25} and γ ∈ {2−6, 2−5,..., 2−1}. We Time (sec.) choose the one that has the best prediction performance (a) Pairwise accuracy on the validation set. The best parameter (C, γ) for each method and criterion is listed in the supplementary ma- 0.55 terials. Note that RV-SVM failed to run on MQ2007 and 0.5 MQ2007-list because the constraint matrix is too large 0.45 to fit in the memory. Thus no best parameters are pre- 0.4 sented. 0.35

We mentioned that kernel evaluations are a bottle- Mean NDCG 0.3 SVMlight neck in training rankSVM. To reduce repeated kernel 0.25 TRON evalutations, our implementation stores the full kernel SVMrank

0.2 2 3 4 light ˜ 10 10 10 matrix Q , SVM caches partial Q that can fit in the Time (sec.) memory, and SVMrank stores full kernel matrix of the (b) Mean NDCG dual of sub-problem of (2.8). As for RV-SVM, although it does not store Q, it only needs 2lp kernel evaluations Figure 1: Experimental results on MQ2007 during the training procedure for computing the coeffi- cient of each variable of constraints in (2.10).

4.4 Comparison Results We record prediction per- 90 rank formances for every iteration of TRON and SVM , 80 light and every 50 iterations of RV-SVM and SVM . Fig- 70 ures 1-3 present the relation between training time and 60 test performances. 50 The compared methods solve optimization prob- 40 SVMlight lems with different regularization and loss terms, so Pairwise Accuracy TRON 30 RV−SVM their final pairwise accuracy or NDCG may be slightly rank SVM

20 1 2 3 different. Instead, the goal of the comparison is to check 10 10 10 that, for each method, how fast its optimization proce- Time (sec.) dure converges. From Figures 1-3, it is clear that the (a) Pairwise accuracy proposed method very quickly achieves the performance of the final optimal solution. Because the training time 0.5 is log-scaled, it is much faster than others. The package 0.45 0.4 SVMrank comes the second, although, in the figures, 0.35 its performance is not stabilized yet in the end. For 0.3 SVMlight, the performance is unstable because the op- 0.25 timization problem has not been accurately solved. For Mean NDCG SVMlight 0.2 smaller data sets, we did observe the convergence to the TRON 0.15 RV−SVM final performance if enough running time is spent. For SVMrank

0.1 1 2 3 10 10 10 RV-SVM, we have mentioned that it can handle only the Time (sec.) smaller problem MQ2008. (b) Mean NDCG

Figure 2: Experimental results on MQ2008 90 P. J. Bartlett, B. Sch¨olkopf, D. Schuurmans, and A. J. Smola, editors, Advances in Large Margin Classifiers, 80 pages 115–132. MIT Press, 2000.

70 [8] K. J¨arvelin and J. Kek¨al¨ainen.Cumulated gain-based evaluation of IR techniques. ACM Transactions on 60 Information Systems, 20(4):422–446, 2002. [9] T. Joachims. Making large-scale SVM learning practi- light Pairwise Accuracy 50 SVM TRON cal. In Advances in Kernel Methods - Support Vector SVMrank Learning. MIT Press, 1998.

40 2 3 4 10 10 10 [10] T. Joachims. Optimizing search engines using click- Time (sec.) through data. In ACM KDD, 2002. Figure 3: Experimental on MQ2007-list. Note that [11] T. Joachims. Training linear SVMs in linear time. In mean NDCG is not available for MQ2007-list because ACM KDD, 2006. [12] T. Joachims, T. Finley, and C.-N. J. Yu. Cutting plane of the overflow of 2π¯(i) caused by the large k. training of structural SVMs. Machine Learning, 77, 2009. [13] G. S. Kimeldorf and G. Wahba. A correspondence 5 Conclusions between Bayesian estimation on stochastic processes In this paper, we propose a method for efficiently train- and smoothing by splines. The Annals of Mathematical ing rankSVMs. The number of variables is reduced Statistics, pages 495–502, 1970. from quadratic to linear to training instances. Effi- [14] C.-P. Lee and C.-J. Lin. Large-scale linear rankSVM. cient method to handle the large number of prefer- Technical report, National Taiwan University, 2013. ence pairs are are incorporated. Empirical compar- [15] C.-J. Lin and J. J. Mor´e.Newton’s method for large- isons show that the training time is reduced in magni- scale bound constrained problems. SIAM Journal on Optimization, 9:1100–1127, 1999. tudes when compared to state-of-the-art methods. Al- [16] C.-J. Lin, R. C. Weng, and S. S. Keerthi. Trust re- though only one loss function is used in our imple- gion Newton method for large-scale logistic regression. mentation for the experiments because of the lack of JMLR, 9:627–650, 2008. space, our method is applicable to a variety of loss [17] K.-M. Lin and C.-J. Lin. A study on reduced support functions and different optimization methods. The pro- vector machines. IEEE TNN, 14(6):1449–1559, 2003. posed approach makes kernel rankSVM a practically [18] K.-Y. Lin. Data selection techniques for large-scale feasible model. The programs used for our experiment rankSVM. Master’s thesis, National Taiwan Univer- is available at http://www.csie.ntu.edu.tw/~cjlin/ sity, 2010. papers/ranksvm/kernel.tar.gz and a package based [19] O. L. Mangasarian. A finite Newton method for on our study will be released soon. classification. Optimization Methods and Software, 17(5):913–929, 2002. [20] J. C. Platt. Fast training of support vector ma- References chines using sequential minimal optimization. In B. Sch¨olkopf, C. J. C. Burges, and A. J. Smola, editors, [1] A. Airola, T. Pahikkala, and T. Salakoski. Training Advances in Kernel Methods - Support Vector Learn- linear ranking SVMs in linearithmic time using red– ing, Cambridge, MA, 1998. MIT Press. black trees. Letters, 32(9):1328– [21] T. Qin, T.-Y. Liu, J. Xu, and H. Li. LETOR: 1336, 2011. A benchmark collection for research on learning to [2] B. E. Boser, I. Guyon, and V. Vapnik. A training rank for information retrieval. Information Retrieval, algorithm for optimal margin classifiers. In COLT, 13(4):346–374, 2010. 1992. [22] B. Sch¨olkopf, R. Herbrich, and A. J. Smola. A [3] C.-C. Chang and C.-J. Lin. LIBSVM: A library for generalized representer theorem. In COLT, 2001. support vector machines. ACM TIST, 2:27:1–27:27, [23] D. Sculley. Large scale learning to rank. In NIPS 2009 2011. Workshop on Advances in Ranking. 2009. [4] O. Chapelle. Training a support vector machine in the [24] I. Tsochantaridis, T. Joachims, T. Hofmann, and primal. Neural Computation, 19(5):1155–1178, 2007. Y. Altun. Large margin methods for structured and [5] O. Chapelle and S. S. Keerthi. Efficient algorithms for interdependent output variables. JMLR, 2005. ranking with SVMs. Information Retrieval, 13(3):201– [25] R. C. Whaley, A. Petitet, and J. J. Dongarra. Auto- 215, 2010. matically tuned linear algebra software and the ATLAS [6] C. Cortes and V. Vapnik. Support-vector network. project. Technical report, Department of Computer Machine Learning, 20:273–297, 1995. Sciences, University of Tennessee, 2000. [7] R. Herbrich, T. Graepel, and K. Obermayer. Large [26] H. Yu, J. Kim, Y. Kim, S. Hwang, and Y. H. Lee. An margin rank boundaries for ordinal regression. In efficient method for learning nonlinear ranking SVM functions. Information Sciences, 209:37–48, 2012. Because f(β) is not twice differentiable, ∇2f(β) does [27] G.-X. Yuan, C.-H. Ho, and C.-J. Lin. Recent ad- not exist. We follow [19] and [16] to define a generalized vances of large-scale linear classification. PIEEE, Hessian matrix 100(9):2584–2603, 2012. 2 T ∇ f(β) ≡ Q + 2CQAβ AβQ. A Optimality of w Defined in (3.16) Because this matrix may be too large to be stored, ∗ Assume α∗ and βˆ are optimal for (2.4) and (3.15), some optimization methods employ the Hessian-free respectively. By the strong duality of rankSVM and the techniques so that only the Hessian-vector products are feasibility of α∗ to the unconstrained problem (3.15), needed. For any vector v ∈ Rl, we have 2 T (B.3) ∇ f(β)v = Qv + 2CQAβ AβQv. optimal value of (1.2) We notice that (B.1)-(B.3) share some common terms, 1 ∗ T ∗ X ∗ T = (α ) Qˆα + C max(0, 1 − (Qˆα )i,j) which can be computed by the same method. For Aβ eβ 2 (i,j)∈P and pβ needed in (B.1) and (B.2), 1 ˆ ∗ T ˆ ∗ X ˆ ∗ ≥ (β ) Qˆβ + C max(0, 1 − (Qˆβ )i,j)  − +  2 (i,j)∈P l1 (β) − l1 (β) l T . X − ≥ optimal value of (1.2). Aβ eβ =  .  , and pβ = l (β),  .  i=1 i l−(β) − l+(β) The last inequality is from the fact that any w con- l l structed by (3.16) is feasible for (1.2). Thus, the above + − where li (β) and li (β) are defined in (3.17) and (3.18). inequalities are in fact equalities, so the proof is com- T T Next we calculate Aβ AβQβ and Aβ AβQv. From plete. Further, the above equation implies that any dual optimal solution α∗ is also optimal for (3.15). T X T X (Aβ Aβ)i,j = (Aβ)i,s(Aβ)s,j = (Aβ)s,i(Aβ)s,j, s s B Calculation of First and Second Order and each row of A only contains two non-zero elements, Information of (3.12) and (3.13) β we have Assume Q is the i-th column of Q. One sub-gradient i  + − of (3.12)’s objective function is li (β) + li (β) if i = j,  T −1 if i 6= j, and X (A Aβ)i,j = Qβ − C (Qi − Qj) β (i,j)∈SV(β)  (i, j) or (j, i) ∈ SV(β),  Xl X X 0 otherwise. = Qβ + C ( Qi − Qi) i=1 (j,i)∈SV(β) (i,j)∈SV(β) Consequently, l X + − = Qβ + C (li (β) − li (β))Qi, l i=1 T X T (Aβ AβQv)i = (Aβ Aβ)i,j(Qv)j For (3.13) that uses L2 loss, the computation is j=1 + −  X slightly more complicated. We define = l (β) + l (β) (Qv)i − (Qv)j. i i j:(j,i) or (i,j)∈SV(β)

pβ ≡ |SV(β)|, Finally we have

pβ ×l T and let Aβ ∈ R include A’s rows corresponding to QAβ AβQv SV(β). That is, if (i, j) ∈ SV(β), then the (i, j)th row  l+(β) + l−(β)(Qv) − γ+(β, v) + γ−(β, v)  of A is selected. The objective function of (3.13) can 1 1 1 1 1  .  then be written as = Q  .  , + −  + −  l (β) + l (β) (Qv)l − γ (β, v) + γ (β, v) 1 T T l l l l f(β) ≡ β Qβ + C(AβQβ − eβ) (AβQβ − eβ) 2 where 1 (B.1) = βT Qβ + CβT Q(AT A Qβ − 2AT e ) + p , + X β β β β β γ (β, v) ≡ (Qv)j, 2 i j:(j,i)∈SV(β) where e ∈ Rpβ is a vector of ones. Its gradient is − X β γ (β, v) ≡ (Qv)j. i j:(i,j)∈SV(β) X Qβ − 2C (Qi − Qj) max(0, 1 − (Qβ)i + (Qβ)j) + − It has been shown in [14] that γi (β, v) and γi (β, v) can (i,j)∈P be calculated by the same technique of order-statistic T T + − (B.2) = Qβ + 2CQ(Aβ AβQβ − Aβ eβ). trees for li (β) and li (β).