<<

1 Statistical Analysis of Bayes Optimal Subset David Cossock Yahoo Inc., Santa Clara, CA, USA [email protected] Tong Zhang Rutgers University, NJ, USA [email protected]

Abstract—The ranking problem has become increasingly asked to choose a few items a user is most likely important in modern applications of statistical methods to buy based on the user’s profile and buying history. in automated decision making systems. In particular, we The selected items will then be presented to the user consider a formulation of the statistical ranking problem which we call subset ranking, and focus on the DCG as recommendations. Another important example that (discounted cumulated gain) criterion that measures the affects millions of people everyday is the internet search quality of items near the top of the rank-list. Similar to problem, where the user presents a query to the search error minimization for binary classification, direct opti- engine, and the search engine then selects a few web- mization of natural ranking criteria such as DCG leads to pages that are most relevant to the query from the whole a non-convex optimization problems that can be NP-hard. Therefore a computationally more tractable approach is web. The quality of a search engine is largely determined needed. We present bounds that relate the approximate by the top-ranked, or highest ranked results the search optimization of DCG to the approximate minimization of engine can display on the first page. Internet search is certain regression errors. These bounds justify the use the main motivation of this theoretical study, although of convex learning formulations for solving the subset the model presented here can be useful for many other ranking problem. The resulting estimation methods are not conventional, in that we focus on the estimation quality in applications. For example, another ranking problem is ad the top-portion of the rank-list. We further investigate the placement in a web-page (either search result, or some asymptotic statistical behavior of these formulations. Under content page) according to revenue-generating potential. appropriate conditions, the consistency of the estimation Although there has been much theoretical investiga- schemes with respect to the DCG metric can be derived. tion of the ranking problem in recent years, many authors have only considered the global ranking problem, where I.INTRODUCTION a single ranking function is used to order a fixed set of items. However, in web-search, a different ranking of We consider the general ranking problem, where a web-pages is needed for each different query. That is, computer system is required to rank a set of items based we want to find a ranking function that is conditioned on a given input. In such applications, the system often on (or local to) the query. We may look at the problem needs to present only a few top ranked items to the user. from an equivalent point of view: in web-search, instead Therefore the quality of the system output is determined of considering many different ranking functions of web- by the performance near the top of its rank-list. pages (one for each query), we may consider a single Ranking is especially important in electronic com- ranking function that depends on the (query,web-page) merce and many internet applications, where person- pair. Given a query q, we only need to rank the subset alization and information based decision making are of all possible (query,web-page) pairs restricted to query critical to the success of such businesses. The decision q. Moreover, we may have a pre-processing step to making process can often be posed as a problem of filter out documents unlikely to be relevant to the query selecting top candidates from a set of potential alter- q, so that the subset of (query,web-page) pairs to be natives, leading to a conditional ranking problem. For ranked is further reduced. We call this framework subset example, in a recommender system, the computer is ranking. A formal mathematical definition will be given Partially supported by NSF grant DMS-0706805. Part of the work in Section III. was done when the second author was at Yahoo Inc. For web-search (and many other ranking problems), we are only interested in the quality of the top choices; totic behavior of regression learning, where we focus the evaluation of the system output is different from on the L2-regularization approach. Together with earlier many traditional error metrics such as classification error. theoretical results, we can establish the consistency of In this setting, a useful figure of merit should focus regression based ranking under appropriate conditions. on the top portion of the rank-list. This characteristic of ranking problems has not been carefully explored in II.RANKINGAND PAIR-WISE PREFERENCE earlier studies (except for a recent paper [24], which also LEARNING touched on this issue). The purpose of this paper is to develop some theoretical results for converting a ranking In the standard (global) ranking problem, a set of problem into a convex optimization problem that can be items is ranked relative to each other according to a efficiently solved. The resulting formulation focuses on single criterion. The goal is to learn a ranking (or linear the quality of the top ranked results. The theory can be ordering) for all items from a small set of training items regarded as an extension of related theory for convex risk (with partially defined preference relations among them), minimization formulations for classification, which has so that the remaining items can be ranked according to drawn much attention recently in the statistical learning the same criterion. However, as we have explained in the literature [3], [18], [26], [27], [30], [31]. introduction, this paper considers subset ranking, where only a subset of items are ranked according to an input Due to our motivation from web-search, this paper q (representing the query in the web-search example). focuses on ranking problems that measure quality only Since our motivation is document retrieval, we will in the top portion of the rank-list. However, it is im- follow the convention of using q to represent query and portant to note that in some other applications, global p to represent pages to be retrieved. Further discussion ranking criteria such as Spearman and on ranking in the document retrieval domain and some Kendall’s τ metric are used. Such global metrics have mathematical formulations can be found in [17], [21], been investigated in a number of papers in recent years. although we use different notations here. We organize the paper as follows. Section II discusses In the context of document retrieval, we may consider earlier work in and on global ranking as a prediction problem. The traditional predic- and pair-wise ranking. Section III introduces the subset tion problem in statistical machine learning assumes that ranking problem. We define two ranking metrics: one is we observe an input vector q ∈ Q, so as to predict the DCG measure which we focus on in this paper, and an unobserved output p ∈ P. However, in a ranking the other is a measure that counts the number of correctly problem, if we assume P = {1, . . . , m} contains m ranked pairs. The latter has been studied recently by possible values, then instead of predicting a value in P, several authors in the context of pair-wise preference we predict a permutation of P that gives an optimal learning. The order induced by the regression function ordering of P. That is, if we denote by P! the set of is shown to be optimal with respect to both metrics. permutations of P, then the goal is to predict an output Section IV introduces some basic estimation methods in P!. There are two fundamental issues: first, how to for ranking. Although this paper focuses on the least measure the quality of ranking; second, how to learn a squares regression based formulation, we also briefly good ranking procedure from historical . discuss other possible loss functions approximating the At first glance, it may seem that we can simply cast optimal order here. The later sections develop bounds the ranking problem as an ordinary prediction problem and consistency arguments for a faster convergence rate where the output space becomes P!. However, the num- to optimal ranking than is possible with naive (uni- ber of permutations in P! is m!, which can be extremely form) regression. Section V contains the main theo- large even for small m. Therefore it is not practical retical results in this paper, where we show that the to solve the ranking problem directly without imposing approximate minimization of certain regression errors certain structures on the search space. Moreover, in leads to the approximate optimization of the ranking practice, given a training point q ∈ Q, we are generally metrics defined earlier. This implies that asymptotically not given an optimal permutation in P! as the observed the non-convex ranking problem can be solved using output. Instead, we may observe another form of output regression methods that are convex. Section VI presents from which an optimal ranking can be inferred, but it the regression learning formulation derived from the may also contain extra information. For example, in theoretical results in Section V. Similar methods are web-search, we may observe relevance score or click- currently used to optimize Yahoo’s production search en- through-rate of a web-page with respect to a query, either gine. Section VII studies the generalization and asymp- using human editorial judgment, or from search-logs.

2 The training procedure should take advantage of such A number of researchers have proposed theoretical information. analysis of ranking based on the pair-wise ranking A common method of obtaining an optimal permu- model. The criterion is to minimize the error of pair- tation in P! is via a scoring function which maps a wise preference prediction when we draw two pairs x1 pair (q, p) in Q × P to a real valued number r(q, p). and x2 randomly from the input space X . That is, given For each q, the predicted permutation in P! induced a scoring function g : X → R, the ranking loss is: by this scoring function is defined as the ordering of E(X ,Y )E(X ,Y )[I(Y1 < Y2)I(g(X1) ≥ g(X2)) p ∈ P sorted with non-increasing value r(q, p). This 1 1 2 2 is the method we will focus on in this paper. In subset + I(Y1 > Y2)I(g(X1) ≤ g(X2))] (1)

ranking, we consider an input space X = Q × P, and =EX1,X2 [P (Y1 < Y2|X1,X2)I(g(X1) ≥ g(X2)) the goal is to learn a scoring function r(q, p) on X , so + P (Y1 > Y2|X1,X2)I(g(X1) ≤ g(X2))], that the items (q, p) can be ranked using this function. where I(·) denotes the indicator function. For binary Although the ranking problem has received consider- output y = 0, 1, it is known that this metric is equivalent able interest in machine learning due to many impor- to the AUC measure (area under the ROC) for binary tant applications in information processing systems, the classifiers up to a scaling, and it is closely related to the problem has not been extensively studied in the tradi- Mann-Whitney-Wilcoxon [14]. In the literature, tional statistical literature. A relevant theoretical analysis has focused mainly on this ranking is ordinal regression [20]. In this model, we are still criterion (for example, see [1], [2], [8], [23]). interested in predicting a single output. We consider the input space X = Q × P, and for each x = (q, p), we The pair-wise preference learning model has some observe an output value y ∈ Y. Moreover, we assume limitations. First, although the criterion in (1) measures that the values in Y = {1,...,L} are ordered, and the the global pair-wise ranking quality, it is not the best cumulative probability P (y ≤ j|x) (j = 1,...,L) has metric to evaluate some practical ranking systems such the form γ(P (y ≤ j|x)) = θj + gβ(x). In this model, as those in web-search. In such applications, a system both γ(·) and gβ(·) have known functional forms, and θ does not need to rank all data-pairs, but only a subset of and β are model parameters. them each time. Furthermore, the top-most positions are Note that the ordinal regression model induces a of primary importance. Another issue with the pair-wise stochastic preference relationship on the input space X . preference learning model is that the scoring function is usually learned by minimizing a convex relaxation of Consider two samples (x1, y1) and (x2, y2) on X × Y. the pair-wise classification error, similar to large margin We say x1 ≺ x2 if and only if y1 < y2. This is a classification. However, if the preference relationship classification problem that takes a pair of inputs x1 and is noisy (that is, not all preference relations can be x2 and tries to predict whether x1 ≺ x2 or not (that satisfied by a single ranking order), then an important is, whether the corresponding outputs satisfy y1 < y2 or not). In this formulation, the optimal prediction rule question that should be addressed is whether such a to minimize classification error is induced by the or- learning algorithm leads to a Bayes optimal ranking function in the large sample limit. Unfortunately for dering of gβ(x) on X because if gβ(x1) < gβ(x2), general risk minimization formulations, the problem is than P (y1 < y2|x1, x2) > 0.5 (based on the ordinal regression model), which is consistent with the Bayes quite complex and difficult if the decision rule is induced rule. Motivated by this observation, an SVM ranking by a single-variable scoring function of the form r(x). method is proposed in [15]. The idea is to reformulate The regression formulations considered in this paper ordinal regression as a model to learn a preference have simpler (although less general) forms, which can relation on the input space X , which can be learned be more easily analyzed. ˆ using pair-wise classification. Given the parameter β The problem of Bayes optimality in the pair-wise learned from training data, the scoring function is simply learning model was partially investigated in [8], but with r(q, p) = gβˆ(x), where x = (q, p). a decision rule of a general form r(x1, x2): we predict The pair-wise preference learning model has become x1 ≺ x2 if r(x1, x2) < 0. To our knowledge, this the major ranking topic in the machine learning litera- method is not widely used in practice because a naive ture. For example, in addition to SVM, a similar method application can lead to contradictions: we may predict based on AdaBoost is proposed in [12]. The idea was r(x1, x2) < 0, r(x2, x3) < 0, and r(x3, x1) < 0. also used in optimizing the Microsoft web-search system Therefore in order to use such a method effectively for [6]. ranking, there needs to be a mechanism to resolve such

3 contradictions (see [9]). For example, one possibility vectors as X . In web-search, feature vectors are con- P 0 is to define a scoring function f(x) = x0 r(x, x ), structed from (q, p). For each x ∈ X , we also observe and rank the data accordingly. Another possibility is to a non-negative real-valued variable y that measures the use a sorting method (such as quick-sort) directly with quality of x (this corresponds to the relevance score of the comparison function given by r(x1, x2). In order to the (query,web-page) pair represented by x). Denote by show that such contradiction resolution methods are well S the set of all finite subsets of X that may possibly con- behaved asymptotically, it is necessary to analyze the tain elements that are redundant (which happens when corresponding error. Although useful, such analysis is two (query,web-page) pairs map to the same feature beyond the scope of this paper. vector). Note that sets with repeated memberships are It is worth mentioning that there are more elaborate some time referred to as multiset in the literature. We do ideas in the preference learning framework that can ad- not differentiate set and multiset here, as this detail is not dress some limitations of global pair-wise ranking. One essential in the paper. In subset ranking, each instance is proposal that involves only partially defined preference an ordered finite subset S = {x1, . . . , xm} ∈ S (i.e., the relations has been considered in [25]. Their methods set of (query,web-page) pairs consistent with the given allow optimization of a convex surrogate that can be query q in the web-search example). Note that the actual more tightly coupled with the goals in ranking, while order of the items in the set is of no importance; the still enabling fast optimization. The methods are directly numerical subscripts are for notational purposes only, so motivated by considering the noiseless situation where that permutations can be more conveniently defined. all preference relationships can be fully satisfied. If the In our model, we randomly draw an ordered subset preference relationships cannot be fully satisfied, the S = {x , . . . , x } ∈ S consisting of feature vectors behavior of their methods is not well-understood. In 1 m x in X ; at the same time, we are given a set of real- contrast, the regression formulations considered in this j valued grades {y } = {y , . . . , y } such that for each j, paper are easier to analyze. Our results imply that meth- j 1 m y corresponds to x (representing its relevance score). ods we propose are well-behaved even when the ranking j j Whether the size of the set m should be a random problem contains noise (that is, we cannot find a ranking variable has no importance in our analysis. In this paper function that preserves all preference relationships). we assume that it is fixed for simplicity.

III.SUBSET RANKING MODEL Based on the observed subset S = {x1, . . . , xm}, the system is required to output an ordering (ranking) of The global pair-wise preference learning model in the items in the set. Using our notation, this ordering Section II has some limitations. In this paper, we shall can be represented as a permutation J = [j1, . . . , jm] of describe a model more relevant to some practical ranking [1, . . . , m]. Our goal is to produce a permutation such systems such as web-search. that yji is in decreasing order for i = 1, . . . , m. A quality criterion is used to evaluate the system produced ranking A. Problem definition of subset S based on the observed grades {yj}. Our goal is to maximize the expected value of this quality criterion The problem of web-search is to rank web-pages over random draws of S and {y } from an underlying (denoted by p) based on a query q. As explained in the j distribution D. A formal definition is given below, where introduction, one may consider this problem as learning for notational simplicity, we state it with fixed m. a global ranking function that orders (query,web-page) pairs. However, given a query q, we only need to rank the Definition 1: In subset ranking, we are given a set of subset of all (query,web-page) pairs that are consistent items X . Let S = X m be the set of (ordered) finite with the query q. Our goal is to rank the (query,web- subsets of X of cardinality m. Let Q(J, {yj}) be a page) pairs according to how relevant the web-page is real-valued quality function, where we denote by J a to the query. As mentioned in Section II, this is achieved permutation of [1, . . . , m], and {yj} = [y1, . . . , ym] ∈ by taking (q, p) as input, and estimate a scoring function Rm a vector of m grades. A ranking function r(S) r(q, p) as output, so that a more relevant pair gets a maps an ordered subset S ∈ S to a permutation J. higher score. Assume that S = {x1, . . . , xm} and {yj} are drawn In practical applications, the standard input to a ma- randomly from an underlying distribution D, which is chine learning algorithm is usually not the original input, invariant under any simultaneous permutation of (xj, yj) but a feature vector created from the input. Therefore for j = 1, . . . , m (that is, the ordering of S is of no from now on, we denote the space of observable feature importance). Our goal is to find a ranking function r(S)

4 to maximize the expected subset ranking quality That is, yji depends on xji instead of the index ji itself. Conditioned on the value x , the distribution of y is E Q(r(S), {y }). ji ji (S,{yj })∼D j not affected when we reorder S. The machine learning problem is to estimate r(S) from a The global pair-wise preference learning metric (1) set of n training pairs {(S, {yj})} that are independently can be adapted to the subset ranking setting. We may drawn from D. consider the following weighted total of correctly ranked In practical applications, each available position i pairs minus incorrectly ranked pairs: can be associated with a weight c that measures the m−1 m i 2 X X importance of that position. Now, given the grades T(J, [y ]) = (y − y ). j m(m − 1) ji ji0 yj(j = 1, . . . , m), a very natural measure of the rank- i=1 i0=i+1 list J = [j1, . . . , jm]’s quality is the following weighted If the output label yi takes binary-values, and the subset sum: m S = X is global (we may assume that it is finite), then X DCG(J, [yj]) = ciyji . this metric is equivalent to (1). Although we pay special i=1 attention to the DCG metric, we shall also include some analysis of the T criterion for completeness. We assume that {ci} is a pre-defined sequence of non- increasing non-negative discount factors that may or may Similar to (2) and (3), we can define the following not depend on S. This metric, described in [16] as DCG quantities: (discounted cumulated gain), is one of the preferred T(r) = ES T(r, S), (4) metrics used in the evaluation of internet search systems, where if we let r(S) = [j1, . . . , jm], then including the production system of Yahoo and that of m−1 m Microsoft [6]. In this context, a typical choice of ci is 2 X X T(r, S) = (5) to set ci = 1/ log(1 + i) when i ≤ k and ci = 0 when m(m − 1) 0 i > k for some k. One may also use other choices, i=1 i =i+1

(Eyj |(xj ,S) yji − Eyj |(xj ,S) yj 0 ). such as letting ci be the probability of user viewing (or i i i0 i0 i i clicking) the result at position . Similar to the concept of Bayes classifier in classifi- Although introduced in the IR context and applied cation, we can define the Bayes ranking function that to web-search, the DCG criterion can be useful for optimizes the DCG and T measures. Based on the many other ranking applications such as recommender conditional formulations in (3) and (5), we have the systems. By choosing a decaying sequence of ci, this following result: measure naturally focuses on the quality of the top Theorem 1: Given a set S ∈ S, for each xj ∈ S, we portion of the rank-list. This is in contrast with the pair- define the Bayes-scoring function as wise error criterion in (1), which does not distinguish the top portion of the rank-list from the bottom portion. fB(xj,S) = Eyj |(xj ,S) yj

For the DCG criterion, the goal is to train a ranking An optimal Bayes ranking function rB(S) that max- function r that can take a subset S ∈ S as input, and imizes (5) returns a rank list J = [j1, . . . , jm] such produce an output permutation J = r(S) such that the that fB(xji ,S) is in descending order: fB(xj1 ,S) ≥ expected DCG is as large as possible: fB(xj2 ,S) ≥ · · · ≥ fB(xjm ,S). An optimal Bayes ranking function r0 (S) that maximizes (3) returns a rank DCG(r) = ES DCG(r, S), (2) B list J = [j1, . . . , jm] such that ck > ck0 implies that where fB(xjk ,S) > fB(xjk0 ,S). m X Proof: Without confusion, we use the same notation DCG(r, S) = ciE yj . (3) yji |(xji ,S) i J = [j1, . . . , jm] to denote the rank list from an optimal i=1 0 Bayes ranking function rB(S) for (5), or from rB(S) Note that in the above equation, we let r(S) = for (3). Given any k, k0 ∈ {1, . . . , m}, we consider a 0 [j1, . . . , jm]. We also assume that xji is the ji-th item modified ranking function rB(S) that returns a rank list 0 0 in S, as in Definition 1. The notation Eyj |(xj ,S) yji J derived from J by switching jk and jk . That is, we i i 0 0 0 0 0 is the expectation of the score yji corresponding to the define J = [j1, . . . , jm], where ji = ji when i 6= k, k , 0 0 0 item xji ∈ S. We use xji instead of ji to indicate the and jk = jk , and jk0 = jk. Due to the optimality of 0 position of yji in order to emphasize that the ordering Bayes ranking function, we have T(J, S) ≥ T(J ,S) information in S is not important (see Definition 1). for (5) and DCG(J, S) ≥ DCG(J 0,S) for (3).

5 We consider the T-criterion first. In order to show To see this, we may consider for simplicity that X is

that fB(xjk ,S) is in descending order, we only need finite; moreover, each subset only contains two elements,

to show that fB(xjk+1 ,S) ≤ fB(xjk ,S) for k = and one is preferred over the other (deterministically). 1, . . . , m − 1. Consider k0 = k + 1, it is easy to Now in the subset learning model, such a preference 0 0 0 check that T(J ,S) − T(J, S) = 4(fB(xjk+1 ,S) − relationship x ≺ x of two elements x, x ∈ X can be 0 0 fB(xjk ,S))/m(m − 1). Therefore T(J ,S) ≤ T(J, S) denoted by a directed edge from x to x . In this setting,

implies that fB(xjk+1 ,S) ≤ fB(xjk ,S). This proves to find a global scoring function that approximates the the first claim of the theorem. Note that the reverse is optimal set dependent Bayes scoring rule is equivalent to

also true because all ranking orders with fB(xjk ,S) in finding a maximum subgraph that is acyclic. In general, descending order have the same T-value. this problem is known to be NP-hard as well as APX- Now we consider the DCG-criterion. Given any hard [11]: the class APX consists of problems having an 0 0 k, k , we have DCG(J ,S) − DCG(J, S) = (ck − approximation to with 1 + c of the optimum for some 0 0 c. If any APX-hard problem admits a polynomial time ck )(fB(xjk0 ,S) − fB(xjk ,S)). Now ck > ck and 0 approximation scheme, then P=NP. DCG(J ,S) ≤ DCG(J, S) implies that fB(xjk ,S) ≥ fB(xjk0 ,S). This proves the second claim. Note that the reverse is also true because all such ranking orders the B. Web-search example same DCG-value. Web-search is a concrete example of subset ranking. In this application, a user submits a query q, and expects The result indicates that the optimal ranking can be the search engine to return a rank-list of web-pages {pj} induced by a single variable scoring function of the such that a more relevant page is placed before a less form f(x, S): X × S → R. In this paper, f(x, S) is relevant page. In a typical internet search engine, the meaningful only in the subset of X ×S with x ∈ S. How- system takes a query and uses a simple ranking formula ever, for notational simplicity, we consider the domain for the initial filtering, which limits the set of web-pages of f(x, S) as X × S, where we simply let f(x, S) = 0 to an initial pool {pj} of size m (e.g., m = 100000). when x∈ / S. We use a specific function form f(x, S) After this initial ranking, the system utilizes a more to emphasize that the oder of S is of no importance. complicated second stage ranking process, which re- In fact, in practical applications, we usually consider a orders the pool. This critical stage is the focus of this function f(x, S) that is independent of the ordering of paper. This step generates a feature vector xj for each S. That is, if S0 contains the same elements as S in a page pj in the initial pool, using information about the different order, then f(x, S0) = f(x, S). query q, page pj, query/page relationships or Due to the dependency of conditional probability of the result pool itself. The feature vector can encode y on S, the optimal scoring function also depends on various types of information, such as the query length, S. Therefore a complete solution of the subset ranking characteristics of the pool, number of query terms that problem can be difficult when m is large. In practice, in match in the title or body of pj, web linkage of pj, order to remove the set dependency, we may consider an etc. The set of all possible feature vectors xj is X . encoding of S into a set-dependent feature vector g(S), The ranking algorithm only observes a list of feature and then incorporate this information into the feature vectors {x1, . . . , xm} with each xj ∈ X . A human vector x. If we are able to find such salient features, editor is presented with a pair (q, pj) and assigns a then the augmented feature vector contains all necessary score sj on a , e.g., 1 − 5 (least relevant to highly information of S. Given the augmented x ∈ S, we relevant). The corresponding target value yj is defined j 1 as a transformation of sj, which maps the grade into may then assume that yj is conditionally independent the interval [0, 1]. Another possible choice of yj is to of S. That is, fB(xj,S) = Eyj |(xj ,S) yj = Eyj |xj yj = normalize it by multiplying each yj by a factor such fB(xj), which removes the dependence on S. Note that this explicit formulation allows predictive features which that the optimal DCG is no more than one. are functions of the result set S, in addition to those IV. RISK MINIMIZATION BASED ESTIMATION which depend exclusively on xj. METHODS One may also consider the combinatorial problem of From the previous section, we know that the optimal finding a global set independent scoring function f(xj) that approximately maximizes the DCG (i.e., approxi- scoring function is the conditional expectation of the mately preserves the ranking order of fB(xj,S)). In the 1For example, the formula (2sj −1)/(25 −1) is used in [6]. Yahoo general case, this problem is computationally difficult. uses a different transformation based on empirical user surveys.

6 grades y. We investigate some basic estimation methods where the function space F¯ contains a subset of func- for conditional expectation learning. tions {f(S¯): X m → Rm} of the form

f(S¯) = [f(x1,S), . . . , f(xm,S)], A. Relation to multi-category classification where S = {x1, . . . , xm} is unordered set. An concrete The subset ranking problem is a generalization of example is maximum entropy (multi-category logistic multi-category classification. In the latter case, we ob- regression) which employs the following m serve an input x0, and are interested in classifying it ¯ X f(xk,S) into one of the m classes. Let the output value be k ∈ Φ(f(S), j) = −f(xj,S) + ln e . {1, . . . , m}. We encode the input x0 into m feature vec- k=1 tors {x1, . . . , xm}, where xi = [0,..., 0, x0, 0,..., 0] with the i-th component being x , and the other com- 0 B. Regression based learning ponents are zeros. We then encode the output k into m

values {yj} such that yk = 1 and yj = 0 for j 6= k. In Since in ranking problems yi,j can take values other this setting, we try to find a scoring function f such than 0 or 1, we can have more general formulations that f(xk) > f(xj) for j 6= k. Consider the DCG than multi-category classification. In particular, we may criterion with c1 = 1 and cj = 0 when j > 1. Then consider variations of the following regression based the classification accuracy is given by the corresponding learning method to train a scoring function in F ⊂ DCG. {X × S → R}: Given any multi-category classification algorithm, one n m ˆ X X may use it to solve the subset ranking problem (with f = arg min φ(f(xi,j,Si), yi,j), (6) f∈F non-negative yj value) as follows. Consider a sample i=1 j=1 S = [x1, . . . , xm] as input, and a set of outputs {yj}. Si = {xi,1, . . . , xi,m} ∈ S, We randomly draw k from 1 to m according to the dis- P tribution yk/ j yj. We then form another sample with where we assume that P y S¯ = [x , . . . , x ] weight j j, which has the vector 1 m φ(a, b) = φ (a) + φ (a)b + φ (b). (where order is important) as input, and label y0 = 0 1 2 k ∈ {1, . . . , m} as output. This changes the problem The estimation formulation is decoupled for each formulation into multi-category classification. Since the element xi,j in a subset Si, which makes the problem conditional expectation can be expressed as easier to solve. In this method, each training point X ((xi,j,Si), yi,j) is treated as a single sample (for i = E y = P (y0 = k|S) E y , yk|(xk,S) k {yj }|S j 1, . . . , n and j = 1, . . . , m). The population version of j the risk function is: the order induced by the scoring function Eyk|(xk,S) yk X is the same as that induced by P (y0 = k|S). There- ES [φ0(f(x, S)) + φ1(f(x, S))Ey|(x,S)y fore a multi-category classification solver that estimates x∈S conditional probability can be used to solve the subset + Ey|(x,S)φ2(y)]. ranking problem. In particular, if we consider a risk This implies that the optimal population solution is a minimization based multi-category classification solver function that minimizes for m-class problem [27], [30] of the following form: n φ0(f(x, S)) + φ1(f(x, S))Ey|(x,S)y, ˆ X f = arg min Φ(f(Xi),Yi), f∈F which is a function of Ey|(x,S)y. Therefore the estima- i=1 tion method in (6) leads to an estimator of conditional where (Xi,Yi) are training points with Yi ∈ {1, . . . , m}, expectation with a reasonable choice of φ0(·) and φ1(·). F is a vector function class that takes values in Rm, and A simple example is the method, where 2 2 Φ is some risk functional. Then for ranking with training we pick φ0(a) = a , φ1(a) = −2a and φ2(b) = b . ¯ ¯ points (Si, {yi,1, . . . , yi,m}) and Si = [xi,1, . . . , xi,m], That is, the learning method (6) becomes least squares the corresponding learning method becomes estimation: n m n m ˆ X X ˆ X X 2 f = arg min yi,jΦ(f(S¯i), j), f = arg min (f(xi,j,Si) − yi,j) . (7) f∈F¯ f∈F i=1 j=1 i=1 j=1

7 0 This method, and some essential variations which we where each Ei = {(j, j ): yi,j < yi,j0 } is a will introduce later, will be the focus of our analysis. subset of {1, . . . , m} × {1, . . . , m}. For example, we It was shown in [7] that the only loss function with may use a non-increasing monotone function φ0 and conditional expectation as the minimizer (for an arbitrary let φ(a1, a2; b1, b2) = φ0((a2 − a1) − (b2 − b1)) or conditional distribution of y) is least squares. However, φ(a1, a2; b1, b2) = (b2 − b1)φ0(a2 − a1). Example loss for practical purposes, we only need to estimate a mono- functions include SVM loss φ0(x) = max(0, 1 − x) and tonic transformation of the conditional expectation. For AdaBoost loss φ0(x) = exp(−x) (see [12], [15], [24]). this purpose, we can have additional loss functions of the The approach works well if the ranking problem is form (6). In particular, let φ0(a) be an arbitrary convex noise-free (that is, yi,j is deterministic). However, one 0 function such that φ0(a) is a monotone increasing func- difficulty with this approach is that if yi,j is stochastic, tion of a, then we may simply take the function φ(a, b) = then the corresponding population estimator from (9) φ0(a) − ab in (6). The optimal population solution is may not be Bayes optimal, unless a more complicated 0 uniquely determined by φ0(f(x, S)) = Ey|(x,S)y.A scheme such as [8] is used. It will be interesting to 4 simple example is φ0(a) = a /4 such that the population investigate the error of such an approach, but the analysis 1/3 optimal solution is f(x, S) = (Ey|(x,S)y) . Clearly is beyond the scope of this paper. such a transformation does not affect ranking. Moreover, One argument used by the advocates of the pair-wise in many ranking problems, the of y is bounded. learning formulation is that we do not have to learn an It is known that additional loss functions can be used absolute grade judgment (or its expectation), but rather for computing the conditional expectation. As a simple only the relative judgment that one item is better than example, if we assume that y ∈ [0, 1], then the following another. In essence, this that for each subset S, if modified least squares can be used: we shift each judgment by a constant, the ranking is not n m affected. If invariance with respect to a set-dependent ˆ X X  2 f = arg min (1 − yi,j) max(0, f(xi,j,Si)) judgment shift is a desirable property, then it can be f∈F i=1 j=1 incorporated into the regression based model [28]. For 2 example, we may introduce an explicit set dependent +yi,j max(0, 1 − f(xi,j,Si)) . (8) shift feature (which is rank-preserving) into (6): One may replace this with other loss functions used for n m ˆ X X binary classification that estimate conditional probability, f = arg min min φ(f(xi,j,Si) + bi, yi,j). f∈F bi∈R such as those discussed in [31]. Although such general i=1 j=1 formulations might be interesting for certain applica- In particular, for least squares, we have the following tions, advantages over the simpler least squares loss method: of (7) are not completely certain, and they are more n m complicated to deal with. Therefore we will not consider ˆ X X 2 f = arg min min (f(xi,j,Si)+bi −yi,j) . (10) such general formulations in this paper, but rather focus f∈F bi∈R i=1 j=1 on adapting the least squares method in (7) to the ranking problems. As we shall see, non-trivial modifications of More generally, we may also introduce more sophisti- (7) are necessary to optimize system performance near cated set dependent features (such as set-dependent scal- the top of rank-list. ing factors) and even hierarchical set-dependent models into the regression formulation. This general approach can remove many limitations of the standard regression C. Pair-wise preference learning method when compared to the pair-wise approach.

A popular idea in the recent machine learning lit- V. CONVEX SURROGATE BOUNDS erature is to pose the ranking problem as a pair-wise The subset ranking problem defined in Section III preference relationship learning problem (see Section II). is combinatorial in nature, which is very difficult to Using this idea, the scoring function for subset ranking solve. Since the optimal Bayes ranking rule is given can be trained by the following method: by conditional expectation, in Section IV, we discussed n X various formulations to estimate the conditional expec- fˆ = arg min (9) f∈F tation. In particular, we are interested in least squares i=1 X regression based methods. In this context, a natural φ(f(xi,j,Si), f(xi,j0 ,Si); yi,j, yi,j0 ), question to ask is if a scoring function approximately 0 (j,j )∈Ei minimizes regression error, how well it can optimize

8 ranking metrics such as DCG or T. This section provides where we used the notation (z)+ = max(0, z). The some theoretical results that relate the optimization of first inequality in the above derivation is a direct conse- the ranking metrics defined in Section III to the mini- quence of the definition of J = rf (S), which implies mization of regression errors. This allows us to design that J = [j1, . . . , jm] achieves the maximum value Pm appropriate convex learning formulations that improve of i=1 cif(xji ,S) among all possible permutations the simple least squares methods in (7) and (10). [j1, . . . , jm] of [1, . . . , m] (see proof of Theorem 1). The Definition 2: A scoring function f(x, S) maps each last inequality is due to Holder’s¨ inequality. x ∈ S to a real valued score. It induces a ranking func- tion rf , which ranks elements {xj} of S in descending The above theorem shows that the DCG criterion order of f(xj). can be bounded through regression error. Although the We are interested in bounding the DCG performance of theorem applies to any arbitrary pair of p and q such that rf compared with that of fB. This can be regarded as 1/p + 1/q = 1, the most useful case is with p = q = 2. extensions of Theorem 1 that motivate regression based This is because in this case, the problem of minimizing learning. Pm 2 j=1(f(xj,S) − fB(xj,S)) can be directly achieved Theorem 2: Let f(x, S) be a real-valued scoring using least squares regression in (7). If regression error function, which induces a ranking function rf . Consider goes to zero, then the resulting ranking converges to pair p, q ∈ [1, ∞] such that 1/p + 1/q = 1. We have the the optimal DCG. Similarly, we can show the following following relationship for each S = {x1, . . . , xm}: result for the T criterion.

DCG(rB,S) − DCG(rf ,S) 1/q Theorem 3: Let f(x, S) be a real-valued scoring m !1/p  m  function, which induces a ranking function rf . We have X p X q ≤ 2 ci  |f(xj,S) − fB(xj,S)|  . the following relationship for each S = {x1, . . . , xm}: i=1 j=1 0 T(rB,S) − T(rf ,S) Proof: S = {x , . . . , x } r (S) = J = 1/2 Let 1 m , f  m  ∗ ∗ 4 [j1, . . . , jm], and rB(S) = JB = [j1 , . . . , jm]. We have X 2 ≤√  (f(xj,S) − fB(xj,S))  , m m X j=1 DCG(r ,S) = c f (x ,S) f i B ji 0 i=1 where rB is an optimal Bayes ranking function for the m m T criterion which is characterized in Theorem 1. X X = cif(xji ,S) + ci(fB(xji ,S) − f(xji ,S)) i=1 i=1 Proof: Let S = {x1, . . . , xm}, rf (S) = J = m m 0 ∗ ∗ X X [j1, . . . , jm], and rB(S) = JB = [j1 , . . . , jm]. We have ≥ c f(x ∗ ,S) + c (f (x ,S) − f(x ,S)) i ji i B ji ji i=1 i=1 T(rf ,S) m m X X m−1 m = c f (x ∗ ,S) + c (f(x ∗ ,S) − f (x ∗ ,S)) 2 X X i B ji i ji B ji = (fB(xji ,S) − fB(xji0 ,S)) i=1 i=1 m(m − 1) 0 m i=1 i =i+1 X m−1 m + ci(fB(xji ,S) − f(xji ,S)) 2 X X = (f(x ,S) − f(x ,S)) i=1 m(m − 1) ji ji0 m i=1 i0=i+1 X ≥DCG(r ,S) − c (f(x ,S) − f (x ,S)) m−1 m B i ji B ji + 2 X X i=1 − [(f(xji ,S) − fB(xji ,S)) m m(m − 1) 0 X i=1 i =i+1 − ci(fB(xj∗ ,S) − f(xj∗ ,S))+ i i − (f(xji0 ,S) − fB(xji0 ,S))] i=1 m−1 m 1/p 2 X X m ! ≥ (f(x ,S) − f(x ,S)) X p m(m − 1) ji ji0 ≥DCG(rB,S) − 2 ci i=1 i0=i+1 i=1 m−1 m  1/q 2 X X m − [|f(xji ,S) − fB(xji ,S)| X q m(m − 1) 0  |f(xj,S) − fB(xj,S)|  , i=1 i =i+1 j=1 + |f(xji0 ,S) − fB(xji0 ,S)|].

9 By simplifying the last term, we have Theorem 1). Now, Jensen’s inequality implies that m T(rf ,S) 4 X |f(xi,S) − fB(xi,S)| m−1 m m 2 X X i=1 ≥ (f(xj ,S) − f(xj 0 ,S)) m(m − 1) i i m !1/2 i=1 i0=i+1 4 X 2 m ≤√ (f(xi,S) − fB(xi,S)) . 2 X m − |f(x ,S) − f (x ,S)| i=1 m i B i i=1 Combining this estimate with the previous inequality, we m−1 m 2 X X obtain the desired bound. ∗ ∗ ≥ (f(xj ,S) − f(xj 0 ,S)) The above approximation bounds imply that least m(m − 1) i i i=1 i0=i+1 square regression can be used to learn the optimal m 2 X ranking functions. The approximation error converges − |f(xi,S) − fB(xi,S)| m to zero when f converges to fB in L2. However, in i=1 general, requiring f to converge to f in L is not m−1 m B 2 2 X X necessary. More importantly, in real applications, we are = (fB(xj∗ ,S) − fB(xj∗ ,S)) m(m − 1) i i0 often only interested in the top portion of the rank-list. i=1 i0=i+1 m Our bounds should reflect this practical consideration. 2 X − |f(x ,S) − f (x ,S)| Assume that the coefficients ci in the DCG criterion m i B i P i=1 decay fast, so that i ci is bounded (independent of m−1 m m). In this case, we may pick p = 1 and q = ∞ in 2 X X − [(fB(xj∗ ,S) − f(xj∗ ,S)) Theorem 2. If sup |f(xj,S) − fB(xj,S)| is small, then m(m − 1) i i j i=1 i0=i+1 we obtain a better bound than the least squares error − (fB(xj∗ ,S) − f(xj∗ ,S))] bound with p = q = 1/2 which depends on m. i0 i0 m−1 m However, we cannot ensure that supj |f(xj,S) − 2 X X ≥ (fB(xj∗ ,S) − fB(xj∗ ,S)) fB(xj,S)| is small using the simple least squares es- m(m − 1) i i0 i=1 i0=i+1 timation in (7). Therefore in the following, we develop m a more refined bound for the DCG metric, which will 2 X − |f(x ,S) − f (x ,S)| then be used to motivate practical learning methods that m i B i i=1 improve on the simple least squares method. m−1 m 2 X X Theorem 4: Let f(x, S) be a real-valued scoring − [|fB(xj∗ ,S) − f(xj∗ ,S)| m(m − 1) i i function, which induces a ranking function rf . Given i=1 i0=i+1 S = {x1, . . . , xm}, let the optimal ranking order be ∗ ∗ ∗ ∗ + |fB(xj 0 ,S) − f(xj 0 ,S)|] J = [j , . . . , j ], where f (x ∗ ) is arranged in non- i i B 1 m B ji m−1 m increasing order. Assume that c = 0 for all i > k. Then 2 X X i = (fB(xj∗ ,S) − fB(xj∗ ,S)) we have the following relationship for all γ ∈ (0, 1), m(m − 1) i i0 i=1 i0=i+1 p, q ≥ 1 such that 1/p + 1/q = 1, u > 0, and subset m ∗ ∗ 4 X K ⊂ {1, . . . , m} that contains j1 , . . . , jk : − |f(x ,S) − f (x ,S)| m i B i i=1 DCG(rB,S) − DCG(rf ,S) m  0 4 X =T(rB,S) − |f(xi,S) − fB(xi,S)|. X q m ≤Cp(γ, u)  |f(xj,S) − fB(xj,S)| i=1 j∈K The first inequality is a simplification of the last !1/q inequality from the previous chain of derivations. 0 q +u sup(f(xj,S) − fB(xj,S))+ , The second inequality in the above derivation j∈ /K is a direct consequence of the definition of where (z)+ = max(z, 0), M = fB(xj∗ ,S), and J = rf (S), which implies that J = [j1, . . . , jm] k 1/p achieves the maximum value of T(J, [f(xi,S)]) = k k !p! Pm−1 Pm 1 0 (f(x ,S) − f(x ,S)) among all X p −p/q X i=1 i =i+1 ji ji0 C (γ, u) = 2 c + u c , p 1 − γ i i possible permutations of [1, . . . , m] (see proof of i=1 i=1 0 fB(xj,S) =fB(xj,S) + γ(M − fB(xj,S))+.

10 0 By using the definition of fB to simplify the last inequality, we obtain:

DCG(rB,S) − DCG(rf ,S) Proof: Let S = {x1, . . . , xm}, rf (S) = J = ∗ ∗ " m # [j1, . . . , jm], and rB(S) = JB = [j1 , . . . , jm]. Since 1 X 0 ∗ ∗ ≤ ci((fB(xj ,S) − M) − (fB(xji ,S) − M)) (fB(xj∗ ,S) − M)+ is in descending order of i, [ji ] 1 − γ i i Pn i=1 achieves the maximum of i=1 ci(fB(xji ,S) − M)+ m among all possible permutations [ji] of [1, . . . , m]. 1 X ≤ ci(fB(xj∗ ,S) − f(xj∗ ,S)) c = 0 i > k 1 − γ i i Therefore using i ( ), we have i=1 m m ! X X 0 c ((f (x ∗ ,S) − M) − (f (x ,S) − M) ) i B ji B ji + − ci(fB(xji ,S) − f(xji ,S)) i=1 i=1 m m X 1 ∗ X = ci((fB(xj ,S) − M)+ − (fB(xji ,S) − M)+) ≤ ci(fB(xj∗ ,S) − f(xj∗ ,S))+ i 1 − γ i i i=1 i=1 ≥0. m ! X 0 + ci(f(xji ,S) − fB(xji ,S))+ This implies that i=1 m k X 1 X ci((fB(xj∗ ,S) − M) − (fB(xj ,S) − M)+) ≤ ci(fB(xj∗ ,S) − f(xj∗ ,S))+ i i 1 − γ i i i=1 i=1 m k 1 X X ≤ ci((fB(xj∗ ,S) − M) + c (f(x ,S) − f 0 (x ,S)) I(j ∈ K) 1 − γ i i ji B ji + i i=1 i=1 − (f (x ,S) − M) ). k ! ! B ji + X + c sup(f(x ,S) − f 0 (x ,S)) Therefore i j B j + i=1 j∈ /K  1/p  DCG(rB,S) − DCG(rf ,S) k ! 1 X p X q m ≤ 2 c (f (x ,S) − f(x ,S)) X 1 − γ  i  B j j + = c ((f (x ∗ ,S) − M) i=1 j∈K i B ji i=1 1/q − (f (x ,S) − M)) X B ji + (f(x ,S) − f 0 (x ,S))q m j B j + X j∈K = ci((fB(xj∗ ,S) − M) i k ! ! i=1 X 0 − (f (x ,S) − M) ) + ci sup(f(xj,S) − fB(xj,S))+ B ji + j∈ /K m i=1 X  1/p  1/q + c (M − f (x ,S)) k ! i B ji + 1  X p X q i=1 ≤ 2 ci  |fB(xj,S) − f(xj,S)|  " m 1 − γ  1 X i=1 j∈K ≤ ci((fB(xj∗ ,S) − M) 1 − γ i k ! i=1 X 0 + ci sup(f(xj,S) − fB(xj,S))+ . −(fB(xj ,S) − M)+) i i=1 j∈ /K m # X In the above derivation, the second inequality uses the +(1 − γ) ci(M − fB(xji ,S))+ i=1 fact that J = [ji] achieves the maximum value of m " m P c f(x ,S) among all possible permutations [j ] 1 X i=1 i ji i = ci((fB(xj∗ ,S) − M) of [1, . . . , m].Holder’s¨ inequality has been applied to 1 − γ i i=1 obtain the second to the last inequality. The last inequal- f 0 (x ,S) ≥ f (x ,S) −(fB(xji ,S) − M)) ity uses the fact B j B j , implying that 0 q q m # (f(xj,S) − f (xj,S)) ≤ (f(xj,S) − fB(xj,S)) . X B + + −γ ci(M − fB(xji ,S))+ . From the last inequality, we can apply the Holder’s¨ i=1 inequality again to obtain the desired bound.

11 0 If fB(xj,S) ≥ 0 for all xj ∈ S, then fB(xj,S) ≥ Proof: Note that f is suboptimal only when either γf (x ∗ ,S). Therefore in this case we have (with p = f(x ,S) ≥ f(x ,S) or when f(x ,S) ≥ f(x ,S). This B jk 3 1 3 2 q = 2): gives the following bound:

DCG(rB,S) − DCG(rf ,S) (11) DCG(rB,S) − DCG(rf ,S)  ≤I(f(x ,S) ≥ f(x ,S)) + I(f(x ,S) ≥ f(x ,S)) X 2 1 3 1 ≤C (γ, u) |f(x ,S) − f (x ,S)|2 2  j B j ≤I(|f(x2,S) − fB(x2,S)| + |f(x1,S) − fB(x1,S)| ≥ 1) j∈K + I(|f(x3,S) − fB(x3,S)| + |f(x1,S) − fB(x1,S)| ≥ 1) !1/2 2 ≤[|f(x2,S) − fB(x2,S)| + |f(x1,S) − fB(x1,S)|] +u sup(f(xj,S) − γM)+ . j∈ /K + [|f(x3,S) − fB(x3,S)| + |f(x1,S) − fB(x1,S)|] Intuitively, the bound says the following: if we can =2|f(x1,S) − fB(x1,S)| + |f(x2,S) − fB(x2,S)| find a set K such that we are certain that an item + |f(x3,S) − fB(x3,S)|. j∈ / K is irrelevant, then we only need to estimate the quality scores of the items j ∈ K reliably (using least In the above, I(·) is the set indicator function. To see that squares regression), while making sure that the score the coefficients cannot be improved, we simply note that the bound is tight when either f(x1,S) = f(x2,S) = f(xj,S) for any irrelevant item j∈ / K is no more than γM (which is a much easier task than estimating the f(x3,S) = 0, or when f(x1,S) = f(x2,S) = 1 and expected quality score). If |K|  m, then the bound f(x3,S) = 0, or when f(x2,S) = 0 and f(x1,S) = is a significant improvement over the standard least f(x3,S) = 1. squares bound in Theorem 2 because the right hand The Proposition does not make the same assumption side depends only on the sum of least squared error as in Theorem 4, where we assume that there are over |K| error terms instead of m error terms. Note many irrelevant items that can be reliably identified. The that in standard least squares, we try to estimate the proposition implies that even in more general situations, quality scores uniformly well. Although idealized, this we should not weight all errors equally. In this example, assumption is quite reasonable for web-search because getting x1 right is more important than getting x2 or x3 given a query q, most pages can be reliably determined right. Conceptually, Theorem 4 and Proposition 1 show to be irrelevant. The remaining pages, which we are the following: uncertain about, form the set K that is much smaller than • Since we are interested in the top portion of the the total number of pages m. This bound motivates an rank-list, we only need to estimate the top rated importance weighted regression formulation which we items accurately, while preventing the bottom items consider in Section VI. In practical implementations, we from being over-estimated (the conditional expec- do not have to set γM as in the theorem, but rather tations don’t have to be estimated accurately). regard it as a tuning parameter. • For ranking purposes, some points are more impor- The bound in Theorem 4 can still be refined. However, tant than other points. Therefore we should bias our the resulting inequalities will become more complicated. learning method to produce more accurate condi- Therefore we will not include such bounds in this paper. tional expectation estimation at the more important Similar to Theorem 4, such refined bounds show that points. we do not have to estimate conditional expectation uni- VI.IMPORTANCE WEIGHTED REGRESSION formly well. We present a simple example as illustration. Proposition 1: Consider m = 3 and S = The key message from the analysis in Section V is that we do not have to estimate the conditional expectations {x1, x2, x3}. Let c1 = 2, c2 = 1, c3 = 0, and equally well for all items. In particular, since we are fB(x1,S) = 1, fB(x2,S) = fB(x3,S) = 0. Let f(x, S) be a real-valued scoring function, which induces interested in the top portion of the rank-list, Theorem 4 implies that we need to estimate the top portion more a ranking function rf . Then accurately than the bottom portion. DCG(rB,S) − DCG(rf ,S) Motivated by this analysis, we consider a regression

≤2|f(x1,S) − fB(x1,S)| + |f(x2,S) − fB(x2,S)| based training method to solve the DCG optimization problem but weight different points differently according + |f(x ,S) − f (x ,S)|. 3 B 3 to their importance. We shall not discuss the implemen- The coefficients on the right hand side cannot be im- tation details for modeling the function f(x, S), which proved. is beyond the scope of this paper.

12 Let F be a function space that contains functions serve that the distribution of a regression-based scoring X × S → R. We draw n sets S1,...,Sn randomly, function f approximately fits a probability model with where Si = {xi,1, . . . , xi,m}, with the corresponding exponential decay of the form P (f > t) < exp(−at+b). grades {yi,j} = {yi,1, . . . , yi,m} (in this notation, i is The significance is that the improved rate of convergence fixed and the set {yi,j} ranges over j = 1, . . . , m). Based to optimal DCG over naive uniform regression is due to on Theorem 2, the simple least squares regression (7) both the relative importance of top ranked documents can be applied. However, this direct regression method in the DCG cost function and the exponentially small is not adequate for many practical problems such as web- probability of the relevant documents. In order to boost search, for which there are many items to rank (that is, the presence of relevant documents that are of maximal m is large) but only the top ranked pages are important. importance to DCG, the readers should bear in mind that This is because the method pays equal attention to importance can be used as a component in the relevant and irrelevant pages. In reality, one should pay weighting scheme w(xj,S). more attention to the top-ranked (relevant) pages. The The weight function w(xj,S) in (13) is chosen so grades of lower rank pages do not need to be estimated that it focuses only on the most important examples (the accurately, as long as we do not over-estimate them so weight is set to zero for pages that we know are irrele- that these pages appear in the top ranked positions. vant). This part of the formulation corresponds to the first The above mentioned intuition can be captured by part of the bound in Theorem 4 (in that case, we choose Theorem 4 and Proposition 1, which motivate the fol- w(xj,S) to be one for the top part of the example with lowing alternative training method: index set K, and zero otherwise). The usefulness of non- n uniform weighting is also demonstrated in Proposition 1. ˆ 1 X f = arg min L(f, Si, {yi,j}), (12) The specific choice of the weight function requires f∈F n i=1 various engineering considerations that are not important where for S = {x1, . . . , xm}, with the corresponding for the purpose of this paper. In general, if there are many {yj}, we have the following importance weighted re- items with similar grades, then it is beneficial to give gression loss as in equation (11): each of the similar items a smaller weight. In the second 0 m part of (13), we choose w (xj,S) so that it focuses on X 2 L(f, S, {yj}) = w(xj,S)(f(xj,S) − yj) (13) the examples not covered by w(xj,S). In particular, it j=1 only covers those data points xj that are low-ranked with 0 2 high confidence. We choose δ(S) to be a small threshold + u sup w (xj,S)(f(xj,S) − δ(S))+, j that can be regarded as a lower bound of γM in (11). where u is a non-negative parameter. Throughout this An important observation is that although m is often very large, the number of points so that w(xj,S) is section an abstract notion of weight function w(xj,S) is employed. This may equivalently represent a re- nonzero is often small. Moreover, (f(xj,S) − δ(S))+ weighting of either the training sample probability dis- is not zero only when f(xj,S) ≥ δ(S). In practice tribution or the loss function. The number δ(S) is a the number of these points is usually small (that is, set-dependent threshold that corresponds to γM in (11). most irrelevant pages will be predicted as irrelevant). However, since γM is not known, in practice, δ(S) may Therefore the formulation completely ignores those low- be considered as a tuning parameter, and a good choice ranked data points such that f(xj,S) ≤ δ(S). This may be determined with heuristics or cross-validation. makes the learning procedure computationally efficient A variation of this method is used to optimize the even when m is large. The analogy here is support vector production system of Yahoo’s internet search engine. The machines, where only the support vectors are useful detailed implementation and parameter choices cannot be in the learning formulation. One can completely ignore disclosed here2. It is also irrelevant for the purpose of samples corresponding to non support vectors. this paper. However, in the following, we shall briefly In the practical implementation of (13), we can use an explain the intuition behind (13) using Theorem 4, and iterative refinement scheme, where we start with a small some practical considerations. number of samples to be included in the first part of It is useful to mention that the distribution of web (13), and then put the low-ranked points into the second search relevance is heavily skewed in the sense that part of (13) only when their ranking scores exceed δ(S). relevance is a “rare-event”. In fact, empirically, we ob- In fact, one may also put these points into the first part of (13), so that the second part always has zero values 2Some aspects of the implementation were covered in [10]. (which makes the implementation simpler). In this sense,

13 0 the formulation in (13) suggests a selective sampling w(xj,S) ≥ 1 and w (xj,S) = 0. Therefore the min-

scheme, in which we pay special attention to important imizer f∗(xj,S) should minimize Eyj |(xj ,S)(f(xj,S) − 2 and highly ranked data points, while completely ignoring yj) , achieved at f∗(xj,S) = fB(xj,S). If fB(xj,S) ≤ most of the low ranked data points. In this regard, with δ(S), then there are two cases:

appropriately chosen w(x, S), the second part of (13) • w(xj,S) > 0, f∗(xj,S) should minimize can be completely ignored. 2 Eyj |(xj ,S)(f(xj,S)−yj) , achieved at f∗(xj,S) = The empirical risk minimization method in (12) ap- fB(xj,S). proximately minimizes the following criterion: • w(xj,S) = 0, f∗(xj,S) should minimize 2 Eyj |(xj ,S)(f(xj,S) − δ(S))+, achieved at Q(f) = ESL(f, S), (14) f∗(xj,S) ≤ δ(S). where This proves the first claim. For each S, denote by K the set of xj such that L(f, S) =E{yj }|SL(f, S, {yj}) 0 w (xj,S) = 0. The second claim follows from the m X 2 following derivation: = w(xj,S)Eyj |(xj ,S) (f(xj,S) − yj) j=1 Q(f) − Q(f∗) 0 2 + u sup w (xj,S)(f(xj,S) − δ(S))+. =ES(L(f, S) − L(f∗,S)) j  k The following theorem shows that under appropriate X 2 =ES  w(xj,S)(f(xj,S) − fB(xj,S)) assumptions, approximate minimization of (14) leads to j=1 the approximate optimization of DCG. For clarity, the  assumptions are idealized. For example, in practice the 0 2 +u sup w (xj,S)(f(xj,S) − δ(S))+ condition δ(S) ≤ γf (x ∗ ,S) may be violated for some j B jk S. However, as long as it holds for most S, the conse-  X 2 quence of the theorem is still valid approximately. In this ≥ES  (fB(xj,S) − f(xj,S))+ regard, the theorem itself should only be considered as a j∈K formal justification of (12) under idealized assumptions # 2 that specify good parameter choices in (12). The method +u sup(f(xj,S) − δ(S))+ itself may still yield good performance when some of the j∈ /K assumptions fail.  X 2 Theorem 5: Assume that ci = 0 for all i > k. ≥ES  (fB(xj,S) − f(xj,S))+ Assume the following conditions hold for each S = j∈K {x1, . . . , xm}: # 0 2 ∗ ∗ +u sup(f(x ,S) − f (x ,S)) • Let the optimal ranking order be JB = [j , . . . , j ], j B j + 1 m j∈ /K where fB(xj∗ ) is arranged in non-increasing order. i ≥E (DCG(r ,S) − DCG(r ,S))2C(γ, u)−2 • For all xj ∈ S, fB(xj,S) ≥ 0. S B f 2 −2 • There exists γ ∈ [0, 1) such that δ(S) ≤ ≥(DCG(rB) − DCG(rf )) C(γ, u) . γfB(xj∗ ,S), where δ(S) is an appropriately chosen k Note that the second inequality follows from set-dependent threshold in (13). 0 fB(xj,S) ≥ γfB(xj∗ ,S) ≥ δ(S), and the third • For all fB(xj,S) > δ(S), we have w(xj,S) ≥ 1. k 0 inequality follows from Theorem 4. • Let w (xj,S) = I(w(xj,S) < 1). Then the following results hold: VII.ASYMPTOTIC ANALYSIS • A function f∗ minimizes (14) if f∗(xj,S) = In this section, we analyze the asymptotic statistical fB(xj,S) when w(xj,S) > 0 and f∗(xj,S) ≤ performance of (12). The analysis depends on the un- δ(S) otherwise. derlying function class F. In the literature, one often • For all f, let rf be the induced ranking function. employs a linear function class with appropriate regu- Let rB be the optimal Bayes ranking function, we larization condition, such as L1 or L2 regularization for have: the linear weight coefficients. Yahoo’s machine learning 1/2 ranking system employs the gradient boosting method DCG(rB)−DCG(rf ) ≤ C(γ, u)(Q(f)−Q(f∗)) . described in [13], which is closely related to L1 reg- Proof: Note that if fB(xj,S) > δ(S), then ularization, analyzed in [4], [18], [19]. Although the

14 consistency of boosting for the standard least squares sup [P w(x ,S)+u sup w0(x ,S)]. Let f be S xj ∈S j xj ∈S j βˆ regression is known (for example, see [5], [32]), such the estimator defined in (16). Then we have analysis does not deal with the situation that m is E{S ,{y }}n Q(f ˆ) large and thus is not suitable for analyzing the ranking i i,j i=1 β  2 2 problem considered in this paper. WM T ≤ 1 + √ inf [Q(fβ) + λβ β]. In this section, we will consider a linear function class 2λn β∈H with L2 regularization, which is closely related to kernel methods. We employ a relatively simple stability analysis We have paid special attention to the properties of which is suitable for L2 regularization. Our result does (16). In particular, the quantity W is usually much not depend on m explicitly, which is important for large smaller than m, which is large for web-search applica- scale ranking problems such as web-search. Although tions. The point we’d like to emphasize here is that even similar results can be obtained for L1 regularization or though the number m is large, the estimation complexity gradient boosting, the analysis will become much more is only affected by the top-portion of the rank-list. If the complicated. estimation of the lowest ranked items is relatively easy For L2 regularization, we consider a feature map ψ : (as is generally the case), then the learning complexity X × S → H, where H is a vector space. We denote by does not depend on the majority of items near the bottom T w v the L2 inner product of w and v in H. The function of the rank-list. class F considered here is of the following form: We can combine Theorem 5 and Theorem 6, giving the following bound: {βT ψ(x, S); β ∈ H, βT β ≤ A2} ⊂ X × S → R, (15) Theorem 7: Suppose the conditions in Theorem 5 and ˆ where the complexity is controlled by L2 regularization Theorem 6 hold with f∗ minimizing (14). Let f = fβˆ, T 2 of the weight vector β β ≤ A . We use (Si = we have {xi,1, . . . , xi,m}, {yi,j}) to indicate a sample point in- E{S ,{y }}n DCG(r ˆ) ≥ DCG(rB) dexed by i. As before, we let {yi,j} = {yi,1, . . . , yi,m}. i i,j i=1 f 1/2 Note that for each sample i, we do not need to as- " 2 2 # WM T sume that yi,j are independently generated for different − C(γ, u) 1 + √ inf (Q(fβ) + λβ β) − Q(f∗) . 2λn β∈H j. Using (15), the importance weighted regression in (12) becomes the following regularized empirical risk Proof: From Theorem 5, we obtain minimization method: DCG(rB) − E{S ,{y }}n DCG(r ˆ) ˆT i i,j i=1 f fβˆ(x, S) = β ψ(x, S), ˆ 1/2 ≤C(γ, u)E{S ,{y }}n (Q(f) − Q(f∗)) " n # i i,j i=1 1 1/2 ˆ X T ≤C(γ, u)[E n Q(f ) − Q(f )] . β = arg min L(β, Si, {yi,j}) + λβ β , (16) {Si,{yi,j }}i=1 βˆ ∗ β∈H n i=1 m The second inequality is a consequence of Jensen’s X T 2 inequality. Now by applying Theorem 6, we obtain the L(β, S, {yj}) = w(xj,S)(β ψ(xj,S) − yj) j=1 desired bound. 0 T 2 The theorem implies that if Q(f∗) = infβ∈H Q(fβ), + u sup w (xj,S)(β ψ(xj,S) − δ(S))+. j then as n → ∞, we can let λ → 0 and λn → ∞ so that the second term on the right hand side vanishes In this method, we replace the hard regularization in in the large sample limit. Therefore asymptotically, we (15) with tuning parameter A by soft regularization can achieve the optimal DCG score. This implies the with tuning parameter λ, which is computationally more consistency of regression based learning methods for convenient. the DCG criterion. Moreover, the rate of convergence The following result is an expected generalization does not depend on m, but rather the relatively small bound for the L -regularized empirical risk minimization 2 quantities W and M. method (16), which uses the stability analysis in [29]. The bound in the theorem compares the performance of a finite sample statistical estimator to that of the optimal VIII.CONCLUSION . Such a bound is generally referred to Ranking problems have many important real world as oracle inequality in the literature. The proof is in applications. Although various formulations have been Appendix A. investigated in the literature, most theoretical results are Theorem 6: Let M = supx,S kψ(x, S)k2 and W = concerned with global ranking using the pair-wise AUC

15 criterion. Motivated by applications such as web-search, Zn+1: we introduced the subset ranking problem, and focus on " n+1 # the DCG criterion that measures the quality of the top- ˆ 1 X T β(Zn+1) = arg min L(β, Si, {yi,j}) + λβ β . β∈H n ranked items. i=1 We derived bounds that relate the optimization of We have the following stability lemma in [29], which DCG scores to the minimization of convex regression er- can be stated with our notation as: rors. In our analysis, it is essential to weight samples dif- ferently according to their importance. These bounds are used to motivate modifications of least squares regression Lemma 1: The following inequality holds: methods that focus on the top-portion of the rank-list. ˆ ˆ kβ(Zn) − β(Zn+1)k2 In addition to conceptual advantages, these methods 1 ∂ ˆ have significant computational advantages over standard ≤ L(β(Zn+1),Sn+1, {yn+1,j}) , regression methods because only a small number of 2λn ∂β 2 items contribute to the solution. This means that they are ∂ where ∂β L(β, S, {yj}) denotes a subgradient of L with computationally efficient to solve. The implementation respect to β. of these methods can be achieved through appropriate se- lective sampling procedures. Moreover, we showed that the expected generalization performance of the system Note that from simple subgradient algebra in [22], we does not depend on m. Instead, it only depends on the know that a subgradient of supj Lj(β) for a convex P estimation quality of the top ranked items. Again this is function Lj(β) can be written as j αj∂Lj(β)/∂β, P important for many practical applications. where j αj ≤ 1 and αj ≥ 0. Therefore we can find P Results obtained here are closely related to the theo- αj ≥ 0 and j αj ≤ 1 such that retical analysis for solving classification methods using 2 ∂ ˆ convex optimization formulations. Our theoretical results L(β(Zn+1),Sn+1, {yn+1,j}) ∂β 2 show that the regression approach provides a solid basis for solving the subset ranking problem. The practical m X T =2 w(xn+1,j,Sn+1)(β ψ(xn+1,j,Sn+1) − yn+1,j) value of such methods is also significant. In Yahoo’s case, substantial improvement of DCG has been achieved j=1 m after the deployment of a machine learning based ranking X 0 system. ·ψ(xn+1,j,Sn+1) + u αjw (xn+1,j,Sn+1) j=1 Although the DCG criterion is difficult to optimize T 2 directly, it is a natural metric for ranking. The in- (β ψ(xn+1,j,Sn+1) − δ(Sn+1))+ψ(xn+1,j,Sn+1) 2 vestigation of convex surrogate formulations provides  m ˆ X a systematic approach to developing efficient machine ≤2L(β(Zn+1),Sn+1, {yn+1,j})  (w(xn+1,j,Sn+1) learning methods for solving this difficult problem. This j=1 paper shows that with appropriate features, importance +uα w0(x ,S ))kψ(x ,S )k2 weighted regression methods can produce the optimal j n+1,j n+1 n+1,j n+1 2 ˆ scoring function in the large sample limit. However, ≤2L(β(Zn+1),Sn+1, {yn+1,j}) regression methods proposed in this paper may not  m  X 0 2 necessarily be optimal algorithms for learning ranking  w(xn+1,j,Sn+1) + u sup w (xn+1,j,Sn+1) M , j functions. Other methods, such as pair-wise preference j=1 learning in Section II, can also be effective. It will be where the first inequality in the derivation is a direct interesting to investigate such alternative formulations application of Cauchy-Schwartz inequality. Now by ap- using a similar analysis. ˆ ˆ plying Lemma 1 with δβ = β(Zn) − β(Zn+1), we can derive the following chain of inequalities. The first inequality uses the following form of Jensen’s inequality APPENDIX X 2 X 2 1/2 X 2 1/2 2 ρi(ai + bi) ≤ [( ρiai ) + ( ρibi ) ] ; i i i We shall introduce the following notation: let Zn = ˆ 2 2 {(Si, {yi,j}): i = 1, . . . , n}. Let β(Zn) be the solution the third inequality is due to (a + b) ≤ (1 + s)a + ˆ −1 2 of (16) and β(Zn+1) be the solution using training data (1 + s )b (where s > 0); the fourth inequality uses

16 Lemma 1. loss with respect to the augmented training data Zn+1 ˆ over β ∈ H. Now, in order to obtain the desired bound, L(β(Zn),Sn+1, {yn+1,j}) we simply take expectation with respect to Zn+1 on both ˆ =L(β(Zn+1) + δβ, Sn+1, {yn+1,j}) sides. h ˆ 1/2 ≤ L(β(Zn+1),Sn+1, {yn+1,j}) REFERENCES  m X T 2 [1] S. Agarwal, T. Graepel, R. Herbrich, S. Har-Peled, and D. Roth. +  w(xn+1,j,Sn+1)|δβ ψ(xn+1,j,Sn+1)| Generalization bounds for the area under the ROC curve. Journal j=1 of Machine Learning Research, 6:393–425, 2005. [2] S. Agarwal and D. Roth. Learnability of bipartite ranking 1/2#2 0 T 2 functions. In Proceedings of the 18th Annual Conference on +u sup w (xj,S)|δβ ψ(xn+1,j,Sn+1)| Learning Theory, 2005. j [3] P. Bartlett, M. Jordan, and J. McAuliffe. Convexity, classification, ˆ 1/2 and risk bounds. Journal of the American Statistical Association, ≤[L(β(Zn+1),Sn+1, {yn+1,j}) 101(473):138–156, 2006. + W (S )1/2Mkδβk ]2 [4] G. Blanchard, G. Lugosi, and N. Vayatis. On the rate of con- n+1 2 vergence of regularized boosting classifiers. Journal of Machine ˆ ≤(1 + s)L(β(Zn+1),Sn+1, {yn+1,j}) Learning Research, 4:861–894, 2003. −1 2 2 [5] P. Buhlmann.¨ Boosting for high-dimensional linear models. + (1 + s )W (Sn+1)kδβk2M Annals of Statistics, 34:559–583, 2006. ˆ −1 [6] C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, ≤(1 + s)L(β(Zn+1),Sn+1, {yn+1,j}) + (1 + s ) N. Hamilton, and G. Hullender. Learning to rank using gradient M 2 ∂ 2 descent. In ICML’05, 2005. ˆ [7] A. Caponnetto. A note on the role of squared loss in regression. · W (Sn+1) 2 2 L(β(Zn+1),Sn+1, {yn+1,j}) 4λ n ∂β 2 Technical report, CBCL, Massachusetts Institute of Technology, ˆ 2005. ≤L(β(Zn+1),Sn+1, {yn+1,j}) [8] S. Clemencon, G. Lugosi, and N. Vayatis. Ranking and scoring  4  using empirical risk minimization. In COLT’05, 2005. −1 2 M · (1 + s) + (1 + s )W (Sn+1) , [9] W. W. Cohen, R. E. Schapire, and Y. Singer. Learning to order 2λ2n2 things. JAIR, 10:243–270, 1999. [10] D. Cossock. Method and apparatus for machine learning a where we define document relevance function. US patent 7197497, 2007. X 0 [11] P. Crescenzi and V. Kann. A compendium of np W (S) = w(xj,S) + u sup w (xj,S). optimization problems. Technical Report SI/RR- xj ∈S xj ∈S 95/02, Dipartimento di Scienze dell’Informazione, Universit di Roma ”La Sapienza”, 1995. updated at By optimizing over s, we obtain http://www.nada.kth.se/∼viggo/wwwcompendium. [12] Y. Freund, R. Iyer, R. E. Schapire, and Y. Singer. An efficient ˆ L(β(Zn),Sn+1, {yn+1,j}) boosting algorithm for combining preferences. JMLR, 4:933–969, 2003.  2 2 W (Sn+1)M ˆ [13] J. Friedman. Greedy function approximation: A gradient boosting ≤ 1 + √ L(β(Zn+1),Sn+1, {yn+1,j}). 2λn machine. The Annals of Statistics, 29:1189–1232, 2001. [14] J. Hanley and B. McNeil. The meaning and use of the Area under (i) a Receiver Operating Characetristic (ROC) curve. Radiology, Now denote by Zn+1 the training data obtained from pages 29–36, 1982. Zn+1 by removing the i-th datum (Si, {yi,j}), and let [15] R. Herbrich, T. Graepel, and K. Obermayer. Large margin ˆ (i) rank boundaries for ordinal regression. In B. S. A. Smola, β(Zn+1) be the solution of (16) with Zn replaced by (i) P. Bartlett and D. Schuurmans, editors, Advances in Large Margin Zn+1, then we have: Classifiers, pages 115–132. MIT Press, 2000. [16] K. Jarvelin and J. Kekalainen. IR evaluation methods for n+1 retrieving highly relevant documents. In SIGIR’00, pages 41– X ˆ (i) L(β(Zn+1),Si, {yi,j}) 48, 2000. i=1 [17] T. Joachims. Optimizing search engines using clickthrough data. In Proceedings of the ACM Conference on Knowledge Discovery  2 2 n+1 WM X ˆ and Data Mining (KDD), 2002. ≤ 1 + √ L(β(Zn+1),Si, {yi,j}) [18] G. Lugosi and N. Vayatis. On the Bayes-risk consistency of 2λn i=1 regularized boosting methods. The Annals of Statistics, 32:30– 55, 2004. with discussion.  2 2 "n+1 WM X [19] S. Mannor, R. Meir, and T. Zhang. Greedy algorithms for ≤ 1 + √ inf L(β, Si, {yi,j}) 2λn β∈H classification - consistency, convergence rates, and adaptivity. i=1 Journal of Machine Learning Research, 4:713–741, 2003. +λ(n + 1)βT β . [20] P. McCullagh and J. A. Nelder. Generalized linear models. Chapman & Hall, London, 1989. The second inequality is due to the definition of [21] F. Radlinski and T. Joachims. Query chains: Learning to rank ˆ from implicit feedback. In Proceedings of the ACM Conference β(Zn+1), which minimizes the regularized empirical on Knowledge Discovery and Data Mining (KDD), 2005.

17 [22] R. T. Rockafellar. Convex analysis. Princeton University Press, Princeton, NJ, 1970. [23] S. Rosset. via the AUC. In ICML’04, 2004. [24] C. Rudin. Ranking with a p-norm push. In COLT 06, 2006. [25] S. Shalev-Shwartz and Y. Singer. Efficient learning of label ranking by soft projections onto polyhedra. Journal of Machine Learning Research, 7:1567–1599, 2006. [26] I. Steinwart. Support vector machines are universally consistent. J. Complexity, 18:768–791, 2002. [27] A. Tewari and P. Bartlett. On the consistency of multiclass classification methods. In COLT, 2005. [28] Z. Zha, Z. Zheng, H. Fu, and G. Sun. Incorporating query difference for learning retrieval functions in information retrieval. In SIGIR, 2006. [29] T. Zhang. Leave-one-out bounds for kernel methods. Neural Computation, 15:1397–1437, 2003. [30] T. Zhang. Statistical analysis of some multi-category large margin classification methods. Journal of Machine Learning Research, 5:1225–1251, 2004. [31] T. Zhang. Statistical behavior and consistency of classification methods based on convex risk minimization. The Annals of Statistics, 32:56–85, 2004. with discussion. [32] T. Zhang and B. Yu. Boosting with early stopping: Convergence and consistency. The Annals of Statistics, 33:1538–1579, 2005.

18