The Lov´asz-BregmanDivergence and connections to rank aggregation, clustering, and web ranking

Rishabh Iyer Jeff Bilmes Dept. of Electrical Engineering Dept. of Electrical Engineering University of Washington University of Washington Seattle, WA-98175, USA Seattle, WA-98175, USA

Abstract of a submodular function. Submodular functions are a special class of discrete functions with interesting prop- We extend the recently introduced theory erties. Let V refer to a finite ground set {1, 2,..., |V |}. of Lov´aszBregman (LB) divergences [19] in A set function f : 2V → R is submodular if ∀S, T ⊆ V , several ways. We show that they represent f(S) + f(T ) ≥ f(S ∪ T ) + f(S ∩ T ). Submodular func- a distortion between a “score” and an tions have attractive properties that make their exact “ordering”, thus providing a new view of rank or approximate optimization efficient and often practi- aggregation and order based clustering with cal [17, 21]. They also naturally arise in many problems interesting connections to web ranking. We in machine learning, computer vision, economics, oper- show how the LB divergences have a number ations research, etc. A link between convexity and sub- of properties akin to many permutation based modularity is seen via the Lov´aszextension ([13, 29]) of metrics, and in fact have as special cases forms the submodular function. While submodular functions very similar to the Kendall-τ . We are growing phenomenon in machine learning, recently also show how the LB divergences subsume a there has been an increasing set of applications for the number of commonly used ranking measures Lov´aszextension. In particular, recent work [1, 2] has in information retrieval, like NDCG [22] and shown nice connections between the Lov´aszextension AUC [35]. Unlike the traditional permutation and structured sparsity inducing norms. based metrics, however, the LB This work is concerned with yet another application naturally captures a notion of “confidence” in of the Lov´aszextension, in the form of the Lov´asz- the orderings, thus providing a new represen- Bregman divergence. This was first introduced in tation to applications involving aggregating Iyer & Bilmes [19], in the context of clustering ranked scores as opposed to just orderings. We show vectors. We extend our work in several ways, mainly how a number of recently used web ranking theoretically, by both showing a number of connections models are forms of Lov´aszBregman rank to the permutation based metrics, to rank aggregation, aggregation and also observe that a natural to rank based clustering and to the “Learning to Rank” form of Mallow’s model using the LB diver- problem in web ranking. gence has been used as conditional ranking models for the “Learning to Rank” problem. 1.1 Motivation 1 Introduction The problems of rank aggregation and rank based clus- tering are ubiquitous in machine learning, information The Bregman divergence first appeared in the context retrieval, and social choice theory. Below is a partial of relaxation techniques in convex programming [5], list of some of these applications. and has found numerous applications as a general framework in clustering [3], proximal minimization Meta Web Search: We are given a collection of ([7]), and others. Many of these applications are due search engines, each providing a ranking or a score to the nice properties of the Bregman divergence, vector, and the task is to aggregate these to generate and the fact that they are parameterized by a single a combined result [27]. convex function. They also generalize a large class of Learning to Rank: The “Learning to rank” prob- divergences between vectors. lem, which is a fundamental problem in machine In this paper, we investigate a specific class of Bregman learning, involves constructing a ranking model from divergences, parameterized via the Lov´aszextension training data. This problem has gained significant interest in web ranking and information retrieval [28]. the Kendall τ metric [23]: Voter or Rank Clustering: This is an important X d (σ, π) = I(σ−1π(i) > σ−1π(j)) (1) problem in social choice theory, where each voter T provides a ranking or assigns a score to every item. A i,j,i

Table 1: Examples of the LB divergences. I(.) is the Indicator fn.

As we shall see in the next section, this divergence has a form induces an interesting class of Lov´aszBregman number of properties akin to the standard permutation divergences. In this case hf (σ (i)) = g(i) − g(i − 1). σx x based distance metrics. Since a large class of submod- Define δg(i) = g(i) − g(i − 1), then: ular functions satisfy the above property (of having n n all possible extreme points), the Lov´asz-Bregman X X dfˆ(x||σ) = x[σx(i)]δg(i) − x[σ(i)]δg(i). (11) divergence forms a large class of divergences. i=1 i=1

The case when y is not totally ordered can be handled Notice that we can start with any δg such that similarly [20]. δg(1) ≥ δg(2) ≥ · · · ≥ δg(n), and through this we can obtain the corresponding function g. Con- 2.4 Lov´aszBregman Divergence Examples sider a specific example, with δg(i) = n − i. Then, Pn −1 −1 dfˆ(x||σ) = i=1 x[σ(i)]i − x[σx(i)]i = hx, σ − σx i. Below is a partial list of some instances of the Lov´asz- This expression looks similar to the Spearman’s rule Bregman divergence. We shall see that a number of (Eqn. (2)), except for being additionally weighted by these are closely related to many standard permutation x. based metrics. Table1 considers several other examples We can also extend this in several ways. For example, of LB divergences. consider a restriction to the top m elements (m < n). Cut function and symmetric submodular func- Define f(X) = min{g(|X|), g(m)}. Then it is not hard tions: A fundamental submodular function, which to verify that: is also symmetric, is the graph cut function. This is P P m m f(X) = i∈X j∈V \X dij. The Lov´aszextension of X X ˆ P dfˆ(x||σ) = x[σx(i)]δg(i) − x[σ(i)]δg(i). (12) f is f(x) = dij(xi − xj)+ [2]. The LB divergence i,j i=1 i=1 corresponding to fˆ then has a nice form: A specific example is f(X) = min{|X|, m}, where X −1 −1 m dfˆ(x||σ) = dσ(i)σ(j)|xσ(i) − xσ(j)|I(σx σ(i) > σx σ(j)) X i σ−1π(j)) T i,j:i σ−1σ(j)). fˆ σ(i) σ(j) x x 2.5 Lov´aszBregman as ranking measures i σ(b)). (15) |G||B| Corollary 3.1.1. Given a submodular function f g∈G,b∈B such that f(X) = g(|X|) for some function g, then d (x||σ) = d (τx, τσ). This can be seen as an instance of LB divergence fˆ fˆ corresponding to the cut function by choosing dij = This follows directly from Eqn. (11) and observing that 1 , ∀i, j, x = 1, ∀g ∈ G and x = 0, ∀b ∈ B. |G||B| g b the extreme points of the corresponding polyhedron are reorderings of each other. In other words, in 3 Lov´aszBregman Properties these cases the submodular polyhedron forms a permutahedron. This property is true even for sums In this section, we shall analyze some interesting prop- of such functions and therefore for many of the special erties of the LB divergences. While many of these cases which we have considered. properties show strong similarities with permutation Dependence on the values and not just the or- based metrics, the Lov´aszBregman divergence enjoys derings: We shall here analyze one key property of some unique properties, thereby providing novel insight the LB divergence that is not present in other per- into the problem of combining and clustering ordered mutation based divergences. Consider the problem of vectors. combining rankings where, given a collection of scores 1 n Non-negativity and convexity: The LB diver- x , ··· , x , we want to come up with a joint ranking. gence is a divergence, in that ∀x, σ, d (x||σ) ≥ 0. Ad- An extreme case of this is where for some x all the fˆ elements are the same. In this case x expresses no ditionally if the submodular polyhedron of f has all preference in the joint ranking. Indeed it is easy to possible extreme points, d (x||σ) = 0 iff σ = σ. Also fˆ x verify that for such an x, d (x||σ) = 0, ∀σ. Now given the Lov´asz-Bregman divergence d (x||σ) is convex in fˆ fˆ a x where all the elements are almost equal (but not ex- x for a given σ. actly equal), even though this vector is totally ordered, Equivalence Classes: The LB divergence of sub- it expresses a very low confidence in it’s ordering. We modular functions which differ only in a modular term would expect for such an x, dfˆ(x||σ) to be small for are equal. Hence for a submodular function f and a every σ. Indeed we have the following result: 1 1

10 0.9 10 0.9

20 0.8 20 0.8

30 0.7 30 0.7

40 0.6 40 0.6

50 0.5 50 0.5

60 0.4 60 0.4

70 0.3 70 0.3

80 0.2 80 0.2

90 0.1 90 0.1

100 100 0 0 10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100 (a) LB2D (b) LB3D1 (c) LB3D2 (d) KT2D (e) KT3D1 (f) KT3D2

Figure 1: A visualization of dfˆ(x||σ) (left three) and dT (σx, σ) (right three). The figures shows a visualization in 2D, and two views in 3D for each, with σ as {1, 2} and {1, 2, 3} and x ∈ [0, 1]2 and [0, 1]3 respectively.

Theorem 3.2. Given a monotone submodular function not be interested in a distance to a total ordering σ, but f and any permutation σ, just a distance to a say a top-m list [26] or a partial or- dering between elements [38, 8]. The LB divergence also d ˆ(x||σ) ≤ n(max f(j) − min f(j|V \j)) ≤ n max f(j) f j j j has a nice interpretation for both of these. In particular, in the context of top m lists, we can use Eqn. (12). This where  = maxi,j |xi −xj| and f(j|A) = f(A∪j)−f(A). exactly corresponds to the divergence between different or possibly overlapping sets of m objects. Moreover, if The above theorem implies that if the vector x is such we are simply interested in the top m elements without that all it’s elements are almost equal, then  is small the orderings, we have Eqn. (13). A special case of this and the LB divergence is also proportionately small. is when we may just be interested in the top value. An- This bound can be improved in certain cases. For other interesting instance is of partial orderings, where example for the cut function, with f(X) = |X||V \X|, we do not care about the total ordering. For example, we have that dfˆ(x||σ) ≤ dT (σx, σ) ≤ n(n − 1)/2, in web ranking we often just care about the relevant and where dT is the Kendall τ. irrelevant documents and that the relevant ones should Priority for higher rankings: We show yet an- be placed above the irrelevant ones. We can then define other nice property of the LB divergence with respect a distance dfˆ(x||P) where P refers to a partial ordering to a natural priority in rankings. This property has by using the cut based Lov´aszBregman (Eqn. (10)) to do intrinsically with the submodularity of the gen- and defining the graph to have edges corresponding to erator function. We have the following theorem, that the partial ordering. For example if we are interested in demonstrates this: a partial order 1 > 2, 3 > 2 in the elements {1, 2, 3, 4}, Lemma 3.1. Given permutations σ, π, such that P(σ) we can define d1,2 = d3,2 = 1 with the rest dij = 0 in σ π Eqn. (10). Defined in this way, the LB divergence then and P(π) share a face (say Sk 6= Sk ) and x ∈ P(π)), then d (x||σ) = (x − x )(f(σ (k)|Sσ ) − measures the distortion between a vector x and the fˆ k k+1 x k−1 partial ordering 1 > 2, 3 > 2. In all these cases, we see f(σ (k)|Sσ)). x k that the extensions to partial rankings are natural in This result directly follows from the definitions. Now our framework, without needing to significantly change consider the class of submodular function f such that the expressions to admit these generalizations. ∀j, k∈ / X, j =6 k, f(j|S) − f(j|S ∪ k) is monotone Lov´asz-Mallows model: In this section, we extend decreasing as a function of S. An example of such a sub- the notion of Mallows model to the LB divergence. We modular function is again f(X) = g(|X|), for a concave first define the Mallows model for the LB divergence: function g. Then it is clear that from the above Lemma that dfˆ(x||σ) will be larger for smaller k. In other exp(−θd ˆ(x||σ)) words, if π and σ differ in the starting of the ranking, p(x|θ, σ) = f , θ ≥ 0. (16) the divergence is more than if π and σ differ somewhere Z(θ, σ) towards the end of the ranking. This kind of weighting is more prominent for the class of functions which For this distribution to be a valid probability distri- depend on the cardinality, i.e., f(X) = g(|X|). Recall bution, we assume that the domain D of x to be a that many of our special cases belong to this class. Then bounded set (say for example [0, 1]n). We also assume Pn we have that dfˆ(x||σ) = i=1{x(σx(i))−x(σ(i))}δg(i). that the domain is symmetric over permutations (i.e., Now since δg(1) ≥ δg(2) ≥ · · · ≥ δg(n), it then follows for all σ ∈ Σ, if x ∈ D, xσ ∈ D. Unlike the standard that if σx and σ differ in the start of the ranking, they Mallow’s model, however, this is defined over scores are penalized more. (or valuations) as opposed to permutations. Extensions to partial orderings and top m-Lists: Given the class of LB divergences defining a probability So far we considered notions of between a distribution over such a symmetric set (i.e., the diver- score x and a complete permutation σ. Often we may gences are invariant over relabelings) it follows that Z(θ, σ) = Z(θ). The reason for this is: of combining rankings [31, 27, 26]. Unfortunately this problem in the context of the permutation based Z metrics were shown to be NP hard [4]. Surprisingly Z(θ, σ) = exp(−θdfˆ(x, σ))dx x for the LB divergence this problem is easy (and has Z −1 a closed form). In particular, the representative = exp(−θdfˆ(xσ , σ0))dx permutation is exactly the ordering corresponding to x Z the arithmetic mean of the elements in X . 0 0 = exp(−θdfˆ(x , σ0))dx = Z(θ) Lemma 4.1. [19] Given a submodular function f, the x0 Lov´aszBregman representative (Eqn. (19)) is σ = σµ, where µ = 1 Pn xi where σ0 = {1, 2, ··· , n}. We can also define an ex- n i=1 tended Mallows model for combining rankings, analo- This result builds on the known result for Bregman gous to [27]. Unlike the Mallows model however this is divergences [3]. This seems somewhat surprising at a model over permutations given a collection of vectors first. Notice, however, that the arithmetic mean X = {x , ··· , x } and parameters Θ = {θ , ··· , θ }. 1 n 1 n uses additional information about the scores and Pn its confidence, as opposed to just the orderings. In exp(− θid ˆ(xi||σ)) p(σ|Θ, X ) = i=1 f (17) this context, the result then seems reasonable since Z(Θ, X ) we would expect that the representatives be closely related to the ordering of the arithmetic mean of the This model can be used to combine rankings using objects. We shall also see that this notion has in fact the LB divergences, in a manner akin to Cranking [27]. been ubiquitously but unintentionally used in the web This extended Lov´asz-Mallows model also admits an ranking and information retrieval communities. interesting Bayesian interpretation, thereby providing a generative view to this model: We illustrate the utility of the Lov´aszBregman rank ag- gregation through the following argument. Assume that n Y a particular vector x is uninformative about the true or- p(σ|Θ, X ) ∝ p(σ) p(x |σ, θ ). (18) i i dering (i.e, the values of x are almost equal). Then with i=1 the LB divergence and any permutation π, d(x||π) ≈ 0, Again this directly follows from the fact that in this and hence this vector will not contribute to the mean case, in the Lov´asz-Mallows model, the normalizing ordering. Instead if we use a permutation based metric, constants (which are independent of σ) cancel out. We it will ignore the values but consider only the permuta- shall actually see some very interesting connections tion. As a result, the mean ordering tends to consider between this conditional model and web ranking. such vectors x which are uninformative about the true ordering. As an example, consider a set of scores: X = {1.9, 2}, {1.8, 2}, {1.95, 2}, {2, 1}, {2.5, 1.2}. The repre- 4 Applications sentative of this collection as seen by a permutation based metric would be the permutation {1, 2} though Rank Aggregation: As argued above, the LB the former three vectors have very low confidence. The divergence is a natural model for the problem of arithmetic mean of these vectors is however {2.03, 1.64} combining scores, where both the ordering and the and the Lov´aszBregman representative would be {2, 1}. valuations are provided. If we ignore the values, but just consider the rankings, this then becomes rank The arithmetic mean also provides a notion of con- aggregation. A natural choice in such problems is the fidence of the population. In particular, if the total Kendall τ distance [27, 26, 31]. On the other hand, if variation [2] of the arithmetic mean is small, it implies we consider only the values without explicitly modeling that the population is not confident about its ordering, the orderings, then this becomes an incarnation of while if the variation is high, it provides a certificate of boosting [16]. The Lov´asz-Bregmandivergence tries to a homogeneous population. Figure1 provides a visu- combine both aspects of this problem – by combining alization the Lov´asz-Bregmandivergence using the cut orderings using a permutation based divergence, while function and the Kendall τ metric, visualized in 2 and 3 simultaneously using the additional information of the dimensions respectively. We see the similarity between confidence in the orderings provided by the valuations. the two divergences and at the same time, the depen- We can then pose this problem as: dence on the “scores” in the Lov´asz-Bregman case.

n Learning to Rank: We investigate a specific in- X i 0 stance of the rank aggregation problem with reference σ ∈ argmin dfˆ(x ||σ ) (19) 0 σ ∈Σ i=1 to the problem of “learning to rank.” A large class of algorithms have been proposed for this problem – The above notion of the representative ordering (also see [28] for a survey on this. A specific class of al- known as the mean ordering) is very common in many gorithms for this problem have focused on maximum applications [3] and has also been used in the context margin learning using ranking based loss functions (see [38, 8] and references therein). While we have seen objective is then: min Pk P d (x ||σ ). As C,Σ j=1 i:xi∈Cj fˆ i i that the ranking based losses themselves are instances shown in [19], a simple k-means style algorithm finds of the LB divergence, the feature functions are also a local minima of the above objective. Moreover due closely related. to simplicity of obtaining the means in this case, this In particular, given a query q, we denote a feature algorithm is extremely scalable and practical. vector corresponding to document i ∈ {1, 2, ··· , n} as d xi ∈ R , where each element of xi denotes a quality of 5 Discussion document i based on a particular indicator or feature. Denote X = {x1, ··· , xn}. We assume we have d fea- To our knowledge, this work is the first introduces ture functions (one might be for example a match with the notion of “score based divergences” in preference j the title, another might be pagerank, etc). Denote xi and ranking based learning. Many of the results in as the score of the jth feature corresponding to docu- this paper are due to some interesting properties of j n the Lov´aszextension and Bregman divergences. This ment i and x ∈ R as the score vector corresponding to feature j over all the documents. In other words, also provides interesting connections between web j j j j ranking and the permutation based metrics. This idea x = (x1, x2, ··· , xn). One possible choice of feature function is: is mildly related to the work of [36] where they use the Choquet integral (of which the Lov´aszextension d is a special case) for preference learning. Unlike our X j φ(X , σ) = wjdfˆ(x ||σ) (20) paper, however, they do not focus on the divergences j=1 formed by the integral. Finally, it will be interesting to use these ideas in real world applications involving for a weight vector w ∈ Rd. Given a particular weight rank aggregation, clustering, and learning to rank. vector w, the inference problem then is to find the permutation σ which minimizes φ(X , σ). Thanks to Acknowledgments: We thank Matthai Phillipose, Lemma 4.1, the permutation σ is exactly the ordering Stefanie Jegelka and the melodi-lab submodular group Pn j at UWEE for discussions and the anonymous reviewers of the vector wjx . It is not hard to see that j=1 for very useful reviews. This material is based upon this exactly corresponds to ordering the scores w>x i work supported by the National Science Foundation for i ∈ {1, 2, ··· , n}. Interestingly many of the fea- under Grant No. (IIS-1162606), and is also supported ture functions used in [38, 8] are forms closely related by a Google, a Microsoft, and an Intel research award. Eqn. (20). In fact the motivation to define these feature functions is exactly that the inference problem for a given set of weights w be solved by simply ordering the References > scores w xi for every i ∈ {1, 2, ··· , n} [8]. We see that through Eqn. (20), we have a large class of possible [1] F. Bach. Structured sparsity-inducing norms feature functions for this problem. through submodular functions. NIPS, 2010. We also point out a connection between the learning to [2] F. Bach. Learning with Submodular functions: A rank problem and the Lov´asz-Mallows model. In partic- convex Optimization Perspective. Arxiv, 2011. ular, recent work [12] defined a conditional probability model over permutations as: [3] A. Banerjee, S. Meregu, I. S. Dhilon, and J. Ghosh. Clustering with Bregman divergences. JMLR, exp(w>φ(X , σ)) 6:1705–1749, 2005. p(σ|w, X ) = . (21) Z [4] J. Bartholdi, C. Tovey, and M. Trick. Voting This conditional model is then exactly the extended schemes for which it can be difficult to tell who won Lov´asz-Mallows model of Eqn. (17) when φ is defined the election. Social Choice and welfare, 6(2):157– as in Eqn. (20). The conditional models used in [12] 165, 1989. are in fact closely related to this and correspondingly [5] L. Bregman. The relaxation method of finding the Eqn. (21) offers a large class of conditional ranking common point of convex sets and its application to models for the learning to rank problem. the solution of problems in convex programming. Clustering: A natural generalization of rank aggre- USSR Comput. Math and Math Physics, 7, 1967. gation is the problem of clustering. In this context, [6] L. Busse, P. Orbanz, and J. Buhmann. Cluster we assume a heterogeneous model, where the data is analysis of heterogeneous rank data. In In ICML, represented as mixtures of ranking models, with each volume 227, pages 113–120, 2007. mixture representing a homogeneous population. It is natural to define a clustering objective in such scenarios. [7] Y. Censor and S. Zenios. Parallel optimization: Assume a set of representatives Σ = {σ1, ··· , σk} and Theory, algorithms, and applications. Oxford Uni- a set of clusters C = {C1, C2, ··· , Ck}. The clustering versity Press, USA, 1997. [8] S. Chakrabarti, R. Khanna, U. Sawant, and [24] K. Kirchhoff et al. Combining articulatory and C. Bhattacharyya. Structured learning for non- acoustic information for speech recognition in smooth ranking losses. In SIGKDD, pages 88–96. noisy and reverberant environments. In ICSLP, ACM, 2008. volume 98, pages 891–894. Citeseer, 1998.

[9] G. Choquet. Theory of capacities. In Annales de [25] K. Kiwiel. Proximal minimization methods with linstitut Fourier, volume 5, page 87, 1953. generalized Bregman functions. SIAM Journal on Control and Optimization, 35(4):1142–1168, 1997. [10] D. Critchlow. Metric methods for analyzing par- tially ranked data in Lecture Notes in [26] A. Klementiev, D. Roth, and K. Small. Unsuper- No. 34. Springer-Verlag, Berlin 1985, 1985. vised rank aggregation with distance-based models. In ICML, 2008. [11] W. H. Cunningham. Decomposition of submodular functions. Combinatorica, 3(1):53–68, 1983. [27] G. Lebanon and J. Lafferty. Cranking: Combining rankings using conditional probability models on [12] A. Dubey, J. Machchhar, C. Bhattacharyya, and permutations. In ICML, 2002. S. Chakrabarti. Conditional models for non- [28] T.-Y. Liu. Learning to rank for information re- smooth ranking loss functions. In ICDM, pages trieval. Foundations and Trends in Information 129–138, 2009. Retrieval, 3(3):225–331, 2009. [13] J. Edmonds. Submodular functions, matroids and [29] L. Lov´asz.Submodular functions and convexity. certain polyhedra. Combinatorial structures and Mathematical Programming, 1983. their Applications, 1970. [30] C. Mallows. Non-null ranking models. i. [14] M. Fligner and J. Verducci. Distance based rank- Biometrika, 44(1/2):114–130, 1957. ing models. Journal of the Royal Statistical Society. Series B (Methodological), pages 359–369, 1986. [31] M. Meil˘a,K. Phadnis, A. Patterson, and J. Bilmes. Consensus ranking under the exponential model. [15] M. Fligner and J. Verducci. Multistage ranking In In UAI, 2007. models. Journal of the American Statistical Asso- ciation, 83(403):892–901, 1988. [32] T. Murphy and D. Martin. Mixtures of distance- based models for ranking data. Computational [16] Y. Freund, R. Iyer, R. E. Schapire, and Y. Singer. statistics & data analysis, 41(3):645–655, 2003. An efficient boosting algorithm for combining pref- erences. JMLR, 4:933–969, 2003. [33] R. Rockafellar. Convex analysis, volume 28. Princeton Univ Pr, 1970. [17] S. Fujishige. Submodular functions and optimiza- [34] A.-V. I. Rosti, N. F. Ayan, B. Xiang, S. Matsoukas, tion, volume 58. Elsevier Science, 2005. R. Schwartz, and B. Dorr. Combining outputs [18] G. Gordon. Regret bounds for prediction problems. from multiple machine translation systems. In In In COLT, pages 29–40. ACM, 1999. NAACL - HLT, 2007. [35] K. A. Spackman. Signal detection theory: Valu- [19] R. Iyer and J. Bilmes. The submodular Breg- able tools for evaluating inductive learning. In man and Lov´asz-Bregman divergences with appli- Proceedings of the sixth international workshop on cations. In NIPS, 2012. Machine learning, 1989. [20] R. Iyer and J. Bilmes. The Lov´asz-BregmanDiver- [36] A. F. Tehrani, W. Cheng, and E. H¨ullermeier.Pref- gence and connections to rank aggregation, clus- erence learning using the choquet integral: The tering and web ranking: Extended Version of UAI case of multipartite ranking. IEEE Transactions paper. 2013. on Fuzzy Systems, 2012. [21] R. Iyer, S. Jegelka, and J. Bilmes. Fast semidif- [37] M. Telgarsky and S. Dasgupta. Agglomerative ferential based submodular function optimization. Bregman clustering. In ICML, 2012. In ICML, 2013. [38] Y. Yue, T. Finley, F. Radlinski, and T. Joachims. [22] K. J¨arvelin and J. Kek¨al¨ainen. IR evaluation A support vector method for optimizing average methods for retrieving highly relevant documents. precision. In SIGIR. ACM, 2007. In In SIGIR, pages 41–48. ACM, 2000.

[23] M. Kendall. A new measure of rank correlation. Biometrika, 30(1/2):81–93, 1938.