The Lovász-Bregman Divergence and Connections to Rank Aggregation

The Lov´asz-BregmanDivergence and connections to rank aggregation, clustering, and web ranking

Rishabh Iyer Jeﬀ Bilmes Dept. of Electrical Engineering Dept. of Electrical Engineering University of Washington University of Washington Seattle, WA-98175, USA Seattle, WA-98175, USA

Abstract of a submodular function. Submodular functions are a special class of discrete functions with interesting prop- We extend the recently introduced theory erties. Let V refer to a finite ground set {1, 2,..., |V |}. of LovászBregman (LB) divergences [19] in A set function f : 2V → R is submodular if ∀S, T ⊆ V , several ways. We show that they represent f(S) + f(T ) ≥ f(S ∪ T ) + f(S ∩ T ). Submodular func- a distortion between a “score” and an tions have attractive properties that make their exact “ordering”, thus providing a new view of rank or approximate optimization efficient and often practi- aggregation and order based clustering with cal [17, 21]. They also naturally arise in many problems interesting connections to web ranking. We in machine learning, computer vision, economics, oper- show how the LB divergences have a number ations research, etc. A link between convexity and sub- of properties akin to many permutation based modularity is seen via the Lovászextension ([13, 29]) of metrics, and in fact have as special cases forms the submodular function. While submodular functions very similar to the Kendall-τ metric. We are growing phenomenon in machine learning, recently also show how the LB divergences subsume a there has been an increasing set of applications for the number of commonly used ranking measures Lovászextension. In particular, recent work [1, 2] has in information retrieval, like NDCG [22] and shown nice connections between the Lovászextension AUC [35]. Unlike the traditional permutation and structured sparsity inducing norms. based metrics, however, the LB divergence This work is concerned with yet another application naturally captures a notion of “confidence” in of the Lovászextension, in the form of the Lovász- the orderings, thus providing a new represen- Bregman divergence. This was first introduced in tation to applications involving aggregating Iyer & Bilmes [19], in the context of clustering ranked scores as opposed to just orderings. We show vectors. We extend our work in several ways, mainly how a number of recently used web ranking theoretically, by both showing a number of connections models are forms of LovászBregman rank to the permutation based metrics, to rank aggregation, aggregation and also observe that a natural to rank based clustering and to the “Learning to Rank” form of Mallow’s model using the LB diver- problem in web ranking. gence has been used as conditional ranking models for the “Learning to Rank” problem. 1.1 Motivation 1 Introduction The problems of rank aggregation and rank based clustering are ubiquitous in machine learning, information The Bregman divergence first appeared in the context retrieval, and social choice theory. Below is a partial of relaxation techniques in convex programming [5], list of some of these applications. and has found numerous applications as a general framework in clustering [3], proximal minimization Meta Web Search: We are given a collection of ([7]), and others. Many of these applications are due search engines, each providing a ranking or a score to the nice properties of the Bregman divergence, vector, and the task is to aggregate these to generate and the fact that they are parameterized by a single a combined result [27]. convex function. They also generalize a large class of Learning to Rank: The “Learning to rank” prob- divergences between vectors. lem, which is a fundamental problem in machine In this paper, we investigate a specific class of Bregman learning, involves constructing a ranking model from divergences, parameterized via the Lovászextension training data. This problem has gained significant interest in web ranking and information retrieval [28]. the Kendall τ metric [23]: Voter or Rank Clustering: This is an important X d (σ, π) = I(σ−1π(i) > σ−1π(j)) (1) problem in social choice theory, where each voter T provides a ranking or assigns a score to every item. A i,j,idistance rankings [26]. Sometimes however the population is metric represents the number of swap operations heterogeneous and a mixture of distinct populations, required to convert a permutation σ to π. It’s not hard each with its own aggregate representative, fits better. to see that it is a metric and it satisfies the ordering in- Combining Classifiers and Boosting: There has variance property. Other often used metrics include the been an increased interest in combining the output of Spearman’s footrule dS and rank correlation dR [10]: different systems in an effort to improve performance n X of pattern classifiers, something often used in Machine d (σ, π)= |σ−1(i) − π−1(i)| (2) Translation [34] and Speech Recognition[24]. One way S i=1 of doing this [27] is to treat the output of every classifier n as a ranking and combine the individual rankings of X −1 −1 2 dR(σ, π)= (σ (i) − π (i)) (3) weak classifiers to obtain the overall classification. This i=1 is akin to standard boosting techniques [16], except that we consider rankings rather than just the valuations. A natural extension to a ranking model is the Mallows model [30], which is an exponential model defined based on these permutation based distance metrics. This is 1.2 Permutation Based Distance Metrics defined as: 1 First a bit on notation – a permutation σ is a bi- p(π|θ, σ) = exp(−θd(π, σ)), with θ ≥ 0. (4) Z(θ) jection from [n] = {1, 2, ··· , n} to itself. Given a −1 permutation σ, we denote σ as the inverse per- This model has been generalized by [14] and also mutation such that σ(i) is the item assigned rank extended to multistage ranking by [15]. Lebanon −1 1 i, while σ (j) is the rank assigned to item j and and Lafferty [27] were amongst the first to use these −1 hence σ(σ (i)) = i. We shall use σx to denote a models in machine learning by proposing an extended permutation induced through the ordering of a vec- mallows model [14] to combine rankings in a manner tor x such that x(σx(1)) ≥ x(σx(2)) · · · ≥ x(σx(n)). like adaboost [16]. Similarly Meila et al [31] use Without loss of generality, we assume that the permu- the generalized Mallows model to infer the optimal tation is defined via a decreasing order of elements. combined ranking. Another related though different We shall use v(i), v[i] and vi interchangeably to denote problem is clustering ranked data, investigated by [32], the i-th element in v. Given two permutations σ, π where they provide a k-means style algorithm. This we can define σπ as the combined permutation, such was also extended to a machine learning context by [6]. that σπ(i) = σ(π(i)). Also given a vector x and a permutation σ, define xσ such that xσ(i) = x(σ(i)). 1.3 Score based Permutation divergences We also define σx as σx(i) = x(σ−1(i)). Recently a number of papers [27, 26, 23, 31] have ad- In this paper, we motivate another class of divergences, dressed the problem of combining rankings using per- which capture the notion of distance between permuta- mutation based distance metrics. Denote Σ as the set tions. Unlike the permutation based distance metrics, however, these are distortion functions between a of permutations over [n]. Then d : Σ × Σ → R+ is a permutation based distance metric if it satisfies the “score” and a permutation. This, as we shall see, usual notions of a metric, viz. , ∀σ, π, τ ∈ Σ, d(σ, π) ≥ 0 offers a new view of rank aggregation and order based and d(σ, π) = 0 iff σ = π, d(σ, π) = d(π, σ) and clustering problems. We shall also see a number of d(σ, π) ≤ d(σ, τ) + d(τ, π). In addition, to represent interesting connections to web ranking. a distance amongst permutations, another property Consider a scenario where we are given a collection which is usually required is that of left invariance to of scores x1, x2, ··· , xn as opposed to just a collection 2 reorderings, i.e., d(σ, π) = d(τσ, τπ) . The most stan- of orderings – i.e., each xi is an ordered vector and not dard notion of a permutation based distance metric is just a permutation. This occurs in a number of real world applications. For example, in the application of combining classifiers [27], the classifiers often output scores (in the form of say normalized confidence 1This is opposite from the convention used in [27, 26, 23, 31] but follows the convention of [17]. or probability distributions). While the rankings 2While in the literature this is called right invariance, themselves are informative, it is often more beneficial we have left invariance due to our notation to use the additional information in the form of scores if available. This in some sense combines the approach permutation divergence with several similarities to of Adaboost [16] and Cranking [27], since the former permutation based metrics. In fact, we show that a is concerned only with the scores while the latter takes form of weighted Kendall τ, and a form related to the only the orderings. The case of voting is similar, where Spearman’s Footrule, occurs as instances of the Lovász- each voter might assign scores to every candidate Bregman divergences. We also show how a number (which can sometimes be easier than assigning an of loss functions used in IR and web ranking like the ordering). This also applies to web-search where often Normalized Discounted Cumulative Gain (NDCG) [22] the individual search engines (or possibly features) and the Area Under the Curve (AUC) [35] occur provide a confidence score for each webpage. Since as instances of the LB. We then demonstrate some these applications provide both the valuations and the unique properties of the LovászBregman divergence rankings, we call these score based ranking applications. not present in permutation-distance metrics. Notable amongst these are the properties that the Lovász- A score based permutation divergence is defined as Bregman naturally captures a notion of “confidence” of follows. Given a convex set , denote d : × Σ → S S R+ an ordering, and exhibits a priority for higher rankings, as a score based permutation divergence if ∀x ∈ , σ ∈ S both of which are desirable in score based ranking Σ, d(x||σ) ≥ 0 and d(x||σ) = 0 if and only if σ = σ. x applications. We then define a Lovász-Mallows model Another desirable property is that of left invariance, as a conditional model over both the scores and the viz. d(x||σ) = d(τx||τσ), ∀τ, σ ∈ Σ, x ∈ . S ranking. We finally connect the Lovász Bregman to It is then immediately clear how the score based permu- rank aggregation and rank based clustering. We show tation divergence naturally models the above scenario. in fact that a number of ranking models for web rank- The problem becomes one of finding a representative ing used in the past are instances of LovászBregman ordering, i.e., find a permutation σ that minimizes the rank aggregation. We moreover show that a number average distortion to the set of points x1, ··· , xn. Sim- of conditional models used in the past for learning to ilarly, in a clustering application, to cluster a set of Rank are closely related to the Lovász-Mallows model. ordered scores, a score based permutation divergence fits more naturally. The representatives for each clus- 2 The LovászBregman divergences ter are permutations, while the objects themselves are ordered vectors. Notice that in both cases, a purely In this section, we shall briefly review the Lovászex- permutation based distance metric would completely tension and define forms of the generalized Bregman ignore the values, and just consider the induced order- and the LB divergence. We only state the main results ings or permutations. To our knowledge, this work is here and for a more extensive discussion, refer to [20]. the first time that the notion of a score based permutation divergence has been introduced formally, thus providing a novel view to the rank aggregation and 2.1 The Generalized Bregman divergences rank based clustering problems. The notion used in this section follows from [33, 37]. We denote φ as a proper convex function (i.e., it’s domain is 1.4 Our Contributions non-empty and it does not take the value −∞), reint(.) and dom(.) as the relative interior and domain respectively. A subgradient g at y ∈ dom(φ) is such that for In this paper, we investigate several theoretical proper- any x, φ(x) ≥ φ(y)+hg, x−yi and the set of all subgradi- ties of one such score based permutation divergence – ents at y is the subdifferential and is denoted by ∂ (y). the LB divergence. This work builds on our previous φ work [19], where we introduce the Lovász-Bregman The Taylor series approximation of a twice differen- divergence. Our focus therein is mainly on the con- tiable convex function provides a natural way of gen- nections between the LovászBregman and a discrete erating a Bregman divergence ([5]). Given a twice Bregman divergence connected with submodular func- differentiable and strictly convex function φ: tions and we also provide a k-means framework for clustering ordered vectors. In the present paper, how- dφ(x, y) = φ(x) − φ(y) − h∇φ(y), x − yi. (5) ever, we make the connection to rank aggregation and clustering more precise, by motivating the class of score In order to extend this notion to non-differentiable con- based permutation divergences and showing relations vex functions, generalized Bregman divergences have to permutation based metrics and web ranking. been proposed [37, 25]. While gradients no longer exist at points of non-differentiability, the directional deriva- The following are some of our main results. We tives exist in the relative interior of the domain of φ, introduce a novel notion of the generalized Bregman as long as the function is finite. Hence a natural for- divergence based on a “subgradient map”. While mulation is to replace the gradient by the directional this is of independent theoretical interest, it helps us derivative, a notion which has been pursued in [37, 25]. characterize the Lovász-Bregmandivergence. We then show that the LB divergence is indeed a score based In this paper, we view the generalized Bregman σ σ σ divergences slightly differently, in a way related to the C : ∅ ⊂ S1 ⊂ · · · ⊂ Sn = V ). In other words, approach in [18]. In order to ensure that the subgradi- P(σ) = conv(1 σ , i = 1, 2, ··· , n). It’s easy to see that Si ents exist, we only consider the relative interior of the P(σ) ⊆ [0, 1]n. domain. Then define Hφ(y) as a subgradient-map such Lemma 2.2. (Lemma 6.19, [17]) Given a permuta- that ∀y ∈ reint(dom(φ)), Hφ(y) ∈ ∂φ(y). Then given f tion σ ∈ Σ, for every vector y ∈ P(σ) the vector hσ is x ∈ dom(φ), y ∈ reint(dom(φ)) and a subgradient map an extreme subgradient of fˆ at y. If y belongs to the Hφ, we define the generalized Bregman divergence as: f (strict) interior of P(σ), hσ is a unique subgradient ˆ Hφ corresponding to f at y. dφ (x, y) = φ(x) − φ(y) − hHφ(y), x − yi (6) The above lemma points out a critical fact about the When φ is differentiable, notice that ∂φ(y) = {∇φ(y)} subgradients of the Lovászextension, in that they and hence Hφ(y) = ∇(y). depend only on the total ordering of a vector and are independent of the vector itself. This also implies that if 2.2 Properties of the LovászExtension y is totally ordered (it belongs to the interior of P(σy)) then ∂fˆ(y) consists of a single (unique) subgradient. We review some important theoretical properties of Hence, two entirely different but identically ordered n the Lovászextension. Given any vector y ∈ [0, 1] vectors will have identical extreme subgradients. This σy and it’s associated permutation σy, define Sj = fact is important when defining and understanding the {σy(1), ··· , σy(j)} for j ∈ [n]. Notice that in gen- properties of the LB divergence. eral σy need not be unique (it will be unique only if y is totally ordered), and hence let Σy represent the set 2.3 The LovászBregman divergences of all possible permutations with this ordering. Then the Lovászextension of f is defined as: We are now in position to define the Lovász-Bregman ˆ n divergence. Throughout this paper, we restrict dom(f) n ˆ X σy σy to be [0, 1] . For the applications we consider in this f(y) = y[σy(j)](f(S ) − f(S )) (7) j j−1 paper, we lose no generality with this assumption, since j=1 the scores can easily be scaled to lie within this volume. This is also called the Choquet integral [9] of f. Though Consider the case when y is totally ordered, and corre- σy might not be unique, the Lovászextension is actually spondingly |Σy| = 1. It follows then from Lemma 2.2 unique. Furthermore, fˆ is convex if and only if f that there exists a unique subgradient and H (y) = hf . fˆ σy is submodular. In addition, the Lovászextension is Hence for any x ∈ [0, 1]n, we have from Eqn. (6) also tight on the vertices of the hypercube, in that that [19]: ˆ f(X) = f(1X ), ∀X ⊆ V (where 1X is the characteristic vector of X, i.e., 1 (j) = I(j ∈ X)) and hence it is a d (x, y) = fˆ(x) − hx, hf i = hx, hf − hf i (8) X fˆ σy σx σy valid continuous extension. The Lovászextension is in general a non-smooth convex function, and hence Notice that this divergence depends only on x and σy, there does not exist a unique subgradient at every and is independent of y itself. In particular, the LB point. The following result due [17, 13] provides a divergence between a vector x and any vector y ∈ P(σ) characterization of the extreme points of the Lovász is the same for all y ∈ P(σ) (Lemma 2.2). We also subdifferential polyhedron ∂fˆ(y): invoke the following lemma from [19]: Lemma 2.3. (Theorem 2.2, [19]) Given a submodular Lemma 2.1. [17, 13] For a submodular function f, f function whose polyhedron contains all possible extreme a vector y and a permutation σy ∈ Σy, a vector h σy points and x which is totally ordered, d (x, y) = 0 if defined as: fˆ and only if σx = σy. f σy σy h (σy(j)) = f(S ) − f(S ), ∀j ∈ {1, 2, ··· , n} σy j j−1 At first sight it seems that the class of submodular ˆ functions satisfying Lemma 2.3 is very specific. We forms an extreme point of ∂f(y). Also, the number of point out however that this class is quite general and ˆ extreme points of ∂f(y) is |Σy|. many instances we consider in this paper belong to this class of functions. For example, it is easy to see Notice that the extreme subgradients are parameterized that the class of submodular functions f(X) = g(|X|) by the permutation σy and hence we refer to them as where g is a concave function satisfying g(i)−g(i−1) 6= hf . Seen in this way, the Lovászextension then takes σy g(j) − g(j − 1) for i 6= j belong to this class. an extremely simple form: fˆ(w) = hhf , wi. σw Hence the Lovász-Bregman divergence is score based We now point out an interesting property related to permutation divergence, and we denote it as: ˆ the extreme subgradients of f . Define P(σ) as a f f d ˆ(x||σ) = hx, h − h i (9) n−simplex corresponding to a permutation σ (or chain f σx σ ˆ f(X) f(x) dfˆ(x, y) P P −1 −1 1) |X||V \X| i σx σ(j)) Pk Pn Pk 2) g(|X|) i=1 x(σx(i))δg(i) i=1 x(σx(i))δg(i) − i=1 x(σy(i))δg(i) Pk Pk Pk 3) min{|X|, k} i=1 x(σx(i)) i=1 x(σx(i)) − i=1 x(σ(i)) 4) min{|X|, 1} maxi xi maxi xi − x(σ(1)) Pn Pn Pn −1 −1 5) i=1 |I(i ∈ X) − I(i + 1 ∈ X) i=1 |xi − xi+1| i=1 |xi − xi+1|I(σx σ(i) > σx σ(i + 1)) 6) I(1 ≤ |A| ≤ n − 1) maxi x(i) − mini x(i) maxi x(i) − x(σ(1)) − mini x(i) + x(σ(n)) 7) I(A 6= ∅,A 6= V ) maxi,j |xi − xj| maxi,j |xi − xj| − |x(σ(1) − x(σ(n))|

Table 1: Examples of the LB divergences. I(.) is the Indicator fn.

As we shall see in the next section, this divergence has a form induces an interesting class of LovászBregman number of properties akin to the standard permutation divergences. In this case hf (σ (i)) = g(i) − g(i − 1). σx x based distance metrics. Since a large class of submod- Define δg(i) = g(i) − g(i − 1), then: ular functions satisfy the above property (of having n n all possible extreme points), the Lovász-Bregman X X dfˆ(x||σ) = x[σx(i)]δg(i) − x[σ(i)]δg(i). (11) divergence forms a large class of divergences. i=1 i=1

The case when y is not totally ordered can be handled Notice that we can start with any δg such that similarly [20]. δg(1) ≥ δg(2) ≥ · · · ≥ δg(n), and through this we can obtain the corresponding function g. Con- 2.4 LovászBregman Divergence Examples sider a specific example, with δg(i) = n − i. Then, Pn −1 −1 dfˆ(x||σ) = i=1 x[σ(i)]i − x[σx(i)]i = hx, σ − σx i. Below is a partial list of some instances of the Lovász- This expression looks similar to the Spearman’s rule Bregman divergence. We shall see that a number of (Eqn. (2)), except for being additionally weighted by these are closely related to many standard permutation x. based metrics. Table1 considers several other examples We can also extend this in several ways. For example, of LB divergences. consider a restriction to the top m elements (m < n). Cut function and symmetric submodular func- Define f(X) = min{g(|X|), g(m)}. Then it is not hard tions: A fundamental submodular function, which to verify that: is also symmetric, is the graph cut function. This is P P m m f(X) = i∈X j∈V \X dij. The Lovászextension of X X ˆ P dfˆ(x||σ) = x[σx(i)]δg(i) − x[σ(i)]δg(i). (12) f is f(x) = dij(xi − xj)+ [2]. The LB divergence i,j i=1 i=1 corresponding to fˆ then has a nice form: A specific example is f(X) = min{|X|, m}, where X −1 −1 m dfˆ(x||σ) = dσ(i)σ(j)|xσ(i) − xσ(j)|I(σx σ(i) > σx σ(j)) X i σ−1π(j)) T i,j:i σ−1σ(j)). fˆ σ(i) σ(j) x x 2.5 LovászBregman as ranking measures i σ(b)). (15) |G||B| Corollary 3.1.1. Given a submodular function f g∈G,b∈B such that f(X) = g(|X|) for some function g, then d (x||σ) = d (τx, τσ). This can be seen as an instance of LB divergence fˆ fˆ corresponding to the cut function by choosing dij = This follows directly from Eqn. (11) and observing that 1 , ∀i, j, x = 1, ∀g ∈ G and x = 0, ∀b ∈ B. |G||B| g b the extreme points of the corresponding polyhedron are reorderings of each other. In other words, in 3 LovászBregman Properties these cases the submodular polyhedron forms a permutahedron. This property is true even for sums In this section, we shall analyze some interesting prop- of such functions and therefore for many of the special erties of the LB divergences. While many of these cases which we have considered. properties show strong similarities with permutation Dependence on the values and not just the or- based metrics, the LovászBregman divergence enjoys derings: We shall here analyze one key property of some unique properties, thereby providing novel insight the LB divergence that is not present in other per- into the problem of combining and clustering ordered mutation based divergences. Consider the problem of vectors. combining rankings where, given a collection of scores 1 n Non-negativity and convexity: The LB diver- x , ··· , x , we want to come up with a joint ranking. gence is a divergence, in that ∀x, σ, d (x||σ) ≥ 0. Ad- An extreme case of this is where for some x all the fˆ elements are the same. In this case x expresses no ditionally if the submodular polyhedron of f has all preference in the joint ranking. Indeed it is easy to possible extreme points, d (x||σ) = 0 iff σ = σ. Also fˆ x verify that for such an x, d (x||σ) = 0, ∀σ. Now given the Lovász-Bregman divergence d (x||σ) is convex in fˆ fˆ a x where all the elements are almost equal (but not ex- x for a given σ. actly equal), even though this vector is totally ordered, Equivalence Classes: The LB divergence of sub- it expresses a very low confidence in it’s ordering. We modular functions which differ only in a modular term would expect for such an x, dfˆ(x||σ) to be small for are equal. Hence for a submodular function f and a every σ. Indeed we have the following result: 1 1

10 0.9 10 0.9

20 0.8 20 0.8

30 0.7 30 0.7

40 0.6 40 0.6

50 0.5 50 0.5

60 0.4 60 0.4

70 0.3 70 0.3

80 0.2 80 0.2

90 0.1 90 0.1

100 100 0 0 10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100 (a) LB2D (b) LB3D1 (c) LB3D2 (d) KT2D (e) KT3D1 (f) KT3D2

Figure 1: A visualization of dfˆ(x||σ) (left three) and dT (σx, σ) (right three). The ﬁgures shows a visualization in 2D, and two views in 3D for each, with σ as {1, 2} and {1, 2, 3} and x ∈ [0, 1]2 and [0, 1]3 respectively.

Theorem 3.2. Given a monotone submodular function not be interested in a distance to a total ordering σ, but f and any permutation σ, just a distance to a say a top-m list [26] or a partial ordering between elements [38, 8]. The LB divergence also d ˆ(x||σ) ≤ n(max f(j) − min f(j|V \j)) ≤ n max f(j) f j j j has a nice interpretation for both of these. In particular, in the context of top m lists, we can use Eqn. (12). This where = maxi,j |xi −xj| and f(j|A) = f(A∪j)−f(A). exactly corresponds to the divergence between different or possibly overlapping sets of m objects. Moreover, if The above theorem implies that if the vector x is such we are simply interested in the top m elements without that all it’s elements are almost equal, then is small the orderings, we have Eqn. (13). A special case of this and the LB divergence is also proportionately small. is when we may just be interested in the top value. An- This bound can be improved in certain cases. For other interesting instance is of partial orderings, where example for the cut function, with f(X) = |X||V \X|, we do not care about the total ordering. For example, we have that dfˆ(x||σ) ≤ dT (σx, σ) ≤ n(n − 1)/2, in web ranking we often just care about the relevant and where dT is the Kendall τ. irrelevant documents and that the relevant ones should Priority for higher rankings: We show yet an- be placed above the irrelevant ones. We can then define other nice property of the LB divergence with respect a distance dfˆ(x||P) where P refers to a partial ordering to a natural priority in rankings. This property has by using the cut based LovászBregman (Eqn. (10)) to do intrinsically with the submodularity of the gen- and defining the graph to have edges corresponding to erator function. We have the following theorem, that the partial ordering. For example if we are interested in demonstrates this: a partial order 1 > 2, 3 > 2 in the elements {1, 2, 3, 4}, Lemma 3.1. Given permutations σ, π, such that P(σ) we can define d1,2 = d3,2 = 1 with the rest dij = 0 in σ π Eqn. (10). Defined in this way, the LB divergence then and P(π) share a face (say Sk 6= Sk ) and x ∈ P(π)), then d (x||σ) = (x − x )(f(σ (k)|Sσ ) − measures the distortion between a vector x and the fˆ k k+1 x k−1 partial ordering 1 > 2, 3 > 2. In all these cases, we see f(σ (k)|Sσ)). x k that the extensions to partial rankings are natural in This result directly follows from the definitions. Now our framework, without needing to significantly change consider the class of submodular function f such that the expressions to admit these generalizations. ∀j, k∈ / X, j =6 k, f(j|S) − f(j|S ∪ k) is monotone Lovász-Mallows model: In this section, we extend decreasing as a function of S. An example of such a sub- the notion of Mallows model to the LB divergence. We modular function is again f(X) = g(|X|), for a concave first define the Mallows model for the LB divergence: function g. Then it is clear that from the above Lemma that dfˆ(x||σ) will be larger for smaller k. In other exp(−θd ˆ(x||σ)) words, if π and σ differ in the starting of the ranking, p(x|θ, σ) = f , θ ≥ 0. (16) the divergence is more than if π and σ differ somewhere Z(θ, σ) towards the end of the ranking. This kind of weighting is more prominent for the class of functions which For this distribution to be a valid probability distri- depend on the cardinality, i.e., f(X) = g(|X|). Recall bution, we assume that the domain D of x to be a that many of our special cases belong to this class. Then bounded set (say for example [0, 1]n). We also assume Pn we have that dfˆ(x||σ) = i=1{x(σx(i))−x(σ(i))}δg(i). that the domain is symmetric over permutations (i.e., Now since δg(1) ≥ δg(2) ≥ · · · ≥ δg(n), it then follows for all σ ∈ Σ, if x ∈ D, xσ ∈ D. Unlike the standard that if σx and σ differ in the start of the ranking, they Mallow’s model, however, this is defined over scores are penalized more. (or valuations) as opposed to permutations. Extensions to partial orderings and top m-Lists: Given the class of LB divergences defining a probability So far we considered notions of distances between a distribution over such a symmetric set (i.e., the diver- score x and a complete permutation σ. Often we may gences are invariant over relabelings) it follows that Z(θ, σ) = Z(θ). The reason for this is: of combining rankings [31, 27, 26]. Unfortunately this problem in the context of the permutation based Z metrics were shown to be NP hard [4]. Surprisingly Z(θ, σ) = exp(−θdfˆ(x, σ))dx x for the LB divergence this problem is easy (and has Z −1 a closed form). In particular, the representative = exp(−θdfˆ(xσ , σ0))dx permutation is exactly the ordering corresponding to x Z the arithmetic mean of the elements in X . 0 0 = exp(−θdfˆ(x , σ0))dx = Z(θ) Lemma 4.1. [19] Given a submodular function f, the x0 LovászBregman representative (Eqn. (19)) is σ = σµ, where µ = 1 Pn xi where σ0 = {1, 2, ··· , n}. We can also define an ex- n i=1 tended Mallows model for combining rankings, analo- This result builds on the known result for Bregman gous to [27]. Unlike the Mallows model however this is divergences [3]. This seems somewhat surprising at a model over permutations given a collection of vectors first. Notice, however, that the arithmetic mean X = {x , ··· , x } and parameters Θ = {θ , ··· , θ }. 1 n 1 n uses additional information about the scores and Pn its confidence, as opposed to just the orderings. In exp(− θid ˆ(xi||σ)) p(σ|Θ, X ) = i=1 f (17) this context, the result then seems reasonable since Z(Θ, X ) we would expect that the representatives be closely related to the ordering of the arithmetic mean of the This model can be used to combine rankings using objects. We shall also see that this notion has in fact the LB divergences, in a manner akin to Cranking [27]. been ubiquitously but unintentionally used in the web This extended Lovász-Mallows model also admits an ranking and information retrieval communities. interesting Bayesian interpretation, thereby providing a generative view to this model: We illustrate the utility of the LovászBregman rank aggregation through the following argument. Assume that n Y a particular vector x is uninformative about the true or- p(σ|Θ, X ) ∝ p(σ) p(x |σ, θ ). (18) i i dering (i.e, the values of x are almost equal). Then with i=1 the LB divergence and any permutation π, d(x||π) ≈ 0, Again this directly follows from the fact that in this and hence this vector will not contribute to the mean case, in the Lovász-Mallows model, the normalizing ordering. Instead if we use a permutation based metric, constants (which are independent of σ) cancel out. We it will ignore the values but consider only the permuta- shall actually see some very interesting connections tion. As a result, the mean ordering tends to consider between this conditional model and web ranking. such vectors x which are uninformative about the true ordering. As an example, consider a set of scores: X = {1.9, 2}, {1.8, 2}, {1.95, 2}, {2, 1}, {2.5, 1.2}. The repre- 4 Applications sentative of this collection as seen by a permutation based metric would be the permutation {1, 2} though Rank Aggregation: As argued above, the LB the former three vectors have very low confidence. The divergence is a natural model for the problem of arithmetic mean of these vectors is however {2.03, 1.64} combining scores, where both the ordering and the and the LovászBregman representative would be {2, 1}. valuations are provided. If we ignore the values, but just consider the rankings, this then becomes rank The arithmetic mean also provides a notion of con- aggregation. A natural choice in such problems is the fidence of the population. In particular, if the total Kendall τ distance [27, 26, 31]. On the other hand, if variation [2] of the arithmetic mean is small, it implies we consider only the values without explicitly modeling that the population is not confident about its ordering, the orderings, then this becomes an incarnation of while if the variation is high, it provides a certificate of boosting [16]. The Lovász-Bregmandivergence tries to a homogeneous population. Figure1 provides a visu- combine both aspects of this problem – by combining alization the Lovász-Bregmandivergence using the cut orderings using a permutation based divergence, while function and the Kendall τ metric, visualized in 2 and 3 simultaneously using the additional information of the dimensions respectively. We see the similarity between confidence in the orderings provided by the valuations. the two divergences and at the same time, the depen- We can then pose this problem as: dence on the “scores” in the Lovász-Bregman case.

n Learning to Rank: We investigate a specific in- X i 0 stance of the rank aggregation problem with reference σ ∈ argmin dfˆ(x ||σ ) (19) 0 σ ∈Σ i=1 to the problem of “learning to rank.” A large class of algorithms have been proposed for this problem – The above notion of the representative ordering (also see [28] for a survey on this. A specific class of al- known as the mean ordering) is very common in many gorithms for this problem have focused on maximum applications [3] and has also been used in the context margin learning using ranking based loss functions (see [38, 8] and references therein). While we have seen objective is then: min Pk P d (x ||σ ). As C,Σ j=1 i:xi∈Cj fˆ i i that the ranking based losses themselves are instances shown in [19], a simple k-means style algorithm finds of the LB divergence, the feature functions are also a local minima of the above objective. Moreover due closely related. to simplicity of obtaining the means in this case, this In particular, given a query q, we denote a feature algorithm is extremely scalable and practical. vector corresponding to document i ∈ {1, 2, ··· , n} as d xi ∈ R , where each element of xi denotes a quality of 5 Discussion document i based on a particular indicator or feature. Denote X = {x1, ··· , xn}. We assume we have d fea- To our knowledge, this work is the first introduces ture functions (one might be for example a match with the notion of “score based divergences” in preference j the title, another might be pagerank, etc). Denote xi and ranking based learning. Many of the results in as the score of the jth feature corresponding to docu- this paper are due to some interesting properties of j n the Lovászextension and Bregman divergences. This ment i and x ∈ R as the score vector corresponding to feature j over all the documents. In other words, also provides interesting connections between web j j j j ranking and the permutation based metrics. This idea x = (x1, x2, ··· , xn). One possible choice of feature function is: is mildly related to the work of [36] where they use the Choquet integral (of which the Lovászextension d is a special case) for preference learning. Unlike our X j φ(X , σ) = wjdfˆ(x ||σ) (20) paper, however, they do not focus on the divergences j=1 formed by the integral. Finally, it will be interesting to use these ideas in real world applications involving for a weight vector w ∈ Rd. Given a particular weight rank aggregation, clustering, and learning to rank. vector w, the inference problem then is to find the permutation σ which minimizes φ(X , σ). Thanks to Acknowledgments: We thank Matthai Phillipose, Lemma 4.1, the permutation σ is exactly the ordering Stefanie Jegelka and the melodi-lab submodular group Pn j at UWEE for discussions and the anonymous reviewers of the vector wjx . It is not hard to see that j=1 for very useful reviews. This material is based upon this exactly corresponds to ordering the scores w>x i work supported by the National Science Foundation for i ∈ {1, 2, ··· , n}. Interestingly many of the fea- under Grant No. (IIS-1162606), and is also supported ture functions used in [38, 8] are forms closely related by a Google, a Microsoft, and an Intel research award. Eqn. (20). In fact the motivation to define these feature functions is exactly that the inference problem for a given set of weights w be solved by simply ordering the References > scores w xi for every i ∈ {1, 2, ··· , n} [8]. We see that through Eqn. (20), we have a large class of possible [1] F. Bach. Structured sparsity-inducing norms feature functions for this problem. through submodular functions. NIPS, 2010. We also point out a connection between the learning to [2] F. Bach. Learning with Submodular functions: A rank problem and the Lovász-Mallows model. In partic- convex Optimization Perspective. Arxiv, 2011. ular, recent work [12] defined a conditional probability model over permutations as: [3] A. Banerjee, S. Meregu, I. S. Dhilon, and J. Ghosh. Clustering with Bregman divergences. JMLR, exp(w>φ(X , σ)) 6:1705–1749, 2005. p(σ|w, X ) = . (21) Z [4] J. Bartholdi, C. Tovey, and M. Trick. Voting This conditional model is then exactly the extended schemes for which it can be difficult to tell who won Lovász-Mallows model of Eqn. (17) when φ is defined the election. Social Choice and welfare, 6(2):157– as in Eqn. (20). The conditional models used in [12] 165, 1989. are in fact closely related to this and correspondingly [5] L. Bregman. The relaxation method of finding the Eqn. (21) offers a large class of conditional ranking common point of convex sets and its application to models for the learning to rank problem. the solution of problems in convex programming. Clustering: A natural generalization of rank aggre- USSR Comput. Math and Math Physics, 7, 1967. gation is the problem of clustering. In this context, [6] L. Busse, P. Orbanz, and J. Buhmann. Cluster we assume a heterogeneous model, where the data is analysis of heterogeneous rank data. In In ICML, represented as mixtures of ranking models, with each volume 227, pages 113–120, 2007. mixture representing a homogeneous population. It is natural to define a clustering objective in such scenarios. [7] Y. Censor and S. Zenios. Parallel optimization: Assume a set of representatives Σ = {σ1, ··· , σk} and Theory, algorithms, and applications. Oxford Uni- a set of clusters C = {C1, C2, ··· , Ck}. The clustering versity Press, USA, 1997. [8] S. Chakrabarti, R. Khanna, U. Sawant, and [24] K. Kirchhoff et al. Combining articulatory and C. Bhattacharyya. Structured learning for non- acoustic information for speech recognition in smooth ranking losses. In SIGKDD, pages 88–96. noisy and reverberant environments. In ICSLP, ACM, 2008. volume 98, pages 891–894. Citeseer, 1998.

[9] G. Choquet. Theory of capacities. In Annales de [25] K. Kiwiel. Proximal minimization methods with linstitut Fourier, volume 5, page 87, 1953. generalized Bregman functions. SIAM Journal on Control and Optimization, 35(4):1142–1168, 1997. [10] D. Critchlow. Metric methods for analyzing par- tially ranked data in Lecture Notes in Statistics [26] A. Klementiev, D. Roth, and K. Small. Unsuper- No. 34. Springer-Verlag, Berlin 1985, 1985. vised rank aggregation with distance-based models. In ICML, 2008. [11] W. H. Cunningham. Decomposition of submodular functions. Combinatorica, 3(1):53–68, 1983. [27] G. Lebanon and J. Lafferty. Cranking: Combining rankings using conditional probability models on [12] A. Dubey, J. Machchhar, C. Bhattacharyya, and permutations. In ICML, 2002. S. Chakrabarti. Conditional models for non- [28] T.-Y. Liu. Learning to rank for information re- smooth ranking loss functions. In ICDM, pages trieval. Foundations and Trends in Information 129–138, 2009. Retrieval, 3(3):225–331, 2009. [13] J. Edmonds. Submodular functions, matroids and [29] L. Lovász.Submodular functions and convexity. certain polyhedra. Combinatorial structures and Mathematical Programming, 1983. their Applications, 1970. [30] C. Mallows. Non-null ranking models. i. [14] M. Fligner and J. Verducci. Distance based rank- Biometrika, 44(1/2):114–130, 1957. ing models. Journal of the Royal Statistical Society. Series B (Methodological), pages 359–369, 1986. [31] M. Meil˘a,K. Phadnis, A. Patterson, and J. Bilmes. Consensus ranking under the exponential model. [15] M. Fligner and J. Verducci. Multistage ranking In In UAI, 2007. models. Journal of the American Statistical Asso- ciation, 83(403):892–901, 1988. [32] T. Murphy and D. Martin. Mixtures of distance- based models for ranking data. Computational [16] Y. Freund, R. Iyer, R. E. Schapire, and Y. Singer. statistics & data analysis, 41(3):645–655, 2003. An efficient boosting algorithm for combining pref- erences. JMLR, 4:933–969, 2003. [33] R. Rockafellar. Convex analysis, volume 28. Princeton Univ Pr, 1970. [17] S. Fujishige. Submodular functions and optimiza- [34] A.-V. I. Rosti, N. F. Ayan, B. Xiang, S. Matsoukas, tion, volume 58. Elsevier Science, 2005. R. Schwartz, and B. Dorr. Combining outputs [18] G. Gordon. Regret bounds for prediction problems. from multiple machine translation systems. In In In COLT, pages 29–40. ACM, 1999. NAACL - HLT, 2007. [35] K. A. Spackman. Signal detection theory: Valu- [19] R. Iyer and J. Bilmes. The submodular Breg- able tools for evaluating inductive learning. In man and Lovász-Bregman divergences with appli- Proceedings of the sixth international workshop on cations. In NIPS, 2012. Machine learning, 1989. [20] R. Iyer and J. Bilmes. The Lovász-BregmanDiver- [36] A. F. Tehrani, W. Cheng, and E. Hüllermeier.Pref- gence and connections to rank aggregation, clus- erence learning using the choquet integral: The tering and web ranking: Extended Version of UAI case of multipartite ranking. IEEE Transactions paper. 2013. on Fuzzy Systems, 2012. [21] R. Iyer, S. Jegelka, and J. Bilmes. Fast semidif- [37] M. Telgarsky and S. Dasgupta. Agglomerative ferential based submodular function optimization. Bregman clustering. In ICML, 2012. In ICML, 2013. [38] Y. Yue, T. Finley, F. Radlinski, and T. Joachims. [22] K. Järvelin and J. Kekäläinen. IR evaluation A support vector method for optimizing average methods for retrieving highly relevant documents. precision. In SIGIR. ACM, 2007. In In SIGIR, pages 41–48. ACM, 2000.

[23] M. Kendall. A new measure of rank correlation. Biometrika, 30(1/2):81–93, 1938.