<<

Foundations of Machine Learning Ranking

Mehryar Mohri Courant Institute and Google Research [email protected] Motivation

Very large data sets: • too large to display or process. • limited resources, need priorities. • ranking more desirable than classification. Applications: • search engines, information extraction. • decision making, auctions, fraud detection. Can we learn to predict ranking accurately?

Mehryar Mohri - Foundations of Machine Learning page 2 Related Problem

Rank aggregation: given n candidates and k voters each giving a ranking of the candidates, find ordering as close as possible to these. • closeness measured in number of pairwise misrankings. • problem NP-hard even for k =4 (Dwork et al., 2001).

Mehryar Mohri - Foundations of Machine Learning page 3 This Talk

Score-based ranking Preference-based ranking

Mehryar Mohri - Foundations of Machine Learning page 4 Score-Based Setting

Single stage: learning algorithm • receives labeled sample of pairwise preferences; returns scoring function h : U R . • Drawbacks: • h induces a linear ordering for full set U . • does not match a query-based scenario. Advantages: • efficient algorithms. • good theory: VC bounds, margin bounds, stability bounds (FISS 03, RCMS 05, AN 05, AGHHR 05, CMR 07). Mehryar Mohri - Foundations of Machine Learning page 5 Score-Based Ranking

Training data: sample of i.i.d. labeled pairs drawn from U U according to some distribution D , S = (x ,x ,y ),...,(x ,x ,y ) U U 1, 0, +1 , 1 1 1 m m m { } +1 if xi >pref xi with yi = 0ifxi =pref xi or no information 1ifx

R(h)= Pr (f(x, x0) = 0) (f(x, x0) h(x0) h(x) 0) . (x,x ) D 6 ^  0 ⇠ h i Mehryar Mohri - Foundations of Machine Learning page 6 Notes

Empirical error: m 1 R(h)= 1(y =0) (y (h(x ) h(x )) 0) . m i i i i i=1 The relation x x f ( x, x )=1 may be non- R transitive (needs not even be anti-symmetric). Problem different from classification.

Mehryar Mohri - Foundations of Machine Learning page 7 Distributional Assumptions

Distribution over points: m points (literature). • labels for pairs. • squared number of examples O ( m 2 ) . • dependency issue. Distribution over pairs: m pairs. • label for each pair received. • independence assumption. • same (linear) number of examples.

Mehryar Mohri - Foundations of Machine Learning page 8 Confidence Margin in Ranking

Labels assumed to be in +1 , 1 . { } Empirical margin loss for ranking: for ⇢ > 0 ,

1 m R (h)= y h(x0 ) h(x ) . ⇢ m ⇢ i i i i=1 X ⇣ ⌘ b 1 m R⇢(h) 1y [h(x ) h(x )] ⇢.  m i i0 i  i=1 X b

Mehryar Mohri - Foundations of Machine Learning page 9 Marginal Rademacher Complexities

Distributions: • D 1 marginal distribution with respect to the first element of the pairs; • D 2 marginal distribution with respect to second element of the pairs.

Samples: S1 = (x1,y1),...,(xm,ym) S2 = (x10 ,y1),...,(xm0 ,ym). Marginal Rademacher complexities:

D1 D2 Rm (H)=E[RS1 (H)] Rm (H)=E[RS2 (H)].

Mehryar Mohri - Foundations of Machine Learningb pageb 10 Ranking Margin Bound (Boyd, Cortes, MM, and Radovanovich 2012; MM, Rostamizadeh, and Talwalkar, 2012) Theorem: let H be a family of real-valued functions. Fix > 0 , then, for any > 0 , with probability at least 1 over the choice of a sample of size m , the following holds for all h H : 2 log 1 R(h) R (h)+ RD1 (H)+RD2 (H) + . m m 2m

Mehryar Mohri - Foundations of Machine Learning page 11 Ranking with SVMs see for example (Joachims, 2002) Optimization problem: application of SVMs. m 1 2 min w + C i w, 2 i=1 subject to: y w (x ) (x ) 1 i · i i i 0, i [1,m] . i Decision function: h: x w (x)+b. ·

Mehryar Mohri - Foundations of Machine Learning page 12 Notes

The algorithm coincides with SVMs using feature mapping

(x, x) (x, x)=(x) (x). Can be used with kernels:

K0((x ,x0 ), (x ,x0 )) = (x ,x0 ) (x ,x0 ) i i j j i i · j j = K(x ,x )+K(x0 ,x0 ) K(x0 ,x ) K(x ,x0 ). i j i j i j i j Algorithm directly based on margin bound.

Mehryar Mohri - Foundations of Machine Learning page 13 Boosting for Ranking

Use weak ranking algorithm and create stronger ranking algorithm. Ensemble method: combine base rankers returned by weak ranking algorithm. Finding simple relatively accurate base rankers often not hard. How should base rankers be combined?

Mehryar Mohri - Foundations of Machine Learning page 14 CD RankBoost (Freund et al., 2003; Rudin et al., 2005)

X 0 + s H 0, 1 . t + t + t =1,t (h)= Pr sgn f(x, x)(h(x) h(x)) = s . (x,x ) Dt { } RankBoost(S =((x1,x1 ,y1) ...,(xm,xm ,ym))) 1 for i 1 to m do 1 2 D1(xi,xi ) m 3 for t 1 to T do + 4 ht base ranker in H with smallest t t = Ei Dt yi ht(xi ) ht(xi) + 1 t 5 t 2 log t 0 + 1 6 Zt t +2[t t] 2 normalization factor 7 fori 1 to m do Dt(xi,xi )exp tyi ht(xi ) ht(xi) 8 Dt+1(xi,x ) i Zt 9 T h T t=1 t t 10 return T

Mehryar Mohri - Foundations of Machine Learning page 15 Notes

Distributions D t over pairs of sample points: • originally uniform. • at each round, the weight of a misclassified example is increased. y[ (x ) (x)] e t t observation: D t+1 ( x, x )= S t Z , since • | | s=1 s t yt[ht(x) ht(x)] Q y s[hs(x) hs(x)] Dt(x, x)e 1 e s=1 Dt+1(x, x)= = P t . Zt S Z | | s=1 s weight assigned to base classifier h t : t directly depends on the accuracy of h t at round t .

Mehryar Mohri - Foundations of Machine Learning page 16 Coordinate Descent RankBoost

Objective Function: convex and differentiable. T y[ (x) (x)] F ()= e T T = exp y [h (x) h (x)] . t t t (x,x ,y) S (x,x ,y) S t=1

x e

0 1pairwiseloss

Mehryar Mohri - Foundations of Machine Learning page 17 • Direction: unit vector e t with dF ( + e ) e =argmin t . t d t =0 T y s=1 s[hs(x) hs(x)] y[ht(x) ht(x)] • Since F ( + e t )= e e , (x,x ,y) S P T dF ( + et) = y[h (x) h (x)] exp y [h (x) h (x)] d t t s s s =0 (x,x ,y) S s=1 T = y[h (x) h (x)]D (x, x) m Z t t T +1 s (x,x,y) S s=1 T + = [ ] m Z . t t s s=1 Thus, direction corresponding to base classifier selected by the algorithm.

Mehryar Mohri - Foundations of Machine Learning page 18 • Step size: obtained via dF ( + e ) t =0 d T y[h (x) h (x)] y[h (x) h (x)] exp y [h (x) h (x)] e t t =0 t t s s s (x,x ,y) S s=1 T y[h (x) h (x)] y[h (x) h (x)]D (x, x) m Z e t t =0 t t T +1 s (x,x ,y) S s=1 y[h (x) h (x)] y[h (x) h (x)]D (x, x)e t t =0 t t T +1 (x,x ,y) S + [ e e ]=0 t t 1 + = log t . 2 t

Thus, step size matches base classifier weight used in algorithm.

Mehryar Mohri - Foundations of Machine Learning page 19 Bipartite Ranking

Training data: sample of negative points drawn according toD • S =(x1,...,xm) U. • sample of positive points drawn according to D+ S+ =(x ,...,x ) U. 1 m Problem: find hypothesis h : U R in H with small generalization error !

RD(h)= Pr h(x)

Mehryar Mohri - Foundations of Machine Learning page 20 Notes

Connection between AdaBoost and RankBoost (Cortes & MM, 04; Rudin et al., 05). • if constant base ranker used. • relationship between objective functions. More efficient algorithm in this special case (Freund et al., 2003). Bipartite ranking results typically reported in terms of AUC.

Mehryar Mohri - Foundations of Machine Learning page 21 AdaBoost and CD RankBoost

Objective functions: comparison. F ()= exp ( y f(x )) Ada i i xi S S+ = exp (+f(x )) + exp ( f(x )) i i xi S xi S+ = F ()+F+(). F ()= exp [f(x ) f(x )] Rank j i (i,j) S S+ = exp (+f(x )) exp ( f(x )) i i (i,j) S S+ = F ()F+(). Mehryar Mohri - Foundations of Machine Learning page 22 AdaBoost and CD RankBoost (Rudin et al., 2005) Property: AdaBoost (non-separable case). • constant base learner h =1 equal contribution of positive and negative points (in the limit). • consequence: AdaBoost asymptotically achieves optimum of CD RankBoost objective.

Observations: if F + ( )= F ( ) ,

d(FRank)=F+d(F )+F d(F+) = F+ d(F )+d(F+) = F+d(FAda).

Mehryar Mohri - Foundations of Machine Learning page 23 Bipartite RankBoost - Efficiency

Decomposition of distribution: for ( x, x ) ( S ,S + ) , D(x, x)=D (x)D+(x). Thus, t[ht(x) ht(x)] Dt(x, x)e Dt+1(x, x)= Zt

tht(x) tht(x) Dt, (x)e Dt,+(x)e = , Zt, Zt,+

t ht(x) tht(x) with Zt, = Dt, (x)e Zt,+ = Dt,+(x)e . x S x S+

Mehryar Mohri - Foundations of Machine Learning page 24 ROC Curve (Egan, 1975) Definition: the receiver operating characteristic (ROC) curve is a plot of the true positive rate (TP) vs. false positive rate (FP). • TP: % positive points correctly labeled positive. • FP: % negative points incorrectly labeled positive. 1

.8

.6 - + .4 h(x3) h(x14) h(x5) h(x1) h(x23)

True positive rate positive True .2 sorted scores

0 0 .2 .4 .6 .8 1 False positive rate

Mehryar Mohri - Foundations of Machine Learning page 25 Area under the ROC Curve (AUC) (Hanley and McNeil, 1982) Definition: the AUC is the area under the ROC curve. Measure of ranking quality.

1

.8

.6

.4 AUC

True positive rate positive True .2

0 0 .2 .4 .6 .8 1 Equivalently, False positive rate 1 m m AUC(h)= 1 =Pr[h(x) >h(x)] h(xj )>h(xi) mm x D i=1 j=1 x D + =1 R(h). b b Mehryar Mohri - Foundations of Machine Learning page 26 This Talk

Score-based ranking Preference-based ranking

Mehryar Mohri - Foundations of Machine Learning page 27 Preference-Based Setting

Definitions: • U : universe, full set of objects. V : finite query subset to rank, V U . • • : target ranking for V (random variable). Two stages: can be viewed as a reduction. learn preference function h : U U [0 , 1] . • • given V , use h to determine ranking of V . Running-time: measured in terms of |calls to h |.

Mehryar Mohri - Foundations of Machine Learning page 28 Preference-Based Ranking Problem

Training data: pairs ( V, ) sampled i.i.d. according to D : (V , ), (V , ),...,(V , ) V U. 1 1 2 2 m m i subsets ranked by different labelers. learn classifier preference function h : U U [0 , 1] . Problem: for any query set V U , use h to return ranking h,V close to target with small average error R(h, )= E [L(h,V ,)]. (V,) D

Mehryar Mohri - Foundations of Machine Learning page 29 Preference Function

h ( u, v ) close to 1 when u preferred to v , close to 0 otherwise. For the analysis, h ( u, v ) 0 , 1 . { } Assumed pairwise consistent: h(u, v)+h(v, u)=1. May be non-transitive, e.g., we may have h(u, v)=h(v, w)=h(w, u)=1. Output of classifier or ‘black-box’.

Mehryar Mohri - Foundations of Machine Learning page 30 Loss Functions

(for fixed ( V, ) ) Preference loss: 2 L(h, )= h(u, v) (v, u). n(n 1) u=v ⇥ Ranking loss: 2 L(, ⇥ )= (u, v)⇥ (v, u). n(n 1) u=v ⇥

Mehryar Mohri - Foundations of Machine Learning page 31 (Weak) Regret

Preference regret: ˜ class (h)= E [L(h V ,)] E min E [L(h, )]. R V, | V h˜ V | Ranking regret:

rank⇥ (A)= E [L(As(V ), ⇥ )] E min E [L(˜, ⇥ )]. R V,⇥ ,s V ˜ S(V ) ⇥ V ⇤ |

Mehryar Mohri - Foundations of Machine Learning page 32 Deterministic Algorithm (Balcan et al., 07) Stage one: standard classification. Learn preference function h : U U [0 , 1] . Stage two: sort-by-degree using comparison function h . • sort by number of points ranked below. • quadratic time complexity O ( n 2 ) .

Mehryar Mohri - Foundations of Machine Learning page 33 Randomized Algorithm (Ailon & MM, 08) Stage one: standard classification. Learn preference function h : U U [0 , 1] . Stage two: randomized QuickSort (Hoare, 61) using h as comparison function. • comparison function non-transitive unlike textbook description. • but, time complexity shown to be O ( n log n ) in general.

Mehryar Mohri - Foundations of Machine Learning page 34 Randomized QS

v

h(v, u)=1 h(u, v)=1

u random pivot left recursion right recursion

Mehryar Mohri - Foundations of Machine Learning page 35 Deterministic Algo. - Bipartite Case (V = V+ V ) (Balcan et al., 07) Bounds: for deterministic sort-by-degree algorithm • expected loss: E [L(A(V ), )] 2E[L(h, )]. V, V, • regret: (A(V )) 2 (h). Rrank Rclass Time complexity: ( V 2 ) . | |

Mehryar Mohri - Foundations of Machine Learning page 36 Randomized Algo. - Bipartite Case

(V = V+ V ) (Ailon & MM, 08) Bounds: for randomized QuickSort (Hoare, 61). • expected loss (equality): h E [L(Qs (V ), )] = E [L(h, )]. V, ,s V, • regret: h (Q ( )) (h) . Rrank s · Rclass Time complexity: • full set: O ( n log n ) . • top k: O(n + k log k).

Mehryar Mohri - Foundations of Machine Learning page 37 Proof Ideas

QuickSort decomposition: 1 p + p h(u, w)h(w, v)+h(v, w)h(w, u) =1. uv 3 uvw w u,v ⇥⇤{ } ⇥ Bipartite property: u (u, v)+ (v, w)+ (w, u)=

(v, u)+ (w, v)+ (u, w). v w

Mehryar Mohri - Foundations of Machine Learning page 38 Lower Bound

Theorem: for any deterministic algorithm A , there is a bipartite distribution for which (A) 2 (h). Rrank Rclass • thus, factor of 2 = best in deterministic case. • randomization necessary for better bound. Proof: take simple case U = V = u, v, w and assume { } that h induces a cycle. u • up to symmetry, A returns u, v, w or w, v, u. v w h

Mehryar Mohri - Foundations of Machine Learning page 39 Lower Bound

If A returns u, v, w , then u choose as:

u, v w v w h - + 1 L[h, ]= ; If A returns w, v, u , then 3 choose as: 2 L[A, ]= . 3 w, v u - +

Mehryar Mohri - Foundations of Machine Learning page 40 Guarantees - General Case

Loss bound for QuickSort:

h E [L(Qs (V ), )] 2E[L(h, )]. V, ,s V, Comparison with optimal ranking (see (CSS 99)):

E [L(Qh(V ), )] 2 L(h, ) s s optimal optimal E [L(h, Qh(V ))] 3 L(h, ), s s optimal

where optimal = argmin L(h, ).

Mehryar Mohri - Foundations of Machine Learning page 41 Weight Function

Generalization:

⇥ (u, v)=(u, v) ⇤((u), (v)). Properties: needed for all previous results to hold, • symmetry: ( i, j )= ( j, i ) for all i, j . monotonicity: ( i, j ) , ( j, k ) ( i, k ) for i

Mehryar Mohri - Foundations of Machine Learning page 42 Weight Function - Examples

Kemeny: w(i, j)=1, i, j. 1 if i k or j k; Top-k: w(i, j)= 0 otherwise. 1 if i k and j>k; Bipartite: w(i, j)= 0 otherwise.

k-partite: can be defined similarly.

Mehryar Mohri - Foundations of Machine Learning page 43 (Strong) Regret Definitions

Ranking regret:

rank(A)= E [L(As(V ), ⇥ )] min E [L(˜ V , ⇥ )]. R V,⇥ ,s ˜ V,⇥ | Preference regret:

˜ class(h)= E [L(h V , )] min E [L(h V , )]. R V, | h˜ V, | All previous regret results hold if for V ,V u, v , 1 2 { } E [ (u, v)] = E [ (u, v)] V V | 1 | 2 for all u, v (pairwise independence on irrelevant alternatives).

Mehryar Mohri - Foundations of Machine Learning page 44 References

• Agarwal, S., Graepel, T., Herbrich, R., Har-Peled, S., & Roth, D. (2005). Generalization bounds for the area under the roc curve. JMLR 6, 393–425. • Agarwal, S., and Niyogi, P. (2005). Stability and generalization of bipartite ranking algorithms. COLT (pp. 32–47). • Nir Ailon and Mehryar Mohri. An efficient reduction of ranking to classification. In Proceedings of COLT 2008. Helsinki, Finland, July 2008. Omnipress. • Balcan, M.-F., Bansal, N., Beygelzimer, A., Coppersmith, D., Langford, J., and Sorkin, G. B. (2007). Robust reductions from ranking to classification. In Proceedings of COLT (pp. 604– 619). Springer. • Stephen Boyd, , Mehryar Mohri, and Ana Radovanovic (2012). Accuracy at the top. In NIPS 2012. • Cohen, W. W., Schapire, R. E., and Singer, Y. (1999). Learning to order things. J. Artif. Intell. Res. (JAIR), 10, 243–270. • Cossock, D., and Zhang, T. (2006). Subset ranking using regression. COLT (pp. 605–619).

Mehryar Mohri - Courant Seminar page 45 Courant Institute, NYU References

• Corinna Cortes and Mehryar Mohri. AUC Optimization vs. Error Rate Minimization. In Advances in Neural Information Processing Systems (NIPS 2003), 2004. MIT Press. • Cortes, C., Mohri, M., and Rastogi, A. (2007a). An Alternative Ranking Problem for Search Engines. Proceedings of WEA 2007 (pp. 1–21). Rome, Italy: Springer. • Corinna Cortes and Mehryar Mohri. Confidence Intervals for the Area under the ROC Curve. In Advances in Neural Information Processing Systems (NIPS 2004), 2005. MIT Press. • Crammer, K., and Singer, Y. (2001). Pranking with ranking. Proceedings of NIPS 2001, December 3-8, 2001, Vancouver, British Columbia, Canada] (pp. 641–647). MIT Press. • , Ravi Kumar, , and D. Sivakumar. Rank Aggregation Methods for the Web. WWW 10, 2001. ACM Press. • J. P. Egan. Signal Detection Theory and ROC Analysis. Academic Press, 1975. • , Raj Iyer, Robert E. Schapire and Yoram Singer. An efficient boosting algorithm for combining preferences. JMLR 4:933-969, 2003.

Mehryar Mohri - Courant Seminar page 46 Courant Institute, NYU References

• J. A. Hanley and B. J. McNeil. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology, 1982. • Ralf Herbrich, Thore Graepel, and Klaus Obermayer. Large margin rank boundaries for ordinal regression. Advances in Large Margin Classifiers, pages 115–132, 2000. • Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of Machine Learning. The MIT Press. 2012. • Lawrence Page, Sergey Brin, , and Terry Winograd. The PageRank citation ranking: Bringing order to the Web, Stanford Digital Library Technologies Project, 1998. • Lehmann, E. L. (1975). Nonparametrics: Statistical methods based on ranks. San Francisco, California: Holden-Day. • Cynthia Rudin, Corinna Cortes, Mehryar Mohri, and Robert E. Schapire. Margin-Based Ranking Meets Boosting in the Middle. In Proceedings of The 18th Annual Conference on Computational Learning Theory (COLT 2005), pages 63-78, 2005. • Thorsten Joachims. Optimizing search engines using clickthrough data. Proceedings of the 8th ACM SIGKDD, pages 133-142, 2002. Mehryar Mohri - Courant Seminar page 47 Courant Institute, NYU