Foundations of Machine Learning Ranking

Foundations of Machine Learning Ranking Mehryar Mohri Courant Institute and Google Research [email protected] Motivation Very large data sets: • too large to display or process. • limited resources, need priorities. • ranking more desirable than classification. Applications: • search engines, information extraction. • decision making, auctions, fraud detection. Can we learn to predict ranking accurately? Mehryar Mohri - Foundations of Machine Learning page 2 Related Problem Rank aggregation: given n candidates and k voters each giving a ranking of the candidates, find ordering as close as possible to these. • closeness measured in number of pairwise misrankings. • problem NP-hard even for k =4 (Dwork et al., 2001). Mehryar Mohri - Foundations of Machine Learning page 3 This Talk Score-based ranking Preference-based ranking Mehryar Mohri - Foundations of Machine Learning page 4 Score-Based Setting Single stage: learning algorithm • receives labeled sample of pairwise preferences; returns scoring function h : U R . • → Drawbacks: • h induces a linear ordering for full set U . • does not match a query-based scenario. Advantages: • efficient algorithms. • good theory: VC bounds, margin bounds, stability bounds (FISS 03, RCMS 05, AN 05, AGHHR 05, CMR 07). Mehryar Mohri - Foundations of Machine Learning page 5 Score-Based Ranking Training data: sample of i.i.d. labeled pairs drawn from U U according to some distribution D , × S = (x ,x ,y ),...,(x ,x ,y ) U U 1, 0, +1 , 1 1 1 m m m ∈ × ×{− } +1 if xi >pref xi with yi = 0ifxi =pref xi or no information 1ifx <pref xi. − i Problem : find hypothesis h : U R in H with small generalization error → R(h)= Pr (f(x, x0) = 0) (f(x, x0) h(x0) h(x) 0) . (x,x ) D 6 ^ − 0 ⇠ h i Mehryar Mohri - Foundations of Machine Learning page 6 Notes Empirical error: m 1 R(h)= 1(y =0) (y (h(x ) h(x )) 0) . m i ∧ i i − i ≤ i=1 The relation x x f ( x, x )=1 may be non- R ⇔ transitive (needs not even be anti-symmetric). Problem different from classification. Mehryar Mohri - Foundations of Machine Learning page 7 Distributional Assumptions Distribution over points: m points (literature). • labels for pairs. • squared number of examples O ( m 2 ) . • dependency issue. Distribution over pairs: m pairs. • label for each pair received. • independence assumption. • same (linear) number of examples. Mehryar Mohri - Foundations of Machine Learning page 8 Confidence Margin in Ranking Labels assumed to be in +1 , 1 . { − } Empirical margin loss for ranking: for ⇢ > 0 , 1 m R (h)= Φ y h(x0 ) h(x ) . ⇢ m ⇢ i i − i i=1 X ⇣ ⌘ b 1 m R⇢(h) 1y [h(x ) h(x )] ⇢. m i i0 − i i=1 X b Mehryar Mohri - Foundations of Machine Learning page 9 Marginal Rademacher Complexities Distributions: • D 1 marginal distribution with respect to the first element of the pairs; • D 2 marginal distribution with respect to second element of the pairs. Samples: S1 = (x1,y1),...,(xm,ym) S2 = (x10 ,y1),...,(xm0 ,ym). Marginal Rademacher complexities: D1 D2 Rm (H)=E[RS1 (H)] Rm (H)=E[RS2 (H)]. Mehryar Mohri - Foundations of Machine Learningb pageb 10 Ranking Margin Bound (Boyd, Cortes, MM, and Radovanovich 2012; MM, Rostamizadeh, and Talwalkar, 2012) Theorem: let H be a family of real-valued functions. Fix ρ > 0 , then, for any δ > 0 , with probability at least 1 δ over the choice of a sample of size m , the − following holds for all h H : ∈ 2 log 1 R(h) R (h)+ RD1 (H)+RD2 (H) + δ . ≤ ρ ρ m m 2m Mehryar Mohri - Foundations of Machine Learning page 11 Ranking with SVMs see for example (Joachims, 2002) Optimization problem: application of SVMs. m 1 2 min w + C ξi w,ξ 2 i=1 subject to: y w Φ(x ) Φ(x ) 1 ξ i · i − i ≥ − i ξ 0, i [1,m] . i ≥ ∀ ∈ Decision function: h: x w Φ(x)+b. → · Mehryar Mohri - Foundations of Machine Learning page 12 Notes The algorithm coincides with SVMs using feature mapping (x, x) Ψ(x, x)=Φ(x) Φ(x). → − Can be used with kernels: K0((x ,x0 ), (x ,x0 )) = (x ,x0 ) (x ,x0 ) i i j j i i · j j = K(x ,x )+K(x0 ,x0 ) K(x0 ,x ) K(x ,x0 ). i j i j − i j − i j Algorithm directly based on margin bound. Mehryar Mohri - Foundations of Machine Learning page 13 Boosting for Ranking Use weak ranking algorithm and create stronger ranking algorithm. Ensemble method: combine base rankers returned by weak ranking algorithm. Finding simple relatively accurate base rankers often not hard. How should base rankers be combined? Mehryar Mohri - Foundations of Machine Learning page 14 CD RankBoost (Freund et al., 2003; Rudin et al., 2005) X 0 + s H 0, 1 . t + t + t− =1,t (h)= Pr sgn f(x, x)(h(x) h(x)) = s . (x,x ) Dt − ⊆{ } ∼ RankBoost(S =((x1,x1 ,y1) ...,(xm,xm ,ym))) 1 for i 1 to m do ← 1 2 D1(xi,xi ) m 3 for t 1 to T ←do ← + 4 ht base ranker in H with smallest t− t = Ei Dt yi ht(xi ) ht(xi) ← − − ∼ − + 1 t 5 αt 2 log ← t− 0 + 1 6 Zt t +2[t t−] 2 normalization factor 7 for←i 1 to m do ← Dt(xi,xi )exp αtyi ht(xi ) ht(xi) 8 Dt+1(xi,x ) − − i ← Zt 9 ϕ T α h T ← t=1 t t 10 return ϕT Mehryar Mohri - Foundations of Machine Learning page 15 Notes Distributions D t over pairs of sample points: • originally uniform. • at each round, the weight of a misclassified example is increased. y[ϕ (x ) ϕ (x)] e− t − t observation: D t+1 ( x, x )= S t Z , since • | | s=1 s t yαt[ht(x) ht(x)] Q y αs[hs(x) hs(x)] Dt(x, x)e− − 1 e− s=1 − Dt+1(x, x)= = P t . Zt S Z | | s=1 s weight assigned to base classifier h t : α t directly depends on the accuracy of h t at round t . Mehryar Mohri - Foundations of Machine Learning page 16 Coordinate Descent RankBoost Objective Function: convex and differentiable. T y[ϕ (x) ϕ (x)] F (α)= e− T − T = exp y α [h (x) h (x)] . − t t − t (x,x ,y) S (x,x ,y) S t=1 ∈ ∈ x e− 0 1pairwiseloss − Mehryar Mohri - Foundations of Machine Learning page 17 • Direction: unit vector e t with dF (α + ηe ) e =argmin t . t dη t η=0 T y s=1 αs[hs(x) hs(x)] yη[ht(x) ht(x)] • Since F ( α + η e t )= e − − e − − , (x,x ,y) S P ∈ T dF (α + ηet) = y[h (x) h (x)] exp y α [h (x) h (x)] dη − t − t − s s − s η=0 (x,x ,y) S s=1 ∈ T = y[h (x) h (x)]D (x, x) m Z − t − t T +1 s (x,x,y) S s=1 ∈ T + = [ −] m Z . − t − t s s=1 Thus, direction corresponding to base classifier selected by the algorithm. Mehryar Mohri - Foundations of Machine Learning page 18 • Step size: obtained via dF (α + ηe ) t =0 dη T y[h (x) h (x)]η y[h (x) h (x)] exp y α [h (x) h (x)] e− t − t =0 ⇔− t − t − s s − s (x,x ,y) S s=1 ∈ T y[h (x) h (x)]η y[h (x) h (x)]D (x, x) m Z e− t − t =0 ⇔− t − t T +1 s (x,x ,y) S s=1 ∈ y[h (x) h (x)]η y[h (x) h (x)]D (x, x)e− t − t =0 ⇔− t − t T +1 (x,x ,y) S ∈ + η η [ e− −e ]=0 ⇔− t − t 1 + η = log t . ⇔ 2 t− Thus, step size matches base classifier weight used in algorithm. Mehryar Mohri - Foundations of Machine Learning page 19 Bipartite Ranking Training data: sample of negative points drawn according toD • − S =(x1,...,xm) U. − ∈ • sample of positive points drawn according to D+ S+ =(x ,...,x ) U. 1 m ∈ Problem: find hypothesis h : U R in H with small generalization error ! RD(h)= Pr h(x)<h(x) . x D ,x D+ ∼ − ∼ Mehryar Mohri - Foundations of Machine Learning page 20 Notes Connection between AdaBoost and RankBoost (Cortes & MM, 04; Rudin et al., 05). • if constant base ranker used. • relationship between objective functions. More efficient algorithm in this special case (Freund et al., 2003). Bipartite ranking results typically reported in terms of AUC. Mehryar Mohri - Foundations of Machine Learning page 21 AdaBoost and CD RankBoost Objective functions: comparison. F (α)= exp ( y f(x )) Ada − i i xi S S+ ∈−∪ = exp (+f(x )) + exp ( f(x )) i − i xi S xi S+ ∈ − ∈ = F (α)+F+(α). − F (α)= exp [f(x ) f(x )] Rank − j − i (i,j) S S+ ∈−× = exp (+f(x )) exp ( f(x )) i − i (i,j) S S+ ∈−× = F (α)F+(α). − Mehryar Mohri - Foundations of Machine Learning page 22 AdaBoost and CD RankBoost (Rudin et al., 2005) Property: AdaBoost (non-separable case). • constant base learner h =1 equal contribution of positive and negative points (in the limit). • consequence: AdaBoost asymptotically achieves optimum of CD RankBoost objective. Observations: if F + ( α )= F ( α ) , − d(FRank)=F+d(F )+F d(F+) − − = F+ d(F )+d(F+) − = F+d(FAda). Mehryar Mohri - Foundations of Machine Learning page 23 Bipartite RankBoost - Efficiency Decomposition of distribution: for ( x, x ) ( S ,S + ) , ∈ − D(x, x)=D (x)D+(x). − Thus, αt[ht(x) ht(x)] Dt(x, x)e− − Dt+1(x, x)= Zt αtht(x) αtht(x) Dt, (x)e Dt,+(x)e− = − , Zt, Zt,+ − αt ht(x) αtht(x) with Zt, = Dt, (x)e Zt,+ = Dt,+(x)e− .

Foundations of Machine Learning Ranking

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support