Foundations of Machine Learning Ranking
Total Page:16
File Type:pdf, Size:1020Kb
Foundations of Machine Learning Ranking Mehryar Mohri Courant Institute and Google Research [email protected] Motivation Very large data sets: • too large to display or process. • limited resources, need priorities. • ranking more desirable than classification. Applications: • search engines, information extraction. • decision making, auctions, fraud detection. Can we learn to predict ranking accurately? Mehryar Mohri - Foundations of Machine Learning page 2 Related Problem Rank aggregation: given n candidates and k voters each giving a ranking of the candidates, find ordering as close as possible to these. • closeness measured in number of pairwise misrankings. • problem NP-hard even for k =4 (Dwork et al., 2001). Mehryar Mohri - Foundations of Machine Learning page 3 This Talk Score-based ranking Preference-based ranking Mehryar Mohri - Foundations of Machine Learning page 4 Score-Based Setting Single stage: learning algorithm • receives labeled sample of pairwise preferences; returns scoring function h : U R . • → Drawbacks: • h induces a linear ordering for full set U . • does not match a query-based scenario. Advantages: • efficient algorithms. • good theory: VC bounds, margin bounds, stability bounds (FISS 03, RCMS 05, AN 05, AGHHR 05, CMR 07). Mehryar Mohri - Foundations of Machine Learning page 5 Score-Based Ranking Training data: sample of i.i.d. labeled pairs drawn from U U according to some distribution D , × S = (x ,x ,y ),...,(x ,x ,y ) U U 1, 0, +1 , 1 1 1 m m m ∈ × ×{− } +1 if xi >pref xi with yi = 0ifxi =pref xi or no information 1ifx <pref xi. − i Problem : find hypothesis h : U R in H with small generalization error → R(h)= Pr (f(x, x0) = 0) (f(x, x0) h(x0) h(x) 0) . (x,x ) D 6 ^ − 0 ⇠ h i Mehryar Mohri - Foundations of Machine Learning page 6 Notes Empirical error: m 1 R(h)= 1(y =0) (y (h(x ) h(x )) 0) . m i ∧ i i − i ≤ i=1 The relation x x f ( x, x )=1 may be non- R ⇔ transitive (needs not even be anti-symmetric). Problem different from classification. Mehryar Mohri - Foundations of Machine Learning page 7 Distributional Assumptions Distribution over points: m points (literature). • labels for pairs. • squared number of examples O ( m 2 ) . • dependency issue. Distribution over pairs: m pairs. • label for each pair received. • independence assumption. • same (linear) number of examples. Mehryar Mohri - Foundations of Machine Learning page 8 Confidence Margin in Ranking Labels assumed to be in +1 , 1 . { − } Empirical margin loss for ranking: for ⇢ > 0 , 1 m R (h)= Φ y h(x0 ) h(x ) . ⇢ m ⇢ i i − i i=1 X ⇣ ⌘ b 1 m R⇢(h) 1y [h(x ) h(x )] ⇢. m i i0 − i i=1 X b Mehryar Mohri - Foundations of Machine Learning page 9 Marginal Rademacher Complexities Distributions: • D 1 marginal distribution with respect to the first element of the pairs; • D 2 marginal distribution with respect to second element of the pairs. Samples: S1 = (x1,y1),...,(xm,ym) S2 = (x10 ,y1),...,(xm0 ,ym). Marginal Rademacher complexities: D1 D2 Rm (H)=E[RS1 (H)] Rm (H)=E[RS2 (H)]. Mehryar Mohri - Foundations of Machine Learningb pageb 10 Ranking Margin Bound (Boyd, Cortes, MM, and Radovanovich 2012; MM, Rostamizadeh, and Talwalkar, 2012) Theorem: let H be a family of real-valued functions. Fix ρ > 0 , then, for any δ > 0 , with probability at least 1 δ over the choice of a sample of size m , the − following holds for all h H : ∈ 2 log 1 R(h) R (h)+ RD1 (H)+RD2 (H) + δ . ≤ ρ ρ m m 2m Mehryar Mohri - Foundations of Machine Learning page 11 Ranking with SVMs see for example (Joachims, 2002) Optimization problem: application of SVMs. m 1 2 min w + C ξi w,ξ 2 i=1 subject to: y w Φ(x ) Φ(x ) 1 ξ i · i − i ≥ − i ξ 0, i [1,m] . i ≥ ∀ ∈ Decision function: h: x w Φ(x)+b. → · Mehryar Mohri - Foundations of Machine Learning page 12 Notes The algorithm coincides with SVMs using feature mapping (x, x) Ψ(x, x)=Φ(x) Φ(x). → − Can be used with kernels: K0((x ,x0 ), (x ,x0 )) = (x ,x0 ) (x ,x0 ) i i j j i i · j j = K(x ,x )+K(x0 ,x0 ) K(x0 ,x ) K(x ,x0 ). i j i j − i j − i j Algorithm directly based on margin bound. Mehryar Mohri - Foundations of Machine Learning page 13 Boosting for Ranking Use weak ranking algorithm and create stronger ranking algorithm. Ensemble method: combine base rankers returned by weak ranking algorithm. Finding simple relatively accurate base rankers often not hard. How should base rankers be combined? Mehryar Mohri - Foundations of Machine Learning page 14 CD RankBoost (Freund et al., 2003; Rudin et al., 2005) X 0 + s H 0, 1 . t + t + t− =1,t (h)= Pr sgn f(x, x)(h(x) h(x)) = s . (x,x ) Dt − ⊆{ } ∼ RankBoost(S =((x1,x1 ,y1) ...,(xm,xm ,ym))) 1 for i 1 to m do ← 1 2 D1(xi,xi ) m 3 for t 1 to T ←do ← + 4 ht base ranker in H with smallest t− t = Ei Dt yi ht(xi ) ht(xi) ← − − ∼ − + 1 t 5 αt 2 log ← t− 0 + 1 6 Zt t +2[t t−] 2 normalization factor 7 for←i 1 to m do ← Dt(xi,xi )exp αtyi ht(xi ) ht(xi) 8 Dt+1(xi,x ) − − i ← Zt 9 ϕ T α h T ← t=1 t t 10 return ϕT Mehryar Mohri - Foundations of Machine Learning page 15 Notes Distributions D t over pairs of sample points: • originally uniform. • at each round, the weight of a misclassified example is increased. y[ϕ (x ) ϕ (x)] e− t − t observation: D t+1 ( x, x )= S t Z , since • | | s=1 s t yαt[ht(x) ht(x)] Q y αs[hs(x) hs(x)] Dt(x, x)e− − 1 e− s=1 − Dt+1(x, x)= = P t . Zt S Z | | s=1 s weight assigned to base classifier h t : α t directly depends on the accuracy of h t at round t . Mehryar Mohri - Foundations of Machine Learning page 16 Coordinate Descent RankBoost Objective Function: convex and differentiable. T y[ϕ (x) ϕ (x)] F (α)= e− T − T = exp y α [h (x) h (x)] . − t t − t (x,x ,y) S (x,x ,y) S t=1 ∈ ∈ x e− 0 1pairwiseloss − Mehryar Mohri - Foundations of Machine Learning page 17 • Direction: unit vector e t with dF (α + ηe ) e =argmin t . t dη t η=0 T y s=1 αs[hs(x) hs(x)] yη[ht(x) ht(x)] • Since F ( α + η e t )= e − − e − − , (x,x ,y) S P ∈ T dF (α + ηet) = y[h (x) h (x)] exp y α [h (x) h (x)] dη − t − t − s s − s η=0 (x,x ,y) S s=1 ∈ T = y[h (x) h (x)]D (x, x) m Z − t − t T +1 s (x,x,y) S s=1 ∈ T + = [ −] m Z . − t − t s s=1 Thus, direction corresponding to base classifier selected by the algorithm. Mehryar Mohri - Foundations of Machine Learning page 18 • Step size: obtained via dF (α + ηe ) t =0 dη T y[h (x) h (x)]η y[h (x) h (x)] exp y α [h (x) h (x)] e− t − t =0 ⇔− t − t − s s − s (x,x ,y) S s=1 ∈ T y[h (x) h (x)]η y[h (x) h (x)]D (x, x) m Z e− t − t =0 ⇔− t − t T +1 s (x,x ,y) S s=1 ∈ y[h (x) h (x)]η y[h (x) h (x)]D (x, x)e− t − t =0 ⇔− t − t T +1 (x,x ,y) S ∈ + η η [ e− −e ]=0 ⇔− t − t 1 + η = log t . ⇔ 2 t− Thus, step size matches base classifier weight used in algorithm. Mehryar Mohri - Foundations of Machine Learning page 19 Bipartite Ranking Training data: sample of negative points drawn according toD • − S =(x1,...,xm) U. − ∈ • sample of positive points drawn according to D+ S+ =(x ,...,x ) U. 1 m ∈ Problem: find hypothesis h : U R in H with small generalization error ! RD(h)= Pr h(x)<h(x) . x D ,x D+ ∼ − ∼ Mehryar Mohri - Foundations of Machine Learning page 20 Notes Connection between AdaBoost and RankBoost (Cortes & MM, 04; Rudin et al., 05). • if constant base ranker used. • relationship between objective functions. More efficient algorithm in this special case (Freund et al., 2003). Bipartite ranking results typically reported in terms of AUC. Mehryar Mohri - Foundations of Machine Learning page 21 AdaBoost and CD RankBoost Objective functions: comparison. F (α)= exp ( y f(x )) Ada − i i xi S S+ ∈−∪ = exp (+f(x )) + exp ( f(x )) i − i xi S xi S+ ∈ − ∈ = F (α)+F+(α). − F (α)= exp [f(x ) f(x )] Rank − j − i (i,j) S S+ ∈−× = exp (+f(x )) exp ( f(x )) i − i (i,j) S S+ ∈−× = F (α)F+(α). − Mehryar Mohri - Foundations of Machine Learning page 22 AdaBoost and CD RankBoost (Rudin et al., 2005) Property: AdaBoost (non-separable case). • constant base learner h =1 equal contribution of positive and negative points (in the limit). • consequence: AdaBoost asymptotically achieves optimum of CD RankBoost objective. Observations: if F + ( α )= F ( α ) , − d(FRank)=F+d(F )+F d(F+) − − = F+ d(F )+d(F+) − = F+d(FAda). Mehryar Mohri - Foundations of Machine Learning page 23 Bipartite RankBoost - Efficiency Decomposition of distribution: for ( x, x ) ( S ,S + ) , ∈ − D(x, x)=D (x)D+(x). − Thus, αt[ht(x) ht(x)] Dt(x, x)e− − Dt+1(x, x)= Zt αtht(x) αtht(x) Dt, (x)e Dt,+(x)e− = − , Zt, Zt,+ − αt ht(x) αtht(x) with Zt, = Dt, (x)e Zt,+ = Dt,+(x)e− .