Introduction to Machine Learning Lecture 16

Introduction to Machine Learning Lecture 16

Introduction to Machine Learning Lecture 16 Mehryar Mohri Courant Institute and Google Research [email protected] Ranking Mehryar Mohri - Introduction to Machine Learning page 2 Motivation Very large data sets: • too large to display or process. • limited resources, need priorities. • ranking more desirable than classification. Applications: • search engines, information extraction. • decision making, auctions, fraud detection. Can we learn to predict ranking accurately? Mehryar Mohri - Foundations of Machine Learning page 3 Score-Based Setting Single stage: learning algorithm • receives labeled sample of pairwise preferences; returns scoring function h : U R . • → Drawbacks: • h induces a linear ordering for full set U . • does not match a query-based scenario. Advantages: • efficient algorithms. • good theory: VC bounds, margin bounds, stability bounds (FISS 03, RCMS 05, AN 05, AGHHR 05, CMR 07). Mehryar Mohri - Foundations of Machine Learning page 4 Preference-Based Setting Definitions: • U : universe, full set of objects. V : finite query subset to rank, V U . • ⊆ • τ ∗: target ranking for V (random variable). Two stages: can be viewed as reduction. learn preference function h : U U [0 , 1] . • × → • given V , use h to determine ranking σ of V . Running-time: measured in terms of |calls to h |. Mehryar Mohri - Foundations of Machine Learning page 5 Related Problem Rank aggregation: given n candidates and k voters each giving a ranking of the candidates, find ordering as close as possible to these. • closeness measured in number of pairwise misrankings. • problem NP-hard even for k =4 (Dwork et al., 2001). Mehryar Mohri - Foundations of Machine Learning page 6 This Talk Score-based ranking Preference-based ranking Mehryar Mohri - Foundations of Machine Learning page 7 Score-Based Ranking Training data: sample of i.i.d. labeled pairs drawn from U U according to some distribution D , × S = (x ,x ,y ),...,(x ,x ,y ) U U 1, 0, +1 , 1 1 1 m m m ∈ × ×{− } +1 if xi >pref xi with yi = 0ifxi =pref xi or no information 1ifx <pref xi. − i Problem : find hypothesis h : U R in H with small generalization error → RD(h)= Pr f(x, x) h(x) h(x) < 0 . (x,x ) D − ∼ Mehryar Mohri - Foundations of Machine Learning page 8 Notes Empirical error: m 1 R(h)= 1y (h(x ) h(x ))<0 . m i i − i i=1 The relation x x f ( x, x )=1 may be non- R ⇔ transitive (needs not even be anti-symmetric). Problem different from classification. Mehryar Mohri - Foundations of Machine Learning page 9 Distributional Assumptions Distribution over points: m points (literature). • labels for pairs. • squared number of examples O ( m 2 ) . Distribution over pairs: m pairs. • label for each pair received. • independence assumption. • same (linear) number of examples. Mehryar Mohri - Foundations of Machine Learning page 10 Boosting for Ranking Use weak ranking algorithm and create stronger ranking algorithm. Ensemble method: combine base rankers returned by weak ranking algorithm. Finding simple relatively accurate base rankers often not hard. How should base rankers be combined? Mehryar Mohri - Foundations of Machine Learning page 11 CD RankBoost (Freund et al., 2003; Rudin et al., 2005) X 0 + s H 0, 1 . t + t + t− =1,t (h)= Pr sgn f(x, x)(h(x) h(x)) = s . (x,x ) Dt − ⊆{ } ∼ RankBoost(S =((x1,x1 ,y1) ...,(xm,xm ,ym))) 1 for i 1 to m do ← 1 2 D1(xi,xi ) m 3 for t 1 to T ←do ← + 4 ht base ranker in H with smallest t− t = Ei Dt yi ht(xi ) ht(xi) ← − − ∼ − + 1 t 5 αt 2 log ← t− 0 + 1 6 Zt t +2[t t−] 2 normalization factor 7 for←i 1 to m do ← Dt(xi,xi )exp αtyi ht(xi ) ht(xi) 8 Dt+1(xi,x ) − − i ← Zt 9 ϕ T α h T ← t=1 t t 10 return h =sgn(ϕT ) Mehryar Mohri - Foundations of Machine Learning page 12 Notes Distributions D t over pairs of sample points: • originally uniform. • at each round, the weight of a misclassified example is increased. y[ϕ (x ) ϕ (x)] e− t − t observation: D t+1 ( x, x )= S t Z , since • | | s=1 s t yαt[ht(x) ht(x)] Q y αs[hs(x) hs(x)] Dt(x, x)e− − 1 e− s=1 − Dt+1(x, x)= = P t . Zt S Z | | s=1 s weight assigned to base classifier h t : α t directy depends on the accuracy of h t at round t . Mehryar Mohri - Foundations of Machine Learning page 13 Coordinate Descent RankBoost Objective Function: convex and differentiable. T y[ϕ (x) ϕ (x)] F (α)= e− T − T = exp y α [h (x) h (x)] . − t t − t (x,x ,y) S (x,x ,y) S t=1 ∈ ∈ x e− 0 1pairwiseloss − Mehryar Mohri - Foundations of Machine Learning page 14 • Direction: unit vector e t with dF (α + ηe ) e =argmin t . t dη t η=0 T y s=1 αs[hs(x) hs(x)] yη[ht(x) ht(x)] • Since F ( α + η e t )= e − − e − − , (x,x ,y) S P ∈ T dF (α + ηet) = y[h (x) h (x)] exp y α [h (x) h (x)] dη − t − t − s s − s η=0 (x,x ,y) S s=1 ∈ T = y[h (x) h (x)]D (x, x) m Z − t − t T +1 s (x,x,y) S s=1 ∈ T + = [ −] m Z . − t − t s s=1 Thus, direction corresponding to base classifier selected by the algorithm. Mehryar Mohri - Foundations of Machine Learning page 15 • Step size: obtained via dF (α + ηe ) t =0 dη T y[h (x) h (x)]η y[h (x) h (x)] exp y α [h (x) h (x)] e− t − t =0 ⇔− t − t − s s − s (x,x ,y) S s=1 ∈ T y[h (x) h (x)]η y[h (x) h (x)]D (x, x) m Z e− t − t =0 ⇔− t − t T +1 s (x,x ,y) S s=1 ∈ y[h (x) h (x)]η y[h (x) h (x)]D (x, x)e− t − t =0 ⇔− t − t T +1 (x,x ,y) S ∈ + η η [ e− −e ]=0 ⇔− t − t 1 + η = log t . ⇔ 2 t− Thus, step size matches base classifier weight used in algorithm. Mehryar Mohri - Foundations of Machine Learning page 16 L1 Margin Definitions Definition: the margin of a pair ( x, x ) with label y =0 is T y(ϕ(x) ϕ(x)) y t=1 αt[ht(x) ht(x)] α ∆h(x) ρ(x, x)= − = − = y · . m α α α t=T t 1 1 Definition: the margin of the hypothesis h for a sample S =(( x 1 ,x 1 ,y 1 ) ..., ( x m ,x m ,y m )) is the minimum margin for pairs in S with non-zero labels: α ∆h(x) ρ =min y · . (x,x,y) S α 1 y=0∈ Mehryar Mohri - Foundations of Machine Learning page 17 Ranking Margin Bound (Cortes and MM, 2011) Theorem: let H be a family of real-valued functions. Fix ρ > 0 , then, for any δ > 0 , with probability at least 1 δ over the choice of a sample of size m , the − following holds for all h H : ∈ 2 log 1 R(h) R (h)+ RD1 (H)+RD2 (H) + δ . ≤ ρ ρ m m 2m Mehryar Mohri - Foundations of Machine Learning page 18 RankBoost Margin But, RankBoot does not maximize the margin. smooth-margin RankBoost (Rudin et al. , 2005): log F (α) G(α)= . − α 1 Empirical performance not reported. Mehryar Mohri - Foundations of Machine Learning page 19 Ranking with SVMs see for example (Joachims, 2002) Optimization problem: application of SVMs. m 1 2 min w + C ξi w,ξ 2 i=1 subject to: y w Φ(x ) Φ(x ) 1 ξ i · i − i ≥ − i ξ 0, i [1,m] . i ≥ ∀ ∈ Decision function: h: x w Φ(x)+b. → · Mehryar Mohri - Foundations of Machine Learning page 20 Notes The algorithm coincides with SVMs using feature mapping (x, x) Ψ(x, x)=Φ(x) Φ(x). → − Can be used with kernels. Algorithm directly based on margin bound. Mehryar Mohri - Foundations of Machine Learning page 21 Bipartite Ranking Training data: sample of negative points drawn according toD • − S =(x ,...,x ) U. + 1 m ∈ • sample of positive points drawn according to D+ S =(x1 ,...,xm ) U. − ∈ Problem: find hypothesis h : U R in H with small generalization error → RD(h)= Pr h(x)<h(x) . x D ,x D+ ∼ − ∼ Mehryar Mohri - Foundations of Machine Learning page 22 Notes More efficient algorithm in this special case (Freund et al., 2003). Connection between AdaBoost and RankBoost (Cortes & MM, 04; Rudin et al., 05). • if constant base ranker used. • relationship between objective functions. Bipartite ranking results typically reported in terms of AUC. Mehryar Mohri - Foundations of Machine Learning page 23 ROC Curve (Egan, 1975) Definition: the receiver operating characteristic (ROC) curve is a plot of the true positive rate (TP) vs. false positive rate (FP). • TP: % positive points correctly labeled positive. • FP: % negative points incorrectly labeled positive. 1 .8 .6 θ - + .4 h(x3) h(x14) h(x5) h(x1) h(x23) True positive rate positive .True 2 sorted scores 0 0 .2 .4 .6 .8 1 False positive rate Mehryar Mohri - Foundations of Machine Learning page 24 Area under the ROC Curve (AUC) (Hanley and McNeil, 1982) Definition: the AUC is the area under the ROC curve. Measure of ranking quality. 1 .8 .6 .4 AUC True positive rate positive .True 2 0 0 .2 .4 .6 .8 1 Equivalently, False positive rate 1 m m AUC(h)= 1 =Pr[h(x) >h(x)] h(xi)>h(xj ) mm x D i=1 j=1 ∼ + x D ∼ − =1 R(h).

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    50 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us