Introduction to Machine Learning Lecture 16
Total Page:16
File Type:pdf, Size:1020Kb
Introduction to Machine Learning Lecture 16 Mehryar Mohri Courant Institute and Google Research [email protected] Ranking Mehryar Mohri - Introduction to Machine Learning page 2 Motivation Very large data sets: • too large to display or process. • limited resources, need priorities. • ranking more desirable than classification. Applications: • search engines, information extraction. • decision making, auctions, fraud detection. Can we learn to predict ranking accurately? Mehryar Mohri - Foundations of Machine Learning page 3 Score-Based Setting Single stage: learning algorithm • receives labeled sample of pairwise preferences; returns scoring function h : U R . • → Drawbacks: • h induces a linear ordering for full set U . • does not match a query-based scenario. Advantages: • efficient algorithms. • good theory: VC bounds, margin bounds, stability bounds (FISS 03, RCMS 05, AN 05, AGHHR 05, CMR 07). Mehryar Mohri - Foundations of Machine Learning page 4 Preference-Based Setting Definitions: • U : universe, full set of objects. V : finite query subset to rank, V U . • ⊆ • τ ∗: target ranking for V (random variable). Two stages: can be viewed as reduction. learn preference function h : U U [0 , 1] . • × → • given V , use h to determine ranking σ of V . Running-time: measured in terms of |calls to h |. Mehryar Mohri - Foundations of Machine Learning page 5 Related Problem Rank aggregation: given n candidates and k voters each giving a ranking of the candidates, find ordering as close as possible to these. • closeness measured in number of pairwise misrankings. • problem NP-hard even for k =4 (Dwork et al., 2001). Mehryar Mohri - Foundations of Machine Learning page 6 This Talk Score-based ranking Preference-based ranking Mehryar Mohri - Foundations of Machine Learning page 7 Score-Based Ranking Training data: sample of i.i.d. labeled pairs drawn from U U according to some distribution D , × S = (x ,x ,y ),...,(x ,x ,y ) U U 1, 0, +1 , 1 1 1 m m m ∈ × ×{− } +1 if xi >pref xi with yi = 0ifxi =pref xi or no information 1ifx <pref xi. − i Problem : find hypothesis h : U R in H with small generalization error → RD(h)= Pr f(x, x) h(x) h(x) < 0 . (x,x ) D − ∼ Mehryar Mohri - Foundations of Machine Learning page 8 Notes Empirical error: m 1 R(h)= 1y (h(x ) h(x ))<0 . m i i − i i=1 The relation x x f ( x, x )=1 may be non- R ⇔ transitive (needs not even be anti-symmetric). Problem different from classification. Mehryar Mohri - Foundations of Machine Learning page 9 Distributional Assumptions Distribution over points: m points (literature). • labels for pairs. • squared number of examples O ( m 2 ) . Distribution over pairs: m pairs. • label for each pair received. • independence assumption. • same (linear) number of examples. Mehryar Mohri - Foundations of Machine Learning page 10 Boosting for Ranking Use weak ranking algorithm and create stronger ranking algorithm. Ensemble method: combine base rankers returned by weak ranking algorithm. Finding simple relatively accurate base rankers often not hard. How should base rankers be combined? Mehryar Mohri - Foundations of Machine Learning page 11 CD RankBoost (Freund et al., 2003; Rudin et al., 2005) X 0 + s H 0, 1 . t + t + t− =1,t (h)= Pr sgn f(x, x)(h(x) h(x)) = s . (x,x ) Dt − ⊆{ } ∼ RankBoost(S =((x1,x1 ,y1) ...,(xm,xm ,ym))) 1 for i 1 to m do ← 1 2 D1(xi,xi ) m 3 for t 1 to T ←do ← + 4 ht base ranker in H with smallest t− t = Ei Dt yi ht(xi ) ht(xi) ← − − ∼ − + 1 t 5 αt 2 log ← t− 0 + 1 6 Zt t +2[t t−] 2 normalization factor 7 for←i 1 to m do ← Dt(xi,xi )exp αtyi ht(xi ) ht(xi) 8 Dt+1(xi,x ) − − i ← Zt 9 ϕ T α h T ← t=1 t t 10 return h =sgn(ϕT ) Mehryar Mohri - Foundations of Machine Learning page 12 Notes Distributions D t over pairs of sample points: • originally uniform. • at each round, the weight of a misclassified example is increased. y[ϕ (x ) ϕ (x)] e− t − t observation: D t+1 ( x, x )= S t Z , since • | | s=1 s t yαt[ht(x) ht(x)] Q y αs[hs(x) hs(x)] Dt(x, x)e− − 1 e− s=1 − Dt+1(x, x)= = P t . Zt S Z | | s=1 s weight assigned to base classifier h t : α t directy depends on the accuracy of h t at round t . Mehryar Mohri - Foundations of Machine Learning page 13 Coordinate Descent RankBoost Objective Function: convex and differentiable. T y[ϕ (x) ϕ (x)] F (α)= e− T − T = exp y α [h (x) h (x)] . − t t − t (x,x ,y) S (x,x ,y) S t=1 ∈ ∈ x e− 0 1pairwiseloss − Mehryar Mohri - Foundations of Machine Learning page 14 • Direction: unit vector e t with dF (α + ηe ) e =argmin t . t dη t η=0 T y s=1 αs[hs(x) hs(x)] yη[ht(x) ht(x)] • Since F ( α + η e t )= e − − e − − , (x,x ,y) S P ∈ T dF (α + ηet) = y[h (x) h (x)] exp y α [h (x) h (x)] dη − t − t − s s − s η=0 (x,x ,y) S s=1 ∈ T = y[h (x) h (x)]D (x, x) m Z − t − t T +1 s (x,x,y) S s=1 ∈ T + = [ −] m Z . − t − t s s=1 Thus, direction corresponding to base classifier selected by the algorithm. Mehryar Mohri - Foundations of Machine Learning page 15 • Step size: obtained via dF (α + ηe ) t =0 dη T y[h (x) h (x)]η y[h (x) h (x)] exp y α [h (x) h (x)] e− t − t =0 ⇔− t − t − s s − s (x,x ,y) S s=1 ∈ T y[h (x) h (x)]η y[h (x) h (x)]D (x, x) m Z e− t − t =0 ⇔− t − t T +1 s (x,x ,y) S s=1 ∈ y[h (x) h (x)]η y[h (x) h (x)]D (x, x)e− t − t =0 ⇔− t − t T +1 (x,x ,y) S ∈ + η η [ e− −e ]=0 ⇔− t − t 1 + η = log t . ⇔ 2 t− Thus, step size matches base classifier weight used in algorithm. Mehryar Mohri - Foundations of Machine Learning page 16 L1 Margin Definitions Definition: the margin of a pair ( x, x ) with label y =0 is T y(ϕ(x) ϕ(x)) y t=1 αt[ht(x) ht(x)] α ∆h(x) ρ(x, x)= − = − = y · . m α α α t=T t 1 1 Definition: the margin of the hypothesis h for a sample S =(( x 1 ,x 1 ,y 1 ) ..., ( x m ,x m ,y m )) is the minimum margin for pairs in S with non-zero labels: α ∆h(x) ρ =min y · . (x,x,y) S α 1 y=0∈ Mehryar Mohri - Foundations of Machine Learning page 17 Ranking Margin Bound (Cortes and MM, 2011) Theorem: let H be a family of real-valued functions. Fix ρ > 0 , then, for any δ > 0 , with probability at least 1 δ over the choice of a sample of size m , the − following holds for all h H : ∈ 2 log 1 R(h) R (h)+ RD1 (H)+RD2 (H) + δ . ≤ ρ ρ m m 2m Mehryar Mohri - Foundations of Machine Learning page 18 RankBoost Margin But, RankBoot does not maximize the margin. smooth-margin RankBoost (Rudin et al. , 2005): log F (α) G(α)= . − α 1 Empirical performance not reported. Mehryar Mohri - Foundations of Machine Learning page 19 Ranking with SVMs see for example (Joachims, 2002) Optimization problem: application of SVMs. m 1 2 min w + C ξi w,ξ 2 i=1 subject to: y w Φ(x ) Φ(x ) 1 ξ i · i − i ≥ − i ξ 0, i [1,m] . i ≥ ∀ ∈ Decision function: h: x w Φ(x)+b. → · Mehryar Mohri - Foundations of Machine Learning page 20 Notes The algorithm coincides with SVMs using feature mapping (x, x) Ψ(x, x)=Φ(x) Φ(x). → − Can be used with kernels. Algorithm directly based on margin bound. Mehryar Mohri - Foundations of Machine Learning page 21 Bipartite Ranking Training data: sample of negative points drawn according toD • − S =(x ,...,x ) U. + 1 m ∈ • sample of positive points drawn according to D+ S =(x1 ,...,xm ) U. − ∈ Problem: find hypothesis h : U R in H with small generalization error → RD(h)= Pr h(x)<h(x) . x D ,x D+ ∼ − ∼ Mehryar Mohri - Foundations of Machine Learning page 22 Notes More efficient algorithm in this special case (Freund et al., 2003). Connection between AdaBoost and RankBoost (Cortes & MM, 04; Rudin et al., 05). • if constant base ranker used. • relationship between objective functions. Bipartite ranking results typically reported in terms of AUC. Mehryar Mohri - Foundations of Machine Learning page 23 ROC Curve (Egan, 1975) Definition: the receiver operating characteristic (ROC) curve is a plot of the true positive rate (TP) vs. false positive rate (FP). • TP: % positive points correctly labeled positive. • FP: % negative points incorrectly labeled positive. 1 .8 .6 θ - + .4 h(x3) h(x14) h(x5) h(x1) h(x23) True positive rate positive .True 2 sorted scores 0 0 .2 .4 .6 .8 1 False positive rate Mehryar Mohri - Foundations of Machine Learning page 24 Area under the ROC Curve (AUC) (Hanley and McNeil, 1982) Definition: the AUC is the area under the ROC curve. Measure of ranking quality. 1 .8 .6 .4 AUC True positive rate positive .True 2 0 0 .2 .4 .6 .8 1 Equivalently, False positive rate 1 m m AUC(h)= 1 =Pr[h(x) >h(x)] h(xi)>h(xj ) mm x D i=1 j=1 ∼ + x D ∼ − =1 R(h).