Feature Selection & the Shapley-Folkman Theorem
Total Page:16
File Type:pdf, Size:1020Kb
Feature Selection & the Shapley-Folkman Theorem. Alexandre d'Aspremont, CNRS & D.I., Ecole´ Normale Sup´erieure. With Armin Askari, Laurent El Ghaoui (UC Berkeley) and Quentin Rebjock (EPFL) Alex d'Aspremont CIMI, Toulouse, November 2019. 1/35 Jobs Postdoc positions in ML / Optimization. At INRIA / Ecole Normale Sup´erieurein Paris. Alex d'Aspremont CIMI, Toulouse, November 2019. 2/35 Introduction Feature Selection. Reduce number of variables while preserving classification performance. Often improves test performance, especially when samples are scarce. Helps interpretation. Classical examples: LASSO, `1-logistic regression, RFE-SVM, . Alex d'Aspremont CIMI, Toulouse, November 2019. 3/35 Introduction Feature Selection. Toy example: text classification in 20newsgroup. Classify sci.med versus sci.space. Space features. 'commercial', 'launches', 'project', 'launched', 'data', 'dryden', 'mining', 'planetary', 'proton', 'missions', 'cost', 'command', 'comet', 'jupiter', 'apollo', 'russian', 'aerospace', 'sun', 'mary', 'payload', 'gravity', ... Med features. med', 'yeast', 'diseases', 'allergic', 'doctors', 'symptoms', 'syndrome', 'diagnosed', 'health', 'drugs', 'therapy', 'candida', 'seizures', 'lyme', 'food', 'brain', 'foods', 'geb', 'pain', 'gordon', patient', ... Alex d'Aspremont CIMI, Toulouse, November 2019. 4/35 Introduction: feature selection RNA classification. Find genes which best discriminate cell type (lung cancer vs control). 35238 genes, 2695 examples. [Lachmann et al., 2018] ×1011 3.430 3.435 3.440 3.445 Objective 3.450 3.455 0 5000 10000 15000 20000 25000 30000 35000 Number of features (k) Best ten genes: MT-CO3, MT-ND4, MT-CYB, RP11-217O12.1, LYZ, EEF1A1, MT-CO1, HBA2, HBB, HBA1. Alex d'Aspremont CIMI, Toulouse, November 2019. 5/35 Introduction: feature selection Applications. Mapping brain activity by fMRI. From PARIETAL team at INRIA. Alex d'Aspremont CIMI, Toulouse, November 2019. 6/35 Introduction: feature selection fMRI. Many voxels, very few samples leads to false discoveries. Wired article on Bennett et al. \Neural Correlates of Interspecies Perspective Taking in the Post-Mortem Atlantic Salmon: An Argument For Proper Multiple Comparisons Correction" Journal of Serendipitous and Unexpected Results, 2010. Alex d'Aspremont CIMI, Toulouse, November 2019. 7/35 Introduction: linear models Linear models. Select features from large weights w. 2 T LASSO solves minw Xw y + λ w 1 with linear prediction given by w x. k − k2 k k P T 2 Linear SVM, solves minw i max 0; 1 yi w xi + λ w 2 with linear classification rule sign(wT x). f − g k k In practice. Relatively high complexity on very large-scale data sets. Recovery results require uncorrelated features (incoherence, RIP, etc.). Cheaper featurewise methods (ANOVA, TF-IDF, etc.) have relatively poor performance. Alex d'Aspremont CIMI, Toulouse, November 2019. 8/35 Outline Sparse Naive Bayes The Shapley-Folkman theorem Duality gap bounds Numerical performance Alex d'Aspremont CIMI, Toulouse, November 2019. 9/35 Naive Bayes Naive Bayes. Predict label of a test point x Rn via the rule 2 y^(x) = arg max Prob(Ci x): i∈{−1;1g j Use Bayes' rule and then use the \naive" assumption that features are conditionally independent of each other m Y Prob(x Ci) = Prob(xj Ci) j j j=1 leading to 8 m 9 < X = y^(x) = arg max log Prob(Ci) + log Prob(xj Ci) : (1) i∈{−1;1g j : j=1 ; In (1), we need to have an explicit model for Prob(xj Ci). j Alex d'Aspremont CIMI, Toulouse, November 2019. 10/35 Multinomial Naive Bayse Multinomial Naive Bayse. In the multinomial model Pm ! > ± ( j=1 xj)! log Prob(x C±) = x log θ + log Qm : j j=1 xj! Training by maximum likelihood + − +> + −> − (θ∗ ; θ∗ ) = argmax f log θ + f log θ 1>θ+=1>θ−=1 θ+,θ−2[0;1]m Linear classification rule from (1): for a given test point x Rm, we set 2 y^(x) = sign(v + w>x); where + − w log θ log θ and v log Prob(C+) log Prob(C−); , ∗ − ∗ , − Alex d'Aspremont CIMI, Toulouse, November 2019. 11/35 Sparse Naive Bayse Naive Feature Selection. Make w log θ+ log θ− sparse. , ∗ − ∗ Solve + − +> + −> − (θ∗ ; θ∗ ) = argmax f log θ + f log θ subject to θ+ θ− k 0 (SMNB) 1k>θ+−= 1k>θ≤− = 1 θ+; θ+ 0 ≥ + − where k 0 is a target number of features. Features for which θi = θi can be discarded.≥ Nonconvex problem. Convex relaxation? Approximation bounds? Alex d'Aspremont CIMI, Toulouse, November 2019. 12/35 Sparse Naive Bayse Convex Relaxation. The dual is very simple. Sparse Multinomial Naive Bayes [Askari, A., El Ghaoui, 2019] Let φ(k) be the optimal value of (SMNB). Then φ(k) (k), where (k) is the optimal value of the following one-dimensional convex optimization≤ problem (k) := C + min sk(h(α)); (USMNB) α2[0;1] where C is a constant, sk( ) is the sum of the top k entries of its vector argument, and for α (0; 1), · 2 h(α) := f+ log f++f− log f− (f++f−) log(f++f−) f+ log α f− log(1 α): ◦ ◦ − ◦ − − − Solved by bisection, linear complexity O(n + k log k). Approximation bounds? Alex d'Aspremont CIMI, Toulouse, November 2019. 13/35 Outline Sparse Naive Bayes The Shapley-Folkman theorem Duality gap bounds Numerical performance Alex d'Aspremont CIMI, Toulouse, November 2019. 14/35 Shapley-Folkman Theorem Minkowski sum. Given sets X; Y Rd, we have ⊂ X + Y = x + y : x X; y Y f 2 2 g (CGAL User and Reference Manual) d Convex hull. Given subsets Vi R , we have ⊂ ! X X Co Vi = Co(Vi) i i Alex d'Aspremont CIMI, Toulouse, November 2019. 15/35 Shapley-Folkman Theorem The `1=2 ball, Minkowsi average of two and ten balls, convex hull. + + + + = Minkowsi sum of five first digits (obtained by sampling). Alex d'Aspremont CIMI, Toulouse, November 2019. 16/35 Shapley-Folkman Theorem Shapley-Folkman Theorem [Starr, 1969] d Suppose Vi R , i = 1; : : : ; n; and ⊂ n ! n X X x Co Vi = Co(Vi) 2 i=1 i=1 then X X x Vi + Co(Vi) 2 [1;n]nSx Sx where x d. jS j ≤ Alex d'Aspremont CIMI, Toulouse, November 2019. 17/35 Shapley-Folkman Theorem Pn Proof sketch. Write x Co(Vi), or 2 i=1 n d+1 x X X vij = λij ; for λ 0, 1n ei ≥ i=1 j=1 Conic Carath´eodory then yields representation with at most n + d nonzero coefficients. Use a pigeonhole argument λij } d } xi ∈ Co(Vi) n xi ∈ Vi Number of nonzero λij controls gap with convex hull. Alex d'Aspremont CIMI, Toulouse, November 2019. 18/35 Shapley-Folkman: geometric consequences Consequences. d If the sets Vi R are uniformly bounded with rad(Vi) R, then ⊂ ≤ Pn Pn p i=1 Vi i=1 Vi min n; d dH ; Co R f g n n ≤ n where rad(V ) = infx2V sup x y : y2V k − k In particular, when d is fixed and n ! 1 Pn V Pn V i=1 i Co i=1 i n ! n in the Hausdorff metric with rate O(1=n). Holds for many other nonconvexity measures [Fradelizi et al., 2017]. Alex d'Aspremont CIMI, Toulouse, November 2019. 19/35 Outline Sparse Naive Bayes The Shapley-Folkman theorem Duality gap bounds Numerical performance Alex d'Aspremont CIMI, Toulouse, November 2019. 20/35 Nonconvex Optimization Separable nonconvex problem. Solve minimize Pn f (x ) i=1 i i (P) subject to Ax b; ≤ d Pn in the variables xi R i with d = di, where fi are lower semicontinuous 2 i=1 and A Rm×d. 2 Take the dual twice to form a convex relaxation, minimize Pn f ∗∗(x ) i=1 i i (CoP) subject to Ax b ≤ d in the variables xi R i. 2 Alex d'Aspremont CIMI, Toulouse, November 2019. 21/35 Nonconvex Optimization Convex envelope. Biconjugate f ∗∗ satisfies epi(f ∗∗) = Co(epi(f)), which means that f ∗∗(x) and f(x) match at extreme points of epi(f ∗∗). Define lack of convexity as ρ(f) sup f(x) f ∗∗(x) . , x2dom(f)f − g Example. |x| 1 Card(x) −1 0 1 x The l1 norm is the convex envelope of Card(x) in [ 1; 1]. − Alex d'Aspremont CIMI, Toulouse, November 2019. 22/35 Nonconvex Optimization Writing the epigraph of problem (P) as in [Lemar´echaland Renaud, 2001], ( n ) 1+m X d r (r0; r) R : fi(xi) r0; Ax b r; x R ; G , 2 ≤ − ≤ 2 i=1 we can write the dual function of (P) as > ∗∗ Ψ(λ) inf r0 + λ r :(r0; r) ; , 2 Gr in the variable λ Rm, where ∗∗ = Co( ) is the closed convex hull of the epigraph . 2 G G G Affine constraints means (P) and (CoP) have the same dual [Lemar´echaland Renaud, 2001, Th. 2.11], given by sup Ψ(λ) (D) λ≥0 in the variable λ Rm. Roughly, if ∗∗ = then there is no duality gap. 2 G G Alex d'Aspremont CIMI, Toulouse, November 2019. 23/35 Nonconvex Optimization Epigraph & duality gap. Define ∗∗ di i = (fi (xi);Aixi): xi R F 2 m×d th where Ai R i is the i block of A. 2 ∗∗ The epigraph can be written as a Minkowski sum of i Gr F n ∗∗ X m+1 r = i + (0; b) + R+ G F − i=1 ∗∗ ∗∗ Shapley-Folkman at any point x r shows f (xi) = f(xi) for all but at most m + 1 terms in the objective.2 G ∗∗ As n , with m=n 0, r gets closer to its convex hull r and the duality! gap 1 becomes! negligible.G G Alex d'Aspremont CIMI, Toulouse, November 2019. 24/35 Bound on duality gap A priori bound on duality gap of Pn minimize i=1 fi(xi) subject to Ax b; ≤ where A Rm×d.