Feature Selection & the Shapley-Folkman Theorem
Total Page:16
File Type:pdf, Size:1020Kb
Feature Selection & the Shapley-Folkman Theorem. Alexandre d'Aspremont, CNRS & D.I., Ecole´ Normale Sup´erieure. With Armin Askari, Laurent El Ghaoui (UC Berkeley) and Quentin Rebjock (EPFL) Alex d'Aspremont OWOS, June 2020. 1/32 Introduction Feature Selection. Reduce number of variables while preserving classification performance. Often improves test performance, especially when samples are scarce. Helps interpretation. Classical examples: LASSO, `1-logistic regression, RFE-SVM, . Alex d'Aspremont OWOS, June 2020. 2/32 Introduction: feature selection RNA classification. Find genes which best discriminate cell type (lung cancer vs control). 35238 genes, 2695 examples. [Lachmann et al., 2018] ×1011 3.430 3.435 3.440 3.445 Objective 3.450 3.455 0 5000 10000 15000 20000 25000 30000 35000 Number of features (k) Best ten genes: MT-CO3, MT-ND4, MT-CYB, RP11-217O12.1, LYZ, EEF1A1, MT-CO1, HBA2, HBB, HBA1. Alex d'Aspremont OWOS, June 2020. 3/32 Introduction: feature selection Applications. Mapping brain activity by fMRI. From PARIETAL team at INRIA. Alex d'Aspremont OWOS, June 2020. 4/32 Introduction: feature selection fMRI. Many voxels, very few samples leads to false discoveries. Wired article on Bennett et al. \Neural Correlates of Interspecies Perspective Taking in the Post-Mortem Atlantic Salmon: An Argument For Proper Multiple Comparisons Correction" Journal of Serendipitous and Unexpected Results, 2010. Alex d'Aspremont OWOS, June 2020. 5/32 Introduction: linear models Linear models. Select features from large weights w. 2 T LASSO solves minw Xw y + λ w 1 with linear prediction given by w x. k − k2 k k P T 2 Linear SVM, solves minw i max 0; 1 yi w xi + λ w 2 with linear classification rule sign(wT x). f − g k k `1-logistic regression, etc. In practice. Relatively high complexity on very large-scale data sets. Recovery results require uncorrelated features (incoherence, RIP, etc.). Cheaper featurewise methods (ANOVA, TF-IDF, etc.) have relatively poor performance. Alex d'Aspremont OWOS, June 2020. 6/32 Outline Sparse Naive Bayes The Shapley-Folkman theorem Duality gap bounds Numerical performance Alex d'Aspremont OWOS, June 2020. 7/32 Multinomial Naive Bayes Multinomial Naive Bayes. In the multinomial model Pm ! > ± ( j=1 xj)! log Prob(x C±) = x log θ + log Qm : j j=1 xj! Training by maximum likelihood + − +> + −> − (θ∗ ; θ∗ ) = argmax f log θ + f log θ 1>θ+=1>θ−=1 θ+,θ−2[0;1]m Linear classification rule: for a given test point x Rm, set 2 y^(x) = sign(v + w>x); where + − w log θ log θ and v log Prob(C+) log Prob(C−); , ∗ − ∗ , − Alex d'Aspremont OWOS, June 2020. 8/32 Sparse Naive Bayes Naive Feature Selection. Make w log θ+ log θ− sparse. , ∗ − ∗ Solve + − +> + −> − (θ∗ ; θ∗ ) = argmax f log θ + f log θ subject to θ+ θ− k 0 (SMNB) 1k>θ+−= 1k>θ≤− = 1 θ+; θ+ 0 ≥ + − where k 0 is a target number of features. Features for which θi = θi can be discarded.≥ Nonconvex problem. Convex relaxation? Approximation bounds? Alex d'Aspremont OWOS, June 2020. 9/32 Sparse Naive Bayes Convex Relaxation. The dual is very simple. Sparse Multinomial Naive Bayes [Askari, A., El Ghaoui, 2019] Let φ(k) be the optimal value of (SMNB). Then φ(k) (k), where (k) is the optimal value of the following one-dimensional convex optimization≤ problem (k) := C + min sk(h(α)); (USMNB) α2[0;1] where C is a constant, sk( ) is the sum of the top k entries of its vector argument, and for α (0; 1), · 2 h(α) := f+ log f++f− log f− (f++f−) log(f++f−) f+ log α f− log(1 α): ◦ ◦ − ◦ − − − Solved by bisection, linear complexity O(n + k log k). Approximation bounds? Alex d'Aspremont OWOS, June 2020. 10/32 Outline Sparse Naive Bayes The Shapley-Folkman theorem Duality gap bounds Numerical performance Alex d'Aspremont OWOS, June 2020. 11/32 Shapley-Folkman Theorem Minkowski sum. Given sets X; Y Rd, we have ⊂ X + Y = x + y : x X; y Y f 2 2 g (CGAL User and Reference Manual) d Convex hull. Given subsets Vi R , we have ⊂ ! X X Co Vi = Co(Vi) i i Alex d'Aspremont OWOS, June 2020. 12/32 Shapley-Folkman Theorem The `1=2 ball, Minkowski average of two and ten balls, convex hull. + + + + = Minkowski sum of five first digits (obtained by sampling). Alex d'Aspremont OWOS, June 2020. 13/32 Shapley-Folkman Theorem Shapley-Folkman Theorem [Starr, 1969] d Suppose Vi R , i = 1; : : : ; n; and ⊂ n ! n X X x Co Vi = Co(Vi) 2 i=1 i=1 then X X x Vi + Co(Vi) 2 [1;n]nSx Sx where x d. jS j ≤ Alex d'Aspremont OWOS, June 2020. 14/32 Shapley-Folkman Theorem Pn Proof sketch. Write x Co(Vi), or 2 i=1 n d+1 x X X vij = λij ; for λ 0, 1n ei ≥ i=1 j=1 Conic Carath´eodory then yields representation with at most n + d nonzero coefficients. Use a pigeonhole argument λij } d } xi ∈ Co(Vi) n xi ∈ Vi Number of nonzero λij controls gap with convex hull. Alex d'Aspremont OWOS, June 2020. 15/32 Shapley-Folkman: geometric consequences Consequences. d If the sets Vi R are uniformly bounded with rad(Vi) R, then ⊂ ≤ Pn Pn p i=1 Vi i=1 Vi min n; d dH ; Co R f g n n ≤ n where rad(V ) = infx2V sup x y : y2V k − k In particular, when d is fixed and n ! 1 Pn V Pn V i=1 i Co i=1 i n ! n in the Hausdorff metric with rate O(1=n). Holds for many other nonconvexity measures [Fradelizi et al., 2017]. Alex d'Aspremont OWOS, June 2020. 16/32 Outline Sparse Naive Bayes The Shapley-Folkman theorem Duality gap bounds Numerical performance Alex d'Aspremont OWOS, June 2020. 17/32 Nonconvex Optimization Separable nonconvex problem. Solve minimize Pn f (x ) i=1 i i (P) subject to Ax b; ≤ d Pn in the variables xi R i with d = di, where fi are lower semicontinuous 2 i=1 and A Rm×d. 2 Take the dual twice to form a convex relaxation, minimize Pn f ∗∗(x ) i=1 i i (CoP) subject to Ax b ≤ d in the variables xi R i. 2 Alex d'Aspremont OWOS, June 2020. 18/32 Nonconvex Optimization Convex envelope. Biconjugate f ∗∗ satisfies epi(f ∗∗) = Co(epi(f)), which means that f ∗∗(x) and f(x) match at extreme points of epi(f ∗∗). Define lack of convexity as ρ(f) sup f(x) f ∗∗(x) . , x2dom(f)f − g Example. |x| 1 Card(x) −1 0 1 x The l1 norm is the convex envelope of Card(x) in [ 1; 1]. − Alex d'Aspremont OWOS, June 2020. 19/32 Nonconvex Optimization Writing the epigraph of problem (P) as in [Lemar´echaland Renaud, 2001], ( n ) 1+m X d r (r0; r) R : fi(xi) r0; Ax b r; x R ; G , 2 ≤ − ≤ 2 i=1 we can write the dual function of (P) as > ∗∗ Ψ(λ) inf r0 + λ r :(r0; r) ; , 2 Gr in the variable λ Rm, where ∗∗ = Co( ) is the closed convex hull of the epigraph . 2 G G G Affine constraints means (P) and (CoP) have the same dual [Lemar´echaland Renaud, 2001, Th. 2.11], given by sup Ψ(λ) (D) λ≥0 in the variable λ Rm. Roughly speaking, if ∗∗ = , no duality gap in (P). 2 G G Alex d'Aspremont OWOS, June 2020. 20/32 Nonconvex Optimization Epigraph & duality gap. Define ∗∗ di m+1 i = (fi (xi);Aixi): xi R + R+ F 2 m×d th where Ai R i is the i block of A. 2 ∗∗ The epigraph can be written as a Minkowski sum of i Gr F n ∗∗ X m+1 r = i + (0; b) + R+ G F − i=1 ∗∗ ∗∗ Shapley-Folkman at x r shows f (xi) = f(xi) for all but at most m + 1 terms in the objective.2 G ∗∗ As n , with m=n 0, r gets closer to its convex hull r and the duality! gap 1 becomes! negligible.G G Alex d'Aspremont OWOS, June 2020. 21/32 Bound on duality gap A priori bound on duality gap of Pn minimize i=1 fi(xi) subject to Ax b; ≤ where A Rm×d. 2 Proposition [Aubin and Ekeland, 1976, Ekeland and Temam, 1999] A priori bounds on the duality gap Suppose the functions fi in (P) satisfy Assumption (. ). There is a point x? Rd at which the primal optimal value of (CoP) is attained, such that 2 n n n m+1 X ∗∗ ? X ? X ∗∗ ? X f (x ) fi(^x ) f (x ) + ρ(f ) i i ≤ i ≤ i i [i] i=1 i=1 i=1 i=1 | CoP{z } | {zP } | CoP{z } | gap{z } where x^? is an optimal point of (P) and ρ(f ) ρ(f ) ::: ρ(f ). [1] ≥ [2] ≥ ≥ [n] Alex d'Aspremont OWOS, June 2020. 22/32 Bound on duality gap General result. Consider the separable nonconvex problem Pn hP (u) := min. i=1 fi(xi) Pn (P) s.t. gi(xi) b + u i=1 ≤ d m in the variables xi R i, with perturbation parameter u R . 2 2 Proposition [Ekeland and Temam, 1999] A priori bounds on the duality gap Suppose the functions fi; gji in problem (P) satisfy assumption (...) for i = 1; : : : ; n, j = 1; : : : ; m. Let p¯j = (m + 1) max ρ(gji); for j = 1; : : : ; m i then ∗∗ ∗∗ hP (¯p) hP (¯p) hP (0) + (m + 1) max ρ(fi): ≤ ≤ i ∗∗ where hP (u) is the optimal value of the dual to (P).