Feature Selection & the Shapley-Folkman Theorem

Feature Selection & the Shapley-Folkman Theorem. Alexandre d'Aspremont, CNRS & D.I., Ecole´ Normale Supérieure. With Armin Askari, Laurent El Ghaoui (UC Berkeley) and Quentin Rebjock (EPFL) Alex d'Aspremont OWOS, June 2020. 1/32 Introduction Feature Selection. Reduce number of variables while preserving classification performance. Often improves test performance, especially when samples are scarce. Helps interpretation. Classical examples: LASSO, `1-logistic regression, RFE-SVM, . Alex d'Aspremont OWOS, June 2020. 2/32 Introduction: feature selection RNA classification. Find genes which best discriminate cell type (lung cancer vs control). 35238 genes, 2695 examples. [Lachmann et al., 2018] ×1011 3.430 3.435 3.440 3.445 Objective 3.450 3.455 0 5000 10000 15000 20000 25000 30000 35000 Number of features (k) Best ten genes: MT-CO3, MT-ND4, MT-CYB, RP11-217O12.1, LYZ, EEF1A1, MT-CO1, HBA2, HBB, HBA1. Alex d'Aspremont OWOS, June 2020. 3/32 Introduction: feature selection Applications. Mapping brain activity by fMRI. From PARIETAL team at INRIA. Alex d'Aspremont OWOS, June 2020. 4/32 Introduction: feature selection fMRI. Many voxels, very few samples leads to false discoveries. Wired article on Bennett et al. \Neural Correlates of Interspecies Perspective Taking in the Post-Mortem Atlantic Salmon: An Argument For Proper Multiple Comparisons Correction" Journal of Serendipitous and Unexpected Results, 2010. Alex d'Aspremont OWOS, June 2020. 5/32 Introduction: linear models Linear models. Select features from large weights w. 2 T LASSO solves minw Xw y + λ w 1 with linear prediction given by w x. k − k2 k k P T 2 Linear SVM, solves minw i max 0; 1 yi w xi + λ w 2 with linear classification rule sign(wT x). f − g k k `1-logistic regression, etc. In practice. Relatively high complexity on very large-scale data sets. Recovery results require uncorrelated features (incoherence, RIP, etc.). Cheaper featurewise methods (ANOVA, TF-IDF, etc.) have relatively poor performance. Alex d'Aspremont OWOS, June 2020. 6/32 Outline Sparse Naive Bayes The Shapley-Folkman theorem Duality gap bounds Numerical performance Alex d'Aspremont OWOS, June 2020. 7/32 Multinomial Naive Bayes Multinomial Naive Bayes. In the multinomial model Pm ! > ± ( j=1 xj)! log Prob(x C±) = x log θ + log Qm : j j=1 xj! Training by maximum likelihood + − +> + −> − (θ∗ ; θ∗ ) = argmax f log θ + f log θ 1>θ+=1>θ−=1 θ+,θ−2[0;1]m Linear classification rule: for a given test point x Rm, set 2 y^(x) = sign(v + w>x); where + − w log θ log θ and v log Prob(C+) log Prob(C−); , ∗ − ∗ , − Alex d'Aspremont OWOS, June 2020. 8/32 Sparse Naive Bayes Naive Feature Selection. Make w log θ+ log θ− sparse. , ∗ − ∗ Solve + − +> + −> − (θ∗ ; θ∗ ) = argmax f log θ + f log θ subject to θ+ θ− k 0 (SMNB) 1k>θ+−= 1k>θ≤− = 1 θ+; θ+ 0 ≥ + − where k 0 is a target number of features. Features for which θi = θi can be discarded.≥ Nonconvex problem. Convex relaxation? Approximation bounds? Alex d'Aspremont OWOS, June 2020. 9/32 Sparse Naive Bayes Convex Relaxation. The dual is very simple. Sparse Multinomial Naive Bayes [Askari, A., El Ghaoui, 2019] Let φ(k) be the optimal value of (SMNB). Then φ(k) (k), where (k) is the optimal value of the following one-dimensional convex optimization≤ problem (k) := C + min sk(h(α)); (USMNB) α2[0;1] where C is a constant, sk( ) is the sum of the top k entries of its vector argument, and for α (0; 1), · 2 h(α) := f+ log f++f− log f− (f++f−) log(f++f−) f+ log α f− log(1 α): ◦ ◦ − ◦ − − − Solved by bisection, linear complexity O(n + k log k). Approximation bounds? Alex d'Aspremont OWOS, June 2020. 10/32 Outline Sparse Naive Bayes The Shapley-Folkman theorem Duality gap bounds Numerical performance Alex d'Aspremont OWOS, June 2020. 11/32 Shapley-Folkman Theorem Minkowski sum. Given sets X; Y Rd, we have ⊂ X + Y = x + y : x X; y Y f 2 2 g (CGAL User and Reference Manual) d Convex hull. Given subsets Vi R , we have ⊂ ! X X Co Vi = Co(Vi) i i Alex d'Aspremont OWOS, June 2020. 12/32 Shapley-Folkman Theorem The `1=2 ball, Minkowski average of two and ten balls, convex hull. + + + + = Minkowski sum of five first digits (obtained by sampling). Alex d'Aspremont OWOS, June 2020. 13/32 Shapley-Folkman Theorem Shapley-Folkman Theorem [Starr, 1969] d Suppose Vi R , i = 1; : : : ; n; and ⊂ n ! n X X x Co Vi = Co(Vi) 2 i=1 i=1 then X X x Vi + Co(Vi) 2 [1;n]nSx Sx where x d. jS j ≤ Alex d'Aspremont OWOS, June 2020. 14/32 Shapley-Folkman Theorem Pn Proof sketch. Write x Co(Vi), or 2 i=1 n d+1 x X X vij = λij ; for λ 0, 1n ei ≥ i=1 j=1 Conic Carathéodory then yields representation with at most n + d nonzero coefficients. Use a pigeonhole argument λij } d } xi ∈ Co(Vi) n xi ∈ Vi Number of nonzero λij controls gap with convex hull. Alex d'Aspremont OWOS, June 2020. 15/32 Shapley-Folkman: geometric consequences Consequences. d If the sets Vi R are uniformly bounded with rad(Vi) R, then ⊂ ≤ Pn Pn p i=1 Vi i=1 Vi min n; d dH ; Co R f g n n ≤ n where rad(V ) = infx2V sup x y : y2V k − k In particular, when d is fixed and n ! 1 Pn V Pn V i=1 i Co i=1 i n ! n in the Hausdorff metric with rate O(1=n). Holds for many other nonconvexity measures [Fradelizi et al., 2017]. Alex d'Aspremont OWOS, June 2020. 16/32 Outline Sparse Naive Bayes The Shapley-Folkman theorem Duality gap bounds Numerical performance Alex d'Aspremont OWOS, June 2020. 17/32 Nonconvex Optimization Separable nonconvex problem. Solve minimize Pn f (x ) i=1 i i (P) subject to Ax b; ≤ d Pn in the variables xi R i with d = di, where fi are lower semicontinuous 2 i=1 and A Rm×d. 2 Take the dual twice to form a convex relaxation, minimize Pn f ∗∗(x ) i=1 i i (CoP) subject to Ax b ≤ d in the variables xi R i. 2 Alex d'Aspremont OWOS, June 2020. 18/32 Nonconvex Optimization Convex envelope. Biconjugate f ∗∗ satisfies epi(f ∗∗) = Co(epi(f)), which means that f ∗∗(x) and f(x) match at extreme points of epi(f ∗∗). Define lack of convexity as ρ(f) sup f(x) f ∗∗(x) . , x2dom(f)f − g Example. |x| 1 Card(x) −1 0 1 x The l1 norm is the convex envelope of Card(x) in [ 1; 1]. − Alex d'Aspremont OWOS, June 2020. 19/32 Nonconvex Optimization Writing the epigraph of problem (P) as in [Lemaréchaland Renaud, 2001], ( n ) 1+m X d r (r0; r) R : fi(xi) r0; Ax b r; x R ; G , 2 ≤ − ≤ 2 i=1 we can write the dual function of (P) as > ∗∗ Ψ(λ) inf r0 + λ r :(r0; r) ; , 2 Gr in the variable λ Rm, where ∗∗ = Co( ) is the closed convex hull of the epigraph . 2 G G G Affine constraints means (P) and (CoP) have the same dual [Lemaréchaland Renaud, 2001, Th. 2.11], given by sup Ψ(λ) (D) λ≥0 in the variable λ Rm. Roughly speaking, if ∗∗ = , no duality gap in (P). 2 G G Alex d'Aspremont OWOS, June 2020. 20/32 Nonconvex Optimization Epigraph & duality gap. Define ∗∗ di m+1 i = (fi (xi);Aixi): xi R + R+ F 2 m×d th where Ai R i is the i block of A. 2 ∗∗ The epigraph can be written as a Minkowski sum of i Gr F n ∗∗ X m+1 r = i + (0; b) + R+ G F − i=1 ∗∗ ∗∗ Shapley-Folkman at x r shows f (xi) = f(xi) for all but at most m + 1 terms in the objective.2 G ∗∗ As n , with m=n 0, r gets closer to its convex hull r and the duality! gap 1 becomes! negligible.G G Alex d'Aspremont OWOS, June 2020. 21/32 Bound on duality gap A priori bound on duality gap of Pn minimize i=1 fi(xi) subject to Ax b; ≤ where A Rm×d. 2 Proposition [Aubin and Ekeland, 1976, Ekeland and Temam, 1999] A priori bounds on the duality gap Suppose the functions fi in (P) satisfy Assumption (. ). There is a point x? Rd at which the primal optimal value of (CoP) is attained, such that 2 n n n m+1 X ∗∗ ? X ? X ∗∗ ? X f (x ) fi(^x ) f (x ) + ρ(f ) i i ≤ i ≤ i i [i] i=1 i=1 i=1 i=1 | CoP{z } | {zP } | CoP{z } | gap{z } where x^? is an optimal point of (P) and ρ(f ) ρ(f ) ::: ρ(f ). [1] ≥ [2] ≥ ≥ [n] Alex d'Aspremont OWOS, June 2020. 22/32 Bound on duality gap General result. Consider the separable nonconvex problem Pn hP (u) := min. i=1 fi(xi) Pn (P) s.t. gi(xi) b + u i=1 ≤ d m in the variables xi R i, with perturbation parameter u R . 2 2 Proposition [Ekeland and Temam, 1999] A priori bounds on the duality gap Suppose the functions fi; gji in problem (P) satisfy assumption (...) for i = 1; : : : ; n, j = 1; : : : ; m. Let p¯j = (m + 1) max ρ(gji); for j = 1; : : : ; m i then ∗∗ ∗∗ hP (¯p) hP (¯p) hP (0) + (m + 1) max ρ(fi): ≤ ≤ i ∗∗ where hP (u) is the optimal value of the dual to (P).

Feature Selection & the Shapley-Folkman Theorem

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support