Feature Selection & the Shapley-Folkman Theorem

Total Page:16

File Type:pdf, Size:1020Kb

Feature Selection & the Shapley-Folkman Theorem Feature Selection & the Shapley-Folkman Theorem. Alexandre d'Aspremont, CNRS & D.I., Ecole´ Normale Sup´erieure. With Armin Askari, Laurent El Ghaoui (UC Berkeley) and Quentin Rebjock (EPFL) Alex d'Aspremont CIMI, Toulouse, November 2019. 1/35 Jobs Postdoc positions in ML / Optimization. At INRIA / Ecole Normale Sup´erieurein Paris. Alex d'Aspremont CIMI, Toulouse, November 2019. 2/35 Introduction Feature Selection. Reduce number of variables while preserving classification performance. Often improves test performance, especially when samples are scarce. Helps interpretation. Classical examples: LASSO, `1-logistic regression, RFE-SVM, . Alex d'Aspremont CIMI, Toulouse, November 2019. 3/35 Introduction Feature Selection. Toy example: text classification in 20newsgroup. Classify sci.med versus sci.space. Space features. 'commercial', 'launches', 'project', 'launched', 'data', 'dryden', 'mining', 'planetary', 'proton', 'missions', 'cost', 'command', 'comet', 'jupiter', 'apollo', 'russian', 'aerospace', 'sun', 'mary', 'payload', 'gravity', ... Med features. med', 'yeast', 'diseases', 'allergic', 'doctors', 'symptoms', 'syndrome', 'diagnosed', 'health', 'drugs', 'therapy', 'candida', 'seizures', 'lyme', 'food', 'brain', 'foods', 'geb', 'pain', 'gordon', patient', ... Alex d'Aspremont CIMI, Toulouse, November 2019. 4/35 Introduction: feature selection RNA classification. Find genes which best discriminate cell type (lung cancer vs control). 35238 genes, 2695 examples. [Lachmann et al., 2018] ×1011 3.430 3.435 3.440 3.445 Objective 3.450 3.455 0 5000 10000 15000 20000 25000 30000 35000 Number of features (k) Best ten genes: MT-CO3, MT-ND4, MT-CYB, RP11-217O12.1, LYZ, EEF1A1, MT-CO1, HBA2, HBB, HBA1. Alex d'Aspremont CIMI, Toulouse, November 2019. 5/35 Introduction: feature selection Applications. Mapping brain activity by fMRI. From PARIETAL team at INRIA. Alex d'Aspremont CIMI, Toulouse, November 2019. 6/35 Introduction: feature selection fMRI. Many voxels, very few samples leads to false discoveries. Wired article on Bennett et al. \Neural Correlates of Interspecies Perspective Taking in the Post-Mortem Atlantic Salmon: An Argument For Proper Multiple Comparisons Correction" Journal of Serendipitous and Unexpected Results, 2010. Alex d'Aspremont CIMI, Toulouse, November 2019. 7/35 Introduction: linear models Linear models. Select features from large weights w. 2 T LASSO solves minw Xw y + λ w 1 with linear prediction given by w x. k − k2 k k P T 2 Linear SVM, solves minw i max 0; 1 yi w xi + λ w 2 with linear classification rule sign(wT x). f − g k k In practice. Relatively high complexity on very large-scale data sets. Recovery results require uncorrelated features (incoherence, RIP, etc.). Cheaper featurewise methods (ANOVA, TF-IDF, etc.) have relatively poor performance. Alex d'Aspremont CIMI, Toulouse, November 2019. 8/35 Outline Sparse Naive Bayes The Shapley-Folkman theorem Duality gap bounds Numerical performance Alex d'Aspremont CIMI, Toulouse, November 2019. 9/35 Naive Bayes Naive Bayes. Predict label of a test point x Rn via the rule 2 y^(x) = arg max Prob(Ci x): i∈{−1;1g j Use Bayes' rule and then use the \naive" assumption that features are conditionally independent of each other m Y Prob(x Ci) = Prob(xj Ci) j j j=1 leading to 8 m 9 < X = y^(x) = arg max log Prob(Ci) + log Prob(xj Ci) : (1) i∈{−1;1g j : j=1 ; In (1), we need to have an explicit model for Prob(xj Ci). j Alex d'Aspremont CIMI, Toulouse, November 2019. 10/35 Multinomial Naive Bayse Multinomial Naive Bayse. In the multinomial model Pm ! > ± ( j=1 xj)! log Prob(x C±) = x log θ + log Qm : j j=1 xj! Training by maximum likelihood + − +> + −> − (θ∗ ; θ∗ ) = argmax f log θ + f log θ 1>θ+=1>θ−=1 θ+,θ−2[0;1]m Linear classification rule from (1): for a given test point x Rm, we set 2 y^(x) = sign(v + w>x); where + − w log θ log θ and v log Prob(C+) log Prob(C−); , ∗ − ∗ , − Alex d'Aspremont CIMI, Toulouse, November 2019. 11/35 Sparse Naive Bayse Naive Feature Selection. Make w log θ+ log θ− sparse. , ∗ − ∗ Solve + − +> + −> − (θ∗ ; θ∗ ) = argmax f log θ + f log θ subject to θ+ θ− k 0 (SMNB) 1k>θ+−= 1k>θ≤− = 1 θ+; θ+ 0 ≥ + − where k 0 is a target number of features. Features for which θi = θi can be discarded.≥ Nonconvex problem. Convex relaxation? Approximation bounds? Alex d'Aspremont CIMI, Toulouse, November 2019. 12/35 Sparse Naive Bayse Convex Relaxation. The dual is very simple. Sparse Multinomial Naive Bayes [Askari, A., El Ghaoui, 2019] Let φ(k) be the optimal value of (SMNB). Then φ(k) (k), where (k) is the optimal value of the following one-dimensional convex optimization≤ problem (k) := C + min sk(h(α)); (USMNB) α2[0;1] where C is a constant, sk( ) is the sum of the top k entries of its vector argument, and for α (0; 1), · 2 h(α) := f+ log f++f− log f− (f++f−) log(f++f−) f+ log α f− log(1 α): ◦ ◦ − ◦ − − − Solved by bisection, linear complexity O(n + k log k). Approximation bounds? Alex d'Aspremont CIMI, Toulouse, November 2019. 13/35 Outline Sparse Naive Bayes The Shapley-Folkman theorem Duality gap bounds Numerical performance Alex d'Aspremont CIMI, Toulouse, November 2019. 14/35 Shapley-Folkman Theorem Minkowski sum. Given sets X; Y Rd, we have ⊂ X + Y = x + y : x X; y Y f 2 2 g (CGAL User and Reference Manual) d Convex hull. Given subsets Vi R , we have ⊂ ! X X Co Vi = Co(Vi) i i Alex d'Aspremont CIMI, Toulouse, November 2019. 15/35 Shapley-Folkman Theorem The `1=2 ball, Minkowsi average of two and ten balls, convex hull. + + + + = Minkowsi sum of five first digits (obtained by sampling). Alex d'Aspremont CIMI, Toulouse, November 2019. 16/35 Shapley-Folkman Theorem Shapley-Folkman Theorem [Starr, 1969] d Suppose Vi R , i = 1; : : : ; n; and ⊂ n ! n X X x Co Vi = Co(Vi) 2 i=1 i=1 then X X x Vi + Co(Vi) 2 [1;n]nSx Sx where x d. jS j ≤ Alex d'Aspremont CIMI, Toulouse, November 2019. 17/35 Shapley-Folkman Theorem Pn Proof sketch. Write x Co(Vi), or 2 i=1 n d+1 x X X vij = λij ; for λ 0, 1n ei ≥ i=1 j=1 Conic Carath´eodory then yields representation with at most n + d nonzero coefficients. Use a pigeonhole argument λij } d } xi ∈ Co(Vi) n xi ∈ Vi Number of nonzero λij controls gap with convex hull. Alex d'Aspremont CIMI, Toulouse, November 2019. 18/35 Shapley-Folkman: geometric consequences Consequences. d If the sets Vi R are uniformly bounded with rad(Vi) R, then ⊂ ≤ Pn Pn p i=1 Vi i=1 Vi min n; d dH ; Co R f g n n ≤ n where rad(V ) = infx2V sup x y : y2V k − k In particular, when d is fixed and n ! 1 Pn V Pn V i=1 i Co i=1 i n ! n in the Hausdorff metric with rate O(1=n). Holds for many other nonconvexity measures [Fradelizi et al., 2017]. Alex d'Aspremont CIMI, Toulouse, November 2019. 19/35 Outline Sparse Naive Bayes The Shapley-Folkman theorem Duality gap bounds Numerical performance Alex d'Aspremont CIMI, Toulouse, November 2019. 20/35 Nonconvex Optimization Separable nonconvex problem. Solve minimize Pn f (x ) i=1 i i (P) subject to Ax b; ≤ d Pn in the variables xi R i with d = di, where fi are lower semicontinuous 2 i=1 and A Rm×d. 2 Take the dual twice to form a convex relaxation, minimize Pn f ∗∗(x ) i=1 i i (CoP) subject to Ax b ≤ d in the variables xi R i. 2 Alex d'Aspremont CIMI, Toulouse, November 2019. 21/35 Nonconvex Optimization Convex envelope. Biconjugate f ∗∗ satisfies epi(f ∗∗) = Co(epi(f)), which means that f ∗∗(x) and f(x) match at extreme points of epi(f ∗∗). Define lack of convexity as ρ(f) sup f(x) f ∗∗(x) . , x2dom(f)f − g Example. |x| 1 Card(x) −1 0 1 x The l1 norm is the convex envelope of Card(x) in [ 1; 1]. − Alex d'Aspremont CIMI, Toulouse, November 2019. 22/35 Nonconvex Optimization Writing the epigraph of problem (P) as in [Lemar´echaland Renaud, 2001], ( n ) 1+m X d r (r0; r) R : fi(xi) r0; Ax b r; x R ; G , 2 ≤ − ≤ 2 i=1 we can write the dual function of (P) as > ∗∗ Ψ(λ) inf r0 + λ r :(r0; r) ; , 2 Gr in the variable λ Rm, where ∗∗ = Co( ) is the closed convex hull of the epigraph . 2 G G G Affine constraints means (P) and (CoP) have the same dual [Lemar´echaland Renaud, 2001, Th. 2.11], given by sup Ψ(λ) (D) λ≥0 in the variable λ Rm. Roughly, if ∗∗ = then there is no duality gap. 2 G G Alex d'Aspremont CIMI, Toulouse, November 2019. 23/35 Nonconvex Optimization Epigraph & duality gap. Define ∗∗ di i = (fi (xi);Aixi): xi R F 2 m×d th where Ai R i is the i block of A. 2 ∗∗ The epigraph can be written as a Minkowski sum of i Gr F n ∗∗ X m+1 r = i + (0; b) + R+ G F − i=1 ∗∗ ∗∗ Shapley-Folkman at any point x r shows f (xi) = f(xi) for all but at most m + 1 terms in the objective.2 G ∗∗ As n , with m=n 0, r gets closer to its convex hull r and the duality! gap 1 becomes! negligible.G G Alex d'Aspremont CIMI, Toulouse, November 2019. 24/35 Bound on duality gap A priori bound on duality gap of Pn minimize i=1 fi(xi) subject to Ax b; ≤ where A Rm×d.
Recommended publications
  • Preface Chapter 1 Chaos Theory and the Dynamics of Narrative
    Notes Preface 1. Tom Stoppard, Arcadia (London: Faber and Faber, 1993) 79. 2. Ivar Ekeland discusses the plight of Johannes Kepler, who attempted to chart the trajectories of planets in the early seventeenth century. Despite having sound theories upon which to draw, “He nevertheless had to perform monstrous computa- tions over a number of years.” Ekeland explains that, even with the digital computers of today, “There are still a great many computations that cannot be performed now or in any foresee- able future.” See Ivar Ekeland, Mathematics and the Unexpected (Chicago: University of Chicago Press, 1988) 5, 31; trans. of Le Calcul, l’imprevu: Les figures de temps du Kepler à Thom (Éditions du Seuil, 1984). 3. The phrase is James Gleick’s. His popular science book Chaos: Making a New Science (New York: Viking, 1987) brought chaos theory into the public imagination. 4. I discuss these works in chapter 1. Chapter 1 Chaos Theory and the Dynamics of Narrative 1. James P. Crutchfield, J. Doyne Farmer, Norman H. Packard, and Robert S. Shaw, “Chaos,” Scientific American December 1986: 49. These four were members of the Dynamical Systems Collective of the University of California at Santa Cruz. “Chaos” was one of the first, if not the first, popular texts on the subject of chaos theory. 2. As Crutchfield et al. point out, “The larger framework that chaos emerges from is the so-called theory of dynamical systems” (49). 3. N. Katherine Hayles discusses the concurrent scientific imprecision and cultural resonance of the terms “chaos theory” and “science of chaos” in Chaos Bound: Orderly Disorder in Contemporary 136 NOTES Literature and Science (Ithaca: Cornell University Press, 1990).
    [Show full text]
  • Selected New Aspects of the Calculus of Variations in the Large
    BULLETIN (New Series) OF THE AMERICAN MATHEMATICAL SOCIETY Volume 39, Number 2, Pages 207{265 S 0273-0979(02)00929-1 Article electronically published on January 4, 2002 SELECTED NEW ASPECTS OF THE CALCULUS OF VARIATIONS IN THE LARGE IVAR EKELAND AND NASSIF GHOUSSOUB Abstract. We discuss some of the recent developments in variational meth- ods while emphasizing new applications to nonlinear problems. We touch on several issues: (i) the formulation of variational set-ups which provide more information on the location of critical points and therefore on the qualitative properties of the solutions of corresponding Euler-Lagrange equations; (ii) the relationships between the energy of variationally generated solutions, their Morse indices, and the Hausdorff measure of their nodal sets; (iii) the gluing of several topological obstructions; (iv) the preservation of critical levels after deformation of functionals; (v) and the various ways to recover compactness in certain borderline variational problems. Contents 1. Introduction 208 2. Critical points and non-linear variational problems 209 2.1. Variational proofs of non-existence: Pohozaev phenomenon 210 2.2. The min-max identification of critical levels 211 2.3. Analysis of Palais-Smale sequences in borderline variational problems 214 3. Locating critical points via dual sets 221 3.1. Improved compactness around dual sets 223 3.2. Relaxing boundary conditions in the presence of dual sets 225 4. Duality and the multiplicity of critical points 227 4.1. Group actions and topological indices 227 4.2. Multiplicity resultsa ` la Ljusternik-Schnirelmann 228 4.3. Comparable functions and other types of multiplicity 230 5. Morse indices of variationally generated critical points 231 5.1.
    [Show full text]
  • “Selecta” Books Scrapbook Forewords & Excerpts from Reviews
    “SELECTA” BOOKS SCRAPBOOK FOREWORDS & EXCERPTS FROM REVIEWS Benoit B. Mandelbrot July 1, 2004 SE Fractals and Scaling in Finance: Discontinuity, Concentration, Risk. New York: Springer, 1997, x+551pp. SFE Fractales, hasard et finance (1959 - 1997). Paris: Flammarion (Collection Champs), 1997, 246pp. SN Multifractals and 1/f Noise : Wild Self-affinity in Physics. New York: Springer. 1999, viii + 442 pp. SH Gaussian Self-Affinity and Fractals: Globality, the Earth, 1/f, and R/S. New York: Springer. 2002, ix + 654 pp. SC Fractals and Chaos: The Mandelbrot Set and Beyond. New York: Springer. 2004, xii + 308 pp. B.B. MANDELBROT ◊ SCRAPBOOK OF REVIEWS OF “SELECTA” BOOKS ◊ JULY 1, 2004 ◊ 2 FOREWORDS “FRACTALS AND SCALING IN FINANCE” ◊ Ralph E. Gomory (President, Sloan Foundation) In 1959-61, while the huge Saarinen-designed research laboratory at Yorktown Heights was being built, much of IBM's Research was housed nearby. My group occupied one of the many little houses on the Lamb Estate complex which had been a sanatorium housing wealthy alcoholics. Even in a Lamb Estate populated exclusively with bright research-oriented people, Benoit always stood out. His thinking was always fresh, and I enjoyed talking with him about any subject, whether technical, political, or historical. He introduced me to the idea that distributions having infinite second moments could be more than a mathematical curiosity an a source of counter-examples. This was a foretaste of the line of thought that eventually led to fractals and to the notion that major pieces of the physical world could be, and in fact could only be, modeled by distributions and sets that had fractional dimensions.
    [Show full text]
  • The Nonconvex Geometry of Linear Inverse Problems
    The Nonconvex Geometry of Linear Inverse Problems Armin Eftekhari and Peyman Mohajerin Esfahani Abstract. The gauge function, closely related to the atomic norm, measures the complexity of a statistical model, and has found broad applications in machine learning and statistical signal processing. In a high-dimensional learning problem, the gauge function attempts to safeguard against overfitting by promoting a sparse (concise) representation within the learning alphabet. In this work, within the context of linear inverse problems, we pinpoint the source of its success, but also argue that the applicability of the gauge function is inherently limited by its convexity, and showcase several learning problems where the classical gauge function theory fails. We then introduce a new notion of statistical complexity, gaugep function, which overcomes the limitations of the gauge function. The gaugep function is a simple generalization of the gauge function that can tightly control the sparsity of a statistical model within the learning alphabet and, perhaps surprisingly, draws further inspiration from the Burer-Monteiro factorization in computational mathematics. We also propose a new learning machine, with the building block of gaugep function, and arm this machine with a number of statistical guarantees. The potential of the proposed gaugep function the- ory is then studied for two stylized applications. Finally, we discuss the computational aspects and, in particular, suggest a tractable numerical algorithm for implementing the new learning machine. 1. Introduction While data is abundant, information is often sparse, and can be characterized mathematically using d ] a small number of atoms, drawn from an alphabet A ⊂ R . Concretely, an r-sparse model x is ] Pr ] ] ] r ] r specified as x := i=1 ciAi, for nonnegative coefficients fcigi=1 and atoms fAigi=1 ⊂ A.
    [Show full text]
  • Arxiv:2002.02208V1
    GLOBAL CONVERGENCE OF FRANK WOLFE ON ONE HIDDEN LAYER NETWORKS ALEXANDRE D’ASPREMONT AND MERT PILANCI ABSTRACT. We derive global convergence bounds for the Frank Wolfe algorithm when training one hidden layer neural networks. When using the ReLU activation function, and under tractable preconditioning assump- tions on the sample data set, the linear minimization oracle used to incrementally form the solution can be solved explicitly as a second order cone program. The classical Frank Wolfe algorithm then converges with rate O(1/T ) where T is both the number of neurons and the number of calls to the oracle. 1. INTRODUCTION We focus on the problem of training one hidden layer neural networks using incremental algorithms, and in particular the Frank-Wolfe method. While they are of course more toy models than effective classification tools, one hidden layer neural networks have been heavily used to study the complexity of the neural network training problem in a variety of regimes and algorithmic settings. Incremental methods in particular are classical tools for training one hidden layer networks, starting at least with the results in [Breiman, 1993] and [Lee et al., 1996]. The Frank-Wolfe algorithm [Frank and Wolfe, 1956], also known as conditional gradients [Levitin and Polyak, 1966] is one of the most well known methods of this type, and is used in constrained minimization problems where projection on the feasible set is hard, but solving a linear minimization oracle (LMO) over this set is tractable. This method has a long list of applications in machine learning, with recent examples including [Joulin et al., 2014, Shah et al., 2015, Osokin et al., 2016, Locatello et al., 2017a, Freund et al., 2017, Locatello et al., 2017b, Miech et al., 2017].
    [Show full text]
  • UNIVERSITY of CALIFORNIA, SAN DIEGO Operator
    UNIVERSITY OF CALIFORNIA, SAN DIEGO Operator Theory for Analysis of Convex Optimization Methods in Machine Learning A dissertation submitted in partial satisfaction of the requirements for the degree Doctor of Philosophy in Cognitive Science by Patrick W. Gallagher Committee in charge: Professor Virginia de Sa, Chair Professor Philip Gill, Co-Chair Professor Jeffrey Elman Professor James Hollan Professor Lawrence Saul Professor Zhuowen Tu 2014 Copyright Patrick W. Gallagher, 2014 All rights reserved. The dissertation of Patrick W. Gallagher is approved, and it is acceptable in quality and form for publication on microfilm and electronically: Co-Chair Chair University of California, San Diego 2014 iii EPIGRAPH ...nothing at all takes place in the universe in which some rule of maximum or minimum does not appear. — Leonhard Euler ...between two truths of the real domain, the easiest and shortest path quite often passes through the complex domain. — Paul Painlevé You can see everything with one eye, but looking with two eyes is more convenient. — Jan Brinkhuis and Vladimir Tikhomirov iv TABLE OF CONTENTS Signature Page . iii Epigraph . iv Table of Contents . v List of Figures . xi List of Tables . xiv Acknowledgements . xv Vita . xvi Abstract of the Dissertation . xvii Chapter 1 Introduction . 1 1.1 Introduction . 1 1.2 Context . 2 1.2.1 Convexity, strong smoothness, and strong convexity . 3 1.2.2 Assumptions and convergence rates from a standard perspective and a more general perspective . 15 1.2.3 Overview of what follows . 17 Chapter 2 Related Literature . 19 Chapter 3 Sets and convexity . 24 3.1 Introduction . 24 3.2 Basic terminology and definitions .
    [Show full text]
  • An Approximate Shapley-Folkman Theorem
    An Approximate Shapley-Folkman Theorem. Alexandre d'Aspremont, CNRS & D.I., Ecole´ Normale Sup´erieure. With Thomas Kerdreux, Igor Colin (ENS). Alex d'Aspremont Newton Institute, June 2018. 1/34 Jobs Postdoc position (1.5 years) in ML / Optimization. At Ecole Normale Sup´erieure in Paris, competitive salary, travel funds. Alex d'Aspremont Newton Institute, June 2018. 2/34 Introduction Minimizing finite sums. Pn minimize i=1 fi(xi) subject to x 2 C Ubiquitous in statistics, machine learning. Better computational complexity (SAGA, SVRG, MISO, etc.). Today. More robust to nonconvexity issues. Alex d'Aspremont Newton Institute, June 2018. 3/34 Introduction Minimizing finite sums. Penalized regression. Given a penalty function g(x) such as `1, `0, SCAD, solve Pn 2 Pp minimize i=1 zi + λ i=1 g(xi) subject to z = Ax − b Empirical Risk Minimization. In the linear case, Pn Pp minimize i=1 `(yi; zi) + λ i=1 g(wi) subject to z = Aw − b Multi-Task Learning. Same format, by blocks. Resource Allocation. Aka unit commitment problem. Pn maximize i=1 f(xi) subject to Ax ≤ b Alex d'Aspremont Newton Institute, June 2018. 4/34 Introduction Minimizing finite sums. Pn minimize i=1 fi(xi) subject to x 2 C Better complexity bounds for stochastic gradient. SAG [Schmidt et al., 2013], SVRG [Johnson and Zhang, 2013], SDCA [Shalev-Shwartz and Zhang, 2013], SAGA [Defazio et al., 2014]. Non convexity has a milder impact. Weakly convex penalties for M-estimators [Loh and Wainwright, 2013, Chen and Gu, 2014]. Equilibrium in economies where consumers have non-convex preferences.
    [Show full text]
  • Notices of the American Mathematical Society ABCD Springer.Com
    ISSN 0002-9920 Notices of the American Mathematical Society ABCD springer.com Highlights in Springer’s eBook Collection of the American Mathematical Society September 2009 Volume 56, Number 8 Bôcher, Osgood, ND and the Ascendance NEW NEW 2 EDITION EDITION of American Mathematics forms bridges between From the reviews of the first edition The theory of elliptic curves is Mathematics at knowledge, tradition, and contemporary 7 Chorin and Hald provide excellent distinguished by the diversity of the Harvard life. The continuous development and explanations with considerable insight methods used in its study. This book growth of its many branches permeates and deep mathematical understanding. treats the arithmetic theory of elliptic page 916 all aspects of applied science and 7 SIAM Review curves in its modern formulation, technology, and so has a vital impact on through the use of basic algebraic our society. The book will focus on these 2nd ed. 2009. X, 162 p. 7 illus. number theory and algebraic geometry. aspects and will benefit from the (Surveys and Tutorials in the Applied contribution of world-famous scientists. Mathematical Sciences) Softcover 2nd ed. 2009. XVIII, 514 p. 14 illus. Invariant Theory ISBN 978-1-4419-1001-1 (Graduate Texts in Mathematics, 2009. XI, 263 p. (Modeling, Simulation & 7 Approx. $39.95 Volume 106) Hardcover of Tensor Product Applications, Volume 3) Hardcover ISBN 978-0-387-09493-9 7 $59.95 Decompositions of ISBN 978-88-470-1121-2 7 $59.95 U(N) and Generalized Casimir Operators For access check with your librarian page 931 Stochastic Partial Linear Optimization Mathematica in Action Riverside Meeting Differential Equations The Simplex Workbook The Power of Visualization A Modeling, White Noise Approach G.
    [Show full text]
  • Vanishing Price of Anarchy in Large Coordinative Nonconvex Optimization
    Submitted in July 2015 Journal Submission Vanishing Price of Anarchy in Large Coordinative Nonconvex Optimization Mengdi Wang [email protected] Abstract We focus on a class of nonconvex cooperative optimization problems that involve multiple participants. We study the duality framework and provide geometric and analytic character- izations of the duality gap. The dual problem is related to a market setting in which each participant pursues self-interests at a given price of common goods. The duality gap is a form of price of anarchy. We prove that the nonconvex problem becomes increasingly convex as the problem scales up in dimension. In other words, the price of anarchy diminishes to zero as the number of participants grows. We prove the existence of a solution to the dual problem that is an approximate global optimum and achieves the minimal price of anarchy. We develop a coordination procedure to identify the solution from the set of all participants' best responses. Furthermore, we propose a globally convergent duality-based algorithm that relies on individual best responses to achieve the approximate social optimum. Convergence and rate of convergence analysis as well as numerical results are provided. Keywords: nonconvex optimization, duality gap, price of anarchy, cooperative optimization, cutting plane method. 1 Introduction Real-world social and engineering systems often involve a large number of participants with self- interests. The participants interact with one another by contributing to some common societal factors. For example, market participants make individual decisions on their consumption or production of common goods, which aggregate to the overall demand and supply in the market.
    [Show full text]
  • Feature Selection & the Shapley-Folkman Theorem
    Feature Selection & the Shapley-Folkman Theorem. Alexandre d'Aspremont, CNRS & D.I., Ecole´ Normale Sup´erieure. With Armin Askari, Laurent El Ghaoui (UC Berkeley) and Quentin Rebjock (EPFL) Alex d'Aspremont OWOS, June 2020. 1/32 Introduction Feature Selection. Reduce number of variables while preserving classification performance. Often improves test performance, especially when samples are scarce. Helps interpretation. Classical examples: LASSO, `1-logistic regression, RFE-SVM, . Alex d'Aspremont OWOS, June 2020. 2/32 Introduction: feature selection RNA classification. Find genes which best discriminate cell type (lung cancer vs control). 35238 genes, 2695 examples. [Lachmann et al., 2018] ×1011 3.430 3.435 3.440 3.445 Objective 3.450 3.455 0 5000 10000 15000 20000 25000 30000 35000 Number of features (k) Best ten genes: MT-CO3, MT-ND4, MT-CYB, RP11-217O12.1, LYZ, EEF1A1, MT-CO1, HBA2, HBB, HBA1. Alex d'Aspremont OWOS, June 2020. 3/32 Introduction: feature selection Applications. Mapping brain activity by fMRI. From PARIETAL team at INRIA. Alex d'Aspremont OWOS, June 2020. 4/32 Introduction: feature selection fMRI. Many voxels, very few samples leads to false discoveries. Wired article on Bennett et al. \Neural Correlates of Interspecies Perspective Taking in the Post-Mortem Atlantic Salmon: An Argument For Proper Multiple Comparisons Correction" Journal of Serendipitous and Unexpected Results, 2010. Alex d'Aspremont OWOS, June 2020. 5/32 Introduction: linear models Linear models. Select features from large weights w. 2 T LASSO solves minw Xw y + λ w 1 with linear prediction given by w x. k − k2 k k P T 2 Linear SVM, solves minw i max 0; 1 yi w xi + λ w 2 with linear classification rule sign(wT x).
    [Show full text]
  • Discretization of Functionals Involving the Monge-Amp\Ere Operator
    DISCRETIZATION OF FUNCTIONALS INVOLVING THE MONGE-AMPÈRE OPERATOR J.-D. BENAMOU, G. CARLIER, Q. MÉRIGOT, AND É. OUDET Abstract. Gradient flows in the Wasserstein space have become a powerful tool in the analysis of diffusion equations, following the seminal work of Jordan, Kinderlehrer and Otto (JKO). The numerical applications of this formulation have been limited by the difficulty to compute the Wasserstein distance in dimension > 2. One step of the JKO scheme is equivalent to a variational problem on the space of convex functions, which involves the Monge-Ampère operator. Convexity constraints are notably difficult to handle numerically, but in our setting the internal energy plays the role of a barrier for these constraints. This enables us to introduce a consistent discretization, which in- herits convexity properties of the continuous variational problem. We show the effectiveness of our approach on nonlinear diffusion and crowd-motion models. 1. Introduction 1.1. Context. Optimal transport and displacement convexity. In the following, we consider two probability measures µ and ν on Rd with finite second moments, the first of which is absolutely continuous with respect to the Lebesgue measure. We are interested in the quadratic optimal transport problem between µ and ν: Z 2 d min kT (x) − xk d µ(x); T : X ! R ;T#µ = ν (1.1) X where T#µ denotes the pushforward of µ by T . A theorem of Brenier shows that the optimal map in (1.1) is given by the gradient of a convex function [8]. Define the Wasserstein distance between µ and ν as the square root of the minimum in (1.1), and denote it W2(µ, ν).
    [Show full text]
  • OPTIMIZATION PROBLEMS in BANACH Spacest1)
    TRANSACTIONS of the AMERICAN MATHEMATICAL SOCIETY Volume 224, Number 2, 1976 GENERICFRECHET-DIFFERENTIABILITY AND PERTURBED OPTIMIZATIONPROBLEMS IN BANACHSPACESt1) BY IVAR EKELAND AND GERARD LEBOURG ABSTRACT. We define a function F on a Banach space V to be locally e-supported by u* e V* at u 6 V if there exists an tj > 0 such that IIu — ut < tj -* F(v) > F(u) + <u*, v — u> — ello — ui. We prove that if the Banach space V admits a nonnegative Fr¿chet-differentiable function with bounded nonempty support, then, for any e > 0 and every lower semicontinuous function F, there is a dense set of points u e V at which F is locally e-supported. The applications are twofold. First, to the study of functions defined as pointwise infima; we prove for instance that every concave continuous function defined on a Banach space with Fréchet-differentiable norm is Frechet-differentiable generically (i.e. on a countable intersection of open dense subsets). Then, to the study of op- timization problems depending on a parameter u e V; we give general condi- tions, mainly in the framework of uniformly convex Banach spaces with uni- formly convex dual, under which such problems generically have a single optimal solution, depending continuously on the parameter and satisfying a first-order necessary condition. 1. Local e-supports. Let F be a Banach space and V* its topological dual. The canonical bilinear form onFxF* will be denoted by brackets <•, •>,the norm of Fby 11-11,the norm of V* by HI*. Let F: V—>-RU{+°°}bea func- tion on V; recall that the effective domain of F is denoted by dorn F and is de- fined as the set of points where F is finite: (1.1) dorn F = {v\F(v) < + °°}.
    [Show full text]