Revisiting Bayesian CMA-ES

Total Page:16

File Type:pdf, Size:1020Kb

Load more

BCMA-ES II: revisiting Bayesian CMA-ES Eric Benhamou David Saltiel A.I Square Connect and Lamsade, France A.I Square Connect and LISIC, France [email protected] [email protected] Beatrice Guez Nicolas Paris A.I Square Connect, France A.I Square Connect, France [email protected] [email protected] ABSTRACT where H and D are two members of the implied σ−algebra. The This paper revisits the Bayesian CMA-ES and provides updates for letters are not by chance. H stands for the hypothesis, which can normal Wishart. It emphasizes the difference between a normal be interpreted as an hypothesis on the parameters, while D stands and normal inverse Wishart prior. After some computation, we for data. prove that the only difference relies surprisingly in the expected The usual frequentist probabilities states that the probability of covariance. We prove that the expected covariance should be lower an observation P¹Dº is given certain hypothesis H on the state of in the normal Wishart prior model because of the convexity of the the world. However, as the equation (1) is completely symmetric, inverse. We present a mixture model that generalizes both normal nothing hinders us to change our point of view and state the inverse Wishart and normal inverse Wishart model. We finally present question. Given an observation of a data D, what is the plausibility various numerical experiments to compare both methods as well of the hypothesis H. The Bayes rules trivially answers this question: as the generalized method. P¹Hº P¹H jDº = P¹DjHº = P¹DjHºP¹Hº (2) P¹ º CCS CONCEPTS D or equivalently, • Mathematics of computing → Probability and statistics; P¹H jDº / P¹DjHºP¹Hº (3) KEYWORDS In the above equation, P¹Hº is called the prior probability or CMA ES, Bayesian, conjugate prior, normal Wishart, normal inverse simply the prior while the conditional probability P¹H jDº is called Wishart, mixture models the posterior probability or simply the posterior. There are a few remarks to be made. First of all, the prior is not necessarily indepen- ACM Reference Format: dent of the knowledge of the experience, on the contrary, a prior Eric Benhamou, David Saltiel, Beatrice Guez, and Nicolas Paris. 2019. BCMA- is often determined with some knowledge of previous experience ES II: revisiting Bayesian CMA-ES. In Proceedings of A.I Square Working Paper. ACM, New York, NY, USA, 10 pages. https://doi.org/10.1145/nnnnnnn. in order to make a meaningful choice. Second, prior and posterior nnnnnnn are not necessarily related to a chronological order but rather to a logical order. 1 INTRODUCTION After observing some data D, we revise the plausibility of H. it is interesting to see that the conditional probability P¹DjHº considered Bayesian statistics have revolutionized statistics like quantum me- as a function of H is indeed a likelihood for H. The Cox Jaynes chanics have done for Newtonian mechanism. Like the latter, the theorem as presented in [18] gives the foundation for Bayesian usual frequentist statistics can be seen as a particular asymptotic calculus. Another important result is the De Finetti’s theorem. Let case of the former. Indeed, the Cox Jaynes theorem ([8]) proves that us recall the definition of Infinite exchangeability. under the four axiomatic assumptions given by: • plausibility degrees are represented by real numbers (conti- Definition 1.1. (Infinite exchangeability). We say that ¹x1; x2; :::º nuity of method), is an infinitely exchangeable sequence of random variables if, for any • none of the possible data should be ignored (no retention) n, the joint probability p¹x1; x2; :::; xnº is invariant to permutation of • these values follow usual common sense rule as stated by the the indices. That is, for any permutation π, well known Laplace formula: the probability theory is truly arXiv:1904.01466v2 [cs.LG] 9 Apr 2019 p¹x1; x2; :::; xnº = p¹xπ 1; xπ 2; :::; xπnº the common sense represented in calculus (common sense), • and states of equivalent knowledge should have equivalent Equipped with this definition, the De Finetti’s theorem as pro- degree of plausibility (consistency), vided below states that exchangeable observations are conditionally independent relative to some latent variable. then, there exists a probability measure defined up to a monotonous function such that it follows the usual probability calculus and the Theorem 1.1. (De Finetti, 1930s). A sequence of random variables fundamental rule of Bayes, that is: ¹x1; x2; :::º is infinitely exchangeable iff, for all n, P¹ º P¹ j ºP¹ º P¹ j ºP¹ º n H; D = H D D = D H H (1) ¹ Ö p¹x1; x2; :::; xnº = p¹xi jθºP¹dθº; Permission to make digital or hard copies of part or all of this work for personal or i=1 classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation for some measure P on θ. on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s). This representation theorem 1.1 justifies the use of priors on A.I Square Working Paper, March 2019, France parameters since for exchangeable data, there must exist a param- © 2019 Copyright held by the owner/author(s). ACM ISBN 978-x-xxxx-xxxx-x/YY/MM. eter θ, a likelihood p¹xjθº and a distribution π on θ. A proof of https://doi.org/10.1145/nnnnnnn.nnnnnnn De Finetti theorem is for instance given in [23] (section 1.5). We A.I Square Working Paper, March 2019, France E. Benhamou, D. Saltiel, B. Guez, N. Paris will see that this Bayesian setting gives a powerful framework for be lower than in the normal inverse Wishart Gaussian prior. We revisiting black box optimization that is introduced below. then introduce a new prior given by a mixture of normal Wishart and normal inverse Wishart Gaussian prior. Likewise, we derive 2 BLACK BOX OPTIMIZATION the update equations. In section 5, we finally give numerical results We assume that we have a real value p-dimensional function f : to compare all these methods. Rp ! R. We examine the following optimization program: 3 CONJUGATE PRIORS min f ¹xº (4) A key concept in Bayesian statistics is conjugate priors that makes x 2Rp the computation really easy and is described below. In contrast to traditional convex optimization theory, we do not assume that f is convex, neither continuous nor admits a global Definition 3.1. A prior distribution π¹θº is said to be a conjugate minimum. We are interested in the so called Black box optimization prior if the posterior distribution (BBO) settings where we only have access to the function f and π¹θ jxº / p¹x jθºπ¹θº (5) nothing else. By nothing else, we mean we can not for instance compute gradient. A practical way to do optimization in this very remains in the same distribution family as the prior. general and minimal setting is to do evolutionary optimization At this stage, it is relevant to introduce exponential family dis- and in particular use the covariance matrix adaptation evolution tributions as this higher level of abstraction that encompasses the strategy (CMA-ES) methodology. The CMA-ES [13] is arguably multi variate normal trivially solves the issue of founding conjugate one of the most powerful real-valued derivative-free optimization priors. This will be very helpful for inferring conjugate priors for algorithms, finding many applications in machine learning. Itis the multi variate Gaussian used in CMA-ES. a state-of-the-art optimizer for continuous black-box functions as shown by the various benchmarks of the COCO (COmparing Definition 3.2. A distribution is said to belong to the exponential Continuous Optimisers) INRIA platform for ill-posed functions. It family if it can be written (in its canonical form) as: has led to a large number of papers and articles and we refer the p¹xjηº = h¹xº exp¹η · T ¹xº − A¹ηºº; (6) interested reader to [1, 2, 4–6, 11–13, 16, 21] and [24] to cite a few. It has has been successfully applied in many unbiased perfor- where η is the natural parameter, T ¹xº is the sufficient statistic, A¹ηº mance comparisons and numerous real-world applications. In par- is log-partition function and h¹xº is the base measure. η and T ¹xº may ticular, in machine learning, it has been used for direct policy search be vector-valued. Here a · b denotes the inner product of a and b. in reinforcement learning and hyper-parameter tuning in super- The log-partition function is defined by the integral: vised learning ( [10], [14, 15, 17]), and references therein, as well as ¹ hyperparameter optimization of deep neural networks [19]. A¹ηº , log h¹xº exp¹η · T ¹xºº dx: (7) In a nutshell, the (µ / λ) CMA-ES is an iterative black box opti- X mization algorithm, that, in each of its iterations, samples λ candi- Also, η 2 Ω = fη 2 Rm jA¹θº < +1g where Ω is the natural date solutions from a multivariate normal distribution, evaluates parameter space. Moreover, Ω is a convex set and A¹·º is a convex these solutions (sequentially or in parallel) retains µ candidates function on Ω. and adjusts the sampling distribution used for the next iteration Remark 3.1. Not surprisingly, the normal distribution N¹x; µ; Σº to give higher probability to good samples. Each iteration can be d individually seen as taking an initial guess or prior for the multi with mean µ 2 R and covariance matrix Σ belongs to the exponential variate parameters, namely the mean and the covariance, and after family but with a different parametrisation.
Recommended publications
  • Introduction to Information Geometry – Based on the Book “Methods of Information Geometry ”Written by Shun-Ichi Amari and Hiroshi Nagaoka

    Introduction to Information Geometry – Based on the Book “Methods of Information Geometry ”Written by Shun-Ichi Amari and Hiroshi Nagaoka

    Introduction to Information Geometry – based on the book “Methods of Information Geometry ”written by Shun-Ichi Amari and Hiroshi Nagaoka Yunshu Liu 2012-02-17 Introduction to differential geometry Geometric structure of statistical models and statistical inference Outline 1 Introduction to differential geometry Manifold and Submanifold Tangent vector, Tangent space and Vector field Riemannian metric and Affine connection Flatness and autoparallel 2 Geometric structure of statistical models and statistical inference The Fisher metric and α-connection Exponential family Divergence and Geometric statistical inference Yunshu Liu (ASPITRG) Introduction to Information Geometry 2 / 79 Introduction to differential geometry Geometric structure of statistical models and statistical inference Part I Introduction to differential geometry Yunshu Liu (ASPITRG) Introduction to Information Geometry 3 / 79 Introduction to differential geometry Geometric structure of statistical models and statistical inference Basic concepts in differential geometry Basic concepts Manifold and Submanifold Tangent vector, Tangent space and Vector field Riemannian metric and Affine connection Flatness and autoparallel Yunshu Liu (ASPITRG) Introduction to Information Geometry 4 / 79 Introduction to differential geometry Geometric structure of statistical models and statistical inference Manifold Manifold S n Manifold: a set with a coordinate system, a one-to-one mapping from S to R , n supposed to be ”locally” looks like an open subset of R ” n Elements of the set(points): points in R , probability distribution, linear system. Figure : A coordinate system ξ for a manifold S Yunshu Liu (ASPITRG) Introduction to Information Geometry 5 / 79 Introduction to differential geometry Geometric structure of statistical models and statistical inference Manifold Manifold S Definition: Let S be a set, if there exists a set of coordinate systems A for S which satisfies the condition (1) and (2) below, we call S an n-dimensional C1 differentiable manifold.
  • Arxiv:1907.11122V2 [Math-Ph] 23 Aug 2019 on M Which Are Dual with Respect to G

    Arxiv:1907.11122V2 [Math-Ph] 23 Aug 2019 on M Which Are Dual with Respect to G

    Canonical divergence for flat α-connections: Classical and Quantum Domenico Felice1, ∗ and Nihat Ay2, y 1Max Planck Institute for Mathematics in the Sciences Inselstrasse 22{04103 Leipzig, Germany 2 Max Planck Institute for Mathematics in the Sciences Inselstrasse 22{04103 Leipzig, Germany Santa Fe Institute, 1399 Hyde Park Rd, Santa Fe, NM 87501, USA Faculty of Mathematics and Computer Science, University of Leipzig, PF 100920, 04009 Leipzig, Germany A recent canonical divergence, which is introduced on a smooth manifold M endowed with a general dualistic structure (g; r; r∗), is considered for flat α-connections. In the classical setting, we compute such a canonical divergence on the manifold of positive measures and prove that it coincides with the classical α-divergence. In the quantum framework, the recent canonical divergence is evaluated for the quantum α-connections on the manifold of all positive definite Hermitian operators. Also in this case we obtain that the recent canonical divergence is the quantum α-divergence. PACS numbers: Classical differential geometry (02.40.Hw), Riemannian geometries (02.40.Ky), Quantum Information (03.67.-a). I. INTRODUCTION Methods of Information Geometry (IG) [1] are ubiquitous in physical sciences and encom- pass both classical and quantum systems [2]. The natural object of study in IG is a quadruple (M; g; r; r∗) given by a smooth manifold M, a Riemannian metric g and a pair of affine connections arXiv:1907.11122v2 [math-ph] 23 Aug 2019 on M which are dual with respect to g, ∗ X g (Y; Z) = g (rX Y; Z) + g (Y; rX Z) ; (1) for all sections X; Y; Z 2 T (M).
  • UC Riverside UC Riverside Previously Published Works

    UC Riverside UC Riverside Previously Published Works

    UC Riverside UC Riverside Previously Published Works Title Fisher information matrix: A tool for dimension reduction, projection pursuit, independent component analysis, and more Permalink https://escholarship.org/uc/item/9351z60j Authors Lindsay, Bruce G Yao, Weixin Publication Date 2012 Peer reviewed eScholarship.org Powered by the California Digital Library University of California 712 The Canadian Journal of Statistics Vol. 40, No. 4, 2012, Pages 712–730 La revue canadienne de statistique Fisher information matrix: A tool for dimension reduction, projection pursuit, independent component analysis, and more Bruce G. LINDSAY1 and Weixin YAO2* 1Department of Statistics, The Pennsylvania State University, University Park, PA 16802, USA 2Department of Statistics, Kansas State University, Manhattan, KS 66502, USA Key words and phrases: Dimension reduction; Fisher information matrix; independent component analysis; projection pursuit; white noise matrix. MSC 2010: Primary 62-07; secondary 62H99. Abstract: Hui & Lindsay (2010) proposed a new dimension reduction method for multivariate data. It was based on the so-called white noise matrices derived from the Fisher information matrix. Their theory and empirical studies demonstrated that this method can detect interesting features from high-dimensional data even with a moderate sample size. The theoretical emphasis in that paper was the detection of non-normal projections. In this paper we show how to decompose the information matrix into non-negative definite information terms in a manner akin to a matrix analysis of variance. Appropriate information matrices will be identified for diagnostics for such important modern modelling techniques as independent component models, Markov dependence models, and spherical symmetry. The Canadian Journal of Statistics 40: 712– 730; 2012 © 2012 Statistical Society of Canada Resum´ e:´ Hui et Lindsay (2010) ont propose´ une nouvelle methode´ de reduction´ de la dimension pour les donnees´ multidimensionnelles.
  • Statistical Estimation in Multivariate Normal Distribution

    Statistical Estimation in Multivariate Normal Distribution

    STATISTICAL ESTIMATION IN MULTIVARIATE NORMAL DISTRIBUTION WITH A BLOCK OF MISSING OBSERVATIONS by YI LIU Presented to the Faculty of the Graduate School of The University of Texas at Arlington in Partial Fulfillment of the Requirements for the Degree of DOCTOR OF PHILOSOPHY THE UNIVERSITY OF TEXAS AT ARLINGTON DECEMBER 2017 Copyright © by YI LIU 2017 All Rights Reserved ii To Alex and My Parents iii Acknowledgements I would like to thank my supervisor, Dr. Chien-Pai Han, for his instructing, guiding and supporting me over the years. You have set an example of excellence as a researcher, mentor, instructor and role model. I would like to thank Dr. Shan Sun-Mitchell for her continuously encouraging and instructing. You are both a good teacher and helpful friend. I would like to thank my thesis committee members Dr. Suvra Pal and Dr. Jonghyun Yun for their discussion, ideas and feedback which are invaluable. I would like to thank the graduate advisor, Dr. Hristo Kojouharov, for his instructing, help and patience. I would like to thank the chairman Dr. Jianzhong Su, Dr. Minerva Cordero-Epperson, Lona Donnelly, Libby Carroll and other staffs for their help. I would like to thank my manager, Robert Schermerhorn, for his understanding, encouraging and supporting which make this happen. I would like to thank my employer Sabre and my coworkers for their support in the past two years. I would like to thank my husband Alex for his encouraging and supporting over all these years. In particularly, I would like to thank my parents -- without the inspiration, drive and support that you have given me, I might not be the person I am today.
  • Information Geometry of Orthogonal Initializations and Training

    Information Geometry of Orthogonal Initializations and Training

    Published as a conference paper at ICLR 2020 INFORMATION GEOMETRY OF ORTHOGONAL INITIALIZATIONS AND TRAINING Piotr Aleksander Sokół and Il Memming Park Department of Neurobiology and Behavior Departments of Applied Mathematics and Statistics, and Electrical and Computer Engineering Institutes for Advanced Computing Science and AI-driven Discovery and Innovation Stony Brook University, Stony Brook, NY 11733 {memming.park, piotr.sokol}@stonybrook.edu ABSTRACT Recently mean field theory has been successfully used to analyze properties of wide, random neural networks. It gave rise to a prescriptive theory for initializing feed-forward neural networks with orthogonal weights, which ensures that both the forward propagated activations and the backpropagated gradients are near `2 isome- tries and as a consequence training is orders of magnitude faster. Despite strong empirical performance, the mechanisms by which critical initializations confer an advantage in the optimization of deep neural networks are poorly understood. Here we show a novel connection between the maximum curvature of the optimization landscape (gradient smoothness) as measured by the Fisher information matrix (FIM) and the spectral radius of the input-output Jacobian, which partially explains why more isometric networks can train much faster. Furthermore, given that or- thogonal weights are necessary to ensure that gradient norms are approximately preserved at initialization, we experimentally investigate the benefits of maintaining orthogonality throughout training, and we conclude that manifold optimization of weights performs well regardless of the smoothness of the gradients. Moreover, we observe a surprising yet robust behavior of highly isometric initializations — even though such networks have a lower FIM condition number at initialization, and therefore by analogy to convex functions should be easier to optimize, experimen- tally they prove to be much harder to train with stochastic gradient descent.
  • A Sufficiency Paradox: an Insufficient Statistic Preserving the Fisher

    A Sufficiency Paradox: an Insufficient Statistic Preserving the Fisher

    A Sufficiency Paradox: An Insufficient Statistic Preserving the Fisher Information Abram KAGAN and Lawrence A. SHEPP this it is possible to find an insufficient statistic T such that (1) holds. An example of a regular statistical experiment is constructed The following example illustrates this. Let g(x) be the density where an insufficient statistic preserves the Fisher information function of a gamma distribution Gamma(3, 1), that is, contained in the data. The data are a pair (∆,X) where ∆ is a 2 −x binary random variable and, given ∆, X has a density f(x−θ|∆) g(x)=(1/2)x e ,x≥ 0; = 0,x<0. depending on a location parameter θ. The phenomenon is based on the fact that f(x|∆) smoothly vanishes at one point; it can be Take a binary variable ∆ with eliminated by adding to the regularity of a statistical experiment P (∆=1)= w , positivity of the density function. 1 P (∆=2)= w2,w1 + w2 =1,w1 =/ w2 (2) KEY WORDS: Convexity; Regularity; Statistical experi- ment. and let Y be a continuous random variable with the conditional density f(y|∆) given by f1(y)=f(y|∆=1)=.7g(y),y≥ 0; = .3g(−y),y≤ 0, 1. INTRODUCTION (3) Let P = {p(x; θ),θ∈ Θ} be a parametric family of densities f2(y)=f(y|∆=2)=.3g(y),y≥ 0; = .7g(−y),y≤ 0. (with respect to a measure µ) of a random element X taking values in a measurable space (X , A), the parameter space Θ (4) being an interval. Following Ibragimov and Has’minskii (1981, There is nothing special in the pair (.7,.3); any pair of positive chap.
  • Information Geometry and Optimal Transport

    Information Geometry and Optimal Transport

    Call for Papers Special Issue Affine Differential Geometry and Hesse Geometry: A Tribute and Memorial to Jean-Louis Koszul Submission Deadline: 30th November 2019 Jean-Louis Koszul (January 3, 1921 – January 12, 2018) was a French mathematician with prominent influence to a wide range of mathematical fields. He was a second generation member of Bourbaki, with several notions in geometry and algebra named after him. He made a great contribution to the fundamental theory of Differential Geometry, which is foundation of Information Geometry. The special issue is dedicated to Koszul for the mathematics he developed that bear on information sciences. Both original contributions and review articles are solicited. Topics include but are not limited to: Affine differential geometry over statistical manifolds Hessian and Kahler geometry Divergence geometry Convex geometry and analysis Differential geometry over homogeneous and symmetric spaces Jordan algebras and graded Lie algebras Pre-Lie algebras and their cohomology Geometric mechanics and Thermodynamics over homogeneous spaces Guest Editor Hideyuki Ishi (Graduate School of Mathematics, Nagoya University) e-mail: [email protected] Submit at www.editorialmanager.com/inge Please select 'S.I.: Affine differential geometry and Hessian geometry Jean-Louis Koszul (Memorial/Koszul)' Photo Author: Konrad Jacobs. Source: Archives of the Mathema�sches Forschungsins�tut Oberwolfach Editor-in-Chief Shinto Eguchi (Tokyo) Special Issue Co-Editors Information Geometry and Optimal Transport Nihat
  • A Note on Inference in a Bivariate Normal Distribution Model Jaya

    A Note on Inference in a Bivariate Normal Distribution Model Jaya

    A Note on Inference in a Bivariate Normal Distribution Model Jaya Bishwal and Edsel A. Peña Technical Report #2009-3 December 22, 2008 This material was based upon work partially supported by the National Science Foundation under Grant DMS-0635449 to the Statistical and Applied Mathematical Sciences Institute. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation. Statistical and Applied Mathematical Sciences Institute PO Box 14006 Research Triangle Park, NC 27709-4006 www.samsi.info A Note on Inference in a Bivariate Normal Distribution Model Jaya Bishwal¤ Edsel A. Pena~ y December 22, 2008 Abstract Suppose observations are available on two variables Y and X and interest is on a parameter that is present in the marginal distribution of Y but not in the marginal distribution of X, and with X and Y dependent and possibly in the presence of other parameters which are nuisance. Could one gain more e±ciency in the point estimation (also, in hypothesis testing and interval estimation) about the parameter of interest by using the full data (both Y and X values) instead of just the Y values? Also, how should one measure the information provided by random observables or their distributions about the parameter of interest? We illustrate these issues using a simple bivariate normal distribution model. The ideas could have important implications in the context of multiple hypothesis testing or simultaneous estimation arising in the analysis of microarray data, or in the analysis of event time data especially those dealing with recurrent event data.
  • Risk, Scores, Fisher Information, and Glrts (Supplementary Material for Math 494) Stanley Sawyer — Washington University Vs

    Risk, Scores, Fisher Information, and Glrts (Supplementary Material for Math 494) Stanley Sawyer — Washington University Vs

    Risk, Scores, Fisher Information, and GLRTs (Supplementary Material for Math 494) Stanley Sawyer | Washington University Vs. April 24, 2010 Table of Contents 1. Statistics and Esimatiors 2. Unbiased Estimators, Risk, and Relative Risk 3. Scores and Fisher Information 4. Proof of the Cram¶erRao Inequality 5. Maximum Likelihood Estimators are Asymptotically E±cient 6. The Most Powerful Hypothesis Tests are Likelihood Ratio Tests 7. Generalized Likelihood Ratio Tests 8. Fisher's Meta-Analysis Theorem 9. A Tale of Two Contingency-Table Tests 1. Statistics and Estimators. Let X1;X2;:::;Xn be an independent sample of observations from a probability density f(x; θ). Here f(x; θ) can be either discrete (like the Poisson or Bernoulli distributions) or continuous (like normal and exponential distributions). In general, a statistic is an arbitrary function T (X1;:::;Xn) of the data values X1;:::;Xn. Thus T (X) for X = (X1;X2;:::;Xn) can depend on X1;:::;Xn, but cannot depend on θ. Some typical examples of statistics are X + X + ::: + X T (X ;:::;X ) = X = 1 2 n (1:1) 1 n n = Xmax = maxf Xk : 1 · k · n g = Xmed = medianf Xk : 1 · k · n g These examples have the property that the statistic T (X) is a symmetric function of X = (X1;:::;Xn). That is, any permutation of the sample X1;:::;Xn preserves the value of the statistic. This is not true in general: For example, for n = 4 and X4 > 0, T (X1;X2;X3;X4) = X1X2 + (1=2)X3=X4 is also a statistic. A statistic T (X) is called an estimator of a parameter θ if it is a statistic that we think might give a reasonable guess for the true value of θ.
  • Evaluating Fisher Information in Order Statistics Mary F

    Evaluating Fisher Information in Order Statistics Mary F

    Southern Illinois University Carbondale OpenSIUC Research Papers Graduate School 2011 Evaluating Fisher Information in Order Statistics Mary F. Dorn [email protected] Follow this and additional works at: http://opensiuc.lib.siu.edu/gs_rp Recommended Citation Dorn, Mary F., "Evaluating Fisher Information in Order Statistics" (2011). Research Papers. Paper 166. http://opensiuc.lib.siu.edu/gs_rp/166 This Article is brought to you for free and open access by the Graduate School at OpenSIUC. It has been accepted for inclusion in Research Papers by an authorized administrator of OpenSIUC. For more information, please contact [email protected]. EVALUATING FISHER INFORMATION IN ORDER STATISTICS by Mary Frances Dorn B.A., University of Chicago, 2009 A Research Paper Submitted in Partial Fulfillment of the Requirements for the Master of Science Degree Department of Mathematics in the Graduate School Southern Illinois University Carbondale July 2011 RESEARCH PAPER APPROVAL EVALUATING FISHER INFORMATION IN ORDER STATISTICS By Mary Frances Dorn A Research Paper Submitted in Partial Fulfillment of the Requirements for the Degree of Master of Science in the field of Mathematics Approved by: Dr. Sakthivel Jeyaratnam, Chair Dr. David Olive Dr. Kathleen Pericak-Spector Graduate School Southern Illinois University Carbondale July 14, 2011 ACKNOWLEDGMENTS I would like to express my deepest appreciation to my advisor, Dr. Sakthivel Jeyaratnam, for his help and guidance throughout the development of this paper. Special thanks go to Dr. Olive and Dr. Pericak-Spector for serving on my committee. I would also like to thank the Department of Mathematics at SIUC for providing many wonderful learning experiences. Finally, none of this would have been possible without the continuous support and encouragement from my parents.
  • Information Geometry (Part 1)

    Information Geometry (Part 1)

    October 22, 2010 Information Geometry (Part 1) John Baez Information geometry is the study of 'statistical manifolds', which are spaces where each point is a hypothesis about some state of affairs. In statistics a hypothesis amounts to a probability distribution, but we'll also be looking at the quantum version of a probability distribution, which is called a 'mixed state'. Every statistical manifold comes with a way of measuring distances and angles, called the Fisher information metric. In the first seven articles in this series, I'll try to figure out what this metric really means. The formula for it is simple enough, but when I first saw it, it seemed quite mysterious. A good place to start is this interesting paper: • Gavin E. Crooks, Measuring thermodynamic length. which was pointed out by John Furey in a discussion about entropy and uncertainty. The idea here should work for either classical or quantum statistical mechanics. The paper describes the classical version, so just for a change of pace let me describe the quantum version. First a lightning review of quantum statistical mechanics. Suppose you have a quantum system with some Hilbert space. When you know as much as possible about your system, then you describe it by a unit vector in this Hilbert space, and you say your system is in a pure state. Sometimes people just call a pure state a 'state'. But that can be confusing, because in statistical mechanics you also need more general 'mixed states' where you don't know as much as possible. A mixed state is described by a density matrix, meaning a positive operator with trace equal to 1: tr( ) = 1 The idea is that any observable is described by a self-adjoint operator A, and the expected value of this observable in the mixed state is A = tr( A) The entropy of a mixed state is defined by S( ) = −tr( ln ) where we take the logarithm of the density matrix just by taking the log of each of its eigenvalues, while keeping the same eigenvectors.
  • Information-Geometric Optimization Algorithms: a Unifying Picture Via Invariance Principles

    Information-Geometric Optimization Algorithms: a Unifying Picture Via Invariance Principles

    Information-Geometric Optimization Algorithms: A Unifying Picture via Invariance Principles Ludovic Arnold, Anne Auger, Nikolaus Hansen, Yann Ollivier Abstract We present a canonical way to turn any smooth parametric family of probability distributions on an arbitrary search space 푋 into a continuous-time black-box optimization method on 푋, the information- geometric optimization (IGO) method. Invariance as a major design principle keeps the number of arbitrary choices to a minimum. The resulting method conducts a natural gradient ascent using an adaptive, time-dependent transformation of the objective function, and makes no particular assumptions on the objective function to be optimized. The IGO method produces explicit IGO algorithms through time discretization. The cross-entropy method is recovered in a particular case with a large time step, and can be extended into a smoothed, parametrization-independent maximum likelihood update. When applied to specific families of distributions on discrete or con- tinuous spaces, the IGO framework allows to naturally recover versions of known algorithms. From the family of Gaussian distributions on R푑, we arrive at a version of the well-known CMA-ES algorithm. From the family of Bernoulli distributions on {0, 1}푑, we recover the seminal PBIL algorithm. From the distributions of restricted Boltzmann ma- chines, we naturally obtain a novel algorithm for discrete optimization on {0, 1}푑. The IGO method achieves, thanks to its intrinsic formulation, max- imal invariance properties: invariance under reparametrization of the search space 푋, under a change of parameters of the probability dis- tribution, and under increasing transformation of the function to be optimized. The latter is achieved thanks to an adaptive formulation of the objective.