BCMA-ES II: revisiting Bayesian CMA-ES Eric Benhamou David Saltiel A.I Square Connect and Lamsade, France A.I Square Connect and LISIC, France [email protected] [email protected] Beatrice Guez Nicolas Paris A.I Square Connect, France A.I Square Connect, France [email protected] [email protected] ABSTRACT where H and D are two members of the implied σ−algebra. The This paper revisits the Bayesian CMA-ES and provides updates for letters are not by chance. H stands for the hypothesis, which can normal Wishart. It emphasizes the difference between a normal be interpreted as an hypothesis on the parameters, while D stands and normal inverse Wishart prior. After some computation, we for data. prove that the only difference relies surprisingly in the expected The usual frequentist probabilities states that the probability of covariance. We prove that the expected covariance should be lower an observation P¹Dº is given certain hypothesis H on the state of in the normal Wishart prior model because of the convexity of the the world. However, as the equation (1) is completely symmetric, inverse. We present a mixture model that generalizes both normal nothing hinders us to change our point of view and state the inverse Wishart and normal inverse Wishart model. We finally present question. Given an observation of a data D, what is the plausibility various numerical experiments to compare both methods as well of the hypothesis H. The Bayes rules trivially answers this question: as the generalized method. P¹Hº P¹H jDº = P¹DjHº = P¹DjHºP¹Hº (2) P¹ º CCS CONCEPTS D or equivalently, • Mathematics of computing → Probability and statistics; P¹H jDº / P¹DjHºP¹Hº (3) KEYWORDS In the above equation, P¹Hº is called the prior probability or CMA ES, Bayesian, conjugate prior, normal Wishart, normal inverse simply the prior while the conditional probability P¹H jDº is called Wishart, mixture models the posterior probability or simply the posterior. There are a few remarks to be made. First of all, the prior is not necessarily indepen- ACM Reference Format: dent of the knowledge of the experience, on the contrary, a prior Eric Benhamou, David Saltiel, Beatrice Guez, and Nicolas Paris. 2019. BCMA- is often determined with some knowledge of previous experience ES II: revisiting Bayesian CMA-ES. In Proceedings of A.I Square Working Paper. ACM, New York, NY, USA, 10 pages. https://doi.org/10.1145/nnnnnnn. in order to make a meaningful choice. Second, prior and posterior nnnnnnn are not necessarily related to a chronological order but rather to a logical order. 1 INTRODUCTION After observing some data D, we revise the plausibility of H. it is interesting to see that the conditional probability P¹DjHº considered Bayesian statistics have revolutionized statistics like quantum me- as a function of H is indeed a likelihood for H. The Cox Jaynes chanics have done for Newtonian mechanism. Like the latter, the theorem as presented in [18] gives the foundation for Bayesian usual frequentist statistics can be seen as a particular asymptotic calculus. Another important result is the De Finetti’s theorem. Let case of the former. Indeed, the Cox Jaynes theorem ([8]) proves that us recall the definition of Infinite exchangeability. under the four axiomatic assumptions given by: • plausibility degrees are represented by real numbers (conti- Definition 1.1. (Infinite exchangeability). We say that ¹x1; x2; :::º nuity of method), is an infinitely exchangeable sequence of random variables if, for any • none of the possible data should be ignored (no retention) n, the joint probability p¹x1; x2; :::; xnº is invariant to permutation of • these values follow usual common sense rule as stated by the the indices. That is, for any permutation π, well known Laplace formula: the probability theory is truly arXiv:1904.01466v2 [cs.LG] 9 Apr 2019 p¹x1; x2; :::; xnº = p¹xπ 1; xπ 2; :::; xπnº the common sense represented in calculus (common sense), • and states of equivalent knowledge should have equivalent Equipped with this definition, the De Finetti’s theorem as pro- degree of plausibility (consistency), vided below states that exchangeable observations are conditionally independent relative to some latent variable. then, there exists a probability measure defined up to a monotonous function such that it follows the usual probability calculus and the Theorem 1.1. (De Finetti, 1930s). A sequence of random variables fundamental rule of Bayes, that is: ¹x1; x2; :::º is infinitely exchangeable iff, for all n, P¹ º P¹ j ºP¹ º P¹ j ºP¹ º n H; D = H D D = D H H (1) ¹ Ö p¹x1; x2; :::; xnº = p¹xi jθºP¹dθº; Permission to make digital or hard copies of part or all of this work for personal or i=1 classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation for some measure P on θ. on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s). This representation theorem 1.1 justifies the use of priors on A.I Square Working Paper, March 2019, France parameters since for exchangeable data, there must exist a param- © 2019 Copyright held by the owner/author(s). ACM ISBN 978-x-xxxx-xxxx-x/YY/MM. eter θ, a likelihood p¹xjθº and a distribution π on θ. A proof of https://doi.org/10.1145/nnnnnnn.nnnnnnn De Finetti theorem is for instance given in [23] (section 1.5). We A.I Square Working Paper, March 2019, France E. Benhamou, D. Saltiel, B. Guez, N. Paris will see that this Bayesian setting gives a powerful framework for be lower than in the normal inverse Wishart Gaussian prior. We revisiting black box optimization that is introduced below. then introduce a new prior given by a mixture of normal Wishart and normal inverse Wishart Gaussian prior. Likewise, we derive 2 BLACK BOX OPTIMIZATION the update equations. In section 5, we finally give numerical results We assume that we have a real value p-dimensional function f : to compare all these methods. Rp ! R. We examine the following optimization program: 3 CONJUGATE PRIORS min f ¹xº (4) A key concept in Bayesian statistics is conjugate priors that makes x 2Rp the computation really easy and is described below. In contrast to traditional convex optimization theory, we do not assume that f is convex, neither continuous nor admits a global Definition 3.1. A prior distribution π¹θº is said to be a conjugate minimum. We are interested in the so called Black box optimization prior if the posterior distribution (BBO) settings where we only have access to the function f and π¹θ jxº / p¹x jθºπ¹θº (5) nothing else. By nothing else, we mean we can not for instance compute gradient. A practical way to do optimization in this very remains in the same distribution family as the prior. general and minimal setting is to do evolutionary optimization At this stage, it is relevant to introduce exponential family dis- and in particular use the covariance matrix adaptation evolution tributions as this higher level of abstraction that encompasses the strategy (CMA-ES) methodology. The CMA-ES [13] is arguably multi variate normal trivially solves the issue of founding conjugate one of the most powerful real-valued derivative-free optimization priors. This will be very helpful for inferring conjugate priors for algorithms, finding many applications in machine learning. Itis the multi variate Gaussian used in CMA-ES. a state-of-the-art optimizer for continuous black-box functions as shown by the various benchmarks of the COCO (COmparing Definition 3.2. A distribution is said to belong to the exponential Continuous Optimisers) INRIA platform for ill-posed functions. It family if it can be written (in its canonical form) as: has led to a large number of papers and articles and we refer the p¹xjηº = h¹xº exp¹η · T ¹xº − A¹ηºº; (6) interested reader to [1, 2, 4–6, 11–13, 16, 21] and [24] to cite a few. It has has been successfully applied in many unbiased perfor- where η is the natural parameter, T ¹xº is the sufficient statistic, A¹ηº mance comparisons and numerous real-world applications. In par- is log-partition function and h¹xº is the base measure. η and T ¹xº may ticular, in machine learning, it has been used for direct policy search be vector-valued. Here a · b denotes the inner product of a and b. in reinforcement learning and hyper-parameter tuning in super- The log-partition function is defined by the integral: vised learning ( [10], [14, 15, 17]), and references therein, as well as ¹ hyperparameter optimization of deep neural networks [19]. A¹ηº , log h¹xº exp¹η · T ¹xºº dx: (7) In a nutshell, the (µ / λ) CMA-ES is an iterative black box opti- X mization algorithm, that, in each of its iterations, samples λ candi- Also, η 2 Ω = fη 2 Rm jA¹θº < +1g where Ω is the natural date solutions from a multivariate normal distribution, evaluates parameter space. Moreover, Ω is a convex set and A¹·º is a convex these solutions (sequentially or in parallel) retains µ candidates function on Ω. and adjusts the sampling distribution used for the next iteration Remark 3.1. Not surprisingly, the normal distribution N¹x; µ; Σº to give higher probability to good samples. Each iteration can be d individually seen as taking an initial guess or prior for the multi with mean µ 2 R and covariance matrix Σ belongs to the exponential variate parameters, namely the mean and the covariance, and after family but with a different parametrisation.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages10 Page
-
File Size-