Studies in Continuous Black-Box Optimization
Total Page:16
File Type:pdf, Size:1020Kb
Technische Universität München Lehrstuhl VI: Echtzeitsysteme und Robotik Studies in Continuous Black-box Optimization Tom Schaul Vollständiger Abdruck der von der Fakultät für Informatik der Technischen Universität München zur Erlangung des akademischen Grades eines Doktors der Naturwissenschaften (Dr. rer. nat.) genehmigten Dissertation. Vorsitzender: Univ.-Prof. B. Brügge, Ph.D Prüfer der Dissertation: 1. Univ.-Prof. Dr. H.-J. Bungartz 2. Univ.-Prof. Dr. H. J. Schmidhuber Die Dissertation wurde am 06.06.2011 bei der Technischen Universität München eingereicht und durch die Fakultät für Informatik am 28.06.2011 angenommen. ii Abstract ptimization is the research field that studies that studies the design of O algorithms for finding the best solutions to problems we humans throw at them. While the whole domain is of important practical utility, the present thesis will focus on the subfield of continuous black-box optimization, presenting a collection of novel, state-of-the-art algorithms for solving problems in that class. First, we introduce a general-purpose algorithm called Natural Evolution Strategies (NES). In contrast to typical evolutionary algorithms which search in the vicinity of the fittest individuals in a population, evolution strategies aim at repeating the type of mutations that led to those individuals. We can characterize those mutations by a search distribution. The key idea of NES is to ascend the gradient on the parameters of that distribution towards higher expected fitness. We show how plain gradient ascent is destined to fail, and provide a viable alternative that instead descends along the natural gradient to adapt the search distribution, which appropriately normalizes the update step with respect to its uncertainty. Being derived from first principles, the NES approach can be extended to all types of search distributions that allow a parametric form, not just the classical multivariate Gaussian one. We derive a number of NES variants for different distributions, and show how they are useful on different problem classes. In addition, we rein in the computational cost, avoiding costly matrix inversions through an incremental change of coordinates. Two additional, novel techniques, importance mixing and adaptation sampling, allow us to automatically tune the learning rate and batch size to the problem, and thereby further reduce the average number of required fitness evaluations. A third technique, restart strategies, provides the algorithm with additional robustness in the presence of multiple local optima, or noise. Second, we introduce a new approach to costly black-box optimization, when fitness evaluations are very expensive. Here, we model the fitness function using state-of-the-art Gaussian process regression, and use the principle of artificial curiosity to direct exploration towards the most informative next evaluation candidate. Both the expected fitness improvement and the expected information gain can be derived explicitly from the Gaussian process model, and our method constructs a front of Pareto-optimal points according to these two criteria. This iii makes the exploration-exploitation trade-off explicit, and permits maximally informed candidate selection. We empirically validate our algorithms on a broad set of benchmarks. We also include a study of how continuous black-box optimization can solve chal- lenging neuro-evolution problems, where multi-dimensional recurrent neural net- works are trained to play the game of Go. At the same time, we establish to what degree those network architectures can transfer good playing behavior from small to large board sizes. In summary, this dissertation presents a collection of novel algorithms, for the general problem of continuous black-box optimization as well as a number of special cases, each validated empirically. iv Dedicated to Joé Schaul In memory of Adina Mosincat v vi Table of Contents Abstract ................................. iii Table of Contents ............................ vii List of Tables .............................. xi List of Figures .............................. xiii List of Algorithms ........................... xv Preface .................................. xvii 0.1 Thesis Outline ............................xvii 0.2 Related Publications .........................xviii 0.3 Notation ................................ xxi Acknowledgments ............................ xxiii 1 Introduction: Continuous Black-box Optimization ...... 1 1.1 Problem Definition .......................... 1 1.1.1 Continuous Optimization: Real-valued Solution Space .. 2 1.1.2 Black-box Optimization: Unknown Function ....... 2 1.1.3 Local Optimization: Find a Nearby Optimum ....... 3 1.1.4 Noisy Optimization: Functions Corrupted by Noise .... 3 1.1.5 Multi-objective Optimization ................ 4 1.2 Evaluating Optimization Algorithms ................ 5 1.3 State-of-the-Art Approaches ..................... 6 1.3.1 Evolutionary Methods .................... 6 1.3.2 Response Surface Methods ................. 7 1.4 Impact and Applications ....................... 8 1.5 Related Problem Domains ...................... 8 1.6 Open Questions ............................ 9 2 Natural Evolution Strategies ................... 11 2.1 The NES Family ........................... 11 2.1.1 Chapter Outline ....................... 12 2.2 Search Gradients ........................... 13 vii 2.2.1 Search Gradients for Gaussian Distributions ........ 14 2.2.2 Limitations of Plain Search Gradients ........... 15 2.3 Using the Natural Gradient ..................... 16 2.4 Performance and Robustness Techniques .............. 19 2.4.1 Fitness Shaping ........................ 20 2.4.2 Fitness Baselines ....................... 20 2.4.3 Importance Mixing ...................... 22 2.4.4 Adaptation Sampling ..................... 24 2.4.5 Restart Strategies ...................... 25 2.5 Techniques for Multinormal Distributions ............. 28 2.5.1 Using Exponential Parameterization ............ 28 2.5.2 Using Natural Coordinates ................. 29 2.5.3 Orthogonal Decomposition of Multinormal Parameter Space 31 2.5.4 Connection to CMA-ES ................... 32 2.5.5 Elitism ............................ 35 Use for Multi-objective NES ................. 37 2.6 Beyond Multinormal Distributions ................. 38 2.6.1 Separable NES ........................ 38 2.6.2 Rotationally-symmetric Distributions ........... 40 Sampling from Radial Distributions ............ 41 Computing the Fisher Information Matrix ......... 41 2.6.3 Heavy-tailed NES ...................... 42 Groups of Invariances .................... 43 Cauchy Distribution ..................... 44 2.7 Experiments .............................. 45 2.7.1 Experimental Setup and Hyperparameters ......... 46 2.7.2 Black-box Optimization Benchmarks ............ 47 2.7.3 Separable NES ........................ 48 Separable and Non-separable Benchmarks ......... 48 Neuro-evolution ........................ 53 Lennard-Jones Potentials .................. 54 2.7.4 Heavy Tails and Global Optimization ........... 56 2.7.5 Results Summary ....................... 57 2.8 Discussion ............................... 58 3 Curiosity-driven Optimization .................. 61 3.1 Artificial Curiosity .......................... 61 3.1.1 Background .......................... 62 3.1.2 Artificial Curiosity as a Guide for Optimization ...... 62 3.1.3 Formal Framework ...................... 63 3.2 Exploration/Exploitation Trade-off. ................. 63 3.3 Curiosity-driven Optimization (General Form) .......... 65 3.4 Models of Expected Fitness and Information Gain ........ 65 viii 3.4.1 A Good Model Choice: Gaussian Processes ........ 66 3.4.2 Derivation of Gaussian Process Information Gain ..... 66 3.5 Curiosity-driven Optimization with Gaussian Processes ...... 67 3.6 Minimal Asymptotic Requirements ................. 68 3.6.1 Reaching Optima at Arbitrary Distance .......... 68 3.6.2 Locating Optima with Arbitrary Precision ......... 69 3.6.3 Guaranteed to Find Global Optimum ........... 69 3.7 Proof-of-concept ........................... 70 3.8 Discussion ............................... 71 4 A Study in Scalability ....................... 73 4.1 Scalability in Problem Size ..................... 73 4.2 Example Domain: the Game of Go ................. 74 4.2.1 Rules of the Game ...................... 74 4.2.2 Challenges of Go ....................... 74 4.2.3 Simpler Variants of Go .................... 75 4.2.4 Computer Opponents .................... 76 4.3 Scalable Neural Architectures .................... 76 4.3.1 Multi-dimensional RNN ................... 77 4.3.2 Multi-dimensional LSTM .................. 78 4.3.3 A Custom Architecture for Go ............... 78 4.4 Experiments .............................. 79 4.4.1 Experimental Setup ..................... 79 4.4.2 Random-weight Networks .................. 80 4.4.3 Optimized Networks ..................... 82 4.4.4 Coevolved Networks ..................... 84 4.5 Discussion ............................... 90 5 Conclusions ............................. 91 5.1 Perspectives .............................. 92 5.2 Closing Words ............................ 93 A Appendix: Implementations in Python ............. 95 B Appendix: Weighted Mann-Whitney Test ........... 97 Bibliography ............................... 99 Index ................................... 109 ix x List of Tables 1 Notation and symbols used. ..................... xxi 2.1 Default parameters .......................... 46 2.2 Summary of enhancing techniques ................