Computer Experimental Design for Gaussian Process Surrogates

Computer Experimental Design for Gaussian Process Surrogates Boya Zhang Dissertation submitted to the Faculty of the Virginia Polytechnic Institute and State University in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Statistics Robert B. Gramacy, Chair Xinwei Deng David Higdon Leanna House August 13, 2020 Blacksburg, Virginia Keywords: Computer experiment, experimental design, sequential design, Gaussian process surrogates, input-dependent noise Copyright 2020, Boya Zhang Computer Experimental Design for Gaussian Process Surrogates Boya Zhang (ABSTRACT) With a rapid development of computing power, computer experiments have gained popu- larity in various scientific fields, like cosmology, ecology and engineering. However, some computer experiments for complex processes are still computationally demanding. A surrogate model or emulator, is often employed as a fast substitute for the simulator. Meanwhile, a common challenge in computer experiments and related fields is to efficiently explore the input space using a small number of samples, i.e., the experimental design problem. This dissertation focuses on the design problem under Gaussian process surrogates. The first work demonstrates empirically that space-filling designs disappoint when the model hyperparameterization is unknown, and must be estimated from data observed at the chosen design sites. A purely random design is shown to be superior to higher-powered alternatives in many cases. Thereafter, a new family of distance-based designs are proposed and their superior performance is illustrated in both static (one-shot design) and sequential settings. The second contribution is motivated by an agent-based model(ABM) of delta smelt conservation. The ABM is developed to assist in a study of delta smelt life cycles and to understand sensitivities to myriad natural variables and human interventions. However, the input space is high-dimensional, running the simulator is time-consuming, and its outputs change nonlin- early in both mean and variance. A batch sequential design scheme is proposed, generalizing one-at-a-time variance-based active learning, as a means of keeping multi-core cluster nodes fully engaged with expensive runs. The acquisition strategy is carefully engineered to favor selection of replicates which boost statistical and computational efficiencies. Design performance is illustrated on a range of toy examples before embarking on a smelt simulation campaign and downstream high-fidelity input sensitivity analysis. Computer Experimental Design for Gaussian Process Surrogates Boya Zhang (GENERAL AUDIENCE ABSTRACT) With a rapid development of computing power, computer experiments have gained popu- larity in various scientific fields, like cosmology, ecology and engineering. However, some computer experiments for complex processes are still computationally demanding. Thus, a statistical model built upon input-output observations, i.e., a so-called surrogate model or emulator, is needed as a fast substitute for the simulator. Design of experiments, i.e., how to select samples from the input space under budget constraints, is also worth studying. This dissertation focuses on the design problem under Gaussian process (GP) surrogates. The first work demonstrates empirically that commonly-used space-filling designs disappoint when the model hyperparameterization is unknown, and must be estimated from data observed at the chosen design sites. Thereafter, a new family of distance-based designs are proposed and their superior performance is illustrated in both static (design points are al- located at one shot) and sequential settings (data are sampled sequentially). The second contribution is motivated by a stochastic computer simulator of delta smelt conservation. This simulator is developed to assist in a study of delta smelt life cycles and to understand sensitivities to myriad natural variables and human interventions. However, the input space is high-dimensional, running the simulator is time-consuming, and its outputs change nonlin- early in both mean and variance. An innovative batch sequential design method is proposed, generalizing one-at-a-time sequential design to one-batch-at-a-time scheme with the goal of parallel computing. The criterion for subsequent data acquisition is carefully engineered to favor selection of replicates which boost statistical and computational efficiencies. The design performance is illustrated on a range of toy examples before embarking on a smelt simulation campaign and downstream sensitivity analysis of simulator inputs. Dedication To my dearest family. iv Acknowledgments This dissertation would not have been possible without the help and support of my advisor, committee, colleagues, friends and family. First, I would like to express my deepest grati- tude to my advisor Bobby Gramacy. With his guidance, I have opportunities to learn about many interesting and challenging research areas, which motivates me to explore more in the future. In the past three years, he gave me persistent help and encouragement in overcoming difficulties. I feel so grateful to have the chance of working with him. I want to thank my committee members Dr. Xinwei Deng, Dr. Dave Higdon, and Dr. Leanna House. Their valuable advice and thought-provoking questions have broadened my view and pushed me to think deeper. I would also like to thank the statistics department of Virginia Tech for providing various courses on cutting-edge topics and plenty of opportunities in teaching and collaborating. And my biggest thanks to all my friends, especially Ruijin Lu, who saved me from home- lessness during my dessertation writing period. Last but not least, I would like to thank my family, in particular, my parents, Jinling Ma and Yongchun Zhang, my grandparents, Qinglan Wu, Meiying Ni and Shimin Zhang, for their unconditional love. Their support gives me the strength to face any difficulties. v Contents List of Figures x List of Tables xiii 1 Introduction 1 1.1 Background .................................... 1 1.2 Motivation example: Delta Smelt ........................ 2 1.3 Overview of this dissertation ........................... 3 1.3.1 Distance-distributed design for GP surrogates ............. 3 1.3.2 IMSPE batch sequential design ..................... 3 1.3.3 Delta Smelt ................................ 4 2 Review of literature 5 2.1 Surrogate modeling ................................ 5 2.1.1 Gaussian Process surrogates ....................... 6 2.1.2 GP kernels ................................ 7 2.1.3 Surrogates for stochastic computer simulators ............. 9 2.1.4 GP with replication ........................... 10 2.1.5 Heteroskedastic Gaussian process .................... 11 vi 2.2 Computer experimental design .......................... 12 2.2.1 Geometric designs ............................ 13 2.2.2 Model-based design ............................ 15 2.2.3 Sequential design ............................. 16 2.2.4 Bayesian Optimization .......................... 18 2.2.5 Batch sequential design ......................... 19 3 Distance-distributed Design for Gaussian Process Surrogates 21 3.1 Setup and related work .............................. 24 3.1.1 Gaussian Process surrogates ....................... 25 3.1.2 Thinking about designs for GPs ..................... 25 3.2 Better than random ............................... 28 3.2.1 Uniform to beta designs ......................... 31 3.2.2 Optimization of shape parameters of betadist design ......... 34 3.3 Hybrid betadist and LHS ............................ 39 3.4 Application to sequential design ......................... 41 3.4.1 Active Learning MacKay ......................... 42 3.4.2 Expected improvement for optimization ................ 44 4 IMSPE batch-sequential design 48 4.1 Batch sequential design .............................. 50 vii 4.1.1 A criterion for minimizing variance ................... 50 4.1.2 Batch IMSPE gradient .......................... 53 4.1.3 Implementation details and illustration ................. 57 4.2 Hunting for replicates .............................. 59 4.2.1 Backtracking via merge ......................... 59 4.2.2 Selecting among backtracked batches .................. 61 4.3 Benchmarking examples ............................. 62 4.3.1 1d toy example .............................. 64 4.3.2 2d toy example .............................. 64 4.3.3 Ocean oxygen ............................... 68 4.3.4 Assemble-to-order ............................ 70 5 Delta smelt 72 5.1 Agent-based model ................................ 74 5.2 Pilot study .................................... 77 5.3 Big experiment .................................. 79 5.3.1 Setup and acquisitions .......................... 80 5.3.2 Downstream analysis ........................... 83 6 Conclusion 88 6.1 Distance-distributed design for GP surrogates ................. 88 viii 6.2 IMSPE Batch-sequential design ......................... 90 6.3 Delta Smelt simulator .............................. 93 Bibliography 94 ix List of Figures 2.1 GP posterior predictive distribution in terms of means, 2.5% and 97.5% quan- tiles. ....................................... 8 2.2 Predictions and associated 95% uncertainty intervals based on GP with nugget parameter. ..................................... 10 3.1 logMSEs from design experiment and de-trending surface. .......... 28 3.2 Standardized logMSE boxplots to thirty gridded θ(t) values for seven compara- tors using n = 2d+1

Computer Experimental Design for Gaussian Process Surrogates

An Uncertainty-Quantification Framework for Assessing Accuracy

Designing Combined Physical and Computer Experiments to Maximize Prediction Accuracy

Computational Physics: Simulation of Classical and Quantum Systems

Calibrating a Large Computer Experiment Simulating Radiative

Sequential Design of Computer Experimentsfor The

Design of Computer Experiments: Space Filling and Beyond Luc Pronzato, Werner Müller

Computer Experiments

Sequential Experimental Designs for Stochastic Kriging

Design and Analysis of Computer Experiments for Screening Input Variables Dissertation Presented in Partial Fulfillment of the R

Computational Physics 1 Home Page Title Page

Uncertainty Quantification of Stochastic Simulation for Black-Box Computer Experiments

Replication Or Exploration? Sequential Design for Stochastic Simulation Experiments