Computer Experimental Design for Gaussian Process Surrogates

Boya Zhang

Dissertation submitted to the Faculty of the Virginia Polytechnic Institute and State University in partial fulfillment of the requirements for the degree of

Doctor of Philosophy in

Robert B. Gramacy, Chair Xinwei Deng David Higdon Leanna House

August 13, 2020 Blacksburg, Virginia

Keywords: Computer experiment, experimental design, sequential design, Gaussian process surrogates, input-dependent noise Copyright 2020, Boya Zhang Computer Experimental Design for Gaussian Process Surrogates

Boya Zhang

(ABSTRACT)

With a rapid development of computing power, computer experiments have gained popu- larity in various scientific fields, like cosmology, ecology and engineering. However, some computer experiments for complex processes are still computationally demanding. A surro- gate model or emulator, is often employed as a fast substitute for the simulator. Meanwhile, a common challenge in computer experiments and related fields is to efficiently explore the input space using a small number of samples, i.e., the experimental design problem. This dissertation focuses on the design problem under Gaussian process surrogates. The first work demonstrates empirically that space-filling designs disappoint when the model hyperparame- terization is unknown, and must be estimated from data observed at the chosen design sites. A purely random design is shown to be superior to higher-powered alternatives in many cases. Thereafter, a new family of distance-based designs are proposed and their superior performance is illustrated in both static (one-shot design) and sequential settings. The sec- ond contribution is motivated by an agent-based model(ABM) of delta smelt conservation. The ABM is developed to assist in a study of delta smelt life cycles and to understand sen- sitivities to myriad natural variables and human interventions. However, the input space is high-dimensional, running the simulator is time-consuming, and its outputs change nonlin- early in both mean and variance. A batch sequential design scheme is proposed, generalizing one-at-a-time variance-based active learning, as a means of keeping multi-core cluster nodes fully engaged with expensive runs. The acquisition strategy is carefully engineered to favor selection of replicates which boost statistical and computational efficiencies. Design per- formance is illustrated on a range of toy examples before embarking on a smelt campaign and downstream high-fidelity input sensitivity analysis. Computer Experimental Design for Gaussian Process Surrogates

Boya Zhang

(GENERAL AUDIENCE ABSTRACT)

With a rapid development of computing power, computer experiments have gained popu- larity in various scientific fields, like cosmology, ecology and engineering. However, some computer experiments for complex processes are still computationally demanding. Thus, a statistical model built upon input-output observations, i.e., a so-called surrogate model or emulator, is needed as a fast substitute for the simulator. , i.e., how to select samples from the input space under budget constraints, is also worth studying. This dissertation focuses on the design problem under Gaussian process (GP) surrogates. The first work demonstrates empirically that commonly-used space-filling designs disappoint when the model hyperparameterization is unknown, and must be estimated from data ob- served at the chosen design sites. Thereafter, a new family of distance-based designs are proposed and their superior performance is illustrated in both static (design points are al- located at one shot) and sequential settings (data are sampled sequentially). The second contribution is motivated by a stochastic computer simulator of delta smelt conservation. This simulator is developed to assist in a study of delta smelt life cycles and to understand sensitivities to myriad natural variables and human interventions. However, the input space is high-dimensional, running the simulator is time-consuming, and its outputs change nonlin- early in both mean and variance. An innovative batch sequential design method is proposed, generalizing one-at-a-time sequential design to one-batch-at-a-time scheme with the goal of parallel computing. The criterion for subsequent data acquisition is carefully engineered to favor selection of replicates which boost statistical and computational efficiencies. The design performance is illustrated on a range of toy examples before embarking on a smelt simulation campaign and downstream sensitivity analysis of simulator inputs. Dedication

To my dearest family.

iv Acknowledgments

This dissertation would not have been possible without the help and support of my advisor, committee, colleagues, friends and family. First, I would like to express my deepest grati- tude to my advisor Bobby Gramacy. With his guidance, I have opportunities to learn about many interesting and challenging research areas, which motivates me to explore more in the future. In the past three years, he gave me persistent help and encouragement in overcoming difficulties. I feel so grateful to have the chance of working with him. I want to thank my committee members Dr. Xinwei Deng, Dr. Dave Higdon, and Dr. Leanna House. Their valuable advice and thought-provoking questions have broadened my view and pushed me to think deeper. I would also like to thank the statistics department of Virginia Tech for providing various courses on cutting-edge topics and plenty of opportuni- ties in teaching and collaborating. And my biggest thanks to all my friends, especially Ruijin Lu, who saved me from home- lessness during my dessertation writing period. Last but not least, I would like to thank my family, in particular, my parents, Jinling Ma and Yongchun Zhang, my grandparents, Qinglan Wu, Meiying Ni and Shimin Zhang, for their unconditional love. Their support gives me the strength to face any difficulties.

v Contents

List of Figures x

List of Tables xiii

1 Introduction 1

1.1 Background ...... 1

1.2 Motivation example: Delta Smelt ...... 2

1.3 Overview of this dissertation ...... 3

1.3.1 Distance-distributed design for GP surrogates ...... 3

1.3.2 IMSPE batch sequential design ...... 3

1.3.3 Delta Smelt ...... 4

2 Review of literature 5

2.1 Surrogate modeling ...... 5

2.1.1 Gaussian Process surrogates ...... 6

2.1.2 GP kernels ...... 7

2.1.3 Surrogates for stochastic computer simulators ...... 9

2.1.4 GP with replication ...... 10

2.1.5 Heteroskedastic Gaussian process ...... 11

vi 2.2 Computer experimental design ...... 12

2.2.1 Geometric designs ...... 13

2.2.2 Model-based design ...... 15

2.2.3 Sequential design ...... 16

2.2.4 Bayesian Optimization ...... 18

2.2.5 Batch sequential design ...... 19

3 Distance-distributed Design for Gaussian Process Surrogates 21

3.1 Setup and related work ...... 24

3.1.1 Gaussian Process surrogates ...... 25

3.1.2 Thinking about designs for GPs ...... 25

3.2 Better than random ...... 28

3.2.1 Uniform to beta designs ...... 31

3.2.2 Optimization of shape parameters of betadist design ...... 34

3.3 Hybrid betadist and LHS ...... 39

3.4 Application to sequential design ...... 41

3.4.1 Active Learning MacKay ...... 42

3.4.2 Expected improvement for optimization ...... 44

4 IMSPE batch-sequential design 48

4.1 Batch sequential design ...... 50

vii 4.1.1 A criterion for minimizing variance ...... 50

4.1.2 Batch IMSPE gradient ...... 53

4.1.3 Implementation details and illustration ...... 57

4.2 Hunting for replicates ...... 59

4.2.1 Backtracking via merge ...... 59

4.2.2 Selecting among backtracked batches ...... 61

4.3 Benchmarking examples ...... 62

4.3.1 1d toy example ...... 64

4.3.2 2d toy example ...... 64

4.3.3 Ocean oxygen ...... 68

4.3.4 Assemble-to-order ...... 70

5 Delta smelt 72

5.1 Agent-based model ...... 74

5.2 Pilot study ...... 77

5.3 Big experiment ...... 79

5.3.1 Setup and acquisitions ...... 80

5.3.2 Downstream analysis ...... 83

6 Conclusion 88

6.1 Distance-distributed design for GP surrogates ...... 88

viii 6.2 IMSPE Batch-sequential design ...... 90

6.3 Delta Smelt simulator ...... 93

Bibliography 94

ix List of Figures

2.1 GP posterior predictive distribution in terms of means, 2.5% and 97.5% quan- tiles...... 8

2.2 Predictions and associated 95% uncertainty intervals based on GP with nugget parameter...... 10

3.1 logMSEs from design experiment and de-trending surface...... 28

3.2 Standardized logMSE boxplots to thirty gridded θ(t) values for seven compara- tors using n = 2d+1 over input dimension d ∈ {2, 3, 4, 5, 6}. The comparators are described in the text. Two outlying standardized log MSE values were clipped by the y-axes to enhance boxplot viewing: random (d = 4) at 10.9 and LHS (d = 6) at 17.4...... 30

3.3 Empirical density curves corresponding to random designs in 2d with lowest 50 logMSE(θ) values from 1000 random design realizations. Empirical maximin and Beta(2.5, 4) densities are shown for comparison...... 33

3.4 deRIMSE surface with T = 1000 for n = 16 and d = 2 as estimated by

hetGP. Dots show the design sites; lighter (heat) colors correspond to higher deRIMSEs...... 36

3.5 Outcomes of BO of RIMSE surfaces for various choices of n and d. Numbers show location and number of replicates in acquisitions; blue square shows (ˆα, βˆ); purple and green contours show 5% and 10% from the optimal. .... 38

x 3.6 2d (black circles) and 1d (red triangles) projections of three d = 3 designs, n = 16...... 41

3.7 RMSPE comparison of initial designs (ninit = 8) as a function of the number of subsequent sequential design iterations via ALM. Each comparator has a pair of lines: those in the left panel indicate mean RMSE; those on the right are the upper 90% quantile...... 43

4.1 Batch IMSPE optimization iterations from initial (blue dots) to final (green crosses) locations. Three optimization epochs are provided by arrows. An overlayed heatmap shows the estimated standard deviation surface r(x). .. 57

4.2 Left: backtracking with merge; gray arrows connect optimal Xe ms with num- bers indicating s = 1,...,M; Right: IMSPE changes over numbers of repli- cates. Merging steps that are finally taken are shown in blue. Fitted seg- mented regression lines are overlaid...... 61

4.3 Three selected scatter plots of IMSPE versus number of replicates with best change-point fitted regression lines overlaid. Colors match arrows in Figure 4.2. 63

4.4 The top-left panel shows the initial design observations. Remaining panels display the sequential design process after adding 1, 5, 10, 15 and 20 batches. 65

4.5 The heatmap shows the mean surface f(x). Lighter colors correspond to higher values. Contours of r(x) are overlaid...... 66

4.6 IMSPE design in batches: gray dots are initial design points; gray contours show signal and noise contrast; numbers indicate replicate multiplicity. The last two panels summarize all new points from 6 batches and all design points respectively...... 66

xi 4.7 Results of RMSPE, score, time per iteration in fitting HetGP model, and the aggregate number of unique design locations from 50 MC repetitions. .... 67

4.8 Ocean simulator results in 30 MC repetitions: RMSPE, score, time per batch and the aggregate number of unique design locations n...... 69

4.9 RMSPE and score over design size N from 30 MC repetitions...... 71

5.1 2d heatmap and 1d lineplot slices of predictive mean and variance for se- lected inputs. The numbers overlaid indicate design locations and numbers of replicates...... 78

5.2 Empirical density of pairwise distances from IMSPE batch and maximin se- quential design for the pilot (left) and full (right) studies...... 80

5.3 Slices for the “full” experiment updating Figure 5.1...... 82

5.4 Sensitivity analysis: main effects (left); first order (middle) and total sensi- tivity (right) from 100 bootstrap re-samples...... 84

5.5 Sensitivity analysis for the variance process: main effects (left); first order (middle) and total sensitivity (right) from 100 bootstrap re-samples. .... 86

xii List of Tables

3.1 Pairwise t-test p-value table for (ninit = 8, d = 2) and two settings n = 25 (top table) and n = 70 (bottom). Statistically significant p-values, i.e., below 5%, are in bold...... 46

3.2 Pairwise t-test p-value table for (ninit = 16, d = 3) and two settings n = 50 (top table) and n = 100 (bottom). Statistically significant p-values, i.e., below 5%, are in bold...... 47

3.3 Pairwise t-test p-value table for (ninit = 32, d = 4) and two settings n = 200 (top table) and n = 500 (bottom). Statistically significant p-values, i.e., below 5%, are in bold...... 47

5.1 Delta smelt simulator input variables. The last column shows the settings of the pilot study in Section 5.2. MR abbreviates mortality rate; EPT means eating prey type...... 74

5.2 Augmenting Table 5.1 to show the settings of the “full” experiment...... 81

5.3 Proportion of positive I = T − S indices for mean process...... 85

5.4 Proportion of positive I = T − S indices for variance process...... 87

xiii Chapter 1

Introduction

1.1 Background

Rapid development of computing power has made computer experiments commonplace in various scientific and engineering fields as an alternative to expensive field experiments. However, computer experiments can be expensive in terms of computation or time. Some of them may take hours or even days to get a single evaluation, especially when complex systems or processes are simulated. In this case, surrogate models are usually fitted with available observations, approximating or even replacing the original computer models.

Gaussian process(GP) has been a widely-used surrogate model for deterministic computer models, see Cressie (1985), Sacks et al. (1989), Santner et al. (2003). Therefor, experimental design methods for GP surrogates are worth studying. Space-filling designs, which spread out the design points across the input space, are common choices when little is known about the underlying surface. Mckay et al. (1979) introduced Latin hypercube sampling (LHS). LHS not only makes points uniformly located in the whole input space, but also maintains the desirable property on projections. Johnson et al. (1990) proposed maximin criteria, maximizing the minimum pairwise distance to spread points out.

However, if the true response surface is non-smooth or the desired precision is relatively high, a design with a fixed number of points may not capture the interesting region at one shot.

1 Chapter 1. Introduction 2

Sequential design strategies start with a small initial design, then sequentially determine the next sample under the guidance of current design. There is a potential saving in applying sequential sampling methods. Sequential design are usually based on surrogates, so that corresponding model-based criteria can be calculated to serve difference design objectives, such as optimization (Jones et al., 1998) and global fitting (MacKay, 1992).

GPs are originally proposed for interpolating data from deterministic computer . But nowadays, computer experiments involving stochastic process, like agent-based models, appear more often. The random noise, which is not necessarily constant over the input space, makes the design and modeling of stochastic computer experiments more challenging. In this scenario, replications, i.e. multiple observations on a single design site, are believed to be essential for separating signal from noise (Binois et al., 2018c).

1.2 Motivation example: Delta Smelt

A motivating example of a computer experiment in this dissertation is called Delta Smelt simulator. It is a stochastic individual-based model developed by Rose et al. (2013), which characterizes the population dynamics of fish by simulating the individual life cycles of the smelt fish living in the San Francisco Estuary. As one of the most highly altered estuarine ecosystems in the world, the San Francisco Estuary needs a better strategy for resource management of restoration. In this process, smelt plays a role as one of the most important environmental condition indicators. However, they have generally been at low abundance since 1980s and showed an even further sharp decrease starting in 2002. The simulation model considers multiple factors thought to contribute to the Delta Smelt decline, including the mortality rates at different growth stages, river effects, and prey feed effects. Researchers aim to identify more influential factors and understand how they interact with each other 1.3. Overview of this dissertation 3 through this simulator.

1.3 Overview of this dissertation

This section provides an overview of the dissertation. Chapter 3 to Chapter 5 cover the main contributions of the dissertation, including distance-distributed design for GP surrogates, IMSPE batch-sequential design for heteroskadestic Gaussian processes and its application delta smelt simulator. Chapter 6 concludes this dissertation with suggestions, methodological ideas and future work.

1.3.1 Distance-distributed design for GP surrogates

All sequential designs need an initial design to start with. Currently, space-filling design is still common in initial stages. But space-filling designs disappoint when the model hyper- parameterization is unknown because the subsequent sequential designs would reinforce bad hyperparameter estimation from the initial design. In Chapter 3, we exposed this inefficien- cies and proposed a family of new schemes by reverse engineering the qualities of the random designs which give the best estimates of GP lengthscales. At the end, we illustrated how our distance-based designs outperform in both static (one-shot design) and sequential designs.

1.3.2 IMSPE batch sequential design

Chapter 4 is motivated by learning the input–output dynamics of a stochastic and time- consuming agent-based model. To get enough runs to effectively, both a nimble modeling strategy and parallel supercomputer evaluation are required. Recent advances in hetero- skedastic Gaussian process (HetGP) surrogate modeling helps, but little is known about Chapter 1. Introduction 4 how to appropriately plan experiments for highly distributed simulator evaluation. An IM- SPE batch sequential design method is proposed under a newly developed heteroskedastic GP model to facilitate parallel computing. A backtracking strategy is applied to create replicates, which are beneficial to both modeling and computation. Design and modeling performance is demonstrated on a range of toy examples.

1.3.3 Delta Smelt

Delta smelt are an endangered fish whose fate is intimately linked with water management practice in the Sacramento river delta system, and who more broadly serve as a barometer for environmental health in the San Francisco Bay. Researchers have developed a stochas- tic, agent-based simulator to virtualize the system, with the goal of assisting in a study of delta smelt life cycles and to understand sensitivities to myriad natural variables and hu- man interventions. However, the input configuration space is high-dimensional, running the simulator is time-consuming, and its noisy outputs change nonlinearly in both mean and variance. These challenges are addressed in Chapter 5 by employing the method proposed in Chapter 4. The influences of input factors on the response surface are compared in a sensitivity analysis thereafter. Chapter 2

Review of literature

Computer simulation experiments are common in scientific areas. They are employed to mimic systems and processes that are often too expensive, time-consuming or sometimes infeasible to observe. However, some computer experiments for complex processes are still computationally demanding. In this case, a statistical model built upon input-output ob- servations, i.e., a so-called surrogate model or emulator, is needed as a fast substitute for the simulator. Design of experiments, i.e., how to select samples from the input space under budget constraints, determines how much we can learn about the input-output dynamics. As two essential components of computer experiment analysis, surrogate modeling and com- puter experimental design are reviewed in this chapter.

2.1 Surrogate modeling

Gaussian process (GP) has been a canonical surrogate model for emulating computer experi- ments (Cressie, 1985, Sacks et al., 1989, Santner et al., 2003). Due to its nonlinear flexibility, outstanding uncertainty quantification properties and partly analytic calculation, GPs are proved effective in a vast of literature.

5 Chapter 2. Review of literature 6

2.1.1 Gaussian Process surrogates

Let f : Rd → R, denote an unknown function, generically, but standing in specifically for a computationally expensive computer model simulation. Let X = {x1,..., xN } denote

> the chosen d-dimensional design, and let Y = (y1, ..., yN ) collect outputs yi = f(xi), for i = 1,...,N. Then a GP surrogate model can be fitted with X and Y, which can be used in lieu of future expensive evaluations. Here we make the simplifying assumption that the computer model, f, is deterministic.

Putting a GP prior on f amounts to specifying that any finite realization of f, e.g., our

N observations Y, has a multivariate normal (MVN) distribution. MVNs are uniquely specified by a mean vector and covariance matrix. It is common in the computer experiments literature to take the mean to be zero, and to specify the covariance structure via scaled

2 ij inverse Euclidean distances. For example, Y ∼ NN (0, τ CN ), where CN follows

( d 2 ) ij X (xip − xjp) C = cθ(xi, xj) = exp − (2.1) N θ p=1 p

2 Above, τ is an amplitude hyperparameter, and θp is the lengthscale, determining the rate of decay of correlation as a function of distance in the pth dimension of the input space. θ =

(θ1, . . . , θd). For a more detailed discussion of GP setups for modeling experiments, see e.g., Santner et al. (2003).

Fixing θ and τ 2, the GP predictive equations at new inputs x, given the data (X, Y), have a convenient closed form derived from simple MVN conditioning identities. The (posterior) predictive distribution for Y (x) | Y is Gaussian with

> −1 mean µ(x | Y) = c (x)CN Y, (2.2)

2 2 > −1 and variance σ (x | Y) = τ [cθ(x, x) − c (x)CN c(x)], 2.1. Surrogate modeling 7

> th where c (x) is the N-vector whose i component is cθ(x, xi).

2 Unknown hyperparameters can be inferred by viewing Y ∼ NN (0, τ CN ) as a likelihood and maximizing its logarithm numerically. The log likelihood is derived as follows:

n n 2 1 1 > −1 l = log L = − log 2π − log τ − log |CN | − Y C Y. 2 2 2 2τ 2 N

2 2 −1 > −1 In MLE, by setting the derivative over τ to zero, we obtain τˆ = N Y CN Y in closed form, which may be used to derive a profile/concentrated multivariate Student-t likelihood for θ. In the Bayesian setting, τ 2 may analytically be integrated out under an inverse-Gamma prior (see, e.g., Gramacy and Apley, 2015). Either way, numerical methods are required to learn appropriate lengthscale settings θˆ.

The performance of GP surrogates can be visualized in Figure 2.1. The black solid line indicates the true sine function. The red solid line shows the predictive mean of GP, which is close to the truth and interpolate all the observations. The famous “sausage” shape predictive band is shown as pink shade, i.e., predictive variance become higher when x gets away from XN .

2.1.2 GP kernels

Gaussian process regression can be regarded as a special case of linear regression. Consider the setup as follows:

yi = xiβ + i, i = 1,...,N

iid The classical linear regression model assumes i ∼ N (0, δ), where δ is constant. In the matrix form, Y = XN β + . If the error terms are not independent but jointly follow a

MVN distribution, i.e.  ∼ NN (0, CN ), then Y ∼ NN (XN β, CN ), which is identical to the Chapter 2. Review of literature 8

Figure 2.1: GP posterior predictive distribution in terms of means, 2.5% and 97.5% quantiles.

distribution under GP prior. As mentioned above, in the computer experiments literature, the mean vector XN β are commonly omitted and the variation of the response surface is completely explained by the CN . Thus, the covariance functions and their properties are fundamental to Gaussian processes.

The covariance matrix is defined by covariance functions c(xi, xj), which are also called kernels. The kernel used in the illustration of Section 2.1.1 is called Gaussian kernel or double exponential kernel. Specifically, it belongs to separable/anisotropic Gaussian family, since the lengthscales are separately parameterized in each dimension. If the θp = θ for p = 1, . . . , d, i.e., the correlation decays at the same rate in all dimensions, it becomes isotropic kernel. Both separable and isotropic kernels are stationary, as they only rely on r = |xi −xj|, i.e., c(xi, xj) = c(r). Due to the double exponential nature, Gaussian kernels are infinitely differentiable, which can be inappropriate for some non-smooth response surfaces. 2.1. Surrogate modeling 9

Matérn family is also a common covariance function type, which is defined as

!ν ! 21−ν r2ν r2ν c (r) = r K r , ν Γ(ν) θ ν θ

where θ is the lengthscale, Kν is a modified Bessel function of the second kind. The smooth- ness of Matérn kernels can be enhanced by increasing parameter ν. When ν → ∞, the Matérn kernel becomes a Gaussian family:

 r2  cν(r) → c∞(r) = exp − . 2θ

2.1.3 Surrogates for stochastic computer simulators

As the development of computing resources, stochastic computer simulators gain more and more popularity (Kim and Nelson, 2006, Kleijnen and Van Beers, 2005, Yin et al., 2011). Stochastic computer experiments are capable of describing complicated systems with various sources of randomness. The data we collected may not only be noisy but involve low signal- to-noise ratio or heteroskedastic variance. How to model and design stochastic computer experiments becomes a big challenge for statisticians.

For stochastic f with constant noise we can add a nugget term g to the diagonal of the covari-

2 ance matrix to define KN = CN + ΛN for ΛN = gIN and take YN ∼ NN (0, τ KN ). This is

2 2 equivalent to Y (x) = w(x)+ε, where w(x) ∼ GP with scale τ , i.e., W ∼ NN (0, τ CN ), and ε(x) iid∼ N (0, σ2). The predictive distribution is identical to Equations (2.2) except replacing

CN with KN . The similar visualization as Figure 2.1 is shown in Figure 2.2. Differently, we no longer see the predictive mean (red solid line) interpolate. But the predictive mean is still close to the truth with such limited number of samples.

However, assuming independent and constant noise is not realistic in some applications and Chapter 2. Review of literature 10

Figure 2.2: Predictions and associated 95% uncertainty intervals based on GP with nugget parameter. the nugget setting can’t handle input-dependent noise. Partition can tackle heteroskedastic noise, see Gramacy and Lee (2008), but it doesn’t do well when signal to noise ratio is high and the fitted mean surface is not necessarily smooth. In terms of separating signal to noise, Ankenman et al., Yin et al. applied stochastic (SK), which offers approximate meth- ods that exploit large degrees of replication. However, moment based estimation they used requires a large amount of replication, which is not practical for computationally expensive simulators.

2.1.4 GP with replication

Replication, i.e., repeated observation at identical inputs, plays an important role in stochas- tic computer experiments. Replications can not only separate signal from noise, but also hold the potential for computational savings through a Woodbury trick (Harville, 1998). From ¯ now on, I use n to denote the number of unique design sites. Let Xn = {x¯1,..., x¯n} and 2.1. Surrogate modeling 11

¯ Yn = {y¯1,..., y¯n} store the unique input locations and average observation over replicates. Through a similar deduction as Equation (2.2), at N 0 testing locations X , the predictive distribution Y(X ) | YN is Gaussian with

¯ −1 ¯ mean µ(X | YN ) = c(X , Xn)Kn Yn, (2.3)

2 ¯ −1 ¯ > and variance Σ(X | YN ) =τ ˆ [c(X , X ) − c(X , Xn)Kn c(X , Xn) ].

−1 In the above equations, Kn = Cn +Λn = Cn +gAn . Cn is the covariance matrix of n unique

ij design locations defined under similar kernel/inverse distance structure, i.e., Cn = cθ(x¯i, x¯j).

An is a diagonal matrix, Aii = ai, which denotes the number of replicates at unique location Pn x¯i so that i=1 ai = N. As predictive equations (2.3) indicate, the computation expense of matrix inversion, which is the most time-consuming part of GP inference, is reduced to O(n3) from O(N 3) without any approximation.

2.1.5 Heteroskedastic Gaussian process

Motivated by SK and ideas from machine learning community, a fully likelihood based in- ference framework called heteroskedastic Gaussian processes (HetGP) modeling is proposed to model a response surface with non-constant noise (Binois et al., 2018a). HetGP doesn’t require large amount of replicates. By involving latent variance observations and modeling the noise surface with a second GP, HetGP gives a smooth noise surface over the entire region.

Specifically, they proposed freeing the diagonal elements of Λn under a sort of smoothness penalty. Let δ1, δ2, . . . , δn denote latent nuggets, corresponding to n  N unique design locations. It is also important not to introduce latent δi in multitude at identical input locations x¯i which introduces numerical instabilities to the inferential scheme. Place these Chapter 2. Review of literature 12

latent nuggets diagonally in ∆n and assign to these a structure similar to Y but now encoding a prior on variances:

2 −1 ∆n ∼ Nn(0, τ(δ)(C(δ) + g(δ)An )).

C(δ) is the covariance matrix of n unique design locations, which is defined under similar kernel with hyperparameters for the noise process; g(δ) is a “nugget of nuggets” controlling the smoothness of λi’s relative to δi’s. Smoothed λi-values can be calculated by plugging ∆n into GP mean predictive equations (2.3):

−1 −1 Λn = C(δ)K(δ)∆n, where K(δ) = C(δ) + g(δ)An . (2.4)

Parameters including θ, τ 2 for both GPs, i.e., for mean and variance, may be estimated by maximizing the joint log likelihood with derivatives via fast library-based methods in time cubic in n. Software is available for R as hetGP on CRAN (Binois et al., 2018a). For

2 −1 implementation convenience, log ∆n ∼ Nn(0, τ(δ)(C(δ) + g(δ)An )) is utilized instead.

2.2 Computer experimental design

As we mentioned at the beginning of this chapter, computer experiments can themselves be computationally demanding or time-consuming, which limits the number of runs that can be entertained (Santner et al., 2018). Thus, how to select the input settings where the computer simulator is run and the corresponding response is collected becomes important. A computer experimental design problem can be defined as follows. Let f : Rd → R, denote an unknown function, generically, but standing in specifically for a computationally expensive computer model simulation. Limited by time or computing resources, we can only run the computer simulation on n sites X = {x1,..., xn} in the d-dimensional input space D, and 2.2. Computer experimental design 13

> let Y = (y1, ..., yn) collect outputs yi = f(xi), for i = 1, . . . , n. The design problem is all about how to select X. The goal of a design varies among factor screening, emulation, model calibration and optimization.

2.2.1 Geometric designs

When you have little information about the response surface, creating a design that fills the space is an intuitive strategy. Geometric criteria is usually used to measure the space-filling performance. Here we only consider the design space D = [0, 1]d. Denote Euclidean distance between two design points xi and xj as dij = ||xi − xj||.

As the most straight-forward space-filling strategy, simple random designs generate design

X from Unif[0, 1]d. Due to the stochastic property, they are not guaranteed to have good space-filling properties, especially when sample size is small. To avoid the great uncertainty of simple random designs, Fang (1980) proposed the uniform design (UD) concept that allo- cates experimental points uniformly scattered on the domain by minimizing the discrepancy between the empirical distribution and uniform distribution density, see Fang et al. (2000).

To uniformly allocate design points among D, short pairwise distances are not desirable. From this perspective, Johnson et al. (1990) proposed maximin-distance design, which at- tempts to maximize the smallest pairwise distance dij, i.e,

X = arg max min dij. D i6=j

Morris and Mitchell (1995) showed that a maximin design can be created by minimizing φp 1/p hPK −pi criterion, φp = k=1 Jkdk , where dk is one of the K unique pairwise distances in a design and Jk is the number of pairs at that distance. Chapter 2. Review of literature 14

Another perspective (Johnson et al., 1990) of speading out the points is to make the maxi- mum distance from all the design points in X to their closest point as small as possible, i.e., minimize the minimax-distance criterion:

X = arg min max min ||x − xi||. D x∈X xi

This is called minimax-distance design. Recent work of Mak and Joseph (2018) developed efficient algorithm for generating minimax-designs via particle swarm optimization and clus- tering.

The space-filling property of maximin and minimax designs doesn’t hold in the marginal subspace. Good projectional property is desirable when doing further analysis over more influential factors. Latin-hypercube sampling (LHS) (Mckay et al., 1979) overcomes the poor projection property that maximin and minimax have. Any projection of a LHS design is still a space-filling design. This property is essential in the cases that some inactive input variables are dropped out in follow-up analysis. for A secondary design criterion can be applied after LHS, like maximinLHS (Morris and Mitchell, 1995).

Due to the good properties of LHS, some of variants have been proposed thereafter. mark Handcock (1991) brought up the idea of cascading LHS, which deploys 2/3-level LH design centered around each design point of a big LHS, ensuring there are design points close to- gether. Tang (1993) introduced orthogonal array-based LHS to guarantee a more uniform design in the subspace spanned by the effective factors. Owen (1994) showed that orthog- onal LHS can further reduce the variance of Monte Carlo integrals. Lin et al. (2010, 2009) contributed on the construction of orthogonal LH designs and solved the sample size limita- tion in previous algorithms. Qian (2009) developed nested Latin hypercube design (NLHD), which contains smaller LH designs as subsets, in order to estimate the means of deterministic 2.2. Computer experimental design 15 f at different fidelity levels.

2.2.2 Model-based design

Model-based design methods take advantages of the obtained information from pre-assumed statistical models, i.e., GPs in this context. In Equations (2.3), uncertainty Σ(X | YN ) is a quadratic function of distance to nearby training data XN locations. For this reason, space-filling designs are also good choices under the GP surrogate assumptions.

Once the data is collected, GP surrogate is usually fitted in replace of the computation- ally expensive computer experiment. Assuming the surrogate model is known, some design criteria can be built upon. In the information theory, entropy is a general measure of unpre- dictability of the state. The entropy of a density p(x) is defined as

Z H(X) = − p(x) log p(x) dx. X

Maximum entropy (maxent) design is proposed by Shewry and Wynn (1987) for spatial models. Under MVN assumption of GPs, maximizing entropy is equivalent to maximizing the determinant of the covariance matrix, |Kn|, see Santner et al. (2003, Chapter 6).

Another model-based design minimizes the predictive uncertainty, see Sacks et al. (1989). To control predictive error over the input space, integrated mean-squared predictive error is employed, which is defined as

Z 2 IMSPE[ˆy(x)] = E{(ˆy − Y (x)) }dx. D

A weight function ω(x) can be incorporated in the integral to give more weights on regions of importance on prediction. Chapter 2. Review of literature 16

Model-based designs usually require known GP hyperparameters for criteria calculation. But knowing the hyperparameters before we collect data is not realistic. This chicken-or-egg problem can be solved by fitting a meta-model with a small model-free design first, then sample at new design sites sequentially, i.e. sequential design methods, which will be covered in Section 2.2.3.

2.2.3 Sequential design

Sequential design strategy answers the question of how to select follow-up design points in order to improve the global fitting or optimization. By allocating the samples sequentially, we can avoid determining the sample size for a target function with no prior information and probably save some runs. Sequential design approaches can be further divided into adaptive and non-adaptive designs.

Non-adaptive designs are usually based on geometric criteria. They deploy the space-filling approaches we mentioned in Section 2.2.1, and extend them in a sequential manner. For example, the sequential LHS can be selected by optimizing some space-filling criteria con- strained by 1d distance thresholds (Dam et al., 2005).

Adaptive designs are also called active learning. They use previous samples and metamodels to help select sequential design sites. As the name indicates, the subsequent samples are adaptive to design purposes and the properties of the target f. Considering an initial design

X0 = {x1,..., xn}, an adaptive sequential design scheme is described in Algorithm 1.

An effective criterion for selecting the new point is the key in active learning. Because of the uncertainty quantification property of GP surrogates, variance based criteria are widely used to reduce predictive error. The so-called active learning MacKay (ALM) (MacKay, 1992) select the new point by maximizing the predictive variance σ2(x). Extending the idea to the 2.2. Computer experimental design 17

Algorithm 1 Adaptive sequential design framework

Init: X = X0,Y = Y0 while the limit of time and computation are not exceeded: do Fit a surrogate model with X and Y Based on a certain design criterion, obtain the sequential design point xn+1 Evaluate the unknown function on x˜ to get y˜ = f(xn+1) Update the surrogate model with augmented dataset X = (X, xn+1) and Y = (Y, yn+1) end while Return: X, Y

entire domain, active learning Cohn (ALC) (Seo et al., 2000) aim to minimize the integrated R σ2( ) dx. variance x∈X x Sacks et al. consider the integrated mean squared predictive error (IMSPE) of the entire design space, in order to improve global fitting. Maximum entropy design criterion (Shewry and Wynn, 1987) can also be applied alone or together with gradient information (Morris et al., 1993) in the sequential design stage. Essentially, maximizing the determinant of the covariance matrix Kn+1 is equivalent to maximizing predictive variance,

2 because one can show log |Kn+1| = log |Kn| + log σ (xn+1). Thus, ALM and maxent design are equivalent in this scenario.

For non-stationary GPs, Gramacy and Lee apply treed sequential maximum entropy designs to find spaced-out candidate set, then select the subsequent design points by ALM and ALC. Binois et al. proposed a purely sequential design approach by minimizing IMSPE for stochastic simulators with heteroskedastic noise. Gradient acquisition functions can also be employed to reduce the predictive error (Erickson et al., 2018, Han et al., 2013). If the change of the response surface is of interest, having more samples in high gradient region would be beneficial.

The above variance-based criteria all focus on global fitting via exploration. If the goal is to do global optimization, the sequential design methods should consider both exploration and local exploitation. Expected improvement(EI) (Jones et al., 1998, Notz and Lam, 2008) is Chapter 2. Review of literature 18 a commonly used criterion from those. Sequential designs with the goal of optimization will be reviewed in Section 2.2.4.

2.2.4 Bayesian Optimization

Many optimization problems in machine learning are dealing with black-box objective func- tions, i.e., the analytical expression for f is unknown. Evaluation of the function is restricted to sampling at an input and getting a possibly noisy response. And the evaluation can be expensive or time-consuming. This kind of problem can be regarded as adaptive sequential design problems with global optimization target.

The most famous method is the Expected Improvement algorithm (Jones et al., 1998, Notz and Lam, 2008). Based on GP posterior predictive equations described by mean µ(x) and standard deviation σ(x), next point is selected by numerically optimizing EI(x):

µ − µ(x) µ − µ(x) EI(x) = (µ − µ(x))Φ min + σ(x)φ min , (2.5) min σ(x) σ(x)

where µmin = minx µ(x) and Φ and φ are the standard Gaussian cdf and pdf, respectively. If the computational budget is used up or optimum/precision criteria is met, the procedure will be stopped. Recently, EI is also applied to optimization with heterogeneous noise, including adaptive sequential Kriging optimization approach Huang et al. (2006), the expected quantile improvement (Picheny et al., 2013a) and the minimum quantile criterion (Picheny et al., 2013b).

For Bayesian optimization, other acquisition functions include gradient, entropy and pre- dictive entropy, see Frazier (2018) for a more comprehensive review. Bayesian optimization techniques are also developed for optimizing functions with black-box constraints (Gramacy et al., 2016, Picheny et al., 2016). 2.2. Computer experimental design 19

2.2.5 Batch sequential design

By the number of new design points added in each subsequent iteration, sequential designs can be divided into pure sequential designs and batch sequential designs. Most of sequential design methods adopt single selection, since they tend to select the new point by solving an auxiliary optimization problem Liu et al. (2018). Instead of a single point, batch sequential design methods sample a batch of M > 1 new points in each subsequent design step. This property naturally in favor of a parallel deployment of subsequent runs, which can greatly improve the computational efficiency in expensive computer experiments. However, adding multiple points per time is apparently a more challenging task. We can’t simply evaluate and rank the candidate sites with optimization objective function in pure sequential designs. Because it would cause overlapping of points in the new batch. Adapting the old criteria to optimize M points simultaneously can be a solution. But the dimension of the auxiliary optimization problem would become M times greater than before.

Non-adaptive sequential designs are basically based on space-filling criteria (e.g. Duan et al., 2017, Loeppky et al., 2010, Williams et al., 2011). For adaptive batch designs, Ginsbourger et al. extended canonical EI to a multipoint version and assessed the performance of a se- quential optimization procedure, which maximizing the usual EI at each iteration. Chevalier et al. (2014) provided practical implementations of multipoint kriging-based infill sampling criteria. Gramacy and Lee achieved asynchronous evaluations in parallel by hybridizing treed sequential maximum entropy designs and ALM/ALC design. Under GP surrogate model- ing, Erickson et al. select the new batch by maximizing expected value of the gradient norm squared. They take a space-filling design as candidate set to reduce the computational cost of the optimization. Chapter 2. Review of literature 20

Having less cumulative information and more challenges in optimization, batch sequential designs are not necessarily perform better than pure sequential design. But the merits of this framework are to deploy parallelizable subsequent evaluations. Batch size M is usually selected based on available computing resources (number of processors). Chapter 3

Distance-distributed Design for Gaussian Process Surrogates

Computer simulation experiments are widely used in the applied sciences to simulate time- consuming or costly physical, biological, or social dynamics. Depending on the dynamics being simulated, these experiments can themselves be computationally demanding, limiting the number of runs that can be entertained. Design and meta-modeling considerations have spawned a research area at the intersection of spatial modeling, optimization, sensitivity analysis, and calibration. Santner et al. (2003) provide an excellent review.

Gaussian process (GP) surrogates, originally for interpolating data from deterministic com- puter simulations (Sacks et al., 1989), have percolated to the top of the hierarchy for many meta-modeling purposes. GP surrogates are fundamentally the same kriging from the spatial statistics literature (Matheron, 1963), but generally applied in higher dimensional (i.e., > 2d) settings. They are preferred for their simple, partially analytic, nonparametric structure. GPs’ out-of-sample predictive accuracy and coverage properties are integral to diverse ap- plications such as Bayesian optimization (BO Jones et al., 1998), calibration (Higdon et al., 2004, Kennedy and O’Hagan, 2001), and input sensitivity analysis (Saltelli et al., 2008). Al- though there are many variations on GP specification, Chen et al. (2016) nicely summarize how such nuances often have little impact in practice.

On the other hand, Chen et al. cite experimental design as playing an out-sized role. Despite

21 Chapter 3. Distance-distributed Design for Gaussian Process Surrogates 22

GPs’ elevation to “canonical” status as surrogates, there has not been quite the same degree of confluence in how to design a computer experiment for the purpose of such modeling. In part this is simply a consequence of different goals emitting different criteria for valuing, and thus selecting, inputs. An exception may be the general agreement that it is sensible, if possible, to proceed sequentially, either one point at a time or in batches. An underlying theme for static (all-at-once) design, or for seeding a sequential design, has been to seek space-fillingness, where the selected inputs are spread out across the study space. For a nice review, see Pronzato and Müller (2011).

There are many ways in which a design might be considered space-filling. Maximin-distance and minimax-distance design (Johnson et al., 1990) are two common approaches based on geometric criteria. A maximin design attempts to make the smallest distance between neigh- boring points as large as possible; conversely, minimax attempts to minimize the maximum distance. A common variation on maximin is φp (Morris and Mitchell, 1995),

" K #1/p X −p φp = Jkdk . k=1 where dk is one of the K unique pairwise distances in a design and Jk is the number of pairs at n that distance. In most applications, K = 2 and all Jk = 1. Designs obtained by minimizing

φp are actually maximin for all p, i.e., the smallest distance mink dk is maximized. At p → ∞ the equivalence is immediate; φp designs for smaller p have greater spread in smaller distances

(d(−k)).

Alternatively, one may desire a design that spreads points evenly across the range of each individual input, i.e., where projections on each dimension are still space-filling. Maximin and minimax designs do not produce such an effect; in fact, they can be pathologically bad in this regard. Latin hypercube sampling (LHS, Mckay et al., 1979) can guarantee this one- 23 dimensional uniformity property. For a nice review of LHS and other space-filling designs for computer experiments, see Lin and Tang (2015).

Space-filling designs intuitively work well when prediction accuracy is of primary interest, seeking cover everywhere you might want to predict. However, it is easy to show [as we do in Section 3.2] that space-filling designs are inefficient for learning GP hyperparameters, discussed in further detail in Section 3.1. It turns out that a random uniform design is actually better than maximin, φp and LHS in that setting,echoing a rule-of-thumb from variogram estimation with lattice data in geostatistics (Zhao and Wall, 2004).

Considering that GP predictive prowess depends upon hyperparameterization, good predic- tion results must tacitly depend upon fortuitously chosen hyperparameters. If good settings are indeed known, then model-based design represents an attractive alternative to (model- free) space-filling design. Example criteria include maximizing the entropy between prior and posterior (maximum entropy design), minimizing the integrated mean-squared predic- tion error (IMSPE, Santner et al., 2003, Chapter 6), and Fisher information (Zimmerman, 2006). These lead to nice sequential extensions, when alternating between design and learn- ing stages. However, such schemes can suffer when initialized poorly. Seemly optimal choices of seed design or hyperparameter can lead to pathologically poor performance.

Here we propose a new class of designs that attempts to resolve that chicken-or-egg problem. GP correlation structures are typically built upon scaled pairwise distance calculations, so we hypothesize that certain sets of pairwise distances offer a more favorable basis for estimating those scales: so-called GP lengthscale hyperparameters. The spirit of our study is similar to that of Morris (1991), but we take a more empirical approach and ultimately provide a message that is more upbeat. Quite simply, we observe the empirical distribution of pairwise distances of random designs which are better than space-filling ones for the purpose of lengthscale estimation. We then parameterize those distributions within the Beta(α, β) Chapter 3. Distance-distributed Design for Gaussian Process Surrogates 24 family, and propose a numerical optimization scheme to tune (α, β) to design size n and input dimension d. In this way, our methodology can be seen as a more aggressive and constructive variation of Zhao and Wall’s study for variograms.

Despite sacrificing positional space-fillingness for relative distance-fillingness in order to tar- get hyperparameter estimation, we show that “betadist” designs still perform favorably in prediction exercises. Inspired by Morris and Mitchell (1995)’s hybridization of LHS and maximin, we propose hybridizing LHS with betadist designs to strike a balance between space and distance-filling toward even more accurate prediction.

The remainder of the chapter is organized as follows. In Section 3.1 we review GP modeling and design details pertinent to our methodological contribution. Section 3.2 demonstrates how space-filling designs fall short in certain respects, and proposes distance-based remedies based on reverse engineering qualities of the best random designs. Section 3.3 explores hybrids of these betadist designs with LHS. Illustrative examples and empirical comparisons are provided throughout. Section 3.4 provides a comprehensive empirical validation in two disparate sequential design settings, where betadist, LHS hybrids and comparators are used to build initial/seed designs.

3.1 Setup and related work

Here we review essentials as a means of framing our contributions, establishing notation, and connecting to related work on design and modeling for computer experiments. 3.1. Setup and related work 25

3.1.1 Gaussian Process surrogates

Let f : Rd → R, denote an unknown function, generically, but standing in specifically for a computationally expensive computer model simulation. There is interest in limiting the evaluation of f, so one designs an experimental plan of runs with the aim of fitting a meta- model, e.g., a Gaussian process (GP), which can be used as a surrogate in lieu of future expensive evaluations. Let X = {x1,..., xn} denote the chosen d-dimensional design, and

> let Y = (y1, ..., yn) collect outputs yi = f(xi), for i = 1, . . . , n.

Here we make the common simplifying assumption that the computer model, f, is deter- ministic. In this work, we focus on isotropic Gaussian kernel. The setup, inference and implementation details can be found in Section 2.1.1. Although we assume this structure throughout for simplicity, we see no reason why our proposed methodology (which empha- sizes design, not modeling) could not be extended to other correlation families, or to the stochastic (f + ε) setting via additional hyperparameters.

3.1.2 Thinking about designs for GPs

The prediction equations (2.2) suggest a space-filling training design for X since σ2(x), for testing x, is quadratically related to distances to nearby xi locations through k(x). However that tacitly assumes the hyperparameters, particularly the lengthscale θ, are known. Where is a good θ supposed to come from? While we acknowledge that it is sometimes possible to intuit reasonable values or ranges for θ, based on knowledge of the underlying dynamics being modeled, such cases are rare in practice, and useless as a default modus operandi, e.g., in software. Thus our presumption is that θ must be learned from data, which requires a design. Intuitively, a space-filling design is poor for such purposes since its deliberate inability to furnish short distances biases inference toward longer lengthscales. Chapter 3. Distance-distributed Design for Gaussian Process Surrogates 26

Sequential design, iterating between design and learning, has been suggested as a remedy. Yet space-filling design is still common in initial stages. For example, Tan (2013) writes “minimax designs are intended to be initial designs for computer experiments, which are almost always sequential in nature”. While we agree with the spirit of that statement, we disagree that spreading out the points is the best way to seed this process. The reason is that subsequent sequential selections are usually model-based, e.g., via σ2(x), and thus hyperparameter- sensitive. Note that in sequential application, both IMSPE and maximum entropy-based designs are about predictive variance. The former maximizes integrated variance; the latter maximizes directly. One must be careful not to introduce a feedback loop where sequential decisions reinforce bad hyperparameters.

Some say that way out of that vicious cycle is to utilize other geometric rather than model- based criteria for sequential selection, e.g., with cascading LHSs (Lin et al., 2010). However, if the design goal is not directly prediction-based, such as in BO (Jones et al., 1998), that approach is clearly inefficient. Plus in the BO literature, regularity conditions underlying the theory for convergence (to global optima) insist on fixed hyperparameterization. This is specifically to avoid pathological settings arising from feedback between sequential acquisi- tion and inference calculations (Bull, 2011).

Perhaps our main thesis is that initial design for hyperparameter learning is paramount to obtaining robust (good) behavior in repeated application. While some space-filling designs are better than others in this context, we observe that it is important to be filling in a different sense. Inference for hyperparameters via the likelihood involves pairwise inverse distances xi − xj through Kij. Therefore, it could help to be more filling in that dimension. As we show in Section 3.2, simple random uniform designs are actually better than the typical maximin and LHS alternatives, sometimes substantially so. Intuitively, this is because random designs lead to a less clumpy, more unimodal, distribution of relative distances 3.1. Setup and related work 27 compared to maximin, for example. [See Figure 3.3 and surrounding discussion.] Based on the outcome of that study, we speculated that having a uniform distribution of such pairwise distances—as opposed to uniform in position—would fare even better.

That intuition turned out to be incorrect. However initial investigations pointed to a promis- ing class of alternatives, targeting a more refined choice of desirable pairwise distance dis- tributions. Although the strategy we propose imminently is novel in the context of design and analysis of computer simulation experiments, it is not without precedent in the spatial statistics literature, where variogram-based inference is, historically, at least as common as likelihood-based methods (see, e.g., Cressie, 1985, Russo, 1984). Out of that literature came the rule-of-thumb that at least thirty pairs of data points should populate certain distance strata. Morris (1991) subsequently revised that number upwards, accounting for spatial correlations which devalue information provided by nearby pairs.

The spirit of our contribution is similar to these works, although we shall make no recom- mendations about design size. Suggestions along these lines in the computer experiments literature, such as n = 10d (Loeppky et al., 2009), have been met with mixed reviews—never mind that the nuance of arguments behind that particular suggestion is often forgotten. In- stead, presuming small fixed (initial) design sizes, we target the search for coordinates with desirable qualities for lengthscale estimation. Our first idea ignores position information entirely, focusing expressly on pairwise distances. We later revise that perspective to hy- bridize with LHS and acknowledge that a degree of space-fillingness may be desirable when the over-arching modeling goal is oriented toward prediction. Chapter 3. Distance-distributed Design for Gaussian Process Surrogates 28

3.2 Better than random

Consider the following simple experiment in the input space [0, 1]d, for d = 2, 3, 4, 5, 6, taken √ in turn. For thirty equally spaced “true” lengthscales θ(t) ∈ (0.1, d]d, for t = 1,..., 30

(t,i) d+1 (t,i) we generate i = 1,..., 1000 designs X of size n = 2 and simulate Y ∼ N (0, Kn).

1 (t,i) Entries of Kn are calculated as in Equation (2.1) via the rows of X and hyperparameters τ 2 = 1 and θ(t). Several design criteria are discussed shortly. For each (t, i), MLEs θˆ(t,i) are calculated from data (X(t,i), Y(t,i)). Finally, we collect average squared discrepancies between n o P1000 ˆ(t,i) (t) 2 estimated and true lengthscales via logMSEt = log i=1 (θ − θ ) .

d=3, n=16 De−trending

● ● ● ● ● ● ● ●

−1 ● ● ● ● ● ● 4 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −3 ● ● ● ● ● ● ● ● ● ● ● ● maximin 0 ● log(MSE) −4 ● ● ● ● minphi2 ● ● ● ● ● lhs ● ● ● ● ● ● ● ● ● ● random residuals standardized ● −5 ● ● ● −2 ● ● ● ● ● unifdist ● ● ● ● beta ● −6 ● ● lhsbeta ● −4

0.5 1.0 1.5 0.5 1.0 1.5

true theta true theta

Figure 3.1: logMSEs from design experiment and de-trending surface.

As an example of the logMSEs obtained, the left panel of Figure 3.1 shows the (d = 3, n = 16)

(t) case. The first thing to notice in that plot is that as θ increases so does logMSEt, for all design methods. Apparently, it is “harder” to accurately estimate lengthscales θ as they become longer. Harder is in quotes because this metric obscures the relative performance of

1 Here we consider isotropic kernel, i.e., θ1 = ··· = θd = θ. 3.2. Better than random 29 the design methods, although some consistently stand out as worse (maximin/black circles) or better (beta/pink squares or lhsbeta/yellow squares) than others. To level the playing field for subsequent analysis, we calculated standardized residuals using a de-trending surface estimated from all of the dots, taken together. To cope with the outliers we fit a hetero- skedastic Student-t GP as described by Chung et al. (2018) and implemented in the hetGP package (Binois and Gramacy, 2018). Section 3.2.2 provides further details on our use of hetGP in this context. Standardized residuals (rt = (logMSEt − µt)/σt, with µt and σt from hetGP), are shown in the right panel of the figure.

Figure 3.2 shows boxplots of these standardized logMSEs, marginalizing over θ(t)s, for all five experiments d ∈ {2, 3, 4, 5, 6}. The number written on each boxplot resides in the position of the mean of that comparator, and indicates relative rank of that mean. In order to help better quantify relative comparisons, the final panel provides the outcome of pairwise paired t-tests, with pairing determined by adjacent ranks: best vs. second best, etc. First consider the “Common designs” block including boxplots of logMSEs for maximin, minphi2 (φ2), LHS and random designs. Although the final panel does not include a p-value for LHS or random vs. maximin when d = 2,3, because neither is ranked adjacently with maximin, it is quite clear these beat maximin, which consistently beats minphi2. LHS and random, on the other hand, offer quite similar results.

Observe that the four “Common designs” follow a similar ranking for all d ≤ 5. However when d = 6 maximin and minphi2 are better than LHS and random. This happens because maximin’s (and φp’s) pathologies are partly corrected in higher dimension. These designs push sites to the corners of the input hyperrectangle. As dimension grows the diversity of distances between corners increases. This helps MSE, but only coincidentally. Deliberate diversity via unifdist and betadist is still better.

The outcome of this experiment, including just those four common designs as comparators, Chapter 3. Distance-distributed Design for Gaussian Process Surrogates 30

log(MSE) comparison, d=2, n=8 log(MSE) comparison, d=3, n=16 Common designs Distance designs Hybrid Common designs Distance designs Hybrid

7.5 ●

● 4 ●

● ● 5.0 ● 7 2 6 2.5 ● ● ● 7 ●

6 ● ● ● ● ●

5 ● 0 0.0 5 4 3 3 4

2 2 1 ● ● −2.5 ● −2 ● 1 ●

standardized residuals log(MSE) standardized residuals log(MSE) standardized −5.0 maximin minphi2 lhs random unifdist beta lhsbeta maximin minphi2 lhs random unifdist beta lhsbeta method method log(MSE) comparison, d=4, n=32 log(MSE) comparison, d=5, n=64 Common designs Distance designs Hybrid Common designs Distance designs Hybrid

6 ● ● ●

● 4 2 7 ● 6 7 ● ● ● 2 ● ●

6 ● ● ● 0 ● ● 5 ● ● 4 ● 3 5 0 4 3 2 ● ● ● 1 ● 2 1 −2 −2 ● standardized residuals log(MSE) standardized residuals log(MSE) standardized maximin minphi2 lhs random unifdist beta lhsbeta maximin minphi2 lhs random unifdist beta lhsbeta method method log(MSE) comparison, d=6, n=128 Common designs Distance designs Hybrid

rank 2d 3d 4d 5d 6d 2 9.78e-4 0.406 2.58e-4 2.90e-5 5.58e-3 3 1.29e-4 7.92e-8 1.16e-7 4.61e-7 0.213 5 ● 4 0.388 0.151 7.14e-3 0.134 9.03e-8 ● 5 3.02e-3 0.287 0.415 0.446 6.64e-5 7 6 2.06e-5 1.08e-7 5.35e-4 3.51e-9 0.261 6 5 0 4 7 1.08e-5 0.0410 4.72e-6 1.06e-6 0.123 3 2 1

● ● standardized residuals log(MSE) standardized maximin minphi2 lhs random unifdist beta lhsbeta method

Figure 3.2: Standardized logMSE boxplots to thirty gridded θ(t) values for seven comparators using n = 2d+1 over input dimension d ∈ {2, 3, 4, 5, 6}. The comparators are described in the text. Two outlying standardized log MSE values were clipped by the y-axes to enhance boxplot viewing: random (d = 4) at 10.9 and LHS (d = 6) at 17.4. The bottom-right panel provides p-values for lower-tail paired t-tests comparing adjacent performers as ranked by their mean logMSE from best (top) to worst (bottom). 3.2. Better than random 31 sparked our search for alternatives. It is perhaps surprising that a purely random design is at least as good for hyperparameter estimation as more thoughtful alternatives like maximin and LHS. The following subsections describe our journey towards improved designs, ultimately outlining details behind the other comparators in Figure 3.2.

3.2.1 Uniform to beta designs

Intuitively, random and LHS are better than maximin for lengthscale (θ) inference because they result in a less adversarial distribution of pairwise distances. Maximin designs are calculated to ensure there are no small pairwise distances, which is presumably too few. Consequently, the distance distribution is multimodal: there are many distances near that minimum, with the rest occurring at “lower harmonics” (multiples of that minimal distance). Figure 3.3 offers a visualization. Random and LHS designs do not preclude small relative distances, although the latter does enforce a degree of uniformity in position. Both tend to yield distance distributions which are unimodal. Figure 3.3 demonstrates this for a subset of random designs, which will be discussed in more detail momentarily. The situation is similar for LHS, which we shall revisit in Section 3.3.

Algorithm 2 MC calculation of size n in [0, 1]d targeting distance distribution F .

iid d Init: Fill X with a random design of size n, i.e, xi ∼ Unif[0, 1] , i = 1, . . . , n. for s = 1,...,S do Select an index i ∈ {1, . . . , n} at random. 0 d Generate xi ∼ Unif[0, 1] . 0 0 Propose new design X as X with xi swapped with xi. if KSD(X0,F ) < KSD(X,F ) then 0 th 0 xi ← xi in the i row of X, i.e., accept X ← X end if end for Return: n × d design X. Chapter 3. Distance-distributed Design for Gaussian Process Surrogates 32

The experimental outcomes just described got us thinking about desirable distance distri- butions for lengthscale estimation. We speculated that it could be advantageous to have a uniform distance design (unifdist), so that all distances were represented—or as many as possible up to the desired design size n. Throughout we presume inputs have been scaled √ to [0, 1]d, and restrict the search for lengthscales to θ ∈ (0, d]. So when we say uniform, √ or any other distribution, we mean Unif(0, d].2 To calculate a design whose distribution of pairwise distances resembles a reference F , we follow the pseudo-code provided by Al- gorithm 2 which is based on S stochastic swap proposals that are accepted or rejected via Kolmogorov-Smirnov distances (KSD) against F . In our examples we fix S = 105 and utilize a faster, custom implementation of KSD based on isolating the $statistic output of the built-in ks.test function in R. Besides being stochastic, the search is greedy which means that it only guarantees local convergence as S → ∞. Nevertheless we find that in practice it furnishes empirical pairwise distance distributions close to the target F . There is little benefit in restarting the algorithm to search for a more global optimum.

Unfortunately, our intuition about unifdist designs didn’t completely match our results. As summarized along with our earlier RMSE comparison in Figure 3.2, unifdist designs are better than maximin, but worse than LHS and random. This outcome prompted a more careful investigation into why random designs, work so well.

Consider the lines in Figure 3.3 labeled “1–50”, representing the empirical density of distances among the random designs whose logMSE was among the fifty best in a large Monte Carlo (MC) exercise. Observe that this density is unimodal, having more small distances than maximin and very few really large distances. The solid red curve in the figure is a Beta(2.5, 4) √ density scaled to [0, 2] as a representative example of a parametric distribution similar to that of those best random distances.

2Our MLE calculations restrict θ to be greater than the square-root of machine precision, which is near 1e-8 on most machines. 3.2. Better than random 33

3.0 betadist(2.5,4) 1−50 maximin 2.5 2.0 1.5 1.0 Empirical Density 0.5 0.0

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4

Pairwise Distance

Figure 3.3: Empirical density curves corresponding to random designs in 2d with lowest 50 logMSE(θ) values from 1000 random design realizations. Empirical maximin and Beta(2.5, 4) densities are shown for comparison.

√ Unifdist designs, which are not shown in the figure, target a flat line across the [0, 2] domain. Unifdist outperforms maximin, but not the best (or even the typical) random designs. This suggests that while having more short distances is desirable, having as many distances at the extremes—both large and small—may not be helpful on average. As the results in Figure 3.2 show, having Beta-distributed distances, focusing the distribution on mid–low-range pairwise distances, leads to statistically significant improvements over random in all three cases. In fact, these “betadist” designs (being ranked 2 or 1) are the only ones in that figure whose logMSEs are statistically better (see p-values in the lower-right panel) than all other designs of lower rank. Chapter 3. Distance-distributed Design for Gaussian Process Surrogates 34

Although Figure 3.3 suggests that a Beta(2.5, 4) is a good target distribution for a betadist design, that was not the specification used to generate all results summarized in Figure 3.2. The best setting of shape parameters, (ˆα, βˆ) in Beta(α, β) depends on dimension d and design size n, as we explore below. However, it is worth nothing that Beta(2.5, 4) does generally perform well because, as we show, the set of decent (α, β) values is relatively big, and does not vary substantially in n and d. But it’s not so big as to choose arbitrarily.

3.2.2 Optimization of shape parameters of betadist design

Here we view the choice of betadist parameterization, αˆ and βˆ in Beta(α, β), for particular design size n in input dimension d, as an optimization problem. I.e., we wish to automate ˆ the search for betadistn,d(ˆα, β). Discussion around Figure 3.1 indicates that a degree of de- trending will be required in order to not over-emphasize larger θ settings in the optimization ˆ criteria. To address this, we seek (ˆα, β) = argminα,β deRIMSEn,d(α, β) where the criteria deRIMSE is defined following a scheme similar to that described around Figure 3.1. √ Begin by establishing a regular grid of θ values (θ(1) = 0.1, . . . , θ(T ) = d), just like in Figure

2 (i) 3.1. Next, generate one pair (α, β) ∼ Unif(1, 10) and use these to create D designs Xn ∼ betadistn,d(α, β), for i = 1,...,D following Algorithm 2. Averaging over more random (α, β)

(i) (t) (t,i) will be described momentarily. For each Xn and each θ generate random responses Yn ∼

(i) (t) (t,i) (t) (Xn , θ ) under the GP MVN and estimate θˆ via MLE. Finally, calculate RMSE = q PD ˆ(t,i) (t) 2 i=1(θ − θ ) /D to estimate the accuracy of those MLE calculations for each t = 1,...,T . Then draw new (α, β) ∼ Unif(1, 10)2 yielding RMSE(t,r), repeating the entire scheme above R times, i.e., for r = 1,...,R. In our empirical work, we chose D = 5 and R = T = 30.

(t) (t,r) R Next, take pairs (θ , {RMSE }r=1) as T × R observations of the quality of lengthscale 3.2. Better than random 35 estimation – RMSE dynamics – across θ-space and fit a Student-t hetGP to these observations

(t) 2 2 (t) yielding a surrogate described by mean µt ≡ µ(θ ) and σt ≡ σ (θ ). Now we are ready to define the criteria deRIMSEn,d(α, β) as

T (t) 1 X RMSE (α, β) − µt deRIMSE(α, β) ≡ , T σ t=1 t where RMSE(t)(α, β) is calculated just as described above with the specific (not random) settings of (α, β) in question.

As a warmup experiment toward solving that optimization problem, consider n = 16 and d = 2. We built a size 200 LHS design of (α, β) settings in [1, 10]2 with 5 replicates on each for a total of 1000 evaluations of deRIMSE. The bottom end of that region, α, β ≥ 1, was chosen to limit the search to unimodal beta distributions; the top end of 10 was chosen based on a smaller pilot study. Each deRIMSE evaluation took about 50 seconds, leading to almost 14 hours of total simulation time.

Figure 3.4 shows the design (dots) and fitted surface of deRIMSE values obtained with hetGP, i.e., treating deRIMSE simulation as a stochastic computer experiment and fitting a surro- gate to a limited number of evaluations. Outliers are less of a concern when averaging over θ(t)-values, so there was no need to include Student-t features in this regression. However, accommodating a degree of heteroskedasticity and leveraging replication in the calculations were essential to obtain a good fit in a reasonable amount of time (Binois et al., 2018b). The blue square, at about (ˆα, βˆ) = (3, 6.5) in the figure, shows where the predictive surface is minimized; the green and purple contours outline regions wherein predicted deRIMSE values are within 5% and 10% of that best setting.

Fourteen hours of simulation in order to choose the characteristics of a random design is rather extreme. However, once done for a particular choice of covariance structure, design Chapter 3. Distance-distributed Design for Gaussian Process Surrogates 36

● ● ● ● ● 10 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

9 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 8 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

7 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 6 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

Beta ● ● ● 5 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 4 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 3 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1

1 2 3 4 5 6 7 8 9 10 Alpha

Figure 3.4: deRIMSE surface with T = 1000 for n = 16 and d = 2 as estimated by hetGP. Dots show the design sites; lighter (heat) colors correspond to higher deRIMSEs. size n and dimension d, it need not be re-done. Still, finding appropriate designs in higher dimension, with more runs to fill out the larger volume could be computationally daunting. Doubling n, for example, would result in more than double the computational effort.

For a more thrifty approach we turn to BO via EI. The idea is to replace a space-filling eval- uation with a sequential design strategy that targets the minimum of the mean of deRIMSE. For a given (n, d)-setting, the setup is as follows. Begin by performing deRIMSE calcula- tions on a maximin design of size twenty, with ten replicates at each setting, and by fitting a hetGP to those realizations, deriving a predictive surface. Then comes the so-called BO acquisition. Based on hetGP posterior predictive equations described by mean µ(x) and standard deviation σ(x), where x = (α, β) in this case, numerically optimize EI(x):

µ − µ(x) µ − µ(x) EI(x) = (µ − µ(x))Φ min + σ(x)φ min , (3.1) min σ(x) σ(x) 3.2. Better than random 37

where µmin = minx µ(x) and Φ and φ are the standard Gaussian cdf and pdf, respectively.

∗ After (a) solving x = argminx EI(x), which we accomplish using a hybrid of discrete search over replicates and continuous multi-start R–optim-based search with method="L-BFGS-B"; (b) simulating y∗ = deRIMSE(x∗); and (c) incorporating the new data pair into the design and updating the hetGP model fit; the process repeats (back to (a)).

For details on EI and BO see Jones et al. (1998) and Chapter 6.3 of Santner et al. (2003). Snoek et al. (2012) offer a somewhat more modern machine learning perspective centered around the use of BO for estimating hyperparameters of deep neural networks. Our use here—to tune a design—is related in spirit but distinct in form. In fact, the setup we propose is fractal. It solves a design problem (for estimating lengthscale) with the solution to another design problem: for function minimization. One could argue that our choice of an initial maximin design for BO is sub-optimal, and we will do just that in Section 3.4.2.

For a set of representative n and d, we allowed our BO scheme to collect an additional 600 deRIMSE simulations. The resulting selections, overlayed with final predictive mean surface from hetGP, the best value of (ˆα, βˆ) and a 5% and 10% contour are shown in Figure 3.5. Several noteworthy patterns emerge from the panels in the figure. First, although some of the surfaces appear to be multimodal, or at least to have ridges of low deRIMSE values, there is usually a setting with relatively low (α, β) which works well. Sometimes a larger setting is predicted as optimal; but there is usually an alternate setting, reported as (˜α, β˜) in the figure, which is almost as good (within 5%).

These “near-optimal” (˜α, β˜) were used in our betadist designs, and subsequent boxplots and p-value calculations, in Figure 3.1. They are re-used throughout the remainder of the chapter in our empirical work [Sections 3.4.1–3.4.2], and likewise with the hybrid lhsbeta designs discussed momentarily. Although the computational demands are still sizable even with the more thrifty BO, these designs are “up-front”. Once saved as we do for the nine Chapter 3. Distance-distributed Design for Gaussian Process Surrogates 38

n = 8, d = 2 n = 16, d = 2 n = 16, d = 3 αˆ = 1.62, βˆ = 5.26α ˆ = 2.4, βˆ = 5α ˆ = 3.02, βˆ = 6.38 α˜ = 1.5, β˜ = 5α ˜ = 2, β˜ = 4α ˆ = 2.5, β˜ = 5

n = 32, d = 3 n = 32, d = 4 n = 64, d = 4 αˆ = 3.48, βˆ = 10α ˆ = 2.44, βˆ = 9.06α ˆ = 3.15, βˆ = 6.53 α˜ = 3, β˜ = 5α ˜ = 1.5, β˜ = 3.5α ˆ = 3, β˜ = 6

n = 64, d = 5 n = 128, d = 5 n = 128, d = 6 αˆ = 2.36, βˆ = 5.92α ˆ = 1.51, βˆ = 3.5α ˆ = 3.52, βˆ = 8.01 α˜ = 2, β˜ = 6α ˜ = 1, β˜ = 3α ˆ = 2, β˜ = 4

Figure 3.5: Outcomes of BO of RIMSE surfaces for various choices of n and d. Numbers show location and number of replicates in acquisitions; blue square shows (ˆα, βˆ); purple and green contours show 5% and 10% from the optimal. 3.3. Hybrid betadist and LHS 39 choices above, no recalculation is required.

3.3 Hybrid betadist and LHS

Having a betadist design, which provides better estimates of hyperparameters like the length- scale θ, is advantageous only insofar as the resulting surrogate fits, i.e., their predictive equa- tions (2.2), are accurate. Since GP surrogates are inherently spatial predictors, practitioners have long preferred designs which fill the space, so that those sites may serve as nearby anchors to good out-of-sample predictive performance. Betadist designs space-fill less than common alternatives, both quantitatively (i.e., via the maximin criteria) and qualitatively (since they’re inherently random). Thus they hold the potential to be inferior as predictive anchors. Yet in our empirical work, we’ve only been able to demonstrate this negative result (not shown here) when good hyperparameter settings are known. Betadist shines brightest in sequential application [Section 3.4], where the impact of early estimates of hyperparameters can have a substantial affect—exceptionally deleterious in pathological cases—on subsequent design decisions in several common situations.

Still, betadist designs consider only relative distance, completely ignoring position except that the points lie in the study area. Among more-or-less equivalent optimal betadist designs, some may have better positional properties and thus offer better anchoring for prediction without compromising on hyperparameter quality. To explore this possibility we considered a hybrid between betadist and LHS designs. Our “lhsbeta” is similar in spirit to maximin–LHS hybrids where maximin helps avoid second-order aliasing common with LHSs, and LHS helps maximin avoid clumpy marginals. In lhsbeta, we primarily view LHS as helping betadist acquire a degree of positional preference, however the alternate perspective of preferring LHSs with better relative distances is no less valid. Chapter 3. Distance-distributed Design for Gaussian Process Surrogates 40

Our stochastic search strategy for finding lhsbeta designs is coded in Algorithm 3.

Algorithm 3 Hybrid F -dist–LHS via S MC iterations for a design of size n in d dimen- sions. Init: Fill X with a LHS of size n in d dimensions. for s = 1,...,S do Randomly select a pair of design points xi, xj. Randomly select a dimension k ∈ {1, . . . , d}. 0 0 Propose a new design X by swapping Latin squares Li,k and Lj,k producing new xi 0 and xj after re-jittering with 2d new uniform random numbers. if KSD(X0,F ) < KSD(X,F ) then 0 0 th 0 xi ← xi and xj ← xj in the (i, j) rows of X, i.e., accept X ← X end if end for

Like in Algorithm 2 for betadist, we presume an input space coded to [0, 1]d. The algorithm is initialized with an LHS X, built in the canonical way (see, e.g., Lin and Tang, 2015) by first choosing d random permutations of {1, . . . , n}, saved in a n × d matrix L describing the n selected hypercubes out of the nd possible partitions of the input space, and then applying jitter in that selected cube. Each subsequent iteration of stochastic search involves randomly proposing to swap pairs of rows and columns of L, effectively swapping the pair of Latin squares without destroying the one-dimensional uniformity property, and then rejittering that pair points within their respective squares. That proposal is then accepted or rejected according to KSD measured against a distribution F , which in our applications is Beta(˜α, β˜) from Section 3.2. Since two types of random proposals are being performed simultaneously, compared to Algorithm 2’s single random swap, we prefer a multiple of two larger S in Algorithm 3; S = 105 in our empirical work.

Figure 3.6 shows a visual comparison between maximin, betadist and lhsbeta designs so constructed. The plots provide a 2d projection for the case n = 16 and d = 3. Observe that maximin’s 1d margins, shown as red triangles at the axes in the left panel, are not uniform. Neither are those in the 2d projection shown as open circles. First-order aliasing is severe in 3.4. Application to sequential design 41

maximin beta (2,5) lhsbeta (2,5)

● ● 1.0 ● ● 1.0 1.0 ● ● ● ● ● ● ● ● 0.8 0.8 0.8 ● ● ● ● ● ● ● ● ● ● ● ● 0.6 0.6 ●● 0.6 ● ● ● ● ● ● ●

0.4 ● 0.4 0.4 ● ● ● ● ● ●

0.2 0.2 0.2 ● ● ● ● ● ● ● ● 0.0 0.0 0.0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Figure 3.6: 2d (black circles) and 1d (red triangles) projections of three d = 3 designs, n = 16. both projections. In the middle panel, our betadist design has a similar problem (although perhaps not to the same degree), yet we know that the distribution of pairwise distances in 3d are much better than maximin for the purpose of lengthscale inference. In the right panel the 1d and 2d margins look much better, because the sample is an LHS. Among LHSs, this lhsbeta design has a near optimal distribution of pairwise distances for this setting (n, d). Figure 3.2 shows that lhsbeta designs are sometimes worse than ordinary betadist designs, but they both are consistently better than all of the other comparators in the figure. This is perhaps not surprising because lhsbeta designs are indeed betadist designs, yet selected for an additional feature not relevant for lengthscale information: space-fillingness. As we show in two prediction-based comparisons below, lhsbeta designs are sometimes superior on those tasks.

3.4 Application to sequential design

Here we provide two applications of betadist and lhsdist as initial designs for a subsequent sequential analysis. In both cases, these distance distribution-based designs are only engaged in a limited way, as a means of seeding the sequential procedure. Subsequent design acqui- Chapter 3. Distance-distributed Design for Gaussian Process Surrogates 42 sitions are then off-loaded to other criteria. Still, it is remarkable how profound the effect of this initial choice can be. A poorly chosen initial design of just ninit = 8 points, say, can be detrimental to predictive accuracy at n = 64.

3.4.1 Active Learning MacKay

First consider the so-called active learning MacKay (ALM; MacKay, 1992) method of sequen- tial design for reducing predictive uncertainty. Acquisitions are determined by maximizing the predictive variance σ2(x).

The target of our experiment is a function f(x) observed under light noise as Y (x) = f(x)+ε, with ε iid∼ N (0, 0.012). For f(x) we use the function

2 2 2 f(x) = x1 exp{−xi − x2} with x ∈ [−2, 4] , first introduced as an active learning benchmark by Gramacy and Lee (2009). We begin with an initial design of size ninit = 8, and perform 56 additional ALM acquisitions for a total of n = 64 evaluations. Along the way, root mean-squared prediction error (RMSPE) is calculated on noise-free outputs obtained on a regular 100 × 100 testing grid in the input space. For the initial design, we consider random, LHS, 2d optimal (˜α = 2, β˜ = 5) betadist and lhsdist designs, and maximin. Unifdist has been dropped from the comparison on the grounds that it is a sub-optimal betadist alternative. √ In keeping with our earlier experiments, MLE calculations limited to (0, 2] are updated after each sequential design acquisition. To accommodate the noisy evaluations, we augment our covariance with a nugget hyperparameter which is included in the MLE calculation via

2 jmleGP in the laGP package. An L-BFGS-B scheme is used to solve argminxσ (x) via optim in R. Variance surfaces can be highly multi-modal, having as many maxima as design points 3.4. Application to sequential design 43 which is what creates the “sausage”-like shape characteristic of the error-bars produced by GP predictive equations. We deployed an n-factor sequential maximin multi-start scheme to avoid inferior local modes of the variance surface. This means that maximin is used to choose the optim initializations, in order to space out starting locations relative to each other and to the existing Xn design locations. 0.12 0.12 0.10 0.10 0.08 0.08 mean rmspe 0.06 0.06 90% rmspe quantile lhs

0.04 betadist 0.04 lhsbeta maximin

0.02 random 0.02

10 20 30 40 50 60 10 20 30 40 50 60

design size design size

Figure 3.7: RMSPE comparison of initial designs (ninit = 8) as a function of the number of subsequent sequential design iterations via ALM. Each comparator has a pair of lines: those in the left panel indicate mean RMSE; those on the right are the upper 90% quantile.

Figure 3.7 shows the outcome of this exercise via mean RMSPE (left panel) and upper 90% RMSPE quantile (right) obtained from 1000 MC repetitions of the scheme described above. Several striking observations stand out. Betadist, lhsbeta and random perform about the same, with betadist winning out in the end. However in early stages lhsdist is best and random is the worst of the three. Beta-distributed distances (from betadist and lhsbeta) lead to better hyperparameter estimates than random. Yet position of design sites is more important than lengthscale quality when there are little data. After many Chapter 3. Distance-distributed Design for Gaussian Process Surrogates 44 sequential acquisitions, position is less important—ALM takes care of that—but the final ˆ results are still sensitive to the choice of the first ninit = 8 points, even though MLEs θ are recalculated after each selection. Seeding the sequential design, which is often glossed over as an implementation detail, can be crucial to good performance in active learning.

Consequently, betadist, lhsbeta and random vastly outperform LHS and maximin. The trouble with these space-filling seed designs is evident in the 90% quantile, which fails to improve even after many new design sites are added. Too much spread in the initial design results in large θˆs, which is reinforced by subsequent ALM acquisitions at the boundaries of the input space. The early behavior of maximin is particularly strange: getting worse before better even in cases where sequential acquisitions lead to decent results. It’s 90% quantile is eventually no worse than LHS’s—quite poor. The fact that maximin’s average RMSPE is nearly as bad suggests that maximin rarely recovers from that poor initial design.

3.4.2 Expected improvement for optimization

Here we show that betadist and lhsbeta initial designs are also superior in a BO context similar to that used to find the best αˆ and βˆ settings in Section 3.2.2. Specifically, acquisitions are gathered via EI (3.1) using a random five-start scheme including the location of the best input setting (corresponding to µmin) from the previous iteration. As a test function, we use the so-called Greiwank function

d 2 d   X xi Y xi fd(x) = − cos √ + 1. 4000 i=1 i=1 i

For visualizations and further details, including R implementation, see the Virtual Library of Simulation Experiments: https://www.sfu.ca/~ssurjano/griewank.html.

A nice feature of the Greiwank is that it is defined for arbitrary input dimension d, and 3.4. Application to sequential design 45 is flexible about the bounds b of the inputs, x ∈ [−b, b]d. These two settings, b and d, together determine the complexity of the response surface. The global minima is always at the origin, however the number of local minima grows quickly as b and d are increased. We utilize these knobs to vary the complexity of the function, in order to span a range of optimization problems. By varying the bounds b in particular, we vary the magnitude of the best lengthscale for the purpose of surrogate modeling, and thereby create a situation where an initial design is key to obtaining good performance in BO.

Our experimental setup is as follows. We consider three (ninit, d)-pairs from Figure 3.5 and track the progress of EI-based BO measured by the lowest value of the objective found over the sequential design iterations. In each of one thousand MC repetitions, we create initial ninit-sized designs via maximin, random, LHS, betadist and lhsbeta, with subsequent acqui- sitions handled by EI. In accordance with the theory for convergence of EI-based BO (Bull, 2011), we do not update θˆ after each EI acquisition, but fix it at the setting obtained imme- diately after the initial design. This has the benefit of accentuating the effect of the initial design, which suits our illustrative purposes. It is also more computationally efficient, lead- ing to an O(n3) calculation rather than O(n4) if MLEs are recalculated regularly. However the results are not much different under that latter alternative.

To vary the complexity of the underlying optimization problem, and thus best effective lengthscale for GP surrogate, we draw b ∼ Unif(0, 10) at the start of each MC repetition. In so doing, each of 1000 MC repetitions targets a Greiwank function having a different degree of waviness, and number of local optima. By holding b fixed for each of the five initial design choices, and subsequent EI-optimizations, we create a setting wherein pairwise t-tests can be used to adjudicate between those comparators. Finally, all calculations were formed with methods built into the laGP package on CRAN. Since we observe fd(x) without noise, no nugget hyperparameters are required. Not presuming to know the randomly generated scale Chapter 3. Distance-distributed Design for Gaussian Process Surrogates 46 b, we allow MLE calculations for θˆ to search in a space that would be appropriate for the √ largest settings, θ ∈ (0, 10 d], regardless of b.

n = 25 maximin LHS random betadist lhsbeta maximin NA 0.95 0.98 > 0.99 > 0.99 LHS 0.048 NA 0.67 > 0.99 > 0.99 random 0.022 0.33 NA > 0.99 > 0.99 betadist < 1e-7 2e-5 8e-5 NA > 0.99 lhsbeta < 1e-7 < 1e-7 < 1e-7 < 1e-7 NA n = 70 maximin LHS random betadist lhsbeta maximin NA > 0.99 > 0.99 > 0.99 > 0.99 LHS 5e-7 NA 0.89 > 0.99 > 0.99 random < 1e-7 0.11 NA > 0.99 > 0.99 betadist < 1e-7 < 1e-7 2e-6 NA > 0.99 lhsbeta < 1e-7 < 1e-7 < 1e-7 < 1e-7 NA

Table 3.1: Pairwise t-test p-value table for (ninit = 8, d = 2) and two settings n = 25 (top table) and n = 70 (bottom). Statistically significant p-values, i.e., below 5%, are in bold.

Table 3.1 summarizes results obtained from the (ninit = 8, d = 2) case in two views: after n = 20 total acquisitions, and then after n = 70. The bolded p-values in the table(s) are below the typical 5% threshold. Observe in both cases that random and LHS design are consistently better than maximin, but betadist is significantly better than all three. Hybrid lhsbeta outperforms all of the others. In other words, the story here is more or less the same as before. The only substantial difference is that lhsbeta outperforms betadist.

Table 3.2 summarizes results from the (ninit = 16, d = 3) case. In higher dimension, the problem is more challenging with many more local minima. Both a bigger initial design, and a larger run of EI acquisitions is necessary in order to obtain reliable results. At n = 50 the pecking order is similar: maximin, LHS, betadist, lhsbeta—all statistically significant at the 5% level. Random outperforms LHS, but not significantly so at the 5% level.

Finally, Table 3.3 summarizes the (ninit = 32, d = 4) case with n = 200 and n = 500. Except when the randomly chosen b is very small, this setting represents an extremely difficult 3.4. Application to sequential design 47

n = 50 maximin LHS random betadist lhsbeta maximin NA > 0.99 > 0.99 > 0.99 > 0.99 LHS < 1e-7 NA 0.95 > 0.99 > 0.99 random < 1e-7 5.3e-2 NA > 0.99 > 0.99 betadist < 1e-7 < 1e-7 < 1e-7 NA > 0.99 lhsbeta < 1e-7 < 1e-7 < 1e-7 1e-3 NA n = 100 maximin LHS random betadist lhsbeta maximin NA > 0.99 > 0.99 > 0.99 > 0.99 LHS <1e-7 NA > 0.99 > 0.99 > 0.99 random < 1e-7 3e-3 NA > 0.99 > 0.99 betadist < 1e-7 < 1e-7 < 1e-7 NA > 0.99 lhsbeta < 1e-7 < 1e-7 < 1e-7 8e-3 NA

Table 3.2: Pairwise t-test p-value table for (ninit = 16, d = 3) and two settings n = 50 (top table) and n = 100 (bottom). Statistically significant p-values, i.e., below 5%, are in bold.

n = 200 maximin LHS random betadist lhsbeta maximin NA > 0.99 > 0.99 > 0.99 > 0.99 LHS < 1e-7 NA 0.43 > 0.99 > 0.99 random < 1e-7 0.57 NA > 0.99 > 0.99 betadist < 1e-7 2e-4 2e-4 NA > 0.99 lhsbeta < 1e-7 < 1e-7 < 1e-7 <1e-7 NA n = 500 maximin LHS random betadist lhsbeta maximin NA > 0.99 > 0.99 > 0.99 > 0.99 LHS <1e-7 NA 0.25 > 0.99 > 0.99 random < 1e-7 0.75 NA > 0.99 > 0.99 betadist < 1e-7 8e-3 2e-3 NA > 0.99 lhsbeta < 1e-7 < 1e-7 < 1e-7 < 1e-7 NA

Table 3.3: Pairwise t-test p-value table for (ninit = 32, d = 4) and two settings n = 200 (top table) and n = 500 (bottom). Statistically significant p-values, i.e., below 5%, are in bold. optimization with dozens of local minima. A large number of samples is required to obtain decent global BO results. The story here is very similar to Tables 3.1–3.2. Chapter 4

IMSPE batch-sequential design

This study is motivated by a stochastic, agent-based simulator of the conservation of delta smelt fish (Rose et al., 2013), with the goal of understanding sensitivities to myriad natural variables and human interventions. Rose et al.’s simulator is slow (typically 4–6 hours for a single run). The input configuration space is large (upwards of 13-dimensions), and the response surface is nonlinear. Separating signal from noise requires a large and costly, highly distributed HPC simulation campaign and pairing with a flexible meta model. Previous campaigns fixed random number seeds, perhaps to artificially amplify signal. Our initial study with this simulator, described in Section 5.2, suggests that in some low-noise/low- signal parts of the configuration space this shortcut is harmless. However, we observe that the response surface is heteroskedastic, and moreover noise levels can vary nonlinearly. This challenges effective design and meta-modeling – a setting that’s increasingly common in simulation experiments, especially those based on agent-based models (Baker et al., 2020).

In similar situations (e.g., Bisset et al., 2009, Fadikar et al., 2018, Farah et al., 2014, Johnson, 2008, Rutter et al., 2019), but perhaps not as extreme in terms of simulator cost, input dimension, and changing variance, researchers have been getting mileage out of methods for surrogate modeling and the design and analysis of computer experiments (Gramacy, 2020, Sacks et al., 1989, Santner et al., 2018). Default, model-free design strategies, such as space-filling options like Latin hypercube sampling (LHS; Mckay et al., 1979), are a good starting point but are not reactive/easily refined to target parts of the input space which

48 49 require heavier sampling. Model-based designs based on Gaussian process (GP) surrogates fare better, in part because they can be developed sequentially along with learning (e.g., Gramacy and Polson, 2011, Jones et al., 1998, Seo et al., 2000).

Until recently, surrogate modeling and computer experiment design methodology has em- phasized deterministic computer evaluations, for example those arising in finite element analysis or solving systems of differential equations. Sequential design with heteroskedastic GP (HetGP) surrogates (Binois et al., 2018a) for stochastic simulations has recently been proposed as a means of dynamically allocating more runs in higher uncertainty/higher vari- ance parts of the input space (Binois et al., 2018c). Such schemes are typically applied as one-at-a-time affairs – fit model, optimize acquisition criteria, run simulation, augment data, repeat – which would take too long for delta smelt. We anticipate needing thousands of runs, with several hours per run. That process cannot be fully serial.

Batch-sequential design procedures have been applied with GP surrogates (e.g. Chevalier, 2013, Duan et al., 2017, Erickson et al., 2018, Ginsbourger et al., 2010, Loeppky et al., 2010). These attempt to calculate a group of runs to go at once, say on multi-core supercomputing node, towards various design goals. Sometimes these are called a “multi-points criteria”. Quasi-batch schemes, which asynchronously re-order points for an unknown number of future simulations have also thrived in supercomputing settings (Gramacy and Lee, 2009, Taddy et al., 2009). However, none of these schemes explicitly address input-dependent noise like we observe in the delta smelt simulations. Here we propose extending the one-at-a-time method of Binois et al. (2018c) to a batch-sequential setting. Our goal is to design for batches of size 24 to match the number of cores available on nodes of a supercomputing cluster at Virginia Tech. Following Binois et al.’s lead, we develop a novel scheme for encouraging replicates in the batches. Replication is a tried and true technique for separating signal from noise, reducing sufficient statistics for modeling and thus enhancing computational and learning Chapter 4. IMSPE batch-sequential design 50 efficiency.

Our flow is as follows. Section 4.1 explains our innovative batch-sequential acquisition strat- egy through an integrated mean-squared prediction error (IMSPE) criteria and closed-form derivatives for optimization, extending the one-at-a-time process from Binois et al. (2018c). Section 4.2 provides a novel and thrifty post-processing scheme to identify replicates in the new batch. Illustrative examples are provided throughout, and Section 4.3 details a bench- marking exercise against the infeasible one-at-a-time gold standard. Finally, in Section 5 the design method is applied to smelt simulations to effectively collect samples.

4.1 Batch sequential design

For stochastic simulator with heteroskedastic noise, ideally, sampling effort would concen- trate on parts of the input space that are harder to model, or where more value can be extracted from noisy simulations. Binois et al. (2018c) proposed IMSPE-based sequential design with that goal in mind. The time-consuming nature of delta smelt simulations means adding one point at a time, i.e., in serial, would be slow and at odds with modern, distributed HPC capabilities. Here we propose extending Binois et al. (2018c) to batches that can fill entire compute nodes at once.

4.1.1 A criterion for minimizing variance

Integrated mean-squared prediction error (IMSPE) measures how well a surrogate model captures the input-output relationships. It is widely used as data acquisition criterion; see,

2 e.g., Gramacy (2020, Chapters 6 and 10). Let σˇN (x) denote the nugget free predictive 4.1. Batch sequential design 51

variance for any single x ∈ D. IMSPE for a design XN may be defined as:

Z Z 2 2 −1 > IN ≡ IMSPE(XN ) = σˇN (x) dx = τˆ [c(x, x) − c(x, XN )KN c(x, XN ) ] dx. x∈D x∈D

The integral above has an analytic expression for GP surrogates, in part because of the

2 closed form for σˇN (x). Examples involving specialized GP setups in recent literature include Ankenman et al. (2010), Chen et al. (2019), Leatherman et al. (2017). Similar expressions do not, to our knowledge, exist for other popular surrogates like deep neural networks, say.

Binois et al. (2018c) gives perhaps the most generic and prescriptive expression for GPs, emphasizing replicates at n  N unique inputs x¯i for computational efficiency. Let Kn

ij r(x¯i) denote the unique n × n covariance structure comprised of K = c(x¯i, x¯j) + δij and let n ai

Let Wn be an n × n matrix with entries comprising integrals of kernel products w(x¯i, x¯j) = R c(¯ , )c(¯ , ) d 1 ≤ i, j ≤ n E = R c( , ) d x∈D xi x xj x x for , and let x∈D x x x, which is constant with respect to the deisgn Xn. Closed forms are provided in Appendix B of Binois et al. for common kernels. Then O(n3) calculations yield

−1 > −1 IN = E[c(X,X)] − E[c(X, XN )KN c(X, XN ) )] = E − tr(Kn Wn). (4.1)

Although expressed for an entire design Xn, in practice IMSPE most useful in sequential application where the goal is to choose new runs. Binois et al. provided a tidy expression

st for solving for xn+1 by optimizing In+1(x˜) over n + 1 candidates x˜. We extend this to an ¯ entire batch of size M ≥ 1, augmenting XN or (more compactly) the unique elements Xn.

> Let Xe = {x˜1, x˜2,..., x˜M } denote the coordinates of a new batch. Let IN+M (Xe ) denote the new IMSPE, which is realized most directly by shoving a row-combined [XN ; Xe ] into Eq. (4.1). That over-simplifies, and flops in O((N + M)3) could be prohibitive. Chapter 4. IMSPE batch-sequential design 52

Partition inverse equations (Barnett, 1979) can be leveraged for even thriftier evaluation.

Extend the kernel K and its integral W to define new (n + M) × (n + M) matrices

    ¯ ¯ Kn c(Xn, Xe ) Wn w(Xn, Xe ) Kn+M =   , Wn+M =   ,  ¯ >   ¯ >  c(Xn, Xe ) c(Xe , Xe ) + r(Xe ) w(Xn, Xe ) w(Xe , Xe )

¯ ¯ where Wn = w(Xn, Xn) and r(Xe ) = Diag(r(x˜1), . . . , r(x˜M )) comes from smoothed latent ¯ 2 variances following Eq. (2.4) via c(Xe , Xn) so that r(Xe ) = τ Λ(Xe ), where

¯ −1 −1 Λ(Xe ) = K(δ)(Xe , Xn)(C(δ) + g(δ)An ) ∆n. (4.2)

−1 3 2 2 We may fill the inverse Kn+M in flops in O(M + nM + n M) as

  −1 > Kn + g(Xe )Σ(Xe )g(Xe ) g(Xe ) −1 =   , Kn+M   (4.3) g(Xe )> Σ(Xe )−1

−1 −1 ¯ > −1 ¯ where g(Xe ) = −Kn c(Xn, Xe )Σ(Xe ) , Σ(Xe ) = r(Xe ) + c(Xe , Xe ) − c(Xn, Xe ) Kn c(Xn, Xe ). Multiplying through components of Eq. (4.3) and properties of traces in Eq. (4.1) leads to

−1 > > IN+M = E − tr(Kn Wn + g(Xe )Σ(Xe )g(Xe ) + g(Xe )w(Xn, Xe ) )

> −1 − tr(g(Xe ) w(Xn, Xe ) + Σ(Xe ) w(Xe , Xe )) (4.4)

> > −1 = IN − tr(g(Xe )Σ(Xe )g(Xe ) ) − 2 tr(g(Xe )w(Xn, Xe ) ) − tr(Σ(Xe ) w(Xe , Xe )).

Finding the best Xe requires only the latter term above. That is, we seek

∗ = I Xe argminXe ∈D N+M = (g( )Σ( )g( )>) + 2 (g( )w( , )>) + (Σ( )−1w( , )). argmaxXe ∈D tr Xe Xe Xe tr Xe Xn Xe tr Xe Xe Xe 4.1. Batch sequential design 53

In other words, we seek Xe ∗ giving the largest reduction in IMSPE. Evaluation involves flops in the orders quoted above, however in repeated calls for numerical optimization many of the O(n) quantities can be pre-evaluated leaving O(M 3 + nM 2 + n2M) for each Xe .

4.1.2 Batch IMSPE gradient

To facilitate library based numerical optimization of IN+M (Xe ) with respect to Xe , in par- ticular via Eq. (4.4), we furnish closed-form expressions for its gradient. Below, these are

th th framed via partial derivatives for x˜i(p), the p coordinate of the i subsequent design point in the new batch. Beginning with the chain rule, the gradient of IN+M over x˜i(p) follows

 −1  ∂IN+M ∂Kn+M −1 ∂Wn+M = −tr Wn+M + Kn+M . (4.5) ∂x˜i(p) ∂x˜i(p) ∂x˜i(p)

∂K−1 For the component n+M , we have ∂x˜i(p)

    −1 −1 > ∂K ∂ Kn + g(Xe )Σ(Xe )g(Xe ) g(Xe ) H(Xe ) Q(Xe ) n+M =   =   ∂x˜i(p) ∂x˜i(p)  > −1  >  g(Xe ) Σ(Xe ) Q(Xe ) V (Xe )

∂Σ(Xe ) where V (Xe ) := −Σ(Xe )−1 Σ(Xe )−1 ∂x˜i(p) ¯ ! ∂g(Xe ) −1 ¯ ∂c(Xn, Xe ) −1 Q(Xe ) := = −Kn c(Xn, Xe )V (Xe ) + Σ(Xe ) ∂x˜i(p) ∂x˜i(p) ∂g(Xe )Σ(Xe )g(Xe )> H(Xe ) := ∂x˜i(p) ∂Σ(Xe ) = g(Xe ) g(Xe )> +Q(Xe )Σ(Xe )g(Xe )> +{Q(Xe )Σ(Xe )g(Xe )>}>. ∂x˜i(p) Chapter 4. IMSPE batch-sequential design 54

The terms that are included in the previous expressions are as follows:

¯ > ∂Σ(Xe ) ∂c(Xe , Xe ) ∂r(Xe ) ∂c(Xn, Xe ) −1 ¯ = + − Kn c(Xn, Xe ) ∂x˜i(p) ∂x˜i(p) ∂x˜i(p) ∂x˜i(p) > ( ¯ > ) ∂c(Xn, Xe ) −1 ¯ − Kn c(Xn, Xe ) where ∂x˜i(p)     ∂c(x˜i,x¯1) c(x˜1, x¯1) ··· c(x˜M , x¯1) ∂x˜i(p) ∂c( ¯ , ) ∂     Xn Xe  . .   .  =  . .  = 0n×(i−1) . 0n×(M−i) ∂x˜i(p) ∂x˜i(p)         ∂c(x˜i,x¯n) c(x˜1, x¯n) ··· c(x˜M , x¯n) ∂x˜i(p)   ∂c(x˜i,x˜1) ∂x˜    i(p)   .  c(x˜1, x˜1) ··· c(x˜M , x˜1)  .      ∂c(Xe , Xe ) ∂  . .    = . . =  ∂c(x˜i,x˜1) ∂c(x˜i,x˜i) ∂c(x˜i,x˜M )  .  . .  ∂˜ ··· ∂˜ ··· ∂˜ ∂x˜i(p) ∂x˜i(p)    xi(p) xi(p) xi(p)       .  c(x˜1, x˜M ) ··· c(x˜M , x˜M )  .     ∂c(x˜i,x˜n)  ∂x˜i(p)

∂Wn+M Then we focus on the expressions related to : ∂x˜i(p)

    ∂Wn+M ∂ Wn w(Xn, Xe ) 0 S(Xe ) =   =   . ∂x˜i(p) ∂x˜i(p)  >   >  w(Xn, Xe ) w(Xe , Xe ) S(Xe ) T (Xe )

With these quantities and Eq. (4.4), the gradient of IN+M can be expressed as:

∂IN+M ∂Σ(Xe ) − = tr(g(Xe ) g(Xe )>) + 2 tr(Q(Xe )Σ(Xe )g(Xe )>) ∂x˜i(p) ∂x˜i(p)

> > + 2 tr(Q(Xe )w(Xn, Xe ) ) + 2 tr(g(Xe )S(Xe ) ) (4.6)

− tr(V (Xe )w(Xe , Xe )) + tr(Σ(Xe )−1T (Xe )). 4.1. Batch sequential design 55

¯ > −1 ¯ Now recall that Σ(Xe ) = r(Xe ) + c(Xe , Xe ) − c(Xn, Xe ) Kn c(Xn, Xe ). Again recursing with the chain rule, first through the diagonal matrix r(Xe ) via Eq. (2.4), gives

2 ¯ ∂r(Xe ) ∂τ Λ(Xe ) ∂K(δ)(Xe , Xn) −1 −1 = = (C(δ) + g(δ)A ) ∆n. (4.7) ∂x˜i(p) ∂x˜i(p) ∂x˜i(p)

It is worth observing here how relative noise levels, smoothed through ∆n and distance to ¯ Xn, impact the potential value of new design elements Xe . In particular, high variance x¯i have low impact unless ai is also large, in which case there is an attractive force encouraging ¯ ¯ ∂Σ(Xe ) ∂c(Xn,Xe ) replication (elements of X nearby Xn). The last component of relies on , a e ∂x˜i(p) ∂x˜i(p) quadratic:

∂ ¯ > −1 ¯ c(Xn, Xe ) Kn c(Xn, Xe ) (4.8) ∂x˜i(p) ¯ ( ¯ )> ¯ > −1 ∂c(Xn, Xe ) ¯ > −1 ∂c(Xn, Xe ) = c(Xn, Xe ) Kn + c(Xn, Xe ) Kn . ∂x˜i(p) ∂x˜i(p)

The structure of this component’s derivative reveals how new design elements Xe repel one ¯ another and push away from existing points Xn. In other words, the forces described in Eqs. (4.7–4.8) trade-off in a sense, encouraging both spread to space-fill and compression toward replication depending on the noise level r(·).

∂Wn+M Finally, for Eq. (4.5) we need . Our earlier expression for w(xi, xj) was generic, ∂x˜i(p) however derivatives are required across each of d input dimensions for the gradient so here

(i,j) we acknowledge a separable kernel structure for completeness. Component Wn+M follows

d d Z Y Z Y w(xi, xj) = c(xi, x)c(xj, x)dx = c(xi(k), x)c(xj(k), x)dx = wk(xi(k), xj(k)). k=1 k=1 x∈D x∈[0,1]

th ∂Wn+M When differentiating with respect to x˜i(p), only the (n + i) row/column of is non- ∂x˜i(p) Chapter 4. IMSPE batch-sequential design 56 zero. Those entries are

(n+i,j) d ∂Wn+M ∂wp(x˜i(p), xj) Y = wk(x˜i(k), xj(k)). ∂x˜i(p) ∂x˜i(p) k=1,k6=p

z 2 √2 R −t wk(·, ·) for a Gaussian kernel is calculated with erf the error function erf(z) = π e dt as 0 √  0 2       2πθ (x − x ) 2 − (xi + xj) xi + xj w(xi, xj) = exp − erf √ + erf √ , 4 θ 2θ 2θ for 1 ≤ i, j ≤ n and with derivative

r  2        ∂w(x, xi) π (x − xi) x + xi − 2 x + xi = exp − (x − xi) erf √ − erf √ ∂x 8θ 2θ 2θ 2θ r   2   2 # 2θ (x + xi) −(x + xi − 2) + exp − − exp . π 2θ 2θ

It is worth noticing that to ensure positive variances, i.e., rather than being faithful to Eq. (4.2), we instead model

¯ −1 −1 log Λ(Xe ) = K(δ)(Xe , Xn)(C(δ) + g(δ)An ) log ∆n.

Thus ∂r(Xe ) can be derived as: ∂x˜i(p)

2 ¯ ∂r(Xe ) ∂τ Λ(Xe ) ∂K(δ)(Xe , Xn) −1 −1 = = (C(δ) + g(δ)A ) log ∆n ∂x˜i(p) ∂x˜i(p) ∂x˜i(p) ¯ −1 −1 × exp(K(δ)(Xe , Xn)(C(δ) + g(δ)A ) log ∆n). 4.1. Batch sequential design 57

4.1.3 Implementation details and illustration

Closed-form IMSPE and gradient in hand, selecting M-sized batches of new runs becomes an optimization problem of Md dimension that can be off-loaded to a library. When each dimension is constrained in [0, 1], i.e., assuming coded inputs, we find that the L-BFGS- B algorithm (Byrd et al., 2003) is appropriate, and generally works well even in this high dimensional setting. Our implementation uses the built-in optim function in R, and is careful to avoid redundant work in evaluating objective and gradient, which share many common building blocks and subroutines.

Figure 4.1: Batch IMSPE optimization iterations from initial (blue dots) to final (green crosses) locations. Three optimization epochs are provided by arrows. An overlayed heatmap shows the estimated standard deviation surface r(x). Chapter 4. IMSPE batch-sequential design 58

Figure 4.1 provides an illustrative view of this new capability. We started with a space-filling ¯ 2 design Xn in [0, 1] , shown as open circles. The true noise surface, r(x), was derived from a standard bivariate Gaussian density with location µ = (0.7, 0.7) and scale Σ = 0.02 · I2. The heatmap indicates HetGP-estimated standard deviation surface based on runs gathered at ¯ Xn. The higher noise region is more yellow. We then set out to calculate coordinates of a new M = 20 sized batch Xe via IMSPE. Search is initialized with a LHS, shown in the figure as blue dots. Arrows originating from those dots show progress of the derivative-based search broken into three epochs for dramatic effect. Iterating to convergence requires hundreds of objective/gradient evaluations in the Md = 40-dimensional search space, but these each take a fraction of a second because there are no large cubic operations. At the terminus of those errors are green crosses, indicating the final locations of the new batch Xe ?. Observe how some of these spread out relative to one another and to the open circles (mostly in the red, low-noise region), while others (especially near the yellow, high-noise region) are attracted to each other. At least one new replicate was found. Thus the IMSPE criterion strikes a balance between filling the space and creating replications, which are good for separating signal from noise.

L-BFGS-B only guarantees a local minimum since the IMSPE objective is not a convex function. Actually, IMSPE surfaces become highly multi-modal as more points are added, with numbers of minima growing lineary in n, the number of unique existing design elements, even in the M = 1 case. Larger batch sizes M > 1 exacerbate this still further. There is also a “label-switching problem”. (Swap two elements of the batch and the IMSPE is the same.) To avoid seriously inferior local minima in our solutions for Xe ? we deploy multi-start scheme, starting multiple L-BFGS-B routines simultaneously from novel sets of space-filling initial Xe (0), choosing the best at the end. 4.2. Hunting for replicates 59

4.2 Hunting for replicates

Replication, meaning repeated simulations Y (x) at fixed x, keeps cubic costs down [Eqs. (4.1) and (4.5), reducing from N to n] and plays an integral role in separating signal from noise (Ankenman et al., 2010, Binois et al., 2018b), a win-win for statistical and computational efficiency. Intuitively, replicates become desirable in otherwise poorly sampled high variance regions (Binois et al., 2018c). Unfortunately, a numerical scheme for optimizing IMSPE will never precisely yield replicates because tolerances on iterative convergence cannot be driven identically to zero. Consider again Figure 4.1, focusing now on the two new design points in the yellow region which went to similar final locations along their optimization paths. These look like potential replicates, but their coordinates don’t match.

One possible solution resolving near-replicates into actual ones is to introduce a secondary set of tolerances in the input space, whereby closeness implying “effective replication” can be deduced after the numerical solver finishes. This worked well for Binois et al., in part because of an additional lookahead device (Ginsbourger and Le Riche, 2010) explicitly favoring repli- cation. But for us such tactics are unsatisfying on several fronts: lookahead isn’t managable for M  1 sized batches; additional input tolerances are tantamount to imposing a grid; such a scheme doesn’t directly utilize IMSPE information; and finally whereas one-at-a-time acquisition presents more opportunities to make adjustments in real-time, our batch setting puts more eggs in one basket. We therefore propose the following post-processing scheme on each batch which we call “backtracking”.

4.2.1 Backtracking via merge

For a new batch of size M, the possible number of new replicates ranges from zero to M. L-BFGS-B optimization yields M unique coordinate tuples, but some may be very close to Chapter 4. IMSPE batch-sequential design 60 one another or the n existing unique sites. Below we verbalize a simple greedy scheme for ordering and valuing those M locations as potential “effective replicates”. Choosing among those alternatives happens in a second phase, described momentarily in Section 4.2.2.

? Begin by recording the IMSPE of the solution Xe M ≡ Xe provided by the optimizer:

In+M (Xe M ). This corresponds to the no-backtrack/no replicate option. Set iterator s = 0 so that Xe ms refers to this potential batch with ms = M unique design elements and let ds = 0.

Move to the first iteration, s = 1. Among the ms−1 unique sites in Xe ms−1 , find the one which has the smallest minimum distance ds to other unique elements in Xe ms−1 and existing sites ¯ Xn, with ties broken arbitrarily. Entertain a new batch Xe ms by merging sights involved in that minimum ds-distance pair. If both are a member of the previous batch Xe ms−1 , then choose a midway value for their new setting(s) in Xe ms . Otherwise, take the site location ¯ from the existing (immovable) unique design element from Xn. Both imply ms = M − s.

Calculate In+ms (Xe ms ). Increment s ← s + 1 and repeat unless s = M. Break out of this loop early if both elements of the minimum-distance pair from Xe ms−1 are existing design ¯ locations from Xn, which is only possible for s ≥ 2. Let S indicate the number of times through the loop, s = 0,...,S ≤ M, or one plus the number of merges.

Figure 4.2 provides an illustration; settings of f(x) and r(x) mirror Figure 4.1. The existing ¯ design Xn has n = 100 unique elements, shown as open circles in the left panel. Each run is replicated three times so that N = 300. A new batch of size M = 24 is sought. Red crosses

represent optimized Xe m0 = Xe M from L-BFGS-B. Numbered arrows mark each backtracking step. Observe that the first two of these (almost on top of one another near the right-hand boundary) involve novel batch elements, whereas all others involve one of the n existing sites. Aesthetically, the first five or so look reasonable, being nearby the high variance (top-right) region. Replication is essential in high-variance settings. 4.2. Hunting for replicates 61

● Xinit Xopt_BFGS

● ●

1.0 ● ● ● 7 ● 23 ● ● ● ● 0.0038 ● ● ● ● 13 ● ● ● ● ● ● 4 ● ● 19 ● 0.8 ● 1 ● 9 ● ● ● ● ● ● ● 16 0.0036 ● 2 ● 11 ● ● ● ● ● ● ● ● ● ● ● ● 0.6 6 ● ● ● 14 3 ● ● ● ●

● ● 0.0034 X2 ● 12 ● ● ● ● ● IMSPE ● 5 21 ● ● ● ● ● ● ● 0.4 ● ● ● 8 ● ● ● ● ● ● 15 ● ● ● 0.0032 ● ● ● ● ● ● ● ● 17 ● ● 0.2 ● ● ● ● ● ● ● ● ● ● 22 ● ● ● ● ● ● ● ● ● ● 10 ● 0.0030 20 ● ● 24 ● ● ● 18 ● ● ● ● ● ● ● ● 0.0

0.0 0.2 0.4 0.6 0.8 1.0 0 5 10 15 20 25

X1 number of replications

Figure 4.2: Left: backtracking with merge; gray arrows connect optimal Xe ms with numbers indicating s = 1,...,M; Right: IMSPE changes over numbers of replicates. Merging steps that are finally taken are shown in blue. Fitted segmented regression lines are overlaid.

4.2.2 Selecting among backtracked batches

To quantify and ultimately automate that eyeball judgment, we investigated In+ms (Xe ms ) versus s, the number of replicates in the new batch. The right panel of Figure 4.2 shows the pattern corresponding to the backtracking steps on the left. Here, the sequence of

In+ms (Xe ms ) values is mostly flat for s = 0,..., 3, then increasing thereafter. We wish to minimize IMSPE, except perhaps preferring exact replicates when IMSPEs may technically differ but are very similar. Aesthetically, that “change point” happens at s = 7 where IMSPE jumps into a new and higher regime.

To operationalize that observation we experimented with a number of change point detection schemes. For example, we tried the tgp (Gramacy, 2007, Gramacy and Taddy, 2010) family of Bayesian treed constant, linear, and GP models. This worked great, but was overkill com- putationally. We also considered placing ds, the minimizing backtracked pairwise distances, Chapter 4. IMSPE batch-sequential design 62 rather than s-values on the x-axis. Although the behavior with this choice was distinct, it yielded more-or-less equivalent behavior on broad terms.

We ultimately settled on the following custom scheme recognizing that the left-hand regime was usually constant (i.e., almost flat), and the right-hand regime was generally increasing.1 To find the point of shift between those two regimes, we fit S + 1 two-segment polynomial regression models, with break points s = 0,...,S respectively, with the first regime (left) being of order zero (constant) and the second (right) being of order four. We then chose as the location sˆ the one whose two fits provide lowest in-sample MSE. The optimal pair of polynomial fit pairs are overlayed on the right panel of Figure 4.2, with groups color-coded to match arrows in the left panel.

Figure 4.3 shows four other examples under the same broad settings but under different random initial n-sized designs. The situation in the top left panel matches that of Figure 4.2 and is by far the most common. The top right panel depicts a setting where zero replicates is best but the two-regression scheme nevertheless identifies a midway change-point suggesting a bias toward finding at least some replicates. The bottom left panel indicates an opposite extreme. Note the small range of the IMSPE axis (y-axis). In such situations, where the right-hand regime has uniformly lower IMSPE than the left-hand one, we take sˆ the choice minimizing IMSPE in the right-hand regime. The bottom right panel shows the case where no replicates are finally included in the new batch.

4.3 Benchmarking examples

Here we illustrate and evaluate our method on an array of test problems. We have four examples total. Two of them are 1d and 2d synthetic toy problems. The first one mirrors

1BFGS is a local solver and backtracking is greedy, both contributing to potential for non-monotonicity. 4.3. Benchmarking examples 63

Figure 4.3: Three selected scatter plots of IMSPE versus number of replicates with best change-point fitted regression lines overlaid. Colors match arrows in Figure 4.2.

the 1d example from Binois et al. (2018c). The other two include a 4d ocean simulator from McKeague et al. (2005) and an 8d “real simulator” from inventory management. Metrics include out-of-sample root mean-squared prediction error (RMSPE), i.e., matching our IM- SPE acquisition heuristic, and a proper scoring rule (Gneiting and Raftery, 2007, Eq. (27)) combining mean and uncertainty quantification accuracy, which for GPs reduces to predic- tive log likelihood. We also consider computing time and number of unique design elements, n, over total acquisitions N. Our gold standard benchmark is the “pure sequential” (M = 1) adaptive lookahead scheme of Binois et al., however when relevant we also showcase other special cases. Our goal is not to beat that benchmark. Rather we aim to be competi- Chapter 4. IMSPE batch-sequential design 64 tive while entertaining M = 24-sized batches, representing the number of cores on a single supercomputing node.

4.3.1 1d toy example

This 1d synthetic example was introduced by Binois et al. (2018c) to show how IMSPE acquisitions distribute over the input space in heteroskedastic settings. Here we borrow that setup to illustrate our batch scheme. The underlying true mean function is f(x) = (6x − 2)2 sin(12x − 4), and the true noise function is r(x) = (1.1 + sin(2πx))2. Observations are generated as y ∼ f(x) + , where  ∼ N(0, σ2 = r(x)). The experiment starts with a maximin–LHS of n0 = 12 locations under a random number of replicates uniform in {1, 2,

3}, so that the starting size is about N0 = 24. A total of twenty M = 24-sized batches are used to augment the design for a total budget of N = 504 runs.

Panels in Figure 4.4 serve to illustrate this process in six epochs. Open circles indicate observations, with more being added in batches over the epochs. The dashed sine curve indicates the relative noise level r(x) over the input space; vertical segments at the bottom highlight the degree of replication at each unique input. Observe how more runs are added to high noise regions, and the degree of replication is higher there too. This is strikingly similar to the behavior reported by Binois et al..

4.3.2 2d toy example

Elements of this example have been in play in previous illustrations, including Figures

4.1–4.2. The true mean function f(x) is defined as:

  a1 a3 f(x) = f(x1, x2) = 20 2 2 + 2 2 , exp(a1 + a2) exp(a3 + a4) 4.3. Benchmarking examples 65

Figure 4.4: The top-left panel shows the initial design observations. Remaining panels display the sequential design process after adding 1, 5, 10, 15 and 20 batches.

where a1 = 6x1 − 4.1, a2 = 6x2 − 4.1, a3 = 6x1 − 1.7, and a4 = 6x2 − 1.7. The true noise surface, r(x), is a bivariate Gaussian density with location µ = (0.7, 0.7) and scale

Σ = 0.02 · I2. Figure 4.5 provides a visual using color for f(x) and contours for r(x) We deliberately made the mean surface have the same signal structure at the bottom left and top right regions. However, the top right region is exposed to high noise intensity while the bottom left region is almost noise-free, creating two signal-to-noise regimes.

Design aspects of our experiment(s) were set up as follows. We begin with an n0 = 20-sized maximin–LHS with five replicates upon each for N0 = 100 total simulations. This is followed by ten batches of IMSPE-acquisition with backtracking for 240 new runs (N = 340 total). Figure 4.6 shows how the first six batches distributed in the input space, with one panel for each. Color is used to track batches over accumulated runs; numbers indicate degrees of Chapter 4. IMSPE batch-sequential design 66

0.5 1.0

1.5

2.5

0.8 3.5

4

3 0.6 2 x2 1

0.5 0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0

x1 Figure 4.5: The heatmap shows the mean surface f(x). Lighter colors correspond to higher values. Contours of r(x) are overlaid.

100 −> 124 124 −> 148 148 −> 172 172 −> 196

● ● ● ●

1.0 1 1.0 1 1 1 1.0 1 1 1 1.0 1 1 1 1 ● 1 1 ● 1 11 1 ● 1 11 1 ● 1 11 1 1 ● 1 1 1 ● 1 1 1 1 ● 1 1 1 1 1 ● 1 1 ● ● ● 1 1 ● 0.8 ● 2 1 0.8 1 ● 2 1 0.8 1 ● 2 1 0.8 1 1 ● 2 2 11 1 ● 1 ● 1 1 ● 1 1 ● 1 ● ● 2 2 ● 2 2 ● ● 1 ● 1 11 ● 1 3 1 11 ● 11 3 1 11 ● 1 1 ● 1 1 ● 12 1 ● 12 0.6 ● 1 1 0.6 ● 1 1 0.6 ● 1 1 0.6 ● 1 1 1 1 ● 1 ● 1 ● 1 2 1 1 ● 1 21 2 1 ● ● ● 1 ● ● ● 2 ● 1 1 2 ●

0.4 1 0.4 1 0.4 1 0.4 1 1 ● 1 ● 1 1 ● 1 1 ● 1 1 ● 1 1 1 ● 1 1 1 1 1● 1 1 1 1 1● 1 ● 1 1 ● 1 1 1 1 1 ● 1 1 1 1 1 ● 1 1 1 1 ● 1 ● 1 1 1 ● 1 1 1 ● 1 0.2 ● 0.2 1 ● 0.2 1 ● 0.2 1 ● 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ● 1 1 ● 1 1 ● 1 1 ● 1 ● 1 1 ● 1 1 1 1 ● 1 1 1 1 ● 1 1 11 1 0.0 0.0 0.0 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 196 −> 220 220 −> 244 all new points all points

● ● ● ●

1.0 1 1 1 1 1 1.0 1 1 1 1 1 1.0 1 1 1 1 1 1.0 5 1 1 1 1 1 1 ● 1 1 1 1 ● 1 1 1 1 ● 2 1 1 1 ● 2 1 1 1 1 1 ● 1 1 1 1 1 1 ● 1 1 11 1 1 1 ● 1 2 11 1 1 1 5● 6 2 11 1 1 ● 1 12 ● 1 4 ● 1 4 5● 0.8 1 1 ● 21 21 11 0.8 1 1 ● 215 21 11 0.8 1 1 ● 8 42 21 0.8 1 1 ● 8 42 21 11 1● 1 11 1● 1 1 11 1● 1 1 11 15● 1 5 1 2 21 ● 21 21 ● 3 41 ● 3 41 ● ● 11 3 1 11 ● 11 3 1 11 ● 21 3 1 21 ● 21 3 1 215 1 ● 12 1 ● 12 2 ● 5 2 5 ● 5 0.6 1 1 0.6 1 1 0.6 3 3 0.6 3 3 1 ● 1 1 1 1 1 ● 1 1 1 1 1 ● 2 1 2 1 1 5● 2 6 2 1 1 ● 11 21 2 1 1 ● 11 21 2 1 1 ● 11 5 2 1 1 ● 11 5 2 1 1 ● 1 ● 2 ● 5 2 ● 1 1 1 2 ● 1 1 1 2 ● 1 1 1 2 ● 1 1 1 2 ● 5

0.4 1 0.4 1 0.4 2 0.4 2 1 1 ● 1 1 ● 2 1 ● 2 6 ● 1 ● 1 1 ● 1 2 ● 1 2 ● 1 1 1 11 1 1 1 11 1 1 1 12 1 1 1 12 6 5 1 ● 1 1 1 1 1 ● 1 1 1 1 2 ● 1 1 1 1 2 ● 1 1 1 1 1 ● 1 1 1 ● 1 1 1 ● 1 1 6 ● 1 0.2 0.2 1 0.2 1 0.2 1 1 ● 1 ● 1 1 ● 1 1 ● 1 5 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 6 1 1 1 1● 1 1 1● 1 1 1● 1 1 1● 1 ● 1 1 1 11 1 ● 1 1 1 11 1 ● 1 1 1 11 1 5● 1 1 1 11 5 1 0.0 1 0.0 1 0.0 1 0.0 1 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Figure 4.6: IMSPE design in batches: gray dots are initial design points; gray contours show signal and noise contrast; numbers indicate replicate multiplicity. The last two panels summarize all new points from 6 batches and all design points respectively. 4.3. Benchmarking examples 67 replication. For example, the first batch had two replicates (one at a unique input, one at an existing open circle), whereas the third batch had many more. Observe that as batches progress, more replicates and more unique locations cluster near the noisy top-right region of the input space. The final two panels summarize all (new) points involved in those first six batches, including the initial design.

Figure 4.7: Results of RMSPE, score, time per iteration in fitting HetGP model, and the aggregate number of unique design locations from 50 MC repetitions.

Figure 4.7 offers a comparison to Binois et al. (2018c)’s pure sequential (M = 1) strategy in a fifty-repetition MC exercise. Randomization is over the initial maximin–LHS, noise deviates in simulating the response, and novel LHS testing designs of size N = 500. Additionally, we include a “no backtracking” comparator, omitting the search for replicates step(s) described Chapter 4. IMSPE batch-sequential design 68 in Section 4.2. For the pure sequential benchmark, we calculate RMSPE and score after every 24 subsequent steps to make it comparable to batch sequential design methods. In terms of RMSPE, all three methods perform about the same. Under the other three metrics, batch- with-backtracking is consistently better than the non-backtracking version: more replicates, faster HetGP fits due to smaller n, and higher score after batch three. The degree of replication yielded by backtracking is even greater than the pure sequential scheme after batch four. Also from batch four, batch IMSPE outperforms pure sequential design on score. Thus, we can conclude that our batch sequential design with backtracking scheme achieves the goal of adding M = 24 runs at once, filling out an entire supercomputing node, without noticeably deleterious effects.

4.3.3 Ocean oxygen

The ocean-oxygen simulator models oxygen concentration in a thin water layer deep in the ocean, see McKeague et al. (2005). For details on how we generate simulations here, see Herbei and Berliner (2014) and (Gramacy, 2020, Section 10.3.4).2 The simulator is stochastic and is highly heteroskedastic. Visuals are provided by our references above. There are four real-valued inputs: two spatial coordinates (longitude and latitude) and two diffusion coefficients. We consider a MC experiment initialized a n0 = 40-sized maximin–LHS, with five replicates upon each (N0 = 200). We consider adding ten M = 24-sized batches so that N = 440 runs are collected by the end. We can’t easily visualize the results in a 4d space, but the analog of our 2d toy results (Figure 4.6) is provided in Figure 4.8.

2Implementation is provided https://github.com/herbei/FK_Simulator. 4.3. Benchmarking examples 69

Figure 4.8: Ocean simulator results in 30 MC repetitions: RMSPE, score, time per batch and the aggregate number of unique design locations n.

In terms of out-of-sample RMSPE and score, all methods exhibit similar performance. The purely sequential design method consistently yields more replicates. Thus, it also takes the lowest time per iteration for updating via hetGP. Our backtracking scheme yields a moderate proportion of replicates with the same performance as measured by RMSPE and score, compared to the version without backtracking. Notice that these metreics do not necessarily improve monotonically over batches. This could be attributed to unknown “true” mean and noise functions in this real-world simulator setting. Calculation of RMSPE and score are out-of-sample, on novel random testing sets, interjecting an extra degree of stochasticity in these assessments. Chapter 4. IMSPE batch-sequential design 70

4.3.4 Assemble-to-order

The assemble-to-order (ATO) problem (Hong and Nelson, 2006) involves a queuing simula- tion targeting inventory management scenarios. It was designed to help determine optimal inventory levels for eight different items to maximize profit. Here we simply treat it as black- box response surface. Although the signal-to-noise ratio is relatively high, ATO simulations are known to be heteroskedastic (Binois et al., 2018b). We utilized the MATLAB implemen- tation described by Xie et al. (2012) through R.matlab (Bengtsson, 2018) in R. Our setup duplicates the MC of Binois et al. (2018c) in thirty replicates, in particular by initializing with a n0 = 100-sized random design in the 8d input space, paired with random degrees of replication ai ∼ Unif{1,..., 10} so that the initial design comprised about N0 ≈ 500 runs. Binois et al. then performed about 1500 acquisitions to end at N = 2000 total runs. We performed sixty-three M = 24-sized batches to obtain about 2012 runs.

Since the 8d inventory input vector must be comprised of integers {0,..., 20}, we slightly modified our method in a manner similar to Binois et al.: inputs are coded to [0, 1] so that IMSPE optimization transpires in an M ×[0, 1]8 space. When backtracking, merged IMSPEs

int are calculated via rounded Xe ms on the natural scale.

Figure 4.9 shows progress in terms of average RMSPE and score mimicking the format of the presentation of Binois et al., whose comparators are duplicated in gray in our updated version. There are eight gray variations, representing multiple lookahead horizons (h) and two automated horizon alternatives, with “Adapt” being the gold standard. In terms of RMSPE, our batch method makes progress more slowly at first, but ultimately ends in the middle of the pack of these pure sequential alternatives. In terms of score, we start out the best, but end in the third position. Apparently, our batch scheme is less aggressive on reducing out-of-sample mean-squared error, but better at accurately assessing uncertainty. 4.3. Benchmarking examples 71

h = −1 h = 0 Batch 0.5 h = 1 h = 2 Target h = 3 h = 4 Adapt ● ● ●

0.4 adapt t = 0.2 h = 4 ●●● ● ● batch h = 3 rmspe 0.3 h = 2

h = 1 ●●●●

0.2 h = 0

h = −1 ● 0.1

500 1000 1500 2000 0.10 0.20 0.30

N Final RMSPE

Batch

3 Target ●●●

Adapt ● ● ●

2 h = 4 ● ●●

h = 3 score

1 h = 2

h = 1

h = 0 ● 0

h = −1

500 1000 1500 2000 0 1 2 3

N Final score Figure 4.9: RMSPE and score over design size N from 30 MC repetitions.

In the 30 MC replicates our average number of new replicates per unique site was 1.64 (min 0, max 5), leading to a mean of n = 1610 (min 1606, max 1612). This is a little higher (lower replication) than n = 1086 (min 465, max 1211) reported by Binois et al. for “Adapt”. Again, we conclude that our batch method is competitive despite being faced with many fewer opportunities to re-tune the strategy over acquisition iterations. Chapter 5

Delta smelt

Delta Smelt (Hypomesus transpacificus) are a small, slender-bodied fish who live in the Sacramento river delta and estuaries on the San Francisco Bay. Their abundance serves as an indicator of environmental health in the bay (Rose et al., 2013). Populations declined in the latter half of the 20th century, and in 1993 they were listed as threatened under the US and California Endangered Species Acts (Fish and Wildlife Service, 1993). Nevertheless populations continued to decline. Factors that may be contributing include entrainment by large water diversion facilities (primarily for farming), densities of zooplankton food sources, pollution, introduction of non-native species, and changes in physical habitat related to salinity and turbidity (Baxter et al., 2010). Finding the most critical factors influencing decline is important for effective resource and wildlife management and restoration.

Recent studies have applied statistical analysis using myriad data sources, methodologies and in combination with diverse species and ecosystems (e.g. Hamilton and Murphy, 2018, MacNally et al., 2010, Maunder and Deriso, 2011, Thomson et al., 2010) Although informa- tive, research by Kimmerer and Rose (2018) suggests studies like these, based on aggregated indices of abundance are too phenomenological. They lack fidelity and biological dynamics in the modeling of mechanisms behind the life history of delta smelt.

To better study its complicated life cycle and interface with weather and climate, Rose et al. (2013) developed a stochastic agent-based model (ABM) of the delta smelt population for the upper estuary in the bay, which simulates dynamics under a range of scenarios.

72 73

Ultimately, the goal of such computer modeling is to augment and inform statistical models, such as those above, through calibration to real data, sensitivity analysis, and to assist in determining which of several reasonably actionable levers could improve the health of the system and thus populations of delta smelt.

Rose et al.’s simulator is slow (typically 4–6 hours for a single run) and stochastic. The input configuration space is large (upwards of 13-dimensions), and the response surface is nonlinear. Separating signal from noise requires a large and costly, highly distributed HPC simulation campaign and pairing with a flexible meta model. Previous campaigns fixed random number seeds, perhaps to artificially amplify signal. Our initial study with this simulator, described in Section 5.2, suggests that in some low-noise/low-signal parts of the configuration space this shortcut is harmless. However, we observe that the response surface is heteroskedastic, and moreover noise levels can vary nonlinearly. This challenges effective design and meta-modeling – a setting that’s increasingly common in simulation experiments, especially those based on agent-based models (Baker et al., 2020).

The structure of this chapter is as follows. Section 5.1 provides more details about delta smelt simulator. Section 5.2 describes a pilot study on a reduced input space identifying challenges/appropriate modeling elements and motivating a HetGP framework. Finally, in Section 5.3 the batch sequential design method developed in 4 is applied to smelt simulations (in a larger space), collecting thousands of runs utilizing tens of thousands of core hours across a several weeks-long simulation campaign. Those runs are used to conduct a sensitivity analysis to exemplify potential downstream tasks. Chapter 5. Delta smelt 74

5.1 Agent-based model

The delta smelt simulator is described in detail by Rose et al. (2013). Its stochastic agent- based model (ABM) architecture tracks reproduction, growth, mortality and individual movement of over entire life cycle of cohorts of fish, the principal agents. Agents are modeled on a spatial grid representing nearly their entire geographic range of the Sacramento river delta. Daily values of environmental variables, such as water temperature, salinity, and den- sities of six zooplankton prey types drive the model. These vary over geographic grid cells according to historical measurements taken from 1995–2005, comprising a ten-year study period. New agents are introduced as yolk-sac larvae into the model. Growth and matura- tion of feeding agents are determined stochastically based on bioenergetics and zooplankton densities. Mortality/removal of agents can be due to natural causes, starvation, and entrain- ment in water diversion facilities, again stochastically. Movement of larvae was modeled by particle-tracking (Kimmerer and Nobriga, 2008), while the movement of juveniles and adults was modeled as a function of salinity.

symbol parameter description range default pilot study

my zmorty yolk-sac larva MR [0.01,0.50] 0.035 0.035 ml zmortl larval MR [0.01, 0.08] 0.050 0.050 mp zmortp post-larval MR [0.005, 0.05] 0.030 0.030 mj zmortj juvenile MR [0.001, 0.025] 0.015 [0.005,0.030] ma zmorta adult MR [0.001, 0.01] 0.006 0.006 mr middlemort river entrain MR [0.005, 0.05] 0.020 [0, 0.05] Pl,2 preyk(3,2) larvae EPT 2 [0.10, 20.0] 0.200 0.200 Pp,2 preyk(4,2) postlarvae EPT 2 [0.10, 20.0] 0.800 [0.10, 1.84] Pp,6 preyk(4,6) postlarvae EPT 6 [0.10, 20.0] 1.500 Pp,2 Pj,3 preyk(5,3) juveniles EPT 3 [0.10, 20.0] 0.600 [0.1, 1.5] Pj,6 preyk(5,6) juveniles EPT 6 [0.10, 20.0] 0.600 Pj,3 Pa,3 preyk(6,3) adults EPT 3 [0.01, 20.0] 0.070 0.070 Pa,4 preyk(6,4) adults EPT 4 [0.01, 5.0] 0.070 0.070

Table 5.1: Delta smelt simulator input variables. The last column shows the settings of the pilot study in Section 5.2. MR abbreviates mortality rate; EPT means eating prey type. 5.1. Agent-based model 75

The simulator has 13 input configurations. These are listed in Table 5.1 alongside a short description, variable names and ranges, default values, etc. The first set of variables involve mortality by natural causes on yolk-sac larvae, larvae, postlarvae, juvenile and adult life stages. These and other variables are unknown quantities, but sensible ranges can be set by known biology. Default values encode mortality rates declining with life stage, except during the vulnerable larval period. These values are constant within each life stage except yolk-sac larva mortality rate my, which is temperature dependent.

Entrainment mortality, mr, is due to water management and other human-caused factors. It occurs when passive (larvae) or behavioral movement (juveniles and adults) places a super-individual in a grid cell containing a water diversion facility, at which point that entire individual is removed from the simulation; e.g., imagine fish getting caught in the turbines. Human activity also affects the availability of zooplankton food sources, which in turn affects the rate of movement between life stages and indirectly affecting mortality. The details are nuanced and not reviewed here. As one example, for juveniles and adults an additional increment of daily mortality rate is added to account for the factors that go beyond water movement. Kimmerer and Rose (2018) emphasize how these zooplankton prey variables are particularly worthy of detailed investigation. Types 1–6 comprise Limnoithona tetraspina, Calanoid copepodites, other calanoid adults, Acanthocyclops, Eurytemora affinis and Pseudodiaptomus forbesi, respectively. Not all combinations of zooplankton groups are realistically compatible with each life stage. For example, larvae consume only juvenile calanoid copepods and adults of the cyclopoid L. tetraspina. The other four groups were considered too large to be available to larvae based on laboratory analysis.

Simulation mechanics consist of tracking daily movement of agents in hourly epochs based on position and velocity with potential for movement to nearby grid cells coupled with “movement” in the configuration space of biological dynamics including reproduction, growth Chapter 5. Delta smelt 76 and mortality which causes populations and cross sections of life stages to flux. At the end of the ten-year period, the simulator records annual adult abundance in each January, annual number of adults entrained in diversion facilities, and other relevant outputs year on year. A complete output table is provided in Table 2 of Rose et al. (2013).

+ In this chapter, we focus on the annual finite population geometric growth rate λi ∈ R with i indexing years from 1995 to 2004.1 To simplify the objective, we take a geometric 1 Q2004  10 mean of all the annual finite population growth rates: λ = i=1995 λi . This quantity acts as an indicator of how much the population of delta smelt is influenced by particular input configurations. For example, if the population in 1994 is a0, a simulation may conclude that

10 the population in 2004 is a0λ . Values of λ > 1 indicate population increase from 1994 to 2004, while λ < 1 indicate decline. It usually takes about six hours for the simulator to traverse the ten-year period, but sometimes as many as ten. In some cases, like with some mortality rate inputs set close to their upper limits, all agents are removed before 2004 causing early termination and output λ = 0.

Previous simulation campaigns fixed the random number seed, obviating the need for replica- tion to understand variability. The importance of single factors were estimated by evaluating population changes after structurally eliminating that factor in the simulations(s). For ex- ample, Kimmerer and Rose (2018) studied the effect of entraiment mortality and food factors in this way. This is very different from a Saltelli-style/functional analysis of variance (e.g., Gramacy, 2020, Marrel et al., 2009, Oakley and O’Hagan, 2004, Saltelli et al., 2000, Chapter 8.2) favored by the computer surrogate modeling literature. That and other downstream applications require a meta-modeling design strategy in the face of extreme computational demands and stochasticity (assuming un-fixed seeds).

1 Hydrodynamic model output is incomplete for 2005. 5.2. Pilot study 77

5.2 Pilot study

We regard the delta smelt simulator as an unknown function f : Rd → R. A meta-model fˆ fit to evaluations (xi, yi ∼ f(xi)), for i = 1,...,N is known as a surrogate model or emulator (Gramacy, 2020). The idea is that fast fˆ(x) could be used in lieu of slow/expensive f(x) for downstream applications like input sensitivity analysis. Although there are many sensible choices, the canonical surrogate is based on Gaussian processes (GPs).

To assist with R-based surrogate modeling we built a custom R interface to the underly- ing Fortran library automating the passing of input configuration files/parsing of outputs through ordinary function I/O. The Rmpi package (Yu, 2002) facilitates cluster-level paral- lel evaluation for distributed simulation through a message passing interface (MPI) on our Advanced Research Computing (ARC) HPC facility at Virginia Tech.

To test that interface and explore modeling and design options we ran a limited delta smelt simulation campaign over six input factors under a maximin–LHS of size n = 96 (via lhs; Carnell, 2020) with five replicates for each combination. Juvenile and river entrainment mortalities mj and mr were varied over their ranges with the rest being fixed at their default values from Table 5.1. Post-larvae (Pp,2) and juvenile (Pj,2) prey parameters for zooplankton type 2 are allowed to vary over their ranges with the type 6 analog taking on identical values

(Pp,2 = Pp,6 and Pj,3 = Pj,6). Other prey types were fixed to their default settings making the effective input dimension four. Twenty 24-core VT/ARC cluster nodes were fully occupied in parallel in order to get all N = 480 runs in about six hours.

4 We fit the simulation data using hetGP with inputs coded to XN to the unit cube [0, 1] and with YN derived from log λi, for i = 1,..., 480 using yi ≡ −6 log 10 in the few cases where

λi = 0 was returned. As a window into visualizing the fitted response surface we plotted a selection of 1d and 2d predictive mean error/variance slices in figure 5.1, using defaults Chapter 5. Delta smelt 78

mean surface variance surface 1d predictive bands

5 5 5 5 5 5 4 5 5 5 5 5 5 5 5 Pj3=0.2273 1.4 5 1.4 5 5 5 5 5 5 5 5 5 5 5 Pj3=0.5101 5 5 5 5 5 5 5 5 5 5 Pj3=0.7929 5 5 5 5 5 5 1.2 5 5 1.2 5 5 Pj3=1.0758 5 5 5 5 5 5 2 5 5 5 5 Pj3=1.5 5 5 5 5 5 5 5 5 5 5 5 5

1.0 5 5 1.0 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 55 55 0 0.8 5 0.8 5

Pj3 5 Pj3 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5

5 5 log(lambda) 0.6 5 5 0.6 5 5 55 5 55 5 5 5 5 5 5 5 5 5 5 5 −2 5 5 5 5 5 5 5 5 0.4 55 0.4 55 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5

0.2 5 5 0.2 5 5 5 5 5 5

5 5 −4 5 5 5 5

0.005 0.010 0.015 0.020 0.025 0.030 0.005 0.010 0.015 0.020 0.025 0.030 0.005 0.010 0.015 0.020 0.025 0.030 mj mj mj

5 5 5 5 5 5 4 5 5 5 5 5 5 5 5 Pj3=0.2273 1.4 5 1.4 5 5 5 5 5 5 5 5 5 5 5 Pj3=0.5101 5 5 5 5 5 5 5 5 5 5 Pj3=0.7929 5 5 5 5 5 5 1.2 1.2 Pj3=1.0758 5 5 5 5 3 5 5 5 5 5 5 5 5 5 5 Pj3=1.5 5 5 5 5 5 5 5 5 5 5 5 5

1.0 5 5 1.0 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 2 5 5 5 5 5 5 5 5 5 5 5 5 0.8 5 0.8 5

Pj3 5 Pj3 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 1

5 5 log(lambda) 0.6 5 5 0.6 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 55 55 5 5 5 5 0.4 5 0.4 5 5 5 5 5 5 5 0 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 0.2 5 5 5 0.2 5 5 5 5 5 5 5 5 5 5 5

0.00 0.01 0.02 0.03 0.04 0.05 0.00 0.01 0.02 0.03 0.04 0.05 0.00 0.01 0.02 0.03 0.04 0.05 mr mr mr Figure 5.1: 2d heatmap and 1d lineplot slices of predictive mean and variance for selected inputs. The numbers overlaid indicate design locations and numbers of replicates.

from Table 5.1 for the fixed variables. The first and second row correspond to the subspaces

(Pj,3 × mj) and (Pj,3 × mr), respectively. Observe in the middle column how noise intensity changes over the 2d input subspace, indicating heteroskedasticity. Both mean and variance surfaces are nonlinear. A similar, higher resolution view is offered by the 1d slices in the final column. The solid curves in the top-right panel are horizontal slices of the top left panel with Pj,3 fixed at five different values, and analogously on the bottom-right. Predictive 95% intervals are shown as dashed lines. In both views, the width of dashed predictive band changes, sometimes drastically, as mj and mr are increased. Clearly mj in the top-right panel has shows a more dramatic and nonlinear mean and variance effects. 5.3. Big experiment 79

5.3 Big experiment

Motivated by the delta smelt ABM, our innovative IMSPE batch sequential design method is developed and tested in Chapter 4. Now we are almost ready to apply it to our motivating application. The plan is to scale-up the pilot study of Section 5.2 and vary more quantities in the 13d input space. Time and allocation limits meant that we’d only get one crack at this, so we did one last “sanity check” before embarking on a big batch-sequential simulation campaign. We returned to the 4d pilot study described in Section 5.2, which involved N = 480 runs and inspected the properties of two new batches, each of size 24. To understand how these 48 inputs, selected via IMSPE and backtracking based on HetGP fits, compare to the original n = 96-sized space-filling design, we plotted empirical densities of the pairwise distances within and between the two sets. See the solid-color-lined densities in the left panel of Figure 5.2. Dashed analogues offer a benchmark via sequential maximin design in two similarly sized batches. These represent an alternative, space-filling default, ignoring the HetGP model fit/IMSPE acquisition criteria.2 Note that there are relatively few pairwise involved in just 48 new runs, which would impact the quality of kernel density estimates.

Consider first comparing the solid and dashed green lines, capturing the spread of distances between new and old runs. Observe that the solid-green density is shifted to the left relative to the dashed ones. This is revealing that IMSPE-selected runs are closer to the existing ones than they would be under a space-filing design. The solid-green density is similarly shifted left compared to the distances in the old space-filling design (solid-black). The situation is a little different for distances within the new batches shown in red. Here we have a tighter density for IMSPE compared to space-filling, meaning we have fewer short and long distances – more medium ones. We take this as evidence that the HetGP/IMSPE batch scheme is working: spreading points out to a degree, but also focusing on some regions of the input

2Sequential maximin, being model-free, doesn’t require new evaluations of the simulator. Chapter 5. Delta smelt 80

Figure 5.2: Empirical density of pairwise distances from IMSPE batch and maximin sequen- tial design for the pilot (left) and full (right) studies. space more than others.

5.3.1 Setup and acquisitions

Encouraged by these results, and the simulations in Section 4.3, we turned to the big cam- paign comprising our “full” analysis. Based on the outcome of the pilot study, known biol- ogy and design of the delta smelt simulator, our colleagues recommended exploring a ten- dimensional input space on a 7d manifold described in Table 5.2, augmenting Table 5.1 with a new column. We are expanding the effective input domain by three dimensions, and adjust- ing slightly ranges and relationships between the original inputs. Specifically, we extended my and began to vary Pl,2, Pa,3, and Pa,4 with anchored variables Pp,6 = Pp,2 × 1.75 + 0.05,

Pj,3 = Pj,6, and Pa,3 = Pa,4. Inputs ml, mp, and ma remain fixed at their default values.

To explore the 7d input space, we begin with maximin-LHS of size n0 = 192, each with five replicates for a total of N0 = 960 initial runs. We aim to double this simulation effort, collecting a total of N = 1920 runs, by adding 40 subsequent batches of size M = 24. 5.3. Big experiment 81

symbol range default pilot study full study

my [0.01,0.50] 0.035 0.035 [0.02, 0.05] ml [0.01, 0.08] 0.050 0.050 0.050 mp [0.005, 0.05] 0.030 0.030 0.030 mj [0.001, 0.025] 0.015 [0.005,0.030] [0.005, 0.030] ma [0.001, 0.01] 0.006 0.006 0.006 mr [0.005, 0.05] 0.020 [0, 0.05] [0, 0.1] Pl,2 [0.10, 20.0] 0.200 0.200 [0.1, 0.5] Pp,2 [0.10, 20.0] 0.800 [0.10, 1.84] [0.10, 1.84] Pp,6 [0.10, 20.0] 1.500 Pp,2 1.75Pp,2 + 0.05 Pj,3 [0.10, 20.0] 0.600 [0.1, 1.5] [0.1, 1.5] Pj,6 [0.10, 20.0] 0.600 Pj,3 Pj,3 Pa,3 [0.01, 20.0] 0.070 0.070 [0.05, 0.15] Pa,4 [0.01, 5.0] 0.070 0.070 Pa,3

Table 5.2: Augmenting Table 5.1 to show the settings of the “full” experiment.

This took a total of 44 days, requiring slightly more than one day per batch, including HetGP update times/IMSPE evaluation and backtracking, and any time spent waiting in the queue on the ARC HPC facility at Virginia Tech. Inevitably, some hiccups prevented a fully autonomous scheme. We discovered that, in at least one case, what seemed to be a conservative request of 10 hours of job time per batch (of runs that usually take 4-6 hours) was insufficient. We had to manually re-run those failed simulations, and subsequently upped the request to 14 hours. This bigger demand led to longer queuing times even though the average execute time was at par with previous campaigns.

The right panel of Figure 5.2 shows an analog of the comparison of pairwise distances for this larger campaign. With many more distance pairs, these kernel densities are more stable than in the 4d case on the left. Nonetheless, we observe a similar pattern here in 7d. IMSPE selections tend to be closer to themselves and to existing locations than ordinary space-filling ones would. We take this as an indication that the scheme was acting in a non-trivial way to reduce predictive uncertainty captured by HetGP model fits.

When training the HetGP surrogate we use log yi with yi = λi for nonzero values. Any zeros Chapter 5. Delta smelt 82

1 are replaced with yi = log 2 mini:λi>0 λi where i : λi > 0 represents the subset of {1,...,N} indexing positive outputs. This lead to slightly different y-axis scales for visuals, i.e., as compared to Section 5.2. However, a dynamic scheme for handling zeros was necessitated by the dynamic nature of the arrival of λ-values furnished over the batches of sequential acquisition – in particular of ones smaller than those obtained in the pilot study.

mean surface variance surface 1d predictive bands

1 1 1 1 1 1 1 1 1 1 1 1 1 111 11 1 1 1 111 11 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 11 1 1 11 1 1 1 2 1 11 1 1 11 1 1 1 2 1 Pj3=0.2273 1 1 1 1 1 1 21 4 1 2 4 1 2 1 1 1 1 1 1 1 1 1 1 21 4 1 2 4 1 2 1 1 1 1 1.4 1 1 16 1 2 1 1 51 1 12 1 1 1.4 1 1 16 1 2 1 1 51 1 12 1 1 1 1 12 1 1 21 2 1 12 1 1 21 1 1 12 1 1 21 2 1 12 1 1 21 2 1 331 1 1 1 1 1 11 2 2 11 1 1 1 1 1 2 1 331 1 1 1 1 1 11 2 2 11 1 1 1 1 1 Pj3=0.5101 2 11 3 2 11 3 2 1 1 11 2 1 1 311 1 1 11 1 1 11 2 1 1 311 1 1 11 11 1 1 1 2 1 1 9 11 1 1 1 2 1 1 9 1 1 1 1 1 1 61111 1 1 1 1 1 1 61111 Pj3=0.7929 1 1 111 1 2 1 11 1 12 1 1 1 1 1 111 1 2 1 11 1 12 1 1 1 13 111 1 1 1 1 2 1 111 1 13 111 1 1 1 1 2 1 111 1 1.2 1 111 1 1 1 1 1.2 1 111 1 1 1 1 Pj3=1.0758 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 11 1 3 1 11 1 1 11 1 3 1 1 1 5 1 1 1 111 1 1 1 5 1 1 1 111 1 Pj3=1.5 1 1 1 12 1 1 1 1 1 1 1 1 12 1 1 1 1 1 1 1 1 1 1 1 6 1 112 1 4 2 1 6 1 112 1 4 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 11 1 11 1 1 1 1 11 1.0 1 1 1 1 2 2 1 2 1.0 1 1 1 1 2 2 1 2 1 11 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 111 1 111 1 1 1 1 1 1 1 1 111 1 111 1 1 1 1 1 1 1 1 3 2 11 1 1 1 21 21 1 1 1 2 1 1 3 2 11 1 1 1 21 21 1 1 1 2 1 1 1 1 2 11 1 11 11 1 1 1 1 2 11 1 11 11 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 11 1 1 1 11 1 1 118 1 1 1 1 1 1 11 1 1 118 1 1 1 1 1 11 2 1 2121 1 211 111 1 11 2 1 2121 1 211 111 0 0.8 1 11 211 1 1 1 1 1 1 1 11 0.8 1 11 211 1 1 1 1 1 1 1 11 Pj3 1 1 1 1 1 1 1 1 Pj3 1 1 1 1 1 1 1 1 1 1 11 1 1 111 1 1 1 1 1 11 1 1 111 1 1 1 1 1 11 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 11 1 1 2 1 1 3 1 1 11 1 1 2 1 1 3 1 1 1 1 111 1 1 1 1 1 1 111 1 1 11 1 1 1 11 1 1 1 1 1 11 1 1 1 11 1 1 1 1 1 1 11 111 1 1 1 11 111 1 1 1 2 2 3 11 141 1 2 1 2 2 3 11 141 1 2 log(lambda) 0.6 1 1 1 1 1 1 0.6 1 1 1 1 1 1 11 11 11 1 1 111 11 1 11 1 11 11 11 11 1 1 111 11 1 11 1 11 1 1 4 1 1 1 1 11 1 1 4 1 1 1 1 11 1 1 2 11 1 1 1 2 11 1 −1 11 1 1 1 11 1 1 1 1 11 1 1 1 11 1 1 1 1 11 1 1 1 2 2 1 12 1 2 11 1 1 1 2 2 1 12 1 2 12 1 1 1 1 1 12 1 1 1 1 1 2 1 2 1 111 1 11 11 1 1 2 1 2 1 111 1 11 11 1 1 1 11 1 1 11 1 1 11 1 1 1 11 1 1 11 1 1 11 1 1 0.4 1 1 1 1 3 1 13 1211 1 0.4 1 1 1 1 3 1 13 1211 1 1 1 1 1121 2 11 11 1 2 2 11 1 1 1 1 1121 2 11 11 1 2 2 11 1 1 1 11 1 1 1 1 1 1 11 1 1 1 1 6 1 1 3 1 1 6 1 1 3 1 1 1 1 1 1 21 1 1 1 1 11 11 1 1 1 1 21 1 1 1 1 11 11 14 1 2 1 14 1 2 1 −2 1 1 11 1 1 4 1 1 1 1 1 1 1 11 1 1 11 1 1 4 1 1 1 1 1 1 1 11 1 1 1 3 1 5 1 11 1 2 1 11 1 1 1 1 1 3 1 5 1 11 1 2 1 11 1 1 11 1 1111 1 6 1 1 111 1 2 1 1 11 11 1 1111 1 6 1 1 111 1 2 1 1 11 0.2 1 13 11 1 4 1 11 1 2 1 1 2 2 1 0.2 1 13 11 1 4 1 11 1 2 1 1 2 2 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 11 1 1 1 111 1 2 11 1 11 1 1 1 111 1 2 11 11 1 1 1 11 1 1 1 11 1 1 1 11 1 1 1 0.005 0.010 0.015 0.020 0.025 0.030 0.005 0.010 0.015 0.020 0.025 0.030 0.005 0.010 0.015 0.020 0.025 0.030 mj mj mj

1 11 1 1 11 1 1 1 1 1 1 111 21 1 1 1 1 1 1 1 1 1 1 1 1 111 21 1 1 1 1 1 1 1 0.030 2 9 0.030 2 9 1 111 1 1 11 11 1 1 1 1 1 1 1 111 1 1 11 11 1 1 1 1 1 1 1 1 1 1 1 11 1 61 1 1 1 1 1 11 1 61 my=0.0227 1111 1 2 1 1 1 1 1 1 11 1 1111 1 2 1 1 1 1 1 1 11 1 1 1 1 1 1 11 2 1 3 1 11 1 1 1 1 1 11 2 1 3 1 11 1 1 1 2 1 1 1 1 1 1 2 1 1 1 my=0.0288

1 1 1 1 2 11 11 1 1 1 1 11 11 1 1 1 1 1 11 1 1 2 1 11 1 1 1 1 1 11 1 11 1 1 2 1 11 1 1 1 1 1 11 1 1 1 2 1 1 1 2 1 1 1 1 1 1 1 2 1 1 1 2 1 1 1 1 my=0.0348 11 1 1 2 21 21 1 11 1 1 2 21 21 1 1 1 1 1 11 4 11 4 1 1 1 1 1 11 4 11 4 1 1 1 2 121 2 11 1 1 1 1 1 1 2 121 2 11 1 1 1 1 my=0.0409 0.025 1 2 1 0.025 1 2 1 1 1 2 1 1 1 11 1 12 1 1 1 1 2 1 1 1 11 1 12 1 1 1 1 12 1 1 1 11 1 1 1 12 1 1 1 11 1 1 1 1 1 1 1 2 1 1 1 2 1 1 1 1 1 1 1 2 1 1 1 2 1 my=0.05 1 18 1 1 13 241 1 1 1 1 1 1 1 18 1 1 13 241 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 11 12 1 11 3 1 1 1 1 11 12 1 11 3 11 11 1 3 11 151 1 1 2 11 11 11 1 3 11 151 1 1 2 11 111 1 1 2 1 1 1 1 1 1 111 1 1 2 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 11 1 1 1 1 1 2 1 1 1 1 1 11 1 1 1 3 1 1 1 1 1111 11 1 1 1 1 3 1 1 1 1 1111 11 1 1 1 111 2 1 21 11 1 1 1 111 2 1 21 11 1 1 0.020 1 11 2 1 1 1 2 1 121 1 1 0.020 1 11 2 1 1 1 2 1 121 1 1 1 11 131 1 1 1 1 11 131 1 1 1 1 1111 22 1 2 1 1 1 1 1 1111 22 1 2 1 1 1 1 1 1 1 2 1 2 11 2 1 1 1 1 1 11 1 1 1 2 1 2 11 2 1 1 1 1 1 11 1 2 1 1 1 11 1 2 1 1 1 11 0 1 1 11 1 1 1 1 1 1 11 1 1 1 1 mj 3 1 1 1 1 mj 3 1 1 1 1 11 12 1 1 1 1 111 1 1 11 1 1 11 12 1 1 1 1 111 1 1 11 1 1 1 1 141 1 21 1 1 1 1 1 2 1 1 141 1 21 1 1 1 1 1 2 2 1 2 5 1 1 1 1 2 1 2 5 1 1 1 1 11 11 1 13 11 1 11 1 11 1 11 11 1 13 11 1 11 1 11 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 2 1 2 1 1 11 1 1 1 1 2 1 2 1 1 11 1 1 0.015 1 1 1 1 1 1 22 1 0.015 1 1 1 1 1 1 22 1 1 2 1 1 11 1 1 1 1 1 2 1 1 11 1 1 1 1 log(lambda) 1 1 1 11 31 1 1 1 1 1 21 1 1 1 11 31 1 1 1 1 1 21 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 21 2 64 1 1 1 1 1 1 1 1 1 1 21 2 64 1 1 1 1 −1 1 1 1 5 1 1 1 1 1 1 1 1 1 1 5 1 1 1 1 1 1 1 1 1 4 1 1 4 111 12 1 1 1 1 4 1 1 4 111 12 1 1 1 111 1 1 1 1 1 1 1 1 111 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 6 11 1 1 1 111 1 1 1 6 11 1 1 1 111 1 1 2 1 1 1 1 1 2 11 1 1 1 2 1 1 1 1 1 2 11 1 0.010 1 12 11 1 1 11 1 1 11 1 0.010 1 12 11 1 1 11 1 1 11 1 1 1 1 2 1 1 4311 2 2 1 1 1 1 2 1 1 4311 2 2 1 1 1 1 1 11 1 3 1 1 1 1 1 11 1 3 1 11 11 −2 1 1 1 1 6 1 11 1 1 1 1 1 1 1 6 1 11 1 1 1 1 1 1 13 1 1 1 111 1 1 1 13 1 1 1 111 1 1 1 1 3 1 1 11 1 11 1 1 1 1 1 1 1 3 1 1 11 1 11 1 1 1 1 11111 11 1 1 1 11111 11 1 1 11 11 1 1 11 1 11 11 1 1 11 1 1 1 1 261 111 1 1 131 1 1 1 1 1 1 261 111 1 1 131 1 1 1 1 1 1 2 1 11 1 1 1 1 11 11 1 11 1 1 1 1 2 1 11 1 1 1 1 11 11 1 11 1 0.005 0.005 0.020 0.025 0.030 0.035 0.040 0.045 0.050 0.020 0.025 0.030 0.035 0.040 0.045 0.050 0.005 0.010 0.015 0.020 0.025 0.030 my my mj

Figure 5.3: Slices for the “full” experiment updating Figure 5.1.

As an example, see Figure 5.3 which augments mean and standard deviation slice views first provided in Figure 5.1. Here, to reduce clutter, numbers overlaid indicate the degrees of replication on only the batch/IMSPE selections. As before, these are projections over the other five dimensions, so the connection between variance and design multiplicity is weak (we can’t see how uncertainty relates to the other five inputs). Nevertheless, multiplicity in unique runs is generally higher (more 4s–6s) in the yellow regions. The first row of Figure 5.3. Big experiment 83

5.3 coincides with Figure 5.1, showing input pair mj × Pj,3. Observe that, after conditioning on more data despite the larger space, predictive bands over mj are narrower, especially at the boundaries. The sudden widening of the blue and black predictive intervals correspond to the yellow spots in the middle panel. This could be signals, but also could disappear after adding more samples nearby. The second row shows a newly selected pair my × mj, replacing the the flat view from Figure 5.1 which is still uninteresting in the “full” setting.

A nonlinear variance is evident, being extremely high at mj = 0.018.

5.3.2 Downstream analysis

Slices are certainly not the best way to visualize a high dimensional response surface. More- over, there are many possible ways to utilize the information in a fitted surrogate. Our intent here is not to explore that vast space in any systematic way, but rather to illustrate potential. Here we showcase input sensitivity analysis as one possible task downstream of fitting and design. That is, we seek to determine which input variables have the greatest influence on outputs, in this example is the growth rate of the fish, and which variables (if any) interact to affect changes in the response. We perform this analysis based exclusively on the N = 1920 runs obtained from the batch sequential design experiment. We could have combined with the pilot runs, which may have reduced variability in some parts of the input space, but could potentially introduce complications interpretively.

Sensitivity analysis for GP surrogates (Marrel et al., 2009, Oakley and O’Hagan, 2004) attempts to measure the effect of a subset of inputs on outputs by controlling and averaging other the compliment of inputs (Saltelli et al., 2000). Gramacy (2020), Chapter 8.2, provides a thorough summary and practical implementations of this context. We briefly summarize salient details here for completeness. Chapter 5. Delta smelt 84

Qm Let U(x) = k=1 uk(xk) denote a distribution on inputs, indicating relative importance in the range of settings or nearby nominal values. We simply take this to be uniform over the study regions (Table 5.2). So-called main effects, sometimes referred to as a zeroth-order index, are calculated by varying one input variable while integrating out others under U:

ZZ

ME(xj) ≡ EU−j {y | xj} = yP (y | x)u−j(x1, xj−1, xj+1, xm) dx−jdy. (5.1) X−j

Above, P (y | x) = P (Y (x) = y) is the predictive distribution from the a surrogate, say via HetGP. One may approximate this double-integral via MC with LHSs over U. We used LHSs of size 10000 paired with a common grid over each variable j involved in ME(xj). See the left panel of Figure 5.4. All show a negative relationship with the response λ, with greater values leading to more dead fish. Apparently, mj and Pj,3 induce higher mean variation in the response than the others.

Figure 5.4: Sensitivity analysis: main effects (left); first order (middle) and total sensitivity (right) from 100 bootstrap re-samples.

To further quantify the variation that each input factor contributes, we calculated first-order (S) and total (T ) indices. These assume a functional ANOVA decomposition,

m X X f(x1, . . . , xm) = f0 + fj(xj) + fij(xj, xi) + ··· + f1,...,m(x1, . . . , xm), j=1 1≤i

VarUj (EU−j {y | xj}) Sj = , j = 1, . . . , m. VarU (y)

Total sensitivity Tj is the mirror image:

E{Var(y | x−j)} Var(E{y | x−j}) Tj = = 1 − . Var(y) Var(y)

It considers the proportion of variability that is not explained without xj. The difference be- tween first-order and total sensitivities, i.e., Tj −Sj, may be taken as a measure of variability in y due to the interaction between input j and the other inputs.

Calculation of S and T indices are also undertaken by MC via LHS, but the details are omitted here for brevity. We repeated MC calculations of both on 100 bootstrap samples of the original data set, and provided a summary via boxplots in the right panels of Figure 5.4.

These views match the main effects: mj and Pj,3 stand out among all the input variables in both plots.

T − S > 0 my mj mr Pl,2 Pp,2 Pj,3 Pa,3 Mean 0.52 1 0.54 0.68 0.55 1 0.52

Table 5.3: Proportion of positive I = T − S indices for mean process.

Using those S and T values, we computed I measuring the strength of interactions. The proportion of these I measurements which are positive is provided in Table 5.3, providing a so-called median probability model summary by comparison to a baseline 0.5. Again, mj and Chapter 5. Delta smelt 86

Pj,3 flag has highly probable for impacting the response through an interaction with another variable. So they not only influence the the response most, but also work interactively.

Besides that pair, the table indicates that only Pl,2 has substantial impact on λ.

HetGP puts a second GP prior on the latent nuggets ∆n. Once ∆n and all the hyper- parameters are estimated by maximizing the likelihood, the predictive mean of the noise process, i.e., the smoothed nuggets Λ, can be calculated over any testing data set in the domain of interest. This provides a way to assess the influence of each input variable on the heteroskedastic variance.

Figure 5.5: Sensitivity analysis for the variance process: main effects (left); first order (mid- dle) and total sensitivity (right) from 100 bootstrap re-samples.

Applying the same procedures, the sensitivity indices are calculated to the variance process of the HetGP models as well, which are shown in figure 5.5. From the left panel, we can tell that when mj is between 0.4 and 0.6, the predictive mean of the variance process is highest. mj and Pj,3 induce the highest variation among all the variables. In HetGP modeling, the GP lengthscale parameters for the noise process are presumed to be less than that of the mean process. This setting makes the noise surface is less smooth. Thus, there are much more outliers in the boxplots of first order sensitivity and total sensitivity indices. But still, indices for mj and Pj,3 are apparently higher than other variables in both middle and right 5.3. Big experiment 87

panels. Thus, we can tell that mj and Pj,3 results in highest variation in both the mean and variance processes. Interestingly, these two variables are both related to the juvenile stage of delta smelt, which may motivate new biological findings.

The proportion of positive I = T −S is shown in Table 5.4. Comparing to Table 5.3, all flags here are greater than 0.5. This indicates more interactions exit for the variance process.

Proportion my mj mr Pl,2 Pp,2 Pj,3 Pa,3 T − S > 0 0.81 1 0.92 0.91 0.81 0.98 0.78

Table 5.4: Proportion of positive I = T − S indices for variance process. Chapter 6

Conclusion

6.1 Distance-distributed design for GP surrogates

In Chapter 3, we have described a new scheme for design for surrogate modeling of com- puter experiments based on pairwise distance distributions. The idea was borne out of the occasionally puzzling behavior of more conventional maximin and LHS designs, especially as deployed as initial designs in a sequential setting. Maximin designs, and to a certain extent LHS, lead to a highly irregular pairwise distance distribution which all but precludes the estimation of small lengthcales except when the design is very large. By deliberately targeting a simpler family of unimodal distance distributions we have found that it is pos- sible to avoid that puzzling behavior, obtain a more accurate estimate of the lengthscale, and ultimately make better predictions and sequential design decisions. For reproducibility, the code behind our empirical work is provided in an open Git repository on Bitbucket: https://bitbucket.org/gramacylab/betadist.

We proposed an optimization strategy for finding the best distance distributions within the Beta family conditional on the design setting, specified kernel family, design size (n) and input dimension (d). Many potential avenues for further investigation naturally suggest themselves. For simplicity, we limited our study to the isotropic Gaussian family. One could check that similar results hold for other common families like the Matérn. A more ambitious extension would be to separable structures where there is a lengthscale for each input coor-

88 6.1. Distance-distributed design for GP surrogates 89

dinate: θ1, . . . , θd. Obtaining appropriate pairwise distance distributions in each coordinate simultaneously could prove difficult, especially in small-n large-d settings. However, we spec- ulate that the problem could be effectively reduced down to d univariate ones. Considering nugget hyperparameters in the optimization would add yet another layer of complication. In that setting, we may wish to consider replication (i.e., zero-inflated distance distributions) as a means of separating signal from noise (Binois et al., 2018c).

Many response surfaces from simulations of industrial systems are exceedingly smooth and slowly varying over the study region of interest. Such knowledge, when available, could translate into an a priori belief about large lengthscales θ, or even a lower bound on θ. In our empirical work, and searches for optimal betadistn,d(α, β) through simulated θ-values, we took a lower bound on θ of effectively zero. However, we see no reason why a different lower bound couldn’t be applied. We speculate that narrowing the range of θ, especially toward the upper end, would result in an organic preference for larger pairwise distances through the search for optimal (ˆα, βˆ), and that these designs will perform more similarly to space-filling ones like maximin.

Another family of target distance distributions, i.e., besides the Beta, could prove easier to optimize over, or otherwise lead to better designs. A higher-powered search for de- signs, besides random swapping, might mitigate the computational burden of finding optimal distance-distributed designs which becomes problematic when n is large. Some researchers have recently had success with particle swarm optimization (PSO) in design settings, like minimax design (Chen et al., 2015), which might port well to the distance-distribution set- ting and the lhsbeta hybrid. Perhaps the most important take home message from this manuscript is that maximin designs can be awful. LHSs are better, because they avoid a multi-modal distance distribution and, simultaneously, a degree of aliasing through their one-dimensional uniformity property. However, we argue that the most important thing is Chapter 6. Conclusion 90 to have a good design for hyperparameter inference, which neither method targets directly. In fact, random design is better than both in this respect, which is perhaps surprising. If you assume to know the hyperparameters, then LHS and maximin are great. It’s worth not- ing that ascribing physical or interpretive meaning to lengthscale hyperparameters can be extremely challenging. Therefore, it is hard to imagine that one could consistently choose ap- propriate lengthscales without help from automatic procedures like MLE—which, of course, need a design.

6.2 IMSPE Batch-sequential design

Motivated by a computationally intensive stochastic agent-based model simulating the ecosys- tem and life cycles of delta smelt, an endangered fish, in Chapter 4, we developed a batch sequential design scheme for loading supercomputing nodes with runs in batches. We used a heteroskedastic Gaussian process (HetGP) surrogate model to acknowledge nonlinear dy- namics in mean and variance, revealed in a limited pilot study, and extended a variance-based (IMSPE) scheme for sequential design under such models to allow the selection of multiple new runs at once. To facilitate numerical optimization of batch IMSPE we furnished closed form derivatives and developed a backtracking scheme to determine if any near replicates provided by the solver were actual replicates. Only actual replicates efficiently separate signal from noise and pay computational dividends at the same time.

Our methods were illustrated and contrasted against previous (pure sequential/one-at-a- time) active learning strategies on several synthetic and real-simulation benchmarks. These allowed us to conclude that our scheme was no worse than previous approaches, while de- signing batches of runs that could fill out a supercomputing node. We then turned to our motivating delta smelt scenario to undertake a simulation campaign with thousands of runs 6.2. IMSPE Batch-sequential design 91

(on an expanded input domain compared to the pilot study). What would have taken more than 12000 core hours, spanning more than 500 days if run back-to-back (and not counting any queue delays), took us about 44 days (including substantial queuing time).

This order of magnitude reduction in compute time, without noticeable drawbacks in model- ing efficiency, could have a substantial impact on the modus operandi of conducting stochas- tic simulation experiments in practice. Widespread university and research lab access to supercomputing facilities is democratizing the application of mathematical modeling of com- plex physical and biological phenomena. However, strategies for planning those experiments in this unique architectural environment are sorely needed. We think the advances reported on here take an important first step.

Simulations in hand, there are many interesting analyses which can be performed down- stream. We provided some visuals based on slices and performed an input sensitivity analysis in order to determine which factors have the largest effect on smelt mortality in this particu- lar system. Our choice of IMSPE suits this analysis well because it reduces variance globally and our Saltelli et al.-style indices emphasize decomposition of variance. Extending the Bi- nois et al.’s IMSPE calculation to other downstream tasks has become a cottage industry of late. Examples include sequential learning of active spaces (Wycoff et al., 2019), level-set finding and Bayesian optimization (Lyu et al., 2018). Cole et. al (forthcoming) adapt a similar calculation for large-scale local Gaussian process approximation via inducing points. We see no barriers to extending these schemes similarly, to batch analogs of one-at-a-time acquisitions. Calibration to field data (e.g., Kennedy and O’Hagan, 2001), say sampling of actual delta smelt, remains on the frontier of design for surrogate modeling. Baker et al. (2020) identify this as an important area for further research.

There is certainly potential for improvement even within our particular niche. The perfor- mance of our scheme relies heavily on local numerical optimization via libraries. Finding Chapter 6. Conclusion 92 global optima for non-convex criteria in high-dimensional spaces is always a challenge. Al- though we get good results with L-BFGS-B, we also tried particle swarm optimization (PSO; Kennedy and Eberhart, 1995) in several capacities: replacing BFGS wholesale and for find- ing good BFGS staring points. Improvements were consistent but minor in the grand scheme of multiple batches of sequential design. rgenoud package developed by Mebane and Sekhon is also based on the idea of hybridizing genetic optimization with gradient based methods, which might be a good alternative. To further increase the computation efficiency, optimiz- ing over space-filling candidate design sites and the corresponding number of replicates can be a way, which has been explored on for generalized linear models (Li and Deng, 2018). Our scheme is tailored to fixed, known batch sizes and we illustrated M = 24 because that matched the size of our supercomputing nodes. Other batch sizes work well, but an expanded capability might support unknown batch sizes or on-demand acquisition: whenever a batch of cores is available the model/design scheme must be ready to furnish runs. This could be accomplished by maintaining a larger M-sized queue of prioritized in- puts, say following Gramacy and Lee (2009), which would need to be updated for the HetGP framework.

In this section, for the mean process, we have been employing the Gaussian kernel. Other choices of kernel function, especially involving various smoothness, may result in different de- gree of replications. Also, we have been focusing on stationary kernels. When non-stationary mean is the case, challenges from separating signal and noise would obstacle the batch se- quential design method targets the high noise regions. Applying non-stationary kernels under HetGP context is an interesting direction to explore in the future. Ba et al. composited two GPs to capture the non-stationary mean, i.e., constructing the non-stationary kernel with sum of stationary kernels. Davis et al. (2019) recently proposed a Bayesian extension of the composite GP model for non-stationary response with heteroskedastic variance. Another 6.3. Delta Smelt simulator 93 way to do this is to model the mean under a deep GP setup (Damianou and Lawrence, 2013, Duvenaud et al., 2014). By involving multiple layers of latent nodes, which can be seen as compositing multiple stationary kernels, deep GPs are capable of much more flexible responses.

6.3 Delta Smelt simulator

For delta smelt in particular, the current version of the simulator assumes a constant mortal- ity rate. A newly developed version, coming online part way through our campaign, includes an option to allow mortality rate to depend on the individual size and population density. A second simulation campaign, perhaps at the same inputs selected for the first campaign/sim- ulator or under a novel batch-sequential design, could be used to contrast regimes, update sensitivities and perform calibration. Bibliography

Ankenman, B., Nelson, B. L., and Staum, J. (2010). “Stochastic kriging for simulation

metamodeling.” Operations research, 58, 2, 371–382.

Ba, S., Joseph, V. R., et al. (2012). “Composite Gaussian process models for emulating

expensive functions.” The Annals of Applied Statistics, 6, 4, 1838–1860.

Baker, E., Barbillon, P., Fadikar, A., Gramacy, R. B., Herbei, R., Higdon, D., Huang, J., Johnson, L. R., Ma, P., Mondal, A., Pires, B., Sacks, J., and Sokolov, V. (2020). “Stochastic Simulators: An Overview with Opportunities.”

Barnett, S. (1979). Matrix Methods for Engineers and Scientists. McGraw-Hill.

Baxter, R., Breuer, R., Brown, L., Conrad, L., Feyrer, F., Fong, S., Gehrts, K., Grimaldo, L., Herbold, B., Hrodey, P., Mueller-Solger, A., Sommer, T., and Souza., K. (2010).

“2010 pelagic organism decline work plan and synthesis of results.” Interagency Ecologi- cal Program for the San Francisco Estuary, California Department of Water Resources, Sacramento.

Bengtsson, H. (2018). R.matlab: Read and Write MAT Files and Call MATLAB from Within R. R package version 3.6.2.

Binois, M. and Gramacy, R. B. (2018). hetGP: Heteroskedastic Gaussian Process Modeling and Design under Replication. R package version 1.1.1.

Binois, M., Gramacy, R. B., and Ludkovski, M. (2018a). “Practical Heteroscedastic Gaussian

Process Modeling for Large Simulation Experiments.” Journal of Computational and Graphical Statistics, 27, 4, 808–821.

94 BIBLIOGRAPHY 95

— (2018b). “Practical heteroskedastic Gaussian process modeling for large simulation ex-

periments.” Journal of Computational and Graphical Statistics, 0, ja, 1–41.

Binois, M., Huang, J., Gramacy, R. B., and Ludkovski, M. (2018c). “Replication or explo-

ration? Sequential design for stochastic simulation experiments.” Technometrics, 0, ja, 1–43.

Bisset, K. R., Chen, J., Feng, X., Kumar, V. A., and Marathe, M. V. (2009). “EpiFast: a fast algorithm for large scale realistic epidemic simulations on distributed memory systems.”

In Proceedings of the 23rd international conference on Supercomputing, 430–439.

Bull, A. D. (2011). “Convergence Rates of Efficient Global Optimization Algorithms.” Jour- nal of Machine Learning Research, 12, 2879–2904.

Byrd, R., Lu, P., Nocedal, J., and Zhu, C. (2003). “A Limited Memory Algorithm for Bound

Constrained Optimization.” SIAM Journal on Scientific Computing, 16.

Carnell, R. (2020). lhs: Latin Hypercube Samples. R package version 1.0.2.

Chen, H., Loeppky, J. L., Sacks, J., and Welch, W. J. (2016). “Analysis Methods for

Computer Experiments: How to Assess and What Counts?” Statistical Science, 31, 1, 40–60.

Chen, J., Mak, S., Joseph, V. R., and Zhang, C. (2019). “Adaptive design for Gaussian

process regression under censoring.” arXiv preprint arXiv:1910.05452.

Chen, R.-B., Chang, S.-P., Wang, W., and Wong, H.-C. T. K. (2015). “Minimax optimal

designs via particle swarm optimization methods.” Statistics and Computing, 25, 5, 975– 988.

Chevalier, C. (2013). “Fast uncertainty reduction strategies relying on Gaussian process models.” Ph.D. thesis. BIBLIOGRAPHY 96

Chevalier, C., Bect, J., Ginsbourger, D., Vazquez, E., Picheny, V., and Richet, Y. (2014). “Fast Parallel Kriging-Based Stepwise Uncertainty Reduction With Application to the

Identification of an Excursion Set.” Technometrics, 56, 4, 455–465.

Chung, M., Binois, M., Gramacy, R. B., Moquin, D. J., Smith, A. P., and Smith, A. M. (2018). “Parameter and Uncertainty Estimation for Dynamical Systems Using Surrogate

Stochastic Processes.” arXiv preprint arXiv:1802.00852.

Cressie, N. (1985). “Fitting variogram models by weighted least squares.” Journal of the International Association for Mathematical Geology, 17, 5, 563–586.

Dam, E., Husslage, B., den Hertog, D., and Melissen, H. (2005). “Maximin Latin Hypercube

Designs in Two Dimensions.” Operations Research, 55.

Damianou, A. and Lawrence, N. (2013). “Deep Gaussian Processes.” ArXiv, abs/1211.0358.

Davis, C. B., Hans, C. M., and Santner, T. J. (2019). “Prediction Using a Bayesian Het- eroscedastic Composite Gaussian Process.”

Duan, W., Ankenman, B. E., Sanchez, S. M., and Sanchez, P. J. (2017). “Sliced Full Factorial-Based Latin Hypercube Designs as a Framework for a Batch Sequential Design

Algorithm.” Technometrics, 59, 1, 11–22.

Duvenaud, D., Rippel, O., Adams, R., and Ghahramani, Z. (2014). “Avoiding pathologies

in very deep networks.” In Artificial Intelligence and Statistics, 202–210.

Erickson, C. B., Ankenman, B. E., Plumlee, M., and Sanchez, S. M. (2018). “GRADI-

ENT BASED CRITERIA FOR SEQUENTIAL DESIGN.” In 2018 Winter Simulation Conference (WSC), 467–478. BIBLIOGRAPHY 97

Fadikar, A., Higdon, D., Chen, J., Lewis, B., Venkatramanan, S., and Marathe, M. (2018).

“Calibrating a stochastic, agent-based model using quantile-based emulation.” SIAM/ASA Journal on Uncertainty Quantification, 6, 4, 1685–1706.

Fang, K.-T. (1980). “The Uniform Design: Application of NumberTheoretic Methods in

Experimental Design.” Acru Muthematicae Apphzgarue Sinica, 3, 363–372.

Fang, K.-T., Lin, D. K., Winker, P., and Zhang, Y. (2000). “Uniform Design: Theory and

Application.” Technometrics, 42, 3, 237–248.

Farah, M., Birrell, P., Conti, S., and Angelis, D. D. (2014). “Bayesian emulation and

calibration of a dynamic epidemic model for A/H1N1 influenza.” Journal of the American Statistical Association, 109, 508, 1398–1411.

Fish and Wildlife Service, I. (1993). “Endangered and Threatened Wildlife and Plants;

Determination of Threatened Status for the Delta Smelt.” Federal Register, 58, 42, 12854– 12864.

Frazier, P. I. (2018). “A tutorial on bayesian optimization.” arXiv preprint arXiv:1807.02811.

Ginsbourger, D. and Le Riche, R. (2010). “Towards Gaussian process-based optimization

with finite time horizon.” In mODa 9–Advances in Model-Oriented Design and Analysis, 89–96. Springer.

Ginsbourger, D., Le Riche, R., and Carraro, L. (2010). “Kriging is well-suited to parallelize

optimization.” In Computational intelligence in expensive optimization problems, 131–162. Springer.

Gneiting, T. and Raftery, A. E. (2007). “Strictly Proper Scoring Rules, Prediction, and

Estimation.” Journal of the American Statistical Association, 102, 477, 359–378. BIBLIOGRAPHY 98

Gramacy, R. and Polson, N. (2011). “Particle learning of Gaussian process models for

sequential design and optimization.” Journal of Computational and Graphical Statistics, 20, 1, 102–118.

Gramacy, R. B. (2007). “tgp: An R Package for Bayesian Nonstationary, Semiparametric

Nonlinear Regression and Design by Treed Gaussian Process Models.” Journal of Statistical Software, 19, 9, 1–46.

— (2020). Surrogates: Gaussian Process Modeling, Design and Optimization for the Ap-

plied Sciences. Boca Raton, Florida: Chapman Hall/CRC. http://bobby.gramacy.com/ surrogates/.

Gramacy, R. B. and Apley, D. W. (2015). “Local Gaussian Process Approximation For

Large Computer Experiments.” Journal of Computational and Graphical Statistics, 24, 2, 561–578. See arXiv:1303.0383.

Gramacy, R. B., Gray, G. A., Digabel, S. L., Lee, H. K. H., Ranjan, P., Wells, G., and Wild, S. M. (2016). “Modeling an Augmented Lagrangian for Blackbox Constrained Optimiza-

tion.” Technometrics, 58, 1, 1–11.

Gramacy, R. B. and Lee, H. K. H. (2008). “Bayesian Treed Gaussian Process Models With

an Application to Computer Modeling.” Journal of the American Statistical Association, 103, 483, 1119–1130.

— (2009). “Adaptive Design and Analysis of Supercomputer Experiment.” Technometrics, 51, 2, 130–145.

Gramacy, R. B. and Taddy, M. (2010). “Categorical Inputs, Sensitivity Analysis, Optimiza- tion and Importance Tempering with tgp Version 2, an R Package for Treed Gaussian

Process Models.” Journal of Statistical Software, 33, 6, 1–48. BIBLIOGRAPHY 99

Hamilton, S. and Murphy, D. (2018). “Analysis of Limiting Factors Across the Life Cycle of

Delta Smelt (Hypomesus transpacificus).” Environmental Management, 62.

Han, Z.-H., Görtz, S., and Zimmermann, R. (2013). “Improving variable-fidelity surro- gate modeling via gradient-enhanced kriging and a generalized hybrid bridge function.”

Aerospace Science and Technology, 25, 177–189.

Harville, D. A. (1998). “Matrix algebra from a statistician’s perspective.”

Herbei, R. and Berliner, L. M. (2014). “Estimating ocean circulation: an MCMC approach

with approximated likelihoods via the Bernoulli factory.” Journal of the American Statis- tical Association, 109, 507, 944–954.

Higdon, D., Kennedy, M., Cavendish, J. C., Cafeo, J. A., and Ryne, R. D. (2004). “Combining

field data and computer simulations for calibration and prediction.” SIAM Journal on Scientific Computing, 26, 2, 448–466.

Hong, L. and Nelson, B. (2006). “Discrete Optimization via Simulation Using COMPASS.”

Operations Research, 54, 1, 115–129.

Huang, D., Allen, T. T., Notz, W. I., and Zeng, N. (2006). “Global optimization of stochastic

black-box systems via sequential kriging meta-models.” Journal of global optimization, 34, 3, 441–466.

Johnson, L. (2008). “Microcolony and Biofilm Formation as a Survival Strategy for Bacteria.”

Journal of theoretical biology, 251, 24–34.

Johnson, M., Moore, L., and Ylvisaker, D. (1990). “Minimax and Maximin Distance De-

signs.” Journal of Statistical Planning and Inference, 26, 131–148.

Jones, D., Schonlau, M., and Welch, W. J. (1998). “Efficient Global Optimization of Expen-

sive Black Box Functions.” Journal of Global Optimization, 13, 455–492. BIBLIOGRAPHY 100

Kennedy, J. and Eberhart, R. (1995). “Particle swarm optimization.” In Proceedings of ICNN’95 - International Conference on Neural Networks, vol. 4, 1942–1948 vol.4.

Kennedy, M. C. and O’Hagan, A. (2001). “Bayesian Calibration of Computer Models.”

Journal of the Royal Statistical Society, Series B, 63, 425–464.

Kim, S.-H. and Nelson, B. L. (2006). “Selecting the best system.” Handbooks in operations research and management science, 13, 501–534.

Kimmerer, W. and Rose, K. (2018). “Individual-Based Modeling of Delta Smelt Population Dynamics in the Upper San Francisco Estuary III. Effects of Entrainment Mortality and

Changes in Prey.” Transactions of the American Fisheries Society, 147, 223–243.

Kimmerer, W. J. and Nobriga, M. L. (2008). “Investigating Particle Transport and Fate in the Sacramento-San Joaquin Delta Using a Particle Tracking Model.”

Kleijnen, J. P. and Van Beers, W. C. (2005). “Robustness of Kriging when interpolating in

random simulation with heterogeneous variances: Some experiments.” European Journal of Operational Research, 165, 3, 826–834.

Leatherman, E. R., Santner, T. J., and Dean, A. M. (2017). “Computer experiment designs

for accurate prediction.” Statistics and Computing, 1–13.

Li, Y. and Deng, X. (2018). “EI-Optimal Design: An Efficient Algorithm for Elastic I-optimal

Design of Generalized Linear Models.” arXiv preprint arXiv:1801.05861.

Lin, C. D., Bingham, D., Sitter, R. R., and Tang, B. (2010). “A new and flexible method

for constructing designs for computer experiments.” The Annals of Statistics, 38.

Lin, C. D., Mukerjee, R., and Tang, B. (2009). “Construction of orthogonal and nearly

orthogonal Latin hypercubes.” Biometrika, 96, 1, 243–247. BIBLIOGRAPHY 101

Lin, C. D. and Tang, B. (2015). Handbook of Design and Analysis of Experiments, chap. Latin Hypercubes and Space-Filling Designs. CRC Press.

Liu, H., Ong, Y., and Cai, J. (2018). “A Survey of Adaptive Sampling for Global Meta-

modeling in Support of Simulation-based Complex Engineering Design.” Structural and Multidisciplinary Optimization.

Loeppky, J., Moore, L., and Williams, B. (2009). “Batch sequential designs for computer

experiments.” Journal of Statistical Planning and Inference, 140, 1452–1464.

Loeppky, J. L., Moore, L. M., and Williams, B. J. (2010). “Batch sequential designs for

computer experiments.” Journal of Statistical Planning and Inference, 140, 6, 1452–1464.

Lyu, X., Binois, M., and Ludkovski, M. (2018). “Evaluating Gaussian process metamodels

and sequential designs for noisy level set estimation.” arXiv preprint arXiv:1807.06712.

MacKay, D. J. C. (1992). “Information–based Objective Functions for Active Data Selection.”

Neural Computation, 4, 4, 589–603.

MacNally, R., Thomson, J., Kimmerer, W., Feyrer, F., Newman, K., Sih, A., Bennett, W., Brown, L., Fleishman, E., Culberson, S., and Castillo, G. (2010). “Analysis of pelagic species decline in the upper San Francisco Estuary using multivariate autoregressive mod-

eling (MAR).” Ecological applications : a publication of the Ecological Society of America, 20, 1417–30.

Mak, S. and Joseph, V. R. (2018). “Minimax and Minimax Projection Designs Using Clus-

tering.” Journal of Computational and Graphical Statistics, 27, 1, 166–178. mark Handcock, S. (1991). “On cascading latin hypercube designs and additive models for

experiments.” Communications in Statistics - Theory and Methods, 20, 2, 417–439. BIBLIOGRAPHY 102

Marrel, A., Iooss, B., Laurent, B., and Roustant, O. (2009). “Calculations of Sobol indices

for the Gaussian process metamodel.” Reliability Engineering & System Safety, 94, 3, 742–751.

Matheron, G. (1963). “Principles of Geostatistics.” Economic Geology, 58, 1246–1266.

Maunder, M. and Deriso, R. (2011). “A state-space multistage life cycle model to evaluate population impacts in the presence of density dependence: Illustrated with application

to delta smelt (hyposmesus transpacificus).” Canadian Journal of Fisheries and Aquatic Sciences, 68, 1285–1306.

Mckay, D., Beckman, R., and Conover, W. (1979). “A Comparison of Three Methods for Selecting Vales of Input Variables in the Analysis of Output From a Computer Code.”

Technometrics, 21, 239–245.

McKeague, I. W., Nicholls, G., Speer, K., and Herbei, R. (2005). “Statistical inversion of

South Atlantic circulation in an abyssal neutral density layer.” Journal of Marine Research, 63, 4, 683–704.

Mebane, W. and Sekhon, J. (2011). “Genetic Optimization Using Derivatives: The rgenoud

Package for R.” Journal of Statistical Software, 42, 1–26.

Morris, M., Mitchell, T., and Ylvisaker, D. (1993). “Bayesian Design and Analysis of Com-

puter Experiments: Use of Derivatives in Surface Prediction.” Technometrics, 35.

Morris, M. D. (1991). “On counting the number of data pairs for semivariogram estimation.”

Mathematical Geology, 25, 929 – 943.

Morris, M. D. and Mitchell, T. J. (1995). “Exploratory Designs for Computational Experi-

ments.” Journal of Statistical Planning and Inference, 43, 381–402. BIBLIOGRAPHY 103

Notz, W. I. and Lam, C. Q. (2008). “Sequential adaptive designs in computer experiments for response surface model fit.”

Oakley, J. and O’Hagan, A. (2004). “Probabilistic sensitivity analysis of complex models:

a Bayesian approach.” Journal of the Royal Statistical Society: Series B (Statistical Methodology), 66, 3, 751–769.

Owen, A. B. (1994). “Controlling Correlations in Latin Hypercube Samples.” Journal of the American Statistical Association, 89, 428, 1517–1522.

Picheny, V., Ginsbourger, D., Richet, Y., and Caplin, G. (2013a). “Quantile-based opti-

mization of noisy computer experiments with tunable precision.” Technometrics, 55, 1, 2–13.

Picheny, V., Gramacy, R., Wild, S. M., and Digabel, S. (2016). “Bayesian optimization

under mixed constraints with a slack-variable augmented Lagrangian.” In NIPS.

Picheny, V., Wagner, T., and Ginsbourger, D. (2013b). “A benchmark of kriging-based

infill criteria for noisy optimization.” Structural and Multidisciplinary Optimization, 48, 3, 607–626.

Pronzato, L. and Müller, W. (2011). “Design of computer experiments: Space filling and

beyond.” Statistics and Computing, 1–21.

Qian, P. Z. G. (2009). “Nested Latin hypercube designs.” Biometrika, 96, 4, 957–970.

Rose, K. A., Kimmerer, W. J., Edwards, K. P., and Bennett, W. A. (2013). “Individual-Based Modeling of Delta Smelt Population Dynamics in the Upper San Francisco Estuary: I.

Model Description and Baseline Results.” Transactions of the American Fisheries Society, 142, 5, 1238–1259. BIBLIOGRAPHY 104

Russo, D. (1984). “Design of an Optimal Sampling Network for Estimating the Variogram.”

Soil Science Society of America Journal - SSSAJ, 48.

Rutter, C. M., Ozik, J., DeYoreo, M., Collier, N., et al. (2019). “Microsimulation model

calibration using incremental mixture approximate Bayesian computation.” The Annals of Applied Statistics, 13, 4, 2189–2212.

Sacks, J., Welch, W., J. Mitchell, T., and Wynn, H. (1989). “Design and analysis of computer

experiments. With comments and a rejoinder by the authors.” Statistical Science, 4.

Saltelli, A., Chan, K., and Scott, M. (2000). Sensitivity Analysis. New York, NY: John Wiley & Sons.

Saltelli, A., Ratto, M., Andres, T., Campolongo, F., Cariboni, J., Gatelli, D., Saisana, M.,

and Tarantola, S. (2008). Global Sensitivity Analysis: The Primer. John Wiley & Sons.

Santner, T., Williams, B., and Notz, W. (2003). The Design and Analysis Computer Exper- iments. Springer; 2003 edition (July 30, 2003).

— (2018). The Design and Analysis of Computer Experiments, Second Edition. New York, NY: Springer–Verlag.

Seo, S., Wallat, M., Graepel, T., and Obermayer, K. (2000). “Gaussian Process Regression:

Active Data Selection and Test Point Rejection.” In Proceedings of the International Joint Conference on Neural Networks, vol. III, 241–246. IEEE.

Shewry, M. C. and Wynn, H. P. (1987). “Maximum entropy sampling.” Journal of Applied Statistics, 14, 2, 165–170.

Snoek, J., Larochelle, H., and Adams, R. P. (2012). “Bayesian optimization of machine

learning algorithms.” In Neural Information Processing Systems (NIPS). BIBLIOGRAPHY 105

Taddy, M., Lee, H., Gray, G., and Griffin, J. (2009). “Bayesian guided pattern search for

robust local optimization.” Technometrics, 51, 4, 389–401.

Tan, M. (2013). “Minimax Designs for Finite Design Regions.” Technometrics, 55, 346–358.

Tang, B. (1993). “Orthogonal Array-Based Latin Hypercubes.” Journal of the American Statistical Association, 88, 424, 1392–1397.

Thomson, J., Kimmerer, W., Brown, L., Newman, K., Mac Nally, R., Bennett, W., Feyrer, F., and Fleishman, E. (2010). “Bayesian change point analysis of abundance trends for

pelagic fishes in the upper San Francisco Estuary.” Ecological applications : a publication of the Ecological Society of America, 20, 1431–48.

Williams, B. J., Loeppky, J. L., Moore, L. M., and Macklem, M. S. (2011). “Batch sequen-

tial design to achieve predictive maturity with calibrated computer models.” Reliability Engineering & System Safety, 96, 9, 1208–1219.

Wycoff, N., Binois, M., and Wild, S. M. (2019). “Sequential Learning of Active Subspaces.”

Xie, J., Frazier, P. I., Sankaran, S., Marsden, A., and Elmohamed, S. (2012). “Optimization of computationally expensive simulations with Gaussian processes and parameter uncer-

tainty: Application to cardiovascular surgery.” In 2012 50th Annual Allerton Conference on Communication, Control, and Computing (Allerton), 406–413.

Yin, J., Ng, S. H., and Ng, K. M. (2011). “Kriging metamodel with modified nugget-effect:

The heteroscedastic variance case.” Computers & Industrial Engineering, 61, 3, 760–777.

Yu, H. (2002). “Rmpi: Parallel Statistical Computing in R.” R News, 2, 2, 10–14.

Zhao, Y. and Wall, M. M. (2004). “Investigating the Use of the Variogram for Lattice Data.”

Journal of Computational and Graphical Statistics, 13, 3, 719–738. BIBLIOGRAPHY 106

Zimmerman, D. (2006). “Optimal network design for spatial prediction, covariance parameter

estimation, and empirical prediction.” Environmetrics, 17, 635 – 652.