Maximizing Acquisition Functions for Bayesian Optimization

Maximizing acquisition functions for Bayesian optimization James T. Wilson∗ Frank Hutter Marc Peter Deisenroth Imperial College London University of Freiburg Imperial College London PROWLER.io Abstract Bayesian optimization is a sample-efficient approach to global optimization that relies on theoretically motivated value heuristics (acquisition functions) to guide its search process. Fully maximizing acquisition functions produces the Bayes’ decision rule, but this ideal is difficult to achieve since these functions are fre- quently non-trivial to optimize. This statement is especially true when evaluating queries in parallel, where acquisition functions are routinely non-convex, high- dimensional, and intractable. We first show that acquisition functions estimated via Monte Carlo integration are consistently amenable to gradient-based optimization. Subsequently, we identify a common family of acquisition functions, includ- ing EI and UCB, whose properties not only facilitate but justify use of greedy approaches for their maximization. 1 Introduction Bayesian optimization (BO) is a powerful framework for tackling complicated global optimization problems [32, 40, 44]. Given a black-box function f : , BO seeks to identify a maximizer X!Y x∗ arg maxx f(x) while simultaneously minimizing incurred costs. Recently, these strategies have2 demonstrated2X state-of-the-art results on many important, real-world problems ranging from material sciences [17, 57], to robotics [3, 7], to algorithm tuning and configuration [16, 29, 53, 56]. From a high-level perspective, BO can be understood as the application of Bayesian decision theory to optimization problems [11, 14, 45]. One first specifies a belief over possible explanations for f using a probabilistic surrogate model and then combines this belief with an acquisition function to convey the expected utility for evaluating a set of queries X. In theory, X is chosen accordingL to Bayes’ decision rule as ’s maximizer by solving for an inner optimization problem [19, 42, L arXiv:1805.10196v2 [stat.ML] 2 Dec 2018 59]. In practice, challenges associated with maximizing greatly impede our ability to live up to this standard. Nevertheless, this inner optimization problemL is often treated as a black-box unto itself. Failing to address this challenge leads to a systematic departure from BO’s premise and, consequently, consistent deterioration in achieved performance. To help reconcile theory and practice, we present two modern perspectives for addressing BO’s inner optimization problem that exploit key aspects of acquisition functions and their estimators. First, we clarify how sample path derivatives can be used to optimize a wide range of acquisition functions estimated via Monte Carlo (MC) integration. Second, we identify a common family of submodular acquisition functions and show that its constituents can generally be expressed in a more computer-friendly form. These acquisition functions’ properties enable greedy approaches to efficiently maximize them with guaranteed near-optimal results. Finally, we demonstrate through comprehensive experiments that these theoretical contributions directly translate to reliable and, often, substantial performance gains. ∗Correspondence to [email protected] 32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montréal, Canada. 1 Inner optimization problem Runtimes for inner optimization 2.2 897 Algorithm 1 BO outer-loop (joint parallelism) ParallelismParallelism (joint parallelism) Algorithm 1 BO outer-loop q =q 32= 32 1: Given model and data . q = 16 1: Given model , acquisition ,,and and data data q = 16 2: for t =1,...,TMM do DL ; D -0.2 q =8 M L D 673 q =4q = 8 2:3:for t Fit=1 model,...,T doto current data q =2q = 4 3: M D 3:4: FitSet modelqt =min(toT currentt, q) data Posterior belief M − D b q = 2 4: Set q =min(qmax,T t) -2.6 − 449 5: FindInnerX optimizationarg maxX problemqt (X) 0.2 2 2 i=1 X L 5: Find X arg max q (X0) 6: Evaluate y f( X0) Q CPU Seconds 2 2X L q 6:7: EvaluateUpdate y f(X) (xk,yk) k=1 Evaluate D D [ { } 224 8: end for qq 0.1 7: Update (xki,y,yik)) i=1 2 D D [ { }}k=1 8: end for a ExpectedExpected utility utility c d Algorithm 2 BO outer-loop (greedy parallelism) -0.00.0 0 0.00 0.25 0.50 0.75 1.00 3 256 512 768 1024 1: Given modelBO outer-loopand data(greedy, parallelism). Algorithm 2 M D ; x Num. prior observations 2: for t =1,...,T do 2 R 1: Given model , acquisition and data 3: Fit model to current data Figure2: t =1 1: ,...,T(a)M Pseudo-codeL forD standard BO’s “outer-loop” with parallelism q; the inner optimization problem 4:for Set X Mdo D 3: Fit model ; to current data is boxed in red.M (b–c) GP-basedD belief and expected utility (EI), given four initial observations ‘•’. The aim of 4: Set X 14 the5: innerfor optimizationk =1;,...min(T t, problem q) do is to find the optimal query ‘9’. (d) Time to compute 2 evaluations of MC − 6: Find xk arg maxx (X x ) q-EI5: usingfor j =1 a,... GP2min( surrogateqmax,T2X Lt) do for[{ } varied observation counts and degrees of parallelism. Runtimes fall off at the 7: X X xk − 6: Find xj [arg{ max} x (X x ) 8: end for 2 q 2X L [{ } T = 1; 024 final7: stepX becauseX xj decreases to accommodate evaluation budget . 9: Evaluate y [ {f(X} ) 8: end for q 10: Update (x ,y ) 9: Evaluate Dy fD(X[){ k k }k=1 11: end for q 10: Update (xi,yi) i=1 211: end Background for D D [ { } Bayesian optimization relies on both a surrogate model and an acquisition function to define a strategy for efficiently maximizing a black-box functionM f. At each “outer-loop” iterationL (Fig- ure 1a), this strategy is used to choose a set of queries X whose evaluation advances the search process. This section reviews related concepts and closes with discussion of the associated inner optimization problem. For an in-depth review of BO, we defer to the recent survey [52]. q d Without loss of generality, we assume BO strategies evaluate q designs X R × in parallel so that setting q = 1 recovers purely sequential decision-making. We denote2 available information ::: regarding f as = (xi; yi) i=1 and, for notational convenience, assume noiseless observations y = f(X). Additionally,D f we referg to ’s parameters (such as an improvement threshold) as and to ’s parameters as ζ. Henceforth, directL reference to these terms will be omitted where possible. M Surrogate models A surrogate model provides a probabilistic interpretation of f whereby possible explanations for the function areM seen as draws f k p(f ). In some cases, this belief is expressed as an explicit ensemble of sample functions [28,∼ 54, 60].jD More commonly however, dictates the parameters θ of a (joint) distribution over the function’s behavior at a finite set of 1 pointsM X. By first tuning the1 model’s (hyper)parameters ζ to explain for , a belief is formed as p(y X; ) = p(y; θ) with θ (X; ζ). Throughout, θ (X; ζ) is usedD to denote that belief p’s parametersj D θ are specified byM model evaluated at X .M A member of this latter category, the Gaussian process prior (GP) is the mostM widely used surrogate and induces a multivariate normal belief θ (µ; Σ) (X; ζ) such that p(y; θ) = (y; µ; Σ) for any finite set X (see Figure 1b). , M N Acquisition functions With few exceptions, acquisition functions amount to integrals defined in terms of a belief p over the unknown outcomes y = y ; : : : ; y revealed when evaluating a black- f 1 qg box function f at corresponding input locations X = x1;:::; xq . This formulation naturally occurs as part of a Bayesian approach whereby the value off querying Xg is determined by accounting for the utility provided by possible outcomes yk p(y X; ). Denoting the chosen utility function as `, this paradigm leads to acquisition functions∼ definedj asD expectations Z (X; ; ) = Ey [`(y; )] = `(y; )p(y X; )dy : (1) L D j D A seeming exception to this rule, non-myopic acquisition functions assign value by further con- k k q sidering how different realizations of q (xi; yi ) i=1 impact our broader understanding of f and usually correspond to moreD complex, D nested[ f integrals.g Figure 1c portrays a prototypical acquisition surface and Table 1 exemplifies popular, myopic and non-myopic instances of (1). Inner optimization problem Maximizing acquisition functions plays a crucial role in BO as the process through which abstract machinery (e.g. model and acquisition function ) yields con- crete actions (e.g. decisions regarding sets of queries X).M Despite its importance however,L this inner optimization problem is often neglected. This lack of emphasis is largely attributable to a greater 2 Abbr. Acquisition Function Reparameterization MM L EI Ey[max(ReLU(y α))] Ez[max(ReLU(µ + Lz α))] Y − − 1 µ+Lz α PI Ey[max( −(y α))] Ez[max(σ( − ))] Y − τ SR Ey[max(y)] Ez[max(µ + Lz)] Y pβπ pβπ UCB Ey[max(µ + =2 γ )] Ez[max(µ + =2 Lz )] Y j j j j + µb a+Lb azb 1 j j ES Eya [H(Eyb ya [ (yb max(yb))])] Eza [H(Ezb [softmax( )])] N − j − − τ 1 1 KG Eya [max(µb + Σb;aΣ− (ya µa))] Eza [max(µb + Σb;aΣ− Laza)] N a;a − a;a Table 1: Examples of reparameterizable acquisition functions; the final column indicates whether they belong += to the MM family (Section 3.2). Glossary: 1 − denotes the right-/left-continuous Heaviside step function; ReLU and σ rectified linear and sigmoid nonlinearities, respectively; H the Shannon entropy; α an improvement threshold; τ a temperature parameter; LL> , Σ the Cholesky factor; and, residuals γ ∼ N (0; Σ).

Maximizing Acquisition Functions for Bayesian Optimization

Discontinuous Forcing Functions

Laplace Transforms: Theory, Problems, and Solutions

10 Heat Equation: Interpretation of the Solution

Delta Functions and Distributions

An Analytic Exact Form of the Unit Step Function

On the Approximation of the Step Function by Raised-Cosine and Laplace Cumulative Distribution Functions

On the Approximation of the Cut and Step Functions by Logistic and Gompertz Functions

Expanded Use of Discontinuity and Singularity Functions in Mechanics

Dirac Delta Function of Matrix Argument

A Heaviside Function Approximation for Neural Network Binary Classiﬁcation

Spectral Density Periodogram Spectral Density (Chapter 12) Outline

Partial Differential Equations Fourier 1. Show That for T