A Review of Bayesian Optimization

Taking the Human Out of the Loop: A Review of Bayesian Optimization The Harvard community has made this article openly available. Please share how this access benefits you. Your story matters Citation Shahriari, Bobak, Kevin Swersky, Ziyu Wang, Ryan P. Adams, and Nando de Freitas. 2016. “Taking the Human Out of the Loop: A Review of Bayesian Optimization.” Proc. IEEE 104 (1) (January): 148– 175. doi:10.1109/jproc.2015.2494218. Published Version doi:10.1109/JPROC.2015.2494218 Citable link http://nrs.harvard.edu/urn-3:HUL.InstRepos:27769882 Terms of Use This article was downloaded from Harvard University’s DASH repository, and is made available under the terms and conditions applicable to Open Access Policy Articles, as set forth at http:// nrs.harvard.edu/urn-3:HUL.InstRepos:dash.current.terms-of- use#OAP 1 Taking the Human Out of the Loop: A Review of Bayesian Optimization Bobak Shahriari, Kevin Swersky, Ziyu Wang, Ryan P. Adams and Nando de Freitas Abstract—Big data applications are typically associated with and the analytics company that sits between them. The analyt- systems involving large numbers of users, massive complex ics company must develop procedures to automatically design software systems, and large-scale heterogeneous computing and game variants across millions of users; the objective is to storage architectures. The construction of such systems involves many distributed design choices. The end products (e.g., rec- enhance user experience and maximize the content provider’s ommendation systems, medical analysis tools, real-time game revenue. engines, speech recognizers) thus involves many tunable config- The preceding examples highlight the importance of au- uration parameters. These parameters are often specified and tomating design choices. For a nurse scheduling application, hard-coded into the software by various developers or teams. we would like to have a tool that automatically chooses the If optimized jointly, these parameters can result in significant improvements. Bayesian optimization is a powerful tool for 76 CPLEX parameters so as to improve healthcare delivery. the joint optimization of design choices that is gaining great When launching a mobile game, we would like to use the data popularity in recent years. It promises greater automation so as gathered from millions of users in real-time to automatically to increase both product quality and human productivity. This adjust and improve the game. When a data scientist uses a review paper introduces Bayesian optimization, highlights some machine learning library to forecast energy demand, we would of its methodological aspects, and showcases a wide range of applications. like to automate the process of choosing the best forecasting technique and its associated parameters. Any significant advances in automated design can result in I. INTRODUCTION immediate product improvements and innovation in a wide area of domains, including advertising, health-care informat- Design problems are pervasive in scientific and industrial ics, banking, information mining, life sciences, control engi- endeavours: scientists design experiments to gain insights into neering, computing systems, manufacturing, e-commerce, and physical and social phenomena, engineers design machines entertainment. to execute tasks more efficiently, pharmaceutical researchers Bayesian optimization has emerged as a powerful solution design new drugs to fight disease, companies design websites for these varied design problems. In academia, it is impacting to enhance user experience and increase advertising revenue, a wide range of areas, including interactive user-interfaces geologists design exploration strategies to harness natural re- [26], robotics [101], [110], environmental monitoring [106], sources, environmentalists design sensor networks to monitor information extraction [158], combinatorial optimisation [79], ecological systems, and developers design software to drive [159], automatic machine learning [16], [143], [148], [151], computers and electronic devices. All these design problems [72], sensor networks [55], [146], adaptive Monte Carlo [105], are fraught with choices, choices that are often complex and experimental design [11] and reinforcement learning [27]. high-dimensional, with interactions that make them difficult When software engineers develop programs, they are often for individuals to reason about. faced with myriad choices. By making these choices explicit, For example, many organizations routinely use the popular Bayesian optimization can be used to construct optimal pro- mixed integer programming solver IBM ILOG CPLEX1 for grams [74]: that is to say, programs that run faster or compute scheduling and planning. This solver has 76 free parameters, better solutions. Furthermore, since different components of which the designers must tune manually – an overwhelming software are typically integrated to build larger systems, this number to deal with by hand. This search space is too vast framework offers the opportunity to automate integrated prod- for anyone to effectively navigate. ucts consisting of many parametrized software modules. More generally, consider teams in large companies that de- Mathematically, we are considering the problem of finding velop software libraries for other teams to use. These libraries a global maximizer (or minimizer) of an unknown objective have hundreds or thousands of free choices and parameters function f: that interact in complex ways. In fact, the level of complexity is often so high that it becomes impossible to find domain x? = arg max f(x) ; (1) experts capable of tuning these libraries to generate a new x2X product. where X is some design space of interest; in global op- As a second example, consider massive online games in- timization, X is often a compact subset of Rd but the volving the following three parties: content providers, users, Bayesian optimization framework can be applied to more unusual search spaces that involve categorical or conditional 1http://www.ibm.com/software/commerce/optimization/cplex-optimizer/ inputs, or even combinatorial search spaces with multiple 2 categorical inputs. Furthermore, we will assume the black- Algorithm 1 Bayesian optimization box function f has no simple closed form, but can be 1: for n = 1; 2;::: do evaluated at any arbitrary query point x in the domain. This 2: select new xn+1 by optimizing acquisition function α evaluation produces noise-corrupted (stochastic) outputs y 2 R xn+1 = arg max α(x; Dn) such that E[y j f(x)] = f(x). In other words, we can only x observe the function f through unbiased noisy point-wise 3: query objective function to obtain yn+1 observations y. Although this is the minimum requirement 4: augment data Dn+1 = fDn; (xn+1; yn+1)g for Bayesian optimization, when gradients are available, they 5: update statistical model can be incorporated in the algorithm as well; see for example 6: end for Sections 4.2.1 and 5.2.4 of [99]. In this setting, we consider a sequential search algorithm which, at iteration n, selects a location xn+1 at which to query f and observe yn+1. After N budget, is typically computationally intractable. This has led queries, the algorithm makes a final recommendation ¯xN , to the introduction of many myopic heuristics known as which represents the algorithm’s best estimate of the optimizer. acquisition functions, e.g., Thompson sampling, probability In the context of big data applications for instance, the func- of improvement, expected improvement, upper-confidence- tion f can be an object recognition system (e.g., deep neural bounds, and entropy search. These acquisition functions trade network) with tunable parameters x (e.g., architectural choices, off exploration and exploitation; their optima are located where learning rates, etc) with a stochastic observable classification the uncertainty in the surrogate model is large (exploration) accuracy y = f(x) on a particular dataset such as ImageNet. and/or where the model prediction is high (exploitation). Because the Bayesian optimization framework is very data Bayesian optimization algorithms then select the next query efficient, it is particularly useful in situations like these where point by maximizing such acquisition functions. Naturally, evaluations of f are costly, where one does not have access these acquisition functions are often even more multimodal to derivatives with respect to x, and where f is non-convex and difficult to optimize, in terms of querying efficiency, than and multimodal. In these situations, Bayesian optimization is the original black-box function f. Therefore it is critical that able to take advantage of the full information provided by the the acquisition functions be cheap to evaluate or approximate: history of the optimization to make this search efficient. cheap in relation to the expense of evaluating the black-box f. Fundamentally, Bayesian optimization is a sequential Since acquisition functions have analytical forms that are easy model-based approach to solving problem (1). In particular, we to evaluate or at least approximate, it is usually much easier prescribe a prior belief over the possible objective functions to optimize them than the original objective function. and then sequentially refine this model as data are observed via Bayesian posterior updating. The Bayesian posterior represents A. Paper overview our updated beliefs – given data – on the likely objective function we are optimizing. Equipped with this probabilistic model, In this paper, we introduce

A Review of Bayesian Optimization

DSA 5403 Bayesian Statistics: an Advancing Introduction 16 Units – Each Unit a Week’S Work

The Bayesian Approach to Statistics

Posterior Propriety and Admissibility of Hyperpriors in Normal

The Deviance Information Criterion: 12 Years On

Paradoxes and Priors in Bayesian Regression

A Compendium of Conjugate Priors

A Widely Applicable Bayesian Information Criterion

Bayes Theorem and Its Recent Applications

9 Introduction to Hierarchical Models

On the Robustness to Misspecification of Α-Posteriors and Their

Prior Distributions for Variance Parameters in Hierarchical Models

Computational Bayesian Statistics