Bayesian Optimization Meets Bayesian Optimal Stopping

Bayesian Optimization Meets Bayesian Optimal Stopping Zhongxiang Dai 1 Haibin Yu 1 Bryan Kian Hsiang Low 1 Patrick Jaillet 2 Abstract Silver et al., 2016). However, a major difficulty faced by ML Bayesian optimization (BO) is a popular paradigm practitioners is the choice of model hyperparameters which for optimizing the hyperparameters of machine significantly impacts the predictive performance. This calls learning (ML) models due to its sample efficiency. for the need to develop hyperparameter optimization algo- Many ML models require running an iterative rithms that have to be sample-efficient since the training training procedure (e.g., stochastic gradient de- of many modern ML models consumes massive computa- scent). This motivates the question whether in- tional resources. To this end, Bayesian optimization (BO) formation available during the training process is a popular paradigm due to its high sample efficiency and (e.g., validation accuracy after each epoch) can be strong theoretical performance guarantee (Shahriari et al., exploited for improving the epoch efficiency of 2016). In particular, the BO algorithm based on the Gaus- BO algorithms by early-stopping model training sian process-upper confidence bound (GP-UCB) acquisition under hyperparameter settings that will end up function has been shown to achieve no regret asymptotically under-performing and hence eliminating unneces- and perform competitively in practice (Srinivas et al., 2010). sary training epochs. This paper proposes to unify Many ML models require running an iterative training proce- BO (specifically, Gaussian process-upper confi- dure for some number of epochs such as stochastic gradient dence bound (GP-UCB)) with Bayesian optimal descent for neural networks (LeCun et al., 2015) and boost- stopping (BO-BOS) to boost the epoch efficiency ing procedure for gradient boosting machines (Friedman, of BO. To achieve this, while GP-UCB is sample- 2001). During BO, any query of a hyperparameter setting efficient in the number of function evaluations, usually involves training the ML model for a fixed number of BOS complements it with epoch efficiency for epochs. Information typically available during the training each function evaluation by providing a principled process (e.g., validation accuracy after each epoch) is rarely optimal stopping mechanism for early stopping. exploited for improving the epoch efficiency of BO algo- BO-BOS preserves the (asymptotic) no-regret per- rithms, specifically, by early-stopping model training under formance of GP-UCB using our specified choice hyperparameter settings that will end up under-performing, of BOS parameters that is amenable to an ele- hence eliminating unnecessary training epochs. Note that gant interpretation in terms of the exploration- this objective is different from that of standard early stop- exploitation trade-off. We empirically evaluate ping during the training of neural networks, which is used the performance of BO-BOS and demonstrate its to prevent overfitting. generality in hyperparameter optimization of ML models and two other interesting applications. To address this challenging issue, a number of works have been proposed to make BO more epoch-efficient: Freeze- 1. Introduction thaw BO (Swersky et al., 2014) explores a diverse collection of hyperparameter settings in the initial stage by training The state-of-the-art machine learning (ML) models have their ML models with a small number of epochs, and then recently reached an unprecedented level of predictive per- gradually focuses on (exploits) a small number of promising formance in several applications such as image recognition, settings. Despite its promising epoch efficiency, its perfor- complex board games, among others (LeCun et al., 2015; mance is not theoretically guaranteed and its computational 1 1 cost can be excessive. Multi-fidelity BO (Kandasamy et al., Department of Computer Science, National University of 2016; 2017), as well as its extensions using predictive en- Singapore, Republic of Singapore. 2Department of Electrical Engineering and Computer Science, Massachusetts Institute of tropy search (Zhang et al., 2017) and value of information Technology, USA. Correspondence to: Bryan Kian Hsiang Low for fidelity selection (Wu & Frazier, 2017), reduces the <[email protected]>. resource consumption of BO by utilizing low-fidelity func- Proceedings of the 36 th International Conference on Machine 1A closely related counterpart is multi-fidelity active learn- Learning, Long Beach, California, PMLR 97, 2019. Copyright ing (Zhang et al., 2016). 2019 by the author(s). Bayesian Optimization Meets Bayesian Optimal Stopping tions which can be obtained by using a subset of the training objectives of GP-UCB vs. BOS (respectively, objective func- data or by training the ML model for a small number of tion v.s. expected loss), BO-BOS can preserve the trademark epochs. However, in each BO iteration, since the fidelity (asymptotic) no-regret performance of GP-UCB with our (e.g., number of epochs) is determined before function eval- specified choice of BOS parameters that is amenable to an el- uation, it is not influenced by information that is typically egant interpretation in terms of the exploration-exploitation available during the training process (e.g., validation accu- trade-off (Section 4). Though the focus of our work here is racy after each epoch). In addition to BO, attempts have on epoch-efficient BO for hyperparameter tuning, we addi- also been made to improve the epoch efficiency of other tionally evaluate the performance of BO-BOS empirically hyperparameter optimization algorithms: Some heuristic in two other interesting applications to demonstrate its gen- methods (Baker et al., 2017; Domhan et al., 2015; Klein erality: policy search for reinforcement learning, and joint et al., 2017) predict the final training outcome based on hyperparameter tuning and feature selection (Section 5). partially trained learning curves in order to identify hyperparameter settings that are predicted to under-perform and 2. Background and Notations early-stop their model training. Hyperband (Li et al., 2017), which dynamically allocates the computational resource 2.1. Bayesian Optimization (BO) and GP-UCB (e.g., training epochs) through random sampling and elimi- Consider the problem of sequentially maximizing an un- nates under-performing hyperparameter settings by succes- known objective function f : R representing the val- D! sive halving, has been proposed and shown to perform well idation accuracy over a compact input domain Rd of D✓ in practice. Both the learning curve prediction methods and different hyperparameter settings for training an ML model: Hyperband can be combined with BO to further improve In each iteration t =1,...,T, an input query zt , [xt,nt] the epoch efficiency (Domhan et al., 2015; Falkner et al., of a hyperparameter setting (comprising N <n N 0 t 2018; Klein et al., 2017), but their resulting performances training epochs and a vector xt of the other hyperparam- are not theoretically guaranteed. Despite these recent ad- eters) is selected for evaluating the validation accuracy f vances, we still lack an epoch-efficient algorithm that can of the ML model to yield a noisy observed output (valida- incorporate early stopping into BO (i.e., by exploiting infor- tion accuracy) yt , f(zt)+✏ with i.i.d. Gaussian noise mation available during the training process) and yet offer a ✏ (0,σ2) and noise variance σ2. Since every eval- ⇠N theoretical performance guarantee, the design of which is uation of f is costly (Section 1), our goal is to strategi- likely to require a principled decision-making mechanism cally select input queries to approach the global maximizer for determining the optimal stopping time. z⇤ , arg maxz f(z) as rapidly as possible. This can be 2D achieved by minimizing a standard BO objective such as Optimal stopping is a classic research topic in statistics and the simple regret ST , f(z⇤) maxt 1,...,T f(zt).A operations research regarding sequential decision-making − 2{ } BO algorithm is said to guarantee no regret asymptotically problems whose objective is to make the optimal stopping if it satisfies limT ST =0, thus implying that it will decision with a small number of observations (Ferguson, !1 eventually converge to the global maximum. 2006). In Bayesian optimal stopping (BOS) or Bayesian sequential design, the decision between stopping vs. continu- To guarantee no regret, the belief of f is modeled by a ing is made to maximize the expected utility or, equivalently, Gaussian process (GP). Let f(z) z denote a GP, that { } 2D minimize the expected loss (Powell & Ryzhov, 2012). BOS is, every finite subset of f(z) z follows a multivariate { } 2D has found success in application domains such as finance Gaussian distribution (Rasmussen & Williams, 2006). Then, (Longstaff & Schwartz, 2001), clinical design (Brockwell & a GP is fully specified by its prior mean µ(z) and covariance Kadane, 2003; Muller¨ et al., 2007; Wathen & Thall, 2008), k(z, z0) for all z, z0 , which are, respectively, assumed 2D and economics (Davis & Cairns, 2012). The capability of w.l.o.g. to be µ(z)=0and k(z, z0) 1 for notational BOS in providing a principled optimal stopping mechanism simplicity. Given a column vector yT , [yt]t>=1,...,T of makes it a prime candidate for introducing early stopping noisy outputs observed from evaluating f at the selected into BO in a theoretically sound and rigorous way. input queries z ,...,z after T iterations, the GP 1 T 2D posterior belief of f at some input z is a Gaussian with This paper proposes to unify Bayesian optimization (specifi- 2D the following posterior mean µ (z) and variance σ2 (z): cally, GP-UCB) with Bayesian optimal stopping (BO-BOS) T T to boost the epoch efficiency of BO (Section 3). Intu- 2 1 µT (z) , kT (z)>(KT + σ I)− yT , itively, GP-UCB is acclaimed for being sample-efficient 2 2 1 (1) σ (z) k(z, z) k (z)>(K + σ I)− k (z) in the number of function evaluations while BOS can reduce T , − T T T the required number of epochs for each function evalua- where KT , [k(zt, zt0 )]t,t0=1,...,T and kT (z) , tion.

Bayesian Optimization Meets Bayesian Optimal Stopping

Deep Optimal Stopping

Optimal Stopping Time and Its Applications to Economic Models

The Duration of Optimal Stopping Problems

Optimal Stopping Problems: Autonomous Trading Over an Inﬁnite Time Horizon

Optimal Stopping Problems

Time Pressure and Regret in Sequential Search

Time Costs and Search Behavior

J. Appl. Prob. 38(3) 647-658 Pricing Exotic Options

New Developments of the Odds Theorem

Pricing the American Option Using Itô's Formula and Optimal Stopping

Optimal Stopping Under Drift Uncertainty

Risk Averse Stochastic Programming: Time Consistency and Optimal Stopping