Bayesian Optimization Meets Bayesian Optimal Stopping
Zhongxiang Dai 1 Haibin Yu 1 Bryan Kian Hsiang Low 1 Patrick Jaillet 2
Abstract Silver et al., 2016). However, a major difficulty faced by ML Bayesian optimization (BO) is a popular paradigm practitioners is the choice of model hyperparameters which for optimizing the hyperparameters of machine significantly impacts the predictive performance. This calls learning (ML) models due to its sample efficiency. for the need to develop hyperparameter optimization algo- Many ML models require running an iterative rithms that have to be sample-efficient since the training training procedure (e.g., stochastic gradient de- of many modern ML models consumes massive computa- scent). This motivates the question whether in- tional resources. To this end, Bayesian optimization (BO) formation available during the training process is a popular paradigm due to its high sample efficiency and (e.g., validation accuracy after each epoch) can be strong theoretical performance guarantee (Shahriari et al., exploited for improving the epoch efficiency of 2016). In particular, the BO algorithm based on the Gaus- BO algorithms by early-stopping model training sian process-upper confidence bound (GP-UCB) acquisition under hyperparameter settings that will end up function has been shown to achieve no regret asymptotically under-performing and hence eliminating unneces- and perform competitively in practice (Srinivas et al., 2010). sary training epochs. This paper proposes to unify Many ML models require running an iterative training proce- BO (specifically, Gaussian process-upper confi- dure for some number of epochs such as stochastic gradient dence bound (GP-UCB)) with Bayesian optimal descent for neural networks (LeCun et al., 2015) and boost- stopping (BO-BOS) to boost the epoch efficiency ing procedure for gradient boosting machines (Friedman, of BO. To achieve this, while GP-UCB is sample- 2001). During BO, any query of a hyperparameter setting efficient in the number of function evaluations, usually involves training the ML model for a fixed number of BOS complements it with epoch efficiency for epochs. Information typically available during the training each function evaluation by providing a principled process (e.g., validation accuracy after each epoch) is rarely optimal stopping mechanism for early stopping. exploited for improving the epoch efficiency of BO algo- BO-BOS preserves the (asymptotic) no-regret per- rithms, specifically, by early-stopping model training under formance of GP-UCB using our specified choice hyperparameter settings that will end up under-performing, of BOS parameters that is amenable to an ele- hence eliminating unnecessary training epochs. Note that gant interpretation in terms of the exploration- this objective is different from that of standard early stop- exploitation trade-off. We empirically evaluate ping during the training of neural networks, which is used the performance of BO-BOS and demonstrate its to prevent overfitting. generality in hyperparameter optimization of ML models and two other interesting applications. To address this challenging issue, a number of works have been proposed to make BO more epoch-efficient: Freeze- 1. Introduction thaw BO (Swersky et al., 2014) explores a diverse collection of hyperparameter settings in the initial stage by training The state-of-the-art machine learning (ML) models have their ML models with a small number of epochs, and then recently reached an unprecedented level of predictive per- gradually focuses on (exploits) a small number of promising formance in several applications such as image recognition, settings. Despite its promising epoch efficiency, its perfor- complex board games, among others (LeCun et al., 2015; mance is not theoretically guaranteed and its computational 1 1 cost can be excessive. Multi-fidelity BO (Kandasamy et al., Department of Computer Science, National University of 2016; 2017), as well as its extensions using predictive en- Singapore, Republic of Singapore. 2Department of Electrical Engineering and Computer Science, Massachusetts Institute of tropy search (Zhang et al., 2017) and value of information Technology, USA. Correspondence to: Bryan Kian Hsiang Low for fidelity selection (Wu & Frazier, 2017), reduces the
Proceedings of the 36 th International Conference on Machine 1A closely related counterpart is multi-fidelity active learn- Learning, Long Beach, California, PMLR 97, 2019. Copyright ing (Zhang et al., 2016). 2019 by the author(s). Bayesian Optimization Meets Bayesian Optimal Stopping tions which can be obtained by using a subset of the training objectives of GP-UCB vs. BOS (respectively, objective func- data or by training the ML model for a small number of tion v.s. expected loss), BO-BOS can preserve the trademark epochs. However, in each BO iteration, since the fidelity (asymptotic) no-regret performance of GP-UCB with our (e.g., number of epochs) is determined before function eval- specified choice of BOS parameters that is amenable to an el- uation, it is not influenced by information that is typically egant interpretation in terms of the exploration-exploitation available during the training process (e.g., validation accu- trade-off (Section 4). Though the focus of our work here is racy after each epoch). In addition to BO, attempts have on epoch-efficient BO for hyperparameter tuning, we addi- also been made to improve the epoch efficiency of other tionally evaluate the performance of BO-BOS empirically hyperparameter optimization algorithms: Some heuristic in two other interesting applications to demonstrate its gen- methods (Baker et al., 2017; Domhan et al., 2015; Klein erality: policy search for reinforcement learning, and joint et al., 2017) predict the final training outcome based on hyperparameter tuning and feature selection (Section 5). partially trained learning curves in order to identify hyper- parameter settings that are predicted to under-perform and 2. Background and Notations early-stop their model training. Hyperband (Li et al., 2017), which dynamically allocates the computational resource 2.1. Bayesian Optimization (BO) and GP-UCB (e.g., training epochs) through random sampling and elimi- Consider the problem of sequentially maximizing an un- nates under-performing hyperparameter settings by succes- known objective function f : R representing the val- D! sive halving, has been proposed and shown to perform well idation accuracy over a compact input domain Rd of D✓ in practice. Both the learning curve prediction methods and different hyperparameter settings for training an ML model: Hyperband can be combined with BO to further improve In each iteration t =1,...,T, an input query zt , [xt,nt] the epoch efficiency (Domhan et al., 2015; Falkner et al., of a hyperparameter setting (comprising N