SageMaker Automatic Model Tuning: Scalable Gradient-Free Optimization

Valerio Perrone1, Huibin Shen, Aida Zolic, Iaroslav Shcherbatyi, Amr Ahmed Tanya Bansal, Michele Donini, Fela Winkelmolen∗, Rodolphe Jenatton∗ Jean Baptiste Faddoul, Barbara Pogorzelska, Miroslav Miladinovic Krishnaram Kenthapadi, Matthias Seeger, Cédric Archambeau

ABSTRACT Mining (KDD ’21), August 14–18, 2021, Virtual Event, Singapore. ACM, New Tuning complex systems is challenging. Machine York, NY, USA, 10 pages. https://doi.org/10.1145/3447548.3467098 learning typically requires to set hyperparameters, be it regular- ization, architecture, or optimization parameters, whose tuning is 1 INTRODUCTION critical to achieve good predictive performance. To democratize In modern machine learning, complex statistical models with many access to machine learning systems, it is essential to automate free parameters are fit to data by way of automated and highly the tuning. This paper presents Amazon SageMaker Automatic scalable algorithms. For example, weights of a deep neural network Model Tuning (AMT), a fully managed system for gradient-free are learned by stochastic gradient descent (SGD), minimizing a optimization at scale. AMT finds the best version of a trained ma- loss function over the training data. Unfortunately, some remaining chine learning model by repeatedly evaluating it with different hyperparameters (HPs) cannot be adjusted this way, and their values hyperparameter configurations. It leverages either random search can significantly affect the prediction quality of the final model. or Bayesian optimization to choose the hyperparameter values re- In a neural network, we need to choose the learning rate of the sulting in the best model, as measured by the metric chosen by the stochastic optimizer, regularization constants, the type of activation user. AMT can be used with built-in algorithms, custom algorithms, functions, and architecture parameters such as the number or the and Amazon SageMaker pre-built containers for machine learning width of the different layers. In Bayesian models, priors need tobe frameworks. We discuss the core functionality, system architecture, specified, while for random forests or gradient boosted decision our design principles, and lessons learned. We also describe more trees, the number and maximum depth of trees are important HPs. advanced features of AMT, such as automated early stopping and The problem of hyperparameter tuning can be formulated as warm-starting, showing in experiments their benefits to users. the minimization of an objective function 푓 : X → R, where X denotes the space of valid HP configurations, while the value 푓 (x) CCS CONCEPTS corresponds to the metric we wish to optimize. We assume that • Computing methodologies → Machine learning; • Computer x = [푥1, . . . , 푥푑 ], where 푥 푗 ∈ X푗 is one of the 푑 hyperparameters, systems organization → Dependable and fault-tolerant sys- such that X = X1 × ... × X푑 . For example, given some x ∈ X, 푓 (x) tems and networks. may correspond to the held-out error rate of a machine learning model when trained and evaluated using the HPs x. In practice, KEYWORDS hyperparameter optimization requires addressing a number of chal- AutoML, hyperparameter tuning, scalable systems lenges. First, we do not know the analytical form of the function 푓 (i.e., it can only be observed through evaluations) and is thus ACM Reference Format: difficult to optimize as we cannot compute its gradients. Second, 1 Valerio Perrone , Huibin Shen, Aida Zolic, Iaroslav Shcherbatyi, Amr Ahmed, evaluations of 푓 (x) are often expensive in terms of time and com- Tanya Bansal, Michele Donini, Fela Winkelmolen∗, Rodolphe Jenatton∗, Jean

arXiv:2012.08489v2 [cs.LG] 18 Jun 2021 pute (e.g., training a deep neural network on a large dataset), so it is Baptiste Faddoul, Barbara Pogorzelska, Miroslav Miladinovic, Krishnaram important to identify a good hyperparameter configuration x∗ with Kenthapadi, Matthias Seeger, Cédric Archambeau. 2021. Amazon SageMaker Automatic Model Tuning: Scalable Gradient-Free Optimization. In Proceed- the least number of queries of 푓 . Third, for complex models, the HP ings of the 27th ACM SIGKDD Conference on Knowledge Discovery and Data configuration space X can have diverse types of attributes, some of which may be integer or categorical. For numerical attributes, 1Correspondence to: Valerio Perrone, . search ranges need to be determined. Some attributes in X can even ∗Work done while at Amazon Web Services. be conditional (e.g., the width of the 푙-th layer of a neural network is only relevant if the model has at least 푙 layers). Finally, even if Permission to make digital or hard copies of part or all of this work for personal or 푓 (x) varies smoothly in x, evaluations of 푓 (x) are typically noisy. classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation We present Amazon SageMaker Automatic Model Tuning (AMT), on the first page. Copyrights for third-party components of this work must be honored. a fully managed system for gradient-free function optimization at For all other uses, contact the owner/author(s). scale.1 The key contributions of our work are as follows: KDD ’21, August 14–18, 2021, Virtual Event, Singapore © 2021 Copyright held by the owner/author(s). ACM ISBN 978-1-4503-8332-5/21/08. 1Amazon SageMaker is a service that allows easy training and hosting of machine https://doi.org/10.1145/3447548.3467098 learning models. For details, see https://aws.amazon.com/sagemaker and [32]. • Design, architecture and implementation of hyperparam- low-fidelity approximations of 푓 are employed during early gen- eter optimization as a distributed, fault-tolerant, scalable, erations [22] in order to reduce computational cost. Multi-fidelity secure and fully managed service, integrated with Amazon strategies are discussed in more generality below. SageMaker (§3). • Description of the Bayesian Optimization algorithm power- 2.2 Surrogate Models. Bayesian Optimization ing AMT, including efficient hyperparameter representation, A key idea to improve data efficiency of sampling is to maintain a surrogate Gaussian process model, acquisition functions, and surrogate model of the function 푓 . At decision step 푡, this model is parallel and asynchronous evaluations (§4). fit to previous data D = {(x , 푓 (x )) | 푡 < 푡}. In Bayesian opti- • Overview of advanced features such as log scaling, auto- <푡 푡1 푡1 1 mization (BO), we employ probabilistic models, not only predicting mated early stopping and warm start (§5). best estimates (posterior means), but also uncertainties (posterior • Discussion of deployment results as well as challenges en- variances) for each x ∈ X: the value of the next x could come countered and lessons learned (§6). 푡 from exploration (sampling where 푓 is most uncertain) or from exploitation (minimizing our best estimate), and a calibrated proba- 2 PRELIMINARIES bilistic model can be used to resolve the trade-off between these Traditionally, HPs are, either hand-tuned by experts in what amounts desiderata optimally [56]. More concretely, we choose x푡 as the best to a laborious process, or they are selected using brute force schemes point according to an acquisition function A(x|D<푡 ), which is a such as grid search or random search. In response to increasing utility function averaged over the posterior predictive distribution model complexity, a range of more sample-efficient HP optimiza- 푝(푓 (x)|D<푡 ). The most common surrogate model for BO is based tion techniques have emerged. A comprehensive review of modern on Gaussian processes (GPs) [47], which not only have simple closed- hyperparameter optimization is provided in [10]. Here, we will form expressions for posterior and predictive distributions, but also focus on work relevant in the context of AMT. come with strong theoretical guarantees. Other BO surrogate mod- els include random forests [20] and Bayesian neural networks [54], 2.1 Model-Free Hyperparameter Optimization which can suffer from uncalibrated uncertainty estimates. We give Any HPO method is proposing evaluation points x1, x2,..., x푇 , a detailed account of sequential BO with a GP surrogate model in such that Section 4. Tutorials on BO are provided in [5, 50].

min 푓 (x푡 ) ≈ min 푓 (x). 푡=1,...,푇 x∈X 2.3 Early Stopping. Multi-Fidelity Optimization For the simplest methods, the choice of x푡 does not depend on Modern architectures can come with hundreds of earlier observations. In grid search, we fix 퐾 values for every HP 푥 푗 , millions of parameters, and even a single training run can take then evaluate 푓 on the Cartesian product, so that푇 = 퐾푑 . In random many hours or even days. In such settings, it is common practice search, we draw x푡 independently and uniformly at random. More to cheaply probe configurations x by training for few epochs or specifically, each 푥푡,푗 is drawn uniformly from X푗 . For numerical on a subset of the data. While this gives rise to a low-fidelity ap- HPs, the distribution may also be uniform in a transformed domain proximation of 푓 (x), such data can be sufficient to filter out poor (e.g., log domain). Random search is more effective than grid search configurations early on, so that full training is run only for themost when some of the HPs 푥 푗 are irrelevant as it considers a larger promising ones. To this end, we consider functions 푓 (x, 푟), 푟 be- number of values of the relevant ones [4]. Both methods are easily ing a resource attribute, where 푓 (x, 푟max) = 푓 (x) is the expensive parallelizable. Random search should always be considered as a metric of interest, while 푓 (x, 푟), 푟 < 푟max are cheaper-to-evaluate baseline, and is frequently used to initialize more sophisticated HPO approximations. Here, 푟 could be the number of training epochs methods. Another alternative is picking a set of pseudo-random for a deep neural network (DNN), the number of trees for gradient points known as Sobol sequences [53]. The advantage is that they boosting, or the dataset subsampling ratio. When training DNNs, provide a better coverage of the search space, but are deterministic we can evaluate 푓 (x, 푟), 푟 = 1, 2,... by computing the validation These simple baselines can be improved upon by making use metric after every epoch. For gradient boosting implementations of earlier observations {(x푡1 , 푓 (x푡1 )) | 푡1 < 푡} in order to plan supporting incremental training, we can obtain intermediate val- subsequent evaluations x푡2 , 푡2 ≥ 푡. Population-based methods, ues 푓 (x, 푟) as well. In early stopping HPO, the evaluation of x is such as evolutionary or genetic algorithms, are attractive if parallel terminated at a level 푟 if the probability of it scoring worse at 푟max computing resources are available [16, 48]. In each generation, all than some earlier x′ is predicted high enough. The median rule configurations from a fixed-size population are evaluated. Then, is a simple instance of this idea [14], while other techniques aim a new population is created by randomly mutating a certain per- to extrapolate learning curves beyond the current 푟 [8, 26]. Early centile of the top performing configurations, as well as sampling stopping is particularly well suited to asynchronous parallel execu- from a background distribution, and the process repeats with the tion: whenever a job is stopped, an evaluation can be started for the next generation. Evolutionary algorithms (EAs) are powerful and next configuration x proposed by HPO. An alternative to stopping flexible search strategies, which can work even in search spaces configurations is to pause and (potentially) resume them58 later[ ]. of complex structure. However, EAs can be difficult to configure Besides early stopping HPO, successive halving (SH) [23, 25] to the problem at hand, and since fine-grained knowledge about and Hyperband [30] are two other basic instances of multi-fidelity 푓 can be encoded only in a large population, they tend to require techniques. In each round, 푓 (x, 푟min) is evaluated for a number 푛 of a substantial amount of parallel compute resources. In some EAs, configurations x sampled at random. Next, 푓 (x, 2푟min) is run for the top 푛/2 of configurations, while the bottom half are discarded. This filtering step is repeated until 푟max is reached. One drawback of SH and Hyperband is their synchronous nature, which is remedied by ASHA [31]. All these methods make use of SH scheduling of evaluations, yet new configurations are chosen at random. BOHB [9] combines synchronous Hyperband with model-based HPO, and a more recent combination of ASHA with Bayesian optimization is MOBSTER [27]. This method tends to be far more efficient than synchronous counterparts, and can often save up to half of resources compared to ASHA.

3 THE SYSTEM Figure 1: System architecture of SageMaker AMT. The dia- In this section we give an overview of the system underpinning gram shows the different components of AMT and how they SageMaker AMT. We lay out design principles, describe the sys- interact. A number of AWS building blocks are combined to tem architecture, and highlight a number of challenges related to deliver a scalable, robust, secure and fully managed service. running HPO in the cloud.

3.1 Design Principles 3.2 System Architecture We present the key requirements underlying the design of Sage- Sagemaker AMT provides a distributed, fault-tolerant, scalable, se- Maker AMT. cure and fully-managed service for HPO. One of the key building • Easy to use and fully managed: To ensure broad adoption blocks of the service is the AWS Sagemaker Training platform, by data scientists and ML developers with varying levels of which executes the training jobs and obtains the value of the objec- ML knowledge, AMT should be easy to use, with minimal tive metric for any candidate hyperparameter chosen by the Hy- effort needed for a new user to setup and execute. Further, perparameter Selection Service. In this way, each candidate set of we would like AMT to be offered as a fully managed service, hyperparameters tried is associated to a corresponding Sagemaker with stable API and default configuration settings so that Training Job in the user’s AWS account. Using the SageMaker train- the implementation complexity is abstracted away from the ing platform allows AMT to scale well. Several training jobs can customer. AMT spares users the pain to provision hardware, be run in parallel, use distributed clusters of EC2 instances, and install the right software, and download the data. It takes scale HPO to large volumes of data. This includes a failure-resistant care of uploading the models to the users’ accounts and workflow with built-in retry mechanisms to guarantee robustness. providing them with training performance metrics. Sagemaker AMT’s backend is built using a fully server-less ar- • Tightly integrated with other SageMaker components: Consid- chitecture by means of a number of AWS building blocks. It uses ering that model tuning (HPO) is typically performed as a AWS API Gateway, AWS Lambda, AWS DynamoDB, AWS Step part of the ML pipeline involving several other components, Functions, AWS Cloudwatch Events in its back-end workflow. AWS AMT should seamlessly operate with other SageMaker com- API Gateway and AWS Lambda is used to power the API Layer ponents and APIs. which customers use to call various Sagemaker AMT APIs, such • Scalable: AMT should scale with respect to different data as Create/List/Describe/StopHyperparamaterTuningJobs. AWS Dy- sizes, ML training algorithms, number of HPs, metrics, HPO namoDB is used as the persistent store to keep all the metadata methods and hardware configurations. Scalability includes a associated with the job and also track the current state of the job. failure-resistant workflow with built-in retry mechanisms The overall system’s architecture is depicted in Figure 1. Sagemaker to guarantee robustness. AMT deals only with the metadata for the jobs and all customer • Cost-effective: AMT should be cost-effective to the customer, data is handled by the Sagemaker Training platform, ensuring that in terms of both compute and human costs. We would like no customer data is stored into the DynamoDB Table. AWS Cloud- to enable the customer to specify a budget and support cost watch Events, AWS Step Functions and AWS Lambda are used in reduction techniques such as early stopping and warm start. the AMT workflows engine, which is responsible for kicking off • Available: A pillar of AMT’s design is to ensure that it is the evaluation of hyperparameter configurations from the Hyper- highly available. This includes success of the synchronous parameter Selection Service, starting training jobs, tracking their API’s as well as minimizing the failures of the asynchronous progress and repeating the process until the stopping criterion is workflows underlying the hyperparameter tuning jobs. met. • Secure: Security is a top priority for AMT. In the context of HPO, this means not putting customer’s data at risk. As we 3.3 Challenges for HPO in the cloud will elaborate below, our system does not store any customer While running at scale poses a number of challenges, AMT is a data. In addition, we provide extra security features such highly distributed and fault tolerant system. System resiliency was as the ability to run workloads in a network-isolated mode one of the guiding principles when building AMT. Example failure (i.e., containers cannot make network calls) and/or in a VPC, scenarios are when the BO engine suggests hyperparameters that addressing security requirements for AMT’s users. can run out of memory or when individual training jobs fail due to dependency issues. The AMT workflow engine is designed to where the observation 푦 is modelled as a Gaussian random variable 2 be resilient against failures, and has a built-in retry mechanism to with mean 푓 (x) and variance 휎0 : a standard GP regression setup [47, guarantee robustness. Chapter 2]. Many choices for the covariance function (or kernel) 푑 푑 AMT runs every evaluation as a separate training job on the K휽 : R × R ↦→ R are possible. The Matérn-5/2 kernel with auto- SageMaker Training platform. Each training job provides customers matic relevance determination (ARD) parametrisation [47, Chapter with a usable model, logs and metrics persisted in CloudWatch. A 4] is advocated in [51, 52, 57], where it is shown that ARD does not training job involves setting up a new cluster of EC2 instances, lead to overfitting, provided that the GPHP are properly handled. waiting for the setup to complete, and downloading algorithm im- We follow this choice, which has become a de-facto standard in ages. This introduced an overhead that was pronounced for smaller most BO packages. datasets. To address this, AMT puts in place compute provisioning optimizations to reduce the time of setting up clusters and getting GP hyperparameters. Our probabilistic model comes with some them ready to run the training. GPHPs 휽. We highlight two possible options to treat these param- eters, both of which are implemented in AMT. A traditional way 4 THE ALGORITHM of determining the GPHPs consists in finding 휽 that maximises Next, we describe the main components and modelling choices the log marginal likelihood of our probabilistic model [47, Section behind the BO algorithm powering SageMaker AMT. We start with 5.4]. While this approach, known as empirical Bayes, is efficient the representation of the hyperparameters, followed by the key and often leads to good results, [51, and follow-up work] rather decisions we made around the choice of the surrogate model and advocate the full Bayesian treatment of integrating out 휽 by Markov acquisition function. We also describe the way AMT tackles the chain Monte Carlo (MCMC). In our experiments we found the latter common use case of exploiting parallel compute resources. approach to be less likely to overfit in the few-observation regime (i.e., early in the BO procedure), but is also more costly, since GP 4.1 Input configuration computations have to be done for every MCMC sample. In AMT, we implement slice sampling, one of the most widely used techniques To tune the HPs of a machine learning model, AMT needs an input for GPHPs [34, 37]. In our implementation we use one chain of 300 description of the space over which the optimization is performed. samples, with 250 samples as burn-in and thinning every 5 samples, We denote by H the set of HPs we are working with, such as resulting in an effective sample size of 10. We fix upper and lower H = {learning rate, loss function}. For each ℎ ∈ H, we also bounds on the GPHPs for numerical stability, and use a random define B the domain of values that ℎ can possibly take, leading to ℎ (normalised) direction, as opposed to a coordinate-wise strategy, to the global domain of HPs B ≜ B × · · · × B . Each ℎ ∈ H has H ℎ1 ℎ|H| go from our multivariate problem (휽 ∈ R푘 ) to the standard univari- data type continuous (real-valued), integer, or categorical, where ate formulation of slice sampling. We observed that slice sampling the numerical types come with lower and upper bounds. Integer is a better approach to learn the GPHPs compared to empirical HPs are handled by working in the continuous space and rounding Bayes, especially at the beginning of HPO where the latter is more to the nearest integer, while categorical HPs are one-hot encoded. prone to overfitting due to the small number of observations. It is common among ML practitioners to exploit some prior knowledge to adequately transform the space of some HP, such as Input warping. It is common practice among ML practitioners to applying a log-transform for a regularisation parameter. This point exploit some prior knowledge to adequately transform the space is important as the final performance hinges on this preprocessing. of some HP, such as applying a log-transform for a regularization There has been a large body work dedicated to automatically finding parameter. We leverage the ideas developed in [52], where the such input transformations (e.g., see [1] and references therein). In configuration x is transformed entry-wise by applying, for each AMT this is tackled through input warping and log scaling, which dimension 푗 ∈ {1, . . . ,푑}, 휔 (x ) ≜ BetaCDF(x , 훼 , 훽 ), where are described in the following sections. 푗 푗 푗 푗 {훼 푗 , 훽푗 }푗 ∈{1,...,푑 } are GPHPs that govern the shape of the trans- 푑 4.2 Gaussian process modelling formation. We refer to 흎(x) ∈ R as the vector resulting from all entry-wise transformations. An alternative, which is the de- Once the input HPs are encoded, AMT builds a model mapping fault choice in AMT, is to consider the CDF of the Kumaraswamy’s hyperparameter configurations to their predicted performance. We distribution, which is more tractable than the CDF of the Beta distri- follow a form of GP-based global optimisation [5, 24, 33, 39], mean- bution. A convenient way of handling these additional GHPHs is to ing that the function 푓 to minimize is assumed to be drawn from a overload our definition of the covariance function so that for,any GP with a given mean and covariance function. The GP is a partic- ′ ′ two x, x , 퐾휽 (x, x) ≜ 퐾휽 (흎(x), 흎(x )), where {훼 푗 , 훽푗 }푗 ∈{1,...,푑 } ularly suitable model for an online learning system such as AMT, are merged within the global vector 휽 of GPHPs. where past hyperparameter evaluations are iteratively used to pick the next ones. Since observations 푦 collected from 푓 are normalized to mean zero, we can consider a zero-mean function without loss 4.3 Acquisition functions 푐 푐 of generality. The choice of the covariance function K휽 , which Denote evaluations done up to now by D = {(x ,푦 ) | 푐 ∈ C}. depends on some Gaussian process hyperparameters (GPHPs) 휽, In GP-based Bayesian optimization, an acquisition function is op- will be discussed in detail in the next paragraph. timized in order to determine the hyperparameter configuration More formally, and given an encoded configuration x, our proba- at which 푓 should be evaluated next. Most common acquisition 2 bilistic model reads 푓 (x) ∼ GP(0, K휽 ) and 푦|푓 (x) ∼ N (푓 (x), 휎0 ), functions A(x) depend on the GP posterior 푃 (푓 | D) only via its marginal mean 휇(x) and variance 휎2 (x). There is an exten- sive literature dedicated to the design of acquisition functions. The Expected improvement (EI) was introduced by [36] and is probably the most popular acquisition function (notably popu- larised by the EGO algorithm [24]). EI is the default choice for toolboxes like SMAC [20], Spearmint [51], and AMT. It is defined as A(x|D, 휽) = E[max{0,푦∗ −푦(x)}], where the expectation is taken with respect to the posterior distribution 푦(x) ∼ N (휇(x), 휎2 (x)). Here, 푦∗ = min{푦푐 | 푐 ∈ C} is the best target value observed so far. In the case of EI, the expectation appearing in A(x) is taken with respect to a Gaussian marginal consistent with the Gaussian process Figure 2: Change of validation set score with different values posterior and, as such, has a simple closed-form expression. An- of the capacity parameter of a Support Vector Machine. Note other basic acquisition function is provided by Thompson sampling, the logarithmic scale of the capacity parameter 퐶. which is optimized by drawing a realization of the GP posterior and searching for its minimum point x [19, 59]. Exact Thompson sampling requires a sample from the joint posterior process, which that this does not take into account the information coming from is intractable. An approximation to this scheme is used in AMT, the fact that we picked the 퐿 − 1 pending candidates. To tackle this, 2 whereby marginal variables are sampled from N(휇(x), 휎 (x)) at asynchronous processing could be based on fantasizing [27, 51]. x locations from a dense set. The set is obtained through a Sobol sequence generator [53] populating the search space as densely 5 ADVANCED FEATURES as possible. Rather than using the sampled variables directly, the Beyond standard BO, AMT comes with a number of extra features resulting pseudo-random grid is used as a set of anchor points to to speed up tuning jobs, saving computational time and potentially initialize the local optimization of the EI. This scales linearly in the finding better hyperparameter configurations. We start by describ- number of locations and works well in practice. ing log scaling, followed by early stopping and warm-starting. Other interesting families of acquisition functions include those built on upper-confidence bound ideas55 [ ] or information-theoretic criteria [17, 18, 61, 62]. While more sample-efficient, many of these 5.1 Log Scaling typically come with higher computational cost than the EI, which A common property of learning problems is that a linear change adds overhead when the tuned model is fast to train (e.g., on a small in validation performance requires an exponential increase in the dataset). For this reason, we chose the EI as a robust acquisition learning capacity of the estimator (as defined by the VC dimension function striking a balance between performance, simplicity and in [60]). To illustrate this, Figure 2 shows the relationship between computational time. Alternative acquisition functions to make the the capacity parameter of SVM and validation accuracy. As a result, EI cost-aware and steer the hyperparameter search towards cheaper a wide search range is often chosen for hyperparameters control- configurations are described in[15, 29]. These variants mitigate the ling model capacity. For instance, a typical choice for the capacity cost associated to expensive hyperparameter configurations, such parameter 퐶 of support vector machine is {10−9 ... 109}. Note that as very large architectures when tuning neural network models. 99% of the volume of this example search space corresponds to values of hyperparameter 퐶 ∈ {107 ... 109}. As a result, smaller values of 퐶 might be under-explored by BO if applied directly to 4.4 Parallelism such range, as it attempts to evenly explore the search space. To While the BO procedure presented so far is purely sequential, many avoid under-exploring smaller values, a log transformation is ap- users are interested in exploiting parallel resources to speed up their plied to such model capacity related variables. Such transformation tuning jobs. To address this, AMT lets them specify whether the is generally referred to as “log scaling”. Search within transformed target function should be evaluated in parallel for different candi- search spaces can be done automatically by AMT, provided that the dates, and with how many parallel evaluations at a time. There are user indicates that such transformation is appropriate for a given different ways of taking advantage of a distributed computational hyperparameter via the API. For all the algorithms provided in environment. If we suppose we have access to 퐿 threads/machines, SageMaker, such as XGBoost and Linear Learner, recommendations we can simply return the top-퐿 candidates as ranked by the ac- are given for which hyperparameters log scaling is appropriate and quisition function A. We proceed to the next step only when the accelerates the tuning. 퐿 candidate’s evaluations in parallel are completed. This strategy Compared to the GP input warping, which automatically detects works well as long as the candidates are diverse enough and their the right scaling as the tuning job progresses, log scaling is applied evaluation time is not too heterogenous, which is unfortunately from the start and comes without extra parameters to estimate. rarely the case. For these reasons, AMT optimizes the acquisition The internal warping can learn a larger family of transformations function so as to induce diversity and adopts an asynchronous compared to the ones supported by hyperparameter scaling, and strategy. As soon as one of the 퐿 evaluations is done, we update is thus always active by default. However, log scaling is useful the GP with this new configuration and pick the next candidate to when the appropriate input transformation is known in advance. fill in the available computation slot (making sure, of course, notto Unlike input warping, it can be used not only with BO but also with select one of the 퐿 − 1 pending candidates). One disadvantage is random search. 5.2 Early Stopping typically works well when data is stationary, and can otherwise bias Tuning involves evaluating several hyperparameter configurations, the surrogate model towards unpromising configurations. Future which can be costly. Generally, a new hyperparameter configuration improvements could make the solution more robust to “negative x푡 may not improve over the previous ones, i.e., 푓 (x푡 ) ≥ 푓 (x푡′ ) for transfer” arising from tasks that are not positively correlated. one or more 푡 ′ < 푡. With early stopping, AMT uses the information of the previously-evaluated configurations to predict whether a 5.4 AMT for AutoML specific candidate x푡 is promising. If not, it stops the evaluation, AMT also serves as key component for SageMaker Autopilot [7], thus reducing the overall time. This works whenever we can obtain the AutoML service of Amazon SageMaker. In Autopilot, users can 1 2 푛 푟 intermediate values 푓 (x푡 ), 푓 (x푡 ), ..., 푓 (x푡 ), with 푓 (x푡 ) represent- upload a tabular dataset and the service will explore a complex ing the value of the objective for the configuration x푡 at training search space, consisting of feature preprocessing, different ML algo- iteration 푟. rithms and their hyperparameter spaces. For more details on how To implement early stopping, AMT employs the simple but effec- AMT is used in Autopilot we refer the readers to [7]. The applica- tive median rule [14] to determine which HP configurations to stop tion to Autopilot highlights how finding a good single ML model 푟 early. If 푓 (x푡 ) is worse than the median of the previously evaluated through hyperparameter optimization is still key for real-world configurations at the same iteration 푟, we stop the training. An applications, including in the context of AutoML. Many techniques, alternative is to predict future performance via a model and stop such as ensembling [6], can boost the accuracy by combining differ- poor configurations. In our experiments, the simple median rule ent models. However, when the number of models in an ensemble performed at least as well as, and often better, than predictions from is large, inference time may become unacceptable, hindering prac- linear and random forest models. A concern with the median rule tical usage. The application of AMT to Autopilot demonstrates the is that lower-fidelities are not necessarily representative of the final relevance of HPO in real-world applications, which require simple, values: as the training proceeds, a seemingly poor HP configuration robust and deployable models. Further, HPO is not incompatible can eventually improve enough to become the best one. To improve with ensembling, and can be applied to build better ensembles. resilience, we only make stopping decisions after a given number of training iterations. As the total number of iterations can vary 6 USE CASES AND DEPLOYMENT RESULTS for different algorithms and use-cases, this threshold is determined We now consider a number of case studies highlighting the benefits dynamically based on the duration of the fully completed hyperpa- of AMT, with respect to both its core algorithm and advanced rameter evaluations. We also considered the additional safeguard of functionalities. We first demonstrate the appeal of the BO strategy always completing 10 hyperparameter evaluations before activating implemented in AMT over random search, and then turn to the the median rule. However, this produced only slight improvements empirical advantages of the advanced features we described in the in the final objective value at the price of reduced time savings, and previous section. was therefore discarded. 6.1 BO vs random search 5.3 Warm start A very common use case is tuning regularization hyperparameters. A typical ask from users running several related tuning jobs is We demonstrate this by tuning the regularization terms alpha and to be able to build on previous experiments. For example, one lambda of the XGBoost algorithm on the direct marketing bi- may want to gradually increase the number of iterations, change nary classification dataset from UCI to minimize the AUC. Figure hyperparameter ranges, change which hyperparameters to tune, or 3 compares the performance of random search and BO, as imple- even tune the same model on a new dataset. In all these cases, it is mented in AMT. Each experiment was replicated with 50 different desirable to re-use information from previous tuning jobs rather random seeds, and we reported average and standard deviation. than starting from scratch. Since related tuning tasks are intuitively Specifically, from the left and middle plots in Figure 3, wecansee expected to be similar, this setting lends itself to some form of that BO suggested better performing hyperparameters then ran- transfer learning. With warm start, AMT uses the results of previous dom search. The right plot of Figure 3 shows that BO consistently tuning jobs to inform which combinations of hyperparameters to outperforms random search across all number of hyperparameter 2 search over in the new tuning job. This can make the search for the evaluations. Although BO is more competitive in the sequential best combination of hyperparameters more efficient. setting, AMT also offers random search. It is important tonote Speeding up HP tuning with transfer learning is an active line of that users find random search particularly useful in the distributed research. Most of this activity has focused on transferring HP knowl- setting where a large number of configurations are evaluated in edge across different datasets [3, 11, 14, 43, 44, 46, 49]. However, parallel, reducing wall-clock time. most of this previous work assumes the availability of some meta- data, describing in which sense datasets differ from each other [11]. 6.2 Log scaling When designing warm start, we found this assumption prohibi- Figure 3 depicts a run of AMT with log scaling. While BO focuses tively restrictive in practice. Computing meta-data in real-world on a small region of the search space, it still attempts to explore predictive systems is challenging due to the computational over- the rest of the search space. This suggests that well performing head or privacy reasons. We thus opted for a light-weight solution, 2A notebook to run this example on AWS SageMaker: https://github.com/awslabs/ purely based on past hyperparameter evaluations and requiring no amazon-sagemaker-examples/tree/master/hyperparameter_tuning/xgboost_ access to meta-data. It should be noted that simple warm-starting random_log. Figure 3: Left: Hyperparameters suggested by random search. The 푥-axis and 푦-axis are the hyperparameter values for alpha and lambda and the color of the points represent the quality of the hyperparameters (yellow: better AUC; violet: worse AUC). Middle: Hyperparameters are suggested by AMT through BO. Right: Best model score obtained so far (푦-axis, where lower is better) as more hyperparameter evaluations are performed (푥-axis). training jobs are found with low values of alpha. With log scaling, BO is steered towards these values and focuses on the most relevant region of the space. Models with larger learning capacity require more compute resources to perform the training phase (e.g., larger depth of trees in XGBoost leads to exponentially larger training time). Log scaling accelerates the search of good hyperparameter configurations, and also reduces the exploration of costly configu- rations – which usually correspond to large hyperparameter values. As the choice of scaling requires in-depth knowledge of the in- Figure 4: Absolute loss over time with and without early ternals of the learning algorithms, recommendations for common stopping when tuning linear learner on the Gdelt data, for algorithms on SageMaker were made available. A lesson learned single instance mode (left) and distributed mode (right). In from the development of log scaling is to carefully consider all both cases, early stopping achieves a similar loss in less time. possible edge-case inputs to the service. We uncovered a non-trivial failure of AMT where a user would first run a tuning job witha linear scaling of a hyperparameter (e.g., in [0.0, 1.0]), and use it 6.4 Warm start to warm start a new tuning job with log scaling enabled. While the value 0 could have been explored in the parent job, this be- It is common to update models regularly, typically with a different comes invalid in the child job. These non-trivial issues can only be choice of the hyperparameter space, iteration count or dataset. This uncovered through careful examination of all possible edge-cases requires re-tuning the model hyperparameters. AMT’s warm start associated to a new functionality. offers a simple approach to learn from previous tuning tasks.We demonstrate this on the problem of building an image classifier and 6.3 Early stopping iteratively tuning it by running multiple hyperparameter tuning jobs. We focus on two simple use cases: running two sequential One of the main concerns with early stopping is that it could neg- hyperparameter tuning jobs on the same algorithm and dataset, and atively impact the final objective values. Furthermore, the effect launching a new tuning job on the same algorithm on an augmented of early stopping is most noticeable on longer training jobs. We dataset. We train Amazon SageMaker’s built-in image classification consider Amazon SageMaker’s built-in linear learner algorithm on algorithm on the Caltech-256 dataset, and tune its hyperparameters the Gdelt (Global Database of Events, Language and Tone) dataset to maximize validation accuracy. Figure 5 shows the impact of in both single instance and distributed training mode.3 Specifically, warm starting from previous tuning jobs. Initially, there is a single we used the Gdelt dataset from multiple years in distributed mode tuning job. Once complete, we launch a new tuning job by selecting and from a single year in single instance mode. Figure 4 compares a the previous job as the parent. The plot shows that the new tuning hyperparameter tuning job with and without early stopping. Each job (red dots) quickly detects good hyperparameter configurations experiment was replicated 10 times, and the median of the best thanks to the knowledge from the parent job (black dots). As the model score so far (in terms of absolute loss, lower is better) is optimization progresses, the validation accuracy reaches 0.47, thus shown on the 푦-axis, while the 푥-axis represents time. Each tuning improving over 0.33, the best previous metric found by running job was launched with a budget of 100 hyperparameter configura- the tuning job from scratch. Warm start can also be applied across tions to explore. AMT with early stopping not only explores the different datasets. We apply a range of data augmentations, including same number of HP configurations in less time, but yields hyperpa- crop, color, and random transformations (i.e., image rotation, shear, rameter configurations with similar performance. and aspect ratio variations). To launch the last tuning job, we warm 3http://www.gdeltproject.org/. start from both previous jobs and run BO for 10 more iterations. 7 RELATED WORK Before concluding, we briefly review open source solutions for gradient-free optimization in this section. Rather than providing an exhaustive list, we aim to give an overview of the tools available pub- licly. One of the earliest packages for Bayesian Optimization using GP as a surrogate model is Spearmint [51], where several impor- tant extensions including multi-task BO [57], input-warping [52] and handling of unknown constraints [13] have been introduced. The same strategy has also been implemented in other open source Figure 5: Validation accuracy (푦-axis) over time (푥-axis) for BayesianOptimization scikit-optimize the tuned image classifier on the Caltech-256 dataset. The packages such as [38], Emukit black dots show some iterations of AMT when starting from (easy to use when training scikit-learn algorithms) and 5 SMAC scratch. Warm start allows you to keep tuning the same al- [40]. Unlike using a GP as surrogate model, [21] uses random gorithm on the same data (red dots) and on a transformed forest, which makes it appealing for high dimensional and discrete version of the data obtained via data augmentations (blue problems. With the growing popularity of deep learning frame- dots), with the validation accuracy continuously improving. works, BO has also been implemented for all the major deep learn- ing frameworks. BoTorch [2] is the BO implementation built on top of PyTorch [41] and GPyTorch [12], with an emphasis on modular The accuracy for the new tuning job (blue dots) improved again interface, support for scalable GPs as well as multi-objective BO. TensorFlow GPflowOpt compared to the parent jobs (red and black dots), reaching 0.52. In , there is [28], which is the BO imple- GPflow 6 AutoGluon HNAS As expected, this feature particularly helped users who run their mentation dependent on [35]. Finally, tuning jobs iteratively. Interestingly, we also found this feature [27] provides asynchronous BO, with and without multi-fidelity AutoGluon to be a practical enabler for users who would like to tune their optimization, as part of . models for a very large number of evaluations. While this would be prohibitively expensive due to the cubical scaling of GPs, it 8 CONCLUSION can be achieved through running tuning jobs in a sequence, each We presented SageMaker AMT, a fully-managed service to optimize time warm-starting from the previous one (e.g., with 500 HPO gradient-free functions in the cloud. Powered by BO, Sagemaker evaluations per tuning task). AMT brings together the state-of-the-art for hyperparameter opti- mization and offers it as a highly scalable solution. We gave insights 6.5 Post launch performance into its design principles, showed how it integrates with other Sage- Maker’s components, and shared practice for designing a fault- Many customers have adopted AMT as part of their ML workflows tolerant system in the cloud. Through a set of real-world use cases, since the product launch in December 2017. In particular, four use we showed that AMT is an effective tool for HPO. It also offers aset cases by large enterprises for AMT are currently publicly presented of advanced features, such as automatic early stopping and warm- as example use cases.4 Applications include automated bidding to starting from previous tuning jobs, which demonstrably speed up consume or supply electrical energy, click through rate and conver- the search of good hyperparameter configurations. Beyond perfor- sion rate predictions, malicious actor detection for DNS security, as mance metrics, several ML applications involve optimizing multiple well as general prototyping of novel ML applications. Note that all constraints and alternative metrics at the same time, such as maxi- the listed applications involve business critical data for customers of mum memory usage, inference latency or fairness [15, 29, 42, 45]. In AMT, which highlights the importance of building a tuning service the future, AMT could be extended to optimize multiple objectives on top of a secure training platform to guarantee secure storage simultaneously, automatically suggesting hyperparameter config- and processing of data. The generality of our container-based tun- urations that are optimal along several criteria and search for the ing service allows us to accommodate a wide variety of learning Pareto frontier of the multiple objectives. algorithms, such as custom neural-network-based solutions imple- mented in TensorFlow, or gradient boosting with decision trees. An abundance of empirical data is continuously collected to ACKNOWLEDGMENTS monitor AMT’s performance. For instance, API communication was AMT has contributions from many members of the SageMaker available at more than service-level agreement for the 99.99% of team, notably Ralf Herbrich, Leo Dirac, Michael Brueckner, Viktor time in 2020. A number of requested tuning jobs also showcased the Kubinec, Anne Milbert, Choucri Bechir, Enrico Sartoriello, Thibaut scalability of service. An example includes requests with 5 training Lienart, Furkan Bozkurt, Ilyes Khamlichi, Yifan Su, Adnan Kukuljac, jobs in parallel, each running on a cluster of 100 CPU machines of Ugur Adiguzel, Mahdi Heidari. 4 cores and 16 GB of RAM. Other examples are individual training jobs involving clusters of 128 GPU accelerators. The overall service REFERENCES is able to process spikes of many hundreds of tuning jobs with no [1] J.-A. M. Assael, Z. Wang, and N. de Freitas. Heteroscedastic treed Bayesian operational issues. optimisation. Technical report, preprint arXiv:1410.7172, 2014.

5https://github.com/scikit-optimize 4https://aws.amazon.com/sagemaker/customers 6https://www.tensorflow.org/ [2] M. Balandat, B. Karrer, D. R. Jiang, S. Daulton, B. Letham, A. G. Wilson, and E. Bak- preprint arXiv:1603.06560, 2016. shy. BoTorch: Programmable Bayesian Optimization in PyTorch. arXiv:1910.06403, [31] L. Li, K. Jamieson, A. Rostamizadeh, E. Gonina, M. Hardt, B. Recht, and A. Tal- Oct. 2019. walkar. Massively parallel hyperparameter tuning. Technical Report 1810.05934v4 [3] R. Bardenet, M. Brendel, B. Kégl, and M. Sebag. Collaborative hyperparameter [cs.LG], ArXiv, 2019. tuning. In ICML, 2013. [32] E. Liberty, Z. Karnin, B. Xiang, L. Rouesnel, B. Coskun, R. Nallapati, J. Delgado, [4] J. Bergstra and Y. Bengio. Random search for hyper-parameter optimization. A. Sadoughi, Y. Astashonok, P. Das, C. Balioglu, S. Chakravarty, M. Jha, P. Gautier, Journal of Machine Learning Research (JMLR), 13, 2012. D. Arpin, T. Januschowski, V. Flunkert, Y. Wang, J. Gasthaus, L. Stella, S. Ranga- [5] E. Brochu, V. M. Cora, and N. De Freitas. A tutorial on Bayesian optimization puram, D. Salinas, S. Schelter, and A. Smola. Elastic machine learning algorithms of expensive cost functions, with application to active user modeling and hi- in Amazon SageMaker. In Proceedings of the 2020 ACM SIGMOD International erarchical . Technical report, preprint arXiv:1012.2599, Conference on Management of Data, 2020. 2010. [33] D. J. Lizotte. Practical Bayesian optimization. PhD thesis, University of Alberta, [6] R. Caruana, A. Niculescu-Mizil, G. Crew, and A. Ksikes. Ensemble selection from 2008. libraries of models. ICML, 2004. [34] D. J. C. Mackay. Information Theory, Inference and Learning Algorithms. Cam- [7] P. Das, V. Perrone, N. Ivkin, T. Bansal, Z. Karnin, H. Shen, I. Shcherbatyi, Y. Elor, bridge University Press, 2003. W. Wu, A. Zolic, T. Lienart, A. Tang, A. Ahmed, J. B. Faddoul, R. Jenatton, [35] A. G. d. G. Matthews, M. van der Wilk, T. Nickson, K. Fujii, A. Boukouvalas, F. Winkelmolen, P. Gautier, L. Dirac, A. Perunicic, M. Miladinovic, G. Zappella, P. León-Villagrá, Z. Ghahramani, and J. Hensman. GPflow: A Gaussian process C. Archambeau, M. Seeger, B. Dutt, and L. Rouesnel. Amazon sagemaker autopilot: library using TensorFlow. Journal of Machine Learning Research, 18(40), apr 2017. a white box automl solution at scale. arXiv:2012.08483, 2020. [36] J. Mockus, V. Tiesis, and A. Zilinskas. The application of Bayesian methods for [8] T. Domhan, T. Springenberg, and F. Hutter. Extrapolating learning curves of seeking the extremum. Towards Global Optimization, 1978. deep neural networks. In ICML AutoML Workshop, 2014. [37] I. Murray and R. P. Adams. Slice sampling covariance hyperparameters of latent [9] S. Falkner, A. Klein, and F. Hutter. Bohb: Robust and efficient hyperparameter Gaussian models. In NeurIPS, pages 1732–1740, 2010. optimization at scale. In ICML, pages 1436–1445, 2018. [38] F. Nogueira. Bayesian optimization: Open source constrained global optimization [10] M. Feurer and F. Hutter. Hyperparameter optimization. In F. Hutter, L. Kot- tool for Python, 2014. thoff, and J. Vanschoren, editors, AutoML: Methods, Sytems, Challenges, chapter 1. [39] M. A. Osborne, R. Garnett, and S. J. Roberts. Gaussian processes for global Springer, 2019. optimization. In Proceedings of the 3rd Learning and Intelligent OptimizatioN [11] M. Feurer, T. Springenberg, and F. Hutter. Initializing Bayesian hyperparame- Conference (LION 3), 2009. ter optimization via meta-learning. In Proceedings of the Twenty-Ninth AAAI [40] A. Paleyes, M. Pullin, M. Mahsereci, N. Lawrence, and J. González. Emulation of Conference on Artificial Intelligence, 2015. physical processes with Emukit. In Second Workshop on Machine Learning and [12] J. R. Gardner, G. Pleiss, D. Bindel, K. Q. Weinberger, and A. G. Wilson. GPyTorch: the Physical Sciences, NeurIPS, 2019. Blackbox matrix-matrix Gaussian process inference with GPU acceleration. In [41] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmai- NeurIPS, 2018. son, L. Antiga, and A. Lerer. Automatic differentiation in PyTorch. In NIPS-W, [13] M. A. Gelbart, J. Snoek, and R. P. Adams. Bayesian optimization with unknown 2017. constraints. In Proceedings of the Thirtieth Conference on Uncertainty in Artificial [42] V. Perrone, M. Donini, K. Kenthapadi, and C. Archambeau. Fair Bayesian opti- Intelligence, page 250–259, 2014. mization. In ICML AutoML Workshop, 2020. [14] D. Golovin, B. Solnik, S. Moitra, G. Kochanski, J. Karro, and D. Sculley. Google [43] V. Perrone, R. Jenatton, M. Seeger, and C. Archambeau. Multiple adaptive Vizier: A service for black-box optimization. In Proceedings of the 23rd ACM Bayesian linear regression for scalable Bayesian optimization with warm start. SIGKDD International Conference on Knowledge Discovery and , pages NeurIPS Meta Learning Workshop, 2017. 1487–1495, 2017. [44] V. Perrone, R. Jenatton, M. W. Seeger, and C. Archambeau. Scalable hyperparam- [15] G. Guinet, V. Perrone, and C. Archambeau. Pareto-efficient Acquisition Functions eter transfer learning. NeurIPS, 2018. for Cost-Aware Bayesian Optimization. NeurIPS Meta Learning Workshop, 2020. [45] V. Perrone, I. Shcherbatyi, R. Jenatton, C. Archambeau, and M. Seeger. Con- [16] N. Hansen and A. Ostermeier. Completely derandomized self-adaptation in strained Bayesian optimization with max-value entropy search. In NeurIPS Meta evolution strategies. Evolutionary Computation, 9(2), 2001. Learning Workshop, 2019. [17] P. Hennig and C. J. Schuler. Entropy search for information-efficient global [46] V. Perrone, H. Shen, M. W. Seeger, C. Archambeau, and R. Jenatton. Learn- optimization. Journal of Machine Learning Research (JMLR), 2012. ing search spaces for Bayesian optimization: Another view of hyperparameter [18] J. M. Hernández-Lobato, M. W. Hoffman, and Z. Ghahramani. Predictive entropy transfer learning. NeurIPS, 32, 2019. search for efficient global optimization of black-box functions. Technical report, [47] C. Rasmussen and C. Williams. Gaussian Processes for Machine Learning. MIT preprint arXiv:1406.2541, 2014. Press, 2006. [19] M. Hoffman, B. Shahriari, and N. de Freitas. On correlation and budget constraints [48] E. Real, C. Liang, D. So, and Q. Le. Evolving machine learning algorithms from in model-based bandit optimization with application to automatic machine learn- scratch. In ICML, 2020. ing. In Proceedings of the International Conference on Artificial Intelligence and [49] D. Salinas, H. Shen, and V. Perrone. A quantile-based approach for hyperparame- Statistics (AISTATS), pages 365–374, 2014. ter transfer learning. ICML, 2020. [20] F. Hutter, H. H. Hoos, and K. Leyton-Brown. Sequential model-based optimization [50] B. Shahriari, K. Swersky, Z. Wang, R. P. Adams, and N. de Freitas. Taking the for general algorithm configuration. In Proceedings of LION-5, pages 507–523, human out of the loop: A review of Bayesian optimization. IEEE, 2016. 2011. [51] J. Snoek, H. Larochelle, and R. P. Adams. Practical Bayesian optimization of [21] F. Hutter, H. H. Hoos, and K. Leyton-Brown. Sequential model-based optimization machine learning algorithms. In NeurIPS, pages 2960–2968, 2012. for general algorithm configuration. In International conference on learning and [52] J. Snoek, K. Swersky, R. S. Zemel, and R. P. Adams. Input warping for intelligent optimization. Springer, 2011. Bayesian optimization of non-stationary functions. Technical report, preprint [22] M. Jaderberg, V. Dalibard, S. Osindero, W. M. Czarnecki, J. Donahue, A. Razavi, arXiv:1402.0929, 2014. O. Vinyals, T. Green, I. Dunning, K. Simonyan, et al. Population based training [53] I. M. Sobol. On the distribution of points in a cube and the approximate evaluation of neural networks. Technical report, preprint arXiv:1711.09846, 2017. of integrals. USSR Computational Mathematics and Mathematical Physics, 7(4), [23] K. Jamieson and A. Talwalkar. Non-stochastic best arm identification and hyper- 1967. parameter optimization. Technical report, preprint arXiv:1502.07943, 2015. [54] J. T. Springenberg, A. Klein, S. Falkner, and F. Hutter. Bayesian optimization with [24] D. R. Jones, M. Schonlau, and W. J. Welch. Efficient global optimization of robust Bayesian neural networks. In NeurIPS, pages 4134–4142, 2016. expensive black-box functions. Journal of Global Optimization, 1998. [55] N. Srinivas, A. Krause, S. Kakade, and M. Seeger. Gaussian process optimization [25] Z. Karnin, T. Koren, and O. Somekh. Almost optimal exploration in multi-armed in the bandit setting: No regret and experimental design. ICML, page 1015–1022, bandits. In ICML, 2013. 2010. [26] A. Klein, S. Falkner, J. T. Springenberg, and F. Hutter. Learning curve predic- [56] N. Srinivas, A. Krause, S. Kakade, and M. Seeger. Information-theoretic regret tion with Bayesian neural networks. In International Conference on Learning bounds for Gaussian process optimization in the bandit setting. IEEE Transactions Representations (ICLR), volume 17, 2017. on Information Theory, 58, 2012. [27] A. Klein, L. Tiao, T. Lienart, C. Archambeau, and M. Seeger. Model-based [57] K. Swersky, J. Snoek, and R. P. Adams. Multi-task Bayesian optimization. In asynchronous hyperparameter and neural architecture search. arXiv preprint NeurIPS, pages 2004–2012, 2013. arXiv:2003.10865, 2020. [58] K. Swersky, J. Snoek, and R. P. Adams. Freeze-thaw Bayesian optimization. [28] N. Knudde, J. van der Herten, T. Dhaene, and I. Couckuyt. GPflowOpt: A Bayesian Technical report, preprint arXiv:1406.3896, 2014. Optimization Library using TensorFlow. arXiv preprint – arXiv:1711.03845, 2017. [59] W. R. Thompson. On the likelihood that one unknown probability exceeds [29] E. H. Lee, V. Perrone, C. Archambeau, and M. Seeger. Cost-aware Bayesian another in view of the evidence of two samples. Biometrika, pages 285–294, 1933. optimization. In ICML AutoML Workshop, 2020. [60] V. Vapnik. The nature of statistical learning theory. Springer science & business [30] L. Li, K. Jamieson, G. DeSalvo, A. Rostamizadeh, and A. Talwalkar. Hyperband: A media, 2013. novel bandit-based approach to hyperparameter optimization. Technical report, [61] J. Villemonteix, E. Vazquez, and E. Walter. An informational approach to the global [62] Z. Wang and S. Jegelka. Max-value entropy search for efficient Bayesian opti- optimization of expensive-to-evaluate functions. Journal of Global Optimization, mization. ICML, 70, 2017. 44(4):509–534, 2009.