Amazon Sagemaker Automatic Model Tuning: Scalable Gradient-Free Optimization

Amazon SageMaker Automatic Model Tuning: Scalable Gradient-Free Optimization Valerio Perrone1, Huibin Shen, Aida Zolic, Iaroslav Shcherbatyi, Amr Ahmed Tanya Bansal, Michele Donini, Fela Winkelmolen∗, Rodolphe Jenatton∗ Jean Baptiste Faddoul, Barbara Pogorzelska, Miroslav Miladinovic Krishnaram Kenthapadi, Matthias Seeger, Cédric Archambeau Amazon Web Services ABSTRACT Mining (KDD ’21), August 14–18, 2021, Virtual Event, Singapore. ACM, New Tuning complex machine learning systems is challenging. Machine York, NY, USA, 10 pages. https://doi.org/10.1145/3447548.3467098 learning typically requires to set hyperparameters, be it regularization, architecture, or optimization parameters, whose tuning is 1 INTRODUCTION critical to achieve good predictive performance. To democratize In modern machine learning, complex statistical models with many access to machine learning systems, it is essential to automate free parameters are fit to data by way of automated and highly the tuning. This paper presents Amazon SageMaker Automatic scalable algorithms. For example, weights of a deep neural network Model Tuning (AMT), a fully managed system for gradient-free are learned by stochastic gradient descent (SGD), minimizing a optimization at scale. AMT finds the best version of a trained ma- loss function over the training data. Unfortunately, some remaining chine learning model by repeatedly evaluating it with different hyperparameters (HPs) cannot be adjusted this way, and their values hyperparameter configurations. It leverages either random search can significantly affect the prediction quality of the final model. or Bayesian optimization to choose the hyperparameter values re- In a neural network, we need to choose the learning rate of the sulting in the best model, as measured by the metric chosen by the stochastic optimizer, regularization constants, the type of activation user. AMT can be used with built-in algorithms, custom algorithms, functions, and architecture parameters such as the number or the and Amazon SageMaker pre-built containers for machine learning width of the different layers. In Bayesian models, priors need tobe frameworks. We discuss the core functionality, system architecture, specified, while for random forests or gradient boosted decision our design principles, and lessons learned. We also describe more trees, the number and maximum depth of trees are important HPs. advanced features of AMT, such as automated early stopping and The problem of hyperparameter tuning can be formulated as warm-starting, showing in experiments their benefits to users. the minimization of an objective function 5 : X! R, where X denotes the space of valid HP configurations, while the value 5 ¹xº CCS CONCEPTS corresponds to the metric we wish to optimize. We assume that • Computing methodologies ! Machine learning; • Computer x = »G1,...,G3 ¼, where G 9 2 X9 is one of the 3 hyperparameters, systems organization ! Dependable and fault-tolerant sys- such that X = X1 × ... × X3 . For example, given some x 2 X, 5 ¹xº tems and networks. may correspond to the held-out error rate of a machine learning model when trained and evaluated using the HPs x. In practice, KEYWORDS hyperparameter optimization requires addressing a number of chal- AutoML, hyperparameter tuning, scalable systems lenges. First, we do not know the analytical form of the function 5 (i.e., it can only be observed through evaluations) and is thus ACM Reference Format: difficult to optimize as we cannot compute its gradients. Second, 1 Valerio Perrone , Huibin Shen, Aida Zolic, Iaroslav Shcherbatyi, Amr Ahmed, evaluations of 5 ¹xº are often expensive in terms of time and com- Tanya Bansal, Michele Donini, Fela Winkelmolen∗, Rodolphe Jenatton∗, Jean arXiv:2012.08489v2 [cs.LG] 18 Jun 2021 pute (e.g., training a deep neural network on a large dataset), so it is Baptiste Faddoul, Barbara Pogorzelska, Miroslav Miladinovic, Krishnaram important to identify a good hyperparameter configuration x∗ with Kenthapadi, Matthias Seeger, Cédric Archambeau. 2021. Amazon SageMaker Automatic Model Tuning: Scalable Gradient-Free Optimization. In Proceed- the least number of queries of 5 . Third, for complex models, the HP ings of the 27th ACM SIGKDD Conference on Knowledge Discovery and Data configuration space X can have diverse types of attributes, some of which may be integer or categorical. For numerical attributes, 1Correspondence to: Valerio Perrone, <[email protected]>. search ranges need to be determined. Some attributes in X can even ∗Work done while at Amazon Web Services. be conditional (e.g., the width of the ;-th layer of a neural network is only relevant if the model has at least ; layers). Finally, even if Permission to make digital or hard copies of part or all of this work for personal or 5 ¹xº varies smoothly in x, evaluations of 5 ¹xº are typically noisy. classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation We present Amazon SageMaker Automatic Model Tuning (AMT), on the first page. Copyrights for third-party components of this work must be honored. a fully managed system for gradient-free function optimization at For all other uses, contact the owner/author(s). scale.1 The key contributions of our work are as follows: KDD ’21, August 14–18, 2021, Virtual Event, Singapore © 2021 Copyright held by the owner/author(s). ACM ISBN 978-1-4503-8332-5/21/08. 1Amazon SageMaker is a service that allows easy training and hosting of machine https://doi.org/10.1145/3447548.3467098 learning models. For details, see https://aws.amazon.com/sagemaker and [32]. • Design, architecture and implementation of hyperparam- low-fidelity approximations of 5 are employed during early gen- eter optimization as a distributed, fault-tolerant, scalable, erations [22] in order to reduce computational cost. Multi-fidelity secure and fully managed service, integrated with Amazon strategies are discussed in more generality below. SageMaker (§3). • Description of the Bayesian Optimization algorithm power- 2.2 Surrogate Models. Bayesian Optimization ing AMT, including efficient hyperparameter representation, A key idea to improve data efficiency of sampling is to maintain a surrogate Gaussian process model, acquisition functions, and surrogate model of the function 5 . At decision step C, this model is parallel and asynchronous evaluations (§4). fit to previous data D = f¹x , 5 ¹x ºº j C < Cg. In Bayesian opti- • Overview of advanced features such as log scaling, auto- <C C1 C1 1 mization (BO), we employ probabilistic models, not only predicting mated early stopping and warm start (§5). best estimates (posterior means), but also uncertainties (posterior • Discussion of deployment results as well as challenges en- variances) for each x 2 X: the value of the next x could come countered and lessons learned (§6). C from exploration (sampling where 5 is most uncertain) or from exploitation (minimizing our best estimate), and a calibrated proba- 2 PRELIMINARIES bilistic model can be used to resolve the trade-off between these Traditionally, HPs are, either hand-tuned by experts in what amounts desiderata optimally [56]. More concretely, we choose xC as the best to a laborious process, or they are selected using brute force schemes point according to an acquisition function A¹xjD<C º, which is a such as grid search or random search. In response to increasing utility function averaged over the posterior predictive distribution model complexity, a range of more sample-efficient HP optimiza- ?¹5 ¹xºjD<C º. The most common surrogate model for BO is based tion techniques have emerged. A comprehensive review of modern on Gaussian processes (GPs) [47], which not only have simple closed- hyperparameter optimization is provided in [10]. Here, we will form expressions for posterior and predictive distributions, but also focus on work relevant in the context of AMT. come with strong theoretical guarantees. Other BO surrogate models include random forests [20] and Bayesian neural networks [54], 2.1 Model-Free Hyperparameter Optimization which can suffer from uncalibrated uncertainty estimates. We give Any HPO method is proposing evaluation points x1, x2,..., x) , a detailed account of sequential BO with a GP surrogate model in such that Section 4. Tutorials on BO are provided in [5, 50]. min 5 ¹xC º ≈ min 5 ¹xº. C=1,...,) x2X 2.3 Early Stopping. Multi-Fidelity Optimization For the simplest methods, the choice of xC does not depend on Modern deep learning architectures can come with hundreds of earlier observations. In grid search, we fix values for every HP G 9 , millions of parameters, and even a single training run can take then evaluate 5 on the Cartesian product, so that) = 3 . In random many hours or even days. In such settings, it is common practice search, we draw xC independently and uniformly at random. More to cheaply probe configurations x by training for few epochs or specifically, each GC,9 is drawn uniformly from X9 . For numerical on a subset of the data. While this gives rise to a low-fidelity ap- HPs, the distribution may also be uniform in a transformed domain proximation of 5 ¹xº, such data can be sufficient to filter out poor (e.g., log domain). Random search is more effective than grid search configurations early on, so that full training is run only for themost when some of the HPs G 9 are irrelevant as it considers a larger promising ones. To this end, we consider functions 5 ¹x,Aº, A be- number of values of the relevant ones [4]. Both methods are easily ing a resource attribute, where 5 ¹x,Amaxº = 5 ¹xº is the expensive parallelizable. Random search should always be considered as a metric of interest, while 5 ¹x,Aº, A < Amax are cheaper-to-evaluate baseline, and is frequently used to initialize more sophisticated HPO approximations. Here, A could be the number of training epochs methods. Another alternative is picking a set of pseudo-random for a deep neural network (DNN), the number of trees for gradient points known as Sobol sequences [53].

Load more