Learning to Rank Learning Curves

Learning to Rank Learning Curves Martin Wistuba 1 Tejaswini Pedapati 1 Abstract predictions (Chandrashekaran & Lane, 2017; Klein et al., 2017; Baker et al., 2018). Thus, the extrapolation method Many automated machine learning methods, such for the first candidates can not be used yet, which means as those for hyperparameter and neural archi- more computational effort. Other methods do not have this tecture optimization, are computationally expen- disadvantage, but require sufficiently long learning curves sive because they involve training many different to make reliable predictions which again means unnecessary model configurations. In this work, we present a overhead (Domhan et al., 2015). Many of these methods new method that saves computational budget by also do not take into account other information such as the terminating poor configurations early on in the hyperparameters of the model being examined or its network training. In contrast to existing methods, we con- architecture. sider this task as a ranking and transfer learning problem. We qualitatively show that by optimiz- We address the need for sample learning curves by devising ing a pairwise ranking loss and leveraging learn- a transfer learning technique that uses learning curves from ing curves from other datasets, our model is able other problems. Since the range of accuracy varies from to effectively rank learning curves without hav- dataset to dataset, we are forced to consider this in our ing to observe many or very long learning curves. modeling. But since we are not interested in predicting the We further demonstrate that our method can be performance of a model anyway, we use a ranking model used to accelerate a neural architecture search by that models the probability that the model currently being a factor of up to 100 without a significant perfor- investigated surpasses the best solution so far. mance degradation of the discovered architecture. In order to be able to make reliable predictions for short In further experiments we analyze the quality of learning curves, we consider further characteristics of the ranking, the influence of different model compo- model such as its network architecture. We compare our nents as well as the predictive behavior of the ranking method with respect to a ranking measure against model. different methods on five different image classification and four tabular regression datasets. We also show that our method is capable of significantly accelerating neural ar- 1. Introduction chitecture search (NAS) and hyperparameter optimization. A method commonly used by human experts to speed up Furthermore, we conduct several ablation studies to provide the optimization of neural architectures or hyperparameters a better motivation of our model and its behavior. is the early termination of iterative training processes that are unlikely to improve the current solution. A common 2. Related work arXiv:2006.03361v1 [cs.LG] 5 Jun 2020 technique to determine the likelihood of no improvement is to compare the learning curve of a new configuration to the Most of the prior work for learning curve prediction is based one of the currently best configuration. This idea can also be on the idea of extrapolating the partial learning curve by used to speed up automated machine learning processes. For using a combination of continuously increasing basic func- this purpose, it is common practice to extrapolate the partial tions. learning curve in order to predict the final performance of Domhan et al.(2015) define a set of 11 parametric basic the currently investigated model. Current extrapolation tech- functions, estimate their parameters and combine them in niques have several weaknesses that make them unable to an ensemble. Klein et al.(2017) propose a heteroscedastic realize their full potential in practice. Many of the methods Bayesian model which learns a weighted average of the require sufficient sample learning curves to make reliable basic functions. Chandrashekaran & Lane(2017) do not 1IBM Research. Correspondence to: Martin Wistuba <mar- use basic functions but use previously observed learning [email protected]>. curves of the current dataset. An affine transformation for each previously seen learning curve is estimated by mini- Learning to Rank Learning Curves mizing the mean squared error with respect to the partial × learning curve. The best fitting extrapolations are averaged Final Performance × as the final prediction. Baker et al.(2018) use a different Intermediate Performance procedure. They use support vector machines as sequential regressive models to predict the final accuracy based on features extracted from the learning curves, its gradients, and the neural architecture itself. Validation Accuracy Iterations The predictor by Domhan et al.(2015) is able to forecast without seeing any learning curve before but requires ob- serving more epochs for accurate predictions. The model by Figure 1. Learning curve prediction tries to predict from the partial Chandrashekaran & Lane(2017) requires seeing few learn- learning curve (solid line) the final performance. ing curves to extrapolate future learning curves. However, accurate forecasts are already possible after few epochs. Al- gorithms proposed by Klein et al.(2017); Baker et al.(2018) need to observe many full-length learning curves before pro- viding any useful forecasts. However, this is prohibiting in learned and finally used to determine the best architecture. the scenarios where learning is time-consuming such as in DARTS (Liu et al., 2019) is one of the last extensions of large convolutional neural networks (CNN). this idea which uses a continuous relaxation of the search All previous methods for automatically terminating iterative space, which allows learning of the shared parameters as learning processes are based on methods that predict the well as of the architecture by means of gradient descent. learning curve. Ultimately, however, we are less interested These methods can be a strong alternative to early termi- in the exact learning curve but rather whether the current nation but do not transfer to general machine learning and learning curve leads to a better result. This way of obtaining recent research implies that this approach does not work for a ranking is referred to as pointwise ranking methods (Liu, arbitrary architecture search spaces (Yu et al., 2020). 2011). They have proven to be less efficient than pairwise ranking methods which directly optimize for the objective 3. Learning curve ranking function (Burges et al., 2005). Yet, we are the first to consider a pairwise ranking loss for this application. With learning curve we refer to the function of qualitative performance with growing number of iterations of an itera- There are some bandit-based methods which leverage early tive learning algorithm. We use the term final learning curve termination as well. Successive Halving (Jamieson & Tal- to explicitly denote the entire learning curve, y1; : : : ; yL, re- walkar, 2016) is a method that trains multiple models with flecting the training process from beginning to end. Here, yi settings chosen at random simultaneously and terminates is a measure of the performance of the model (e.g., classifi- the worst performing half in predefined intervals. Hyper- cation accuracy), which is determined at regular intervals. band (Li et al., 2017) identifies that the choice of the inter- On the contrary, a partial learning curve, y1; : : : ; yl, refers vals is vital and therefore proposes to run Successive Halv- to learning curves that are observed only up to a time l. We ing with different intervals. BOHB (Falkner et al., 2018) is visualize the concepts of the terms in Figure1. an extension of Hyperband which proposes to replace the random selection of settings with Bayesian optimization. There is a set of automated early termination methods which follow broadly the same principle. The first model is trained The use of methods that terminate less promising iterative to completion and is considered as the current best model training processes early are of particular interest in the field mmax with its performance being ymax. For each further of automated machine learning, as they can, for example, model mi, the probability that it is better than the current significantly accelerate hyperparameter optimization. The max best model, p (mi > m ), is monitored at periodic inter- most computationally intensive subproblem of automated vals during training. If it is below a given threshold, the machine learning is Neural Architecture Search (Wistuba model’s training is terminated (see Algorithm1). All exist- et al., 2019), the optimization of the neural network topol- ing methods rely on a two-step approach to determine this ogy. Our method is not limited to this problem, but as it is probability which involves extrapolating the partial learning currently one of the major challenges in automated machine curve and a heuristic measure. Instead of the two-step pro- learning, we use this problem as a sample application in cess to determine the probability, we propose LCRankNet our evaluation. For this particular application, optimization to predict the probability that a model mi is better than methods that leverage parameter sharing (Pham et al., 2018) mj directly. LCRankNet is based on a neural network f have become established as a standard method. Here, a which considers model characteristics and the partial learn- search space spanning, overparameterized neural network is ing curve xi as input. We define the probability that mi is Learning to Rank Learning Curves better than mj as f(x) ef(xi)−f(xj ) p(mi > mj) =p î;j = : (1) Fully Connected Layer 1 + ef(xi)−f(xj ) Using the logistic function is an established modelling ap- max proach in the learning-to-rank community (Burges et al., LSTM pool 2005). Given a set of final learning curves for some models, conv the estimation of the values p between these models is Embedding Embedding i;j k=2..5 trivial since we know whether mi is better than mj or not.

Load more