Learning to Rank Learning Curves

Martin Wistuba 1 Tejaswini Pedapati 1

Abstract predictions (Chandrashekaran & Lane, 2017; Klein et al., 2017; Baker et al., 2018). Thus, the extrapolation method Many automated methods, such for the first candidates can not be used yet, which means as those for hyperparameter and neural archi- more computational effort. Other methods do not have this tecture optimization, are computationally expen- disadvantage, but require sufficiently long learning curves sive because they involve training many different to make reliable predictions which again means unnecessary model configurations. In this work, we present a overhead (Domhan et al., 2015). Many of these methods new method that saves computational budget by also do not take into account other information such as the terminating poor configurations early on in the hyperparameters of the model being examined or its network training. In contrast to existing methods, we con- architecture. sider this task as a ranking and transfer learning problem. We qualitatively show that by optimiz- We address the need for sample learning curves by devising ing a pairwise ranking loss and leveraging learn- a transfer learning technique that uses learning curves from ing curves from other datasets, our model is able other problems. Since the range of accuracy varies from to effectively rank learning curves without hav- dataset to dataset, we are forced to consider this in our ing to observe many or very long learning curves. modeling. But since we are not interested in predicting the We further demonstrate that our method can be performance of a model anyway, we use a ranking model used to accelerate a neural architecture search by that models the probability that the model currently being a factor of up to 100 without a significant perfor- investigated surpasses the best solution so far. mance degradation of the discovered architecture. In order to be able to make reliable predictions for short In further experiments we analyze the quality of learning curves, we consider further characteristics of the ranking, the influence of different model compo- model such as its network architecture. We compare our nents as well as the predictive behavior of the ranking method with respect to a ranking measure against model. different methods on five different image classification and four tabular regression datasets. We also show that our method is capable of significantly accelerating neural ar- 1. Introduction chitecture search (NAS) and hyperparameter optimization. A method commonly used by human experts to speed up Furthermore, we conduct several ablation studies to provide the optimization of neural architectures or hyperparameters a better motivation of our model and its behavior. is the early termination of iterative training processes that are unlikely to improve the current solution. A common 2. Related work arXiv:2006.03361v1 [cs.LG] 5 Jun 2020 technique to determine the likelihood of no improvement is to compare the learning curve of a new configuration to the Most of the prior work for learning curve prediction is based one of the currently best configuration. This idea can also be on the idea of extrapolating the partial learning curve by used to speed up automated machine learning processes. For using a combination of continuously increasing basic func- this purpose, it is common practice to extrapolate the partial tions. learning curve in order to predict the final performance of Domhan et al.(2015) define a set of 11 parametric basic the currently investigated model. Current extrapolation tech- functions, estimate their parameters and combine them in niques have several weaknesses that make them unable to an ensemble. Klein et al.(2017) propose a heteroscedastic realize their full potential in practice. Many of the methods Bayesian model which learns a weighted average of the require sufficient sample learning curves to make reliable basic functions. Chandrashekaran & Lane(2017) do not 1IBM Research. Correspondence to: Martin Wistuba . curves of the current dataset. An affine transformation for each previously seen learning curve is estimated by mini- Learning to Rank Learning Curves mizing the mean squared error with respect to the partial × learning curve. The best fitting extrapolations are averaged Final Performance × as the final prediction. Baker et al.(2018) use a different Intermediate Performance procedure. They use support vector machines as sequential regressive models to predict the final accuracy based on features extracted from the learning curves, its gradients, and the neural architecture itself. Validation Accuracy Iterations The predictor by Domhan et al.(2015) is able to forecast without seeing any learning curve before but requires ob- serving more epochs for accurate predictions. The model by Figure 1. Learning curve prediction tries to predict from the partial Chandrashekaran & Lane(2017) requires seeing few learn- learning curve (solid line) the final performance. ing curves to extrapolate future learning curves. However, accurate forecasts are already possible after few epochs. Al- gorithms proposed by Klein et al.(2017); Baker et al.(2018) need to observe many full-length learning curves before pro- viding any useful forecasts. However, this is prohibiting in learned and finally used to determine the best architecture. the scenarios where learning is time-consuming such as in DARTS (Liu et al., 2019) is one of the last extensions of large convolutional neural networks (CNN). this idea which uses a continuous relaxation of the search All previous methods for automatically terminating iterative space, which allows learning of the shared parameters as learning processes are based on methods that predict the well as of the architecture by means of gradient descent. learning curve. Ultimately, however, we are less interested These methods can be a strong alternative to early termi- in the exact learning curve but rather whether the current nation but do not transfer to general machine learning and learning curve leads to a better result. This way of obtaining recent research implies that this approach does not work for a ranking is referred to as pointwise ranking methods (Liu, arbitrary architecture search spaces (Yu et al., 2020). 2011). They have proven to be less efficient than pairwise ranking methods which directly optimize for the objective 3. Learning curve ranking function (Burges et al., 2005). Yet, we are the first to con- sider a pairwise ranking loss for this application. With learning curve we refer to the function of qualitative performance with growing number of iterations of an itera- There are some bandit-based methods which leverage early tive learning algorithm. We use the term final learning curve termination as well. Successive Halving (Jamieson & Tal- to explicitly denote the entire learning curve, y1, . . . , yL, re- walkar, 2016) is a method that trains multiple models with flecting the training process from beginning to end. Here, yi settings chosen at random simultaneously and terminates is a measure of the performance of the model (e.g., classifi- the worst performing half in predefined intervals. Hyper- cation accuracy), which is determined at regular intervals. band (Li et al., 2017) identifies that the choice of the inter- On the contrary, a partial learning curve, y1, . . . , yl, refers vals is vital and therefore proposes to run Successive Halv- to learning curves that are observed only up to a time l. We ing with different intervals. BOHB (Falkner et al., 2018) is visualize the concepts of the terms in Figure1. an extension of Hyperband which proposes to replace the random selection of settings with Bayesian optimization. There is a set of automated early termination methods which follow broadly the same principle. The first model is trained The use of methods that terminate less promising iterative to completion and is considered as the current best model training processes early are of particular interest in the field mmax with its performance being ymax. For each further of automated machine learning, as they can, for example, model mi, the probability that it is better than the current significantly accelerate hyperparameter optimization. The max best model, p (mi > m ), is monitored at periodic inter- most computationally intensive subproblem of automated vals during training. If it is below a given threshold, the machine learning is Neural Architecture Search (Wistuba model’s training is terminated (see Algorithm1). All exist- et al., 2019), the optimization of the neural network topol- ing methods rely on a two-step approach to determine this ogy. Our method is not limited to this problem, but as it is probability which involves extrapolating the partial learning currently one of the major challenges in automated machine curve and a heuristic measure. Instead of the two-step pro- learning, we use this problem as a sample application in cess to determine the probability, we propose LCRankNet our evaluation. For this particular application, optimization to predict the probability that a model mi is better than methods that leverage parameter sharing (Pham et al., 2018) mj directly. LCRankNet is based on a neural network f have become established as a standard method. Here, a which considers model characteristics and the partial learn- search space spanning, overparameterized neural network is ing curve xi as input. We define the probability that mi is Learning to Rank Learning Curves better than mj as f(x)

ef(xi)−f(xj ) p(mi > mj) =p ˆi,j = . (1) Fully Connected Layer 1 + ef(xi)−f(xj )

Using the logistic function is an established modelling ap- max proach in the learning-to-rank community (Burges et al., LSTM pool 2005). Given a set of final learning curves for some models, conv the estimation of the values p between these models is Embedding Embedding i,j k=2..5 trivial since we know whether mi is better than mj or not. Therefore, we set Architecture y1,...,l Dataset ID Hyperparams  1 if m > m  i j pi,j = 0.5 if mi = mj (2) Figure 2. LCRankNet has four different components, each dealing  0 if mi < mj . with one type of input: architecture encoding, partial learning curve, dataset ID, and further hyperparameters. We minimize the cross-entropy loss X We model f using a neural network and use special layers to Lce = −pi,j logp ˆi,j − (1 − pi,j) log(1 − pˆi,j) (3) i,j process the different parts of the representation. The learned representation for the different parts are finally concatenated to determine the parameters of f. Now p(mi > mj) can be and fed to a fully connected layer. The architecture is vi- predicted for arbitrary pairs mi and mj using Equation (1). sualized in Figure2. We will now describe the different A model fl is trained on all partial learning curves of length components in detail. l to predict the probabilities of curves of length l. Learning curve component A CNN is used to process Algorithm 1 Early Termination Method the partial learning curve. We consider up to four different Input: Dataset d, model m, performance of best model convolutional layers with different kernel sizes. The exact mmax so far ymax. number depends on the length of the learning curve since Output: Learning curve. we consider no kernel sizes larger than the learning curve 1: for l ← 1 ...L do length. Each convolution is followed by a global max pool- 2: Train m on d for a step and observe a further part of ing layer and their outputs are concatenated. This results the learning curve yt. in the learned representation for the learning curve. In the max 3: if max1≤i≤l yi > y then appendix we analyze alternative modeling options. 4: continue 5: else if p (m > mmax) ≤ δ then Architecture component In this paper we consider two 6: return y experiments for learning curve prediction. In one we deal 7: end if with NAS in the NASNet search space (Zoph et al., 2018), 8: end for in the other we work on the joint optimization of architec- 9: return y ture and hyperparameters on tabular data (Klein & Hutter, 2019). In the former experiment, the search space consists of CNNs, which consist of two cells with five blocks each. This means that 60 decisions about architecture have to be 3.1. Ranking model to learn across datasets made. In the second experiment, the fully connected neural We have defined the prediction of our model in Equation (1). networks (FCNN) consists of only two layers, so the number It depends on the outcome of the function f which takes a of decisions is only 6. For more details refer to the appendix. model representation x as input. The information contained We learn an embedding for every option. An LSTM takes in this representation depends on the task at hand. The this embedding and generates the architecture embedding. representation consists of up to four different parts. First, the partial learning curve y1, . . . , yl. Second, the description Dataset component As we intend to learn from learning of the model’s architecture which is a sequence of strings curves of other datasets, we include dataset embeddings as representing the layers it is comprised of. Third, a dataset ID one of the components. Every dataset has its own embed- to indicate on which dataset the corresponding architecture ding and it will be used whenever a learning curve observed and learning curve was trained and observed. Fourth, all on this dataset is selected. If the dataset is new, then the remaining hyperparameters, e.g. or batch size. corresponding embedding is initialized at random. As the Learning to Rank Learning Curves

CIFAR-10 CIFAR-100 Fashion-MNIST

1.0 0.50 0.75 0.25 0.50 0.5 0.00 0.25 0.0 0.25 0.00 Spearman Correlation 0 10 20 30 Spearman Correlation 0 10 20 30 Spearman Correlation 0 10 20 30 Seen Learning Curve (%) Seen LearningSVHN Curve (%) Seen Learning Curve (%) 0.6 Quick, Draw! SVHN

0.75 0.4 0.50 Baker et al. Chandrashekaran 0.50 0.2 0.25 Domhan et al. 0.25 0.00 0.0 Last Seen Value 0.00 0.25 LCRankNet Spearman Correlation 0 10 20 0.230 Spearman Correlation 0 10 20 30 Seen Learning Curve (%) Seen Learning Curve (%) Spearman Correlation 0.4 0 10 20 30 Figure 3. The x-axis indicates the length observed of theSeen learning Learning curve. Curve We report (%) the mean of ten repetitions. The shaded area is the standard deviation. Our method LCRankNet outperforms its competitors on all datasets. model observes more learning curves from this dataset, it 4. Experiments improves the dataset embedding. The embedding helps us to model dataset-specific peculiarities in order to adjust the In this section, we first discuss how to create the meta- ranking if necessary. knowledge and then analyze our model in terms of learning curve ranking and the ability to use it as a way to accelerate NAS. Finally, we examine its individual components and Hyperparameter component Other hyperparameters, if behavior in certain scenarios. Furthermore, we show that available, are passed directly to the last layer. LCRankNet can be combined with various optimization methods and used for hyperparameter optimization tasks.

Technical details During the development process, we 4.1. Meta-knowledge found that the architecture component leads to instabilities during training. To avoid this, we regularize the output of We compare our method to similar methods on five different the LSTM layer by using it as an input to an autoregressive datasets: CIFAR-10, CIFAR-100, Fashion-MNIST, Quick- model, similar to a sequence-to-sequence model (Sutskever draw, and SVHN. We use the original train/test splits if et al., 2014) that recreates the original description of the available. Quickdraw has a total of 50 million data points architecture. In addition, we use the attention mecha- and 345 classes. To reduce the training time, we select a sub- nism (Bahdanau et al., 2015) to facilitate this process. All set of this dataset. We use 100 different randomly selected parameters of the layers in f are trained jointly by means of classes and choose 300 examples per class for the training Adam (Kingma & Ba, 2015) by minimizing split and 100 per class for the test split. 5,000 random data points of the training dataset serve as validation split for all datasets. L = αLce + (1 − α) Lrec (4) To create the meta-knowledge, we choose 200 architectures per dataset at random from the NASNet search space (Zoph a weighted linear combination of the ranking loss (Equation et al., 2018) such that we train a total of 1,000 architectures. (3)) and the reconstruction loss with α = 0.8. The hyperpa- We would like to point out that these are 1,000 unique rameter δ allows for trading precision vs. recall or search architectures, there is no architecture that has been trained time vs. regret, respectively. There are two extreme cases: if on several different datasets. Each architecture is trained we set δ ≥ 1, every run is terminated immediately. If we set for 100 epochs with stochastic gradient descent and cosine δ ≤ 0, we never terminate any run. For our experiment we learning rate schedule without restart (Loshchilov & Hutter, set δ = 0.45 which means that if the predicted probability 2017). that the new model is better than the best one is below 45%, the run is terminated early. We use standard image preprocessing and augmentation: Learning to Rank Learning Curves

Table 1. Results obtained by the different methods on five different datasets. For both metrics the smaller, the better. Regret reported in percent, time in GPU hours.

Method CIFAR-10 CIFAR-100 Fashion Quickdraw SVHN REGR.TIME REGR.TIME REGR.TIME REGR.TIME REGR.TIME NO EARLY TERMINATION 0.00 1023 0.00 1021 0.00 1218 0.00 1045 0.00 1485 DOMHANETAL.(2015) 0.56 346 0.82 326 0.00 460 0.44 331 0.28 471 LIETAL.(2017) 0.22 106 0.78 102 0.32 132 0.54 109 0.00 156 BAKERETAL.(2018) 0.00 89 0.00 77 0.00 129 0.00 107 0.00 241 JAMIESON &TALWALKAR (2016) 0.62 62 0.00 54 0.18 70 0.40 60 0.28 88 CHANDRASHEKARAN &LANE (2017) 0.62 30 0.00 35 0.28 41 0.30 82 0.06 164 LCRANKNET 0.22 20 0.00 11 0.10 19 0.00 28 0.10 74 for every image, we first subtract the channel mean and The method by Chandrashekaran & Lane(2017) consis- then divide by the channel standard deviation. Images are tently shows the second best results and in some cases can padded by a margin of four pixels and randomly cropped catch up to the results of our method. The method by Baker back to the original dimension. For all datasets but SVHN et al.(2018) stands out due to the high standard deviation. It we apply random horizontal flipping. Additionally, we use is by far the method with the smallest squared error on test. Cutout (DeVries & Taylor, 2017). However, the smallest changes in the prediction lead to a significantly different ranking, which explains the high vari- For the experiments in Section 4.6 we rely on the tabular ance in their results. The method by Domhan et al.(2015) benchmark (Klein & Hutter, 2019). This benchmark con- requires a minimum length of the learning curve to make tains meta-data for four regression tasks. In total 62,208 predictions. Accordingly, we observe rank correlation val- different FCNN with different architecture and hyperparam- ues starting from a learning curve length of 4%. Using the eters are evaluated. 100 settings per dataset are chosen at last seen value to determine the ranking of learning curves random as meta-knowledge. is a simple yet efficient method (Klein et al., 2017). In fact, The following experiments are conducted in a leave-one- it is able to outperform some of the more elaborate methods. dataset-out cross-validation. That means when considering one dataset, all meta-knowledge but the one for this particu- 4.3. Accelerating random neural architecture search lar dataset is used. In this experiment, we demonstrate the utility of learning curve predictors in the search for network architectures. For 4.2. Ranking performance the sake of simplicity, we accelerate a random search in the First, we analyze the quality of the learning curve rankings NASNet search space (Zoph et al., 2018). by different learning curve prediction methods. In this ex- The random search samples 200 models and trains each of periment we choose 50 different learning curves at random them for 100 epochs to obtain the final accuracy. In the as a test set. Five random learning curves are used as a train- end, the best of all these models is returned. Now each ing set for every repetition. Each learning curve prediction learning curve predictor iterates over these sampled archi- method ranks the 50 architectures by observing the partial tectures in the same order and determines at every third learning curve whose length varies from 0 to 30 percent of epoch if the training should be aborted. For Successive the final learning curve. We repeat the following experiment Halving (Jamieson & Talwalkar, 2016) and Hyperband (Li ten times and report the mean and standard deviation of et al., 2017) we follow the algorithm defined by the authors the correlation between the true and the predicted ranking and use the recommended settings. The current best model in Figure3. As a correlation metric, we use Spearman’s discovered after iterating over all 200 architectures is re- rank correlation coefficient. Thus, the correlation is 1 for a turned as the best model for this learning curve predictor. perfect ranking and 0 for an uncorrelated, random ranking. The goal of the methods is to minimize the regret compared Our method LCRankNet shows for all datasets higher cor- to a random search without early stopping and at the same relation. If there are no or only very short partial learning time reduce the computational effort. curves available, our method shows the biggest difference to the existing methods. The reason for this is a combination One of our observations is that Domhan et al.(2015)’s of the consideration of the network architecture together method underestimates performance when the learning with additional meta-knowledge. We analyze the impact of curve is short. As a result, the method ends each train- each component in detail in Section 4.5. ing process early after only a few epochs. Therefore, we Learning to Rank Learning Curves

0.35 0.3 0.30 150 0.25 0.2 0.20 100 0.15 0.1

0.10 Validation Error 50 Validation Error 0.05 0 20 40 60 80 100 0 Learning Curve Length (%) 0 20 40 60 Learning Curve Length80 100 (%)

Figure 4. LCRankNet is speeding up architecture search for SVHN. Difference between learning curves is small, making this the hardest task. follow the recommendation by Domhan et al.(2015) and do DARTS (2nd order) can improve these results to 2.76%, but not end processes before we have seen the first 30 epochs. it takes 96 GPU hours. One of the major disadvantages of We summarize the results in Table1. Our first observation is DARTS and related methods is that they cannot generalize that all methods accelerate the search with little regret. Here to any architecture search space (Yu et al., 2020) and do not we define regret as the difference between the accuracy of allow hyperparameter optimization. Our proposed method the model found by the random search and the accuracy of does not suffer from these problems. the model found by one of the methods. Not surprisingly, Domhan et al.(2015)’s method takes up most of the time, as 4.4. LCRankNet prediction analysis it requires significantly longer learning curves to make its de- cision. In addition, we can confirm the results by Baker et al. We saw in the previous selection that LCRankNet does not (2018); Chandrashekaran & Lane(2017), both of which perform perfectly. There are cases where its regret is greater report better results than Domhan et al.(2015). Our method than zero or where the search time is higher which indicates requires the least amount of time for each dataset. For that early stopping was applied later than maybe possible. CIFAR-100 we do not observe any regret, but a reduction In this section we try to shed some light on the decisions of the time by a factor of 100. In some cases we observe an made by the model and try to give the reader some insight insignificantly higher regret than some of the other methods. of the model’s behavior. In our opinion, the time saved makes up for it. We provide four example decisions of LCRankNet in In Figure4 we visualize the random search for SVHN. As Figure5. The plots in the top row show cases where you can see, many curves not only have similar behavior LCRankNet made a correct decision, the plots in the bottom but also similar error. For this reason, it is difficult to de- row are examples for incorrect decisions. cide whether to discard a model safely, which explains the In both of the correct cases (top row), LCRankNet assigns increased runtime. The only methods that do not show this a higher probability to begin with, using meta-knowledge. behavior are Successive Halving and Hyperband. The sim- However, in the top left case it becomes evident after only ple reason is that the number of runs terminated and the very few epochs that mmax is consistently better than m time they are discarded are fixed by the algorithm and do such that the probability p (m > mmax) reduces sharply to not rely on the properties of the learning curve. The disad- values close to zero which would correctly stop training vantages are obvious: promising runs may be terminated early on. The case in the top right plot is more tricky. The and unpromising runs may run longer than needed. probability is increasing until a learning curve length of 12% Finally, we compare our method with DARTS (Liu et al., as the learning curve seem to indicate that m is better than max max 2019), a method that speeds up the search by parameter m . However, the learning curve of m approaches the sharing. We train the architecture discovered in the previous values of m and then the probability decreases and training experiment with an equivalent training setup like DARTS for is stopped early. a fair comparison. This architecture achieved a classification We continue now with the discussion of the two examples in error of 2.99% on CIFAR-10 after only 20 GPU hours were the bottom row which LCRankNet erroneously terminated searched. In comparison, DARTS (1st order) needs 36 early, which causes in turn a regret higher than zero in GPU hours for a similarly good architecture (3.00% error). Table1. In the bottom left we show an example where Learning to Rank Learning Curves

0.8 0.6 m 0.6 0.4 mmax 0.4 p(m > mmax) 0.2 0.2 Validation Accuracy Validation Accuracy 0.0 0.0

0 25 50 75 100 0 25 50 75 100 Learning Curve Length (%) Learning Curve Length (%)

0.8

0.8 0.6

0.6 0.4

0.4 0.2 Validation Accuracy Validation Accuracy 0.2 0.0 0 25 50 75 100 0 25 50 75 100 Learning Curve Length (%) Learning Curve Length (%)

Figure 5. Analysis of the predicted probability of LCRankNet with growing learning curve length. The plots in the top row show examples for correct decisions by LCRankNet. The bottom row shows two examples where its decision was wrong. sometimes m is better for a longer period and sometimes pare the configuration we have proposed with different vari- mmax. This is arguable a very difficult problem and it is hard ants in Figure6. We consider variants with and without to predict which curve will eventually be better. However, metadata, architecture description or learning curve consid- LCRankNet shows a reasonable behavior. As the learning eration. In addition, we compare our configuration trained curve of m is consistently worse than mmax in the segment with pairwise ranking loss to one trained with a pointwise from 15% to 42%, the probability is decreasing. Starting ranking loss. from a length of 57%, where m shows superior performance One of the most striking observations is that the metadata is than mmax for several epochs, the probability starts to raise. essential for the model. This is not surprising since in partic- From learning curve length of 70% onward, both learning ular the learning of architecture embedding needs sufficient curves are very close and the difference in the final accuracy data. Sufficient data is not available in this setup, so we is only 0.0022. The example visualized in the bottom right observe a much higher variance for these variants. Even the plot is a very interesting one. The learning curve of mmax variant that only considers the learning curve benefits from is consistently better than or almost equal to m up to the additional meta-knowledge. But even this is not surprising, very end. Towards the end, learning curves are getting very since stagnating learning processes show similar behavior close. In fact, from learning curve length 63% onward, the regardless of the dataset. Using the meta-knowledge, both maximum difference between m and mmax per epoch is components achieve good results on their own. It can be only 0.008. Hence, in the beginning it seems like a trivial clearly seen that these components are orthogonal to one decision to reject m. However, eventually m turns out to be another. The variant, which only considers the architecture, better than mmax. shows very good results for short learning curves. If only In conclusion, deciding whether one model will be better the learning curve is considered, the results are not very than another one based on a partial learning curve is a chal- good at first, but improve significantly with the length of the lenging task. A model that turns out to be (slightly) better learning curve. A combination of both methods ultimately than another one can be dominated consistently for most leads to our method and further improves the results. Finally, of the learning curve. This makes it a very challenging we compare the use of a pointwise ranking loss (L2 loss) problem for automated methods and human experts. versus a pairwise ranking loss. Although our assumption was that the latter would have to be fundamentally better, 4.5. Analysis of LCRankNet’s components since it optimizes the model parameters directly for the task at hand, in practice this does not necessarily seem to be We briefly mentioned before which components of our learn- the case. Especially for short learning curves, the simpler ing curve ranker have an essential influence on its quality. method achieves better results. However, once the learning We would like to deepen the analysis at this point and com- curve is long enough, the pairwise ranking loss pays off. Learning to Rank Learning Curves

CIFAR-10 CIFAR-100 Fashion-MNIST 1.0 0.6 1.0 0.4 0.5 0.5 0.2 0.0 0.0 0.0 0.2

Spearman Correlation 0 10 20 30 Spearman Correlation 0 10 20 30 Spearman Correlation 0 10 20 30 Seen Learning Curve (%) Seen Learning Curve (%) Seen Learning Curve (%)

SVHN

Quick, Draw! 0.6 SVHN Proposed Configuration 0.75 0.500.4 Architecture only 0.50 Proposed Configuration with L2-Loss 0.250.2 Proposed Configuration w/o Metadata 0.25 0.0 0.00 Learning Curve only 0.00 0.250.2 Learning Curve only w/o Metadata

0.25 Spearman Correlation Spearman Correlation Spearman Correlation 0.4 0 10 20 30 0 10 20 30 Seen Learning Curve (%) 0 Seen Learning10 Curve20 (%) 30 Seen Learning Curve (%)

Figure 6. Analysis of the different components of LCRankNet. Every single component, metadata, consideration of the learning curve and architecture description, is vital.

4.6. Accelerating hyperparameter optimization Protein Structure LCRankNet is not limited to NAS, but can be used for hyper- 0.15 TPE parameter optimization methods as well. We demonstrate TPE (early term.) 0.10 this by combining LCRankNet with TPE (Bergstra et al., 2011), Regularized Evolution (RE) (Real et al., 2019)), and 0.05 (RL) (Zoph & Le, 2017) to opti- Test Regret 0.00 mize the hyperparameters of an FCNN using a tabular data 0 2000 4000 6000 8000 benchmark (Klein & Hutter, 2019). The objective of this Training Time (in seconds) benchmark is to minimize the mean squared error (MSE) by choosing the right architecture and hyperparameter settings. Protein Structure All experiments are repeated ten times. Each optimization 0.15 method can evaluate up to 100 different configurations and chooses the best one based on its validation MSE. We report 0.10 results with respect to test regret following Klein & Hutter 0.05 RE Test Regret (2019). This number indicates the difference between the RE (early term.) test MSE actually achieved and the best possible test MSE. 0.00 0 1000 2000 3000 In our experiments, we observed that early termination gen- Training Time (in seconds) erally improves the method (Figure7, Appendix). All other results are listed in the Appendix. Figure 7. TPE and RE can leverage LCRankNet to accelerate opti- mization. Results for reinforcement learning, more datasets and 5. Conclusion further insights can be found in the Appendix. In this paper we present LCRankNet, a method to automat- ically terminate unpromising model configurations early. The two main novelties of the underlying model are that it three alternatives. In an experiment to optimize network is able to consider learning curves from other datasets and architectures, we obtain the fastest results. In the best case, that it uses a pairwise ranking loss. The former allows to LCRankNet is 100 times faster without sacrificing accuracy. predict for relatively short, and in extreme cases even with- We also examine the components and predictions of our out, learning curves. The latter directly allows to model the method to give the reader a better understanding of the de- probability that one configuration is better than the another. sign choices and functionalities. Finally, we demonstrate its We analyze our method on five different datasets against use in combination with various hyperparameter optimizers. Learning to Rank Learning Curves

References Falkner, S., Klein, A., and Hutter, F. BOHB: robust and efficient hyperparameter optimization at scale. In Bahdanau, D., Cho, K., and Bengio, Y. Neural machine Proceedings of the 35th International Conference on translation by jointly learning to align and translate. Machine Learning, ICML 2018, Stockholmsmassan,¨ In 3rd International Conference on Learning Repre- Stockholm, Sweden, July 10-15, 2018, pp. 1436–1445, sentations, ICLR 2015, San Diego, CA, USA, May 7- 2018. URL http://proceedings.mlr.press/ 9, 2015, Conference Track Proceedings, 2015. URL v80/falkner18a.html. http://arxiv.org/abs/1409.0473. Jamieson, K. G. and Talwalkar, A. Non-stochastic Baker, B., Gupta, O., Raskar, R., and Naik, N. Acceler- best arm identification and hyperparameter optimiza- ating neural architecture search using performance pre- tion. In Proceedings of the 19th International Con- diction. In 6th International Conference on Learning ference on Artificial Intelligence and Statistics, AIS- Representations, ICLR 2018, Vancouver, BC, Canada, TATS 2016, Cadiz, Spain, May 9-11, 2016, pp. 240–248, April 30 - May 3, 2018, Workshop Track Proceedings, 2016. URL http://proceedings.mlr.press/ 2018. URL https://openreview.net/forum? v51/jamieson16.html. id=HJqk3N1vG. Kingma, D. P. and Ba, J. Adam: A method for stochastic op- Bergstra, J., Bardenet, R., Bengio, Y., and Kegl,´ B. timization. In 3rd International Conference on Learning Algorithms for hyper-parameter optimization. In Representations, ICLR 2015, San Diego, CA, USA, May Advances in Neural Information Processing Systems 7-9, 2015, Conference Track Proceedings, 2015. URL 24: 25th Annual Conference on Neural Information http://arxiv.org/abs/1412.6980. Processing Systems 2011. Proceedings of a meeting held Klein, A. and Hutter, F. Tabular benchmarks for joint 12-14 December 2011, Granada, Spain, pp. 2546–2554, architecture and hyperparameter optimization. CoRR, 2011. URL http://papers.nips.cc/paper/ abs/1905.04970, 2019. URL http://arxiv.org/ 4443-algorithms-for-hyper-parameter-optimization. abs/1905.04970.

Burges, C. J. C., Shaked, T., Renshaw, E., Lazier, A., Deeds, Klein, A., Falkner, S., Springenberg, J. T., and Hutter, F. M., Hamilton, N., and Hullender, G. N. Learning to Learning curve prediction with bayesian neural networks. rank using gradient descent. In Machine Learning, Pro- In 5th International Conference on Learning Representa- ceedings of the Twenty-Second International Conference tions, ICLR 2017, Toulon, France, April 24-26, 2017, (ICML 2005), Bonn, Germany, August 7-11, 2005, pp. 89– Conference Track Proceedings, 2017. URL https: 96, 2005. doi: 10.1145/1102351.1102363. URL https: //openreview.net/forum?id=S11KBYclx. //doi.org/10.1145/1102351.1102363. Li, L., Jamieson, K. G., DeSalvo, G., Rostamizadeh, A., and Chandrashekaran, A. and Lane, I. R. Speeding up hyper- Talwalkar, A. Hyperband: A novel bandit-based approach parameter optimization by extrapolation of learning to hyperparameter optimization. J. Mach. Learn. Res., curves using previous builds. In Machine Learning and 18:185:1–185:52, 2017. URL http://jmlr.org/ Knowledge Discovery in Databases - European Confer- papers/v18/16-558.html. ence, ECML PKDD 2017, Skopje, Macedonia, September 18-22, 2017, Proceedings, Part I, volume 10534 of Lec- Liu, H., Simonyan, K., and Yang, Y. DARTS: differen- ture Notes in Computer Science, pp. 477–492. Springer, tiable architecture search. In 7th International Confer- 2017. ence on Learning Representations, ICLR 2019, New Or- leans, LA, USA, May 6-9, 2019, 2019. URL https: //openreview.net/forum?id=S1eYHoC5FX DeVries, T. and Taylor, G. W. Improved regularization . of convolutional neural networks with cutout. CoRR, Liu, T. Learning to Rank for . abs/1708.04552, 2017. URL http://arxiv.org/ Springer, 2011. ISBN 978-3-642-14266-6. doi: 10.1007/ abs/1708.04552. 978-3-642-14267-3. URL https://doi.org/10. 1007/978-3-642-14267-3. Domhan, T., Springenberg, J. T., and Hutter, F. Speeding up automatic hyperparameter optimization of deep neural Loshchilov, I. and Hutter, F. SGDR: stochastic gradi- networks by extrapolation of learning curves. In Proceed- ent descent with warm restarts. In 5th International ings of the Twenty-Fourth International Joint Conference Conference on Learning Representations, ICLR 2017, on Artificial Intelligence, IJCAI 2015, Buenos Aires, Ar- Toulon, France, April 24-26, 2017, Conference Track Pro- gentina, July 25-31, 2015, pp. 3460–3468. AAAI Press, ceedings, 2017. URL https://openreview.net/ 2015. forum?id=Skq89Scxx. Learning to Rank Learning Curves

Luo, R., Tian, F., Qin, T., Chen, E., and Liu, T. Neural archi- recognition. In 2018 IEEE Conference on Computer tecture optimization. In Advances in Neural Information Vision and Pattern Recognition, CVPR 2018, Salt Lake Processing Systems 31: Annual Conference on Neural City, UT, USA, June 18-22, 2018, pp. 8697–8710, Information Processing Systems 2018, NeurIPS 2018, 2018. doi: 10.1109/CVPR.2018.00907. URL http: 3-8 December 2018, Montreal,´ Canada, pp. 7827–7838, //openaccess.thecvf.com/content_cvpr_ 2018. URL http://papers.nips.cc/paper/ 2018/html/Zoph_Learning_Transferable_ 8007-neural-architecture-optimization. Architectures_CVPR_2018_paper.html. Pham, H., Guan, M. Y., Zoph, B., Le, Q. V., and Dean, J. Efficient neural architecture search via parameter shar- ing. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmassan,¨ Stockholm, Sweden, July 10-15, 2018, pp. 4092–4101, 2018. URL http://proceedings.mlr.press/ v80/pham18a.html. Real, E., Aggarwal, A., Huang, Y., and Le, Q. V. Regular- ized evolution for image classifier architecture search. In The Thirty-Third AAAI Conference on Artificial Intelli- gence, AAAI 2019, The Thirty-First Innovative Applica- tions of Artificial Intelligence Conference, IAAI 2019, The Ninth AAAI Symposium on Educational Advances in Arti- ficial Intelligence, EAAI 2019, Honolulu, Hawaii, USA, January 27 - February 1, 2019, pp. 4780–4789, 2019. doi: 10.1609/aaai.v33i01.33014780. URL https:// doi.org/10.1609/aaai.v33i01.33014780. Sutskever, I., Vinyals, O., and Le, Q. V. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Sys- tems 27: Annual Conference on Neural Informa- tion Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, pp. 3104–3112, 2014. URL http://papers.nips.cc/paper/ 5346-sequence-to-sequence-learning-with-neural-networks. Wistuba, M., Rawat, A., and Pedapati, T. A survey on neural architecture search. CoRR, abs/1905.01392, 2019. URL http://arxiv.org/abs/1905.01392. Yu, K., Sciuto, C., Jaggi, M., Musat, C., and Salz- mann, M. Evaluating the search phase of neural ar- chitecture search. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020, Conference Track Pro- ceedings, 2020. URL https://openreview.net/ forum?id=H1loF2NFwr. Zoph, B. and Le, Q. V. Neural architecture search with reinforcement learning. In 5th International Confer- ence on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceed- ings, 2017. URL https://openreview.net/ forum?id=r1Ue8Hcxg. Zoph, B., Vasudevan, V., Shlens, J., and Le, Q. V. Learning transferable architectures for scalable image Learning to Rank Learning Curves

A. Ablation study: learning curve component For this reason we are carrying out an additional experiment on a publicly available tabular regression benchmark (Klein One of the most important design decisions of LCRankNet & Hutter, 2019). This benchmark was created by performing was the learning curve component. We experimented with a grid search with a total of 62,208 different architecture and several models that would generate the best learning curve hyperparameter settings for four different tabular regression embedding. In the following discussion of different models, datasets: slice localization, protein structure, parkinsons we assume that the learning curve represents the validation telemonitoring, and naval propulsion. Each dataset is split accuracy over time and thus higher values are better. into 60% train, 20% validation, and 20% test. Each archi- To begin with, passing the entire learning curve directly to tecture has two fully connected layers. Settings vary with be concatenated with architecture embedding would result respect to the initial learning rate (0.0005, 0.001, 0.005, in large number of predictors thereby overfitting. It was also 0.01, 0.05, 0.1), the batch size (8, 16, 32, 64), the learn- clear that the best or last value of the learning curve alone ing rate schedule (cosine or fixed), the activation per layer (assuming that the learning curve is constantly improving, (ReLU or tanh), the number of units per layer (16, 32, 64, i.e. it increases monotonically, it is the same) is a very good 128, 256, 512), and dropout per layer (0.0, 0.3, 0.6). predictor. After all, it is the only one used by methods like We are again following the leave-one-dataset-out cross- Hyperband and Successive Halving. In our example, this validation protocol: We optimize the architecture and hyper- would be achieved through a simple global max pooling parameter setting for one dataset and transfer the knowledge layer. Any further information regarding the development from the others. 100 settings per dataset are randomly se- of the learning curve (improvement since epoch 1, 1st and lected as additional data for LCRankNet. 2nd order gradients, etc.) would, however, be disregarded. would be one way to create such fea- B.1. Accelerating hyperparameter optimization tures but convolutions allow to automatically learn which of these features are helpful. In Figure8 we take two different The goal is to minimize the mean squared error (MSE) by models with max pooling layer into account. The version choosing the right architecture and hyperparameter settings. with global max pooling layer reduces the number of predic- In this experiment we compare the optimization methods tors to one per filter, while the version with strides reduces TPE, RE and RL with a version with early termination using the number to 4 per filter. In our experiments, we did not LCRankNet. All experiments are repeated ten times. Each notice a big difference between these two versions, but a optimization method can take up to 100 different settings significant improvement over the version that only uses the into account. We report results with respect to test regret best value. following Klein & Hutter(2019). This number indicates the difference between test MSE actually achieved and best Furthermore, we tried in vain to apply LSTMs to this prob- possible test MSE. Finding the best setting will result in a lem. But both, LSTMs directly using the learning curve and regret of 0. The setting chosen by an optimization method using the output of the convolution did not achieve better is based on the validation MSE. It is therefore possible that results than if the learning curve had not been taken into the test regret may increase again. This can be seen as an account at all (”architecture only”). overfitting on the validation set. In Figure 10 to 12 we report the results up to the search time required for the shortest B. Joint architecture and hyperparameter of all repetitions. We find that early termination generally optimization improves the method. Only in one case are they not better, but not worse either. As briefly discussed in the main paper, we want to show that LCRankNet B.2. Technical details

• can be combined with optimization methods such as In contrast to random search, Successive Halving or Hy- Tree-structured Parzen Estimator (TPE) (Bergstra et al., perband, the intermediate performance is not sufficient for 2011), Regularized Evolution (RE) (Real et al., 2019) other optimization methods and the final performance of and Reinforcement Learning (RL) (Zoph & Le, 2017), a model is required. To take this into account, a simple change to LCRankNet is required. An additional output layer is added that predicts the final performance in addi- • can be applied to different search spaces (convolutional tional to the ranking score. We continue to use the ranking vs. fully connected neural networks) with and without score only to decide whether to terminate a run early or not. hyperparameters, and The predicted final performance is only used in the event of early termination and is only used to provide feedback • work with different machine learning tasks (classifica- to the optimization method. If a run is not terminated early, tion vs. regression). Learning to Rank Learning Curves

CIFAR-10 CIFAR-100 Fashion-MNIST 0.9 0.60 0.8 0.8 0.55

0.7 0.50 0.7 0.45 0.6

Spearman Correlation 0 10 20 30 Spearman Correlation 0 10 20 30 Spearman Correlation 0 10 20 30 Seen Learning Curve (%) Seen Learning Curve (%) Seen Learning Curve (%) SVHN 0.65 Quick, Draw! SVHN 0.60 Global Max Pooling 0.85 0.6 CNN + Global Max Pooling 0.55 0.80 CNN + Max Pooling with Strides 0.75 0.50 0.5 LSTM CNN + LSTM 0.70 0.45 0.4 Architecture only Spearman Correlation 0 10 20 30 0.40 Spearman Correlation 0 10 20 30 Spearman Correlation Seen Learning Curve (%) Seen Learning Curve (%) 0 10 20 30 Seen Learning Curve (%) Figure 8. Analysis of various different learning curve components in comparison to not using the learning curve at all.

add bination of input and operation through three embeddings. The first embedding specifies the input choice, the second max conv the type of operation (convolution, maximum pooling, etc.), 3x3 3x3 and the third the kernel size. In this way, an architecture with two cells, each with five blocks, is clearly described by Figure 9. One example block used in an architecture from the NAS- a sequence of 60 decisions. Net search space. We describe the FCNNs in a very similar way. For each layer we learn an embedding for the activation function, the dropout rate and the number of units. The batch size, the the actual final performance is used. To ensure that the learning rate (after log transformation) and the learning rate predicted final performance is in a reasonable range, we schedule (after one-hot encoding) are taken into account as define a lower and upper bound. We are now explaining other hyperparameters. them in the context of MSE so lower values are better and vice versa. The lower bound is defined by the mean MSE observed for previous runs. The motivation for this decision is that each terminated run should be below average. We define the best observed MSE of the partial learning curve as an upper bound. If the upper and lower bounds conflict, we prefer the upper bound because it is based directly on observed data.

C. Architecture representation In this work we consider two different search spaces. The NASNet search space that takes CNNs (Zoph et al., 2018) into account and a search space for FCNNs (Klein & Hutter, 2019). In the NASNet search space, an architecture is com- pletely described by two cells, each consisting of several blocks. The entire architecture is defined by selecting the design of all blocks. A block is designed by selecting two operations and their corresponding inputs and how these two operations are combined. A sample block is shown in Figure9. We follow Luo et al.(2018) and model every com- Learning to Rank Learning Curves

Protein Structure Slice Localization 0.05 0.15 TPE 0.04 TPE (early term.) 0.10 0.03 0.02 0.05 Test Regret Test Regret 0.01

0.00 0.00 0 2000 4000 6000 8000 0 2500 5000 7500 10000 Training Time (in seconds) Training Time (in seconds)

Parkinsons Telemonitoring Naval Propulsion 0.15 0.15

0.10 0.10

0.05 0.05 Test Regret Test Regret

0.00 0.00 0 200 400 600 800 1000 0 500 1000 1500 2000 Training Time (in seconds) Training Time (in seconds)

Figure 10. TPE benefits from early termination on all datasets.

Protein Structure Slice Localization 0.05 0.15 0.04

0.10 0.03 0.02 0.05 RE Test Regret Test Regret 0.01 RE (early term.) 0.00 0.00 0 1000 2000 3000 0 1000 2000 3000 4000 Training Time (in seconds) Training Time (in seconds)

Parkinsons Telemonitoring Naval Propulsion 0.15 0.15

0.10 0.10

0.05 0.05 Test Regret Test Regret

0.00 0.00 0 100 200 300 0 200 400 600 800 Training Time (in seconds) Training Time (in seconds)

Figure 11. Regularized Evolution benefits from early termination on all datasets. Learning to Rank Learning Curves

Protein Structure Slice Localization 0.05 0.15 0.04

0.10 0.03 0.02 0.05 RL Test Regret Test Regret 0.01 RL (early term.) 0.00 0.00 0 500 1000 1500 2000 2500 0 1000 2000 3000 4000 Training Time (in seconds) Training Time (in seconds)

Parkinsons Telemonitoring Naval Propulsion 0.15 0.15

0.10 0.10

0.05 0.05 Test Regret Test Regret

0.00 0.00 0 50 100 150 200 0 200 400 600 800 Training Time (in seconds) Training Time (in seconds)

Figure 12. Reinforcement Learning benefits from early termination on all datasets.