A Graph-Based Evolutionary Algorithm for Automated Machine Learning

DarwinML: A Graph-based Evolutionary Algorithm for Automated Machine Learning Fei Qi, Zhaohui Xia, Gaoyang Tang, Hang Yang, Yu Song, Guangrui Qian, Xiong An, Chunhuan Lin, Guangming Shi

Abstract—As an emerging field, Automated Machine Learning and Feurer et al. [13] proposed solutions for automating the (AutoML) aims to reduce or eliminate manual operations that entire process for traditional ML problems. These solutions are require expertise in machine learning. In this paper, a graph- not powerful enough, because they use simple combinations based architecture is employed to represent flexible combinations of ML models, which provides a large searching space compared or single model selection. Employing EA for architecture to tree-based and stacking-based architectures. Based on this, search, Tree-based Pipeline Optimization Tool [14], or TPOT, an evolutionary algorithm is proposed to search for the best uses a tree-based representation for model combination; while architecture, where the mutation and heredity operators are the Autostacker [15] using the stacking scheme. key for architecture evolution. With Bayesian hyper-parameter According to a comparison between tree-based [16] and optimization, the proposed approach can automate the workflow of machine learning. On the PMLB dataset, the proposed graph-based [17] representations for symbolic regression [18], approach shows the state-of-the-art performance compared with a graph-based representation is much flexible and general than TPOT, Autostacker, and auto-sklearn. Some of the optimized the tree-based one. In addition, the graph-based representation models are with complex structures which are difficult to obtain for computation are widely used in modern ML frameworks in manual design. including TensorFlow [19]. In this paper, we propose a novel AutoML solution called I.INTRODUCTION DarwinML based on the EA with tournament selection [20]. DarwinML employs the directed acyclic graph (DAG) for Various models have been thoroughly investigated by the model combination. Compared to existing AutoML methods machine learning (ML) community. In theory, these models such as pipeline, the proposed method is with a flexible are general and applicable to both academia and industry. representation and is highly extensible. In summary, the key However, it could be time-consuming to build a solution on contributions of this paper are as follows. a specific ML task, even for a ML expert. To remedy this, Automated ML (AutoML) [1] emerges to minimize the design • The adjacency matrix of graph is analyzed and used complexity of a complete ML solution, which usually includes to represent the architecture composed by a series of data preprocessing, feature engineering, model selection and traditional ML models. ensemble, fine-tuning of hyperparameters, etc. • Several evolutionary operators have been defined and Research of AutoML starts from hyperparameter optimiza- implemented to generate diverse architectures, which has tion, which includes random search [2], evolutionary algorithm not been thoroughly investigated in existing works. (EA) [3], and Bayesian [4] approaches. Automatic feature • Based on EA, an end-to-end automatic solution called engineering [5], [6] is another important sub-field in AutoML. “DarwinML” is proposed to search optimal composition In deep learning, the end-to-end scheme [7] provides a of traditional ML models. solution for automated learning to a certain degree. However, The rest of this paper is organized as follows. SectionII

arXiv:1901.08013v1 [cs.NE] 20 Nov 2018 to train a deep end-to-end neural network, it generally requires reviews literature related to AutoML. Section III introduces large scale labeled dataset, which is unavailable in many the architectural representation for ML model composition. practical problems. In addition, Glasmachers has shown other SectionIV presents the approach for evolutionary architecture limitations of the end-to-end scheme [8]. At the mean while, search and optimization. SectionV illustrates experimental neural architecture search approaches have emerged [9], [10] results on the PMLB [21] dataset. SectionVI concludes this to automate neural network design. work. Though deep learning is very powerful, there is still room to apply traditional ML. To achieve the goal of AutoML, II.RELATED WORKS Thornton et al. [11], Thakur and Krohn-Grimberghe [12], A. Automatic Machine Learning This work was supported in part by the National Natural Science Foundation The earlier AutoML models were born in the competi- of China under Grant 61572387. tion called “ChaLearn AutoML Challenge” [22] in 2015. F. Qi, Z. Xia, X. An, C. Lin and G. Shi are with the School of Artiﬁcial Intelligence, Xidian University, Xi’an 710071, China (e- According to a test on 30 datasets, the two top ranking mail: [email protected]; [email protected]; [email protected]; models of the challenge were designed by Intel’s team and lch [email protected]; [email protected]). Hutter’s team, respectively. Intel’s proprietary solution [1] G. Tang, H. Yang, Y. Song and G. Qian are with Intelligence Qubic (Beijing) Technology Co., Ltd, Beijing, China (e-mail: {tgy, yanghang, is a tree-based method. Auto-sklearn [13], the open-source songyu, grqian}@iqubic.net). solution from Hutter’s team, won the challenge. Hutter also 2

co-developed SMAC [23] and Auto-WEKA [11], [24]. Auto- set M. For simplicity, the model fmk denotes the function WEKA treated AutoML as a Combined Algorithm Selection associated with the vertex vk in the rest of this paper. An and Hyperparametric optimization (CASH) problem. Based on edge eij implies that the output from vertex vi flows into the open source package WEKA [25], Auto-WEKA put traditional vertex vj. ML steps, including the full range of classifiers and feature Let zj denote the output of vertex vj, and z0 denote the selectors, into one pipeline and uses tree-based Bayesian op- input of the composite model. Suppose vertex vj has n input timization [23] to search in the combined space of algorithms vertices, vi1 , vi2 , ··· , vin , the function associated to vertex vj and hyperparameters. Following Auto-WEKA, auto-sklearn is then computed as: also solves the CASH problem with Bayesian hyperparameter optimization (BHO). The improvements are the meta-learning zj = fmj U(zi1 , zi2 , ··· , zin ) , (1) initialization and ensemble steps added before and after the step, respectively. where U(·) is a feature union function. In the proposed approach, vector concatenation is used as the feature union B. Evolutionary Algorithms function. The output zK of the composite model can be computed by recursively applying (1). EA has many variants in implementation such as evolution programming, evolution strategy, genetic algorithm, etc. En- coding problem to different representation that easier to evolve B. Layer is an efficient way to solution. Genetic Programming (GP), For the convenience of understanding and applying some as a special type of EA, encodes computer programs as a evolutionary operations, vertices are topologically sorted [34]. set of genes to evolve to the best. In recent years, inspired For simplicity, the topological order of a vertex is defined as by its success in generating structured representations [26], its depth. Vertices with a same depth d compose a layer Ld: more implementations [27], [28] of GP has been studied. L = {v |d = d, ∀v ∈ V }. (2) One topic is using mathematical functions as coding units to d k k k search for appropriate composite function structures. Such as It should be noted that there are no edges between two vertices graph-based representations encoded by GP [20], [29], both belong to a same layer. Without loss of generality, vertices approaches use GP to evolve a better function to approximate v1 and vK are supposed to be the input and the output of target value. the graph, respectively. Accordingly, d1 and dZ are with the In AutoML, EAs are used frequently for architecture search. minimum and maximum depths. TPOT [30], a later AutoML approach which outperforms The connection probability of a particular edge eij is useful previous systems, searches tree-based pipeline of models with for some evolutionary operations, which is related to the depth GP. TPOT allows dataset to be processed parallelly with of the two layers containing vertices vi and vj: multiple preprocessing methods. Such a step can bring diverse features to the following steps. Furthermore, Kordik et al. [31] p(eij) = p0 exp γ(di − dj + 1) , (3) and de Sa et al. [32] not only use tree-based GP to ensemble where γ is the decay factor, and p0 is an initial connection ML methods, but also use a grammar to enforce the gener- probability. ation of valid solution. The lager search space and detailed constrains can make solution more like a successful method. C. Layer blocks Recently, Chen et al. proposed Autostacker [15] to combine ML classifiers by stacking [33] and search for the best model The adjacency matrix, A, of a graph, G, is a K × K matrix using EA. This method largely extends the possibilities of the with elements given by: model combination. ( 1 if eij ∈ E Aij = , (4) 0 else III.ARCHITECTURE REPRESENTATION

A. Graph-based model combination where one indicates an edge from vertex vi to vertex vj. As the graph is directed acyclic and topologically sorted, the A set of ML models, M = {f1, ··· , fM }, is chosen adjacency matrix is always upper triangular. As shown in to construct the composite model, where fm denotes the hypothesis function of the m-th model. A directed acyclic Fig.1, the diagonal of the adjacency matrix are square blocks graph (DAG) is employed to denote the architecture of a full of zeros: combination of models. In the DAG, edges and vertices denote (d) Aij = 0 ∀vi, vj ∈ Ld, (5) data ﬂow and computational functions, respectively. This is (d) the same to the representation used in TensorFlow [19]. One where A is a |Ld|×|Ld| block corresponds to the layer Ld. difference is that the structure of DAG will be changed after The rest of the upper-triangle in the adjacency matrix can be (d,d0) evolutionary operations. split into several blocks Aij according to the fragmenting Let graph G = {V,E} be the DAG denoting the com- of layers, where vi ∈ Ld and vj ∈ Ld0 . The size of the block (d,d0) bined model, which is composed of a set of vertices V = A is |Ld| × |Ld0 |. {v1, ··· , vK } and a set of edges E = {eij}. A vertex vk The connection probability can be visualized by mapping is associated with a computational function from the model onto the adjacency matrix, as shown in Fig.1. According 3

In the second generations, individuals evolve by applying all four operations. Firstly, the keep best operator is applied to get top 15% best individual set from first generation. Then, tournament selection[35] picks the top model from a randomly chosen sub-group in previous generation and keep best set. A number of top models are selected by repeating random grouping and tournament selection. Secondly, the mutation and heredity operations are applied to these promising individuals, and new individuals are created for the current generation. Then, the top 15% best individuals are inherent back in the population to ensure models with good performance are Fig. 1: Layer blocks: The color corresponds to the connection reserved. After that, individuals are generated by the random probability between two layers Ld and Ld0 calculated accord- operation to fulfill the current generation. Finally, fitness of ing to (6) all individuals, except the ones inherent from keep best, are evaluated after a training procedure. In the subsequent generation, i-th generation applied by (d,d0) to (3), each block A has a same connection probability mutation and heredity operation are produced from (i − 1)- which can be calculated as: th generation and top 15% best individuals in (i − 2)-th 0 generation. p eij|vi ∈ Ld, vj ∈ Ld0 = p0 exp γ(d − d + 1) . (6) In every generation, the percentage of individuals generated With above intuitive observation, changing layer Ld should from random, heredity, and mutation operations are about consider blocks A(d), A(d,·), and A(·,d), which form a cross- 30%, 40%, and 30%, respectively. A higher heredity prob- shaped area as shown in Fig.1. ability shows a better performance in tasks fit for complex architectures. Individuals with invalid graph structures are IV. EVOLUTIONOF ARCHITECTURES simply dropped no matter how the individual is generated. The rules will be explained in Implementation Details. In addition, Our goal is to find the architecture that performs best the search algorithm prefers simple architectures according to on a given ML task. In this paper, only classification tasks the fitness in (7). are considered and evaluated. Let X = {x , x , ··· , x } 1 2 N The evolution stops when a predefined duration or number and Y = {y , y , ··· , y } be the features and labels of 1 2 N of population has been reached. The final output is the a dataset, respectively. The optimization goal can then be individual with the highest fitness. formally expressed as: G∗,Θ∗ = arg min `G,Θ,X,Y + αcG, (7) G,Θ B. Evolutionary Operations 1) Random Operation: where G denotes the architecture represented by a DAG, and Θ The random operation is designed is the parameter endowed by the graph, `(·) is the loss function to randomly sample individuals in the configuration or search- of defined by the classification task, c(·) measures the com- ing space. The pseudo code of the random operation is shown K D plexity of the graph, and α is a coefficient to trade-off between in Algorithm1. In the algorithm, and are the numbers loss and complexity. With the complexity term, the objective of vertices and layers of the DAG to be generated as an C function prefers a simple architecture. In implementation, the individual. The constant serves as an upper bound to control l loss function includes a regularization term and some cross the size of DAG. The sizes of layers, d, should follow a PD l = K l = l = 1 validation strategies to avoid over-fitting, and the complexity constraint that d d , where 1 D applies since counts the number of vertices and edges in the graph. the DAG is always implemented as single-in-single-output. The predefined threshold ρ controls the density of edges. 2) Mutation Operations: Three mutation operations are de- A. Search Algorithm signed to provide flexible way to vary individual architectures The evolutionary search algorithm based on tournament for the purpose of traversing to better individuals. selection [35] is employed to explore for architectures in a) Vertex mutation: This operation is designed to replace the complex configuration space. The work flow is illustrated one vertex with another ML model. Its implementation is as in Fig.2. Four operations, which are the random, mutation, straightforward as shown by Algorithm2. A real example of heredity, and keep best, constitute the main part of the al- this operation is shown in Fig.3, where the vertex model gorithm. These operations are designed to generate diverse “SVC” is replaced by a “ridge classifier”. architectures. Details of these operations will be explained in b) Edge mutation: This operation, as given by Algo- following sub-sections. Following the convention in EA, an rithm3, is defined to flip an edge connection in the DAG. architecture will also be called an individual in the following. c) Layer mutation: Local structures could affect the In initialization, the first generation is build up by indi- performance of the architecture. Layers are naturally such local viduals according to the convention in EA, generated by the structures. Thus, the layer mutation is introduced to change random operation. the DAG at a scale larger than vertex and edge. The pseudo 4

Random Operation First Generation Mutation Operation Initial Graph Heredity Operation

Keep Best Tournament . Selection .

Last

. . Generation .

Evolutionary Algorithm Envoled Graph

Fig. 2: DarwinML framework. Evolutionary Algorithm(EA) using gene operations do searching job in the form of decoded matrices, and do training to get ﬁtness in the form of encoded graph. After training, EA save best 15% models to next generation. EA keep searching and training until the population is enough. The best model can be chosen in population by ﬁtness as well as loss function in machine learning.

Algorithm 1 Random Operation 1: K ← randint . Number of vertices (1 < K < C) Input 2: Randomly choose K functions from the model set M and assign them to each vertex. 3: D ← randint() . Number of layers (D < K). 4: l2, ··· , lD−1 ← randint() . Size of each layer. ExtraTree Ridge Kmeans SVC 5: A ← 0 . Initialize the adjacency matrix. Classifier Classifier 0 6: for 1 < d < d < D do 7: for vi ∈ Ld, vj ∈ Ld0 do 8: pij ← rand() according to (6). . 0 ≤ pij ≤ 1. 0 (d,d ) GradientBoosting Previous Generation 9: A ← 1 p > ρ . e ij If ij Determine edge ij. Classifier Current Generation 10: end for 11: end for 12: Perform topological sort. Fig. 3: An example of vertex mutation.

Algorithm 2 Vertex Mutation are permitted but not available in this example. 1: k ← randint() . Select vertex vk to change. 0 3) Heredity Operation: Heredity operation is an enhance- 2: m ← randint() . Choose a ML model from M. k ment on dealing with local structures. This operation provides 3: fm ← fm0 . Replace the model for vertex vk. k k an opportunity for an individual to replace a bad layer with a good one. In this sense, heredity plays a very important role to inherent and broadcast the good local structures to the whole code is illustrated in Algorithm4. An example is shown in population. Different to previous operations, heredity requires Fig.4, where a layer composed by three vertices, which are two good individuals to perform the operation, as given by the “KNeighbors Regressor”, “SVC”, and “Ridge Classiﬁer”, Algorithm5. was inserted to the DAG. The dashed edge was automatically One example is illustrated in Fig.5, the layer with one removed after insertion. Please note that edges crossing layers vertex, “Ridge Classiﬁer”, has been inserted after the removal of the layer containing two vertices, “Normalizer” and “Robust Algorithm 3 Edge Mutation Scaler”. With the heredity operation, the balanced accuracy of the architecture increased to 0.778 from 0.723. The example i, j ← randint() . v v 1: Select two vertices i and j. shows the effectiveness of this operation. 2: Aij ← 1 − Aij . Flip the connection of edge eij. 4) Keep Best: This operation is used to keep the best 3: Perform topological sort. individual in previous generation to avoid losing the best one 5

Input

Normalizer Ridge Robust Normalizer Classifier Scaler

Voting KNeighbors Ridge Previous Generation SVC Classifier Regressor Classifier Current Generation

Fig. 5: An example of heredity operation.

GradientBoosting Previous Generation Classifier Current Generation in following generations. In implementation, 15% individuals with top performances are treated as best ones and kept from Fig. 4: An example of layer mutation. one generation to the next.

C. Implementation Details 1) Graph Validation: As the edges are generated by the evolutionary operations according to a random distribution, Algorithm 4 Layer Mutation there are possibilities that the generated individual is not a valid DAG. So validation check should be performed after 1: d ← randint() . Choose layer Ld(1 < d < D). each operation, and the invalid individual is dropped. There 2: p ← rand() . A real random number 0 < p < 1. is a maximum retrial number for these operations. So the 3: if p > 0.5 then . To remove the layer Ld. (d,d0) (d0,d) 4: A ,A ← 0 percentages of individuals from each operation may different (d) (d,d0) (d0,d) to the configuration. One rule for validation check is that 5: Remove A ,A ,A from A. vertices in a DAG should have both input and output edges 6: Remove vertices vi ∈ Ld except the first and last vertex. Another rule can be applied to 7: else . To insert a layer Ld+1. promote the ratio of individuals with good performance, which 8: ld+1 ← randint() is that classifier/regression vertices should not directly connect 9: Select ld+1 models to create layer Ld+1. (d+1) (d+1,·) (·,d+1) to another classifier/regression vertex alone. Also, there are 10: A ,A ,A ← 0. (d+1) (d+1,·) (·,d+1) other constrains like maximum vertices number will be showed 11: Insert A ,A ,A to A. (d+1,·) (·,d+1) in experiment section. 12: for i, j in blocks A and A do 2) Model Mix: After building a DAG with machine learning 13: pij ← rand() according to (6). models, we have some problems while the graph ensembles 14: Aij ← 1 If pij > ρ 15: end for machine learning methods different from the task. If the task 16: end if is classifier type and there is regression model in DAG, we calculate root-mean-square error between regression prediction 17: Perform topological sort. and classifier label as regression loss function to train model. If unsupervised model in the graph, we do unsupervised learning on dataset and output a vector as feature to next vertex. For example, the output of k-means is a one-of-k coded vector, indicating the cluster index. So the output of k-means can be Algorithm 5 Heredity Operation used by its following vertex for model combination (stacking). 0 1: Choose two good graphs G and G via tournament selec- 3) Hyperparameter Optimization: After searching models tion. by the EA with tournament selection, BHO [4] is applied to 0 2: Randomly choose two layers, Ld and Ld0 , one from each the top five individuals in the final population. Due to the limi- graph. tation of computing resources, hyperparameter optimization is 3: Remove the layer Ld from graph G. very expensive to apply to the whole population. The indi- 4: . Pseudo codes like lines 4–6 in Algorithm4. viduals find by algorithm is competitive, and hyperparameter 0 5: Insert the layer Ld0 into graph G. optimization could further improve their performances. 6: . Pseudo codes like lines 10–15 in Algorithm4. 7: Perform topological sort. V. EXPERIMENTS To show the performance of DarwinML, its robustness and accuracy were evaluated. Results were compared with random 6

Datasets #Instances #Features #Classes Parameters Configuration monk1 556 6 2 #populations 120/400 parity5 32 5 2 #generations 10/20 parity5+5 1124 10 2 #vertices 2–12 pima 768 8 2 #layers 2–6 prnn crabs 200 7 2 #retrials 100 allhypo 3770 29 3 max training time 3600 secs spect 267 22 2 vehicle 846 18 5 TABLE II: Constrain configuration used in experiments wine-recognition 178 13 4 breast-cancer 286 9 2 cars 392 8 3 dis 3772 29 2 Hill Valley with noise 1212 100 2 ecoli 327 7 8 heart-h 294 13 2 TABLE I: Datasets used in experiments forest, auto-sklearn [13], TPOT [30], and Autostacker [15]. The capability of the model in different search spaces and the influence of hyperparameter optimization are demonstrated by ablation experiments. Interestingly, DarwinML found a variety of interpretable solutions, which are also illustrated.

A. Datasets of PMLB Fig. 6: The scatter plot shows the performance evolution when architectures is searched on the “allhypo” dataset in DarwinML was tested on the same datasets used in PMLB [21] with DarwinML. Three architectures are selected Autostacker [15], where 15 datasets were selected from to show the flexibility of proposed evolutionary operations. PMLB [21]. PMLB is a benchmark dataset that include Each point represents balanced accuracy evaluated by an hundreds of public datasets which mainly resources from individual graph. The point in the curve means the best OpenML [36] and UCI [37]. These 15 datasets present various individual in each generation. domains in PMLB that can test DarwinML’s classification performance on both binary and multiple-class tasks. TableI shows the detail of these datasets. For each dataset, samples of other AutoML methods are collected from the paper of were shuffled and divided it into training (80%) and testing Autostacker [15], which is produced with a same setting. (20%) sets. The training set uses cross validation. As listed in TableII, two tests with different sizes of population, which have 120 and 400 individuals to be evaluated B. Comparison in total, were run to compare the performance with different Results were listed in Table III. Random forest is chosen to search spaces. The corresponding numbers of generation were be the baseline. Being EA based AutoML models, TPOT [30] set to 10 and 20, respectively. Results from the two population and Autostacker [15] evolved for 100 and 3 generations, sizes were named “DML120” and “DML400”, respectively. In respectively. Both models include hyperparameter optimiza- addition, ranges of numbers of vertices and layers were set to tion, which improves the performance dramatically. In testing constrain the search space, as done in Kordik et al. [31] and DarwinML, hyperparameter optimization was separated for de Sa et al. [32]. The number of retrials is the upper limit a detailed comparison. DML400 achieves better performance of the failures in trying to apply evolutionary operators to a when compared with random forest on 14 datasets. Its accu- graph. If this limit is reached, the individual will be dropped racy is comparable or superior to TPOT on 11 datasets, and according to sub-section IV-C1. The max training time is the is comparable or superior to Autostacker and auto-sklearn 9 upper limit of the training time can be used for one graph. datasets. Although other models except random forest include If time is out, graph will be dropped as a failure graph. The hyperparameter optimization, DML400 is better than them parameters p0 and γ in (3) are set to 0.3 and 1, respectively. because DarwinML’s inherent mechanism provides flexible The parameter C in random operation is set to 10. model combination which enlarges the architectural search Results named “DML400+BHO” are obtained by applying space. Taking BHO as a post-processing step on DML400, BHO [4] to the top five best graphs in the population with 400 DarwinML far exceeds other models in mean balanced accu- individuals. With Bayesian optimization, we tuned up to 40 racy on most datasets. parameter sets. According to Autostacker [15] and TPOT [30], Fig.6 shows the scatter plot of accuracy of each individual the balanced accuracy [38] is employed to measure the per- on the dataset allhypo, which is one run of DML400. In the formance for a fair comparison. On each dataset, DarwinML plot, the color of each point represents how it is generated. were repeated 10 times with random initialization, the mean Some individuals run out of the training time and were and variances of the balanced accuracy are calculated. Results dropped. Their loss were set to infinity and were excluded 7

Datasets RandomForest auto-sklearn TPOT Autostacker DML120 DML400 DML400+BHO monk1 0.98±0.009 1±0 1±0 1±0 1±0 1±0 1±0 parity5 0.02±0.053 0.87±0.209 0.81±0.21 0.94±0.138 1±0 1±0 1±0 parity5+5 0.60±0.050 1±0 1±0 1±0 0.88±0.044 0.93±0.030 1±0 pima 0.73±0.033 0.72±0.040 0.73±0.05 0.74±0.023 0.74±0.009 0.77±0.006 0.79±0.009 prnn crabs 0.95±0.027 0.99±0.019 1±0.008 1±0 1±0 1±0 1±0 allhypo 0.79±0.021 0.89±0.029 0.95±0.046 0.94±0.026 0.86±0.025 0.87±0.015 0.97±0.003 spect 0.68±0.068 0.71±0.046 0.81±0.031 0.82±0.04 0.83±0.024 0.85±0.013 0.86±0.010 vehicle 0.83±0.021 0.90±0.017 0.82±0.039 0.89±0.044 0.84±0.012 0.86±0.007 0.85±0.005 wine-recognition 0.99±0.015 0.97±0.021 0.98±0.018 0.99±0.012 1±0 1±0 1±0 breast-cancer 0.59±0.058 0.59±0.059 0.67±0.090 0.66±0.080 0.69±0.015 0.71±0.012 0.77±0.017 cars 0.91±0.034 0.97±0.013 0.96±0.036 0.98±0.014 0.93±0.022 0.96±0.018 0.99±0.007 dis 0.55±0.042 0.68±0.069 0.76±0.061 0.79±0.046 0.79±0.033 0.82±0.022 0.90±0.003 Hill Valley with noise 0.56±0.027 1±0.003 0.96±0.043 0.98±0.015 0.98±0.013 0.97±0.010 0.97±0.013 ecoli 0.91±0.030 0.89±0.062 0.86±0.043 0.92±0.029 0.82±0.030 0.85±0.016 0.95±0.013 heart-h 0.79±0.036 0.79±0.042 0.81±0.047 0.83±0.022 0.86±0.013 0.85±0.018 0.86±0.008 TABLE III: Test Accuracy Comparison. Results on same 15 PMLB Datasets in Autostacker.

for a simpler structure if models have a nearly same loss

Input Input `(·). So DarwinML can ﬁnd best architectures with respect to (7). In addition, we also observed optimized architectures with complex structures in experiments, which are difﬁcult to Normalizer Normalizer obtain in manual design.

RidgeClassifier KNeighborsRegressor KNeighborsRegressor RidgeClassifier VI.CONCLUSIONS In this paper, an evolutionary algorithm is proposed to RidgeClassifier RidgeClassifier

ID:28 accuracy=0.875 ID:63 accuracy=0.875 search for the best architecture composed of traditional machine learning models with a graph-based representation. Based on the representation, the random, mutation, and hered- (a) Second Generation (b) Third Generation ity operators are defined and implemented. Evolutionary algo- Fig. 7: Two individual graphs on heart-h dataset generated by rithm is then employed to optimize the architecture. Diversity DarwinML120. makes the success of EAs. The evolutionary operations proposed in the paper enables the search of diverse architectures. With Bayesian hyperparameter optimization applied to the best results of the proposed evolutionary method, the proposed from drawing on the plot. In the last few generations of the approach demonstrates the state-of-the-art performance on the evolution, individuals generated by random operation appear PMLB dataset compared to TPOT, auto-stacker, and auto- less and less, while individuals from heredity and mutation sklearn. operations are more important for improving the performance. Though the proposed approach is implemented and tested According to Fig.6, the accuracy of best individuals in- on the set of traditional machine learning models, there are creased continuously from 0.581 to 0.864. In addition, it no inherent limitations that it applies only on these models. can be observed in each generation that performances of In the future, we plan to generalize the approach to neural individuals from heredity and mutation are generally with a architecture search by extending the ML models with building performance better than the ones sampled randomly. Some blocks of neural networks to get better performance on large specific examples of applying these evolutionary operations scale datasets. have been observed. Fig.3 is a concrete example on dataset “breast-cancer” where the accuracy increased from 0.537 to REFERENCES 0.728 after applying the vertex mutation. Fig.4 shows a layer mutation applied on “Hill Vally with noise” which increased [1] I. Guyon, I. Chaabane, H. J. Escalante, S. Escalera, D. Jajetic, J. R. Lloyd, N. Macia,` B. Ray, L. Romaszko, M. Sebag, A. Statnikov, the accuracy from 0.971 to 1.000. Accuracy increment from S. Treguer, and E. Viegas, “A brief Review of the ChaLearn AutoML 0.723 to 0.778 was observed in experiments on pima after Challenge: Any-time Any-dataset Learning without Human Interven- applying the heredity operation, as shown in Fig.5. These tion,” in Proceedings of the Workshop on Automatic Machine Learning, ser. PMLR, F. Hutter, L. Kotthoff, and J. Vanschoren, Eds., vol. 64, New observations demonstrate that DarwinML provides rational York, New York, USA, 2016, pp. 21–30. and efficient operators to evolve individuals better. [2] J. Bergstra and Y. Bengio, “Random Search for Hyper-Parameter Opti- mization,” J. Mach. Learn. Res., vol. 13, no. Feb, pp. 281–305, 2012. In Fig.7, we present best solutions on heart-h datasets [3] F. Friedrichs and C. Igel, “Evolutionary tuning of multiple SVM searched by DML120. Surprisingly, we observed that there are parameters,” Neurocomputing, vol. 64, pp. 107–117, Mar. 2005. more than one individuals achieved the best balanced accuracy, [4] E. Brochu, V. M. Cora, and N. de Freitas, “A Tutorial on Bayesian Opti- mization of Expensive Cost Functions, with Application to Active User 0.875. The individual shown in Fig. 7b has one edge less Modeling and Hierarchical Reinforcement Learning,” arXiv:1012.2599, than that in Fig. 7a. According to (7), DarwinML prefers Dec. 2010. 8

[5] U. Khurana, F. Nargesian, H. Samulowitz, Elias Khalil, and Deepak [26] A. H. Gandomi, A. H. Alavi, and C. Ryan, Eds., Handbook of Genetic Turaga, “Automating Feature Engineering,” in AI4DS, 2016. Programming Applications. Springer International Publishing, 2015. [6] U. Khurana, H. Samulowitz, and D. Turaga, “Feature Engineering for [27] M. Graff, E. S. Tellez, H. Jair Escalante, and S. Miranda-Jimenez,´ Predictive Modeling Using Reinforcement Learning,” in Thirty-Second “Semantic Genetic Programming for Sentiment Analysis,” in NEO 2015: AAAI Conference on Artificial Intelligence, Apr. 2018. Results of the Numerical and Evolutionary Optimization Workshop NEO [7] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet Classification 2015 Held at September 23-25 2015 in Tijuana, Mexico, ser. Studies with Deep Convolutional Neural Networks,” in Proc. Adv. Neural Inf. in Computational Intelligence, O. Schutze,¨ L. Trujillo, P. Legrand, and Process. Syst. (NIPS), F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Y. Maldonado, Eds. Cham: Springer International Publishing, 2017, Weinberger, Eds. Curran Associates, Inc., 2012, pp. 1097–1105. pp. 43–65. [8] T. Glasmachers, “Limits of End-to-End Learning,” in Asian Conference [28] T. P. Pawlak, B. Wieloch, and K. Krawiec, “Semantic Backpropagation on Machine Learning, Nov. 2017, pp. 17–32. for Designing Search Operators in Genetic Programming,” IEEE Trans- [9] E. Real, S. Moore, A. Selle, S. Saxena, Y. L. Suematsu, J. Tan, Q. V. actions on Evolutionary Computation, vol. 19, no. 3, pp. 326–340, Jun. Le, and A. Kurakin, “Large-Scale Evolution of Image Classifiers,” in 2015. Proc. Int. Conf. Mach. Learn. (ICML), Jul. 2017, pp. 2902–2911. [29] D. Ashlock and J. Tsang, “Evolving fractal art with a directed acyclic [10] C. Liu, B. Zoph, M. Neumann, J. Shlens, W. Hua, L.-J. Li, L. Fei-Fei, graph genetic programming representation,” in 2015 IEEE Congress on A. Yuille, J. Huang, and K. Murphy, “Progressive Neural Architecture Evolutionary Computation (CEC), May 2015, pp. 2137–2144. Search,” arXiv:1712.00559, Dec. 2017. [30] R. S. Olson, R. J. Urbanowicz, P. C. Andrews, N. A. Lavender, L. C. [11] C. Thornton, F. Hutter, H. H. Hoos, and K. Leyton-Brown, “Auto- Kidd, and J. H. Moore, “Automating Biomedical Data Science Through WEKA: Combined Selection and Hyperparameter Optimization of Tree-Based Pipeline Optimization,” in Applications of Evolutionary Classification Algorithms,” in Proceedings of the 19th ACM SIGKDD Computation, ser. Lecture Notes in Computer Science, vol. 9597. International Conference on Knowledge Discovery and Data Mining, Springer, Cham, Mar. 2016, pp. 123–137. ˇ ser. KDD ’13, New York, NY, USA, 2013, pp. 847–855. [31] P. Kord´ık, J. Cerny,` and T. Fryda,` “Discovering predictive ensembles [12] A. Thakur and A. Krohn-Grimberghe, “AutoCompete: A Framework for for transfer learning and meta-learning,” Machine Learning, vol. 107, Machine Learning Competition,” arXiv:1507.02188, Jul. 2015. no. 1, pp. 177–207, 2018. [13] M. Feurer, A. Klein, K. Eggensperger, J. Springenberg, M. Blum, and [32] A. G. de Sa,´ W. J. G. Pinto, L. O. V. Oliveira, and G. L. Pappa, “Recipe: F. Hutter, “Efficient and robust automated machine learning,” in Proc. a grammar-based framework for automatically evolving classification European Conference on Genetic Programming Adv. Neural Inf. Process. Syst. (NIPS), C. Cortes, N. D. Lawrence, D. D. pipelines,” in . Springer, Lee, M. Sugiyama, and R. Garnett, Eds. Curran Associates, Inc., 2015, 2017, pp. 246–261. pp. 2962–2970. [33] A. A. Ghorbani and K. Owrangh, “Stacked generalization in neural networks: generalization on statistically neutral problems,” in Neural [14] R. S. Olson, N. Bartley, R. J. Urbanowicz, and J. H. Moore, “Evalu- Networks, 2001. Proceedings. IJCNN’01. International Joint Conference ation of a Tree-based Pipeline Optimization Tool for Automating Data on, vol. 3. IEEE, 2001, pp. 1715–1720. Science,” in Proceedings of the Genetic and Evolutionary Computation [34] A. B. Kahn, “Topological Sorting of Large Networks,” Communications Conference 2016, ser. GECCO’16. New York, NY, USA: ACM, 2016, of the ACM, vol. 5, no. 11, pp. 558–562, Nov. 1962. pp. 485–492. [35] D. E. Goldberg and K. Deb, “A Comparative Analysis of Selection [15] B. Chen, H. Wu, W. Mo, I. Chattopadhyay, and H. Lipson, “Autostacker: Schemes Used in Genetic Algorithms,” in Foundations of Genetic A Compositional Evolutionary Learning System,” in Proceedings of the Algorithms, G. J. Rawlins, Ed. Elsevier, 1992, vol. 1, pp. 69–93. Genetic and Evolutionary Computation Conference, ser. GECCO’18. [36] J. Vanschoren, J. N. van Rijn, B. Bischl, and L. Torgo, “OpenML: New York, NY, USA: GECCO, 2018, pp. 402–409. Networked Science in Machine Learning,” ACM SIGKDD Explorations [16] N. L. Cramer, “A Representation for the Adaptive Generation of Simple Newsletter, vol. 15, no. 2, pp. 49–60, Jun. 2014. Sequential Programs,” in Proceedings of the 1st International Confer- [37] D. Dheeru and E. Karra Taniskidou, “UCI Machine Learning Repos- ence on Genetic Algorithms. Hillsdale, NJ, USA: L. Erlbaum Associates itory,” University of California, Irvine, School of Information and Inc., 1985, pp. 183–187. Computer Sciences, Tech. Rep., 2017. Cartesian Genetic [17] J. F. Miller, “Cartesian Genetic Programming,” in [38] D. R. Velez, B. C. White, A. A. Motsinger, W. S. Bush, M. D. Ritchie, Programming , ser. Natural Computing Series. Springer-Verlag Berlin S. M. Williams, and J. H. Moore, “A Balanced Accuracy Function for Heidelberg, 2011, pp. 17–34. Epistasis Modeling in Imbalanced Datasets Using Multifactor Dimen- [18] J. R. Koza, Genetic Programming II: Automatic Discovery of Reusable sionality Reduction,” Genetic Epidemiology, vol. 31, no. 4, pp. 306–315, Programs. Cambridge, MA, USA: MIT Press, 1994. May 2007. [19] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, M. Kudlur, J. Levenberg, R. Monga, S. Moore, D. G. Murray, B. Steiner, P. Tucker, V. Vasudevan, P. Warden, M. Wicke, Y. Yu, and X. Zheng, “TensorFlow: A System for Large-Scale Machine Learning,” in 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16). Savannah, GA: USENIX Association, 2016, pp. 265–283. [20] M. Graff, E. S. Tellez, S. Miranda-Jimenez,´ and H. J. Escalante, “EvoDAG: A semantic Genetic Programming Python library,” in 2016 IEEE International Autumn Meeting on Power, Electronics and Com- puting (ROPEC), Nov. 2016, pp. 1–6. [21] R. S. Olson, W. La Cava, P. Orzechowski, R. J. Urbanowicz, and J. H. Moore, “PMLB: A large benchmark suite for machine learning evaluation and comparison,” BioData Mining, vol. 10, p. 36, Dec. 2017. [22] I. Guyon, K. Bennett, G. Cawley, H. J. Escalante, S. Escalera, T. K. Ho, N. Macia,` B. Ray, M. Saeed, A. Statnikov, and E. Viegas, “Design of the 2015 ChaLearn AutoML challenge,” in Int. Joint Conf. Neural Networks (IJCNN), Jul. 2015, pp. 1–8. [23] F. Hutter, H. H. Hoos, and K. Leyton-Brown, “Sequential Model- Based Optimization for General Algorithm Configuration,” in Learning and Intelligent Optimization, ser. Lecture Notes in Computer Science. Springer, Berlin, Heidelberg, Jan. 2011, pp. 507–523. [24] L. Kotthoff, C. Thornton, H. H. Hoos, F. Hutter, and K. Leyton-Brown, “Auto-WEKA 2.0: Automatic model selection and hyperparameter optimization in WEKA,” J. Mach. Learn. Res., vol. 18, no. 25, pp. 1–5, 2017. [25] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten, “The WEKA Data Mining Software: An Update,” ACM SIGKDD Explorations Newsletter, vol. 11, no. 1, pp. 10–18, Nov. 2009.