Towards the Evolution of Multi-Layered Neural Networks: A Dynamic Structured Grammatical Evolution Approach
Filipe Assunc¸ao,˜ Nuno Lourenc¸o, Penousal Machado, Bernardete Ribeiro CISUC, Department of Informatics Engineering, University of Coimbra, Portugal {fga,naml,machado,bribeiro}@dei.uc.pt
ABSTRACT considered: (i) the denition of the topology of the network, i.e., Current grammar-based NeuroEvolution approaches have several number of hidden-layers, number of neurons of each hidden-layer, shortcomings. On the one hand, they do not allow the generation and how should the layers be connected between each other; and of Articial Neural Networks (ANNs) composed of more than one (ii) the choice and parameterisation of the learning algorithm that hidden-layer. On the other, there is no way to evolve networks is used to tune the weights and bias of the network connections with more than one output neuron. To properly evolve ANNs with (e.g., initial weights distribution and learning rate). more than one hidden-layer and multiple output nodes there is the Evolutionary Articial Neural Networks (EANNs) or NeuroEvo- need to know the number of neurons available in previous layers. lution refers to methodologies that aim at the automatic search and In this paper we introduce Dynamic Structured Grammatical Evo- optimisation of the ANNs’ parameters using Evolutionary Compu- lution (DSGE): a new genotypic representation that overcomes the tation (EC). With the popularisation of Deep Learning (DL) and the aforementioned limitations. By enabling the creation of dynamic need for ANNs with a larger number of hidden-layers, NeuroEvo- rules that specify the connection possibilities of each neuron, the lution has been vastly used in recent works [19]. methodology enables the evolution of multi-layered ANNs with e goal of the current work is the proposal of a novel Grammar- more than one output neuron. Results in dierent classication based Genetic Programming (GGP) methodology for evolving the problems show that DSGE evolves eective single and multi-layered topologies, weights and bias of ANNs. We rely on a GGP method- ANNs, with a varying number of output neurons. ology because in this way we allow the direct encoding of dierent topologies for solving distinct problems in a plug-and-play fash- CCS CONCEPTS ion, requiring the user to set a grammar capable of describing the parameters of the ANN to be optimised. One of the shortcomings •Computing methodologies → Neural networks; Genetic pro- of previous GGP methodologies applied to NeuroEvolution is the gramming; Supervised learning by classication; •eory of com- fact that they are limited to the evolution of network topologies putation → Genetic programming; with one hidden-layer. e proposed approach, called Dynamic KEYWORDS Structured Grammatical Evolution (DSGE) solves this constraint. e remainder of the document is organised as follows: In Sec- NeuroEvolution, Articial Neural Networks, Classication, Grammar- tion 2 we detail Structured Grammatical Evolution (SGE) and survey based Genetic Programming the state of the art on NeuroEvolution; en, in Sections 3 and 4, ACM Reference format: DSGE and its adaption to the evolution of multi-layered ANNs Filipe Assunc¸ao,˜ Nuno Lourenc¸o, Penousal Machado, Bernardete Ribeiro. are presented, respectively; e results and comparison of DSGE 2017. Towards the Evolution of Multi-Layered Neural Networks: with other GGP approaches is conducted in Section 5; Finally, in A Dynamic Structured Grammatical Evolution Approach. In Proceedings of Section 6, conclusions are drawn and future work is addressed. GECCO ’17, Berlin, Germany, July 15-19, 2017, 8 pages. DOI: hp://dx.doi.org/10.1145/3071178.3071286 2 BACKGROUND AND STATE OF THE ART 1 INTRODUCTION In the following sub-sections we present SGE, which serves as base
arXiv:1706.08493v1 [cs.NE] 26 Jun 2017 for the development of DSGE. en, we survey EC approaches for Machine Learning (ML) approaches, such as Articial Neural Net- the evolution of ANNs. works (ANNs), are oen used to learn how to distinguish between multiple classes of a given problem. However, to reach near-optimal classiers a laborious process of trial-and-error is needed to hand- 2.1 Structured Grammatical Evolution cra and tune the parameters of ML methodologies. In the specic Structured Grammatical Evolution (SGE) was proposed by Lourenc¸o case of ANNs there are at least two manual steps that need to be et al. [11] as a new genotypic representation for Grammatical Evo- Permission to make digital or hard copies of all or part of this work for personal or lution (GE) [13]. e new representation aims at solving the redun- classroom use is granted without fee provided that copies are not made or distributed dancy and locality issues in GE, and consists of a list of genes, one for prot or commercial advantage and that copies bear this notice and the full citation for each non-terminal symbol. Furthermore, a gene is a list of inte- on the rst page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permied. To copy otherwise, or republish, gers of the size of the maximum number of possible expansions for to post on servers or to redistribute to lists, requires prior specic permission and/or a the non-terminal it encodes; each integer is a value in the interval fee. Request permissions from [email protected]. [0, non terminal possibilities−1], where non terminal possibilities is GECCO ’17, Berlin, Germany © 2017 ACM. 978-1-4503-4920-8/17/07...$15.00 the number of possibilities for the expansion of the considered non- DOI: hp://dx.doi.org/10.1145/3071178.3071286 terminal symbol. Consequently, there is the need to pre-process GECCO ’17, July 15-19, 2017, Berlin, Germany Filipe Assunc¸ao˜ et al. the input grammar to nd the maximum number of expansions for
Table 1: Example of the mapping procedure. Each row rep-
Figure 2: Grammar used in [2]. n represents the number of of the cuing point is done at the gene level and not at the integers features of the problem. level, i.e., what is exchanged between parents are genes and not the integers. Given two parents [[0], [0], [2], | [0,0,1], [2,5,9]] and [[0], [0], However, there is no way of knowing how many neurons are in [1], | [0,0,0,1], [1,0,2,4]], where | denotes the cuing point, the the previous layer, so that the established connections are valid. In generated ospring would be [[0], [0], [2], [0,0,0,1], [1,0,2,4]] and the next sections we detail the adaptions we introduce to allow the [[0], [0], [1], [0,0,1], [2,5,9]]. evolution of multi-layered networks. 3.5 Fitness Evaluation 4.1 Dynamic Production Rules To enable the comparison of the evolved ANNs with the results from To know the neurons available in each hidden-layer we create a previous work [2] the performance is measured as the Root Mean new production rules, in run-time. More precisely, considering the Squared Error (RMSE) obtained while solving a classication task. grammar of Figure 3, for each i-th
2.7 Table 3: Experimental parameters. GE SGE DSGE Parameter Value 2.4 → Num. runs 30
Population size 100 2.1 Num. generations (1 hidden-layer) 500
≥ FITNESS Num. generations ( 1 hidden layers) 1500 / 3500 ← Crossover rate 95% 1.8 Mutation rate 1 mutation in 95% of the individuals Tournament size 3 1.5 Elite size 1% 0 100 200 300 400 500 GE Parameter Value ← GENERATION → Individual size 200 Wrapping 0 SGE Parameter Value Figure 4: Fitness evolution of the best individuals across gen- Recursion level 6 erations for the sonar dataset. DSGE Parameter Value Max. depth (1 hidden-layer) {sigexpr: 6, sum: 3} Max. depth (≥ 1 hidden-layers) {hidden-layers: 3, nodes: 5, sum: 4} Dataset Parameter Value challenging dataset. Results are averages of 30 independent runs. Training percentage 70% e gure clearly shows DSGE outperforms the other methods, Testing percentage 30% indicating that the new representation, where no mutation is silent promotes locality and the ecient exploration of the search space. Table 4 reports the results obtained with DSGE, which are com- symbol, due to the higher diculty of tuning parameters in deeper pared with other grammar-based approaches: GE and SGE. Results and more complex topologies. are averages of the best network (in terms of tness) of each of the ± 5.3 Experimental Setup evolutionary runs, and are formaed as follows: mean standard deviation. For all experiments we record tness, RMSE, Area Under Table 3 details the experimental parameters used for the tests con- the ROC Curve (AUROC) and f-measure. Except for the tness ducted with GE, SGE and DSGE. e parameters where 1 hidden- metric, which is only computed over the train set, all metrics are layer and ≥ 1 hidden-layers are mentioned refer to the experiments calculated for the train and test sets. In addition, we also report the conducted with the grammars of Figures 2 and 3, respectively. average number of neurons and used features. To make the exploration of the search space similar we only An analysis of the results shows that DSGE performs beer than allow one mutation in 95% of the population individuals. In the the other approaches for all the datasets, and for all the metrics search for multi-layered networks the domain is bigger and, con- considered. Regarding the structure of the generated networks, sequently we perform longer runs. For the experiments focusing DSGE is capable of nding solutions which are more complex in on the generation of one-hidden-layered networks we restrict the terms of the number of used neurons and input features. As the domains to allowing similar networks in terms of the neurons and complexity of the networks grows more real values have to be connections they can have: GE, SGE and one-hidden-layered DSGE tuned; nonetheless, DSGE is capable of performing such tuning, are constrained to generated networks with up to 7 neurons and reaching solutions that, despite being more complex, perform beer 8 connections in each neuron. In the experiments targeting net- in the tested classication problems. works that can have more than one hidden-layer these limits are To verify if the dierences between the tested approaches are increased: the maximum number of neurons in each layer is 32, signicant we perform a statistical analysis. In [2] we have already the maximum number of connections of each neuron is 16; it is demonstrated that SGE is consistently statistically superior to GE. possible to generate networks with up to 8 hidden-layers. As such, we will now focus in analysing if DSGE is statistically e used datasets are all randomly partitioned in the same way: superior to SGE. To check if the samples follow a Normal Distri- 70% of each class instances are used for training and the remain- bution we use the Kolmogorov-Smirnov and Shapiro-Wilk tests, ing 30% for testing. We only measure the individuals performance with a signicance level of α = 0.05. e tests revealed that data using the train data, and thus the test data is kept aside the evolu- does not follow any distribution and, as such, a non-parametric tionary process and exclusively used for validation purposes, i.e., test (Mann-Whitney U, α = 0.05) will be used to perform the pair- to evaluate the behaviour of the networks in the classication of wise comparison for each recorded metric. Table 4 uses a graphical instances that have not been seen during the creation of the ANN. overview to present the results of the statistical analysis: ∼ indi- e datasets are used as they are obtained, i.e., no pre-processing cates no statistical dierence between DSGE and SGE and + signals or data augmentation methodologies are applied. that DSGE is statistically superior to SGE. e eect size is denoted by the number of + signals, where +, ++ and + + + correspond 5.4 Evolution of One-Hidden-Layered ANNs respectively to low (0.1 ≤ r < 0.3), medium (0.3 ≤ r < 0.5) and large To investigate whether or not DSGE performs beer than previous (r ≥ 0.5) eect sizes. e statistical analysis reveals that DSGE is grammar-based approaches we focus on experiments for the evolu- consistently statistically superior to SGE: DSGE is never worse than tion of one-hidden-layered ANNs. For that, we use the grammar of SGE and is only equivalent in 5 situations. In all the remaining 31 Figure 2. comparisons, DSGE is statistically superior, with a medium eect Figure 4 shows the evolution of the tness across generations size in 11 occasions and a large eect size in 20 occasions. for GE, SGE and DSGE on the sonar dataset. Because of space Focusing on the comparison between train and test set results constraints we only present the tness evolution results on the most in DSGE the dierences between train and test performance are, Towards the Evolution of Multi-Layered Neural Networks: A Dynamic Structured Grammatical Evolution Approach GECCO ’17, July 15-19, 2017, Berlin, Germany
Table 4: Evolution of one-hidden-layered ANNs. Compar- Table 5: DSGE evolution of multi-layered ANNs. + and ∼ ison between DSGE and other grammar-based approaches. symbols represent the result of statsitical tests, and have the Results are averages of 30 independent runs. + and ∼ repre- same meaning as in Table 4. sent the result of statistical tests (see text). Flame WDBC Ionosphere Sonar Flame WDBC Ionosphere Sonar 1 H-L 1.16 ± 0.13 1.36 ± 0.06 1.38 ± 0.13 1.73 ± 0.12 ≥ ± ∼ ± ∼ ± ∼ ± ∼ GE 1.58 ± 0.36 1.55± 0.18 1.82 ± 0.28 2.01 ± 0.23 Fitness 1 H-Ls 1.15 0.18 1.35 0.13 1.36 0.19 1.57 0.18 ± +++ ± ∼ ± +++ ± +++ Fitness SGE 1.32 ± 0.25 1.46 ± 0.08 1.48 ± 0.18 1.85 ± 0.18 > 1 H-Ls 1.08 0.13 1.32 0.11 1.30 0.11 1.53 0.15 DSGE 1.16 ± 0.13+++ 1.36 ± 0.06+++ 1.38 ± 0.13+++ 1.73 ± 0.12+++ 1 H-L 0.08 ± 0.06 0.15 ± 0.03 0.17 ± 0.05 0.30 ± 0.05 ≥ ± ∼ ± ∼ ± ∼ ± ∼ GE 0.28 ± 0.15 0.24 ± 0.09 0.33 ± 0.10 0.38 ± 0.08 RMSE 1 H-Ls 0.07 0.08 0.15 0.05 0.17 0.08 0.26 0.07 ± +++ ± ∼ ± +++ ± +++ RMSE SGE 0.16 ± 0.13 0.19 ± 0.03 0.21 ± 0.07 0.34 ± 0.06 > 1 H-Ls 0.07 0.05 0.07 0.04 0.14 0.05 0.25 0.06 ± ± ± ± DSGE 0.08 ± 0.06+++ 0.15 ± 0.03+++ 0.17 ± 0.05++ 0.30 ± 0.05++ 1 H-L 0.99 0.03 0.97 0.01 0.97 0.03 0.90 0.04 ≥ ± ∼ ± ∼ ± ∼ ± ∼ GE 0.90 ± 0.10 0.92 ± 0.12 0.79 ± 0.20 0.76 ± 0.16 Accuracy 1 H-Ls 0.99 0.02 0.97 0.02 0.97 0.03 0.92 0.05 ± ∼ ± ∼ ± ∼ ± +++ Accuracy SGE 0.96 ± 0.08 0.95 ± 0.02 0.93 ± 0.11 0.84 ± 0.12 > 1 H-Ls 0.99 0.02 0.97 0.02 0.98 0.02 0.93 0.04 +++ +++ +++ ++ 1 H-L 0.99 ± 0.01 0.99 ± 0.00 0.96 ± 0.03 0.93 ± 0.04 DSGE 0.99 ± 0.03 0.97 ± 0.01 0.97 ± 0.03 0.90 ± 0.04 Train ≥ ± ∼ ± ∼ ± ∼ ± ∼ GE 0.96 ± 0.05 0.98 ± 0.02 0.86 ± 0.18 0.88 ± 0.06 AUROC 1 H-Ls 0.99 0.01 0.99 0.01 0.96 0.04 0.93 0.05 Train ± ∼ ± ∼ ± ∼ ± ∼ AUROC SGE 0.98 ± 0.04 0.99 ± 0.01 0.94 ± 0.04 0.91 ± 0.04 > 1 H-Ls 1.00 0.01 0.99 0.00 0.97 0.02 0.93 0.04 ± ± ± ± DSGE 0.99 ± 0.01+++ 0.99 ± 0.00+++ 0.96 ± 0.03+++ 0.93 ± 0.04∼ 1 H-L 0.99 0.02 0.96 0.02 0.98 0.02 0.88 0.05 ≥ ± ∼ ± ∼ ± ∼ ± ∼ GE 0.91 ± 0.09 0.88 ± 0.18 0.78 ± 0.32 0.70 ± 0.29 F-measure 1 H-Ls 0.99 0.02 0.96 0.03 0.97 0.02 0.91 0.06 ± ∼ ± ∼ ± +++ ± +++ F-measure SGE 0.96 ± 0.08 0.93 ± 0.03 0.93 ± 0.18 0.78 ± 0.23 > 1 H-Ls 0.99 0.01 0.97 0.03 0.98 0.01 0.92 0.05 DSGE 0.99 ± 0.02+++ 0.96 ± 0.02+++ 0.98 ± 0.02+++ 0.88 ± 0.05++ 1 H-L 0.14 ± 0.08 0.20 ± 0.03 0.28 ± 0.04 0.43 ± 0.04 ≥ ± ∼ ± ∼ ± ∼ ± ∼ GE 0.31 ± 0.15 0.27 ± 0.09 0.38 ± 0.07 0.45 ± 0.06 RMSE 1 H-Ls 0.17 0.08 0.20 0.04 0.31 0.05 0.44 0.05 ± ∼ ± ∼ ± ∼ ± ∼ RMSE SGE 0.22 ± 0.13 0.23 ± 0.04 0.32 ± 0.05 0.44 ± 0.04 > 1 H-Ls 0.16 0.08 0.20 0.04 0.29 0.05 0.43 0.06 ± ± ± ± DSGE 0.14 ± 0.08++ 0.20 ± 0.03++ 0.28 ± 0.04+++ 0.43 ± 0.04∼ 1 H-L 0.97 0.05 0.95 0.02 0.90 0.03 0.76 0.05 ≥ ± ∼ ± ∼ ± ∼ ± ∼ GE 0.88 ± 0.11 0.90 ± 0.12 0.76 ± 0.18 0.68 ± 0.12 Accuracy 1 H-Ls 0.96 0.04 0.95 0.02 0.89 0.03 0.77 0.05 ± ∼ ± ∼ ± ∼ ± ∼ Accuracy SGE 0.93 ± 0.09 0.93 ± 0.02 0.87 ± 0.10 0.73 ± 0.09 > 1 H-Ls 0.97 0.04 0.95 0.02 0.90 0.03 0.78 0.06
Test ± ± ± ± DSGE 0.97 ± 0.05+++ 0.95 ± 0.02+++ 0.90 ± 0.03++ 0.76 ± 0.05∼ 1 H-L 0.99 0.03 0.98 0.01 0.93 0.04 0.83 0.04 AUROC ≥ 1 H-Ls 0.98 ± 0.03∼ 0.99 ± 0.01∼ 0.91 ± 0.06∼ 0.82 ± 0.07∼
Test GE 0.93 ± 0.08 0.97 ± 0.03 0.83 ± 0.09 0.81 ± 0.05 ± ∼ ± ∼ ± ∼ ± ∼ AUROC SGE 0.96 ± 0.08 0.98 ± 0.02 0.90 ± 0.05 0.82 ± 0.05 > 1 H-Ls 0.99 0.02 0.99 0.01 0.93 0.03 0.82 0.07 ± ± ± ± DSGE 0.99 ± 0.03++ 0.98 ± 0.01++ 0.93 ± 0.04++ 0.83 ± 0.04∼ 1 H-L 0.98 0.04 0.93 0.03 0.93 0.02 0.72 0.06 ≥ ± ∼ ± ∼ ± ∼ ± ∼ GE 0.89 ± 0.09 0.86 ± 0.18 0.76 ± 0.31 0.61 ± 0.25 F-measure 1 H-Ls 0.97 0.03 0.93 0.03 0.92 0.02 0.73 0.07 ± ∼ ± ∼ ± ∼ ± ∼ F-measure SGE 0.94 ± 0.09 0.91 ± 0.03 0.89 ± 0.17 0.64 ± 0.20 > 1 H-Ls 0.97 0.03 0.93 0.03 0.92 0.02 0.74 0.08 DSGE 0.98 ± 0.04+++ 0.93 ± 0.03+++ 0.93 ± 0.02++ 0.72 ± 0.06∼ 1 H-L 1.00 ± 0.00 1.00 ± 0.00 1.00 ± 0.00 1.00 ± 0.00 ≥ ± ± ± ± GE 3.33 ± 1.40 3.13 ± 1.53 2.50 ± 1.41 2.53 ± 1.20 Num. H-Ls 1 H-Ls 2.27 0.98 2.23 1.04 1.90 0.84 2.37 0.96 ± ± ± ± Num. Neurons SGE 4.87 ± 1.83 3.73 ± 1.53 3.53 ± 1.36 3.07 ± 1.39 > 1 H-Ls 2.81 0.60 2.95 0.52 2.50 0.51 2.95 1.12 DSGE 6.47 ± 1.20 6.23 ± 1.58 5.97 ± 1.78 6.13 ± 1.69 1 H-L 3.33 ± 1.40 3.13 ± 1.53 2.50 ± 1.41 2.53 ± 0.38 ≥ ± ± ± ± GE 1.97 ± 0.18 8.40 ± 3.81 7.33 ± 5.33 9.40 ± 5.73 Num. Neurons 1 H-Ls 10.6 6.49 9.73 7.03 7.17 4.98 9.27 4.55 ± ± ± ± Num. Features SGE 2.00 ± 0.00 12.0 ± 6.51 12.1 ± 5.79 13.3 ± 6.42 > 1 H-Ls 12.5 6.64 12.8 6.70 9.33 5.24 10.9 4.12 DSGE 2.00 ± 0.00 14.5 ± 3.52 13.3 ± 4.74 17.6 ± 4.66 1 H-L 2.00 ± 0.00 14.5 ± 3.52 13.1 ± 6.87 19.8 ± 8.39 Num. Features ≥ 1 H-Ls 2.00 ± 0.00 16.3 ± 7.03 13.8 ± 6.66 21.8 ± 8.48 > 1 H-Ls 2.00 ± 0.00 18.2 ± 6.91 15.8 ± 5.48 24.0 ± 7.54 on average, 0.09, 0.06, 0.04 and 0.06, for RMSE, accuracy, AUROC and f-measure, respectively. For SGE the dierences are 0.08, 0.06, discarding the evolutionary runs that resulted in networks with 0.04 and 0.06 and for GE 0.05, 0.04, 0.04 and 0.04, for the RMSE, just one hidden-layer. accuracy, AUROC and f-measure, respectively. Despite the fact that Results of 1 H-L and ≥ 1 H-Ls are averages of 30 independent the dierences in DSGE are superior to the ones in SGE and GE, it runs; in the > 1 H-Ls row results are averages of 21, 19, 18 and is our perception that this is a result of the beer DSGE results and 21 independent runs for the ame, WDBC, ionosphere and sonar not an indicator of overing. datasets, respectively. at is, for the ame, WDBC, ionosphere and In [2] we showed that the results obtained with SGE are superior sonar datasets in 9, 11, 12 and 9 runs the best network is composed to those of other grammar-based approaches (namely, the ones by only one hidden-layer. For the ame and WDBC we perform described in [1, 16, 20]). Additionally, we showed that it is benecial runs with 1500 generations, and for the ionosphere and sonar 3500 to evolve both the topology and weights of a network, since it has generations are used. e rationale behind the dierent number of beer results than ANNs obtained by hand-craing the topology of generations is related to the number of features of each problem and the networks and training them using backpropagation. As DSGE the possible progression margin: in ame and WDBC the metrics in is statistically superior to SGE it is then clear that it is also beer train and test already have values close to optimum, thus suggesting than previous methods. that one hidden-layer is likely enough to solve the classication task. 5.5 Evolution of Multi-Layered ANNs e experimental results show that by using DSGE it is possible To search for ANNs that may have more than one hidden-layer we to evolve eective multi-layered ANNs. By comparing the rst two use the grammar of Figure 3. rows (1 H-L and ≥ 1 H-Ls) it is also clear that the results are equiv- e experimental results are presented in Table 5. For each alent, which is conrmed by a statistical analysis (Mann-Whitney, metric, the rst two rows present the averages of the tests con- α = 0.05) that shows no statistical dierences. Focusing on the com- ducted with the grammars of Figures 2 and 3, respectively, i.e., the parison between one-hidden-layered ANNs and those that have grammars for encoding networks with just one hidden-layer (1 more than one hidden-layer (> 1 H-Ls) a small, but perceptive dif- H-L, from Table 4) or that allow the generation of networks with ference exists. at dierence is statistically signicant in some of multiple hidden-layers (≥ 1 H-Ls). Additionally, we analyse the the training metrics, and in none of the testing metrics. When there ANNs that have more than one hidden-layer (last row, > 1 H-Ls), is a statistical dierence the eect size is always large. Moreover, GECCO ’17, July 15-19, 2017, Berlin, Germany Filipe Assunc¸ao˜ et al. the dierence is larger in the datasets that have more input features the ones reported in the current paper had to be used in order to and a greater margin for improvement. On the contrary, almost no compare DSGE with previous approaches. improvement is observable in the ame dataset, which is expected Future work will focus on the performance of experiments using as the problem only has two features, and thus an ANN with one more complex datasets (e.g., MNIST and CIFAR) and on the gener- hidden-layer already performs close to optimal. In a nutshell, al- alisation of the allowed layer types, so that it is possible to evolve though there are no statistical dierences between the performance networks that use, for example, convolution and/or pooling layers. of 1 H-L and ≥ 1 H-Ls runs, when the evolutionary process results in multi-layered ANNs the dierences begin to emerge. is result ACKNOWLEDGMENTS shows that DSGE is able to cope with the higher dimensionality is research is partially funded by: Fundac¸ao˜ para a Cienciaˆ e of the search space associated with the evolution of multi-layered Tecnologia (FCT), Portugal, under the grant SFRH/BD/114865/2016. topologies and indicates that in complex datasets, where there is a clear advantage in using deeper topologies, DSGE is likely to REFERENCES outperform other grammar-based approaches in train and test. [1] Fardin Ahmadizar, Khabat Soltanian, Fardin AkhlaghianTab, and Ioannis Tsoulos. e complexity of the generated ANNs (in terms of the number 2015. Articial neural network development by means of a novel combination of grammatical evolution and genetic algorithm. Engineering Applications of of used neurons and hidden-layers) is greater when allowing the Articial Intelligence 39 (2015), 1–13. generation of multi-layered ANNs. e dierence is even larger if [2] Filipe Assunc¸ao,˜ Nuno Lourenc¸o, Penousal Machado, and Bernardete Ribeiro. we consider only those networks that have more than one hidden- 2017. Automatic Generation of Neural Networks with Structured Grammatical Evolution. In 2017 IEEE Congress on Evolutionary Computation (CEC). IEEE. layer. More neurons means more weights and bias values that have [3] Omid E David and Iddo Greental. 2014. Genetic algorithms for evolving deep to be evolved, making the evolutionary task more dicult. ANNs neural networks. In Proceedings of the 2014 Conf. companion on Genetic and evolutionary computation companion. ACM, 1451–1452. with fewer layers tend to have less neurons, and consequently are [4] Lucian-Ovidiu Fedorovici, Radu-Emil Precup, Florin Dragan, and Constantin beneted from an evolutionary point of view, as less real values Purcaru. 2013. Evolutionary optimization-based training of convolutional neu- need to be tuned. us, in future experiments it is our intention to ral networks for OCR applications. In System eory, Control and Computing (ICSTCC), 2013 17th International Conf. IEEE, 207–212. use the evolutionary approach to evolve the initial set of weights [5] Limin Fu and Enzo Medico. 2007. FLAME, a novel fuzzy clustering method for and then apply a ne tuning stage (e.g, backpropagation or resilient the analysis of DNA microarray data. BMC bioinformatics 8, 1 (2007), 1. backpropagation) during a maximum number of epochs propor- [6] Faustino Gomez, Jurgen¨ Schmidhuber, and Risto Miikkulainen. 2008. Accelerated neural evolution through cooperatively coevolved synapses. Journal of Machine tional to the number of neurons and/or hidden-layers. As we allow Learning Research 9, May (2008), 937–965. hidden-nodes to connect to nodes in previous layers (including [7] R. Paul Gorman and Terrence J. Sejnowski. 1988. Analysis of hidden units in a layered network trained to classify sonar targets. Neural networks 1, 1 (1988), input features) the number of used features is also higher. 75–89. Preliminary experiments concerning the evolution of ANNs with [8] Frederic Gruau. 1992. Genetic synthesis of boolean neural networks with a cell more than one output neuron have also been conducted, i.e., in- rewriting developmental process. In International Workshop on Combinations of Genetic Algorithms and Neural Networks, COGANN-92. IEEE, 55–74. stead of using just one output neuron and the sigmoid function as [9] Jae-Yoon Jung and James A Reggia. 2006. Evolutionary design of neural net- activation function we used two output nodes (one per class) with work architectures using a descriptive encoding language. IEEE transactions on the somax activation function. Obtained performances are similar. evolutionary computation 10, 6 (2006), 676–688. [10] M. Lichman. 2013. UCI ML Repository. (2013). archive.ics.uci.edu/ml However, results are not presented due to the lack of space. [11] Nuno Lourenc¸o, Francisco B. Pereira, and Ernesto Costa. 2016. Unveiling the properties of structured grammatical evolution. Genetic Programming and Evolv- able Machines (2016), 1–39. [12] Miguel Rocha, Paulo Cortez, and Jose´ Neves. 2007. Evolution of neural networks for classication and regression. Neurocomputing 70, 16 (2007), 2809–2816. 6 CONCLUSION AND FUTURE WORK [13] Conor Ryan, JJ Collins, and Michael O’Neill. 1998. Grammatical evolution: Evolving programs for an arbitrary language. Springer Berlin Heidelberg, Berlin, In this paper we propose DSGE: a new genotypic representation for Heidelberg, 83–96. SGE. e gain is two fold: (i) while in previous grammmar-based [14] Tapas Si, Arunava De, and Anup Kumar Bhaacharjee. 2014. Grammatical representations, the genotype encodes the largest allowed sequence, Swarm for Articial Neural Network Training. In International Conference on Circuit, Power and Computing Technologies (ICCPCT). IEEE, 1657–1661. in DSGE the genotype grows as needed; and (ii) there is no need to [15] Vincent G. Sigillito, Simon P. Wing, Larrie V. Huon, and Kile B. Baker. 1989. pre-process the grammar in order to expand recursive production Classication of radar returns from the ionosphere using neural networks. Johns rules, as the maximum depth is dened for each sub-tree. Most Hopkins APL Technical Digest 10, 3 (1989), 262–266. [16] Khabat Soltanian, Fardin Akhlaghian Tab, Fardin Ahmadi Zar, and Ioannis Tsou- importantly, DSGE solves a limitation of other GGP methodologies los. 2013. Articial neural networks generation using grammatical evolution. In by allowing the evolution of solutions to dynamic domains, such 21st Iranian Conference on Electrical Engineering (ICEE). IEEE, 1–5. [17] Kenneth O Stanley and Risto Miikkulainen. 2002. Evolving neural networks as ANNs, where there is the need to know the number of neurons through augmenting topologies. Evolutionary computation 10, 2 (2002), 99–127. available in previous layers so that valid individuals are generated. [18] W. Nick Street, William H. Wolberg, and Olvi L. Mangasarian. 1993. Nuclear Results show that DSGE is able to evolve the topology and feature extraction for breast tumor diagnosis. In IS&T/SPIE’s Symposium on Electronic Imaging: Science and Technology. International Society for Optics and weights of one-hidden-layered ANNs that perform statistically bet- Photonics, 861–870. ter than those evolved using GE or SGE. Moreover, the results are [19] Sreenivas Sremath Tirumala, S Ali, and C Phani Ramesh. 2016. Evolving deep also beer than those provided by hand-craed ANNs netuned neural networks: A new prospect. In Natural Computation, Fuzzy Systems and Knowledge Discovery (ICNC-FSKD), 2016 12th International Conf. on. IEEE, 69–74. using backpropagation, and than the ones generated using other [20] Ioannis Tsoulos, Dimitris Gavrilis, and Euripidis Glavas. 2008. Neural network GE-based approaches [1, 16, 20]. Results concerning the evolution construction and training using grammatical evolution. Neurocomputing 72, 1 (2008), 269–277. of multi-layered ANNs despite not statistical show that the method- [21] Darrell Whitley, Timothy Starkweather, and Christopher Bogart. 1990. Ge- ology is suitable for the evolution of accurate ANNs. Experiments netic algorithms and neural networks: Optimizing connections and connectivity. on more, and more dicult datasets are needed to beer analyse Parallel computing 14, 3 (1990), 347–361. the evolution of ANNs with more than one hidden-layer. However,