Neural Network Architecture Search with Differentiable Cartesian Genetic Programming for Regression

Neural Network Architecture Search with Differentiable Cartesian Genetic Programming for Regression Marcus Märtens Dario Izzo European Space Agency European Space Agency Noordwijk, The Netherlands Noordwijk, The Netherlands [email protected] [email protected] ABSTRACT call dCGPANN. Due to an efficient automated backward differen- The ability to design complex neural network architectures which tiation, the loss gradient of a dCGPANN can be obtained during enable effective training by stochastic gradient descent has been the fitness evaluation with only a negligible computational overhead. key for many achievements in the field of deep learning. However, Instead of ignoring the gradient information, we propose a memetic developing such architectures remains a challenging and resource- algorithm that adapts the weights and biases of the dCGPANN by intensive process full of trial-and-error iterations. All in all, the backpropagation. The performance in learning is then used as a relation between the network topology and its ability to model the selective force for evolution to incrementally improve the network data remains poorly understood. We propose to encode neural net- architecture. We trigger these improvements by mutations on the works with a differentiable variant of Cartesian Genetic Program- neural connections (rewirings) and the activation functions of indi- ming (dCGPANN) and present a memetic algorithm for architecture vidual neurons, which allows us to navigate the vast design space design: local searches with gradient descent learn the network pa- of neural network architectures up to a predetermined maximum rameters while evolutionary operators act on the dCGPANN genes size. shaping the network architecture towards faster learning. Study- To evaluate the performance of our approach, we evolve network ing a particular instance of such a learning scheme, we are able architectures for a series of small-scale regression problems. Given to improve the starting feed forward topology by learning how the same canonical feed forward neural network as a starting point to rewire and prune links, adapt activation functions and intro- for each individual challenge, we show how complex architectures duce skip connections for chosen regression tasks. The evolved for improved learning can be evolved without human intervention. network architectures require less space for network parameters The remainder of this work is organized as follows: Section 2 and reach, given the same amount of time, a significantly lower relates our contribution to other work in the field of architecture error on average. search and CGP applied to artificial neural networks. Section 3 gives some background on CGP, introduces the dCGPANN encoding and KEYWORDS explains how its weights can be trained efficiently. In Section 4 we outline our experiments and describe the evolutionary algorithm designing neural network architectures, evolution, genetic program- together with our test problems. Results are presented in Section 5 ming, artificial neural networks and we conclude with a discussion on the benefits of the evolved 1 INTRODUCTION architectures in Section 6. The ambition of artificial intelligence (AI) is to develop artificial sys- tems that exhibit a level of intelligent behaviour competitive with 2 RELATED WORK humans. It is thus natural that many research in AI has taken inspi- This section briefly explains how this work is related to ongoing ration from the human brain [11]. The general brain was shaped by research like genetic programming, neural network architecture natural evolution to give its owner the ability to learn: new skills search, neuro-evolution, meta-learning and similar. and knowledge are acquired during lifetime due to exposure to arXiv:1907.01939v1 [cs.NE] 3 Jul 2019 different environments and situations. This lifelong learning isin 2.1 Cartesian Genetic Programming stark contrast to the machine learning approach, where typically only weight parameters of a static network architecture are tuned In its original form, CGP [19] has been deployed to various appli- during a training phase and then left frozen to perform a particular cations, including the evolution of robotic controllers [10], digital task. filters [18], computational art [1] and large scale digital circuits [36]. While exact mechanisms in the human brain are poorly under- The idea to use the CGP-encoding to represent neural networks stood, there is evidence that a process called neuroplasticity [5] goes back to works of Turner and Miller [35] and Khan et al. [16], plays an important role, which is described as the ability of neurons who coined the term CGPANN. In these works, the network param- to change their behaviour (function, connectivity patterns, etc.) eters (mainly weights as no biases were introduced) are evolved by due to exposure to the environment [25]. These changes manifest genetic operators, and the developed techniques are thus applied themselves as alterations of the physical and chemical structures to tasks where gradient information is not available (e.g. reinforce- of the nervous system. ment learning). In contrast, our work will make explicit use of the Inspired by the idea of the neuroplastic brain, we propose a dif- gradient information for adapting weights and node biases, effec- ferentiable version of Cartesian Genetic Programming (CGP) [19] tively creating a memetic algorithm [20]. There exists some work as a direct encoding of artificial neural networks (ANN), which we on the exploitation of low-order differentials to learn parameters for C0;0 Cr;0 Ccr;0 C0;1 n Cr;1 n + r Ccr;1 n + cr F0 Fr Fcr C0;a Cr;a Ccr;a 0 O0 C1;0 Cr +1;0 Ccr +1;0 C1;1 n + 1 Cr +1;1 n + r + 1 Ccr +1;1 n + cr + 1 1 F1 Fr +1 Fcr +1 O1 C1;a Cr +1;a Ccr +1;a n − 1 Om Cr −1;0 C2r −1;0 Ccr +r −1;0 Cr −1;1 n + r − 1 C2r −1;1 n + 2r − 1 Ccr +r −1;1 n + cr + r − 1 Fr −1 F2r −1 Fcr +r −1 Cr −1;a C2r −1;a Ccr +r −1;a Figure 1: Most widely used form of Cartesian genetic programming, as described by [19]. genetic programs in general [6, 34] but the application of gradient by Soltoggio et al. [28], who coin the term “EPANN” (Evolved descent to CGPANNs is widely unexplored. Plastic Artificial Neural Network). However, to the best of our A notable exception is the recent work of Suganuma et al. [33], knowledge, our work is the first to analyze plasticity in neural who deployed CGP to encode the interconnections among func- networks represented as CGPs. tional blocks of a convolutional neural network. In [33], the nodes of the CGP represent highly functional modules such as convo- 3 DIFFERENTIABLE CARTESIAN GENETIC lutional blocks, tensor concatenation and similar operations. The PROGRAMMING resulting convolutional neural networks are then trained by sto- This section outlines our design of a neural network as a CGP and chastic gradient descent. In contrast, Our approach works directly explains how it can be trained efficiently. on the fundamental units of computation, i.e. neurons, activation functions and their connectome. 3.1 Definition of a dCGPANN A Cartesian genetic program [19], in the widely used form depicted 2.2 Neural Network Architecture Search in Figure 1, is defined by the number of inputs n, the number of There is great interest to automate the design process of neural outputs m, the number of rows r, the number of columns c, the networks, since finding the best performing topologies by human levels-back l, the arity a of its kernels (non-linearities) and the set experts is often viewed as a laborious and tedious process. Some of possible kernel functions. With reference to the picture, each recent approaches deploy Bayessian optimization [27] or reinforce- of the n + rc nodes in a CGP is thus assigned a unique id and the ment learning [2, 38] to discover architectures that can rival human vector of integers: designs for large image classification problems like CIFAR-10 or x = »F ;C ;C ; :::;C ; F ;C ; :::::;O ;O ; :::;O ¼ CIFAR-100. However, automated architecture design often comes I 0 0;0 0;1 0;a 1 1;0 1 2 m with heavy ressource requirements and many works are dedicated defines entirely the value of the terminal nodes. Indicating the to mitigate this issue [4, 7, 26]. numerical value of the output of the generic CGP node having id i One way to perform architecture search are metaheuristics and with the symbol Ni , we formally have that: neuro-evolution, which have been studied since decades and remain a profilific area of research22 [ ]. Most notably, NEAT [32] and its Ni = Fi NCi;0 ; NCi;1 ; :::; NCi;a variations [17, 31] have been investigated as methods to grow net- In other words, each node outputs the value of its kernel – or non work structures while simultaneously evolving their corresponding linearity, to adopt a terminology more used in ANN research – weights. The approach by Han et al. [9] is almost orthogonal, as it computed using as inputs the connected nodes. deploys an effective pruning strategy to learns weights and topol- We modify the standard CGP node adding a weight w for each ogy purely from gradients. Our approach takes aspects of both: connection C, a bias b for each function F and a different arity a weights are learned from gradients while network topologies are for each node. We also change the definition of Ni to: gradually improved by evolution. a We focus on small-scale regression problems and optimize our Õi N = F © w N + b ª (1) topologies for efficient training as a means to combat the exploding i i i; j Ci; j j ® j=0 resource requirements. In this sense, our approach is related to the « ¬ concept of meta-learning [37]; the ability to “learn how to learn” forcing the non linearities to act on the biased sum of their weighted by exploiting meta-knowledge and adapting to the learning task inputs.

Load more