<<

arXiv:cs/0603004v1 [cs.NE] 1 Mar 2006 http://www.geneura.org zto,Lann,Nua ewrs Optimization, Search Networks, Lamarckian Neural Effect, Baldwin Learning, ization, u eerhr ndffrn approaches: different in researchers ous the the acquired of code. part have genetic becomes who characteristic this offspring grows, of ability number abilities. the their As propagate and more have time, to through longer. them offspring live allows acquire will they thus longevity and The environment, the adapt to will the capacity better that learning fact greater the to with limited individuals is influence this that gued generation. following the to on passed characteris- a are the learned has all tics since learning , approach, on influence this great to of evolution According this the for . characteristic, responsible learned be would or mechanism will to acquired generation passed following any are the inherit Thus, life [1]. its offspring during influences the acquires individual learning an where or evolution effect, Lamarckian evolution. Baldwin e.g. the ideas, Darwinian eFetnea101Gaaa(Spain). Granada Campus 18071 Fuentenueva Granada. de de Universidad Computadores. nolog´ıa de inr ehd n uz oi ae methods. based logic fuzzy evolu- and other methods RPROP, solution. like tionary a algorithms reach BP-based to other longer rate, taking error size, smallest larger the Bald- a with the MLPs and using low obtains while a size, effect with smallest win networks the and find rate, algorithm error the strategy. makes Lamarckian erator a uses Baldwin other the and the exploits learning while one combining effect, explored: of are ways G-Prop search improve genetic Two can search. learning genetic how into study imental search. and the observed improve be to can leveraged , learned Baldwin the of the to incorporation progressive evolution, characteristics evolu- the Darwinian is, Lamarckian of that case or effect, the Darwinian In on tion. inspired be may nomaia nvria eJ´n ... va Madrid, e-mail: Avda. Ja´en (Spain). E.P.S., 23071 Ja´en. 35 de Universidad Inform´atica. aacinEouinadteBlwnEffect Baldwin the and Evolution Lamarckian rp Geneura Grupo Resumen • numer- by used been previously have ideas These ar- [3] Waddington and [2] Baldwin Nevertheless, characteristics the that states non- theory Lamarck’s implement often algorithms Hybrid aarsclave Palabras than error average lower a obtain approaches Both op- Lamarckian a using that show experiments Our exper- an out carry to is paper this of purpose The ..Castillo P.A. .Itouto n tt fteArt the of State and Introduction I. aacinmcaim nhbi evolutionary hybrid in mechanisms Lamarckian iceie rmtebooia on fve,but view, of point biological the from discredited algorithms yrdnuoeouinr algorithms neuro-evolutionary Hybrid — vltoayAgrtm,General- Algorithms, Evolutionary — aacinter stdytotally today is theory Lamarckian . 1 nEouinr erlNetworks Neural Evolutionary in 1 ..Arenas M.G. , eatmnod rutcuayTec- y Arquitectura de Departamento [email protected] 1 ..Castellano J.G. , 2 eatmnode Departamento URL: 1 ..Merelo J.J. , • • oprtv tde fLmrka mechanisms Lamarckian of studies Comparative tdigteBlwneeti yrdalgorithms hybrid in effect Baldwin the Studying i ffc a n h lblotmm while usually optimum, faster, global Bald- although strategy, the the Lamarckian find of a can advantage effect taking al. win that et another Whitley show for by [14] obtained Nevertheless, results the the solv- problem, problem. for and their effective first equally ing the are strategies that second finding that algorithm ANNs, genetic evolves a compared in implemented [13] nisms mecha- Whitley Darwinian and and Lamarckian ob- Baldwinian, the Gruau on results dependent The very problem. and different, [12]. are search tained the accelerate worse to or mechanisms better Lamarckian implementing is ad- one effect takes than Baldwin that the algorithm of hybrid algorithms vantage a on hybrid strat- based a in egy whether investigated effect have studies Baldwin Some the and hnigtentokacietr spr fthe but process. of network, part learning the as in architecture network learning the of changing process a as implemented effect, not is Baldwin effectivity the to with whose algorithm explained hybrid architectures, a ANN describe [11] evolve results. al. best et the had Boers individuals obtained the capabilities result which learning the in obtaining experiments system, that life effect Baldwin artificial Ack- the an studied in . [10] Littman the and to ley communicated the not by are acquired characteristics though the even equivalents, faster nonlearning much which their evolve than effect in to Baldwin organisms space the learning search that allows and the operates of learning evolution that shape Hin- found by the who [7], proposed alters in strategy Nowlan the and indi- ton is the This of fitness code vidual. their genetic the improve search modifying to local without individuals a out certain carrying on studied have effect, authors Baldwin Some the [11]. [10], [9], [8], [7], stebs ouint iearyo optimiza- of array wide problems. a to tion solution general, best acknowledged In the nowadays as are [6]). algorithms Ross [5], hybrid Freisleben [4], and Gorges-Schleuter Merz problem, salesman (trav- op- improvement elling search substantial local a a obtains of erator application prob- the in where success lems sev- particular by with used researchers been eral have (its ideas evaluation fitness ge- These its after “lifetime”). or modify can during individual code an netic that so EAs, evolution in Lamarckian implement to possible is it 1 .Prieto A. , 1 .Rivas V. , 2 n .Romero G. and 1 . converges to a local optimum. neurons, which implies greater speed when training On the other hand, results obtained by Ku and and classifying and facilitates its hardware imple- Mak [15] with a GA designed to evolve recurrent mentation. neural networks, show that the use of a Lamar- The classification accuracy or number of hits is ckian strategy implies an improvement of the obtained by dividing the number of hits among the algorithm, while the Baldwin effect does not. In total number of examples in the validating set. The Houck et al. [16] several algorithms are stud- approximation ability is obtained using the normal- ied, and similar conclusions drawn, as in [17], ized mean squared error (NMSE) given by: where a comparison between the Darwinian, Baldwinian and Lamarckian mechanisms, ap- N 2 v i (si − oi) NMSE = uP (1) plied to the 4-cycle problem, is made. u N − 2 t i (si s¯) G-Prop (a genetic evolution of BP trained MLP), P used in this paper to tune learning parameters and where si is the real output for the example i, oi is to set the initial weights and hidden layer size of a the obtained output, ands ¯ is the mean of all the real MLP, searches for the optimal set of weights, the op- outputs. timal topology and learning parameters, using an EA The Lamarckian approach uses no special fitness and Quick-Propagation [18] (QP). In this method no function; instead, a local search genetic operator (QP ANN parameters have to be set by hand; it obviously application) has been designed to improve the indi- needs to set the EA constants, but is robust enough viduals, saving the individual trained weights (ac- to obtain good results under the default parameter quired characteristics) back to the . settings (all operators applied with the same prob- On the other hand, the Baldwin effect requires ability, 300 generations and 200 individuals in the some type of learning to be applied to the in- population). dividuals, and the changes (trained weights) are This paper carries out a study of the Baldwin ef- not codified back to the population. In order to fect in the G-Prop [19], [20], [21], [22] method to take advantage of the Baldwin effect, the follow- solve pattern classification and function approxima- ing fitness function is proposed: firstly the classi- tion problems. We compare results with those of fication/approximation ability on the validation set other authors, and intend to check the results ob- of the individual before being trained is calculated. tained by Gruau and Whitley [13], i.e., that the use Then it is trained and its ability (after training) is of learning that modifies fitness without modifying calculated. Three criteria are used to decide which is the genetic code improves the task of finding an ANN the best individual: the best MLP is that with higher to solve the problem at hand. classification/approximation ability after training; if We compare the results obtained taking advantage both MLPs show the same accuracy, then the best of the Baldwin effect with those obtained using a is that whose ability before training is higher (the Lamarckian local search mechanism. We will also MLP is more likely to have a high accuracy when compare them with other non-hybrid (RPROP [23]), trained); if both MLPs show the same accuracy be- hybrid algorithms, and those based on fuzzy logic, fore and after training, then the best is the smallest to prove that both versions of G-Prop obtain better one. results (or at least comparable) than other methods, III. Experiments although one of these versions is more likely to be trapped at a local optimum due to the fact that it The algorithm was run for a fixed number of gen- uses a local search genetic operator. erations. When training each individual of the pop- The remainder of this paper is structured as fol- ulation to obtain its fitness, a limit of epochs was lows: Section II presents the new fitness functions de- established. We used 300 generations and 200 indi- signed to determine if the Baldwin effect takes place viduals in the population in every run, and 200 train- in G-Prop. Section III describes the experiments, ing epochs in order to avoid long simulation times Section IV presents the results obtained, followed by and also to avoid overfitted networks, making the EA a brief conclusion in Section V. carry out the search and the training operator refine the solutions. In addition, the number of epochs cho- II. The G-Prop Algorithm sen was much smaller than that necessary to train a single MLP, so that the time taken to find a suitable In this section we will only describe the new fitness network to solve the problem is similar to that would functions designed to determine if the Baldwin ef- be needed to train a MLP (that obtains similar re- fect takes place in G-Prop. The complete description sults) using a method based on gradient descent. Af- of the method and results on classification problems ter an exhaustive test of genetic operators, we have have been presented elsewhere [19], [20], [21], [22]. considered to apply them with the same priority (see [19], [20], [21], [22]). The learning operator (see [21], In G-Prop, the Darwinian fitness function is given [22]) was only used when obtaining the results of the by the classification / approximation ability obtained Lamarckian approach. when carrying out the validation after training, and in the case of two individuals with identical ability The tests used to assess the accuracy of a method the best is the one that has a hidden layer with fewer must be chosen carefully, because some of them (exclusive-or problem) are not suitable for certain ca- 100 Lamarck pacities of the BP algorithm, such as generalization Baldwin (after training) [24]. Our opinion, along with Prechelt [25], is that to Baldwin (before training) test an algorithm, at least two real world problems should be used. We have used a pattern classification problem and a function approximation problem, in order to demonstrate the capacities of the proposed method solving different kind of problems, and also to show that the Baldwin effect takes place whatever the problem at hand is. Error (x10^-3) In these experiments, the Glass1a (extracted from Proben1 data sets) pattern classification problem, proposed by Prechelt [25] and used by Gr¨onroos [26], is used, as well as the function approximation prob- lem given by equation (2).

Glass1a is a problem of classification of glass 10 types, taken from [25]. The results of chemical 0 50 100 150 200 250 300 Generations analysis of glass splinters (percent content of 8 Error different elements) plus the refractive index are 180 used to classify the sample as either float pro- Lamarckian approach cessed or non float processed building windows, Baldwinian approach vehicle windows, containers, tableware, or head 160 lamps. This task is motivated by forensic needs in criminal investigation. This dataset was cre- ated based on the glass problem dataset from 140 the UCI repository of machine learning databases (http://www.ics.uci.edu/ mlearn/MLRepository.html ). The data set contains 214 instances. Each sample 120 has 9 attributes plus the class attribute: refractive index, sodium, magnesium, aluminium, silicon, Size of networks potassium, calcium, barium, iron, class attribute 100 (type of glass).

Function given by equation (2) is an analytical 80 function gathered by Cherkassky [27] and Sugeno [28], and used by Pomares [29] their research on fuzzy logic function approximation: 60 0 50 100 150 200 250 300 Generations

−5x Size f2(x)= e sin(2πx) (2) Fig. 1 where x ∈ [0, 1]. Average results over all the runs for both Lamarckian and Baldwinian approaches for the Glass problem. Average error IV. Results is plotted above (on a vertical logscale) and size below. Figure 1 shows average results over all the runs for both Lamarckian and Baldwinian approaches to classification and approximation problems. Plotted (evolution stops) because of the use of an “elitist” al- data correspond to the best individual in the pop- gorithm, and tends to dominate the population due ulation for each generation. The dotted line corre- to its high fitness. sponds to the classification / approximation ability On the other hand, using the Baldwin effect, re- before training, while the dashed line corresponds to sults can be as good as using Lamarckian evolution, the classification / approximation ability after train- although the method needs many more generations ing, in the Baldwinian approach. The solid line cor- and the evolution of the population is much more responds to the Lamarckian approach. progressive during the simulation. Results in size Standard deviation values have not been plotted show that using the Lamarckian approach, MLPs are due to the fact that those values remain constant smaller than with the Baldwinian approach. along the generations; in any case, they show that error achieved is similar. The method exhibits roughly the same behaviour on function approximation problem f2. Using Lamarckian evolution a suitable MLP is found in the early epochs of the simulation. However Although it is not the aim of this paper to com- that MLP remains the best during the simulation pare the G-Prop method with those of other authors, we do so in order to prove the capacity of both ver- The proposed method obtains better results on sions of G-Prop to solve pattern classification and approximation ability (0.09 ± 0.01 versus 0.125) al- function approximation problems, and how it out- though the networks obtained are greater in size performs other methods. and number of parameters than those obtained using Tables I and II show the average error rate, the fuzzy controllers (18 ± 8 versus 6). average size of nets as the number of parameters, The approximation ability obtained is greater us- that is, the number of weights of the net, and the ing the configuration that verifies if the Baldwin ef- average number of generations until the best one for fect takes place in G-Prop, while the network sizes that run is found. are slightly larger and need more generations to reach similar results. The results for the Glass1a pattern classification problem (% of error in test), obtained using the Each run of the proposed method takes about 4 Lamarckian mechanism are compared with those ob- hours on an AMD-K7(tm) 600Mhz, using the pa- tained taking advantage of the Baldwin effect and rameters described above. those obtained by Prechelt [25] (using RPROP [23], The Lamarckian strategy achieves good enough re- [30]) and Gr¨onroos [26] (using a hybrid algorithm) sults using, on average, fewer generations, although in Table I. the Baldwinian strategy, using a suitable number of generations, can achieve the same or even better re- Approach Error Size Generations sults. Lamarckian 32 ± 2 59 ± 28 52 ± 54 The results obtained show that if the problem does Baldwinian 31 ± 2 112 ± 62 119 ± 55 not have many local minima and results must be ob- Prechelt 33 ± 5 350 - tained quickly, the best strategy is the Lamarckian. Gr¨onroos 32 ± 5 350 - Otherwise, the Baldwinian strategy or a mixture of both is the best. TABLA I Results for the Glass1a problem obtained with G-Prop V. Conclusions taking advantage of the Baldwin effect and for the A study of the Baldwin effect in the G-Prop Lamarckian approach, as well as those obtained by Prechelt method [19], [20], [21], [22] (a hybrid algorithm to and Gr¨onroos, which are included for the sake of comparison. tune learning parameters, initial weights and hid- den layer size of a MLP using an EA and QP) has been carried out. A comparison between the re- It is evident that G-Prop outperforms other meth- sults obtained taking advantage of the Baldwin effect ods (both in classification accuracy and network size and those obtained using a local search Lamarckian obtained): Prechelt [25] using RPROP [23], [30] mechanism has been made. obtained a classification accuracy of 33 ± 5, and The results obtained, agree with those presented Gr¨onroos [26] using Kitano’s network obtained 32±5; by Whitley et al. [14], and show that the use of a while G-Prop achieves an error of 32 ± 2 using the Lamarckian strategy makes the method obtain good Lamarckian approach and 31 ± 2 taking advantage solutions faster than if the Baldwin effect is used, of the Baldwin effect. although it is more likely to be trapped in a local In the case of the configuration that verifies if the optimum than the approach that takes advantage of Baldwin effect takes place in G-Prop, the classifica- the Baldwin effect. However, errors are not significa- tion ability obtained is greater, although the size is tively worse. greater and more generations are needed to reach Figures show how a Lamarckian strategy finds a similar results. suitable MLP in the early generations which remains the best during the simulation (evolution stops); f2 The results for the function approximation with the Baldwin effect, results can be as good as problem, obtained using the Lamarckian mechanism, those of Lamarckian evolution, although the method are compared with those obtained taking advantage needs many more generations and the evolution is of the Baldwin effect and those obtained by Pomares much more progressive. [29] in Table II. It should be observed also that neural nets ob- tained using a Lamarckian strategy are smaller, Approach Error Size Generations which contributes to learning speed. Besides a small Lamarckian 0.09 ± 0.01 18 ± 8 34 ± 28 network is fast while training and classifying, and Baldwinian 0.086 ± 0.004 85 ± 27 97 ± 81 obtaining it in fewer generations means less time is Pomares [29] 0.125 6 (4 rules) - needed to design it. Another interesting result is that when the Lamar- TABLA II ckian training operator is used, learning contributes Results for the f2 problem obtained with G-Prop taking more to fitness improvement at the beginning of the advantage of the Baldwin effect and for the Lamarckian simulations [31]. This is due to to the use of an eli- approach, as well as those obtained by Pomares, which are tist algorithm, so that when it is applied to a MLP, included for the sake of comparison. these individuals can obtain an advantage in relation to the remaining members of the population, and will continue to be the best individuals among the pop- [18] S.E. Fahlman, “Faster-Learning Variations on Back- ulation until the end of the simulation. This can be Propagation: An Empirical Study,” Proceedings of the 1988 Connectionist Models Summer School, Morgan also proved using visualization techniques [32]. Kaufmann, 1988. [19] P.A. Castillo, J. Gonz´alez, J.J. Merelo, V. Rivas, Acknowledgements G. Romero, and A. Prieto, “SA-Prop: Optimization of Multilayer Perceptron Parameters using Simulated An- This work has been supported in part by CI- nealing,” Lecture Notes in Computer Science, ISBN:3- 540-66069-0, Vol. 1606, pp. 661-670, Springer-Verlag, CYT TIC99-0550, INTAS 97-30950, HPRN-CT- 1999. 2000-00068 and IST-1999-12679. [20] P.A. Castillo, J.J. Merelo, V. Rivas, G. Romero, and A. Prieto, “G-Prop: Global Optimization of Multilayer Referencias Perceptrons using GAs,” Neurocomputing, Vol.35/1-4, pp.149-163, 2000. [1] J.B. Lamarck, “Philosophie zoologique,” 1809. [21] P.A. Castillo, J. Carpio, J.J. Merelo, V. Rivas, [2] J.M. Baldwin, “A new factor in evolution,” American G. Romero, and A. Prieto, “Evolving Multilayer Percep- Naturalist 30, 441-451, 1896. trons,” Neural Processing Letters, vol. 12, no. 2, pp.115- [3] C.H. Waddington, “Canalization of development and the 127. October, 2000. inheritance of acquired characteristics,” Nature, 3811, [22] P.A. Castillo, M.G. Arenas, J.G. Castellano, M.Cillero, pp. 563-565, 1942. J.J. Merelo, A. Prieto, V. Rivas, and G. Romero, “Func- [4] M. Gorges-Schleuter, “Asparagos96 and the traveling tion Approximation with Evolved Multilayer Percep- salesman problem,” In Proceedings of 1997 IEEE Inter- trons,” Advances in Neural Networks and Applica- national Conference on Evolutionary Computation, pp. tions. Artificial Intelligence Series. Nikos E. Mastorakis 171-174. IEEE, 1997. Editor. ISBN:960-8052-26-2, pp.195-200, Published by [5] P. Merz; B. Freisleben, “Genetic local search for the TSP: World Scientific and Engineering Society Press, 2001. New results,” In Proceedings of 1997 IEEE International [23] M. Riedmiller and H. Braun, “A Direct Adaptive Method Conference on Evolutionary Computation, pp. 159-163. for Faster Backpropagation Learning: The RPROP Al- IEEE, 1997. gorithm,” In Ruspini, H., (Ed.) Proc. of the ICNN93, San Francisco, pp. 586-591, 1993. [6] B.J. Ross, “A Lamarckian evolution strategy for genetic [24] S. Fahlman, “An empirical study of learning speed in algorithms,” In Lance D. Chambers, editor, Practical back-propagation networks,” Tech. Rep., Carnegie Mel- Handbook of Genetic Algorithms: Complex Coding Sys- lon University, 1988. tems, volume III, pp. 1-16. Boca Raton, FL: CRC Press, [25] Lutz Prechelt, “PROBEN1 — A set of benchmarks 1999. and benchmarking rules for neural network training al- [7] G.E. Hinton and S.J. Nowlan, “How learning can guide gorithms,” Technical Report 21/94, Fakult¨at f¨ur Infor- evolution,” Complex Systems, 1, 495-502, 1987. matik, Universit¨at Karlsruhe, D-76128 Karlsruhe, Ger- [8] R.K. Belew, “When both individuals and many. (Also in: http://wwwipd.ira.uka.de/˜prechelt/), search: Adding simple learning to the genetic algo- 1994. rithm.,” In 3th Intern. Conf. on Genetic Algorithms, [26] M.A. Gr¨onroos, “Evolutionary Design of Neural Net- D. Schaffer, ed., Morgan Kaufmann, 1989. works,” Master of Science Thesis in Computer Sci- [9] I. Harvey, “The puzzle of the persistent question marks: ence. Department of Mathematical Sciences. University a case study of ,” In 5th International Con- of Turku., 1998. ference on Genetic Algorithms, pp. 15-22, S. Forrest, ed. [27] V. Cherkassky; D. Gehring; F. Mulier, “Comparison of Morgan Kaufmann, 1993. adaptive methods for function estimation from samples,” [10] D.H. Ackley and M. Littman, “Interactions between IEEE Trans. Neural Networks, vol. 7, no. 4, pp. 969-984, learning and evolution,” In C.G. Langton, C. Taylor, 1996. J.D. Garmer, and S. Rasmussen (Editors), Artificial [28] M. Sugeno and T. Yasukawa, “A fuzzy-logic based ap- Life II, 487-507, Addison-Wesley, Reading, MA, 1992. proach to qualitative modeling,” IEEE Fuzzy Sets and [11] E.J.W. Boers; M.V. Borst; I.G. Sprinkhuizen-Kuyper, Systems, vol. 1, no. 1, 1993. “Evolving Artificial Neural Networks using the ”Bald- [29] H. Pomares Cintas, “Nueva metodolog´ıa para el dise˜no win Effect”,” In D.W. Pearson, N.C. Steele and R.F. autom´atico de sistemas difusos,” Tesis Doctoral. Depar- Albrecht (eds.); Artificial Neural Nets and Genetic Algo- tamento de Arquitectura y Tecnolog´ıa de Computadores. rithms. Proceedings of the International Conference in Universidad de Granada, 1999. Al`es, France, 333-336, Springer Verlag Wien New York, [30] M. Riedmiller, “Description and Implementation De- 1995. tails,” Tech. Rep., University of Karlsruhe, 1994. [12] M. Huesken; J.E. Gayko; b. Sendhoff, “Optimization for [31] M. Oliveira; J. Barreiros; E. Costa; F. Pereira, “Lam- Problem Classes - Neural Networks that Learn to Learn,” BaDa: An Artificial Environment to Study the Inter- Proceedings of the First IEEE Symposium on Combina- action beween Evolution and Learning,” In Congress tions of Evolutionary Computation and Neural Networks on Evolutionary Computation, Volume I, pp. 145-152, (ECNN 2000), IEEE Press, 2000. Washington D.C., USA, 1999. [13] F. Gruau and D. Whitley, “Adding learning to the cel- [32] G. Romero, M.G. Arenas, J. Carpio, J.G. Castellano, lular development of neural networks: Evolution and the P.A. Castillo, J.J. Merelo, A. Prieto, and V. Rivas, “Evo- Baldwin efect,” Evolutionary Computation, Volume I, lutionary Computation Visualization: Application to G- No. 3, pp. 213-233, 1993. Prop,” Lecture Notes in Computer Science, ISBN:3-540- [14] D. Whitley; V.S. Gordon; K. Mathias, “Lamarckian 41056-2, Vol.1917, pp.902-912. September, 2000. Evolution, The Baldwin Effect and Function Optimiza- tion,” Parallel Problem Solving from Nature-PPSN III. Y. Davidor, H.P. Schwefel and R. Manner, eds. pp. 6- 15. Springer-Verlag, 1994. [15] K.W.C. Ku and M.W. Mak, “Exploring the effects of Lamarckian and Baldwinian learning in evolving recur- rent neural networks,” In Proceedings of 1997 IEEE International Conference on Evolutionary Computation, pp. 159-163. IEEE, 1997. [16] C. Houck; J.A. Joines; M.G. Kay; J.R. Wilson, “Em- pirical investigation of the benefits of partial Lamarck- ianism,” Evolutionary Computation, v.5, n.1, pp. 31-60, 1997. [17] B.A. Julstrom, “Comparing Darwinian, Baldwinian and Lamarckian Search in a Genetic Algorithm for the 4- Cycle Problem,” In Congress on Evolutionary Computa- tion,In Genetic and Evolutionary Computation Confer- ence, Late Breaking Papers, pp. 134-138, Orlando, USA, 1999.