Genetic Programming for Julia: Fast Performance and Parallel Island Model Implementation
Total Page:16
File Type:pdf, Size:1020Kb
Genetic Programming for Julia: fast performance and parallel island model implementation Morgan R. Frank November 30, 2015 Abstract I introduce a Julia implementation for genetic programming (GP), which is an evolutionary algorithm that evolves models as syntax trees. While some abstract high-level genetic algorithm packages, such as GeneticAlgorithms.jl, al- ready exist for Julia, this package is not optimized for genetic programming, and I provide a relatively fast implemen- tation here by utilizing the low-level Expr Julia type. The resulting GP implementation has a simple programmatic interface that provides ample access to the parameters controlling the evolution. Finally, I provide the option for the GP to run in parallel using the highly scalable ”island model” for genetic algorithms, which has been shown to improve search results in a variety of genetic algorithms by maintaining solution diversity and explorative dynamics across the global population of solutions. 1 Introduction be inflexible when attempting to handle diverse problems. Julia’s easy vectorization, access to low-level data types, Evolving human beings and other complex organisms and user-friendly parallel implementation make it an ideal from single-cell bacteria is indeed a daunting task, yet bi- language for developing a GP for symbolic regression. ological evolution has been able to provide a variety of Finally, I will discuss a powerful and widely used solutions to the problem. Evolutionary algorithms [1–3] method for utilizing multiple computing cores to improve refers to a field of algorithms that solve problems by mim- solution discovery using evolutionary algorithms. Since icking the structure and dynamics of genetic evolution. In many languages implement multi-processing, rather than particular, evolutionary algorithms typically model solu- multi-threading, naive methods seeking to improve the tions to a given problem using some data structure, which runtime of evolutionary algorithms (e.g. evaluate the fit- can be treated as a genotype or gene. These algorithms be- ness of population members in parallel), are often rel- gin with a collection, or “population”, of these data struc- atively fruitless due to the communication overhead re- tures and iteratively generate new populations, or “gener- quired to bring the population back to a single process for ations”, by implementing genetic mutation and crossover essential steps in the algorithm iteration, such as genetic on good solutions in the existing population. Good solu- crossover. Furthermore, evolutionary algorithms that it- tions are often referred to as being “fit” if they reproduce erate with a single population are susceptible to conver- a target signal; therefore, we typically measure fitness by gence on a “local optima” solution, rather than seeking measuring how well a solution minimizes error. The fit- a more desirable “global optima” through increased ex- ness of a given solution can be thought of as a phenotype. ploration. The island model [5,6] is a useful paradigm Here, I focus on a particular type of evolutionary al- for parallel computation in evolutionary algorithms that gorithm called “genetic programming” (GP). In general, largely overcomes both concerns by iterating an indepen- genetic programming evolves syntax trees that represent dent population of solutions on each available core. Since models, which aim to reproduce a target signal given in- each population is initialized randomly, it is possible that put signals. In theory, these syntax trees can represent individual populations will independently explore solu- any program, but we will restrict ourselves to the sym- tion space and possibly converge on different local op- bolic regression problem here. Therefore, our syntax trees tima. The last piece is to implement occasional “migra- will represent equations, which will act on numeric input tion” between the populations, where good solutions from variables and reproduce a numeric output variable. Some one population are removed and sent to a different popu- powerful third-party software packages exist, such as Eu- lation running on a different parallel process. Migration reqa [4], but these packages are not open-source and can 1 seeks to encourage the overall GP to continue exploring haps you should include trigonometric functions in your around the best solutions, rather than having populations library; otherwise, maybe you should omit trig functions converging too quickly to locally optimal solutions. Fur- if your signal is not periodic but is instead quite noisy as thermore, migration is the only step in the island model the trig functions might be used to overfit the noise in the that requires communication between processes during target signal. My implementation of GP is restricted to runtime, but migration is infrequently performed and only using only the terminals stored in the library, but future it- a few solutions are communicated at each migration step. erations of this project would instead use only parameters In summary, we see that the island model provides a and variables as terminals for the syntax trees, along with highly scalable means to improve overall performance of occasional parametric optimization for each syntax tree evolutionary algorithms, rather than simply improving al- through least-squares regression or another evolutionary gorithmic runtime. algorithm. The Population type represents an independent GP population, which iterates over time to form new gener- 2 GP implementation in Julia ations of solutions. Populations store a population as a list of Trees. The number of Trees to maintain in the Pop- I employ an object-oriented implementation of GP in Julia ulation is given as the popSize property when the Popula- by defining four key types that are essential to running the tion is initialized. The subset of these Trees representing GP. The Tree type will be used to represent the solution “good” solutions are stored in a separate list as well. We models as syntax trees. At the root of each Tree is a Julia will see that “good” Trees define a Pareto front. In this Expr, which is the literal representation of the equation context, “good” trees are trees that are young, simple, and which we hope reproduces the target variable from input reproduce the output signal. For measuring the latter, a variables. The Expr type is a powerful low-level Julia type fitness function must be provided to the Population which that is perfect for this task because of fast evaluation and defines how to measure the error between the syntax tree easy manipulation. In fact, the Julia metaprogramming and the target signal; for example, you might use abso- page highlights that Julia expressions are first parsed into lute error or root-mean-squared error as a fitness function. Expr types before being possibly compiled and then eval- The GP will attempt to find Trees that, in part, minimize uated. In addition to the Expr root, Trees have property the fitness function. fields for the age (ie. the number of generations the tree There are several ways one might define the simplicity has survived unchanged), the depth (ie. the maximum tree of a syntax tree, but for our purpose we can assume that depth of the syntax tree), the number of nodes in the syn- Trees with fewer nodes will be easier to interpret in the tax tree, and the fitness value of the syntax tree. Further- context of the problem being solved by the GP. The type more, Tree types have methods for copying themselves, of crossover typically employed in genetic programming listing their nodes in evaluation order, a toString method, (discussed below) tends to produce new syntax trees that an equals method, and a method for determining if the are deeper than their parent trees; this tendency towards tree is better than another tree (if so, then we would say larger and more complicated trees is known as “bloat”. the tree “dominates” the other tree, but more on that be- Allowing bloat to go unchecked would result in only com- low). plicated solutions and eventually even effect the runtime I define a gpLibrary type as a place to store the func- of the GP algorithm. Note that bloat is not always a bad tions and terminals with which to construct Trees. Termi- thing, as the genetic material contained therein may lead nal values are numbers, called “constants”, or variables, to useful combinations of library variables for a future and they represent entities that can be used as leaves in Tree. In any case, implementing selection pressure for the syntax trees. Functions are added to the library along simplicity should help combat bloat. Future iterations of with the number of input variables they require. These this project might allow users to assign complexity val- functions are available to be used as non-leaf nodes in the ues to each function in the gpLibrary which represents the syntax trees, and so the number of inputs for each func- cost of executing that function or interpreting that func- tion determines the number of children a node will have tion in the context of the larger problem. based on the function the node represents. Beyond stor- Finally, we want “young” Trees, or specifically Trees ing these entities, gpLibrary types also contain some nice that have not been around for many generations, to allow functionality for randomly accessing stored Terminals and new solutions a chance to develop into potentially good Functions. Deciding which functions and which constants solutions before being dominated by existing Trees. This to include in the library is problem dependent. For exam- consideration helps to maintain some exploration dynam- ple, if you know your target signal is periodic, then per- 2 A Tree using crossover. To mutate a Tree, we first randomly select a node. If that node is a leaf, then the variable stored in that node is changed to a randomly selected ter- minal from the library.