Evolutionary Algorithms: Concepts, Designs, and Applications in Bioinformatics Evolutionary Algorithms for Bioinformatics
Total Page:16
File Type:pdf, Size:1020Kb
Evolutionary Algorithms: Concepts, Designs, and Applications in Bioinformatics Evolutionary Algorithms for Bioinformatics Ka-Chun Wong Department of Computer Science, University of Toronto, Ontario, Canada INTRODUCTION Since genetic algorithm was proposed by John Holland (Holland J. H., 1975) in the early 1970s, the study of evolutionary algorithm has emerged as a popular research field (Civicioglu & Besdok, 2013). Researchers from various scientific and engineering disciplines have been digging into this field, exploring the unique power of evolutionary algorithms (Hadka & Reed, 2013). Many applications have been successfully proposed in the past twenty years. For example, mechanical design (Lampinen & Zelinka, 1999), electromagnetic optimization (Rahmat-Samii & Michielssen, 1999), environmental protection (Bertini, Felice, Moretti, & Pizzuti, 2010), finance (Larkin & Ryan, 2010), musical orchestration (Esling, Carpentier, & Agon, 2010), pipe routing (Furuholmen, Glette, Hovin, & Torresen, 2010), and nuclear reactor core design (Sacco, Henderson, Rios-Coelho, Ali, & Pereira, 2009). In particular, its function optimization capability was highlighted (Goldberg & Richardson, 1987) because of its high adaptability to different function landscapes, to which we cannot apply traditional optimization techniques (Wong, Leung, & Wong, 2009). BACKGROUND Evolutionary algorithms draw inspiration from nature. An evolutionary algorithm starts with a randomly initialized population. The population then evolves across several generations. In each generation, fit individuals are selected to become parent individuals. They cross-over with each other to generate new individuals, which are subsequently called offspring individuals. Randomly selected offspring individuals then undergo certain mutations. After that, the algorithm selects the optimal individuals for survival to the next generation according to the survival selection scheme designed in advance. For instance, if the algorithm is overlapping (De Jong, 2006), then both parent and offspring populations will participate in the survival selection. Otherwise, only the offspring population will participate in the survival selection. The selected individuals then survive to the next generation. Such a procedure is repeated again and again until a certain termination condition is met (Wong, Leung, & Wong, 2010). Figure 1 outlines a typical evolutionary algorithm Figure 1. Major components of a typical evolutionary algorithm In this book chapter, we follow the unified approach proposed by De Jong (De Jong, 2006). The design of evolutionary algorithm can be divided into several components: representation, parent selection, crossover operators, mutation operators, survival selection, and termination condition. Details can be found in the following sections. • Representation: It involves genotype representation and genotype-phenotype mapping. (De Jong, 2006). For instance, we may represent an integer (phenotype) as a binary array (genotype): '19' as '10011' and '106' as '1101010'. If we mutate the first bit, then we will get '3' (00011) and '42' (0101010). For those examples, even we have mutated one bit in the genotype, the phenotype may vary very much. Thus we can see that there are a lot of considerations in the mapping. • Parent Selection: It aims at selecting good parent individuals for crossovers, where the goodness of a parent individual is quantified by its fitness. Thus most parent selection schemes focus on giving more opportunities to the fitter parent individuals than the other individuals and vice versa such that “good” offspring individuals are likely to be generated. • Crossover Operators: It resembles the reproduction mechanism in nature. Thus they, with mutation operators, are collectively called reproductive operators. In general, a crossover operator combines two individuals to form a new individual. It tries to split an individual into parts and then assemble those parts into a new individual. • Mutation Operators: It simulates the mutation mechanism in which some parts of a genome undergoes random changes in nature. Thus, as a typical modeling practice, a mutation operator changes parts of the genome of an individual. On the other hand, mutations can be thought as an exploration mechanism to balance the exploitation power of crossover operators. • Survival Selection: It aims at selecting a subset of good individuals from a set of individuals, where the goodness of individual is proportional to its fitness in most cases. Thus survival selection mechanism is somehow similar to parent selection mechanism. In a typical framework like 'EC4' (De Jong, 2006), most parent selection mechanisms can be re-applied in survival selection. • Termination Condition: It refers to the condition at which an evolutionary algorithm should end. EVOLUTIONARY ALGORITHMS: CONCEPTS AND DESIGNS Representation Representation involves genotype representation and genotype-phenotype mapping. In general, designers try to keep genotype representation as compact as possible while keeping it as close to the corresponding phenotype representations as possible such that measurement metrics, say distance, in the genotype space can be mapped to those in phenotype space without the loss of semantic information. Figure 2. Some representations of evolutionary algorithms: (a) Integer representation (b) Protein structure representation on a lattice model (c) Tree representation for a mathematical expression In general, there are many types of representations that an evolutionary algorithm can adopt. For example, fixed-length linear structures, variable-length linear structures, and tree structures…… Figure 2 depicts three examples. Figure 2a is a vector of integers. We can observe that its genotype is a binary array with length equal to 10. To map it into the phenotype space, the first 5 binary digits (10011) are mapped to the first element (19) of the vector whereas the remaining 5 binary digits (11110) are mapped to the second element (30) of the vector. Figure 2b is the relative encoding representation of a protein on the HP lattice model (Krasnogor, Hart, Smith, & Pelta, 1999). Its genotype is an array of moves and its length is set to the amino acid sequence length of the protein. The array of moves encodes the relative positions of amino acids from their predecessor amino acids. Thus we need to follow the move sequence to compute the 3D structure of the protein (phenotype) for further evaluations. Figure 2c is the tree representation of a mathematical expression. Obviously, such tree structure is a variable length structure, which has the flexibility in design. If the expression is short, it can be shrunk during the evolution. If the expression is long, it can also be expanded during the evolution. Thus we can observe that the structure has an advantage over the previous representations. Nevertheless, there is no free lunch. It imposes several implementation difficulties to translate it into phenotypes. Parent Selection Parent selection aims at selecting “good” parent individuals for crossover, where the goodness of a parent individual is positively proportional to its fitness for most cases. Thus most parent selection schemes focus on giving more opportunities to the fitter parent individuals than the other individuals and vice versa. The typical methods are listed as follows: • Fitness Proportional Selection: The scheme is sometimes called roulette wheel selection. In the scheme, the fitness values of all individuals are summed. Once summed, the fitness of each individual is divided by the sum. The ratio then becomes the probability for each individual to be selected. • Rank Proportional Selection: Individuals with high ranks are given more chances to be selected. Unlike fitness proportional scheme, the rank proportional scheme does not depend on the actual fitness values of the individuals. It is a double-edged sword. On the positive side, it can help us prevent the domination of very high fitness values. On the negative side, it imposes additional computational costs for ranking. • Uniform Deterministic Selection: The scheme is the simplest among the other schemes. All individuals are selected, resulting in uniform selection. • Uniform Stochastic Selection: The scheme is the probabilistic version of uniform deterministic selection. All individuals are given equal chances (equal probabilities) to be selected. • Binary Tournament: Actually, there are other tournament selection schemes proposed in the past literature. In this book chapter, the most basic one, binary tournament, is selected and described. In each binary tournament, two individuals are randomly selected and competed with each other by fitness. The winner is then selected. Such a procedure is repeated until all vacancies are filled. • Truncation: The top individuals are selected deterministically when there is a vacancy for selection. In other words, the bottom individuals are never selected. For example, if there are 100 individuals and 50 slots are available, then the top 50 fittest individuals will be selected. Crossover Operators Crossover operators resemble the reproduction mechanism in nature. Thus they, with mutation operators, are collectively called reproductive operators. In general, a crossover operator combines two individuals to form a new individual. It tries to partition an individual into parts and then assemble the parts of two individuals into a new individual. The partitioning is not a trivial task. It depends on the representation