<<

Prior to our understanding of DNA and yeast. This publication provided a offered to solve the phylogenetic recon- molecular sequence information, paleon- tremendous opportunity for the use of struction problem, some using - tologists inferred the on biological sequence information as “mol- ary . Earth by comparing morphological charac- ecular fossils” of information that could teristics (preserved phenotypic traits) be compared between extant organisms Challenges in found in fossils. Accurate estimation of this to determine their evolution. Seemingly The phylogenetics problem can be history of life or “phylogeny” was general- all that was required was additional loosely defined as the search for a tree- ly only possible for those organisms that sequence information and better comput- like structure that defines ancestral rela- had “hard parts” capable of preservation, ers and algorithms for their interpreta- tionships between related objects over died in a “fossilization-friendly” context, tion. Assessing the reliability of phyloge- time. The related objects can be any- and were unearthed over time only to be nies involves difficult statistical and com- thing from biological sequence informa- discovered by a very fortunate paleontolo- putational problems, including the NP- tion to morphological characters. The gist. According to Newton and Laporte, complete problems of sequence align- divergence over evolutionary time rep- paleontologists are the first to recognize ment and discovering the best phyloge- resented is captured in a treelike struc- the incompleteness of the fossil record and netic tree that fits the data. ture termed a “phylogeny” (Fig. 1). The the resulting difficulty of formulating evo- Given modern databases filled with number of unrooted, birfurcating tree lutionary relationships between extant sequence information, interest has turned topologies T for n taxa is given by organisms. It is a shame that many of from one of generating sequence to Earth’s organisms have very likely left no rapid interpretation and discovery of (2n − 5)! trace of their existence. “true” phylogenies for their application in T = .(1) ( − ) n−3 Our ability to infer evolutionary histo- not only the resolution of the history of n 3 !2 ries changed significantly following the life but also for epidemiology as it relates discovery of the genetic code. Fitch and to human disease. Three developments In general, three possible methods Margoliash were the first to offer a com- have been essential in this progression: are used to search for the best topolo- puter process for the construction of 1) the development of criteria and algo- gy from this set of possible trees: 1) phylogenetic trees from protein sequence rithms for discriminating among potential exhaustive methods, 2) branch and information, specifically for 20 phylogenies, 2) increased computational bound methods, and 3) heuristic meth- cytochrome C protein sequences (the power over time, and 3) the rapid ods. In analogy to the traveling sales- most for any protein to that date) that increase in sequence data availability. An man problem (TSP), exhaustive meth- had been elucidated from humans to assortment of algorithms has been ods are useful when the number of

The History of Life Gary B. Fogel Through Evolutionary Computation ©CORBIS, RUBBERBALL

AUGUST/SEPTEMBER 2005 0278-6648/05/$20.00 © 2005 IEEE 29 possible tree topologies (TSP routes) is efficiently search large numbers of char- is the one that groups taxa together with low. This is possible when the number acter states and infer “reasonable” his- similar characteristics and minimizes the of n taxa being compared is low (on torical relationships between organisms. number of observed overall changes in the order of ≤12) but rapidly becomes Additional confidence in a proposed the tree. There are several statistical infeasible with larger numbers of taxa. phylogeny results from overlap of “best” inconsistencies with this approach, most Branch and bound methods exclude trees generated from different sequence notably long-branch attraction and trees that do not meet specific criteria, or character sets. unequal rates of evolution. reducing the search space to a more It should be noted that one key dis- Distance methods are a second com- reasonable size and increasing the covery of 20th century evolutionary biol- mon method to infer phylogeny. These probability of an exact solution of ogy was the determination of a “tree of methods use pairwise distance calcula- merit. However, there are still limita- life” based purely on molecular tions for characters (e.g., a pairwise cal- tions on the upper number of taxa that sequences for a set of genes (ribosomal culation of mismatches measured across can be used with this approach. RNA genes) that are common to essen- all positions of a nucleotide sequence Heuristic searches commonly are tially all forms of extant life. According to alignment). A model of sequence evolu- used to build tree topologies either by Pace, the has reshaped our tion is then applied to the distance matrix changing the order in which the trees thinking of the evolution of life on Earth to correct for unobserved changes and are built or via branch swapping with and helped us identify the three major chance similarities. If the model of some metric of scoring (such as a parsi- kingdoms of life (eukarya, bacteria, and sequence evolution is accurate, then the mony) being used to determine which archaea). Phylogenetic methods are now correct tree can be recovered. These changes are more useful than others. It integral parts of almost every major area models may be specific to the characters is easy to envision how evolutionary of and several parts that are being investigated. It is clear that computation can be applied in this of ecology. Those interested in learning models of sequence evolution do not regard as a method of global optimiza- more about recent discoveries in this work equally well over all character sets. tion where the best resulting tree topol- area are referred to Assembling the Tree Thus, this approach works maximally as ogy is estimated after searching only a of Life and the Tree of Life web project well as the model of sequence evolution. small fraction of the search space. An . Maximum likelihood methods repre- unfortunate consequence of this approx- sent a third common method of phyloge- imation is that the resulting tree topolo- Phylogenetic methods netic inference. This approach is general- gy cannot be guaranteed to be optimal; Correct alignment of the sequence ly similar to maximum parsimony. How- however, it does allow the researcher to information is generally a critical first ever, maximum likelihood methods use a component of phyloge- specified model of sequence evolution to netic analysis and can assign a value of confidence to ancestral be difficult when there states. Maximum parsimony methods are large numbers of assume that a shared character between Human AATAATATTCTTTATGACAACTCATTTGATTT sequences to be com- two extant taxa must have also existed in Monkey AATAATATTCTTTATGACAATTCATTTGATTT pared, when the the ancestor of those two taxa. Maximum Mouse AATAATATTCTTTATCGCAACTCATTTGATTT sequences are long, and likelihood methods assign a confidence when the sequences to the existence of that ancestral charac- Kangaroo AATAAAATTCTTTATCGCAATTCATTTGAAAT have limited evolution- ter based on the distance (or length of ary conservation. time) that exists between the two extant Parsimony is a com- taxa. As a result, maximum likelihood mon method used to methods tend to be more computational- infer phylogenetic trees. ly intensive than parsimony methods. This method is based They also share the same requirements on the hypothesis that as distance methods for a model of Human the “best” tree in the sequence evolution specific to the char- space of all possible acter set under investigation. However, trees is the one that maximum likelihood approaches can Monkey explains the relation- improve tree inference when the ship between the extant sequences are not closely related. All taxa with the fewest three of the above approaches are the Mouse number of evolutionary subject of considerable investigation in changes throughout the the literature and are the subject of opti- tree. The principle of mization with simulated evolution. Kangaroo inheritance implies that Fig. 1 Phylogenetic reconstruction. The nucleotide sequences when a characteristic Applications of evolutionary for four organisms are aligned and in this simple example are state is modified in a computation to phylogenetics highly similar except for six positions noted as (“|”). At each of species, the descen- Evolutionary computation has been these informative positions (characters), change has occurred dents of that species applied to phylogenetic reconstruction over evolutionary time. All other nucleotide positions are invariant and not required for proper phylogenetic will have a high proba- for 10 years. Matsuda was the first to reconstruction. Combining key characters and resolving the bility of sharing that use evolutionary algorithms for phylo- true phylogeny is challenging when the number and type of characteristic. Thus the genetic reconstruction, doing so with characters/taxa increases. most parsimonious tree protein sequences. Since that time,

30 IEEE POTENTIALS these strategies have been extended for PAUP was the same as use with different character sets, one of the trees discov- including Lewis who used evolutionary ered with the evolution- algorithms for phylogenetic reconstruc- ary but required tion with nucleotide sequence informa- 783 hours to do this cal- tion. For this purpose, individuals in culation. In comparison, A the population were representations of the evolutionary approach topologies along with only required 42.4 hours. BC D EA all branch lengths and values of other Similar results and com- B C parameters regarding the model of parisons to standard sequence evolution used. and methods were made by D E recombination operators were defined Reijmers et al. who cou- (a) (b) to generate new offspring solutions and pled an evolutionary algo- Fig. 2 Example of two parent phylogenies with five species was scored with respect to the rithm to a neighbor-join- each. (a) A subtree with species B is identified in the first natural log likelihood (lnL) score. High- ing approach to reduce parent solution. (b) A subtree that contains species B. The er lnL scores were favored over time the number of trees in the two circled branches would then be the subject of crossover (using ranked selection) until conver- search space with suc- (adapted from Congdon, 2002). gence on a solution was observed. The cessful evaluation relative best resulting individual from this evo- to the FITCH algorithm. represents only the direct lineage to the lutionary search was believed to repre- Congdon and Greenfest developed species in 1 above). sent the best phylogenetic topology. a program called “Gaphyl” for phylo- 3. In a second parent phylogeny, the This approach was applied to a phylo- genetic reconstruction with evolution- smallest subtree containing all the genetic reconstruction problem with 55 ary algorithms. This was first used species from the subtree obtained in 2 taxa involving sequences of the chloro- with binary character states using par- above is found. plast rbcL gene in plants. In each of simony as a guide for optimal tree 4. Two offspring trees are formed by three runs, the topology discovery. Gaphyl makes use swapping these two subtrees. converged on a different tree topology of several intuitive variation operators 5. Any duplicate species found in (most likely as a result of premature for successful search. First, a crossover the resulting offspring solutions are convergence to local optima and/or strategy is used as follows: pruned. insufficient parameter tuning). A com- 1. A species is selected at random Second, a mutation operator is used: parison of the approach to a more stan- from a parent phylogeny. 1. Select two species at random from dard heuristic method (PAUP v.4.0) on 2. A subtree is selected at random within one parent and swap their posi- the same data set using the same com- that includes the species from 1 above tions in the tree. puter and model of sequence evolution (excluding the subtree that represents 2. Select a subtree at random within was made. The best resulting tree from the entire tree itself and the subtree that a parent phylogeny. Rearrange the evolutionary history of the species within that subtree. The approach was extended in Cong- Computational Intelligence in Bioinformatics don and Septor for use with DNA Advances in computational intelligence provide us with the opportunity to sequence information, and the early indi- explore ways in which these methods can be applied to problems in bioinformat- cations are that the Gaphyl program can ics, such as the phylogenetic reconstruction problem. However, this interdiscipli- outperform a common program called nary research requires understanding the biological data, methods for optimal Phylip on datasets with l63 species with search, and where and how these can be applied. To bring these researchers 1,588 nucleotides each in terms of CPU from and biology closer together, a number of workshops and time, but both tools are able to find four special sessions have been organized at IEEE Computational Intelligence Society equally parsimonies phylogenies for this sponsored events and activities including: data. A common theme in this work was • the IEEE Symposium on Computational Intelligence in Bioinformatics and that as the problem sets became more complex, Gaphyl was able to find more (IEEE CIBCB), sponsored by the IEEE Computational complete sets of phylogenies than Phylip, Intelligence Society even when making use of the fitness func- • a variety of special sessions on bioinformatics applications at IEEE sponsored tion from Phylip. Gaphyl commonly conferences such as the Congress on Evolutionary Computation (CEC) demonstrated a wider variety of resulting , International Joint Conference on Neural Net- “best” phylogenies than Phylip, suggesting works (IJCNN) www.ijcnn.org, and FUZZ-IEEE that Gaphyl could search the space of pos- • IEEE/ACM Transactions on Computational Biology and Bioinformatics is a rel- sible solutions more effectively. Congdon atively new IEEE publication that serves this community. presents a nice survey of the importance In particular the IEEE CIS Bioinformatics and Bioengineering Technical Committee of variation operators in this problem area (IEEE BBTC) promotes the research, and suggested that as the number of development, education, and understanding of computational intelligence species and attributes increases, the effec- tive of Gaphyl over Phylip appeared to methods in computational biology. increase. Shen and Heckendorn have con- tinued experimentation in this area.

AUGUST/SEPTEMBER 2005 31 Katoh et al., Lemmon and Pond and Frost utilized a genetic algo- • H. Matsuda, “Construction of phylo- Milinkovitch, and Brauer et al. have all rithm approach to assign lineages in a genetic trees from amino acid sequences recently published in mainstream biology phylogeny to different rates of nonsynony- using a ,” in Proc. journals, using evolutionary algorithms for mous and synonymous substitution (muta- Genome Workshop, 1995, no. phylogenetic reconstruction. In particular, tion rate) at the protein level. Very com- 6, pp.19–28. Brauer et al. focused on a parallelization monly, researchers assume a particular • H. Matsuda, “Protein phylogenetic of the evolutionary search, where a mas- substitution/mutation rate exists over an inference using maximum likelihood with ter process created a population of indi- entire phylogeny, when it is rather clear a genetic algorithm,” in Proc. Pac. Symp. viduals and sent each of them to another that the rate of mutation may vary signifi- Biocomput., 1996, pp. 512–523. single processor to be scored. Thus the cantly by lineage. The Pond and Frost • C.R. Newton and L.F. Laporte, population size of the evolutionary algo- study was the first to search for models of Ancient Environments, 3rd ed. Englewood rithm was equal to the number of slave substitution in a phylogenetic context. Cliffs, NJ: Prentice-Hall, Inc., 1989. nodes plus the master node. This method • S.L. Pond and S.D. Frost, “A combined mutation of the branch lengths Read more about it genetic algorithm approach to detect- and tree topology, recombination, migra- • M.J Brauer, M.T. Holder, L.A. Dries, ing lineage-specific variation in selec- tion, and selection using a maximum-like- D.J. Zwickl, P.O. Lewis, and D.M. Hillis, tion pressure,” Mol. Biol. Evol., vol. 22, lihood approach. Search-time improve- “Genetic algorithms and parallel process- pp. 478–485, 2005. ment was roughly linear with respect to ing in maximum-likelihood phylogeny • N.R. Pace, “A molecular view of the number of processors that were used. inference,” Mol Biol. Evol. vol. 19, no. 10, microbial diversity and the biosphere,” Also of note was the use of both real bio- pp. 1717–1726, 2002. Science, vol. 276, pp. 734–740, 1997. logical data (288 plant taxa each with • C.B. Congdon, “Gaphyl: A genetic • T.H. Reijmers, R.L. Wehrens, and 4,822 nucleotides of sequence informa- algorithms approach to ” in M.C. Buydens, “Quality criteria of genetic tion) in addition to simulated data (a Principles of Data Mining and Knowl- algorithms for construction of phylogenet- Monte Carlo of 5,000 edge Discovery (Lecture Notes in Com- ic trees,” J. Comput. Chem., vol. 20, pp. nucleotide characters across a model tree puter Science, vol. 2168), L. DeRaedt 867–876, 1999. of the same 228 taxa, derived from the and A. Siebes, Eds. Berlin: Springer-Ver- • T.H. Reijmers, R. Wehrens, F.D. most parsimonious previously known lag, 2002, pp 67–78. Daeyaert, P.J. Lewi, and L.M.C. Buydens, answer for this data set. Use of simulated • C.B. Congdon and E.F. Greenfest, “Using genetic algorithms for the con- data provided a means of interrogating all “Gaphyl: A genetic algorithm approach struction of phylogenetic trees: Applica- of the “correct” ancestral states that are to cladistics,” in Data Mining with Evo- tion to G-protein coupled receptor commonly missing in real data and lutionary Algorithms, A.A. Freitas, W. sequences,” BioSystems, vol. 49, pp. proved to be very valuable in assaying Hart, N. Krasnogor, and J. Smith, Eds. 31–43, 1999. the overall worth of the approach. 2000, pp. 85–88. • J. Shen and R.B. Heckendorn, • C.B. Congdon, “Gaphyl: An evolu- “Discrete branch length representation Future applications tionary algorithms approach for the study for genetic algorithms in phylogenetic Sequence data availability is no longer of natural evolution,” in Proc. Genetic search,” in Lecture Notes in Computer an issue for phylogenetics studies. Rather and Evolutionary Computation Confer- Science “Applications of Evolutionary than simply compare genes from differ- ence (GECCO 2002), San Francisco, 2002, Computing: EvoWorkshops 2004: Evo- ent organisms, the problem has trans- pp. 1057–1064. BIO, EvoCOMNET, EvoHOT, EvoISAP, formed into complete genome compari- • C.B. Congdon and K.J. Septor, “Phy- EvoMUSART, and EvoSTOC, Coimbra, son. Computational power continues to logenetic trees using evolutionary search: Portugal, April 5-7, 2004,” G.R. Raidl, increase over time but the problem sizes Initial progress in extending gaphyl to S. Cagnoni, and J. Branke, Eds. Berlin: of interest continue to expand as well. work with genetic data,” in Proc. 2003 Springer, 2004, pp. 94–103. Development of better algorithms and Congress on Evolutionary Computation methods for discriminating among poten- (CEC 2003), 2003, pp. 320–326. About the author tial phylogenies will continue to play a • W.M. Fitch and E. Margoliash, “Con- Gary B. Fogel received his B.A. in central role in this area. struction of phylogenetic trees,” Science, biology from the University of Califor- Testing algorithm performance vol. 155, pp. 279–284, 1967. nia, Santa Cruz and a Ph.D. in biology requires knowledge of a “ground truth” • K. Katoh, K. Kuma, and T. Miyata, from the University of California, Los set of data from a known evolutionary “Genetic algorithm-based maximum-like- Angeles (UCLA). While at UCLA, he history. This is difficult to obtain from lihood analysis for molecular phylogeny,” was a fellow of the Center for the the fossil record but can be generated J. Mol. Evol., vol. 53, no. 4–5, pp. Study of Evolution and the Origin of artificially in the computer, in the labo- 477–484, 2001. Life. As vice president at Natural ratory with bacterial lineages or nucleic • A.R. Lemmon and M.C. Milinkovitch, Selection, Inc., his current research acids, or from particularly fast-evolving “The metapopulation genetic algorithm: interests focus on the application of forms such as viruses. From each of An efficient solution for the problem of computational intelligence methods to these processes, ancestral states can be large phylogeny estimation,” Proc. Nat. problems in the biomedical sciences. identified and stored as the evolution- Acad. Sci. USA, vol. 99, no. 16, pp. He is chair of the IEEE CIS Bioinfor- ary divergence unfolds. These data sets 10516–10521, 2002. matics and Bioengineering Technical make particularly interesting examples • P.O. Lewis, “A genetic algorithm for Committee and general chair for the for the evaluation of phylogenetic maximum-likelihood phylogeny inference 2005 IEEE Symposium on Computa- methods and are likely to be used more using nucleotide sequence data,” Mol. Biol. tional Intelligence in Bioinformatics often in that manner in the future. Evol., vol. 15, no. 3, pp. 277–283, 1998. and Computational Biology.

32 IEEE POTENTIALS