DNA Sequence Assembly and Genetic Algorithms New Results and Puzzling Insights
Total Page:16
File Type:pdf, Size:1020Kb
From: ISMB-95 Proceedings. Copyright © 1995, AAAI (www.aaai.org). All rights reserved. DNA Sequence Assembly and Genetic Algorithms New Results and Puzzling Insights Rebecca Parsons Mark E. Johnson Department of Computer Science Statistics Department University of Central Florida University of Central Florida Orlando, FL USA Orlando, FL USA [email protected] [email protected] Abstract performance.Several critical issues affect the perfor- manceof a genetic algorithm: problemrepresentation, Applyinggenetic algorithmsto DNAsequence assem- operator selection, population size, and parameter se- bly is not a straightforwardprocess. Significantlyim- lection. Our previous workindicated that the perfor- provedresults in termsof performance,quality of re- manceof the genetic algorithm was sensitive to the sults, andthe scaling of applicability havebeen real- ized throughnon-standard and even counter-intuitive setting of the rate parameters. This issue seemedim- parametersettings. Specifically,the solutiontime for portant to address first. Weapplied the techniques of a 10kbdata set wasreduced by an order of magnitude, experimentaldesign and response surface analysis (Box and a 20kbdata set that waspreviously unsolved by & Draper 1987; Box, Hunter, & Hunter 1978) to the the genetic algorithmwas solved in a time that rep- rate parameterscontrolling the application of the var- resents only a linear increasefrom the 10kbdata set. ious operators in the genetic algorithm. This analysis Additionally, significant progress has been madeon promptedus to set the operator rates at significantly a 35kbdata set representingreal biological data. A different levels than those we had previously tried and single contig solution wasfound for a 752 fragment those typically used by the genetic algorithm commu- subset of the data set, and a 15 contig solution was foundfor the full data set. This paper discusses the nity. The result, however,was significantly improved newresults, the modificationsto the previousgenetic convergencetimes for small to medium-sizedproblems algorithmused in this study, the experimentaldesign and the resolution of a 20kb sequence which had pre- process by whichthe newresults wereobtained, the viously not been solved by the genetic algorithm. questionsraised by these results, and someprelimi- naxyattempts to explainthese results. Wenext explored the populationsizes for the genetic algorithm and found that significantly smaller popula- tion sizes resulted in additional performanceimprove- Genetic Algorithms and DNA Sequence ments. Combiningthese results, we returned to the Assembly problem of the 35kb data set consisting of real data The laboratory processes for DNAsequencing are still from a sequencing laboratory, a commonbenchmark limited to relatively short stretches of DNA,necessi- for sequencing algorithms. Wesaw significantly im- tating the assembly of longer sequences based on the proved performanceon this data set as well, although base configuration of the shorter sequences. This as- we also found that minormodifications are required to sembly process, described in more detail below and the operators to properly exploit the building blocks in (Parsons, Forrest, & Burks1994), is a combinatorial within the evolvedsolutions. problemthat is closely related to the Traveling Sales- man Problem (TSP). However, the DNAsequence as- In the remainderof this section, we introduce both semblyproblem is sufficiently different from TSPthat the DNASequence Assemblyproblem and genetic al- the heuristic methodsdeveloped for TSPare not ap- gorithms. Next, we briefly summarizeour previous plicable in this context. In prior work, we have de- workand describe the application of experimentalde- scribed a genetic algorithm for the DNASequence As- sign techniques to the GAparameter selection and the sembly problemwhich gave promising results on small resulting parametersettings. Wethen describe the re- sequences but did not scale well to the problemsizes sults obtained on the larger data sets using the modi- being tackled in the large-scale sequencinglaborato- fied GA.Finally, we discuss the questions of parame- ries (Parsons & Burks 1993; Parsons, Forrest, & Burks ter setting and populationsizes posed by these perfor- 1994). manceimprovements and how these results relate to Wehave been studying the genetic algorithm to un- the application of genetic algorithms to other permu- derstand its behavior and the limiting factors on its tation problems. Parsons 277 The Basics of DNA Sequence Assembly and genetic inheritance. Originally proposed by Hol- The laboratory sequencing processes for DNAare be- land (Holland 1975) in the seventies, genetic algo- ing aggressively tested and improved as a result of the rithms have experienced a resurgence in popularity and Human Genome Project. The scale of this project use (Forrest 1993; Goldberg 1989). In its simplest is daunting, as one individual’s genome contains ap- form, a genetic algorithm operates over a population of proximately 3 billion bases (bonded pairs of nucleic binary strings, termed individuals. A fitness function, acids). DNAconsists of two anti-parallel, complemen- usually whatever function is to be optimized, assigns tary strands of nucleic acids. The shotgun sequenc- a fitness to each individual in the population. An in- ing process replicates and separates these strands, and dividual competes with the others in the population then breaks the strands into smaller fragments. These based on its fitness. fragments are processed on the automated sequencing Three primary operators are used in each generation machines to determine the base sequence for the frag- of a GA: reproduction, crossover and mutation. The ment. Unfortunately, in the process of making and reproduction operator implements the survival compe- sequencing the fragments, information about the lo- tition; the expected numberof copies of an individual cation, orientation and strandedness of the fragments in the new population is based on the individual’s fit- with respect to the original sequence is lost. The only ness relative to the fitness of the rest of the popula- information remaining about a fragment is its base se- tion. Poorly performing individuals die out, with their quence. The assembly process compares the individual places taken in the next generation by new copies of base pair sequences for the fragments and, using this more highly fit individuals. The crossover operator information, assembles a consensus sequence for the combines the information contained in two individuals parent DNA.Consecutive overlapping fragments make in the population to create offspring. The mutation up a contig, and the goal of the assembly process is to operator randomly alters an individual. Finally, each order the fragments such that the result is one contigu- new individual has its fitness computedusing the fit- ous sequence of overlapping fragments. The hypothe- ness function. This cycle of reproduction, crossover, sis followed in the assembly process is that fragments mutation, and evaluation continues until some conver- with a high degree of similarity of their sequences (a gence criterion is satisfied. high overlap strength) likely come from the same area The design of a genetic algorithm for a problem re- of the DNA.The implications and validity of this hy- quires the identification of a fitness function, a rep- pothesis is addressed briefly in the conclusions and in resentation for an individual solution, crossover and Myers (Myers 1994). mutation operators that generate new solution indi- In this paper, we discuss results on five realistically viduals from existing ones, and a selection mechanism. sized data sets, referred to here as POBF, AMCG, The representation and operators should be selected Seto, MSeto and MSeto2. The first data set, POBF, to exploit the building blocks of a problem, where a is a human apolopoprotein (Carlsson et al. 1986), building block is a component of solution that both with accession number M15421which is 10089 bases has a positive effect on fitness by itself and is part of a long. The data set AMCGis the initial 40%(20,100) muchbetter solution when combined with other build- of the bases from LAMCG,the complete genome of ing blocks. The dynamics of the genetic algorithm at- bacteriophage lambda, accession numbers J02459 and tempt to construct increasingly larger good building M17233(Sanger et al. 1982). The Seto data set is blocks until a good solution is found. an experimental data set madeavailable for testing se- quencing algorithms (Seto, Koop, & Hood 1993). This Summary of Previous Work data set represents 34,475 bases of consensus sequence. Genetic algorithms have been applied to manydifferent The data sets referred to here as MSetoand MSeto2re- optimization problems, including parameter/function spectively are 752 and 743 fragment subsets of the Seto optimization and combinatorial optimization prob- data set. Whenwe analyzed the Seto set, we found 86 lems. Permutation problems such as job shop schedul- fragments that did not have significant similarity to the ing and the traveling salesman problem raise difficult published consensus sequence based on our similarity representation and operator design issues. The obvious metric. We removed these fragments from the data representation for a permutation, an ordered list of el- set, since they can not be properly placed, relative to ements in the permutation, is not readily manipulated the published consensus sequence, using only this sim- by