<<

From: ISMB-95 Proceedings. Copyright © 1995, AAAI (www.aaai.org). All rights reserved.

DNA Sequence Assembly and Genetic Algorithms New Results and Puzzling Insights

Rebecca Parsons Mark E. Johnson Department of Computer Science Statistics Department University of Central Florida University of Central Florida Orlando, FL USA Orlando, FL USA [email protected] [email protected]

Abstract performance.Several critical issues affect the perfor- manceof a genetic algorithm: problemrepresentation, Applyinggenetic algorithmsto DNAsequence assem- operator selection, population size, and parameter se- bly is not a straightforwardprocess. Significantlyim- lection. Our previous workindicated that the perfor- provedresults in termsof performance,quality of re- manceof the genetic algorithm was sensitive to the sults, andthe scaling of applicability havebeen real- ized throughnon-standard and even counter-intuitive setting of the rate parameters. This issue seemedim- parametersettings. Specifically,the solutiontime for portant to address first. Weapplied the techniques of a 10kbdata set wasreduced by an order of magnitude, experimentaldesign and response surface analysis (Box and a 20kbdata set that waspreviously unsolved by & Draper 1987; Box, Hunter, & Hunter 1978) to the the genetic algorithmwas solved in a time that rep- rate parameterscontrolling the application of the var- resents only a linear increasefrom the 10kbdata set. ious operators in the genetic algorithm. This analysis Additionally, significant progress has been madeon promptedus to set the operator rates at significantly a 35kbdata set representingreal biological data. A different levels than those we had previously tried and single contig solution wasfound for a 752 fragment those typically used by the genetic algorithm commu- subset of the data set, and a 15 contig solution was foundfor the full data set. This paper discusses the nity. The result, however,was significantly improved newresults, the modificationsto the previousgenetic convergencetimes for small to medium-sizedproblems algorithmused in this study, the experimentaldesign and the resolution of a 20kb sequence which had pre- process by whichthe newresults wereobtained, the viously not been solved by the genetic algorithm. questionsraised by these results, and someprelimi- naxyattempts to explainthese results. Wenext explored the populationsizes for the genetic algorithm and found that significantly smaller popula- tion sizes resulted in additional performanceimprove- Genetic Algorithms and DNA Sequence ments. Combiningthese results, we returned to the Assembly problem of the 35kb data set consisting of real data The laboratory processes for DNAsequencing are still from a sequencing laboratory, a commonbenchmark limited to relatively short stretches of DNA,necessi- for sequencing algorithms. Wesaw significantly im- tating the assembly of longer sequences based on the proved performanceon this data set as well, although base configuration of the shorter sequences. This as- we also found that minormodifications are required to sembly process, described in more detail below and the operators to properly exploit the building blocks in (Parsons, Forrest, & Burks1994), is a combinatorial within the evolvedsolutions. problemthat is closely related to the Traveling Sales- man Problem (TSP). However, the DNAsequence as- In the remainderof this section, we introduce both semblyproblem is sufficiently different from TSPthat the DNASequence Assemblyproblem and genetic al- the heuristic methodsdeveloped for TSPare not ap- gorithms. Next, we briefly summarizeour previous plicable in this context. In prior work, we have de- workand describe the application of experimentalde- scribed a genetic algorithm for the DNASequence As- sign techniques to the GAparameter selection and the sembly problemwhich gave promising results on small resulting parametersettings. Wethen describe the re- sequences but did not scale well to the problemsizes sults obtained on the larger data sets using the modi- being tackled in the large-scale sequencinglaborato- fied GA.Finally, we discuss the questions of parame- ries (Parsons & Burks 1993; Parsons, Forrest, & Burks ter setting and populationsizes posed by these perfor- 1994). manceimprovements and how these results relate to Wehave been studying the genetic algorithm to un- the application of genetic algorithms to other permu- derstand its behavior and the limiting factors on its tation problems.

Parsons 277 The Basics of DNA Sequence Assembly and genetic inheritance. Originally proposed by Hol- The laboratory sequencing processes for DNAare be- land (Holland 1975) in the seventies, genetic algo- ing aggressively tested and improved as a result of the rithms have experienced a resurgence in popularity and Human Project. The scale of this project use (Forrest 1993; Goldberg 1989). In its simplest is daunting, as one individual’s genome contains ap- form, a genetic algorithm operates over a population of proximately 3 billion bases (bonded pairs of nucleic binary strings, termed individuals. A fitness function, acids). DNAconsists of two anti-parallel, complemen- usually whatever function is to be optimized, assigns tary strands of nucleic acids. The shotgun sequenc- a fitness to each individual in the population. An in- ing process replicates and separates these strands, and dividual competes with the others in the population then breaks the strands into smaller fragments. These based on its fitness. fragments are processed on the automated sequencing Three primary operators are used in each generation machines to determine the base sequence for the frag- of a GA: reproduction, crossover and mutation. The ment. Unfortunately, in the process of making and reproduction operator implements the survival compe- sequencing the fragments, information about the lo- tition; the expected numberof copies of an individual cation, orientation and strandedness of the fragments in the new population is based on the individual’s fit- with respect to the original sequence is lost. The only ness relative to the fitness of the rest of the popula- information remaining about a fragment is its base se- tion. Poorly performing individuals die out, with their quence. The assembly process compares the individual places taken in the next generation by new copies of base pair sequences for the fragments and, using this more highly fit individuals. The crossover operator information, assembles a consensus sequence for the combines the information contained in two individuals parent DNA.Consecutive overlapping fragments make in the population to create offspring. The mutation up a contig, and the goal of the assembly process is to operator randomly alters an individual. Finally, each order the fragments such that the result is one contigu- new individual has its fitness computedusing the fit- ous sequence of overlapping fragments. The hypothe- ness function. This cycle of reproduction, crossover, sis followed in the assembly process is that fragments mutation, and evaluation continues until some conver- with a high degree of similarity of their sequences (a gence criterion is satisfied. high overlap strength) likely come from the same area The design of a genetic algorithm for a problem re- of the DNA.The implications and validity of this hy- quires the identification of a fitness function, a rep- pothesis is addressed briefly in the conclusions and in resentation for an individual solution, crossover and Myers (Myers 1994). mutation operators that generate new solution indi- In this paper, we discuss results on five realistically viduals from existing ones, and a selection mechanism. sized data sets, referred to here as POBF, AMCG, The representation and operators should be selected Seto, MSeto and MSeto2. The first data set, POBF, to exploit the building blocks of a problem, where a is a human apolopoprotein (Carlsson et al. 1986), building block is a component of solution that both with accession number M15421which is 10089 bases has a positive effect on fitness by itself and is part of a long. The data set AMCGis the initial 40%(20,100) muchbetter solution when combined with other build- of the bases from LAMCG,the complete genome of ing blocks. The dynamics of the genetic algorithm at- bacteriophage lambda, accession numbers J02459 and tempt to construct increasingly larger good building M17233(Sanger et al. 1982). The Seto data set is blocks until a good solution is found. an experimental data set madeavailable for testing se- quencing algorithms (Seto, Koop, & Hood 1993). This Summary of Previous Work data set represents 34,475 bases of consensus sequence. Genetic algorithms have been applied to manydifferent The data sets referred to here as MSetoand MSeto2re- optimization problems, including parameter/function spectively are 752 and 743 fragment subsets of the Seto optimization and combinatorial optimization prob- data set. Whenwe analyzed the Seto set, we found 86 lems. Permutation problems such as job shop schedul- fragments that did not have significant similarity to the ing and the traveling salesman problem raise difficult published consensus sequence based on our similarity representation and operator design issues. The obvious metric. We removed these fragments from the data representation for a permutation, an ordered list of el- set, since they can not be properly placed, relative to ements in the permutation, is not readily manipulated the published consensus sequence, using only this sim- by the traditional crossover and mutation operators. ilarity information (and not expert information from Problem occur as there are a significant numberof in- sequencers). Initially, we only removed the 77 frag- feasible solutions (combinations that are not permu- ments with similarity less than 10. The data sets have tations) and the operators are not closed over feasible 177, 352, 829,752 and 743 fragments respectively. solutions. In our prior work, we used both the sorted- order representation with the standard operators and The Basics of Genetic Algorithms the straightforward representation with specialized op- Genetic algorithms are a method of solving prob- erators (Parsons, Forrest, & Burks 1994). This work lems inspired by the principles of natural selection indicated that the specialized operators were essential

278 ISMB---95 to achieve good performance for the DNASequence evaluating an ordering is the pairwise similarity of the Assembly Problem. However, we had limited success fragments in the fragment set (Churchill et ai. 1993). scaling that algorithm to problems larger than 10kb. The pairwise similarity is computed using the base se- quence and considering all possible strand and orienta- Representation, Fitness Function, and tion combinations. The final score, called the overlap Operators strength, is the maximumsimilarity measure for the The DNASequence Assembly problem is a permuta- pair. There are several ways to evaluate the fitness tion problem, so solutions consist of a permutation of of a particular ordering using this pairwise informa- the fragments. The simplest way to represent a per- tion. In previous work, we had determined that using mutation is using the list of fragment identifiers; this a simple fitness function, analogous to that used for the is the representation used here. This representation, traveling salesman problem, provided the best results though, necessitates the use of either specialized oper- for the genetic algorithm. Specifically, the fitness of an ators or somemechanism to penalize illegal solutions. ordering is the sum of the pairwise overlap strengths Wechose to use specialized operators. For the repro- for the fragments that are adjacent in the ordering: duction and selection phase, we chose to use genera- n--2 tional reproduction with sigma scaling and an elitist FI(I) ]---- ~ w][il,][i_kl strategy. i--O The edge-recombination crossover operator was specifically designed for problems such as the DNASe- There are obvious problems with this fitness function, quence Assembly Problem that exploit adjacency in- but it has proved adequate and computational efficient formation in the formation of high-quality solutions. in the past. Edge recombination is a complicated operator; a de- Table 3 summarizes the previous results of the ge- tailed explanation of the operator implemented here netic algorithm on the larger data sets considered here. The genetic algorithm was able to find the correct so- appears in (Starkweather et al. 1991). In general, this lution to the 10kb data set, although the number of crossover attempts to preserve adjacencies in the par- trials required did not indicate that the process scaled ents, and in particular, those adjacencies that are com- well for larger data sets. The previous best results for mon to both parents. When neither of those options the 20kb and Seto data sets supported that concern. is possible, a random selection is made. This operator While the 13 contig solution is not a terrible one, the is appropriate for the sequence assembly problem since solution of the Seto set is not much improved over a the building blocks of a good fragment ordering consist random solution. Clearly, if the genetic algorithm was of a set of fragments that are related to each other by the similarity metric and should therefore be adjacent to contribute to problems of realistic size, the perfor- to one another. mance on these data sets had to be understood. For a mutation operator, we chose the swap. This is the simplest operator that preserves the permutation Experimental Design and Genetic property. Twopositions in the ordering are selected Algorithm Parameters at random, and the fragments in those positions are Use of a genetic algorithm requires the proper setting swapped to create the new individual (Churchill et al. of several parameters: operator application rates, se- 1993). This operator is a restricted form of a 4-opt lection pressure and population size. Here we consid- transformation (Lin & Kernighan 1973). ered the setting of the operator rates and the popu- We used two other operators, each of which rely on lation size. Previously, we had determined that the some domain-specific information. These two opera- genetic algorithm for the fragment assembly problem tors, inversion and transposition, moveblocks of frag- required non-standard settings for the operator appli- ments, specifically contigs, in the ordering. Contigs are cation rates (Parsons, Forrest, & Burks 1994). Specif- selected by choosing a fragment at random and moving ically, the crossover rate required to achieve good per- out in both directions until the edge of the either con- formance, even on the smaller problems was low. Since tig or the ordering is encountered. An adjacent pair the performance was sensitive to the operator rates, of fragments with an overlap strength of zero denotes we undertook a statistical analysis of the behavior of the edge of a contig. Inversion reverses the order of the genetic algorithm relative to the operator appli- the fragments in the selected contig (Goldberg 1989). cation rates (crossover, swap, transposition and inver- Transposition movesa contig to a position between two sion). There was, in addition, some indication that adjacent contigs. Transposition and inversion are also there was some interaction among the operators that a restricted form of 4-opt, with the restriction focusing might have an affect on the performance. on the selection of the edges to break and with what Specifically, we did an initial study to determine the edges to replace them. effects of high or low rates for each of these four param- The final factor in designing a genetic algorithm for eters. These tests were performed on the 10kb POBF a particular problem is the selection of an appropriate data set, which was the largest data set solved under fitness function. The basic information available for that previous method. A 24 full factorial design (Box,

Parsons 279 Cross. Trans. Swap Inversion genetic algorithm and limits the potential for crossover Rate Rate Rate Rate to contribute to the optimization. Premature conver- High .38 .14 .38 gence is a typical problem when the population size is .5 too small, since the rate at which the superior indi- Low .3 .28 .04 .28 viduals in a population reproduce quickly reduces the representation in the population of competing but po- Table 1: Rates for Full Factorial Experimental Design tentially beneficial individuals. on POBFdata set. Parameters for all runs: Population Westudied the performance of the genetic algorithm size 500, 1Mil trials, 2.1k generations, Sigma Scaling on the data set used above, varying the population 2.0, and Elitist Strategy sizes. The results are shown in part in Table 2. The table focuses on fitness function values only, as this is an appropriate measure of the quality of the optimiza- Hunter, & Hunter 1978) was used in the initial investi- tion process itself. Each entry in the table represents gation of the four parameters, at the rates shownin Ta- the progress of the optimization on a typical run.1 For ble 1. The results obtained for these 16 combinations each population size, several intermediate fitness func- of parameter settings showed that the performance im- tion values and the final value after one million trials proved with lower rates both of crossover and of swap are recorded. A gross measure of homogeneity of the (coefficients -.016 and -.004) respectively. The perfor- population is also shown. This convergence estimate mancewas independent of the choice of rate for trans- is the approximate percentage of the bits that are the position and inversion, although the variance among same across all individuals in the population (lost bits, the runs was lower with the higher rate of transpo- in genetic algorithm terminology). This is a gross mea- sition. In addition, no second or third order affects sure since, as discussed briefly below, the traditional among the parameters was indicated in this study. measure of convergence assumes characteristics of the Thus, we could concentrate on the rates for the two operators that are not held by our suite of operators. parameters, crossover and mutation. The tests on population size indicated, for this data The analysis of the full factorial design led to the set, that a population size comparable to the num- method of steepest descent, using the coefficients de- ber of fragments in the data set was the most effec- termined above, to search for an optimal region. This tive value when considering total number of function descent located a mesa-like region in the parameter- evaluations, reproducibility of results, and population performance space, which we delineated using a cen- diversity. This result is somewhat surprising, when tral composite design. Both these approaches are tech- one considers the growth rate of the space of permuta- niques of response surface methodology (see, for exam- tions and the fact that the individuals grow at a rate ple, (Box & Draper 1987)). These subsequent runs of nlogn. Additionally, the data indicates that there were all performed using the same parameters as the is a minimumpopulation size below which the genetic initial runs, varying only the crossover and swap rates. algorithm can not effectively function, due presumably The lower rate of inversion and higher rate of transpo- to a loss of diversity in the population. The conven- sition were used for the remainder of the runs. tional wisdomof genetic algorithms indicates that this The central composite design indicated that there minimumlevel should be much larger for a problem of was a large region surrounding the resulting values this size -- 1416 bits per individual -- based solely on within which the genetic algorithm achieved good per- the numberof low level building blocks. The next chal- formance. The values indicated by this design, how- lenge was to see if these results generalized to larger ever, are counter-intuitive. Specifically, genetic algo- and different data sets. rithms are frequently assumed to derive most of their power from the crossover operator. While previous Generalization of the Results and the work has shown that crossover should be used in the Seto Data Set operator mix, the final value indicated by the central composite design is a rate of 10%, much lower than Previously, the genetic algorithm was unable to solve previously considered. the data sets that were larger than 10kb. Using the Once these rates were determined, we turned to the same parameter settings found for the 10kb data set population size. The time requirements for the genetic and a population size only slightly larger than the num- algorithm are typically dominated by the evaluation ber of fragments, we ran the genetic algorithm on the of the fitness function. The population size and the other large data sets described above. Table 3 sum- number of generations together determine the number marizes the previous results for the three data sets de- of fitness function evaluations for a given run, also re- scribed here and the results obtained using the new ferred to as the number of trials. Larger population parameter settings. sizes can decrease the number of generations required but also increase the memoryrequirements. Too small 1Other runs with different randomseeds gave compara- a population size limits the diversity available to the ble results.

280 ISMB-95 Pop Num Num Fitness Cony Pop Num Num Num Size Gems Trials Function Est Name Size Gems Trials Contigs I000 27 20,280 11,320 0 POBF 1500 13,000 5,900,000 1 I000 326 240,711 36,079 0 200 3393 500,117 1 I000 678 500,459 46,188 0 AMCG 2500 5,600 2,300,000 13 I000 1003 740,481 50,307 20.3 400 6,786 2,000,021 1 I000 1355 1,000,469 51,946 20.2 Seto 25OO 547 1,200,176 125 400 68 20,212 19,836 0.2 900 9,045 6,000,265 27 400 815 240,191 45,864 0.1 900 17,548 11,000,354 15 400 1,697 500,065 50,621 1.0 MSeto N/A N/A N/A N/A 400 2,512 740,127 53,069 5.8 775 14,003 8,000,54 5 400 3394 1,000,054 54,622 1,7 775 37,623 21,500,393 i 200 136 20,041 18,540 .1 MSeto2 N/A N/A N/A N/A 200 1,632 240,127 48,802 21.6 775 9,624 5,500,544 7 200 3,397 500,091 52,964 30.7 200 5,024 740,138 55,046 28.6 Table 3: Comparison of Genetic Algorithm Results. 200 6,788 1,000,040 55,966 24.8 First line for a data set is result from prior work (Par- 50 551 20,020 38,439 .2 50 6,598 240,012 52,194 2.2 sons, Forrest, & Burks 1994). Lines below use Crossover Rate .1, Swap Rate .04, Transposition Rate 50 13,692 500,015 55,202 36.6 .38, Inversion Rate .28, SigmaScaling 2.0, and Elitist 50 20,244 740,031 56,134 34.6 Strategy. 50 27,342 1,000,003 56,647 46.0 10 2986 20,005 41,343 3.9 10 36,043 240,006 49,972 2.1 10 74,968 500,002 51,762 5.5 less than 10 with the final parent consensus, giving us 10 110,547 740,000 52,997 4.7 a data set with 752 fragments. 10 149,172 1,000,003 53,483 3.4 The MSeto data set is the Seto data set with these 77 fragments mentioned above removed. The 5 con- ~g solution presented above is very close to a correct Table 2: Population Size Tests for POBFdata set. solution. In fact, a simple greedy algorithm could con- Parameters for all runs: Crossover rate .1, Swap Rate vert this solution to the correct one. Genetic algo- .04, Transposition Rate .38, Inversion Rate .28, Sigma rithms are best at getting close to the right solution, Scaling 2.0, and Elitist Strategy but do not tend to make the fine adjustments neces- sary to move from a near-optimal to an optimal solu- tion. This behavior is consistent with the behavior of The results presented in this table represent im- simulated annealing, another stochastic optimization provements in several aspects of the performance of technique (Bohachevsky, Johnson, & Stein 1993). the genetic algorithm although they also raise some However, the solutions represent another challenge issues. The numberof function evaluations (trials) re- for the genetic algorithm; they contain defects that are quired to find the correct solution for the POBFdata not correctable, except by the relatively unlikely event was reduced by an order of magnitude and the number of a swap operation. The discrepancies occur because of generations by a factor of 4 through the use of appro- lower quality, yet statistically significant overlaps are priate parameter settings and the smaller population chosen instead of higher quality overlaps (an overlap size. For the AMCGdata set, a smalle~ number of tri- of 75 is significant, but an overlap of 250 maybe the als resulted in the correct solution using the adjusted correct one). The specialized operators are currently parameter settings. designed to retain significant overlaps by only moving The results for the Seto set are most striking. While contigs. As a consequence of the above choices, the 27 contigs is not a reasonable answer, it is a dramatic connections between contigs can not be properly made, improvement over the previous best solution we had and the contig isolation and stranding of fragments obtained. As we were analyzing this solution, we re- OCCUrs. alized that there were a large number of fragments There are several approaches to this problem. We that, according to our similarity metric (Churchill et al. combined two of them in an effort to get to the opti- 1993), did not match the parent sequence to any signif- mal solution: relaxing the constraint on contig bound- icant degree. Since our fitness function is driven only aries for the transposition and inversion operators and by the overlap information from this similarity metric the addition of a form of greedy swap. Each of these and these fragments did not have significant overlaps, techniques is described below. we removed them from the data set. At this stage, we While an overlap of 75-100 is clearly different than identified 77 fragments that had a similarity metric of random, it likely represents an inappropriate place-

Parsons 281 ment for the fragment. Specifically, one of these frag- a faster rate of improvement using the whole suite of ments should likely be closer to the end of the contig modified operators. The performance of the genetic al- where it may serve as a bridge fragment joining con- gorithm with the modified operators is encouraging for tigs. The completely random application of transpo- the large, and therefore realistically sized, data sets. sition and inversion proved ineffective (and counter- productive) in our earlier studies, prompting us to re- Related Work strict application of the operator to contigs. However, this restriction means that the only mechanismavail- Different techniques have been applied to the problem able for the movementof this kind of isolated fragment of DNASequence Assembly. Greedy algorithms are to its proper location is the swap. What we chose to do the most popular, as they are easily the most efficient. instead was to introduce an additional degree of ran- In manydata sets, greedy solutions are also the correct domness into the transposition and inversion operator. ones. The most popular greedy algorithm is that of We added a threshold value for the contig boundary Staden (Staden 1980), although Huang has proposed in conjunction with a random value. Overlaps above greedy system that relies on a different mechanismfor that threshold were still considered within the contig. computing the similarity of two fragments and exploits Overlaps below that threshold had an increasing prob- this information later in the process (Huang 1992). ability of being designated as the contig boundary. For Kececioglu and Myers (Kececioglu 1991; Kececioglu these preliminary experiments, we used a threshold of Myers 1989) have proposed both an exact graph algo- 100, with a 10%probability of breaking the contig at rithm and an approximate algorithm that is provably an overlap of 90 and a 90%probability at an overlap close to the correct alignment whenno errors exist in of 10. the data. Churchill et al. have developed a simulated annealer that uses the swap operator discussed above The existing swap operation randomly selects two and a modified fitness function (Churchill et al. 1993). fragments in the ordering and swapped their positions. We introduced a modification to this operator, that Towards an Understanding of Genetic only takes affect late in the run. Swapslate in the run are either completely random as before or greedy. For Algorithms for Permutation Problems the greedy swap, one fragment, i, is chosen at random, The study of the population sizes has led to several and fragment j, the fragment with the highest over- questions about the effects on population diversity by lap strength with i, is identified. A cursory analysis the various operators. In the standard genetic algo- of the neighborhood in the ordering of both i and j is rithm, once a given position is the same in all members made to determine whether to move i next to j in the of the population, only the mutation operator applied ordering or to move j next to i. There are a couple at that position will allow that value to change. This of issues with an operator like this. First, it is impor- effect is the result of the preservation of positions by tant to delay the application of this operator until later the crossover operator. In the operator suite designed in the run, since otherwise the genetic algorithm may for this problem, all operators have the potential of be led too early into local minima. Second, it proved altering all bit positions. Therefore, the traditional critical to examine the neighborhood surrounding the measure of convergence is not applicable. Indeed, the fragments to determine which move to make. The se- figures in Table 2 demonstrate that the homogeneity lection of when to begin the greedy swap and how often metric can vary significantly over the course of the run. to apply it are currently ad hoc. As with the previous Most of the conventional wisdom regarding popula- modification, more parameters have been introduced tion size selection derives from the problems with di- into the optimization process. versity and convergence. As the population becomes These changes allowed us to find a single contig so- less diverse, the search narrows drastically, since the lution of the MSeto data set, building from the pop- individuals generated by a homogeneous population ulation which produced the 5 contig solution. With do not differ radically from those individuals in that this new operator configuration, we returned to the population. Wehypothesize that the different effects Seto data set and produced a 15 contig solution, again of our operator suite and their impact on convergence building on our previously evolved population. In ex- may be what allows us to use smaller populations than amining this solution, we realized that there were more would otherwise be anticipated. Weare exploring this fragments that had low similarity to the parent with question to determine the kind of models and behavior our metric. Indeed, there are 86 fragments with an that can be expected for genetic algorithms applied to overlap score of 13 or less with the parent sequence; permutation problems. Whitley and Yoo have recently the other 743 have an overlap score of 86 or more. developed exact models of genetic algorithm behavior Additionally, there are 9 fragments that have no over- for certain of the permutation operators (Whitley lap of weight greater than 100 with more than 1 other Yoo1995). These models are inapplicable to the ques- fragment. Therefore, we generated another, slightly tion posed here, because they assume an infinite popu- smaller, data set with these 86 fragments removed. lation. However,their models, assuming these infinite This data set, as shown in Table 3, is proceeding at populations, still provide insight into the asymptotic

282 ISMB-95 behavior in the finite population case. References Bohachevsky, I. O.; Johnson, M. E.; and Stein, M. L. Conclusions 1993. Stochastic optimization and the gambler’s ruin This paper reports on significant performance improve- problem. Journal of Computational and Graphical ments for a genetic algorithm applied to the problem Statistics 1(4). of DNASequence Assembly. Specifically, an order of Box, G. E. P., and Draper, N. E. 1987. Empirical magnitude improvement was obtained on a medium- Model-Building and Response Surfaces. John Wiley sized data set, and larger data sets have either been and Sons. solved completely or have produced workable near- Box, G. E. P; Hunter, W. G.; and Hunter, J. S. 1978. optimal solutions. These performance improvements Statistics for Experimenters, An Introduction to De- are the result of applying techniques from experimen- sign, Data Analysis and Model Building. John Wiley tal design and response surface analysis to the param- and Sons. eter settings. Additional changes in the operator suite resulted in the solution of the realistic data sets. A Burks, C.; Parsons, R.; and Engle, M. 1994. Inte- better understanding of the nature of the performance gration of competing ancillary assertions in genome enhancements and the operation of the genetic algo- assembly. In Proceedings of the Second International rithm in this setting is needed. Weintend to explore Conference on Intelligent Systems in Molecular Biol- such questions as we extend this work further. ogy. AAAI Press. There are still the problems associated with the fit- Carlsson, P.; Darnfors, C.; Olofsson, S.-O.; and ness function and problem formulation. Specifically, it Bjursell, G. 1986. Analysis of the humanapolipopro- is easily shown that, in the presence of significantly tein B gene; complete structure of the B-74 region. conserved repeat sequences, this formulation of the Gene 49:29-51. problem leads to a consensus sequence that is shorter Churchill, G.; Burks, C.; Eggert, M.; Engle, M.; and than the correct sequence. This compression occurs Waterman, M. 1993. Assembling DNA sequence since the overlap amongfragments from different re- fragments by shuffling and simulated annealing. Los peated regions is extremely high, violating the hypoth- Alamos Technical Report. esis of the assembly. Myers (Myers 1994) has proposed Forrest, S. 1993. Genetic algorithms: Principles of an alternative formulation for the sequencing problem natural selection applied to computation. Science to address the problem of repeated DNAin the se- quence. These issues, as well as the value judgements 261:872-878. made by the sequencers evidenced in the 86 fragments Goldberg, D. E. 1989. Genetic Algorithms in Search, with low similarity to the parent sequence, indicate Optimization, and Machine Learning. Addison Wes- that the information being used by the genetic algo- ley Publishing Company. rithm is likely insufficient to adequately solve the prob- Holland, J. H. 1975. Adaptation in Natural and Ar- lem. However, the genetic algorithm should be able tificial Systems. Ann Arbor, MI: The University of to exploit additional information through combined Michigan Press. fitness functions such as those proposed by Burks et Huang, X. 1992. A contig assembly program based al (Burks, Parsons, & Engle 1994). The problem on sensitive detection of fragment overlaps. repeated DNAis specifically addressed in that work 14:18-25. through the use of map information. Optimization of the modified fitness function separates the members Kececioglu, J., and Myers, E. 1989. A procedural of the different repeat regions, resulting in the proper interface for a fragment assembly tool. Technical Re- consensus sequence. port TR-89-5, Depart of Computer Science, Univer- The DNASequence Assembly problem, one of the sity of Arizona, Tucson, AZ. problems highlighted by DIMACSduring its Compu- Kececioglu, J. 1991. Exact and approximation algo- tational Biology Year, is a critical part of the Human rithms for DNAsequence reconstruction. Ph.D. Dis- GenomeProject. It also provides a realistic test bed sertation, University of Arizona, Tucson, AZ. TR for the study of genetic algorithms applied to permu- 91-26, Department of Computer Science. tation problems in general. Lin, S., and Kernighan, H. W. 1973. An effective heuristic algorithm for the traveling-salesman prob- Acknowledgments lem. Operations Research 21:498-516. The authors wish to acknowledge C. Burks, M. En- Myers, G. 1994. An alternative formulation of se- gle and S. Forrest for their past help on this problem. quence assembly. DIMACSWorkshop on Combina- Discussions with S. Forrest and C. Kennedy have been torial Methods for DNAMapping and Sequencing. quite enlightening and have contributed to the progress Parsons, R., and Burks, C. 1993. An analysis of the reported on here. The referees offered useful sugges- random keys representation for DNAsequence assem- tions to improve the presentation. bly. Manuscript in Preparation.

Parsons 283 Parsons, R.; Forrest, S.; and Burks, C. 1994. Genetic algorithms, operators and DNAfragment assembly. Machine Learning. To appear. Sanger, F.; Coulson, A.; Hill, D.; and Petersen, G. 1982. Nucleotide sequence of bacteriophage lambda DNA.J. Mol. Biol. 162:729-773. Seto, D.; Koop, B.; and Hood, L. 1993. An experimentally-derived data set constructed for test- ing large-scale DNAsequence assembly algorithms. Genomics. In press. Staden, R. 1980. A new computer method for the storage and manipulation of DNAgel reading data. Nucl. Acids Res. 8:3673-3694. Starkweather, T.; McDaniel, S.; Mathias, K.; Whit- ley, D.; and Whitley, C. 1991. A comparison of genetic sequencing operators. In ~th International Conference on Genetic Algorithms, 69-76. Whitley, D., and Yoo, N.-W. 1995. Modeling sim- ple genetic algorithms for permutation problems. To Appear: Foundations of Genetic Algorithms 3.

284 ISMB--95