PsodaScript: Applying Advanced Language Constructs to Open-source Phylogenetic Search

Jonathan L. Krein1, Adam R. Teichert1, Hyrum D. Carroll, Mark J. Clement and Quinn O. Snell Computer Science Department Brigham Young University Provo, UT 84602

Abstract—Due to the immensity of phylogenetic tree space for Researchers have developed a number of heuristic algo- large data sets, researches must rely on heuristic searches to rithms that perform reasonably well on large data sets; nev- infer reasonable phylogenies. By designing meta-searches which ertheless, local optima in a search space often prevent these appropriately combine a variety of heuristics and parameter settings, researchers can significantly improve the performance algorithms from finding the globally optimal solution. This of heuristic searches. Advanced language constructs in the problem is exacerbated by the inability of researchers—due open-source PSODA project—including variables, mathematical to the vastness of the search space—to confidently distinguish and logical expressions, conditional statements, and user-defined between local and global optima. commands—give researchers a better framework for the explo- ration and exploitation of phylogenetic meta-search algorithms. PSODA’s approach to scripting meta-search algorithms is unique One answer to the local optimum problem is meta- among open-source packages and addresses several limitations of searching, which combines various heuristic searches together other phylogenetic applications. into a single search algorithm [8]. In phylogenetic inference Index Terms—meta-searching, phylogenetic analysis, scripting software packages, meta-searches may be constructed from languages existing algorithms either via an internal scripting language or an external wrapper. Unfortunately, wrapper applications I. INTRODUCTION tend to be cumbersome, often requiring considerable effort to orchestrate the interaction among heuristic searches. Further- Phylogenetic trees model the evolutionary relationships more, having to code meta-search algorithms in a wrapper- among species, enabling researchers to both analyze species script increases complexity and reduces clarity. In an effort differences [1] and to better understand the processes by which to build upon predecessor applications and provide a solid, those differences arise [2]. Phylogenetic trees represent an usable framework for developing meta-search algorithms, Phy- important research tool in many scientific areas. In AIDS logenetic Search Open-source Data Analysis (PSODA) [9] research, for example, scientists are using phylogenies to better implements a scripting language—PsodaScript—with syntax understand how the Human Immunodeficiency Virus (HIV) and constructs for clearly and concisely defining phylogenetic mutates in response to the human immune system. Because searches. HIV evolves faster than most known organisms, understanding its evolution will hopefully lead to improved vaccines [3]. PSODA is an open-source phylogeny reconstruction appli- Phylogenetic research is also contributing to many other areas cation. It provides basic and advanced phylogeny searching of study, including nucleic acids research [4] and endangered capabilities and a progressive multiple sequence alignment species conservation [5]. method. It can perform phylogenetic analysis with Maximum A major task related to phylogenetic research is that of effi- Parsimony and Maximum Likelihood. Additionally, it comes ciently inferring, from molecular data, the correct phylogenetic with a graphical user interface (GUI) that allows for visualiz- tree for a set of organisms. One approach to the problem is to ing individual phylogenies as well as the progress of a search. arrange a data set into all possible phylogenies, scoring each PSODA’s performance is comparable to existing phylogenetic tree to find the one that most likely represents the actual evo- programs (see Figure 1). PSODA can be obtained for free from lutionary relationships among the organisms. Unfortunately, http://csl.cs.byu.edu/psoda. because phylogenetic inference is an NP-complete problem [6], such an exhaustive approach is typically unreasonable. This paper seeks to familiarize researchers with PsodaScript Therefore, to infer trees from data sets with more than 15- as a powerful tool in the utilization and exploration of phy- 25 taxa, phylogenetic inference software must rely heavily logenetic meta search algorithms. We first provide a context on heuristic algorithms [7]. This is significant because most for PsodaScript by reviewing related work in phylogenetic realistic biological studies require data sets with at least 20 inference software and meta-searching. Next, we present a taxa [6], if not hundreds or thousands. brief overview of the PsodaScript language. Finally, we discuss two specific meta search applications that can be implemented 1Equally contributing authors in PsodaScript. Quinn Snell, Mark Clement, Mark Ebbert, Adam Teichert, Jonathan Krein Brigham Young University

Abstract

PSODA is an open-source phylogenetic search application that implements many traditional parsimony and likelihood search techniques as well as advanced search algorithms. PSODA is compatible with PAUP* and the search algorithms are competitive with those in PAUP*. PSODA also adds a basic scripting language to the PAUP block, making it possible to easily create advanced meta-searches. Because PSODA is open-source, we have also been able to easily add in advanced search techniques and character- ize the benefits of various optimizations. PSODA is available for Macintosh OS X, Windows, and .

Advantages of PSODA • High Performance • Open Source (-: FREE :-) • Modular Design (easy algorithm development) • Advanced Scripting Language • makes advanced meta-searches simple • Reads and Executes PAUP nexus files • PSODA is competive with PAUP*

16450 PAUP* PSODA TABLE I: Meta-search Capabilities Advanced ScriptingWrappable Internal Commands Internal Scripting 16400 • Added functionality to PAUP blocks. PHYLIP• Decision Statements￿ & Loops • Advanced Functions & User-defined Functions POY ￿￿ • Easily Extensible 16350 PAUP* • Easy scripting of￿￿ advanced meta-sarches such as: BioPerl• Ratchet (Parsimony￿￿ and Likelihood) ￿ TNT 16300 • DCM and more.￿￿ ￿ Parsimony Score PSODA ￿￿ ￿ 1 minute PAUP* Ratchet PSODA Likelihood Ratchet BEGIN PAUP; set criterion=parsimony ; BEGIN PAUP; set maxtrees=1 16250 increase = no; hsearch start=stepwise swap=TBR ; weights 2 : 7 14 17 26 27 31 34 45 50 52 54 57 60 63 64 65 73 77 86 91 92 102 103 107 115 117 121 begin randomReweight 122 124 127 131 133 134 140 142 155 156 163 173 176 183TABLE 185 187 195 197 198 202 204 II: 209 219 Accessibility 221 222 225 226 230 237 238 240 241 244 248 252 254 255 258 269 276 279 283 284 291 294 295 297 309 310 315 319 321 323 325 326 342 346 349 350 351 353 354 355 359 360 363 366 370 377 378 390 393 394 398 408 numChars = getWeightsLength(); 412 413 418 419 423 425 426 428 429 432 450 456 467 475 479 482 484 489 492 493 497 498 500; hsearch start=current swap=TBR numWeights = numChars / percent; ; set maxtrees=1; License Fee Open-source weights 1:all; 16200 hsearch start=current swap=TBR ; j = 0; 1 10 100 1000 1000 weights 2 : 1 17 20 21 29 33 35 39 42 50 57 59 69 70 71 73 80 86 93 96 97 99 100 102 111 121 125 126 141 142 148 149 155 157 163 168 174 179 180 184 187 188 194 203 208 212 216 219 220 225 226 231 242 while (j < numweights) Time (in seconds) 243 251 253 254 256 257 260 265 269 272 274PHYLIP 277 282 286 287 288 291 293 296 297 305 306 309 315 316 318 324 326 327 329 334 340 346 351 352 356 360 366 370 372 373 379 380 384 390 396 399 401 408 412 ￿ 414 421 423 426 428 429 441 442 444 448 455 459 460 462 472 473 475 477 478 487 489 493; weight = random(max = range); hsearch start=current swap=TBR ; POY col = random(max = numChars) + 1; PSODA Features weights 1:all; ￿ hsearch start=current swap=TBR ; weights weight:col; • Parsimony and Likelihood (F84) search weights 2 : 2 9 11 14 16 18 20 24 28 36 40PAUP* 43 45 47 59 60 62 64 65 70 72 84 90 92 105 107 111 112 118 122 129 130 135 138 146 148 152 153 156 161 162 167 172 173 177 190 197 200 207￿ 209 213 215 217 j++; 218 226 234 245 249 250 251 253 254 255 257 266 275 276 281 284 288 293 303 305 315 319 322 324 329 Fig. 1: Performance results, in terms of best parsimony score 336 347 349 351 352 355 356 358 360 365 367 368 371 376 377 378 381 382 407 409 410 411 412 413 418 endwhile; • Advanced Likelihood models and Bayesian methods coming soon. 421 424 425 426 437 442 444 447 448 452 459 461 462 466 474 475 477 478 480 484 488 491; hsearch start=current swap=TBR BioPerl ; end; ￿ found• Consensus over time, (strict for and a majority parsimony rules) search using PSODA and weights 1:all; hsearch start=current swap=TBR ; TNT weights 2 : 1 5 17 22 26 31 34 39 44 46 58 65 71 73 79 80 85 92 93 95 98 100 101 107￿ 114 119 127 131 • Search Visualization 137 138 139 142 145 146 150 151 152 155 161 162 167 170 175 176 177 188 191 197 198 211 212 217 220 set maxtrees = 1 nreps = 5; PAUP* on the 500 taxon seed plant rbcL data set [10]. Note: 228 232 234 236 238 239 241 242 243 245 252 257 268 269 270 271 281 283 287 300 305 308 309 311 318 323 324 326 333 334 335 342 345 349 350 351PSODA 361 363 367 368 371 372 374 384 388 395 406 408 410 425 set criterion=parsimony; 426 427 428 430 432 434 435 439 443 445 447 452 455 457 461 465 467 470 474 483 485 490; ￿ • Graphical User Interface hsearch start=current swap=TBR hsearch start = stepwise swap = tbr; lower parsimony score is better. ; weights 1:all; • Binaries for Mac OS X, Windows and Linux hsearch start=current swap=TBR range = 3; ; weights 2 : 2 3 9 20 22 24 25 26 28 30 32 34 38 43 46 49 58 60 66 68 70 72 75 87 88 91 97 99 103 105 109 112 115 118 123 139 140 144 146 152 160 162 165 168 180 184 194 197 203 204 207 214 217 220 222 • Object-oriented ++ 223 225 233 240 244 249 250 251 257 260 262 270 275 280 286 290 291 292 297 298 301 307 310 313 314 318 321 322 326 334 336 338 339 343 347 357 369 372 373 378 380 381 389 395 396 397 401 403 404 407 while (true) 410 413 425 430 447 454 455 460 461 464 472 478 482 483 487 488 489 492 494 498; • easy to contribute to new search algorithm development hsearch start=current swap=TBR randomReweight(percent = 10); ; weights 1:all; set criterion=parsimony; intuitive.hsearch start=current swap=TBR • contribute a scoring metric without needing to change the search ; II. RELATED WORK weights 2 : 2 3 6 12 25 29 30 37 50 56 60 61 63 64 67 68 69 70 75 79 85 87 95 99 102 103 109 110 120 hsearch start = current swap = tbr; 123 125 127 129 131 142 143 145 157 164 168 170 172 175 176 180 190 191 194 195 196 197 201 202 204 205 206Another 211 214 216 227 230 235 236 reason 240 241 246 250 252 why255 257 263 266using 272 281 289 304 307 wrapper 308 309 programs to run meta- • Available via subversion 310 327 328 329 336 338 339 342 350 353 354 356 357 358 368 380 382 387 389 397 400 405 407 408 410 413 418 421 422 423 424 440 444 446 448 451 452 455 457 463 468 470 476 480 485 495; hsearch start=current swap=TBR weights reset; Over• svn the checkout past few http://csl.cs.byu.edu/opensvn/psoda decades, developers have produced a ; searchesweights 1:all; introduces additional complexity is the difficulty hsearch start=current swap=TBR ; set criterion=likelihood; number of phylogenetic inference software packages. Even a weights 2 : 5 6 19 24 27 31 32 33 34 35 39 40 43 49 58 59 61 66 73 89 90 92 94 97 98 113 116 118 121 of126 127 131 coordinating 140 141 144 148 164 165 169 170 173 175 and 176 177 179 converting181 183 185 191 194 202 203 204 205inputs and outputs between 207 211 213 215 218 220 233 234 238 245 252 254 259 260 265 269 272 278 280 286 287 288 292 295 297 hsearch start = current swap = tbr; 302 304 306 310 313 315 317 319 328 331 334 335 336 342 346 347 351 357 362 365 377 384 388 389 390 403 411 419 426 438 440 443 444 446 447 455 460 463 464 467 486 491 493 494 499 500; endwhile; superficial summary of the many packages is beyond the scope heuristichsearch start=current swap=TBR searches. BioPerl addresses this problem, and many ; weights 1:all; of this paper. We discuss only a few of the more influential hsearch start=current swap=TBR end; researchersRepeated text must continue. use However it to, it is makeunclear when needed to stop. conversions. However, as packages—PHYLIP [11], POY [12], PAUP* [13], BioPerl a framework for writing phylogenetic search algorithms it [14], and TNT [15]—and theirhttp://csl.cs.byu.edu/psoda relevance to PSODA’s internal still suffers from over-complexity. BioPerl brings the power scripting language, PsodaScript (see Tables I and II). of Perl to , but installing Perl and the necessary Existing packages already offer varying degrees of meta- modules can be frustrating, particularly for users with limited searching capabilities. To perform advanced meta-searches, experience in computer systems administration. Moreover, Perl however, most of these packages (including PHYLIP, POY, and syntax can be an obstacle for researchers who wish to write PAUP*) must be wrapped by an external program or language complex algorithms, but who do not have the time to learn the such as DCM [16], Perl, or shell scripting. finer points of Perl programming. PHYLIP, an open-source package, is one of the longest TNT implements its own internal scripting language with maintained phylogenetic analysis applications, offering a col- functionality similar to what PSODA provides. Although TNT lection of separate programs that can be used as a toolkit has become a pacesetter in phylogenetic software, offering in phylogenetic research. PHYLIP’s component programs are excellent search speeds [17], it ultimately limits users because written in C and are relatively easy to install and use. However, it is closed source. For instance, TNT scores all trees with PHYLIP also runs much more slowly on large data sets than parsimony; users cannot score trees with likelihood, and they other phylogenetic applications [7]. Furthermore, because it cannot extend TNT if they need it. In addition to being closed- has no internal scripting language, PHYLIP cannot perform source, TNT’s syntax is similar to Perl’s in that it provides meta-search algorithms without the help of external programs, great power, but at the expense of simplicity. PsodaScript takes which introduces inordinate complexity—a common difficulty the middle ground between the command languages of POY with the wrapper approach to meta-searching. and PAUP* and the heavy scripting languages of Perl and One reason why using wrapper applications to run meta- TNT. search algorithms causes unnecessary complexity is that these programs must be written in languages whose syntax and III. PSODASCRIPT:METHODS &PROCEDURES terms are typically unfamiliar to phylogenetic researchers. To date, researchers have developed several successful meta- POY and PAUP* provide command sets that draw from searching techniques [18], [19]. However, without internal familiar terminology and facilitate some basic meta-searches, scripting capabilities and more advanced language constructs, but both applications still require external programming to using and experimenting with these techniques and other meta- run more advanced meta-search algorithms. One of the goals search algorithms is awkward and limited. Our challenge of PsodaScript is to provide a language with a syntax and was to design and implement an internal scripting language keywords that make describing phylogenetic search algorithms for PSODA in a way that would enable (even encourage) meta-search exploration, without compromising simplicity or the general idea of what a PSODA program does, even if they usability. have had no prior exposure to it.

A. The “Available, but Optional” Principle C. A Quick Introduction to PsodaScript Syntax To develop an internal scripting language for PSODA In keeping with PAUP*’s style, PSODA instructions are powerful enough to perform advanced meta-searches without each listed as distinct statements terminated by a semicolon. compromising simplicity, we decided to follow the paradigm The instruction in Sample 1 tells PSODA to run a heuristic that the language’s features—particularly the more advanced search beginning with the current trees in the repository and ones—must be available, but optional. In other words, not using the TBR method to explore the search space. Even with knowing about an advanced feature should not inhibit the very little exposure, the syntax and terminology make sense. user’s ability to use features that are already familiar. In fact, an empty PSODA program block executes without errors or Sample 1 PsodaScript is based on PAUP*’s command syntax warnings; thus, the user begins with a working program. and the NEXUS file format Another application of the “available, but optional” principle hsearch start = current is found in the “execute” command, which allows users to swap = TBR; execute external programs from within a PsodaScript. PSODA and PsodaScript are also open-source, giving interested devel- opers access to modify the underlying C++ as In addition to running PAUP* commands, PsodaScript also needed. Thus, users have multiple ways to step into a more allows users to store values in user variables. Assignment complex programming environment. PsodaScript makes these to a variable is performed via the = operator. For instance, alternatives available without increasing the complexity of the the statement height = 5; assigns the value of 5 to the core PSODA language or affecting users who want nothing to variable height. If the variable does not yet exist, it is do with them; they are available, but optional. created and initialize to the value 5. To avoid unintended side effect of assigning variables, PsodaScript uses the notion of B. Compatibility with the NEXUS File Format and PAUP*’s variable scoping, which means that a variable exists within a Command Set well defined region in the program. If a variable is created Despite its meta-searching limitations, PAUP* has become within a loop, for instance, then it will not be visible, or a popular proprietary package, offering a rich variety of accessible, outside of that loop; it is considered, therefore, commands and performing heuristic searches with competitive local to that loop. If a variable is created outside of all speed. The success of PAUP* has helped inspire the creation constructs, then it will be visible throughout the program of the open source project PSODA. (except for in user defined command bodies). As is the case with PAUP*, input files for PSODA generally Another point to note about variables in PsodaScript is that follow the NEXUS file format. This format was proposed they are loosely typed, as in Perl or JavaScript. This means in 1997 to be an extensible format for describing systematic that users do not have to explicitly declare the type of data information [20]. PSODA is also designed to execute PAUP* that a variable will hold when they create it. One variable blocks, although it’s language does not yet implement every in PsodaScript can be assigned data of any type, and when PAUP* command. Before execution, the PSODA interpreter necessary, the PSODA interpreter will attempt to perform warns the user about any unimplemented commands, refer- conversions. PsodaScript variables can also be combined with ences the commands by line number, and then skips over operators to form arithmetic and logical expressions. them during execution. As PSODA continues to be extended, more features and algorithms will be incorporated into the Sample 2 Syntax for a Conditional Construct application. There are two important reasons why PsodaScript was if () developed on the foundation of the NEXUS file formant and ... PAUP*’s command set (see Sample 1). First, many people are elsif () ... already familiar with this format and command style, making elsif () it easier for users to begin using PsodaScript. The other moti- ... vation lies in the simplicity of PAUP*’s syntax and commands, else which seem to be well suited to phylogenetic researchers. ... Building on PAUP*’s syntax, PsodaScript is intended to be endif; understandable by someone with little programming expe- rience. Where appropriate, PSODA uses words rather than obtuse symbols, and the more advanced language constructs Conditional constructs allow users to determine how a are intended to read like English statements. A person who is program (or meta-search) should proceed based on the current familiar with PAUP* commands should be able to understand state of the search. Sample 2 illustrates the syntax for a Sample 3 Syntax for a Loop Construct Sample 4 Example of a User-defined Command (also see Sample 5) while () ... [ endwhile; Randomly selected weights are changed to a value in the interval from 0 to range-1 inclusive.

range - upper limit of changed weight value conditional construct. In a conditional construct, the instruc- percent - the % of weights being changed. tions following the first true condition are executed once, ] after which the interpreter skips to the end of the conditional begin randomReweight (range=3, percent=25); (endif;) to continue on with the program. If none of the numColumns = getWeightsLength(); numToChange = numColumns * percent / 100; conditions evaluate to true, then the else component is j = 0; executed. A conditional statement may or may not have an while (j < numToChange) else component, and it may have zero or more elsif column = random (max=numColumns) + 1; components. newWeight = random (max=range); The loop construct is written in a similar format (see Sample weights newWeight:column; j++; 3). In a while loop construct, the body is repeatedly executed endwhile; as long as the condition expression evaluates to true. end; D. Error Checking and Handling Given its critical impact on usability, there is an ongoing effort to develop PsodaScript’s error checking and handling. Because phylogenetic searches frequently run for several days of Sample 4 are a comment, and are not executed. Notice or weeks at a time, error handling should include some degree that the first statement in the randomReweight command of data recovery following a fatal scripting error. Good error creates two variables, range and percent, with the de- checking and handling is even more important when running fault values of 3 and 25 respectively. Just like any other meta-search algorithms, which are more likely (due to code PsodaScript commands, this command could be called with a branching) to run longer before encountering faulty logic. parameter: randomReweight percent = 20;. Listing Before execution, the PSODA interpreter informs the user of percent = 20 as a parameter, overrides the default value fatally incorrect syntax and gives warnings for several non- of 25. fatal errors. For fatal logic errors, which may surface during The real benefit of user-defined commands, therefore, is execution, the interpreter informs the user of the error and the their ability to abstract away substantial segments of code, need to quit. It then gives the user an opportunity to save the which can then be executed as often as needed via a single, trees in the repository before exiting. parameterized command call.

E. User-defined Commands: Facilitating Algorithm Readabil- IV. APPLICATIONS ity and Design To demonstrate the use of PsodaScript for meta-search con- PsodaScript also supports user-defined commands for struction, we present here the implementation of two specific grouping program statements together under a single command meta-search algorithms: Parsimony Ratchet and Alignment name. The ability to group statements is a simple but powerful with Feedback. concept. Consider a large meta-search algorithm that has the ability to perform various heuristic searches, each of which A. Parsimony Ratchet was derived from a base search and modified via a specific configuration of parameters, weights, and so forth. Duplicating As discussed above, heuristic searches for optimal phyloge- the code for each heuristic search every time it is used leads to netic trees are often caught in local optima and thus fail to find a messy program block, which translates into more mistakes the global optimum. The Parsimony Ratchet attempts to avoid and decreased algorithm readability. In short, grouping and this problem by executing a series of heuristic searches alter- labeling program segments is a simple technique that makes nating between skewed and normal weightings for tree scoring an algorithm much easier to understand as a whole; it helps [18]. For the sake of comparison, we present a ratchet that current and future researchers answer the question, “What is could be run by PAUP* together with a possible implemen- this algorithm really doing?” tation of a ratchet in PsodaScript (see Sample 5). Developers Sample 4 illustrates a user-defined command that ran- have created tools that facilitate using the ratchet in PAUP* domly skews a given percentage of column weights—part (or PSODA) by generating long sequences of commands that of a technique used by the ratchet [18] to escape a local simulate looping and random number generation [21]. The optima. It is called as if it were a built-in PsodaScript PAUP* ratchet dramatically improves search performance and command (i.e. randomReweight;). The first several lines can be run in PSODA; however, using the programming Sample 5 Ratchet Implementation: PAUP* vs. PsodaScript

BEGIN PAUP; set criterion=parsimony ; BEGIN PSODA; set maxtrees=1 increase = no; hsearch start=stepwise swap=TBR ; [ weights 2 : 7 14 17 26 27 31 34 45 50 52 54 57 60 63 64 65 73 77 86 91 92 102 103 107 115 117 121 122 124 127 131 133 134 140 142 155 156 163 173 176 183 185 187 195 Randomly selected weights are changed to 197 198 202 204 209 219 221 222 225 226 230 237 238 240 241 244 248 252 254 255 258 a value in the interval from 0 to range-1 269 276 279 283 284 291 294 295 297 309 310 315 319 321 323 325 326 342 346 349 350 351 353 354 355 359 360 363 366 370 377 378 390 393 394 398 408 412 413 418 419 423 inclusive. 425 426 428 429 432 450 456 467 475 479 482 484 489 492 493 497 498 500; hsearch start=current swap=TBR ; range - upper limit of changed weight value set maxtrees=1; weights 1:all; percent - the % of weights being changed. hsearch start=current swap=TBR ; ] weights 2 : 1 17 20 21 29 33 35 39 42 50 57 59 69 70 71 73 80 86 93 96 97 99 100 begin randomReweight (range=3, percent=25); 102 111 121 125 126 141 142 148 149 155 157 163 168 174 179 180 184 187 188 194 203 208 212 216 219 220 225 226 231 242 243 251 253 254 256 257 260 265 269 272 274 277 numColumns = getWeightsLength(); 282 286 287 288 291 293 296 297 305 306 309 315 316 318 324 326 327 329 334 340 346 351 352 356 360 366 370 372 373 379 380 384 390 396 399 401 408 412 414 421 423 426 numToChange = numColumns * percent / 100; 428 429 441 442 444 448 455 459 460 462 472 473 475 477 478 487 489 493; j = 0; hsearch start=current swap=TBR ; while (j < numToChange) weights 1:all; hsearch start=current swap=TBR column = random (max=numColumns) + 1; ; newWeight = random (max=range); weights 2 : 2 9 11 14 16 18 20 24 28 36 40 43 45 47 59 60 62 64 65 70 72 84 90 92 105 107 111 112 118 122 129 130 135 138 146 148 152 153 156 161 162 167 172 173 177 weights newWeight:column; 190 197 200 207 209 213 215 217 218 226 234 245 249 250 251 253 254 255 257 266 275 276 281 284 288 293 303 305 315 319 322 324 329 336 347 349 351 352 355 356 358 360 j++; 365 367 368 371 376 377 378 381 382 407 409 410 411 412 413 418 421 424 425 426 437 endWhile; 442 444 447 448 452 459 461 462 466 474 475 477 478 480 484 488 491; hsearch start=current swap=TBR end; ; weights 1:all; hsearch start=current swap=TBR [ ; weights 2 : 1 5 17 22 26 31 34 39 44 46 58 65 71 73 79 80 85 92 93 95 98 100 101 Proceed with Heuristic search 107 114 119 127 131 137 138 139 142 145 146 150 151 152 155 161 162 167 170 175 176 177 188 191 197 198 211 212 217 220 228 232 234 236 238 239 241 242 243 245 252 257 ] 268 269 270 271 281 283 287 300 305 308 309 311 318 323 324 326 333 334 335 342 345 begin continueSearch; 349 350 351 361 363 367 368 371 372 374 384 388 395 406 408 410 425 426 427 428 430 432 434 435 439 443 445 447 452 455 457 461 465 467 470 474 483 485 490; set (criterion=parsimony); hsearch start=current swap=TBR ; hsearch (start=current, swap=tbr); weights 1:all; end; hsearch start=current swap=TBR ; weights 2 : 2 3 9 20 22 24 25 26 28 30 32 34 38 43 46 49 58 60 66 68 70 72 75 87 88 91 97 99 103 105 109 112 115 118 123 139 140 144 146 152 160 162 165 168 180 184 [ 194 197 203 204 207 214 217 220 222 223 225 233 240 244 249 250 251 257 260 262 270 Main Body 275 280 286 290 291 292 297 298 301 307 310 313 314 318 321 322 326 334 336 338 339 343 347 357 369 372 373 378 380 381 389 395 396 397 401 403 404 407 410 413 425 430 ] 447 454 455 460 461 464 472 478 482 483 487 488 489 492 494 498; hsearch start=current swap=TBR set (maxtrees=1, criterion=parsimony); ; hsearch (start=stepwise, swap=tbr); weights 1:all; hsearch start=current swap=TBR while (true) ; weights 2 : 2 3 6 12 25 29 30 37 50 56 60 61 63 64 67 68 69 70 75 79 85 87 95 99 randomReweight (percent=20); 102 103 109 110 120 123 125 127 129 131 142 143 145 157 164 168 170 172 175 176 180 continueSearch; 190 191 194 195 196 197 201 202 204 205 206 211 214 216 227 230 235 236 240 241 246 250 252 255 257 263 266 272 281 289 304 307 308 309 310 327 328 329 336 338 339 342 weights (reset); 350 353 354 356 357 358 368 380 382 387 389 397 400 405 407 408 410 413 418 421 422 423 424 440 444 446 448 451 452 455 457 463 468 470 476 480 485 495; continueSearch; hsearch start=current swap=TBR print (text = "Score: "); ; weights 1:all; print (text = getCurrentScore() . endline); hsearch start=current swap=TBR ; endwhile;

... Repeated text must continue. However, it is unclear when to stop. END;

constructs of PsodaScript to implement the ratchet offers a algorithm. On the other hand, consider the while loop in the number of advantages in addition to the performance boost. main body of the PSODA program (see Sample 5). It makes For instance, during the ratchet, better scoring trees are the algorithm of the ratchet search clear. First, twenty percent replaced by other (perhaps worse) trees on subsequent iter- of the weights are randomly re-weighted with skewed values, ations. With variables and conditional constructs, however, and the search continues with the skewed weights. Afterward, PsodaScript can also track the best score found so far, and save the weights are reset to their initial values, and the search only those trees that achieve a new best score. Additionally, continues under normal weights. This process of alternating PAUP* merely simulates a loop by running a long sequence of the search between skewed and normal weights is repeated heuristic searches, the length of which is ultimately limited by until the user tells PSODA to stop. memory and/or disk space. On the other hand, with a simple A further benefit of the PsodaScript framework is the loop, PsodaScript can run indefinitely, generating new random way it facilitates experimentation with parameters, such as numbers on each iteration. the range of random numbers used in the weighting or the Another disadvantage to the PAUP* ratchet is the added percentage of skewed weights. Additionally, researchers can dependence on external software, which further obfuscates the use print statements to periodically output information about the state of the search. Performing these simple tasks without [5] N. D. Wolfe, A. A. Escalante, W. B. Karesh, A. Kilbourn, A. Spielman, a programming language is impractical, and as noted above, and A. A. Lal, “Wild Primate Populations in Emerging Infectious Disease Research: The Missing Link?” Emerging Enfectious Diseases, external languages introduce unnecessary complexity. vol. 4, no. 2, 1998. [6] R. L. Graham and L. R. Foulds, “Unlikelihood that minimal phylogenies B. Feedback Alignment for a realistic biological study can be constructed in reasonable compu- tational time,” Mathematical Biosciences, vol. 60, no. 2, pp. 133–142, Phylogenetic inference frequently begins by producing a 1982. multiple sequence alignment (MSA) using software such as [7] M. J. Sanderson, “Flexible Phylogeny Reconstruction: A Review of Phylogenetic Inference Packages,” Systematic Zoology, vol. 39, no. 4, ClustalW [22] and an initial phylogenetic guide tree. In pp. 414–420, 1990. preparation for a phylogenetic search, MSA inserts gaps into [8] A. Miolanen, “Simulated evolutionary optimization and local search: the sequence data to make all sequences the same length. Introduction and application to tree search,” Cladistics, vol. 17, pp. S12– S25, 2000. Research has shown that the quality of the alignment signif- [9] H. Carroll, M. Ebbert, M. Clement, and Q. Snell. icantly impacts the success of the phylogenetic search [23]. [10] M. Chase, D. Soltis, R. Olmstead, D. Morgan, D. Les, B. Mishler, The software package POY performs a MSA for every tree M. Duvall, R. Price, H. Hills, Y. Qiu et al., “Phylogenetics of Seed Plants: An Analysis of Nucleotide Sequences from the Plastid Gene searched—thereby making expensive MSA calculations on rbcL,” Annals of the Missouri Botanical Garden, vol. 80, no. 3, pp. suboptimal trees [24]. A possibly more efficient alternative to 528–580, 1993. POY’s approach is feedback alignment [25], which performs [11] J. Felsenstein, “PHYLIP (Phylogeny Inference Package) version 3.67,” Distributed by the author. Department of Genome Sciences, University occasional realignment of the taxa based on the best tree of Washington, Seattle, 2007. found so far. The sample code below shows the simplicity [12] W. C. Wheeler and D. S. Gladstein, “Poy: The optimization of implementing feedback alignment in PsodaScript. of alignment characters,” American Museum of Natural His- tory, New York, 2000, program and documentation available at ftp://ftp.amnh.org/pub/molecular/poy/. Sample 6 A Possible Implementation of Feedback Alignment [13] D. L. Swofford, “PAUP* Phylogenetic analysis using parsimony (*and other methods), 4.0b4a.” Sinauer Associates, Sunderland, Massachusetts, hsearch (start=stepwise, nreps=5); 1998. [14] J. E. Stajich et al., “The Bioperl Toolkit: Perl Modules for the Life Sciences,” Genome Research, vol. 12, no. 10, pp. 1611–1618, October while (true) 2002. hsearch (start=current); [15] P. Goloboff, S. Farris, and K. Nixon, “TNT: Tree Analysis Using New align (guidetree=best); Technology,” 1999, http://www.cladistics.com. endwhile; [16] U. W. Roshan, B. M. E. Moret, T. Warnow, and T. L. Williams, “Rec- I-DCM3: A Fast Algorithmic Technique for Reconstructing Large Phy- logenetic Trees,” in Computational Systems Bioinformatics Conference, 2004. CSB 2004. Proceedings., IEEE. IEEE, August 2004, pp. 98–109. [17] R. Meier and F. B. Ali, “The newest kid on the parsimony block: TNT (Tree analysis using new technology),” Systematic Entomology, vol. 30, ONCLUSION V. C no. 1, pp. 179–182, 2005. The immensity of phylogenetic tree space for interesting [18] K. C. Nixon, “The Parsimony Ratchet, a New Method for Rapid Parsimony Analysis,” Cladistics, vol. 15, no. 4, pp. 407–414, 1999. data sets demands that researchers rely on heuristic searches [19] P. A. Goloboff, “Analyzing Large Data Sets in Reasonable Times: to infer phylogenies. By designing meta-searches which ap- Solutions for Composite Optima,” Cladistics, vol. 15, no. 415–428, propriately combine various heuristics and parameter settings, 1999. [20] D. R. Maddison, D. L. Swofford, and W. P. Maddison, “Nexus: An phylogenists greatly improve the practicability of using in- extensible file format for systematic information,” Systematic Biology, ferred phylogenetic trees to solve problems. The internal vol. 46, no. 4, pp. 590–621, December 1997. scripting abilities of the open-source PSODA project give [21] D. S. Sikes and P. O. Lewis, PAUPRat: a tool to implement parsimony ratchet searches using PAUP*, 2001, download site: researchers the flexibility to better explore and exploit the http://www.ucalgary.ca/ dsikes/software2.htm. realm of phylogenetic meta-searching while addressing several [22] J. D. Thompson, D. G. Higgins, and T. J. Gibson, “ w: improving limitations of previous phylogenetic applications. the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice,” Nucleic Acids Research, vol. 22, no. 22, pp. 4673–4680, 1994. ACKNOWLEDGMENTS [23] D. A. Morrison and J. T. Ellis, “Effects of nucleotide sequence align- ment on phylogeny estimation: A case study,” Molecular Biology and This material is based upon work supported by the National Evolution, vol. 14, pp. 428–441, 1997. Science Foundation under Grant No. 0120718. [24] W. Wheeler, L. Aagesen, C. P. Arango, J. Faivovich, T. Grant, C. D’Haese, D. Janies, W. L. Smith, A. Varon,´ and G. Giribet, Dynamic REFERENCES Homology and Phylogenetic Systematics: A Unified Approach Using POY. American Museum of Natural History, 2006, ch. Overcoming [1] J. Felesenstein, Inferring Phylogenies. Sinauer Associates, Inc., 2004, Limitations in Phylogenetics, p. 34, published in cooperation with NASA ch. Preface. Fundamental Space Biology. [2] A. O. Mooers and S. B. Heard, “Inferring Evolutionary Process from [25] P. G. Ridge, H. D. Carroll, D. Sneddon, M. J. Clement, and Q. O. Snell, Phylogenetic Tree Shape,” The Quarterly Review of Biology, vol. 72, “Large Grain Size Stochastic Optimization Alignment,” in Proceedings no. 1, pp. 31–54, 1997. of the Sixth IEEE Symposium on BionInformatics and BioEngineering [3] A. Rambaut, D. Posada, K. A. Crandall, and E. C. Holmes, “The causes (BIBE’06). IEEE Computer Society Washington, DC, USA, 2006, pp. and consequences of HIV evolution,” Nature Reviews Genetics, vol. 5, 127–134. pp. 52–61, 2004. [4] J. A. Eisen, “A Phylogenomic study of the MutS family of proteins,” Nucleic Acids Research, vol. 26, no. 18, pp. 4291–4300, 1998.