Analysis and Predictions of DNA Sequence Transformations on Grids

A Thesis Submitted for the Degree of Master of Science (Engineering) in the Faculty of Engineering

By Yadnyesh R. Joshi

Supercomputer Education and Research Centre INDIAN INSTITUTE OF SCIENCE BANGALORE – 560 012, INDIA

August 2007 Acknowledgments

First of all I would like to extend my sincere thanks to my research supervisor Dr. Sathish Vadhiyar for his constant guidance and support during the entire period of my post-graduation at IISc. He was always approachable, supportive and ready to help in any sort of problem. I am very thankful to him for being extremely patient and understanding about the silly mistakes that I had made. Under his guidance I learned to approach problems in an organized manner and set realistic goals for my research. I thank him for his extreme patience and excellent technical guidance in writing and presenting research. Finally, he was and continues to be my role model for his hard work and passion for research. I am also thankful to Dr. Nagasuma Chandra, Dr. Debnath Pal from S.E.R.C. and Dr. Narendra Dixit from Chemical Engineering department for their very useful and interesting insights into the biological domain of our research. I am also thankful to all the faculty of S.E.R.C. for always inspiring us with their motivational talks. I would like to mention the names of my colleagues Sandip, Sanjay, Rakhi, Sundari, Antoine and Roshan for making their technical and emotional support. Special thanks to vatyaa kya group members for the adventures and the routines inside and outside the institute. I would also like to thank the Marathi Mandal for making the institute a homely place. Back home, I would like to thank my parents for being my pillars of strength. I would also like to thank Yamini tai and Dhanashree, my sisters for supporting and guiding me to make important decisions. I would like to thank my friends, Vijay, Vishwanath, Pushkaraj, Prashant, Akshay, Sunder, Anusha and Neha for always being there for me. Last but never the least, I am very thankful to my grandfather Laxman Rao for being the strongest motivator all the time.

i Abstract

Phylogenetics is the study of evolution of organisms. Evolution occurs due to mutations of DNA sequences. The reasons behind these seemingly random mutations are largely unknown. There are many algorithms that build phylogenetic trees from DNA sequences. However, there are certain uncertainties associated with these phylogenetic trees. Fine level analysis of these phy- logenetic trees is both important and interesting for evolutionary biologists. In this thesis, we try to model evolutions of DNA sequences using Cellular Automata and resolve the uncertain- ties associated with the phylogenetic trees. In particular, we determine the effect of neighboring DNA base-pairs on the mutation of a base-pair. Cellular Automata can be viewed as an array of cells which modifies itself in discrete time-steps according to a governing rule. The state of the cell at the next time-step depends on its current state and state of its neighbors. We have used cellular automata rules for analysis and predictions of DNA sequence transformations on computational grids. In the first part of the thesis, DNA sequence evolution is modeled as a cellular automata with each cell having one of the four possible states, corresponding to four bases. Phylogenetic trees are explored in order to find out the cellular automata rules that may have guided the evo- lutions. Master-client paradigm is used to exploit the parallelism in the sequence transformation analysis. Load balancing and fault-tolerance techniques are developed to enable the execution of the explorations on grid resources. The analysis of the sequence transformations is used to resolve uncertainties associated with the phylogenetic trees namely, intermediate sequences in the phylogenetic tree and the exact number of time-steps required for the evolution of a branch. The model is further used to find out various statistics such as most popular rules at a partic- ular time-step in the evolution history of a branch in a phylogenetic tree. We have observed

ii iii

some interesting statistics regarding the unknown base pairs in the intermediate sequences of the phylogenetic tree and the most popular rules used for sequence transformations. Next part of the thesis deals with predictions of future sequences using the previous se- quences. First, we try to find out the preserved sequences so that cellular automata rules can be applied selectively. Then, random strategies are developed as base benchmarks. A roulette wheel strategy is used for predicting future DNA sequences. Though the prediction strategies are able to better the random benchmarks in most of the cases, average performance improve- ment over the random strategies is not significant. The possible reasons are discussed. Contents

List of Figures vii

List of Tables ix

1 Introduction 1 1.1 Cellular Automata ...... 1 1.2 DNA and Cellular Automata ...... 3 1.3 Phylogenetics ...... 5 1.4 ...... 7 1.5 Motivation and Problem Formulation ...... 8

2 Related Work 11 2.1 DNA Sequence Evolution ...... 11 2.2 Grid Computing Applications ...... 13 2.2.1 Applications in Mathematics and Earth Sciences ...... 13 2.2.2 Applications in Astronomy, Physics and Chemistry ...... 14 2.2.3 Applications in Biology and Bioinformatics ...... 14

3 Sequence Transformation on Grids 16 3.1 Sequence Transformation on a Branch ...... 16 3.1.1 Naive Approach ...... 17 3.1.2 Selective Application of Cellular Automata Rules ...... 19 3.1.3 Dynamic Formation of Cellular Automata Rules ...... 20

iv CONTENTS v

3.2 Sequence Transformer ...... 22 3.3 Pseudo Molecular Clock Assumption ...... 22 3.4 Design ...... 24 3.4.1 Master-Worker Paradigm ...... 25 3.4.2 Phases of Execution ...... 26 3.5 Grid Computing Techniques ...... 27 3.5.1 Load Balancing ...... 28 3.5.2 Fault Tolerance ...... 29 3.6 Database Design ...... 29 3.7 Statistics Collection ...... 31 3.7.1 Timesteps ...... 32 3.7.2 Unknown base-pairs ...... 32 3.7.3 Rules ...... 32 3.7.4 Differential rule analysis ...... 33 3.7.5 Popularity of transitions ...... 33

4 Experiments and Results 36 4.1 Grid Infrastructure ...... 36 4.2 Timesteps ...... 37 4.3 Popular Rules ...... 37 4.4 Base Pairs Corresponding to Unknown Positions ...... 42 4.5 Potential of Grid Computing ...... 46

5 Predictions in phylogenetic trees 48 5.1 Determining the Preserved Segments ...... 48 5.1.1 Calculation of PSSM ...... 49 5.1.2 Strategies for Determining Preserved Sequences ...... 50 5.1.3 Evaluation of Strategies ...... 51 5.1.4 Determination of Threshold Values for Flexible Strategies ...... 53 5.2 Analysis of Random Strategies ...... 55 CONTENTS vi

5.3 Methods Used for Prediction ...... 57 5.3.1 Roulette Wheel Method ...... 58 5.3.2 Roulette Wheel Method with Random Component ...... 58 5.3.3 History Sizes ...... 59 5.3.4 Experiments and Results ...... 59 5.4 Analysis ...... 60

6 Conclusions and Future work 62 6.1 Conclusions ...... 62 6.2 Future Work ...... 63

References 65 List of Figures

1.1 Evolution of Cellular Automata through time steps ...... 2 1.2 Rule that governs the evolution of cellular automata shown in Figure 1.1 . . . . 2 1.3 Double helix structure of DNA (Courtesy : U.S. National Library of Medicine) 4 1.4 Example Phylogenetic Tree with Gag Sequences ...... 6

3.1 Application of Random Cellular Automata Rules ...... 18 3.2 Selective Application of Cellular Automata Rules ...... 19 3.3 Example : Dynamic Formation of Cellular Automata Rules ...... 20 3.4 Dynamic Formation and Selective Application of Cellular Automata Rules . . . 21 3.5 Illustration of the Greedy Algorithm ...... 24 3.6 The Master-Worker Design ...... 25 3.7 Phase I in Master ...... 27 3.8 Phase II in Master ...... 28

5.1 Analysis of threshold values : Flexible-1 ...... 53 5.2 Analysis of threshold values : Flexible-2 ...... 54 5.3 Analysis of random strategies ...... 57

vii List of Algorithms

1 Algorithm for Sequence Transformer ...... 34 2 Greedy Algorithm for Chain Formation ...... 35 3 Calculation of Position Specific Scoring Matrix ...... 49

viii List of Tables

1.1 Left-Hand Sides of 64 Transitions of Cellular Automata with Neighborhood Size of 1 ...... 5

3.1 strands ...... 30 3.2 working strand ...... 30 3.3 branches ...... 31 3.4 ruletable ...... 31 3.5 chains ...... 32

4.1 The Distributed Infrastructure ...... 37 4.2 Summary of time step information for Gag sequences ...... 38 4.3 Summary of time step information for GagPol sequences ...... 38 4.4 Summary of time step information for env sequences ...... 39 4.5 Differential Rule Analysis for Gag Sequences ...... 40 4.6 Differential Rule Analysis for GagPol Sequences ...... 40 4.7 Differential Rule Analysis for env Sequences ...... 41 4.8 Popular Rules for a Branch for Gag Sequences ...... 42 4.9 Popular Rules for a Branch for GagPol Sequences ...... 43 4.10 Popular Rules for a Branch for env Sequences ...... 44 4.11 Resolution of Unknown Positions for Gag Sequences ...... 45 4.12 Resolution of Unknown Positions for GagPol Sequences ...... 45 4.13 Resolution of Unknown Positions for env Sequences ...... 46 4.14 Usefulness of Large Number of Runs ...... 46

ix Chapter 1

Introduction

In this section, we give brief background on cellular automata, the relationship between cellular automata and DNA evolutions and the concept of phylogenetic trees.

1.1 Cellular Automata

Cellular automaton is a regular array of identical finite state automata where the next states of the array elements are determined solely by their current states and the states of their neighbors. One dimensional cellular automata consists of a line of cells, each having a particular state. State of each of these cells changes over discrete time-steps. At every time-step, there is a definite rule that determines the next state of a given cell based on the current state of the cell and its neighboring cells[36]. As an example, the evolution of one dimensional cellular automata is shown in Figure 1.1. Each cell can have one of the two possible states - 0 or 1. In this example, the neighborhood size is 1, i.e. state of each cell at the next time step is dependent on the state of that cell, the state of its single left neighbor and the state of its single right neighbor. The rule that governs this evolution is depicted in Figure 1.2. In Figure 1.1,Evolution of the cellular automata is shown for two time-steps. 0th time-step corresponds to the original configuration of the cellular automata. Let us consider the second cell in this original configuration. The current state of this cell is 1. The state of both of its

1 CHAPTER 1. INTRODUCTION 2

Time steps Cells 0 0 1 0 1 0 0 0 1 1 1 0 1 1 0 0 2 0 0 0 0 0 1 1

Figure 1.1: Evolution of Cellular Automata through time steps

0 0 0 0 0 1 0 1 0 0 1 1 1 0 0 1 0 1 1 1 0 1 1 1 0 1 1 0 1 0 0 1

Figure 1.2: Rule that governs the evolution of cellular automata shown in Figure 1.1

neighbors (left and right) is 0 forming a triplet 010. As can be seen in Figure 1.2, the transition corresponding to this state (010) tells us that the next state of this particular cell should be 1 which is reflected in time-step 1. In the same way, all the remaining cells in the array get transformed to their respective next states to form the complete array at time-step 1. The same procedure is repeated to get the state of the array at time-step 2 and so on. As can be seen, the cellular automata rule consists of eight transitions corresponding to eight possible left hand side states. Each of these eight left hand side states can assume two values - 0 or 1. Thus, the number of rules possible for this particular cellular automata is 28 = 256. These 256 CAs are generally referred to using Wolfram notation, a standard naming convention invented by Wolfram[36]. The name of a CA is the decimal number which, in binary, gives the rule table, with the eight possible states listed in reverse counting order, i.e. 000, 001,. . . ,111. Thus, the rule in the Figure 1.2 is the “rule 105 CA” (binary representation of 105 is 01101001). One dimensional cellular automata with two states has a rule consisting of 22·n+1 transitions

2·n+1 where n is the neighborhood size and the total number of possible rules are 22 . In general, one dimensional cellular automata with P states has a rule consisting of P P ·n+1 transitions and

2n+1 the total number of possible rules are P P . Cellular automata have been used to model physical, economical, and sociological systems[12]. They have replaced partial differential equations in the area of system model- ing fairly successfully. Evolution has been modeled as partial differential equations in the work by Smith et. al.[26]. This directly suggests that cellular automata can be potentially powerful tools for modeling molecular evolutions. Hence, cellular automata can prove to be very powerful tools to analyze DNA mutations. CHAPTER 1. INTRODUCTION 3

Rules found by modeling DNA mutations using cellular automata can give us useful insights about the effect of neighboring base pairs on the evolution of DNA segments.

1.2 DNA and Cellular Automata

DNA is a nucleic acid that contains the genetic instructions for the development and function of living things. DNA is often compared with blueprint of building as it contains the information for construction of the other components of the cells such as proteins and RNA molecules. The building blocks of the DNA polymer are nucleotides, which in turn consist of a phos- phate group, a sugar ring group and either a purine or a pyrimidine base group. Two possible purines are guanine (G) and adenine (A) and the two possible pyrimidines are thymine (T) and cytosine (C). DNA is a double stranded molecule (see Figure (1.3)), where the two strands are connected to each other through hydrogen bonding between a purine on one strand and a pyrim- idine on the other or vice versa. Furthermore, adenine (A) is always paired with thymine (T) and guanine (G) is always paired with cytosine (C). Thus, the sequence of one of the strands of DNA is known, the sequence on the other strand can be easily determined. A sequence of these bases forms a strand or a sequence. Three base-pairs in a DNA sequence form one codon. Each codon corresponds to either one of the 20 amino acids or to a control codon (start codon or end codon). There are 64 (43) possible combinations for a codon. But there are only 20 amino acids. There are more than one codon that map to same amino acid. This means there exists redundancy in the mapping. This mapping from amino acids to codons is called as ”Genetic Code” and is more or less same for all organisms. A chain of amino acids form a protein which are the basic functional blocks of the organisms. We can view a DNA strand as a line of cells with each cell having one of the four values (A,G,C or T). This array of cells essentially contains the information stored in DNA. This strand is copied exactly to produce another identical strand in the process of DNA replication. Some times, mutation(s) occurs during replication giving rise to a DNA strand which is different from the original strand. These mutations are the basic reasons of evolution. There are strong indications that mutations of DNA base-pairs are affected by neighboring CHAPTER 1. INTRODUCTION 4

Figure 1.3: Double helix structure of DNA (Courtesy : U.S. National Library of Medicine)

base-pairs[13, 4, 2, 17, 29]. The exact effect of the neighboring base-pairs on the mutation of an individual base-pair is still unknown. We make an attempt to find out this relationship by modeling DNA as cellular automaton where the DNA mutations are governed by the cellular automata rules. The DNA molecule can be viewed as a one-dimensional cellular automaton, with four states per cell, corresponding to each of the four base-pairs. Thus, the base of this cellular automata is 4. The number of transitions in a given rule is 42n+1 where n is the number

2n+1 of left/right neighbors. Total number of rules that may govern the DNA mutations is thus 44 . Even for n = 1, the number of possible cellular automata rules is 464, which is an astronomical number. Thus, the task of finding rules which could have been followed during evolutions requires exploration of huge search space. In this work, we consider only those rules with neighborhood size of 1, i.e. the transition of a base-pair in a DNA sequence during evolution depends on the base-pair and its left and right neighboring base-pairs. The 64 left-hand sides of the transitions corresponding to the neighborhood size of 1 is depicted in Table 1.1. The right-hand side of each transition can be CHAPTER 1. INTRODUCTION 5

Table 1.1: Left-Hand Sides of 64 Transitions of Cellular Automata with Neighborhood Size of 1 S.No. LHS S.No. LHS S.No. LHS S.No. LHS 1. AAA 17. AAC 33. AAG 49. AAT 2. CAA 18. CAC 34. CAG 50. CAT 3. GAA 19. GAC 35. GAG 51. GAT 4. TAA 20. TAC 36. TAG 52. TAT 5. ACA 21. ACC 37. ACG 53. ACT 6. CCA 22. CCC 38. CCG 54. CCT 7. GCA 23. GCC 39. GCG 55. GCT 8. TCA 24. TCC 40. TCG 56. TCT 9. AGA 25. AGC 41. AGG 57. AGT 10. CGA 26. CGC 42. CGG 58. CGT 11. GGA 27. GGC 43. GGG 59. GGT 12. TGA 28. TGC 44. TGG 60. TGT 13. ATA 29. ATC 45. ATG 61. ATT 14. CTA 30. CTC 46. CTG 62. CTT 15. GTA 31. GTC 47. GTG 63. GTT 16. TTA 32. TTC 48. TTG 64. TTT

one of the 4 base-pairs giving rise 464 rules. We also assume that during one evolution step, a single rule is applied for the entire sequence, i.e. transitions in cells i and j of the sequences are governed by the same rule. Although these assumptions do not encapsulate the myriad mechanisms that could have been followed during evolutions, the assumptions are reasonable since there have been evidences that a base-pair is impacted by its immediate neighbors than its farthest neighbors[13, 2].

1.3 Phylogenetics

In biology, the study of evolutionary relatedness among various groups of organisms (e.g., species, populations) is called Phylogenetics. A phylogenetic tree, shown in Figure 1.4, also called an evolutionary tree or a tree of life, is a tree showing the evolutionary interrelationships CHAPTER 1. INTRODUCTION 6

Figure 1.4: Example Phylogenetic Tree with Gag Sequences among various species or other entities that are believed to have a common ancestor. The leaves of the tree represent various organisms, species, or genomic sequences. The internal node of the tree stands for an abstract organism (species, sequence) whose existence is presumed and whose evolution led to the organisms in the leaves. Evolution is visualized with the help of phylogenetic trees corresponding to a set of organ- isms. Phylogenetic trees give a picture of relatedness between various organisms. A rooted phylogenetic tree has a root that corresponds to the most recent common ancestor to all the sequences under consideration. A branch in a rooted phylogenetic tree connecting an ancestor and a progeny indicates that one sequence (progeny) is evolved from the other (ancestor). Un- rooted trees illustrate the relatedness of the leaf sequences without making assumptions about ancestry. Various efforts have been made to construct phylogenetic trees for a given set of DNA sequences[19, 33]. However, there are certain uncertainties associated with these phylogenetic trees. The reconstruction of the sequences corresponding to intermediate nodes are not complete CHAPTER 1. INTRODUCTION 7

in the phylogenetic trees. There are several positions in the intermediate sequences where the exact base-pairs are not known. The number of time steps required for the mutations to occur is also not known explicitly. Finally, these trees do not provide any indications about the rules that may have been followed during the evolution of the different sequences in the given tree.

1.4 Grid Computing

Grid Computing involves collection of computational resources in order to solve problems of large magnitude. These computational resources may be diverse in terms of their computing power or architecture and may be under different administrative domains. These resources may be shared resulting in resource dynamics in the system. They may be spread over a large geographical area. Grid computing seamlessly organizes these resources in order to solve the problem. More importantly, Grid Computing can utilize the unused computing power in order to give a low cost solution to the problems which otherwise are solved on expensive single high performance system. Though grids can have many possible architectures, one of the main features of the grids is that many computing resources 1 are connected to each other through low-bandwidth and highly shared network. Hence, communication is almost always a performance bottleneck for a grid (when compared with a single high-performance system). Hence, applications which involve low communication or applications which can be decomposed into tasks which involve little or no communication between them are suitable for grids. Often, popular applications in Grid Computing have a client-server architecture where client contacts server for some information and actual processing of the information is done at the client side. The problem considered in this thesis has client-server characteristics and hence is suitable for Grid Computing. The entire problem can be divided into different computational tasks. These tasks can be assigned to any number of available computing resources over grid. Mod- eling DNA sequence mutations or transformations using cellular automata where each cell can assume one of 4 possible states requires exploration of a huge search space and needs large

1 The computing resources here can be personal computers or clusters of servers or high performance CHAPTER 1. INTRODUCTION 8

amounts of computing cycles. Grid computing has previously been used successfully in order to solve problems of very large magnitude. Grid computing has been found to be useful in the fields that vary from Mathematics to Biology and Medicine. In the field of mathematics, Grids have been used to solve satisfiability problem[7], to find primes of the form k ·2n 1. In the field of Earth Sciences, Grid Computing is being used to produce a forecast of the climate in 21st century in ClimatePrediction.Net[8] project. In physics, projects such as SETI@Home[28], µFluids[18], Einstein@Home[10] are making use of unused processor cycles at various desktops in order to achieve high computing power. In the area of Bioinformatics, SIMAP[30], Rosetta@home[25], Predictor@Home[22] use the power of Grid computing to analyze the structures and functions of proteins. Evolution of Antrhopods has been extensively studied using Grid Computing by Stewart et.al.[33]. In these applications, vast and ever-expanding grid resources have been used successfully for parallel exploration and analysis of different parameter values. The particular parameters that are of interest in our work on DNA sequence evolutions are cellular automata rules for depicting the effects of neighboring base-pairs on evolutions, unknown bases in the intermediate sequences and the exact number of time steps involved in the evolution of one sequence to another.

1.5 Motivation and Problem Formulation

Study of evolution of different species or organisms is important to biologists since it has many practical applications including drug discovery, population monitoring and management[11]. Availability of DNA sequence databases[27] in the last few decades has enabled the study the evolution at the molecular level. There have been many studies on the evolutions of species using DNA sequences[13, 26, 4, 2, 14]. In terms of molecular biology, evolution can be viewed as a mutation event in which a particular DNA segment of the organism undergoes some change. During evolution, a DNA segment consisting of a sequence or purines and pyrimidines (also called as base-pairs) changes to a different sequence of base-pairs. While the effects of some neighborhood base-pairs on the evolution of a DNA segment is CHAPTER 1. INTRODUCTION 9

known, there has been very little work[5, 31] to our knowledge that comprehensively analyze the effects of different neighborhoods on evolutions. The exact effect of base-pairs on the mu- tation is still unknown and remains a challenge for evolutionary biologists. While phylogenetic trees constructed out of existing packages [20] give an overall picture of the relationships, these packages do not give fine level details of the way evolution might have progressed. The trees do not give the exact number of time steps required for mutation and also does not give give any indication about the effect of neighboring base pairs on mutations. These packages, while constructing the phylogenetic tree, produce incomplete hypothetical sequences for intermediate nodes. The exact base-pairs of some positions in the intermediate sequences are not known. Our work tries to resolve the uncertainties associated with the phylogenetic trees. In par- ticular, our work tries to determine the rules for neighborhood based mutations that may have been followed during the evolutions of sequences. In addition, our work also tries to resolve the uncertainties related to the number of time steps and unknown base-pairs in the intermediate sequences of the phylogenetic trees. We use the vast number of resources available in com- putational grids to perform parameter searches associated with the phylogenetic trees with the intention of narrowing the ranges of the parameters. In this work, we model DNA sequence mutations using cellular automata to find rules for neighborhood-based mutations for a particular phylogenetic tree on computational grids. By parallel guided exploration of large number of cellular automata rules on grid resources, we also attempt to resolve uncertainties associated with the phylogenetic tree, namely, finding unknown base-pairs in intermediate sequences and the number of time steps for evolutions. This analysis of mutations using cellular automata rules can be utilized by evolutionary biologists to better predict mutations of DNA sequences. Thus, formally, the problem can be stated as to develop solutions based on cellular automata to restrict and/or resolve the uncertainties associated with phylogenetic trees by parallel guided exploration of vast number of cellular automaton rules on different resources of computational Grids. This cellular automata model of DNA gives us various important statistics about the phylogenetic tree. We then attempt to use these statistics to predict future DNA sequences of the phylogenetic tree. The methods used for prediction depend on calculating preserved sequences using position specific scoring matrix CHAPTER 1. INTRODUCTION 10

and the statistics collected about the phylogenetic tree during the analysis done. The rest of the thesis is organized as follows. In chapter 2, we look at the existing works that are relevant to the problem. In particular, we look at the works that suggest the effect of neighborhood bases on the mutation of DNA strand. In chapter 3, we look at the design of Cellular Automata model of DNA mutations. We also see how this design can be used resolve the uncertainties associated with phylogenetic trees. Chapter 4 describes various experiments we have performed and their corresponding results. We also look at the interesting statistics that we have collected. In chapter 5, we try to extend the cellular automata model in order to predict future sequences. Chapter 6 concludes the thesis and gives directions for the future work. Chapter 2

Related Work

In this chapter, we look at some of the efforts that have investigated DNA sequence evolution by taking context dependency into account. We then look at the efforts that illustrate how Grid Computing can help large applications.

2.1 DNA Sequence Evolution

There have been number of studies on the evolution of DNA sequences[14, 4, 13, 2, 17, 29, 31, 33]. The work by Korber et. al.[14] studies the evolution of HIV sequences using the molecular clock assumption; this hypothesis postulates that molecular change is a linear function of time and that substitutions accumulate according to a Poisson distribution. HIV-1 sequences were analyzed to estimate the timing of the ancestral sequence of the main group of HIV-1, the strains responsible for the AIDS pandemic. Using parallel supercomputers and assuming a constant rate of evolution, maximum-likelihood phylogenetic methods were applied to unprecedented amounts of data for this calculation. Results were validated by correctly estimating the timing of two historically documented points. There are a number of studies that analyze the impact of neighboring bases on the mutation of a particular base. The work by Bulmer[4] finds that there is a marked increase in the fre- quency of transitions from the doublet CG. There are also some smaller effects of neighboring bases on the frequencies of transitions from adenine and thymine. They also determine that the

11 CHAPTER 2. RELATED WORK 12

transition frequency from either of these bases is reduced by having G on the right (or C on the left) and increased by having T on the right (or A on the left). Hess[13] also concludes that substitution rates, representing averages over those for differ- ent regions of the genome, are distributed over a 60-fold range with strong biases in particular neighbor-pair environments. Studies indicate that substitution rates vary for the same base-pair for different neighbor-pair environments. They found that, in general, the rates are fastest in alternating purine-pyrimidine sequences and slowest in purine-pyrimidine tracts. This clearly indicates that the mutation rates are affected by neighboring base pairs. Arndt et. al.[2] introduces a model of DNA sequence evolution which can account for biases in mutation rates that depend on the identity of the neighboring bases. They have developed an analytical model of evolution by adopting the methods of non-linear dynamics. They conclude that phylogenetic analysis should be extended to include neighbor-dependent effects. All the above efforts clearly indicate that neighboring bases have some effect on the muta- tion of a particular base. But none of these studies analyze the fine-grain effects of neighboring bases during each step of evolution. Cellular automata model can exploit this neighborhood dependency during DNA sequence evolution. Morton et. al.[17] have analyzed 1776 aligned SNP sequences generated from nuclear genes of maize to study the effect of neighborhood compositions on mutation dynamics. Their studies have found that the A+T content of flanking nucleotides has an influence of various aspects of mutation dynamics. Overall, the polarized SNP data yielded a G and C nucleotide mutation rate (the GC rate) that is 1.6 times the rate of mutation for A and T nucleotides (the AT rate). The sequences used in their study for the analysis were pre-generated while the sequences in our study are dynamically generated due to change in states of cellular automata. Siepel and Haussler[29] incorporate context-dependence in phylogenetic models to improve the quality of phylogenetic trees. Thus the motivation of their work is similar to ours. Their work indicates that the patterns of context-dependent substitutions are complex in both cod- ing and noncoding regions. They build models of different orders and higher ordered models produce better results than lower ordered models. Third-order models suggest that important context effects occur at the level of nucleotide triplets. Their work focuses on using their im- CHAPTER 2. RELATED WORK 13

proved context-dependent phylogenetic models to estimate the pattern and rates of substitutions on the branches of a given phylogenetic tree. Their work also reports results on better estimates of branch lengths. In addition, their work has the potential to refine phylogenetic tree construc- tion. They have estimated substitution rates and context effects for 160,000 noncoding sites and 3 million sites in coding regions in mammalian genomes. Our work based on cellular automata tries to determine finer-grained context-dependent effects on individual steps of mutations. DNA evolution has been modeled as Cellular Automata in the work by Sirakoulis et. al.[31]. In this work, application of cellular automata rule to a DNA strand is treated as a matrix multi- plication modulo 4. This strategy, however, cannot consider all possible cellular automata rules. The neighborhood effects for mutations are less clear after the modulo 4 operation is performed. This tool was created to visualize the evolution according to a particular cellular automata rule. The work by Stewart et. al.[33] had prepared a global grid for studying arthropod evolu- tion. The effort implemented on a global grid, a parallel version of fastDNAml[19] algorithm using maximum likelihood approach to construct better phylogenetic trees. While these efforts deal with constructing a better phylogenetic tree, we try to refine the phylogenetic tree using grid computing technologies. We also try to find additional information about these trees by modeling the DNA sequence evolution as cellular automata rules.

2.2 Grid Computing Applications

Grid Computing has been found useful in fields that vary from Mathematics to Biology and Medicine. In the following subsections, we take a brief look at some selected applications where Grid Computing has helped produce results which require large amount of resources.

2.2.1 Applications in Mathematics and Earth Sciences

ClimatePrediction.Net[8] is an example where Grid Computing has been found useful to solve a high computation intensive problem. The project is the largest experiment to try and produce a forecast of the climate in the 21st century. GridSAT[7] is another example where Grid Computing enabled to find answers to pre- CHAPTER 2. RELATED WORK 14

viously unsolved satisfiability problems. This application utilizes master-client model along with the effective rescheduling techniques to make use of the idle workstations to solve dif- ficult problem. The solution involves breaking the big problem into smaller subproblem at every step. These subproblems are then distributed among the available clients based on their capacity. ABC@Home[1] aims at finding sets containing three integers, a, b,c such that a + b = c,a < b < c, a,b,c have no common divisors and c > rad(abc) where rad(abc) is the product of distinct primes in abc. Aim of SZTAKI Desktop Grid project[34] is to find all the generalized binary number systems up to dimension 11. Riesel Sieve project[24] tries to find the primes of the form k · 2n − 1. PrimeGrid[23] is generating public sequential prime numbers database and searching for twin primes of the form k · 2n − 1, k · 2n + 1.

2.2.2 Applications in Astronomy, Physics and Chemistry

SETI@Home [28] is a typical example in which unused processor cycles all over the world are utilized to search signs of intelligent signals from the space. µFluids project[18] is a mas- sively distributed computer simulation of two-phase fluid behavior in microgravity and mi- crofluidics problems. Einstein@Home[10] is a program that uses personal computer’s idle time to search for spinning neutron stars (also called pulsars). Goal of Spinhenge@home[32] is to study molecular magnets and controlled nanoscale magnetism with the help of Grid Computing. Quantum Monte Carlo@Home[6] studies structure and reactivity of molecules using quantum chemistry. LHC@home (Large Hadron Collider at Home)[15] is a particle accelerator being built at CERN, the European Organization for Nuclear Research. These examples use master-worker paradigm and illustrate the power of Grid Computing to solve the problems which require large amount of computing power.

2.2.3 Applications in Biology and Bioinformatics

Rosetta@home[25] aims to determine the 3-dimensional shapes of proteins. Project SIMAP aims to calculate similarities between proteins. SIMAP[30] provides a public database of the resulting data, which plays a key role in many bioinformatics research projects. CHAPTER 2. RELATED WORK 15

Predictor@Home[22] attempts to predict the folded, functioning, form of the protein. Predicting the structure of an unknown protein is a critical problem in enabling structure-based drug design to treat new and existing diseases. Malariacontrol.net[16] runs simulation models of the trans- mission dynamics and health effects of malaria that are an important tool for malaria control. They can be used to determine optimal strategies for delivering mosquito nets, chemotherapy, or new vaccines which are currently under development and testing. Such modeling is extremely computer intensive, requiring simulations of large human populations with a diverse set of pa- rameters related to biological and social factors that influence the distribution of the disease. The goal of [35] is to further critical non-profit research on some of humanity’s most pressing problems by creating the world’s largest grid. Research includes HIV/AIDS, cancer, muscular dystrophy, dengue fever, and many more. The work by Stewart et. al. [33] also made use of potential of Grid Computing for analysis of evo- lution of Anthropods - research which otherwise would not have otherwise been undertaken. In this case, master-worker helped to achieve coarse gain and fine grain parallelism with fault tol- erance techniques. The Biological General Repository for Interaction Datasets (BioGRID)[3] is a curated biological database of protein-protein interactions. It strives to provide a comprehen- sive resource of Protein-Protein interactions for all major species while attempting to remove redundancy to create a single mapping of protein interactions. These applications have been helped by large computing resources made available due to Grid infrastructures. Most of these applications exploit parallalism using master-client archi- tecture where many clients parallely perform compution assigned by the master. We also use master-client approach along with Grid Computing to model DNA mutations using cellular automata rules. Chapter 3

Sequence Transformation on Grids

DNA mutations can be modeled using cellular automata since DNA mutations are related to the neighboring base pairs. To study these mutations, we make use of phylogenetic trees. For each branch of the phylogenetic tree, we try to transform progeny sequence to ancestor sequence us- ing cellular automata rules. In this chapter, we describe the techniques used for transformation of sequences from an ancestor to a progeny, a program for performing the transformations, and our assumptions regarding evolutionary rates to manage the number of successful transforma- tions. Further, we also see how these techniques are realized on a grid using some of the grid computing techniques.

3.1 Sequence Transformation on a Branch

We use Phylip[20] to construct phylogenetic trees. DNA sequences were downloaded from HIV Sequence database at Los Alamos[27]. The sequences were aligned by ClustalW web interface[9]. The aligned sequences were then input to Phylip to obtain phylogenetic tree for a given set of sequences. The ’dnamlk’ program of the phylip package was used as it generates the phylogenetic tree assuming molecular clock. Transition/transversion ratio was kept as 2.0. Empirical base frequency was used. The option for intermediate sequence generation was kept ON. At the end of the program, phylip output file was generated containing the entire tree, branches with their branch lengths, intermediate sequences and other debugging information.

16 CHAPTER 3. SEQUENCE TRANSFORMATION ON GRIDS 17

Intermediate sequences were output in interleaved format. These sequences were separated from the phylip output file and stored sequentially. The branches and their corresponding branch lengths were also obtained from the output file. Each branch in a phylogenetic tree corresponds to an ancestor-progeny pair. For each branch, we apply a set of cellular-automata rules for transforming an ancestor sequence to the progeny sequence. We compare a sequence, produced during the transformation, with the progeny sequence using a similarity value metric defined as the percentage of the number of base pairs in the sequence matching with the corresponding base pairs in the progeny sequence. Thus, a similarity value of 1 indicates that the transformation has resulted in the progeny se- quence. Following subsections describe the methods used for sequence transformation of one branch in the phylogenetic tree.

3.1.1 Naive Approach

One naive approach for transformation is to randomly choose a cellular-automata rule at each time step and apply the rule to the current sequence. We then monitor the progress of similarity value metric over the time-steps. This approach can lead to the sequences deviating from the progeny sequence as illustrated in Figure 3.1. The figure shows the similarity values of the sequences when using the naive approach for an ancestor-progeny branch corresponding to a phylogenetic tree constructed for gag sequences of HIV virus using Phylip package. The progeny sequence in this branch has accession number X52154 and the ancestor sequence is one of the intermediate sequences generated by the Phylip package. CHAPTER 3. SEQUENCE TRANSFORMATION ON GRIDS 18

Application of a Random Cellular Automata Rule 0.65

0.6

0.55

0.5

0.45

0.4 similarity value

0.35

0.3

0.25 0 50 100 150 200 250 300 timesteps

Figure 3.1: Application of Random Cellular Automata Rules CHAPTER 3. SEQUENCE TRANSFORMATION ON GRIDS 19

3.1.2 Selective Application of Cellular Automata Rules

Selective Application of a rule 1

0.95

0.9

0.85 similarity value

0.8

0.75 0 50 100 150 timesteps

Figure 3.2: Selective Application of Cellular Automata Rules

For successful transformation of an ancestor to a progeny, we can make use of the fact that not all the base-pairs mutate at each time step. Indeed, many subsequences in ancestor sequence match exactly with the progeny sequence at their corresponding positions. Thus, at a given time step, we apply a random cellular automata rule only to those base-pairs in the current sequence which differ with the corresponding base-pairs in the progeny sequence. This selective applica- tion of cellular automata rules helps in the convergence of sequences to a progeny sequence as shown in Figure This figure shows the similarity value for each time step when using selective application of cellular automaton rules for the same branch. As seen in the figure, this approach completes the transformation in 141 time steps. This approach can also be biologically justi- fied since most of the successful sequences (which form complete proteins) are less prune to mutations than the others. 3.2. CHAPTER 3. SEQUENCE TRANSFORMATION ON GRIDS 20

Current Sequence C G T A C G T C T

Progeny Sequence A T A A G A C C G

Rules 1. CGT −→ T 2. GTA −→ A 3. TAC −→ A 4. ACG −→ G 5. CGT −→ A

Figure 3.3: Example : Dynamic Formation of Cellular Automata Rules

3.1.3 Dynamic Formation of Cellular Automata Rules

In dynamic formation of cellular automata rules, we try to dynamically create a cellular au- tomata rule using a sequence obtained during the transformation and the progeny sequence. Figure 3.3 illustrates the dynamic formation of a rule. We try to form a complete rule by forming the individual transitions with the use of current and progeny sequences. The current sequence on which a rule is to be applied forms the left hand side of the transitions. The progeny sequence forms the right hand side of the transitions. For example, in Figure 3.3, the left hand side of a transition is formed by the first three base- pairs of the current sequence, namely, CGT; and the right hand side of the transition is formed by the corresponding base-pair in the progeny sequence, T. We begin the formation of the rule from the first few base-pairs of the current and the progeny sequences. These base-pairs form a window in the current sequence and map to a single base-pair in the child sequence. The size of this window is 2 · n + 1 where n is the neighborhood size used in the cellular automata rule. In our example, the neighborhood size is one, hence the window consists of three base-pairs including a neighbor on either side of the base-pair under consideration for mutation. We slide this window over the entire parent sequence or until we find a contradicting transition during the formation of the rule. A contradicting transition is found when two windows in the current sequence containing exactly the same sub-sequence map to different base-pairs in the corresponding progeny se- CHAPTER 3. SEQUENCE TRANSFORMATION ON GRIDS 21

Dynamic Formation of Rule with Selective Application 1

0.95

0.9

0.85 similarity value

0.8

0.75 0 10 20 30 40 50 60 timesteps

Figure 3.4: Dynamic Formation and Selective Application of Cellular Automata Rules quence. In our example, transitions 1 and 5 contradict with each other as they try to map the same sequence, CGT, to different base-pairs. If a contradicting transition is found, we selec- tively apply a random cellular automata rule to the current sequence leading to a new sequence and repeat the procedure of dynamic rule formation to the new sequence. With dynamic rule formation and selective application of cellular automata rules, the number of time steps required for the transformation between the same ancestor-progeny pair reduced considerably as shown in Figure 3.4. Based on the above principles, we have written a program, sequence transformer program, that uses selective application and dynamic formation of cellular automata rules for transforma- tion on an ancestor-progeny branch. We describe the program in the next section. CHAPTER 3. SEQUENCE TRANSFORMATION ON GRIDS 22

3.2 Sequence Transformer

The sequence transformer is a fundamental component in our infrastructure. The pseudo-code is given in Algorithm 1. It takes as input, sequences for an ancestor and a progeny and produces as output the number of time steps required for the transformation from the ancestor to the progeny and the cellular automata rules applied during the transformation. The array rule arr maintains the cellular automata rules applied for transformations and the array rule change indicates the time step at which a rule was applied. (rule change[i + 1] − rule change[i]) are the number of time steps for which the same rule rule arr[i] is applied. The input variable sametol is a tolerance factor or number of time steps for which the same rule can be applied without increase in similarity value. Initially, after alignment of the sequences, the base pairs corresponding to some segments of the ancestor sequences are not known. The sequence transformer also fills these unknown segments in the ancestor sequence with random base pairs (line 2). This assignment of unknown segments is also recorded as output of the sequence transformer. Multiple runs of sequence transformer on a branch can lead to evolution of the ancestor to the progeny of the branch. In order to work with manageable number of solutions, assumptions were made relating the branch lengths of the branches and the time steps. These assumptions are explained in the next section.

3.3 Pseudo Molecular Clock Assumption

Molecular clock is an assumption of constant rate of evolution[14], i.e. the rate of evolutionary change of any specified protein is approximately constant over time and over different lineages in the phylogenetic tree. Thus, according to strict molecular clock, there exists a single rate of mutation α such that

b1 = α · t1; b2 = α · t2; · · · ; bn = α · tn (3.1)

where b1, b2, b3, . . . , bn are branch-lengths of the n branches in the phylogenetic tree and t1, t2

, t3, . . . , tn are the time steps taken for mutations of the corresponding branches. Branch length is a measure of the difference between the ancestor and progeny of a branch and is obtained along with the phylogenetic tree from the Phylip package . The time steps are outputs from CHAPTER 3. SEQUENCE TRANSFORMATION ON GRIDS 23

our sequence transformer program. The strict molecular clock assumption has been used in some phylogenetic inferences[14]. In a non-molecular clock assumption, mutation rates for the different sequences in the phylogenetic tree are different, i.e.

b1 = α1 · t1; b2 = α2 · t2; · · · ; bn = αn · tn (3.2)

In our work, we use a pseudo-molecular clock assumption. According to this,

α1, α2, α3, ..., αn are related such that

b1 = α1 · t1 < b2 = α2 · t2 < · · · < bn = αn · tn (3.3) and

t1 < t2 < t3 < · · · < tn (3.4)

This assumption is reasonable since greater the branch lengths or greater the difference between the ancestor and progeny sequences, more the time steps required to transform from an ancestor to a progeny. After many invocations of sequence transformer for different branches, we accept only those outputs of sequence transformers that adhere to the pseudo-molecular clock assumption. To find such valid transformations, we use a greedy algorithm. The greedy algorithm starts with a sorted order of branches in the phylogenetic tree in terms of their corresponding branch lengths. For each branch, a linked list is maintained. Every node in a linked list corresponds to one invocation of the sequence transformer and contains the inputs and outputs for the invocation including the number of time steps taken for mutations on the branch. The greedy algorithm, shown in Algorithm 2 finds a node, prev node, in the first linked list having the smallest number of time steps. The prev node is inserted in a chain. The algorithm then considers the next linked list and finds a node with the smallest time step value greater than the time step value of prev node. This node now becomes the prev node and is added to the the chain. The entire procedure is then repeated for all linked lists corresponding to all the branches. During this algorithm, a branch whose linked list does not have a node with a time step value greater than that of prev node may be found. Such branches are not included in the chain. We then try to form another chain which may contain the remaining branches. We repeat this procedure CHAPTER 3. SEQUENCE TRANSFORMATION ON GRIDS 24

Branch 1 11 13 10 14 11 15 Branch 2 20 25 16 21 17 18 23 20 Branch 3 25 20 28 21 29 Branch 4 18 19 17 18 18 Branch 5 42 40 41 45 Branch 6 36 35 35 36 38 37 Branch 7 41 48 51 49 55

Chains[0] = 10, 16, 20, 40, 41(Longest chain); Chains[1] = 17, 35

Figure 3.5: Illustration of the Greedy Algorithm

so that a node of every branch is included in some chain. At the end of this algorithm, we obtain different chains each having different lengths. A chain containing all the branches in the phylogenetic tree is called complete chain. Note that the greedy algorithm is bound to find a complete chain in the current snapshot of time-steps for branches if one exists. The working of the greedy algorithm is illustrated in Figure 3.5. The figure shows linked lists for 7 branches. The numbers in the nodes of the linked lists represent the time steps. The cells which contain the numbers either in bold font or italic font are part of a chain. The figure also shows the 2 chains that are produced from the greedy algorithm on the example. The numbers in bold font indicate one chain and the numbers in italics indicate the other. We can see that there is no complete chain in the current snapshot. The chain marked with bold font is the current longest chain. Having described the various mechanisms, assumptions and algorithms, in the next section, we see how they fit into overall design of the entire solution.

3.4 Design

The number of input and output parameters of the sequence transformer leading to the formation of complete chains can be very large, making the problem suitable for grid computing. In order to explore the vast space of parameters and to converge on the most likely values for these parameters, we use the distributed resources of a computational grid. This method is similar to the ensemble method used for climate prediction in ClimatePrediction.Net[8]. CHAPTER 3. SEQUENCE TRANSFORMATION ON GRIDS 25

Figure 3.6: The Master-Worker Design

Following subsections describes the architecture of the entire design.

3.4.1 Master-Worker Paradigm

Master-worker paradigm is used for invocations of the sequence transformer, formation of completion chains, and insertions of parameters corresponding to the complete chains into a database. The overall design of our master-worker infrastructure is illustrated in Figure 3.6. The master is responsible for assigning branches to the workers and collecting results from them when they complete their calculations. The master assigns branches to the workers in round-robin fashion. The master, after assigning a branch to a worker, does not wait for the worker to complete its calculations, but proceeds to the next worker. Thus there is parallelism in the calculations by the workers. When the first worker completes its calculations, the master is notified of the completion and the worker sends the results back to the master. The master stores the results from the workers into a database. We use PostgreSQL[21] database for data insertion and querying. Additionally, the master periodically invokes the greedy algorithm for CHAPTER 3. SEQUENCE TRANSFORMATION ON GRIDS 26

chain formation with the available snapshot of linked list values for various branches. If a complete chain could be formed from the current set of linked list values, the master forks another process to insert the parameter values corresponding to the complete chain into the database and to possibly form additional chains from the same set of values. Note that the greedy algorithm gives us only one chain from the available set of parameter values while there may be many complete chains in a given snapshot of linked list values. To obtain additional chains, the forked process deletes an element from the original complete chain and invokes the greedy algorithm to find another complete chain. This procedure is repeated by deleting different sets of elements from the original complete chain. During this process, whenever a complete chain is formed, it is inserted into the database. A worker takes a branch from the master, fills the unknown base-pairs randomly and invokes the sequence transformer. When the transformation is complete, it sends the results back to the master. The results consist of the number of time steps required for transformations, the rules used during the transformations and the assignment of base-pairs to the unknown portions of the ancestor sequence. The worker then waits for a new branch from the master. The number of worker processes are not related to the number of branches. Hence our framework can make use of any number of available grid resources for the execution of worker processes. Depending upon the state of the master process, two phases can be identified. We look at these phases in the next subsection.

3.4.2 Phases of Execution

In phase I shown in Figure 3.7, the master continuously gives new branches to the worker processes and collects the results from them. The master initially considers all branches for al- locations to workers for invocation of sequence transformer on the branch. Once the number of branches in the longest chain exceeds 60% of the total number of branches, the master considers only those branches that are not in the longest chain for allocations to workers. Thus, more re- sources are utilized for difficult branches for complete chain formation. To avoid potential large difference between the number of invocations of sequence transformer for any two branches, we fix a threshold, allocation threshold, for maximum difference between the maximum and CHAPTER 3. SEQUENCE TRANSFORMATION ON GRIDS 27

Phase I

Give branches to workers

Accept results from workers

No Calculate the length of the longest chain

Is a complete chain found?

Yes

Phase II

Figure 3.7: Phase I in Master minimum invocations for any two branches in the phylogenetic tree. If this difference exceeds the threshold, the branch with minimum number of invocations is assigned to the workers till the difference becomes less than the threshold. During phase I, the master also invokes the chain formation algorithm periodically. Once a complete chain is formed, the master initiates phase II. Phase II, shown in Figure 3.8, involves insertion of the complete chains into the database and formation of additional complete chains from the same data. Note that, for phase II to start, phase I must complete. But, once phase II has been started, the next round of phase I can be started immediately ensuring pipelined parallelism between phase I and phase II in the master.

3.5 Grid Computing Techniques

As discussed in the introduction, state of the grid resources may be highly dynamic. We use load balancing and fault tolerance techniques to adapt to the resource dynamics in grids. CHAPTER 3. SEQUENCE TRANSFORMATION ON GRIDS 28

Phase II

Insert values into the database

Delete element(s) from the chain

Try to find new chain

Yes

Is a complete chain found?

No

Quit

Figure 3.8: Phase II in Master

3.5.1 Load Balancing

During phase I, when the number of branches in the longest chain is less than 60% of all the branches, all branches are allocated to the workers for calculations. There is a possibility that a particular branch is always assigned to a slow worker hampering the progress for the branch. A threshold, loadbalance threshold, is fixed for maximum allowed difference between the number of invocations of sequence transformer for any two branches. If the difference exceeds the threshold, the branch with minimum number of invocations is allocated to two workers during each iteration till the difference becomes less than the threshold. This technique allows uniform progress for all branches irrespective of the different loads on the grid resources where the workers are executing. Theoretically, any number of phase I and phase II processes can be executing at any given point. However, both phase I and phase II processes consume memory on the machine where the master is executing. Hence, phase I and phase II processes must be started after verifying whether the required amount of memory is available at the master machine. We follow a sim- CHAPTER 3. SEQUENCE TRANSFORMATION ON GRIDS 29

plistic approach where only one phase I and one phase II processes may be active at any given point of time. This controls the load on the master machine allowing efficient execution of the master.

3.5.2 Fault Tolerance

For each worker, the master forks a process that sends the parameters to and receives the results from the worker. The forked process maintains connection with the worker throughout the calculation of the branch. After the results are sent back by the worker, the forked child process notifies about the arrival of results to the parent master. Hence, even if the worker fails during its execution, only the forked process gets killed and the master will be able to continue its execution. Also, the master forks a new process for phase II operations. The forked phase II process does not need to communicate with the main master process. This achieves not only parallelism but also fault tolerance since even if the phase II process fails, it doesn’t affect the execution of the main master process. The master process will eventually fork off another phase II process. Finally, even if the master fails after some time, the results obtained so far are still accessible since they are inserted into the database as and when the complete chains are formed. The user can still collect the statistics for the results from the database irrespective of whether the master is alive. Due to the durability property of the databases, we can continue to build upon the previous successful transformations even in the case of the failure of master process. Master collects the results from worker and stores them into the database. In the next section, we look at the design of this database.

3.6 Database Design

There are five different tables that are used to maintain the details of the different complete chains. These tables are branches, chains, ruletable, strands and working strand. These database tables are illustrated in the tables 3.6, 3.2, 3.6, 3.6 and 3.6. Table strands stores the information about the original strands. All the strands are given a CHAPTER 3. SEQUENCE TRANSFORMATION ON GRIDS 30

Table 3.1: strands id name length seq integer char[] integer char[]

Table 3.2: working strand id no uncert pos uncert uncert seq id integer integer integer[] char [] integer

specific id which is later used for referencing that particular strand. The field name stores the name for that particular strand. This is the same name as generated by the Phylip package. The field seq stores actual sequence which has the length stored in the field length. The original strand has many positions where exact base pair is not known. The worker process fills up these positions and then performs sequence transformation. Thus, each invoca- tion of sequence transformation forms different sequences corresponding to the different bases filled up at the unknown positions. These sequences are stored in the table working strand. id is the primary key of the table. no uncert stores the number of positions where the exact base pair is not used. pos uncert[] is an integer array that stores the positions of these base pairs. uncert[] is the character array that actually stores the bases filled up by the workers. seq id is the foreign key to the table strands referring to the strand in the strands table in which these bases are filled up. The table branches stores the information about the branches in the phylogenetic tree. Each instance or record corresponds to one invocation of sequence transformer that was included in some complete chain. id field is the primary key of the table. from id and to id the foreign keys referring to the id field of the table working strand. from id corresponds to the id in the table working strands of the ancestor sequence. to id corresponds to the id in the table working strands of the progeny sequence. length is the branch length generated by the Phylip package. no ts is the number of time-steps required for the complete transformation. swpt is the switching parameter used. Recollect that a particular rule is applied to the strand until the similarity value doesn’t change for some fixed number of time-steps. If similarity value doesn’t change after swpt time-step, the rule to be applied is changed. seq no is the foreign key to the field id of ruletable which stores the information about the rules used in the transformation. CHAPTER 3. SEQUENCE TRANSFORMATION ON GRIDS 31

Table 3.3: branches id from id to id length no ts swpt seq no integer integer integer real integer integer integer

Table 3.4: ruletable id no rc no ts rc rules integer integer integer [] char[][]

In the table ruletable id is the primary key. no rc is the integer storing the number of rules used in the transformation. no ts rc[] is the integer array storing the time-steps at which the rules were changed. rules[][] is the two dimensional character array which stores the actual rules. Since, we have 64 transitions in each of the rule for neighborhood size of 1, There are 64 characters in each of the rules corresponding to the right hand side of the rule. The position of each character defines the three left hand side bases of a transition. The table chains is the relation from the table branches to itself. It stores all the ids of all the instances in branches that belong to a single complete chain. The field id acts as a primary key and length specifies the number of instances of branches in that chain. The field ids[] is an integer array which refers to all the ids of the table branches involved in that particular chain. Thus, the above database makes it easier and efficient to collect various statistics gener- ated through sequence transformation of all the branches. We discuss the statistic collection programs in the next section.

3.7 Statistics Collection

We have developed various statistic collection programs that retrieve values from the database and gives various kinds of statistics. There can be different complete chains with different pa- rameters satisfying the pseudo molecular clock assumption shown in Equation 3.3. The statistic collection programs extract collective statistics from all the complete chains. These statistic col- lection programs can be executed offline by interested users at any point of time. The various statistics that are of interest are discussed in the following subsections. CHAPTER 3. SEQUENCE TRANSFORMATION ON GRIDS 32

Table 3.5: chains id length ids integer integer integer []

3.7.1 Timesteps

Number of time steps required for transformation. This information can be used to obtain more accurate measures about the rates of evolution. Different invocations of sequence transformer on the same branch by different worker processes may give different number of time-steps. Further, some of these instances may not be stored as they violate the pseudo molecular clock assumption. Each of these time-steps are stored in the database. We obtain the average and standard deviation values of the time steps for a transformation corresponding to a particular branch across different complete chains.

3.7.2 Unknown base-pairs

Probabilities of a base-pair assignment to the unknown segments of the ancestor DNA se- quences are calculated. Base-pair assignment to unknown segments are stored for each run of the sequence transformer. To calculate probability of a particular base at a certain position, we count all the instances where that base pair was assigned to that position. Then we divide this count by the total number of instances. These probabilities may help in re-building the complete intermediate sequences of the phylogenetic tree. For each position in the sequence, at which a base is not known, we obtain probability for each of the four possible bases.

3.7.3 Rules

Various rules used during transformation. These rules may give insights on the impact of neigh- boring base-pairs on the evolution of DNA sequences. The statistic collection programs try to collect popular rules within a particular branch across all the complete chains and for a single complete chain across all branches. To calculate popular rule in a branch across all the com- plete chains, we count the number of times a particular rule is used for the transformation of the branch. Then this count is divided by the total number of rules used. Similarly, popular CHAPTER 3. SEQUENCE TRANSFORMATION ON GRIDS 33

branches are also determined across the complete tree. The rules that were used more often than others are of more interest as these rules may guide in finding the reasons behind certain types of mutations.

3.7.4 Differential rule analysis

Probability of a particular rule being used at a given time step of transformation in a given branch. For a given branch, at a given time-step, all the rules used across all complete chains are collected. The number of times those rules were used is also noted. Then, popularity of a given rule at that time-step is calculated by dividing the count of the rule by the sum of all counts. This is similar to analysis performed for the calculation of popular rules. The difference is that this calculation is now calculated at each time-step. This analysis may give finer directions at each time step of mutation.

3.7.5 Popularity of transitions

The number of times a particular transition is used for a given branch. This may be useful in determining the exact effects of a neighboring base-pairs on mutations. There are 64 possible left hand sides for a transition. For each of these 64 left hand side, there are 4 possible right hand sides. Hence, average popularity of a transition is 1/256. To calculate popular transitions, we count the number of times a particular transition is used in a particular branch. We divide this number by total number of transitions used in all the time-steps of the transformations of that branch. Similar statistic can be calculated for the entire branch aggregating over the time-steps. The statistic collection programs help in finding the popularity of a transition within a par- ticular complete chain across all the branches and for a particular branch across all the complete chains. CHAPTER 3. SEQUENCE TRANSFORMATION ON GRIDS 34

Algorithm 1 Algorithm for Sequence Transformer 1: Read the ancestor and progeny sequences of the branch 2: Fill the unknown base-pairs in ancestor sequence randomly and record in unknwn[] 3: current sequence ⇐ ancestor sequence; sametolv ⇐ 0 4: prev value ⇐ similarity value between the ancestor and progeny sequences 5: timestep ⇐ 0; rule ⇐ random Cellular Automata Rule 6: for i ⇐ 0 to maximum number of time steps do 7: Try to form a drule which transforms the current sequence into progeny sequence 8: if the drule could be formed then 9: rules arr[timestep] ⇐ drule; rule change[timestep] = i; timestep ⇐ timestep+ 1 10: break 11: end if 12: Apply the rule to current sequence 13: if i == 0 then 14: rules arr[timestep] ⇐ rule ; rule change[timestep] = i 15: end if 16: Calculate similarity value 17: if similarity value == 1.0 then 18: break 19: end if 20: if similarity value > prev value then 21: continue 22: else 23: sametolv + + 24: if sametolv > sametol then 25: rule ⇐ random Cellular Automata Rule; rules arr[timestep] ⇐ rule 26: rule change[timestep] = i; timestep ⇐ timestep + 1; sametolv ⇐ 0 27: end if 28: end if 29: end for 30: time steps = i CHAPTER 3. SEQUENCE TRANSFORMATION ON GRIDS 35

Algorithm 2 Greedy Algorithm for Chain Formation Require: Array of Linked lists L[] corresponding to each branch. N, the number of branches Ensure: Longestchain {Chains[]: Array of Linked lists with each element holding a chain } {inChain[]: an array of N elements where inChain[i] indicates if branch i belongs to a chain} 1: chainCount ⇐ 0 2: for i ⇐ 0 to N − 1 do 3: inchain[i] ⇐ 0 4: end for 5: for i ⇐ 0 to N − 1 do 6: if inchain[i] == 1 then 7: continue 8: end if 9: prev node ⇐ The node in L[i] such that node.length is smallest 10: Insert prev node in Chains[chainCount] 11: inchain[i] ⇐ 1 12: tmp node = NULL 13: for j ⇐ i + 1 to N − 1 do 14: tmp node ⇐ The node in L[j] such that node.length > prev node.length and node.length is smallest 15: if tmp node =6 NULL then 16: prev node ⇐ tmp node 17: Insert prev node in Chains[chainCount] 18: inchain[j] ⇐ 1 19: end if 20: end for 21: chainCount + + 22: end for 23: Longest chain ⇐ Chains[i] such that Chains[i].length is highest in the array of linked lists Chains[] Chapter 4

Experiments and Results

In this chapter, we present some promising results obtained for 3 HIV sequence types, namely, gag, gagpol and env sequences. The results were obtained by executing the statistics collection programs on the PostgreSQL databases formed for the sequences. Since the grid infrastructure that we had developed can execute the transformations for long periods of time and produce better statistics, the results presented in this chapter should be treated as representing some good samples and demonstrating the potential of the infrastructure.

4.1 Grid Infrastructure

The sequences were downloaded from the HIV Sequence database at Los Alamos[27] and were aligned by ClustalW web interface[9]. For each of the 3 types of aligned sequences, a phy- logenetic tree was obtained using the Phylip[20] package. We utilized a grid infrastructure consisting of 23 machines distributed in 3 countries for our experiments. These machines op- erated in non-dedicated modes and were used by local users for different purposes. Table 4.1 describes the experiment infrastructure. The worker processes were executed on all the ma- chines. The master process and the PostgreSQL database, where information pertaining to the complete chains of a sequence type are stored by the master, were started on one of the AMD machines in Indian Institute of Science, India. The results in this chapter correspond to obtain- ing statistics with 7332, 4759 and 3251 complete chains collected in the PostgreSQL databases

36 CHAPTER 4. EXPERIMENTS AND RESULTS 37

Table 4.1: The Distributed Infrastructure Location Number of Specifications machines Torc cluster, Uni- 8 GNU/Linux 2.6.8, Dual PIII 933 versity of Tennessee MHz, 512 MB RAM, 40GB Hard (UT), USA Drive, 100 Mbps Ethernet DAS-2, Vrije Univer- 9 GNU/Linux 2.4.21, Dual PIII 996 siteit, Netherlands MHz, 1 GB RAM, 20 GB Hard Drive, 100 Mbps Fast Ethernet. AMD cluster, Indian 6 AMD Opteron 246 based 2.21 GHz Institute of Science, servers, Fedora Core 4.0, 1 GB India RAM, 160 GB Hard Drive, Gigabit Ethernet for gag, gagpol and env sequences, respectively, after 13, 4 and 3 days, respectively, from the start of the corresponding experiments on the distributed infrastructure.

4.2 Timesteps

Table 4.2 shows the averages and standard deviations of the number of time steps taken for mutations from the ancestor to progeny sequences of some branches of the phylogenetic tree for the gag sequences. The averages and standard deviations were obtained over the 7332 complete chains formed for the gag sequences. The low standard deviation values for large number of branches indicate high convergence of time step values for the different branches. Similar results for time steps were obtained for gagpol and env sequences. Tables 4.3 and 4.4 show similar results for time steps for gagpol and env sequences, respectively.

4.3 Popular Rules

Table 4.5 shows the most likely cellular automata rules for neighborhood-dependent mutations that could have been followed at certain discrete time steps of mutations on some branches of the phylogenetic tree formed from gag sequences. The 64 characters in column 2 of the table represent the 64 right-hand sides of the transitions shown in Table 1.1. The last column of Table CHAPTER 4. EXPERIMENTS AND RESULTS 38

Table 4.2: Summary of time step information for Gag sequences Branch Num- Average Standard De- ber (Ancestor- Number of viation Progeny) Time steps 0(8-9) 15.657665 1.961112 4(19-14) 31.242908 2.085160 5(43-44) 32.274139 2.096791 6(20-21) 34.051281 1.852813 7(13-24) 35.154255 1.833268 8(24-36) 36.161758 1.827739 9(5-4) 37.184532 1.814149 10(29-30) 38.198307 1.816491 11(15-16) 39.253136 1.898609 12(1-45) 40.259411 1.904292 13(12-13) 41.311649 1.974289 14(14-18) 42.380253 1.991383 15(25-26) 43.405754 1.986514 16(22-8) 44.409302 1.990247 17(22-15) 45.484451 2.040152

Table 4.3: Summary of time step information for GagPol sequences Branch Num- Average Standard De- ber (Ancestor- Number of viation Progeny) Time steps 0(3-4) 16.003782 2.499934 1(6-10) 21.138685 1.995396 2(14-6) 23.404707 1.913676 3(10-15) 25.672621 1.507439 4(4-20) 30.176718 3.335719 5(14-17) 32.025635 3.510833 CHAPTER 4. EXPERIMENTS AND RESULTS 39

Table 4.4: Summary of time step information for env sequences Branch Num- Average Standard De- ber (Ancestor- Number of viation Progeny) Time steps 0(10-15) 21.201477 2.505688 1(8-10) 25.257153 2.474922 2(19-22) 31.203014 2.654692 3(4-18) 33.275299 2.591264 4(6-7) 35.227314 2.508755 5(16-12) 36.850201 2.525077 6(14-13) 38.647186 2.503064 7(16-17) 40.231621 2.366600 8(12-15) 42.673332 2.888016 9(10-9) 44.145802 2.879150 10(8-6) 46.235619 3.420729

4.5 shows the probability of application of the rule at the specified time step(s) and is obtained by dividing the number of times the rule was applied and the total number of applications of rules for the time step(s). Determining the most likely rules for neighborhood-dependent mutations is the primary objective of our work and the high probability numbers for about 7000 total samples shown in Table 4.5 indicate the potential benefits of our methodologies. Similar results for rules at discrete time steps were obtained for gagpol and env sequences. Tables 4.6 and 4.7 show results for differential rule analysis for gagpol and env sequences, respectively. Table 4.8 shows some of the most and least popular rules that were applied for some branches of the phylogenetic tree obtained from the gag sequences. Column 3 of the table shows the popularity of a rule for a branch and is calculated by dividing the number of times the rule was applied at different time steps for the branch corresponding to all the complete chains and the total number of applications of rules for the branch. Column 4 shows the ex- pected popularity assuming that all rules were applied equal number of times on the branch. The first 5 entries of the table shows that certain rules are more popular and have 2-3 times more popularity than the expected popularity. The last entry shows that the corresponding rule CHAPTER 4. EXPERIMENTS AND RESULTS 40

Table 4.5: Differential Rule Analysis for Gag Sequences Branch Num- Rule Used Time- Probability ber (Ancestor- Steps Progeny) 11(15-16) CAGGCAAACGCCTGTTACATT 42-43 0.951841 AATTTCGTGCCGTTAAAAGTG TGCGGCTGATGCGCAAATGCCC 12(1-45) GGGTCAATTTTGGCATGACCA 43-44 0.951841 TATTGTCCTCAGATACTAAGT TCATAGAGAAGAACGATTACTT 29(26-27) CCGAAGTGATTGAAGCGCTTG 59-61 0.981132 TTTCTGGCGATTTTTGTGGTC CACTCACCTTATTCGCAAAATA 32(16- TCGCAGCGACCGCATAAACAT 65-66 0.929577 02 AG.NG.x) ACCGGCTGGCGATAAGCTGGA CTCAACACGAGTGCCAAATCTT 83(6-21) ATATGTGCGGTTACTATCGGT 198- 0.943114 CTCCGGAGGTGCACTTACCGC 199 CGGATGGCACTAGAGAACTATA

Table 4.6: Differential Rule Analysis for GagPol Sequences Branch Num- Rule Used Time- Probability ber (Ancestor- Steps Progeny) 0(3-4) GTGTCCGATAAAGATTTAAAT 21-22 0.901235 GTGCAAACATTTCTCTCCGGC TAGTACAAGTTCGACGCTTAGG 11(9-8) CAGATTATTCCAAGTACAGAA 51-53 0.946429 ACCCTCGACAGGGGCGTCCAG CCCCGGGCGGCATGTCCAGGGC 19(18-19) AGGGGCGTGAATGGACGAGTG 69-71 0.962963 ATGCCGCTCCATAGGGCGAGG AACAGTCATTATGCAAGAGAAG 45(4- AGGTGAGAGCGGTAGCTTCAT 88-88 0.821429 AF382828) TCCGCTGAGTAACATCGAATT GTTACTATTTGCCAGAGTTCTA 25(21- TTTCAGTAATACTCCTCTGTA 85-85 0.763780 L20587) ATTTGGGAGAACACGTCGTAT ATACTCCGCGGACGTAATGCTA CHAPTER 4. EXPERIMENTS AND RESULTS 41

Table 4.7: Differential Rule Analysis for env Sequences Branch Num- Rule Used Time- Probability ber (Ancestor- Steps Progeny) 43(23- GCCAGAGGGATATGTCTGGAC 158- 0.919271 U42720) AGTGTGACCGAGACCATTGTT TTG- 161 GACAGCGTAGGACGGGCGC 45(22-20) AAAACCGACGGCCAGCTACCT 170- 0.921960 ACTGTACTCCTGGATAGTCCG 173 TTGCAGCGTCGCCAACGAGATA 45(22-20) AAAACCGACGGCCAGCTACCT 174- 0.921960 ACTGTACTCCTGGATAGTCCG 174 TTGCAGCGTCGCCAACGAGATA 6(14-13) CGCACGCTAGGATACATCGCA 43-43 0.864583 TTGCTGGACACAGCGATGACA ACAGCTTAATAAGCACAAGAGT 33(20- TAATGATATCTGGGTGAACTC TCT- 108- 0.819672 L20587) GTAAACCACGTATGCGAA AG- 109 TAAAACTCCACTACACTCCC 41(21- TACAGCCCCTGCCAAATAACT 138- 0.844498 X52154) CGCTTTGCCACCGGATAGCAA 147 GCCTCAACGTTAAGTAACCCTT CHAPTER 4. EXPERIMENTS AND RESULTS 42

Table 4.8: Popular Rules for a Branch for Gag Sequences Branch Num- Rule Popularity Expected ber (Ancestor- Popularity Progeny) 0(8-9) CCTTAGGGCGGTGAGCTGAGGA 0.012445 0.005555 CAGGTAACTGTGCAAAGGGAC ACACCTGTCGCGATCCATAGG 0(8-9) CCGAGGTCGGATAAGACATAGG 0.010139 0.005555 AAGTTGTACCCTCTGAACTAA TCGTTGGTCATCGGCAGCGTT 0(8-9) CGAGACAGAACTGCTCCGAAAG 0.012784 0.005555 GCAGAGTGGAAGGCTTATTCT CGGTAAACCTTTAGTGGCATG 1(18-15) GTATAACAAGTCCTTCTGTGTC 0.010041 0.004291 TCCTCCAGAGGACGATTCCGT GCGTACGGAACGCCTTCGTAA 2(10-37) GTCAGGTATACATCTGGTCTCG 0.013078 0.004608 AATGGTAGCTCGATCAACCCC ATCGCAACCTGGGACGCATCC 2(10-37) CGCCGAAAACCTCTTCTTTAAC 0.000105 0.004608 ACGCGTTGCCCGTTTTATCCT GCTTATTCTAGCTTTGTGACT

is less likely to have been used for the branch, having 40 times less popularity than the expected popularity. Similar popular rules for the gagpol and env sequences are shown in Tables 4.9 and 4.10, respectively.

4.4 Base Pairs Corresponding to Unknown Positions

Table 4.11 shows the resolutions of unknown positions of the intermediate gag sequences. Columns, 4, 5, 6, and 7 show the probabilities that a particular unknown position in an in- termediate sequence is occupied by a base pair and is calculated by dividing the number of resolutions of the position with the base-pair by the total number of resolutions of the position. Entries 1,3 and 6 of the table show that the corresponding positions of the sequences are most likely to be occupied by purines (A and G) than pyrimidines (C and T). The other entries show CHAPTER 4. EXPERIMENTS AND RESULTS 43

Table 4.9: Popular Rules for a Branch for GagPol Sequences Branch Num- Rule Popularity Expected ber (Ancestor- Popularity Progeny) 0(3-4) AGGGCTGGGTGGGTACAACTAT 0.016373 0.006289 GTAACTCTGCGATAAGGGCCCA CGCCTTATGTTTTAGAAAGT 0(3-4) CTGAGAACACGTATGTGTCGGT 0.017153 0.006289 CTCATTTCAACTCTTCCATATT TCTCCATTACTCCTCCAGGC 0(3-4) CTATGTCTTGCAGTTCTACACC 0.018941 0.006289 CTCTTCGTTTGTGATAATGGTC CTGCAAGCCACTTCCCCAAT 1(6-10) ACCGCGAAAGGTGCGCTGTGGT 0.016120 0.006289 ACACGTCGAGAGAGCCTCACAT TTATCAGATGCATTACGAAT 1(6-10) CGACCCATAGACTGAGGCCCAC 0.016158 0.006289 ACACAGGCCATGCCAGAAACCT TCGCTTGACGCCAGTTTCTT 1(6-10) GGCCTTATCCTCGTTATCGCGG 0.016538 0.006289 AGACGCTGCCAGTAGTTACTCT GCACGGTCATTCAGTAACGC CHAPTER 4. EXPERIMENTS AND RESULTS 44

Table 4.10: Popular Rules for a Branch for env Sequences Branch Num- Rule Popularity Expected ber (Ancestor- Popularity Progeny) 0(10-16) CCCCTGTGCATAGGGCGAGCGG 0.026367 0.009345 ATACCAATTGACAAATTGAGGG ACTTGCCAAGAAAGAAGAGG 1(8-10) AGAACGACAATAATTCCCTCTT 0.022690 0.008474 GGATCCTGTACAATGCATCATG ATTCAGTAGCTGAATATTAC 1(8-10) CAGAGCGCTAAGTCATCTGGTA 0.020014 0.008474 CTGGTTACTGGGTAACCTCGGC TGAAGGTGTACCTACTGACA 5(16-12) ATATGCGACCCAGCATCCGACA 0.020333 0.005780 GACCCTTGTACTATAGACTCGC GGCAATAGTTGCCTCACCAT 7(16-17) GCTGTGATTCAAACGAGGACGG 0.022057 0.006134 AGCGTGTAAGCTAAGCAGTAGT GCAATACCTCTAATTTAAGC

that the corresponding positions of the sequences are most likely to be occupied by pyrimidines than purines. Table 4.12 and 4.13 show similar results for gagpol and env sequences, respec- tively. CHAPTER 4. EXPERIMENTS AND RESULTS 45

Table 4.11: Resolution of Unknown Positions for Gag Sequences Seq Un- prob(A) prob(C) prob(G) prob(T) prob prob Num- know- (1) (2) (3) (4) (purines) (pyrim- ber n (1+3) idines) (Name) Po- (2+4) si- tion 26 (3) 1390 0.343699 0.149891 0.405074 0.101337 0.748773 0.251228 26 (3) 375 0.173895 0.242089 0.082788 0.501227 0.256683 0.743316 4 (1) 410 0.430306 0.113339 0.305101 0.151255 0.735407 0.264594 9 (14) 411 0.126978 0.340971 0.139935 0.392117 0.266913 0.733088 10 (15) 386 0.125887 0.415985 0.141980 0.316148 0.267867 0.732133 17 (21) 1251 0.262411 0.076650 0.469040 0.191899 0.731451 0.268549 4 (1) 419 0.127250 0.289416 0.147709 0.435625 0.274959 0.725041 18 (22) 413 0.221086 0.281915 0.061648 0.435352 0.282734 0.717267 18 (22) 428 0.133933 0.399209 0.151391 0.315466 0.285324 0.714675

Table 4.12: Resolution of Unknown Positions for GagPol Sequences Seq Un- prob(A) prob(C) prob(G) prob(T) prob prob Num- know- (1) (2) (3) (4) (purines) (pyrim- ber n (1+3) idines) (Name) Po- (2+4) si- tion 4 (13) 1182 0.145921 0.349749 0.125733 0.378597 0.271654 0.728346 10 (19) 1239 0.330819 0.177354 0.375594 0.116234 0.706413 0.293588 CHAPTER 4. EXPERIMENTS AND RESULTS 46

Table 4.13: Resolution of Unknown Positions for env Sequences Seq Un- prob(A) prob(C) prob(G) prob(T) prob prob Num- know- (1) (2) (3) (4) (purines) (pyrim- ber n (1+3) idines) (Name) Po- (2+4) si- tion 10 (19) 1973 0.149503 0.421882 0.112165 0.316450 0.261668 0.738332 8 (17) 1077 0.343036 0.222814 0.389156 0.044993 0.732192 0.267807 12 (20) 1709 0.121063 0.299489 0.147342 0.432106 0.268405 0.731595 22 (9) 914 0.381259 0.181177 0.349785 0.087778 0.731044 0.268955 0 (1) 463 0.090723 0.421704 0.191265 0.296308 0.281988 0.718012 9 (18) 914 0.168916 0.335992 0.113804 0.381288 0.282720 0.717280 17 (4) 2302 0.371154 0.148012 0.341511 0.139323 0.712665 0.287335 6 (15) 1685 0.135072 0.365951 0.152454 0.346524 0.287526 0.712475 0 (1) 638 0.167945 0.333436 0.122021 0.376598 0.289966 0.710034

4.5 Potential of Grid Computing

Table 4.14: Usefulness of Large Number of Runs December 23, 2006, Number of complete chains = 1347

Branch Average Number of Time Steps Standard Deviation

0 14.625093 3.565527

1 20.830734 3.153974

January 3 2007, Number of complete chains = 7607

Branch Average Number of Time Steps Standard Deviation

0 15.832522 1.890947

1 22.199816 1.708907 CHAPTER 4. EXPERIMENTS AND RESULTS 47

In order to show that our ever-running computations can have potential long term benefits in re- solving uncertainties associated with mutations, we conducted experiments with gag sequences on the 6 AMD machines in India. We then observed the average time steps in mutations of 2 branches of the phylogenetic tree at 2 different periods of time separated by 10 days. Table 4.14 shows results corresponding to the 2 branches with 1347 complete chains collected in the Post- greSQL database on December 23, 2006 and with 7607 complete chains collected on January 3, 2007. The lower standard deviation values for the results collected on January 3 show that the average number of time steps converges with increasing number of executions. Thus, our work can give more definite findings regarding mutations with time progression. Chapter 5

Predictions in phylogenetic trees

In Chapter 3, we described the modeling of the DNA sequence mutations in order to gain insight into the mutation process. In this chapter, we describe our attempts to extend this model in order to predict the future sequences. To verify the accuracy of the predictions, we try to predict the already existing sequences of the phylogenetic tree.

5.1 Determining the Preserved Segments

Not all the base-pairs in a DNA sequence undergo mutation. In the transformation methods described in the previous chapter, the cellular automata rules are applied only on those segments of the current sequence which differ with the corresponding segments of the progeny sequence. Thus, while predicting, we need to find the segments of the current sequence to which cellular automata rules should not be applied (preserved segments). However, during prediction, we do not have a priori knowledge of progeny sequence. Hence, we need to find the preserved segments using some other method. We use Position Specific Scoring Matrix (PSSM) for this purpose. Position Specific Scoring Matrix is calculated over a set of sequences. The matrix gives us an idea about the preserved sequences. It contains four columns corresponding to four base pairs and rows equal to the number of bases in the strands. Entry at row i and column j corresponds to the probability of occurrence of the base corresponding to column j at position i of a strand.

48 CHAPTER 5. PREDICTIONS IN PHYLOGENETIC TREES 49

We use this probability to determine whether a base at a particular position in a strand should be preserved.

5.1.1 Calculation of PSSM

For a specific position or a row in the PSSM, we calculate the occurrences of all the bases across all the sequences. Then, we divide each of these occurrences by the total number of sequences under consideration to get the probabilities of occurrences of the four bases for that particular position. Thus, four columns corresponding to a position or row of PSSM are filled. This procedure is repeated for all the positions in order to obtain the complete PSSM. Algorithm for this calculation is shown in Algorithm 3

Algorithm 3 Calculation of Position Specific Scoring Matrix Require: n sequences in sequences[] each having numpositions bases. Ensure: pssm[][] acount ⇐ 0; ccount ⇐ 0; gcount ⇐ 0; tcount ⇐ 0 for i ⇐ 0 to numpositions − 1 do for j ⇐ 0 to n − 1 do if sequences[j].bases[i] == A then acount ⇐ acount + 1 end if if sequences[j].bases[i] == C then ccount ⇐ ccount + 1 end if if sequences[j].bases[i] == G then gcount ⇐ gcount + 1 end if if sequences[j].bases[i] == T then tcount ⇐ tcount + 1 end if end for pssm[i][0] ⇐ acount/n pssm[i][1] ⇐ ccount/n pssm[i][2] ⇐ gcount/n pssm[i][3] ⇐ tcount/n end for

PSSM, as calculated in Algorithm 3, will be used for predicting a sequence in a phylogenetic tree. As shown in Algorithm 3, PSSM can be calculated over a set of sequences. Thus, we have CHAPTER 5. PREDICTIONS IN PHYLOGENETIC TREES 50

to define this set of sequences. For our methodology, we have defined three such sets.

1. whole-pssm : Here, we consider all the sequences in the phylogenetic tree for calculating the PSSM. This PSSM gives the overall probability of a particular base at being at a particular position across all the sequences. This PSSM is same for all the sequences. The same whole-pssm will be used for predicting all nodes in the phylogenetic tree.

2. subtree-pssm : In this case, we use a subset of the entire phylogenetic tree. If the se- quence for which we are going to predict bases is at level n, all the sequences at levels from 0 to n − 1 form a subtree which existed before this sequence. We use this subset to calculate subtree-pssm. Subtree-pssm summarizes the history of the sequence under consideration. Subtree-pssm is same for all the nodes at a particular level. However, it has to be re-calculated for sequences at different levels of the phylogenetic tree.

3. path-pssm : In this case, we isolate the set of branches that lead from root to the se- quence. We form a set of these sequences to calculate PSSM. path-pssm uses exact his- tory of a particular sequence and is different for different sequences and can help in more specialized predictions.

5.1.2 Strategies for Determining Preserved Sequences

We derived four different strategies for determining preserved sequences, using PSSM. Each of these strategies decide whether to apply rule for transformations at a particular position in the strand.

1. Conservative : This strategy decides not apply rule only at positions where PSSM value is 1.0. This strategy assumes that if a particular base pair appears in all the sequences, it may also appear at the same position in the next sequence.

2. Extreme-Conservative : This strategy decides not to apply rules at positions where PSSM value is 1.0 and the base corresponding to that position also matches with the base in the strand. PSSM used can be one of whole-pssm, subtree-pssm or path-pssm. CHAPTER 5. PREDICTIONS IN PHYLOGENETIC TREES 51

3. Flexible-1: This strategy decides not to apply rules at positions where PSSM value is greater than or equal to a certain threshold value. The higher the threshold value, more are the positions where rules are not applied. Various experiments were conducted to determine this threshold value. This strategy allows a lot of flexibility and its performance varies depending upon the value of the chosen threshold.

4. Flexible-2: This strategy decides not to apply rules at positions where

(a) PSSM value is greater than or equal to a certain threshold value and

(b) the base pair corresponding to the highest PSSM value at that position equals to the base pair in the sequence under consideration.

This strategy offers more flexibility than Flexible-1 strategy as the positions where strat- egy decides not to apply rule may vary as we progress through the time-steps. This is because of the second condition added. This increased flexibility improves the perfor- mance of the strategy and hence we use this strategy for our prediction experiments.

5.1.3 Evaluation of Strategies

To compare the strategies described in 5.1.2, we define 5 parameters for a strategy : true posi- tives, false positives, true negatives and false negatives and decision parameter. True positives are the positions where a strategy decides to exclude the position from applying the rule and the corresponding bases in both ancestor and progeny are equal; i.e. decision taken by the strategy was correct. False positives are the positions where a strategy decides not to apply rule for transformations but the corresponding two bases were not the same in ancestor and progeny; i.e. decision taken by the strategy was not correct. Similarly, true negatives and false negatives are defined. The number of false positives defines the upper bound on the similarity value (see Chapter 3) that can be attained due to the transformations. This is because the positions corresponding to false positives are not going to be transformed even though the base pairs in these positions do not match with the corresponding base pairs in the progeny sequence. This upper bound can CHAPTER 5. PREDICTIONS IN PHYLOGENETIC TREES 52

be calculated as n − fp upperbound = (5.1) n where,

n = number of bases in the sequence.

fp = number of false positives.

Decision parameter is calculated by subtracting the number of wrong decisions from the number of correct decisions and then normalizing the number by the total number of bases in the the sequences. The Equation 5.2 gives the formula for the decision parameter.

{(tp + tn) − (fp + fn)} decpar = (5.2) n where,

tp = number of true positives.

tn = number of true negatives.

fp = number of false positives.

fn = number of false negatives.

We evaluate fitness of a strategy using Equation 5.3

fit(strategy) = 0.5 · decpar + 0.5 · upperbound (5.3)

Conservative strategy gives a low percentage of false positives resulting in high upperbound values but there are also a lots of false negatives affecting the fitness of the strategy. Extreme-conservative strategy gives similar results to conservative strategy when full-pssm is used but the results are different when subtree-pssm or path-pssm are used. When full PSSM is used, the second condition used in this strategy becomes redundant and extreme-conservative strategy exhibits the same behavior as the conservative strategy. However, when subtree-pssm CHAPTER 5. PREDICTIONS IN PHYLOGENETIC TREES 53

Analysis of threshold values 0.75 path-pssm sub-tree pssm 0.7 entire tree pssm

0.65

0.6

0.55

fit(Flexible-1) 0.5

0.45

0.4

0.35 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Threshold values

Figure 5.1: Analysis of threshold values : Flexible-1 or path-pssm is used, the base pair in the ancestor and progeny sequences may not match even if th PSSM value at that position is 1.0, giving different results. This strategy also suffers from the drawback that it gives high number of false negatives.

5.1.4 Determination of Threshold Values for Flexible Strategies

We performed experiments for strategies Flexible-1 and Flexible-2 to determine the threshold value discussed in subsection 5.1.2 . Figure 5.1 and Figure 5.2 shows the fitness of strategies Flexible-1 and Flexible-2 respectively, for various threshold values for a particular branch. The Y-axis shows the fitness values calculated using Equation 5.3. As we can see from both the graphs, the threshold values have similar effects on the perfor- mance of Flexible-1 and Flexible-2. For higher threshold values, the number of false positives are low. However, the number of false negatives are very high. This decreases the metric value. As we decrease the threshold value, many of the false negatives become true positives, increas- ing the metric value. The performance of the strategy reaches its peak at threshold value of 0.5. For threshold values less than 0.5, the number of false positives increase so much that the met- CHAPTER 5. PREDICTIONS IN PHYLOGENETIC TREES 54

Analysis of threshold values 0.7 path-pssm sub-tree pssm entire tree pssm 0.65

0.6

0.55 fit(Flexible-2) 0.5

0.45

0.4 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Threshold values

Figure 5.2: Analysis of threshold values : Flexible-2 ric value starts decreasing. After a certain stage, the fitness value stabilizes irrespective of the threshold value. This happens because most of the preserved positions remain constant when the threshold values are less than most of the PSSM entries. The threshold values can be varied for different time-steps and then their effects on the fit- ness values can be analyzed. This will result in a more complicated strategy and can provide better results. For the initial time-steps, we want the number of false positives as low as possible so that the limit of similarity values that can be achieved can be high. This requires high thresh- old value. However, this also means that actual similarity value remains low. Hence to improve upon this similarity value, the threshold value can be increased. Calculating a different thresh- old values for each of the time-steps can be a complicated and cumbersome task. Moreover, it also depends upon the number of time-steps which differ for each of the branch. Hence it is impossible to come up with a fixed generic strategy. We follow a simplistic approach with two

threshold values thr1 and thr. In this approach, thr1 is used only for the first time-step and thr

is used for the later time-steps. From the above discussion, it follows that value of thr1 should be as high as possible and value thr should be lower. We performed an analysis over different ranges of both the thresholds. In this analysis, we actually used Flexible-1 and Flexible-2, and CHAPTER 5. PREDICTIONS IN PHYLOGENETIC TREES 55

applied random rules to come up with the similarity value for each of the branches. The op- timal values for thr1 and thr determined from this analysis were 0.9 and 0.5 respectively. As expected, value of thr1 is higher and the value of thr is the same as the best threshold value obtained from the analysis of Figure 5.2 and Figure 5.1. Also as seen from Figures 5.1 and 5.2, Flexible-2 performs better than Flexible-1 most of the times. Moreover, Flexible-2 improves itself for increasing time-steps. Hence we have used Flexible-2 for determining the preserved sequences in further work. To summarize, during transformations, a rule is not applied to the preserved segments. PSSM is used to calculate these preserved segments of a DNA sequence. Any of the whole, subtree or path PSSM can be used for this purpose. Of the four strategies that we have for- mulated, Flexible-2 performs best in terms of the fitness defined by Equation 5.3. Flexible-2 strategy decides not to apply rule at a particular position if PSSM value is greater than or equal to a certain threshold value and the base pair corresponding to the highest PSSM value at that position equals the base pair in the current sequence. The fitness of Flexible-2 is a function of the threshold value. In order to nullify the effect of high number of false positives, two thresh-

old values are used - thr1 for the first time-step and thr for the rest of the time-steps during the

transformations. From several experiments, the best values for thr1 and thr are determined as 0.9 and 0.5 respectively.

5.2 Analysis of Random Strategies

Following the strategies for preserving segments during transformations using Flexible-2 strat- egy described in previous section, we mutate the rest of the segments in a given ancestor se- quence for certain number of time-steps. The number of time-steps is equal to the average number of time-steps for the branch whose ancestor corresponds to the ancestor sequence un- der consideration. After mutating for the time-steps, we compare the sequence obtained with the progeny sequence. For mutations of the non-preserved segments, various techniques can be used. In order to evaluate these techniques, we use three base strategies. CHAPTER 5. PREDICTIONS IN PHYLOGENETIC TREES 56

1. Upper bound: This is not really a strategy but the maximum upper bound that we can achieve. This upper bound is calculated by using Equation 5.1. This upper bound is calculated for each of the branches using Flexible-2.

2. Time-step invariant : In this strategy, we assume that preserved segments of the se- quence are calculated only at the first time-step. Then these same segments are preserved in the further time-steps. With this, we can analytically calculate the expected similarity value as follows. By definition, false positives corresponding to the ancestor sequence, always differ in ancestor and progeny sequences. True negatives and false negatives, however, can converge to the correct base with the probability of 0.25. Thus,

n − {fp + 0.75 · (tn + fn)} exp sim = (5.4) n

This analytical value was verified by conducting experiments where the non-preserved segments are randomly mutated.

3. Time-step variant : In this case, analytical value can not be obtained as we allow to change the preserved segments of the sequences at each time-step. Hence, we conducted 50 experiments and obtained average of the similarity values after the given number of time-steps, for each of the branches. For each branch, we calculated the preserved seg- ments using the Flexible-2 (see 5.1). As discussed in the previous section, threshold used for first time-step was 0.9 and the threshold used for the rest of the time-steps was 0.5. We observed that similarity values obtained for almost all the branches for time-step in- variant case were better than the analytical values obtained using Equation 5.4 for the time-invariant case.

The three curves shown in Figure 5.3 are obtained using above random strategies. These curves act as benchmark for our strategies. The upper bound curve sets the maximum limit that we can reach. The challenge is to derive prediction strategies for mutations of non-preserved segments that can yield similarity values better than time-step variant and time-step invariant cases for all the branches. CHAPTER 5. PREDICTIONS IN PHYLOGENETIC TREES 57

Analysis of random strategies 1.1 Upper bound Time-step invariant 1.05 Time-step variant

1

0.95

0.9

0.85 Similarity Value 0.8

0.75

0.7

0.65 0 10 20 30 40 50 60 70 80 Branches

Figure 5.3: Analysis of random strategies

5.3 Methods Used for Prediction

We have collected various statistics for HIV sequences as discussed in Chapter 4. We use these statistics in order to predict the future nucleotide sequences. In order to measure whether our prediction methods are producing correct sequences, we prune the complete tree to a smaller tree and then try to predict sequence in the complete tree using the pruned tree. The phylogenetic tree under consideration has 16 levels. Thus, we split the tree at each level and try to predict the sequences for the next level. For example, we consider only five levels from the root of the phylogenetic tree. The leaves of this five-level tree are intermediate nodes in the original tree. From this five-level tree, we try to predict the sequences at level six. Then these sequences are compared with the level-six nodes of the complete or original phylogenetic tree to derive the similarity value. The sequences at the sixth level are also obtained using the base strategies mentioned in the previous section. CHAPTER 5. PREDICTIONS IN PHYLOGENETIC TREES 58

5.3.1 Roulette Wheel Method

We tried various methods, of which we describe here the most effective of all. In order to predict the progeny sequence of branch A, we extract the rules used for each of the time-steps from the database for the history of branch A. History consists all the branches that lead to ancestor sequence of branch A from the root of the tree. We also obtain average number of time-steps for branch A. The problem is to apply a rule for each time-step of branch A in order to reach the progeny sequence. To decide which rule should be used for a time-step, we form a roulette wheel using a fraction of the most popular rules for that particular time-step for previous branches. Entire roulette wheel is divided into sectors such that sum of the probabilities (sizes) of each of all the sectors equals one. Each sector of the wheel corresponds to a rule extracted from the database and probability (size) of that sector is proportional to the popularity of that particular rule. Now, we rearrange the probabilities in the roulette wheel such that probability of a particular sector is the sum of the probabilities of all of the previous sectors. Thus, the probability of the last sector equals 1.0. Then a number between 0 and 1 is randomly generated. That number falls in one of the sectors of the roulette wheel. The rule corresponding to that sector is used for the time-step. Note that the probability of picking a particular rule is directly proportional to the popularity of that rule. The fraction of the most popular rules that should be used to form a roulette wheel is an interesting parameter to search for. We have tried values ranging from 0.1 to 0.6 in the steps of 0.1 i.e. from top 10% to top 60% of the popular rules.

5.3.2 Roulette Wheel Method with Random Component

For the roulette wheel method discussed in the previous section, all the rules used to form a roulette wheel were selected from the database. In this method, however, we create a part of the roulette wheel using randomly generated rules. Thus, a roulette wheel now has two parts

1. Part consisting of the rules extracted from the database.

2. Part consisting of the randomly generate rules. CHAPTER 5. PREDICTIONS IN PHYLOGENETIC TREES 59

Same procedure (described in the previous subsection) is followed to pick a rule for a particular time-step. The rule may belong to any one of the above parts with probability equal to the size of the parts. For an example, if we keep the size of the random part to be 0.1 of the total roulette wheel, probability that the rule picked for a time-step is randomly generated is 0.1. For the part consisting of the rules extracted from the database, the probabilities of each of the sectors are now adjusted in such a way that their sum is 0.9 instead of 1.0. The fraction that consists of randomly generated rules is an interesting parameter to search. We tried different values ranging from 0.0 to 0.5 in steps of 0.05.

5.3.3 History Sizes

For both prediction methods, we use history of the branch under consideration. Naive strategy is to consider all the branches from the root to that branch. We can also limit this history to consist of only a certain number of immediate predecessors from the history of the branch. We call this number as the history size. Thus, if we are performing experiments with the history size of 3, three immediate predecessor branches in the path from root to the ancestor of the branch under consideration are selected. Then the roulette wheel is formed only from the popular rules of these 3 branches. History size does have an impact on the similarity value that we can attain. The history size values of 3 and 4 seemed to produce better results than other history size values.

5.3.4 Experiments and Results

Using above techniques, we have tried to predict for the branches at level 3 onwards. There are total of 77 branches. For each branch, we have four similarity values corresponding to upper bound, time-step variant random and time-step invariant random, and roulette wheel method with random component. We have performed experiments for history sizes ranging from 3 to 10. For each of these history sizes, we have varied the random components of the roulette wheel from 0.0 to 0.5 in the steps of 0.05. For each of these random components, we have varied the fraction of popular rules to be extracted from the database from 0.1 to 0.6 in the steps of 0.1. The goal is to improve the value obtained from the roulette wheel method over both values obtained from the random methods. CHAPTER 5. PREDICTIONS IN PHYLOGENETIC TREES 60

The roulette wheel method outperforms the time-step invariant random strategy in all the 77 branches with average improvement of 11.41%. However, the average improvement when compared to time-step variant random strategy is quite low. Though we have been successful to better the values obtained from random methods in 51 branches out of 77 branches for a certain set of parameter values, average improvement over the time-variant random strategy is only 0.003. There can be various reasons for this low average value.

5.4 Analysis

Clearly, the predictions done using the methods described so far do not yield better results as compared to random strategies. In this section we analyze the reasons behind this.

1. New Rules : In all the methods described so far, we use the popular rules used in the previous history. A small analysis was performed to find out whether a particular rule was used more than once in the tree. It was found that rules were never repeated in any part of the tree. This was expected as the search space consists of 464 rules and it is highly unlikely that two randomly generated rules will be the same. This analysis suggests that a method to derive new rule could be more effective. This new rule could be formed from the popular rules using extrapolation. In function extrapolation, we use previous samples to derive a new value. We do not use the most popular sample for the next prediction. We are not exactly certain about how to derive this new rule from the previous history, but it can give better results than using just the most popular rule as it is.

2. Preserved Sequences for Selective Application of Rules: While performing sequence transformation, we saw that selective application of cellular automata rules quickly gave us the convergence to progeny sequence. This happens because we had used the knowl- edge of progeny sequence itself to form the preserved sequence. This is affordable when we are only doing the analysis. However, while making predictions, we don’t have this powerful tool at our disposal. Though we use PSSMs for determining the preserved se- quences, the rules that are used do not correspond to the same PSSMs. If, during analysis, CHAPTER 5. PREDICTIONS IN PHYLOGENETIC TREES 61

PSSMs had been used for selective application of cellular automata rules, we would have had a direct connection between the rules and the preserved sequences.

We are not certain whether using PSSM for selective application of cellular automata rules may yield the same convergence we have got. It may take more time-steps to converge requiring higher number of grid resources. However, it may give better results as far as predictions are concerned.

3. Biological Knowledge? : The rules obtained using the sequence transformation analysis are used directly for prediction. Some kind of biological knowledge may give some insights to filter these rules. Analogy here is with the construction of phylogenetic tree itself. The phylogenetic tree building packages give a large number of trees. These trees are then filtered using bootstrapping. Sometimes even the manual observation by biological expert is needed in order to obtain the correct phylogenetic tree. This post- analysis on the rules is not performed. Bootstrapping of the rules may reduce the noise giving us the better idea of the popular rules. Again, the methodology and amount of resources required for bootstrapping remain to be seen. Chapter 6

Conclusions and Future work

6.1 Conclusions

In this thesis, we have developed techniques based on cellular automata to analyze the rules for neighborhood-based mutations on branches of a phylogenetic tree. Fine-level analysis of rules at different time steps of mutations is one of the primary contributions of our work. We have also used the cellular automata rules to resolve the uncertainties regarding phylogenetic trees, including number of time steps of mutations and base-pairs in certain positions of the intermediate sequences. Due to the vast number of rule space involved in our analyzes, we adopt coarse-level parallelism by conducting parallel exploration of rules on different grid resources. We have also built mechanisms for load-balancing and fault-tolerance that are necessary for sustaining our ever-running computations on grid resources. We have built a database in order to store the results generated during the parallel sequence transformations. Experiments were performed on gag, gagPol and env sequences of HIV using distributed setup of 23 machines across three countries. Various statistics collection programs are written to extract the useful results from the database. These programs can be invoked any time offline by the user during or after the execution of the application. Based on the results collected, we have shown some interesting statistics related to time steps, popular cellular automata rules for a branch and the resolution of unknown positions in the intermediate sequences. In the second part of the work, we have laid foundations for predictions of sequences in phy-

62 CHAPTER 6. CONCLUSIONS AND FUTURE WORK 63

logenetic trees. Various strategies for calculating the preserved sequences have been formulated. Base strategies have been developed in order to compare the results of any prediction strategy. Roulette Wheel strategy has been developed for actual predictions. Though the Roulette Wheel strategy performed better than the base strategies in large number of cases, the average perfor- mance improvement was not significant. Not using PSSM during the sequence transformations and lack of bootstrap filtering on rules may have hampered the performance of the strategies.

6.2 Future Work

Our current work can adequately deal with sequences of small lengths as HIV virus. This work can be extended to deal with sequences of large lengths like in human genomes. To perform computations and collect results for large sequences in reasonable amount of time, fine-grain parallelism, where the individual transformations from an ancestor to a progeny have to be parallelized, can be employed. Robust scheduling mechanisms can be developed to map the individual sequence transformations to the processors of a grid. Existing scheduling techniques that need expected time to completions will not be adequate, since the number of time steps needed for sequence transformations cannot be determine a priori. This work can be generalized to deal with different rules for different patterns of neighborhood-based mutations. In particular, different neighborhood sizes greater than 1 can be explored. Higher order neighborhood-dependencies have to be managed by using simplifying biological assumptions. Specifically, the assumption that fitness of the strand always increases during the transformation can be relaxed. Certain fluctuations in the fitness value may be al- lowed during the transformations in order to get more realistic analyzes. Different prediction strategies can be built in order to get more accurate results for the predictions. These prediction strategies may either depend on the realistic analysis discussed above or they can be built by addressing the reasons discussed in the last section. Specifically, analysis itself can be improved by incorporating PSSMs for Selective application of cellular automata rules. The techniques used for finding the preserved sequences can be tuned so that they can also be used for selective application of cellular automata rules. Rules obtained from CHAPTER 6. CONCLUSIONS AND FUTURE WORK 64

the analyzes can be filtered using bootstrapping so that these rules can be used more confidently for predictions. New prediction strategies may use techniques similar to function extrapolation to derive new rules ensued of using the exactly same rules used in the history. References

[1] Abc@home. http://abcathome.com/.

[2] P. Arndt, C. Burge, and T. Hwa. DNA Sequence Evolution with Neighbor-Dependent Mutation. Journal of Computational Biology, 10:313–322, 2003.

[3] Biogrid. http://biogrid.net.

[4] M. Bulmer. Neighboring Base Effects on Substitution Rates in Pseudogenes. Molecular Biology and Evolution, 3(4):322–329, 1986.

[5] C. Burks and D. Farmer. Towards Modeling DNA Sequences as Automata. Physica D: Nonlinear Phenomena, 10(Issue 1-2):157–167, 1984.

[6] Quantum monte carlo@home. http://qah.uni-muenster.de/.

[7] Wahid Chrabakh and Richard Wolski. Gridsat: Design and implementation of a computa- tional grid application. J. Grid Comput., 4(2):177–193, 2006.

[8] Climateprediction.net. http://www.climateprediction.net.

[9] ClustalW. http://www.ebi.ac.uk/clustalw.

[10] Einstein@home. http://einstein.phys.uwm.edu/.

[11] Understanding Evolution. http://evolution.berkeley.edu.

[12] N. Ganguly, B. Sikdar, A. Deutsch, G. Canright, and P. Chaudhuri. A Survey on Cellular Automata. Technical report, Centre for High Performance Computing, Dresden University of Technology, December 2003.

65 REFERENCES 66

[13] S. Hess, D. Jonathan, and R. Blake. Wide Variations in Neighbor-Dependent Substitution Rates. Journal of Molecular Biology, 236:1022–1033, 1994.

[14] B. Korber, M. Muldoon, J. Theiler, F. Gao, R. Gupta, A. Lapedes, B. H. Hahn, S. Wolin- sky, and T. Bhattacharya. Timing the Ancestor of the HIV-1 Pandemic Strains. Science, 288(5472):1789–1796, June 2000.

[15] Lhc@home. http://lhcathome.cern.ch/lhcathome/.

[16] Malariacontrol.net. http://www.malariacontrol.net.

[17] B. Morton, I. Bi, M. McMullen, and B. Gaut. Variation in Mutation Dynamics Across the Maize Genome as a Function of Regional and Flanking Base Composition. Genetics, 172(1):569–577, January 2006.

[18] Mufulid. http://www.ufluids.net/.

[19] G. Olsen, H. Matsuda, R. Hagstrom, and R. Overbeek. fastDNAmL: a Tool for Construc- tion of Phylogenetic Trees of DNA Sequences using Maximum Likelihood. Computer Applications in the Biosciences, 10(1):41–48, 1994.

[20] Phylip Package. http://evolution.genetics.washington.edu/phylip. html.

[21] PostgreSQL. http://www.postgresql.org.

[22] Predictor@home. http://predictor.scripps.edu/.

[23] Primegrid. http://www.primegrid.com.

[24] Riesel sieve. http://boinc.rieselsieve.com/.

[25] Rosetta@home. http://boinc.bakerlab.org/rosetta/.

[26] H.-P. Schwefel. Deep Insight from Simple Models of Evolution. BioSystems, 64(1):189– 198, January 2002. REFERENCES 67

[27] HIV Sequence Database. http://hiv.lanl.gov.

[28] SETI@Home. http://setiathome.berkeley.edu.

[29] A. Siepel and D. Haussler. Phylogenetic estimation of context-dependent substitution rates by maximum likelihood. Molecular Biology and Evolution, 21(3):468–488, March 2004.

[30] Simap. http://boinc.bio.wzw.tum.de/boincsimap/.

[31] G. Sirakoulis, I. Karafyllidis, Ch. Mizas, V. Mardiris, A. Thanailakis, and Ph. Tsalides. A Cellular Automaton Model for the Study of DNA Sequence Evolution. Computers in Biology and Medicine, 33(5):439–453, September 2003.

[32] Spinhenge@home. http://spin.fh-bielefeld.de/.

[33] C. Stewart, D. Hart, M. Aumuller, R. Keller, M. Muller, H. Li, R. Repasky, R. Sheppard, D. Berry, M. Hess, U. Wossner, and J. Colbourne. A Global Grid for Analysis of Arthro- pod Evolution. In Proceedings of the Fifth IEEE/ACM International Workshop on Grid Computing, pages 328–337, 2004.

[34] Sztaki desktop grid. http://szdg.lpds.sztaki.hu/szdg/.

[35] World community grid. http://www.worldcommunitygrid.org/.

[36] S. Wolfram. A New Kind of Science. Wolfram Media, Inc., 2002.