Analysis and Predictions of DNA Sequence Transformations on Grids

Analysis and Predictions of DNA Sequence Transformations on Grids A Thesis Submitted for the Degree of Master of Science (Engineering) in the Faculty of Engineering By Yadnyesh R. Joshi Supercomputer Education and Research Centre INDIAN INSTITUTE OF SCIENCE BANGALORE – 560 012, INDIA August 2007 Acknowledgments First of all I would like to extend my sincere thanks to my research supervisor Dr. Sathish Vadhiyar for his constant guidance and support during the entire period of my post-graduation at IISc. He was always approachable, supportive and ready to help in any sort of problem. I am very thankful to him for being extremely patient and understanding about the silly mistakes that I had made. Under his guidance I learned to approach problems in an organized manner and set realistic goals for my research. I thank him for his extreme patience and excellent technical guidance in writing and presenting research. Finally, he was and continues to be my role model for his hard work and passion for research. I am also thankful to Dr. Nagasuma Chandra, Dr. Debnath Pal from S.E.R.C. and Dr. Narendra Dixit from Chemical Engineering department for their very useful and interesting insights into the biological domain of our research. I am also thankful to all the faculty of S.E.R.C. for always inspiring us with their motivational talks. I would like to mention the names of my colleagues Sandip, Sanjay, Rakhi, Sundari, Antoine and Roshan for making their technical and emotional support. Special thanks to vatyaa kya group members for the adventures and the routines inside and outside the institute. I would also like to thank the Marathi Mandal for making the institute a homely place. Back home, I would like to thank my parents for being my pillars of strength. I would also like to thank Yamini tai and Dhanashree, my sisters for supporting and guiding me to make important decisions. I would like to thank my friends, Vijay, Vishwanath, Pushkaraj, Prashant, Akshay, Sunder, Anusha and Neha for always being there for me. Last but never the least, I am very thankful to my grandfather Laxman Rao for being the strongest motivator all the time. i Abstract Phylogenetics is the study of evolution of organisms. Evolution occurs due to mutations of DNA sequences. The reasons behind these seemingly random mutations are largely unknown. There are many algorithms that build phylogenetic trees from DNA sequences. However, there are certain uncertainties associated with these phylogenetic trees. Fine level analysis of these phylogenetic trees is both important and interesting for evolutionary biologists. In this thesis, we try to model evolutions of DNA sequences using Cellular Automata and resolve the uncertainties associated with the phylogenetic trees. In particular, we determine the effect of neighboring DNA base-pairs on the mutation of a base-pair. Cellular Automata can be viewed as an array of cells which modifies itself in discrete time-steps according to a governing rule. The state of the cell at the next time-step depends on its current state and state of its neighbors. We have used cellular automata rules for analysis and predictions of DNA sequence transformations on computational grids. In the first part of the thesis, DNA sequence evolution is modeled as a cellular automata with each cell having one of the four possible states, corresponding to four bases. Phylogenetic trees are explored in order to find out the cellular automata rules that may have guided the evolutions. Master-client paradigm is used to exploit the parallelism in the sequence transformation analysis. Load balancing and fault-tolerance techniques are developed to enable the execution of the explorations on grid resources. The analysis of the sequence transformations is used to resolve uncertainties associated with the phylogenetic trees namely, intermediate sequences in the phylogenetic tree and the exact number of time-steps required for the evolution of a branch. The model is further used to find out various statistics such as most popular rules at a particular time-step in the evolution history of a branch in a phylogenetic tree. We have observed ii iii some interesting statistics regarding the unknown base pairs in the intermediate sequences of the phylogenetic tree and the most popular rules used for sequence transformations. Next part of the thesis deals with predictions of future sequences using the previous sequences. First, we try to find out the preserved sequences so that cellular automata rules can be applied selectively. Then, random strategies are developed as base benchmarks. A roulette wheel strategy is used for predicting future DNA sequences. Though the prediction strategies are able to better the random benchmarks in most of the cases, average performance improve- ment over the random strategies is not significant. The possible reasons are discussed. Contents List of Figures vii List of Tables ix 1 Introduction 1 1.1 Cellular Automata . 1 1.2 DNA and Cellular Automata . 3 1.3 Phylogenetics . 5 1.4 Grid Computing . 7 1.5 Motivation and Problem Formulation . 8 2 Related Work 11 2.1 DNA Sequence Evolution . 11 2.2 Grid Computing Applications . 13 2.2.1 Applications in Mathematics and Earth Sciences . 13 2.2.2 Applications in Astronomy, Physics and Chemistry . 14 2.2.3 Applications in Biology and Bioinformatics . 14 3 Sequence Transformation on Grids 16 3.1 Sequence Transformation on a Branch . 16 3.1.1 Naive Approach . 17 3.1.2 Selective Application of Cellular Automata Rules . 19 3.1.3 Dynamic Formation of Cellular Automata Rules . 20 iv CONTENTS v 3.2 Sequence Transformer . 22 3.3 Pseudo Molecular Clock Assumption . 22 3.4 Design . 24 3.4.1 Master-Worker Paradigm . 25 3.4.2 Phases of Execution . 26 3.5 Grid Computing Techniques . 27 3.5.1 Load Balancing . 28 3.5.2 Fault Tolerance . 29 3.6 Database Design . 29 3.7 Statistics Collection . 31 3.7.1 Timesteps . 32 3.7.2 Unknown base-pairs . 32 3.7.3 Rules . 32 3.7.4 Differential rule analysis . 33 3.7.5 Popularity of transitions . 33 4 Experiments and Results 36 4.1 Grid Infrastructure . 36 4.2 Timesteps . 37 4.3 Popular Rules . 37 4.4 Base Pairs Corresponding to Unknown Positions . 42 4.5 Potential of Grid Computing . 46 5 Predictions in phylogenetic trees 48 5.1 Determining the Preserved Segments . 48 5.1.1 Calculation of PSSM . 49 5.1.2 Strategies for Determining Preserved Sequences . 50 5.1.3 Evaluation of Strategies . 51 5.1.4 Determination of Threshold Values for Flexible Strategies . 53 5.2 Analysis of Random Strategies . 55 CONTENTS vi 5.3 Methods Used for Prediction . 57 5.3.1 Roulette Wheel Method . 58 5.3.2 Roulette Wheel Method with Random Component . 58 5.3.3 History Sizes . 59 5.3.4 Experiments and Results . 59 5.4 Analysis . 60 6 Conclusions and Future work 62 6.1 Conclusions . 62 6.2 Future Work . 63 References 65 List of Figures 1.1 Evolution of Cellular Automata through time steps . 2 1.2 Rule that governs the evolution of cellular automata shown in Figure 1.1 . 2 1.3 Double helix structure of DNA (Courtesy : U.S. National Library of Medicine) 4 1.4 Example Phylogenetic Tree with Gag Sequences . 6 3.1 Application of Random Cellular Automata Rules . 18 3.2 Selective Application of Cellular Automata Rules . 19 3.3 Example : Dynamic Formation of Cellular Automata Rules . 20 3.4 Dynamic Formation and Selective Application of Cellular Automata Rules . 21 3.5 Illustration of the Greedy Algorithm . 24 3.6 The Master-Worker Design . 25 3.7 Phase I in Master . 27 3.8 Phase II in Master . 28 5.1 Analysis of threshold values : Flexible-1 . 53 5.2 Analysis of threshold values : Flexible-2 . 54 5.3 Analysis of random strategies . 57 vii List of Algorithms 1 Algorithm for Sequence Transformer . 34 2 Greedy Algorithm for Chain Formation . 35 3 Calculation of Position Specific Scoring Matrix . 49 viii List of Tables 1.1 Left-Hand Sides of 64 Transitions of Cellular Automata with Neighborhood Size of 1 . 5 3.1 strands . 30 3.2 working strand . 30 3.3 branches . 31 3.4 ruletable . 31 3.5 chains . 32 4.1 The Distributed Infrastructure . 37 4.2 Summary of time step information for Gag sequences . 38 4.3 Summary of time step information for GagPol sequences . 38 4.4 Summary of time step information for env sequences . 39 4.5 Differential Rule Analysis for Gag Sequences . 40 4.6 Differential Rule Analysis for GagPol Sequences . 40 4.7 Differential Rule Analysis for env Sequences . 41 4.8 Popular Rules for a Branch for Gag Sequences . 42 4.9 Popular Rules for a Branch for GagPol Sequences . 43 4.10 Popular Rules for a Branch for env Sequences . 44 4.11 Resolution of Unknown Positions for Gag Sequences . 45 4.12 Resolution of Unknown Positions for GagPol Sequences . 45 4.13 Resolution of Unknown Positions for env Sequences . 46 4.14 Usefulness of Large Number of Runs . 46 ix Chapter 1 Introduction In this section, we give brief background on cellular automata, the relationship between cellular automata and DNA evolutions and the concept of phylogenetic trees. 1.1 Cellular Automata Cellular automaton is a regular array of identical finite state automata where the next states of the array elements are determined solely by their current states and the states of their neighbors. One dimensional cellular automata consists of a line of cells, each having a particular state. State of each of these cells changes over discrete time-steps.

Analysis and Predictions of DNA Sequence Transformations on Grids

Volunteer Computing Different Grids for Different Needs

Spin-Off Successes of SETI Research at Berkeley

"Challenges and Formal Aspects of Volunteer Computing"

Analyzing Daily Computing Runtimes on the World Community Grid

EDGES Project Meeting

Toward Crowdsourced Drug Discovery: Start-Up of the Volunteer Computing Project Sidock@Home

Primegrid: Searching for a New World Record Prime Number

Volunteer Down: How COVID-19 Created the Largest Idling Supercomputer on Earth

Integrated Service and Desktop Grids for Scientific Computing

Secure Volunteer Computing for Distributed Cryptanalysis

Predicting Climate Change Through Volunteer Computing

Computer Science • 14 (1) 2013