Grid and High-Performance Computing for Applied Bioinformatics

Grid and High-Performance Computing for Applied Bioinformatics Jorge Andrade Royal Institute of Technology, School of Biotechnology Stockholm, 2007 Jorge Andrade © Jorge Andrade E-mail: [email protected] School of Biotechnology Royal Institute of Technology AlbaNova University Center SE-106 91 Stockholm Sweden Printed at Universitetsservice US AB Box 700 14 Stockholm ISBN 978-91-7178-782-8 TRITA-BIO-Report 2007-9 ISSN 1654-2312 Grid and High-Performance Computing for applied Bioinformatics Jorge Andrade (2007). Grid and High-Performance Computing for Applied Bioinformatics. Department of Gene Technology, School of Biotechnology, Royal Institute of Technology (KTH), Stockholm, Sweden ISBN 978-91-7178-782-8 TRITA-BIO-Report 2007-9 ISSN 1654-2312 ABSTRACT The beginning of the twenty-first century has been characterized by an explosion of biological information. The avalanche of data grows daily and arises as a consequence of advances in the fields of molecular biology and genomics and proteomics. The challenge for nowadays biologist lies in the de-codification of this huge and complex data, in order to achieve a better understanding of how our genes shape who we are, how our genome evolved, and how we function. Without the annotation and data mining, the information provided by for example high throughput genomic sequencing projects is not very useful. Bioinformatics is the application of computer science and technology to the management and analysis of biological data, in an effort to address biological questions. The work presented in this thesis has focused on the use of Grid and High Performance Computing for solving computationally expensive bioinformatics tasks, where, due to the very large amount of available data and the complexity of the tasks, new solutions are required for efficient data analysis and interpretation. Three major research topics are addressed; First, the use of grids for distributing the execution of sequence based proteomic analysis, its application in optimal epitope selection and in a proteome-wide effort to map the linear epitopes in the human proteome. Second, the application of grid technology in genetic association studies, which enabled the analysis of thousand of simulated genotypes, and finally the development and application of a economic based model for grid-job scheduling and resource administration. The applications of the grid based technology developed in the present investigation, results in successfully tagging and linking chromosomes regions in Alzheimer disease, proteome-wide mapping of the linear epitopes, and the development of a Market-Based Resource Allocation in Grid for Scientific Applications. Keywords: Grid computing, bioinformatics, genomics, proteomics. 3 Jorge Andrade Grid and High-Performance Computing for applied Bioinformatics LIST OF PUBLICATIONS This thesis is based on the papers listed below, which will be referred to by their roman numerals I. Jorge Andrade*, Lisa Berglund*, Mathias Uhlén and Jacob Odeberg. Using Grid Technology for Computationally Intensive Applied Bioinformatics Analyses, In Silico Biology 6 (2006) 1 10 1 IOS Press 1386-6338/06 II. Lisa Berglund*, Jorge Andrade*, Jacob Odeberg and Mathias Uhlen. The linear epitope space of the human proteome (2007) Submitted III. Jorge Andrade, Malin Andersen, Sillén Anna, Caroline Graff, Jacob Odeberg. The use of grid computing to drive data-intensive genetic research. Eur. J. Hum. Genet. (2007) 15, 694–702 IV. Anna Sillén, Jorge Andrade, Lena Lilius, Charlotte Forsell, Karin Axelman, Jacob Odeberg, Bengt Winblad and Caroline Graff. Expanded high-resolution genetic study of 109 Swedish families with Alzheimer's disease Eur. J. Hum. Genet. (2007) V. Thomas Sandholm, Jorge Andrade, Jacob Odeberg, Kevin Lai. Market-Based Resource Allocation using Price Prediction in a High Performance Computing Grid for Scientific Applications. High Performance Distributed Computing, (2006) 15th IEEE International Symposium. 132 - 143 ISSN: 1082-8907 * These authors contributed equally to the work Related publications 1. Mercke Odeberg J, Andrade J, Holmberg K, Hoglund P, Malmqvist U, Odeberg J. UGT1A polymorphisms in a Swedish cohort and a human diversity panel, and the relation to bilirubin plasma levels in males and females. European Journal of Clinical Pharmacology. (2006) 2. Andrade J, Andersen M, Berglund L, Odeberg J. Applications of Grid computing in genetics and proteomics. Proceedings of PARA06 workshop on state-of-the-art in scientific and parallel computing Springer series Lecture Notes in Computer Science (LNCS) 2007. Articles printed with permission from the respective publisher. 5 Jorge Andrade Table of contents I INTRODUCTION 9 1. INTRODUCTION 11 1.1 An explosion of biological information 11 1.2 Computer science in biology - Bioinformatics 12 2. EXAMPLES OF BIOINFORMATICS APPLICATION RESEARCH IN BIOLOGY. 12 2.1 Genomic and proteomics databases 12 2.3 Analysis of gene expression 13 2.4 Analysis of protein levels 13 2.5 Prediction of protein structure 14 2.6 Protein-protein docking 14 2.7 High-throughput image analysis 14 2.8 Simulation based linkage and association studies 15 2.9 Systems biology 15 3. COMPUTATIONAL CHALLENGES IN BIOINFORMATICS 16 3.1 The problem of growing size 16 3.2 The problem of storage and data distribution 16 3.3 The problem of data complexity 17 4. EMERGING DISTRIBUTED COMPUTING TECHNOLOGIES 18 4.1 An introduction to grid computing 18 4.2 Virtual Organizations 19 4.3 Examples of Computational Grids 20 4.3.1 The European DataGrid 20 4.3.2 The Enabling Grids for E-sciencE project (EGEE) 21 4.3.3 Nordugrid / Swedgrid 22 4.3.4 The TeraGrid project 23 4.3.5 The Open Science Grid 24 4.4 Software Technologies for the Grid 24 4.4.1 Globus 25 4.4.2 Condor 25 4.5 Models for Grid Resource Management and Job Scheduling 26 4.5.1 GRAM (Grid Resource Allocation Manager) 27 4.5.2 Economic-based Grid Resource Management and Scheduling 27 4.5 Grid-based initiatives approaching applied bioinformatics 28 II PRESENT INVESTIGATION 31 5. APPLICATIONS OF GRID TECHNOLOGY IN PROTEOMICS (PAPER I AND II) 34 5.1 Grid technology applied to sequence similarity searches (Grid-Blast) 34 Grid and High-Performance Computing for applied Bioinformatics 5.2 Grid based proteomic similarity searches using non-heuristic algorithms 35 6. APPLICATIONS OF GRID TECHNOLOGY IN GENETICS. (PAPER III AND IV) 37 6.1 Grid technology applied to genetic association studies (Grid-Allegro) 37 6.2 Genetic study of 109 Swedish families with Alzheimer’s disease 38 7. RESOURCE ALLOCATION IN GRID COMPUTING (PAPER V) 39 7.1 Market-Based Resource Allocation in Grid 39 8. FUTURE PERSPECTIVES 41 ABBREVIATIONS 42 ACKNOWLEDGEMENTS 43 REFERENCES 44 7 Jorge Andrade Grid and High-Performance Computing for applied Bioinformatics I INTRODUCTION 9 Jorge Andrade Grid and High-Performance Computing for applied Bioinformatics Chapter 1 1. Introduction 1.1 An explosion of biological information The beginning of the twenty first century can be characterized by an explosion of biological information. The avalanche of data grows daily and arises as a consequence of advances in the fields of molecular biology and genomics. The genetic information is codified and stored in the nucleus of the cells that are the fundamental working units of every living system. All the instructions needed to direct their activities are contained within the chemical DNA. As stated in the central dogma of molecular biology (Figure 1), genetic information flows from genes, via RNA, to proteins. Figure 1. Diagram of the central dogma, from DNA to RNA to protein, illustrating the genetic code. Proteins perform most of the cellular functions and constitute the majority of the cellular structures. Proteins are often large, complex molecules made up of smaller polymerised subunits called amino acids. Chemical properties that distinguish the twenty different amino acids cause the protein chains to fold up into specific three-dimensional structures that define their particular functions in the cell. Studies to explore protein structure and activities, known as proteomics, will be the focus of much research for decades in order to elucidate and understand the molecular basis of health and disease. 11 Jorge Andrade 1.2 Computer science in biology - Bioinformatics The challenge for nowadays biologist lies in the de-codification of this huge and complex data from the biological language, in order to better understand of how our genes shape who we are, how our genome evolved, and how we function. Without annotation and detailed data mining, the information provided by the high throughput genomic sequencing projects is not very useful. Bioinformatics is an interdisciplinary research area that involves the use of techniques including applied mathematics, informatics, statistics, computer science, artificial intelligence, chemistry, and biochemistry to solve biological problems usually on the molecular level. The ultimate goal of bioinformatics is to uncover and decipher the richness of biological information hidden in the mass of data and to obtain a clearer insight into the fundamental biology of organisms. 2. Examples of bioinformatics application research in biology. 2.1 Genomic and proteomics databases Since the sequencing of the first organism (Phi-X174 phage) by Fred Sanger and his team in 1977 (Sanger, Air et al. 1977), the DNA sequences of hundreds of organisms have been decoded and stored in genomic databases (Galperin 2007; Hutchison 2007). Sequence analysis in molecular biology and bioinformatics

Grid and High-Performance Computing for Applied Bioinformatics

A Flexible Middleware for Metacomputing a Flexible

Four-Dimensional Model for Describing the Status of Peers in Peer-To-Peer Distributed Systems

From Computational Science to Internetics: Integration of Science with Computer Science

Globus: a Metacomputing Infrastructure Toolkit

Using Metacomputing Tools to Facilitate Large Scale Analyses Of

Metacomputing Across Intercontinental Networks S.M

G2-P2P: a Fully Decentralised Fault-Tolerant Cycle-Stealing Framework

Web Based Metacomputing

Decision Support System on the Grid 1 Introduction

Kenneth A. Hawick PD Coddington HA James

Sorcer: Computing and Metacomputing Intergrid

Distributed Frameworks and Parallel Algorithms for Processing Large-Scale Geographic Data