Bioinformatics Technologies Yi-Ping Phoebe Chen (Ed.) Bioinformatics Technologies

Home , Information technology, Network architecture

Yi-Ping Phoebe Chen (Ed.)

With 129 Figures and 50 Tables

123 Yi-Ping Phoebe Chen (Ed.) School of Information Technology, Faculty of Science and Technology, Deakin University, Australia Email: [email protected]

Library of Congress Control Number: 2004115713 ACM Computing Classification (1998): J.3, I.5, H.3

ISBN 3-540-20873-9 Springer Berlin Heidelberg New York

This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilm or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable for prosecution under the German Copyright Law.

Springer is a part of Springer Science+Business Media springeronline.com

The use of general descriptive names, registered names, trademarks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use.

Typesetting: By the author Cover design: KünkelLopka, Heidelberg Production: LE-TEX Jelonek, Schmidt & Vöckler GbR, Leipzig Printed on acid-free paper 45/3142/YL - 5 4 3 2 1 0 Preface

This book arose primarily out of a compelling need for a comprehensive reference in bioinformatics that will cater to students, research, and indus- try. We strongly believe that this new field evolved from the active interac- tion of two fast-developing disciplines: biology and information technology. Solving modern biological problems requires advanced computational methods. Key techniques include database management, data modeling, pattern recognition, data mining, query processing, and visualization of biological data. Until very recently, virtually all public databases were based on large flat files stored in simple formats. Navigation among databases required expert knowledge and considerable patience. The huge quantities of biological data and escalating demands of modern biological research increasingly require the sophistication and computing power of information technology (IT) tools. More specifically, optimal use of these tools requires proximal information – knowing which data points are in the surrounding area of others. In this book, we will present methodologies and data structures for arriving at high quality biological information, which can then be used as foundation to develop practical tools for clustering and visualization in biological data mining and database management. Throughout the book, we will demonstrate the application of well estab- lished concepts and techniques of information technology to the management and analysis of biological data. Biological analysis requires the integration of software tools used in data mining, such as clustering, classification, decision trees and decision tables, and sequence and structural modeling such as data modeling. A distinctive feature of our book is the integration of advanced database technologies with visualization techniques such as query-interactive user interfaces, visual descriptions, and advanced 3-D visual modeling. Biological data continue to grow exponentially in size and complexity. As a result, they introduce new data types not previously seen even in molecular biology. It is vital and urgent that advanced information technologies, in particular, database technologies and visual analysis, be applied to support biological research and innovation based on biological data. Spe- cific IT-motivated activities are taking root in some parts of the biological VI Preface research community, and we foresee that they will benefit information technology. “Bioinformatics technologies” is a comprehensive book that covers these two important areas, viz., IT and biology, which have become inter- woven in recent years. Many international experts have made contributions to this book. Each article is written in a way that a practitioner of bioinformatics can easily understand and then apply the knowledge gained to extract useful information from biological data. Each article covers one topic, and can be read independently of each other. The book provides both a general survey of the topic and an in-depth exposition of the state-of-the- art. Practitioners will certainly find this book very resourceful and handy when looking for solutions to practical problems in bioinformatics. Re- searchers can use this book as a source for obtaining background information, current trends and developments; this provides them also with the most important references on these topics. The book covers the basic principles and applications of bioinformatics technologies. It also contains many articles that specifically address bioinformatics databases and emerging topics in bioinformatics technologies such as patterns discovery, data mining, simulation and visualization. The central issue in bioinformatics is how to transform biological data into meaningful and valuable information. It implies that the biological knowledge related to the problem domain is incorporated into the requirements analysis phase of the bioinformatics. However, it has been recently recognized that in the twenty-first century bioinformatics will play an increasingly important role. For this reason, the international conference series on Asia-Pacific Bioinformatics (first bioinformatics conference in the IT domain) was founded in 2002. The underly- ing goal behind this conference series is to recognize the interdisciplinary nature of bioinformatics in the interplay between biology and IT and how information technology can be applied to biology. Even though a great deal of attention is paid to this area in terms of research and investment, the theoretical understanding needs further refine- ment to bring the outcome of the biological analysis effectively to the ser- vice of mankind. In editing this book, this viewpoint has been carefully taken into consideration to conceptually organize the recent progress in bioinformatics. The book is organized into twelve chapters that cover twelve important technologies in bioinformatics. Chapter1, Introduction to Bioinformatics, provides an overview of bioinformatics technology, and different techniques within bioinformatics. Further, it introduces the relationships between the other chapters. Preface VII

Chapter 2, Overview of Structural Bioinformatics, presents an overview of structural bioinformatics. The chapter describes organization of structural bioinformatics, the Protein Data Bank, secondary resources and applications, and using structural bioinformatics approaches in drug design. It also includes structural classification, structure prediction, functional assignments in structural genomics, protein-protein interactions and protein-ligand interactions. The role of structural bioinformatics in systems biology is also briefly discussed. Chapter 3, Database Warehousing in Bioinformatics, deals with the basics in database warehousing, transforming biological data into knowledge, data warehouse architectures and data quality in bioinformatics. Chapter 4, Data Mining for Bioinformatics, discusses the basics of data mining applicable to bioinformatics. The main types of data analysis, namely, biomedical data analysis, DNA data analysis, protein data analysis and microarray data analysis, are elaborated upon. Biomedical data analysis includes a major nucleotide sequence database, a protein sequence database, a gene expression database, and software tools for bioinformatics research. DNA data analysis covers DNA sequence and DNA data analysis. Protein data analysis encompasses protein and amino acid sequence and protein data analysis. Chapter 5, Machine Learning in Bioinformatics, dwells on the theory behind machine learning applied to bioinformatics. It includes neural network architectures and applications. We also describe other machine learning techniques, such as genetic algorithms and fuzzy systems. Chapter 6, Systems Biotechnology: a New Paradigm in Biotechnology Development, describes a new paradigm in biotechnology development called system biotechnology. It covers integrative approaches and in silico modeling and simulation of cellular processes. Chapter 7, Computational Modeling of Biological Processes with Petri Net-Based Architecture, describes computational modeling of biological processes with a Petri net-based architecture, a hybrid Petri net and a hybrid dynamic net, and a hybrid functional Petri net. The chapter also covers the implementation of a HFPNe in a genomic object net and the modeling of biological processes with a HFPNe and a genomic object net and its visualizer. Chapter 8, Biological Sequence Assembly and Alignment, illustrates biological sequence assembly and alignment. It covers large-scale sequence assembly, Euler sequence assembly, PESA sequence assembly, large-scale pairwise sequence alignment, large-scale multiple sequence, alignment, and load balancing and communication overheads. Chapter 9, Modeling for Bioinformatics, covers the basics of modeling techniques related to bioinformatics. It includes the major modeling tech- VIII Preface niques, namely, hidden Markov modeling for biological data analysis, comparative modeling and molecular modeling. An elaborate discussion is made to apply hidden Markov modeling on biological data to have sequence identification, sequence classification, and multiple alignment generation. Comparative modeling comprises protein comparative modeling, comparative genomic modeling, and probabilistic modeling. The probabilistic modeling encompasses Bayesian networks, stochastic context-free grammars, and probabilistic Boolean networks. Finally, we describe molecular modeling, which deals with molecular and related visualization applications, molecular mechanics, and modern computer programs used in molecular modeling. Chapter 10, Pattern Matching for Motifs, addresses the issues in pattern matching for discovering motifs. Topics include gene regulation and promoter organization. We include motif recognition and motif detection strategies. The chapter also includes two different approaches, namely, the single gene multi-species approach and the multi-gene multi-species approach. Chapter 11, Visualization and Fractal Analysis of Biological Sequences, deals with visualization and fractal analysis of biological sequences. It elaborates on the fractal analysis, the recurrent iterated function system model, the moment method to estimate the parameters of the IFS (RIFS) model, multifractal analysis, the DNA walk model, and chaos game representation of biological sequences. Two-dimensional portrait representation of DNA sequences and one-dimensional measure representations of biological sequences are also introduced. Chapter 12, Microarray Data Analysis, discusses the techniques used to analyze microarray data and microarray technology used for genome expression study, image analysis for data extraction, and data analysis for pattern discovery. In a rapidly expanding area such as bioinformatics, no book can claim to cover the topics that suit the interests of everyone. However, it is hoped that this book is comprehensive enough to serve as a useful and handy guide for both practitioners and researchers. This book will help both IT professionals and biologists to understand the bioinformatics world. We would like to thank all authors who contributed the chapters in this book, without whom the mission would have been impossible. Special thanks to the reviewers for their professional inputs. We thank Ricky Chen and Chinnu Subramaniam for helping us check parts of the manuscript at short notice. We have taken care to cite referenced work. If we have missed any citation, we apologize for the lapse. We thank all researchers for their permission to use their figures in this book. We also wish to thank the Springer publisher Ralf Gerstner for his final step of checking and Preface IX timely help before publication. Finally, we wish to thank our families and friends for their support. We are sure that some errors may stay behind in the book. Your input for improvement will be helpful for future reprints and editions. Com- ments, corrections, and constructive suggestions should be sent to Springer or by electronic mail to [email protected]

January 2005 Yi-Ping Phoebe Chen Contents

Preface ...... V

1 Introduction to Bioinformatics...... 1 1.1 Introduction...... 1 1.2 Needs of Bioinformatics Technologies...... 2 1.3 An Overview of Bioinformatics Technologies...... 5 1.4 A Brief Discussion on the Chapters...... 8 References...... 12

2 Overview of Structural Bioinformatics...... 15 2.1 Introduction...... 15 2.2 Organization of Structural Bioinformatics...... 17 2.3 Primary Resource: Protein Data Bank ...... 18 2.3.1 Data Format...... 18 2.3.2 Growth of Data ...... 18 2.3.3 Data Processing and Quality Control...... 20 2.3.4 The Future of the PDB ...... 21 2.3.5 Visualization...... 21 2.4 Secondary Resources and Applications ...... 22 2.4.1 Structural Classification ...... 22 2.4.2 Structure Prediction ...... 28 2.4.3 Functional Assignments in Structural Genomics...... 30 2.4.4 Protein-Protein Interactions...... 32 2.4.5 Protein-Ligand Interactions ...... 34 2.5 Using Structural Bioinformatics Approaches in Drug Design ...... 37 2.6 The Future...... 39 2.6.1 Integration over Multiple Resources...... 39 2.6.2 The Impact of Structural Genomics ...... 39 2.6.3 The Role of Structural Bioinformatics in Systems Biology ....39 References...... 40

3 Database Warehousing in Bioinformatics...... 45 3.1 Introduction...... 45 3.2 Bioinformatics Data...... 48 3.3 Transforming Data to Knowledge ...... 51 3.4 Data Warehousing ...... 54 3.5 Data Warehouse Architecture...... 56 3.6 Data Quality ...... 58 3.7 Concluding Remarks...... 60 XII Contents

References...... 61

4 Data Mining for Bioinformatics ...... 63 4.1 Introduction...... 63 4.2 Biomedical Data Analysis...... 64 4.2.1 Major Nucleotide Sequence Database, Protein Sequence Database, and Gene Expression Database...... 65 4.2.2 Software Tools for Bioinformatics Research ...... 68 4.3 DNA Data Analysis ...... 71 4.3.1 DNA Sequence ...... 71 4.3.2 DNA Data Analysis ...... 76 4.4 Protein Data Analysis ...... 92 4.4.1 Protein and Amino Acid Sequence ...... 92 4.4.2 Protein Data Analysis...... 99 References...... 109

5 Machine Learning in Bioinformatics ...... 117 5.1 Introduction...... 117 5.2 Artificial Neural Network ...... 120 5.3 Neural Network Architectures and Applications...... 128 5.3.1 Neural Network Architecture ...... 128 5.3.2 Neural Network Learning Algorithms ...... 131 5.3.3 Neural Network Applications in Bioinformatics ...... 134 5.4 Genetic Algorithm ...... 135 5.5 Fuzzy System ...... 141 References...... 147

6 Systems Biotechnology: a New Paradigm in Biotechnology Development...... 155 6.1 Introduction...... 155 6.2 Why Systems Biotechnology?...... 156 6.3 Tools for Systems Biotechnology...... 158 6.3.1 Genome Analyses ...... 158 6.3.2 Transcriptome Analyses...... 159 6.3.3 Proteome Analyses...... 161 6.3.4 Metabolome/Fluxome Analyses ...... 163 6.4 Integrative Approaches ...... 164 6.5 In Silico Modeling and Simulation of Cellular Processes...... 166 6.5.1 Statistical Modeling ...... 167 6.5.2 Dynamic Modeling ...... 169 6.6 Conclusion ...... 170 References...... 171 Contents XIII

7 Computational Modeling of Biological Processes with Petri Net- Based Architecture ...... 179 7.1 Introduction...... 179 7.2 Hybrid Petri Net and Hybrid Dynamic Net...... 183 7.3 Hybrid Functional Petri Net ...... 190 7.4 Hybrid Functional Petri Net with Extension ...... 191 7.4.1 Definitions...... 191 7.4.2 Relationships with Other Petri Nets...... 197 7.4.3 Implementation of HFPNe in Genomic Object Net...... 198 7.5 Modeling of Biological Processes with HFPNe...... 198 7.5.1 From DNA to mRNA in Eucaryotes – Alternative Splicing .199 7.5.2 Translation of mRNA – Frameshift ...... 203 7.5.3 Huntington’s Disease ...... 203 7.5.4 Protein Modification – p53...... 207 7.6 Related Works with HFPNe...... 211 7.7 Genomic Object Net: GON...... 212 7.7.1 GON Features That Derived from HFPNe Features ...... 214 7.7.2 GON GUI and Other Features ...... 214 7.7.3 GONML and Related Works with GONML ...... 220 7.7.4 Related Works with GON ...... 222 7.8 Visualizer ...... 224 7.8.1 Bio-processes on Visualizer ...... 226 7.8.2 Related Works with Visualizer...... 231 7.9 BPE...... 233 7.10 Conclusion...... 236 References...... 236

8 Biological Sequence Assembly and Alignment ...... 243 8.1 Introduction...... 243 8.2 Large-Scale Sequence Assembly...... 245 8.2.1 Related Research...... 245 8.2.2 Euler Sequence Assembly...... 249 8.2.3 PESA Sequence Assembly Algorithm...... 249 8.3 Large-Scale Pairwise Sequence Alignment ...... 254 8.3.1 Pairwise Sequence Alignment ...... 254 8.3.2 Large Smith-Waterman Pairwise Sequence Alignment...... 256 8.4 Large-Scale Multiple Sequence Alignment ...... 257 8.4.1 Multiple Sequence Alignment ...... 257 8.4.2 Large-Scale Clustal W Multiple Sequence Alignment ...... 258 8.5 Load Balancing and Communication Overhead...... 259 8.6 Conclusion ...... 259 References...... 260 XIV Contents

9 Modeling for Bioinformatics ...... 263 9.1 Introduction...... 263 9.2 Hidden Markov Modeling for Biological Data Analysis ...... 264 9.2.1 Hidden Markov Modeling for Sequence Identification...... 264 9.2.2 Hidden Markov Modeling for Sequence Classification...... 273 9.2.3 Hidden Markov Modeling for Multiple Alignment Generation...... 278 9.2.4 Conclusion...... 280 9.3 Comparative Modeling ...... 281 9.3.1 Protein Comparative Modeling...... 281 9.3.2 Comparative Genomic Modeling...... 284 9.4 Probabilistic Modeling...... 287 9.4.1 Bayesian Networks ...... 287 9.4.2 Stochastic Context-Free Grammars ...... 288 9.4.3 Probabilistic Boolean Networks ...... 288 9.5 Molecular Modeling ...... 290 9.5.1 Molecular and Related Visualization Applications...... 290 9.5.2 Molecular Mechanics...... 294 9.5.3 Modern Computer Programs for Molecular Modeling ...... 295 References...... 297

10 Pattern Matching for Motifs ...... 299 10.1 Introduction ...... 299 10.2 Gene Regulation ...... 301 10.2.1 Promoter Organization ...... 302 10.3 Motif Recognition...... 303 10.4 Motif Detection Strategies ...... 305 10.4.1 Multi-genes, Single Species Approach ...... 306 10.5 Single Gene, Multi-species Approach...... 307 10.6 Multi-genes, Multi-species Approach...... 309 10.7 Summary ...... 309 References...... 310

11 Visualization and Fractal Analysis of Biological Sequences...... 313 11.1 Introduction ...... 313 11.2 Fractal Analysis ...... 317 11.2.1 What Is a Fractal? ...... 317 11.2.2 Recurrent Iterated Function System Model...... 319 11.2.3 Moment Method to Estimate the Parameters of the IFS (RIFS) Model...... 320 11.2.4 Multifractal Analysis...... 321 11.3 DNA Walk Models ...... 323 Contents XV

11.3.1 One-Dimensional DNA Walk...... 323 11.3.2 Two-Dimensional DNA Walk...... 324 11.3.3 Higher-Dimensional DNA Walk ...... 325 11.4 Chaos Game Representation of Biological Sequences ...... 325 11.4.1 Chaos Game Representation of DNA Sequences ...... 325 11.4.2 Chaos Game Representation of Protein Sequences...... 326 11.4.3 Chaos Game Representation of Protein Structures ...... 326 11.4.4 Chaos Game Representation of Amino Acid Sequences Based on the Detailed HP Model...... 327 11.5 Two-Dimensional Portrait Representation of DNA Sequences .330 11.5.1 Graphical Representation of Counters ...... 330 11.5.2 Fractal Dimension of the Fractal Set for a Given Tag...... 332 11.6 One-Dimensional Measure Representation of Biological Sequences...... 335 11.6.1 Measure Representation of Complete Genomes...... 335 11.6.2 Measure Representation of Linked Protein Sequences ...... 340 11.6.3 Measure Representation of Protein Sequences Based on Detailed HP Model...... 344 References...... 348

12 Microarray Data Analysis ...... 353 12.1 Introduction ...... 353 12.2 Microarray Technology for Genome Expression Study...... 354 12.3 Image Analysis for Data Extraction...... 356 12.3.1 Image Preprocessing ...... 357 12.3.2 Block Segmentation ...... 359 12.3.3 Automatic Gridding ...... 360 12.3.4 Spot Extraction ...... 360 12.3.5 Background Correction, Data Normalization and Filtering, and Missing Value Estimation...... 361 12.4 Data Analysis for Pattern Discovery...... 363 12.4.1 Cluster Analysis...... 363 12.4.2 Temporal Expression Profile Analysis and Gene Regulation ...... 371 12.4.3 Gene Regulatory Network Analysis...... 382 References...... 384

Index ...... 389