Bioinformatics and Its Applications in Plant Biology
Total Page:16
File Type:pdf, Size:1020Kb
ANRV274-PP57-13 ARI 5 April 2006 19:12 Bioinformatics and Its Applications in Plant Biology Seung Yon Rhee,1 Julie Dickerson,2 and Dong Xu3 1Department of Plant Biology, Carnegie Institution, Stanford, California 94305; email: [email protected] 2Baker Center for Computational Biology, Electrical and Computer Engineering, Iowa State University, Ames, Iowa 50011-3060; email: [email protected] 3Digital Biology Laboratory, Computer Science Department and Life Sciences Center, University of Missouri-Columbia, Columbia, Missouri 65211-2060; email: [email protected] Annu. Rev. Plant Biol. Key Words 2006. 57:335–60 sequence analysis, computational proteomics, microarray data The Annual Review of Plant Biology is online at analysis, bio-ontology, biological database plant.annualreviews.org Abstract doi: 10.1146/ annurev.arplant.56.032604.144103 Bioinformatics plays an essential role in today’s plant science. As the Copyright c 2006 by amount of data grows exponentially, there is a parallel growth in the by Stanford University Robert Crown Law Lib. on 05/10/07. For personal use only. Annual Reviews. All rights demand for tools and methods in data management, visualization, in- Annu. Rev. Plant Biol. 2006.57:335-360. Downloaded from arjournals.annualreviews.org reserved tegration, analysis, modeling, and prediction. At the same time, many First published online as a researchers in biology are unfamiliar with available bioinformatics Review in Advance on February 28, 2006 methods, tools, and databases, which could lead to missed oppor- tunities or misinterpretation of the information. In this review, we 1543-5008/06/0602- 0335$20.00 describe some of the key concepts, methods, software packages, and databases used in bioinformatics, with an emphasis on those relevant to plant science. We also cover some fundamental issues related to biological sequence analyses, transcriptome analyses, computational proteomics, computational metabolomics, bio-ontologies, and bio- logical databases. Finally, we explore a few emerging research topics in bioinformatics. 335 ANRV274-PP57-13 ARI 5 April 2006 19:12 the human brain to process and thus there Contents is an increasing need to use computational methods to process and contextualize these INTRODUCTION................. 336 data. SEQUENCE ANALYSIS ........... 337 Bioinformatics refers to the study of bio- Genome Sequencing ............. 337 logical information using concepts and meth- Gene Finding and Genome ods in computer science, statistics, and engi- Annotation.................... 337 neering. It can be divided into two categories: Sequence Comparison ............ 338 biological information management and com- TRANSCRIPTOME ANALYSIS. 340 putational biology. The National Institutes Microarray Analysis .............. 340 of Health (NIH) (http://www.bisti.nih.gov/) Tiling Arrays ..................... 341 defines the former category as “research, de- Regulatory Sequence Analysis ..... 341 velopment, or application of computational COMPUTATIONAL tools and approaches for expanding the use PROTEOMICS ................. 342 of biological, medical, behavioral or health Electrophoresis Analysis .......... 342 data, including those to acquire, represent, de- Protein Identification Through scribe, store, analyze, or visualize such data.” Mass Spectrometry ............ 342 The latter category is defined as “the devel- METABOLOMICS AND opment and application of data-analytical and METABOLIC FLUX ............ 344 theoretical methods, mathematical modeling, ONTOLOGIES .................... 345 and computational simulation techniques to Types of Bio-Ontologies .......... 345 the study of biological, behavioral, and social Applications of Ontologies ........ 345 systems.” The boundaries of these categories Software for Accessing and are becoming more diffuse and other cate- Analyzing Ontologies and gories will no doubt surface in the future as Annotations ................... 346 this field matures. DATABASES ....................... 347 The intention of this article is not to pro- Types of Biological Databases . 347 vide an exhaustive summary of all the advances Data Representation and Storage . 348 made in bioinformatics. Rather, we describe Data Access and Exchange . ..... 348 some of the key concepts, methods, and tools Data Curation.................... 349 used in this field, particularly those relevant EMERGING AREAS IN to plant science, and their current limitations BIOINFORMATICS............. 350 and opportunities for new development and Text Mining...................... 350 improvement. The first section introduces Computational Systems Biology. 350 sequence-based analyses, including gene find- Semantic Web.................... 351 by Stanford University Robert Crown Law Lib. on 05/10/07. For personal use only. ing, gene family and phylogenetic analy- Annu. Rev. Plant Biol. 2006.57:335-360. Downloaded from arjournals.annualreviews.org Cellular Localization and Spatially ses, and comparative genomics approaches. Resolved Data................. 351 The second section presents computational CONCLUSION .................... 351 transcriptome analysis, ranging from analy- ses of various array technologies to regula- tory sequence prediction. In section three, we focus on computational proteomics, in- INTRODUCTION cluding gel analysis and protein identifica- Recent developments in technologies and in- tion from mass-spectrometry data. Section strumentation, which allow large-scale as well four describes computational metabolomics. as nano-scale probing of biological samples, Section five introduces biological ontologies are generating an unprecedented amount of and their applications. Section six addresses digital data. This sea of data is too much for various issues related to biological databases 336 Rhee · Dickerson · Xu ANRV274-PP57-13 ARI 5 April 2006 19:12 ranging from database development to cura- highly repetitive sequences, although the cost tion. In section seven, we discuss a few emerg- of sequencing is another limitation. Recently ing research topics in bioinformatics. developed methods continue to reduce the cost of sequencing, including sequencing by using differential hybridization of oligonu- SEQUENCE ANALYSIS cleotide probes (48, 62, 101), polymorphism Biological sequence such as DNA, RNA, and ratio sequencing (16), four-color DNA protein sequence is the most fundamental sequencing by synthesis on a chip (114), and object for a biological system at the molecular the “454 method” based on microfabricated level. Several genomes have been sequenced high-density picoliter reactors (87). Each of to a high quality in plants, including Arabidop- these sequencing technologies has significant sis thaliana (130) and rice (52, 147, 148). Draft analytical challenges for bioinformatics in genome sequences are available for poplar terms of experimental design, data interpre- (http://genome.jgi-psf.org/Poptr1/) and tation, and analysis of the data in conjunction lotus (http://www.kazusa.or.jp/lotus/), and with other data (33). sequencing efforts are in progress for several others including tomato, maize, Medicago Gene Finding and Genome truncatula, sorghum (11) and close relatives Annotation of Arabidopsis thaliana. Researchers also gen- erated expressed sequence tags (ESTs) from Gene finding refers to prediction of introns many plants including lotus, beet, soybean, and exons in a segment of DNA sequence. cotton, wheat, and sorghum (see http:// Dozens of computer programs for identifying www.ncbi.nlm.nih.gov/dbEST/). protein-coding genes are available (150). Some of the well-known ones include Gen- scan (http://genes.mit.edu/GENSCAN.ht Genome Sequencing ml), GeneMarkHMM (http://opal.biology. Advances in sequencing technologies provide gatech.edu/GeneMark/ ), GRAIL (http:// opportunities in bioinformatics for manag- compbio.ornl.gov/Grail-1.3/ ), Genie ing, processing, and analyzing the sequences. (http://www.fruitfly.org/seq tools/genie. Shotgun sequencing is currently the most html), and Glimmer (http://www.tigr.org/ common method in genome sequencing: softlab/glimmer). Several new gene-finding pieces of DNA are sheared randomly, cloned, tools are tailored for applications to plant and sequenced in parallel. Software has been genomic sequences (112). developed to piece together the random, Ab initio gene prediction remains a chal- overlapping segments that are sequenced lenging problem, especially for large-sized eu- by Stanford University Robert Crown Law Lib. on 05/10/07. For personal use only. separately into a coherent and accurate con- karyotic genomes. For a typical Arabidopsis Annu. Rev. Plant Biol. 2006.57:335-360. Downloaded from arjournals.annualreviews.org tiguous sequence (93). Numerous software thaliana gene with five exons, at least one packages exist for sequence assembly (51), in- exon is expected to have at least one of its cluding Phred/Phrap/Consed (http://www. borders predicted incorrectly by the ab ini- phrap.org), Arachne (http://www.broad. tio approach (19). Transcript evidence from mit.edu/wga/), and GAP4 (http://staden. full-length cDNA or EST sequences or sim- sourceforge.net/overview.html). TIGR ilarity to potential protein homologs can sig- developed a modular, open-source package nificantly reduce uncertainty of gene identi- called AMOS (http://www.tigr.org/soft fication (154). Such methods are widely used ware/AMOS/), which can be used for com- in “structural annotation” of genomes, which parative genome assembly (102). Current refers to the identification of features such limitations in