Comparing Genomes: Databases and Computational Tools for Comparative Analysis of Prokaryotic Genomes DOI: 10.3395/Reciis.V1i2.Sup.105En

[www.reciis.cict.fiocruz.br] ISSN 1981-6286 SUPPLEMENT – BIOINFORMATICS AND HEALTH Review Articles Comparing genomes: databases and computational tools for comparative analysis of prokaryotic genomes DOI: 10.3395/reciis.v1i2.Sup.105en Marcos Antonio Basílio Catanho de Miranda Laboratório de Genômica Laboratório de Genômica Funcional e Bioinformática Funcional e Bioinformática do Instituto Oswaldo Cruz do Instituto Oswaldo Cruz da Fundação Oswaldo da Fundação Oswaldo Cruz, Cruz, Rio de Janeiro, Brazil Rio de Janeiro, Brazil [email protected] [email protected] Wim Degrave Laboratório de Genômica Funcional e Bioinformática do Instituto Oswaldo Cruz da Fundação Oswaldo Cruz, Rio de Janeiro, Brazil [email protected] Abstract Since the 1990’s, the complete genetic code of more than 600 living organisms has been deciphered, such as bacteria, yeasts, protozoan parasites, invertebrates and vertebrates, including Homo sapiens, and plants. More than 2,000 other genome projects representing medical, commercial, environmental and industrial interests, or comprising mo- del organisms, important for the development of the scientific research, are currently in progress. The achievement of complete genome sequences of numerous species combined with the tremendous progress in computation that occurred in the last few decades allowed the use of new holistic approaches in the study of genome structure, organization and evolution, as well as in the field of gene prediction and functional classification. Numerous public or proprietary databases and computational tools have been created attempting to optimize the access to this information through the web. In this review, we present the main resources available through the web for comparative analysis of prokaryotic genomes. We concentrated on the group of mycobacteria that contains important human and animal pathogens. The birth of Bioinformatics and Computational Biology and the contributions of these disciplines to the scientific development of this field are also discussed. Keywords Bioinformatics, computational biology, databases, genomes, prokaryotes Sup334 RECIIS – Elet. J. Commun. Inf. Innov. Health. Rio de Janeiro, v.1, n.2, Sup. 1, p.Sup334-Sup355, Jul.-Dec., 2007 The beginning of a new era: the birth the (iii) development of high-throughput computing of Bioinformatics and Computational technologies and (iv) more efficient algorithms, allowed Biology holistic approaches (which consider the whole body of available information, such as all genes encoded by a Bioinformatics and Computational Biology (BIS- group of genomes) to be used in the study of genome TIC Definition Committee 2000) emerged in the structure, organization and evolution, in differential 1960’s when computers became essential tools for the expression analyses of genes and proteins, in protein development of Molecular Biology. According to HA- three-dimensional structure predictions, in the process GEN (2000), this emergence was motivated by three of metabolic reconstruction, and in the functional main factors: (i) the increasing availability of protein prediction of genes. As a result, at least two general sequences, providing both a source of data and a set rules about biological systems (summarizing a number of relevant challenges impossible to cope without com- of experimental evidences) can be derived from the puter assistance; (ii) the idea that macromolecules carry exercise of Bioinformatics and Computational Biology information had became fundamental in the Molecular over these almost five decades, given the existence of Biology conceptual framework; (iii) the availability of numerous corollaries resulting from them with direct powerful computers in research centres. application in biological researches: (i) the three-dimen- Several algorithms and computational programs for sional structures of proteins are much more conserved the analysis of structure, function, and evolution at the than their biochemical functions; (ii) in contrast to molecular level, as well as rudimentary protein sequences genomic sequences, the comparison of the total num- databases, were already available towards the end of the ber of genes encoded by each individual in a group of 1960’s (HAGEN, 2000; reviewed by OUZOUNIS & organisms do not reflects the phylogeny of the species VALENCIA, 2003). New algorithms and computational involved (OUZOUNIS, 2002). approaches were introduced in the following decades, such as algorithms for sequence alignments, public da- New challenges, new approaches: the tabases, efficient data retrieval systems, sophisticated protein structure prediction methods, gene annotation comparative analysis of prokaryotic and genome comparison tools, and systems for functional genomes genome analysis (OUZOUNIS, 2002). The pioneering initiative of the U.S. Department However, Bioinformatics and Computational of Energy (DOE) to obtain a reference human genome Biology might only be recognized as independent disci- sequence culminated in the launching of the Human plines, with their own problems and achievements, by Genome Project, in 1990. The initial plan was achieve the decade of 1980 when, for the first time, efficient a deeper understanding of potential health and environ- algorithms were developed to cope with the increasing mental risks caused by the production and use of new amount of information, and computer implementations energy resources and technologies. Later, the technologi- of these algorithms (programs) were made available for cal resources generated by this project stimulated the the entire scientific community (OUZOUNIS & VALEN- development of many other public and private genome CIA, 2003). The consolidation of both new disciplines project initiatives (HGP 2001). occurred in the 1990’s, with the emergence of powerful So far, 70 eukaryotic genomes have already been personal computers, supercomputers, the World Wide completely sequenced. They include the human genome Web, huge biological databases and the so-called ome (VENTER et al., 2001; LANDER et al., 2001), some other projects: genome, transcriptome, and proteome, sup- vertebrates and plants. In addition, the complete genome ported by the continuous progress in DNA sequencing, sequence of 47 archaeobacteria and 543 eubacteria are the development of microarrays and biochip technolo- also available, and 2,258 other projects are currently in gies, and mass spectrometry. progress (GOLD, 2007). Concerning the mycobacteria Actually, the achievement of (i) numerous com- group (GOODFELLOW & MINNIKIN, 1984) in particu- plete genome sequences, (ii) gene and protein expres- lar, the genomes of 16 species have already been entirely sion data of cells, tissues and organs, combined with sequenced and 23 others are on going (Table 1). Table 1 - Mycobacterial genome projects Species or strain Importance Research Centre URL Status Medical; human and http://www.ncbi.nlm.nih.gov/sites/entr M. tuberculosis Beijing Genomics animal pathogen; causes ez?db=genome&cmd=Retrieve&dopt Complete H37Ra Institute tuberculosis =Overview&list_uids=21081 Medical; animal, cattle, http://www.broad.mit.edu/annotation/ M. tuberculosis and human pathogen; Broad Institute genome/mycobacterium_tuberculosis_ Complete F11 (ExPEC) causes tuberculosis. spp/MultiHome.html cont. RECIIS – Elet. J. Commun. Inf. Innov. Health. Rio de Janeiro, v.1, n.2, Sup. 1, p.Sup334-Sup355, Jul.-Dec., 2007 Sup335 Table 1 - Mycobacterial genome projects (cont.) Medical; animal, cattle, M. bovis BCG http://www.pasteur.fr/recherche/unites/ and human pathogen; Institut Pasteur Complete Pasteur 1173P2 Lgmb/mycogenomics.html causes tuberculosis. M. ulcerans Medical; human patho- http://www.pasteur.fr/recherche/unites/ Institut Pasteur Complete Agy99 gen; causes Buruli ulcer. Lgmb/mycogenomics.html M. flavenscens Biotechnological; isolated Joint Genome http://genome.jgi-psf.org/finished_ Complete PYR-GCK from soil. Institute microbes/mycfl/mycfl.home.html M. vanbaalenii Biotechnological; isolated Joint Genome http://genome.jgi-psf.org/finished_ Complete PYR-1 from soil. Institute microbes/mycva/mycva.home.html Biotechnological; isolated Mycobacterium Joint Genome http://genome.jgi-psf.org/finished_ from creosote-contami- Complete sp JLS Institute microbes/myc_j/myc_j.home.html nated soil. Biotechnological; isolated Mycobacterium Joint Genome http://genome.jgi-psf.org/finished_ from creosote-contami- Complete sp KMS Institute microbes/myc_k/myc_k.home.html nated soil. Biotechnological; isolated Mycobacterium Joint Genome http://genome.jgi-psf.org/finished_ from creosote-contami- Complete sp MCS Institute microbes/myc_k/myc_k.home.html nated soil. Medical; human and M. tuberculosis http://www.sanger.ac.uk/Projects/M_ animal pathogen; causes Sanger Institute Complete H37Rv tuberculosis/ tuberculosis. Medical; animal, cattle, M. bovis Sanger Institute/ http://www.sanger.ac.uk/Projects/ and human pathogen; Complete AF2122/97 Institut Pasteur M_bovis/ causes tuberculosis Medical; human patho- Sanger Institute/ http://www.sanger.ac.uk/Projects/ M. leprae TN Complete gen; causes leprosy. Institut Pasteur M_leprae/ Medical; animal pathogen; http://www.ncbi.nlm.nih.gov/sites/entr The Institute for M. avium 104 causes respiratory infec- ez?db=genome&cmd=Retrieve&dopt Complete Genomic Research tion. =Overview&list_uids=20086 Medical; human patho- M. smegmatis The Institute for http://www.tigr.org/tigr-scripts/CMR2/ gen; opportunistic infec- Complete MC2 155 Genomic Research GenomePage3.spl?database=gms tion. Medical; animal

Comparing Genomes: Databases and Computational Tools for Comparative Analysis of Prokaryotic Genomes DOI: 10.3395/Reciis.V1i2.Sup.105En

Computer Applications Making Rapid Advances in High Throughput Microbial Proteomics (HTMP) Balakrishna Anandkumar1,§, Steve W

For Immediate Release. Oct. 8Th, 2010 Microbesonline.Org Enhanced for Metagenomics and Metabolism

OMA Browser—Exploring Orthologous Relations Across 352 Complete Genomes Adrian Schneider*, Christophe Dessimoz and Gaston H

GTL PI Meeting 2006 Abstracts Molecular Interactions

Comparative Genomics and Functional Analysis of Rhamnose Catabolic Pathways and Regulons in Bacteria Irina A

Environmental Conditions Shape the Nature of a Minimal Bacterial Genome

VIMSS Computational Microbiology Core Research on Comparative and Functional Genomics

Pangenome Analysis Toolkit

Myoglobin Primary Structure Reveals Multiple Convergent Transitions To

GTL PI Meeting 2007 Abstracts

The Impact of Long-Distance Horizontal Gene Transfer on Prokaryotic Genome Size

Joint Genome Institute PROGRESS REPORT 2006 JGI’S Mission