The Integrated Microbial Genomes Database and Comparative Analysis System

University of Nebraska - Lincoln DigitalCommons@University of Nebraska - Lincoln US Department of Energy Publications U.S. Department of Energy 2012 IMG: the integrated microbial genomes database and comparative analysis system Victor M. Markowitz Lawrence Berkeley National Laboratory, [email protected] I-Min A. Chen Lawrence Berkeley National Laboratory, [email protected] Krishna Palaniappan Lawrence Berkeley National Laboratory Ken Chu Lawrence Berkeley National Laboratory Ernest Szeto Lawrence Berkeley National Laboratory See next page for additional authors Follow this and additional works at: https://digitalcommons.unl.edu/usdoepub Part of the Bioresource and Agricultural Engineering Commons Markowitz, Victor M.; Chen, I-Min A.; Palaniappan, Krishna; Chu, Ken; Szeto, Ernest; Grechkin, Yuri; Ratner, Anna; Jacob, Biju; Huang, Jinghua; Williams, Peter; Huntemann, Marcel; Anderson, Iain; Mavromatis, Konstantinos; Ivanova, Natalia N.; and Kyrpides, Nikos C., "IMG: the integrated microbial genomes database and comparative analysis system" (2012). US Department of Energy Publications. 291. https://digitalcommons.unl.edu/usdoepub/291 This Article is brought to you for free and open access by the U.S. Department of Energy at DigitalCommons@University of Nebraska - Lincoln. It has been accepted for inclusion in US Department of Energy Publications by an authorized administrator of DigitalCommons@University of Nebraska - Lincoln. Authors Victor M. Markowitz, I-Min A. Chen, Krishna Palaniappan, Ken Chu, Ernest Szeto, Yuri Grechkin, Anna Ratner, Biju Jacob, Jinghua Huang, Peter Williams, Marcel Huntemann, Iain Anderson, Konstantinos Mavromatis, Natalia N. Ivanova, and Nikos C. Kyrpides This article is available at DigitalCommons@University of Nebraska - Lincoln: https://digitalcommons.unl.edu/ usdoepub/291 Nucleic Acids Research, 2012, Vol. 40, Database issue D115–D122 doi:10.1093/nar/gkr1044 IMG: the integrated microbial genomes database and comparative analysis system Victor M. Markowitz1,*, I-Min A. Chen1, Krishna Palaniappan1, Ken Chu1, Ernest Szeto1, Yuri Grechkin1, Anna Ratner1, Biju Jacob1, Jinghua Huang1, Peter Williams2, Marcel Huntemann2, Iain Anderson2, Konstantinos Mavromatis2, Natalia N. Ivanova2 and Nikos C. Kyrpides2,* 1Biological Data Management and Technology Center, Lawrence Berkeley National Laboratory, 1 Cyclotron Road, Berkeley and 2Microbial Genomics and Metagenomics Program, Department of Energy Joint Genome Institute, 2800 Mitchell Drive, Walnut Creek, USA Received September 15, 2011; Accepted October 24, 2011 ABSTRACT from RefSeq including its organization into chromosomal replicons (for finished genomes) and scaffolds and/or The Integrated Microbial Genomes (IMG) system contigs (for draft genomes), together with predicted serves as a community resource for comparative protein-coding sequences (CDSs), some RNA-coding analysis of publicly available genomes in a compre- genes and protein product names that are provided by hensive integrated context. IMG integrates publicly the genome sequence centres. available draft and complete genomes from all three IMG’s data integration pipeline associates every domains of life with a large number of plasmids genome with metadata from GOLD (2), and fills in add- and viruses. IMG provides tools and viewers for itional information potentially missing from the RefSeq analyzing and reviewing the annotations of genes files such as CRISPR repeats (3), signal peptides and genomes in a comparative context. IMG’s data computed using SignalP (4) and transmembrane helices content and analytical capabilities have been con- computed using TMHMM (5). Missing RNAs are identified using tRNAS-can-SE-1.23 (6) for tRNAs, in tinuously extended through regular updates since house developed HMMs for rRNAs (7), and Rfam (8) its first release in March 2005. IMG is available at and INFERNAL v1.0 (9) for other small RNAs. Genes http://img.jgi.doe.gov. Companion IMG systems are associated with ‘secondary’ functional annotations provide support for expert review of genome anno- and lists of related (e.g. homologue, paralogue) genes. tations (IMG/ER: http://img.jgi.doe.gov/er), teaching IMG generated annotations consist of protein family courses and training in microbial genome analysis and domain characterizations based on COG clusters (IMG/EDU: http://img.jgi.doe.gov/edu) and analysis and functional categories (10), Pfam (11), TIGRfam and of genomes related to the Human Microbiome TIGR role categories (12), InterPro domains (13), Gene Project (IMG/HMP: http://www.hmpdacc-resources Ontology (GO) terms (14) and KEGG Ortholog (KO) .org/img_hmp). terms and pathways (15). The association of KEGG pathways with IMG genomes is based on the assignment of KEGG Orthology (KO) terms to IMG genes via a mapping of INTRODUCTION IMG genes to KEGG genes. The MetaCyc collection of The Integrated Microbial Genomes (IMG) system inte- pathways (16) is also available in IMG, whereby the asso- grates publicly available draft and complete microbial ciation of MetaCyc pathways with IMG genomes is based genomes from all three domains of life with a large on correlating enzyme EC numbers in MetaCyc reactions number of plasmids and viruses. IMG employs NCBI’s with EC numbers associated with IMG genes via KO RefSeq resource (1) as its main source of public genome terms. Genes are further characterized using an IMG sequence data, and ‘primary’ annotations consisting of native collection of generic (protein cluster-independent) predicted genes and protein products. For every genome, functional roles called IMG terms that are defined by IMG records its primary genome sequence information their association with generic (organism-independent) *To whom correspondence should be addressed. Tel: +510 486 7073; Fax: +1 510 486 5812; Email: [email protected] Correspondence may also be addressed to Nikos C. Kyrpides. Tel: +925 296 5718; Fax: +925 296 5666; Email: [email protected] ß The Author(s) 2011. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/ by-nc/3.0), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. D116 Nucleic Acids Research, 2012, Vol. 40, Database issue functional hierarchies, called IMG pathways (17). IMG The ‘Expert Review’ version of IMG, IMG/ER (24), terms and pathways are specified by domain experts at allows individual scientists or groups of scientists to DOE-JGI as part of the process of annotating specific review and curate the functional annotation of microbial genomes of interest, and are subsequently propagated to genomes in the context of IMG’s public genomes. all the genomes in IMG using a rule based methodology Scientists can submit their private genome data sets into (18). Transporter genes are linked to the Transport Classi- IMG ER (using password protected access) prior to their fication Database (19) based on their assignment to COG, public release either with their original annotations Pfam or TIGRfam domains or IMG Terms that or with annotations generated by IMG’s annotation correspond to transporter families. pipeline (18). Since August 2009, close to 750 private For each gene, IMG provides lists of related (e.g. can- genomes have been reviewed and curated using IMG/ER. didate homologue, paralogue, orthologue) genes that are Genomes generated as part of the Human Microbiome based on sequence similarities computed using NCBI Project (HMP) (25) and the Genome Encyclopedia of BLASTp for protein coding genes and BLASTn for Bacterial and Archaea Genomes (GEBA) project (26) RNA genes. Such lists of genes can be filtered using are of special interest. With the goal of characterizing percent identity, bit score and more stringent E-values. microbial communities found at multiple human body IMG’s data integration pipeline identifies gene fusions sites, HMP has initially focused on the sequencing of and conserved gene cassettes (putative operons). A fused reference genomes from both cultured and uncultured gene (fusion) is defined as a gene that is formed from the bacteria (25). Over 550 reference genomes sequenced as composition (fusion) of two or more previously separate part of the HMP initiative, as well as over 1500 genomes genes (20). Transposases and integrases, pseudogenes, and associated with a human host and thus relevant to HMP, genes from draft genomes are not considered as putative can be examined and analyzed using IMG/HMP fusion components in order to avoid false positives caused (http://www.hmpdacc-resources.org/img_hmp/), which is by gene fragmentation. A ‘chromosomal cassette’ is provided as part of the HMP Data Analysis and defined as a stretch of genes with intergenic distance Coordination Center (DACC). smaller or equal to 300 bp (21), whereby the genes can The aim of the GEBA is to fill systematically the be on the same or different strands of the chromosome. sequencing gaps along the bacterial and archaeal Chromosomal cassettes with a minimum size of two genes branches of the tree of life. After a pilot project in 2009 common in at least two separate genomes are defined as that generated complete genomes for about 100 organisms ‘conserved chromosomal cassettes’. The identification of (26), the number of sequenced GEBA genomes has common genes across organisms is based on three gene steadily increased and stands at 205 as of August 2011. clustering methods, namely

The Integrated Microbial Genomes Database and Comparative Analysis System

Comparative Genomic Analysis of Three Pseudomonas

Senegalemassilia Anaerobia Gen. Nov., Sp. Nov

Three New Genome Assemblies Support a Rapid Radiation in Musa Acuminata (Wild Banana)

UNIVERSITY of CALIFORNIA, SAN DIEGO the Comparative Genomics

Compact Graphical Representation of Phylogenetic Data and Metadata with Graphlan

Across Bacterial Phyla, Distantly-Related Genomes with Similar Genomic GC Content Have Similar Patterns of Amino Acid Usage

Various Complexes of the Oral Microbial Flora in Periodontal Disease

Comparative and Genetic Analysis of the Four Sequenced Paenibacillus

Downloaded from a Variety of Sources (For Details, See Table S1) and Between Metazoa and Dictyostelium

Analysis of Class C G-Protein Coupled Receptors Using Supervised

Comparative Analysis of Plant Genomes Through Data Integration

Combining Morphological and Metabarcoding Approaches Reveals the Freshwater Eukaryotic Phytoplankton Community