Insights from 20 Years of Bacterial Genome Sequencing
Total Page:16
File Type:pdf, Size:1020Kb
Funct Integr Genomics DOI 10.1007/s10142-015-0433-4 REVIEW Insights from 20 years of bacterial genome sequencing Miriam Land & Loren Hauser & Se-Ran Jun & Intawat Nookaew & Michael R. Leuze & Tae-Hyuk Ahn & Tatiana Karpinets & Ole Lund & Guruprased Kora & Trudy Wassenaar & Suresh Poudel & David W. Ussery Received: 19 January 2015 /Revised: 11 February 2015 /Accepted: 12 February 2015 # The Author(s) 2015. This article is published with open access at Springerlink.com Abstract Since the first two complete bacterial genome se- in a few hours and identify some types of methylation sites quences were published in 1995, the science of bacteria has along the genome as well. Sequencing of bacterial genome dramatically changed. Using third-generation DNA sequenc- sequences is now a standard procedure, and the information ing, it is possible to completely sequence a bacterial genome from tens of thousands of bacterial genomes has had a major impact on our views of the bacterial world. In this review, we This manuscript has been authored by a contractor of the US Government explore a series of questions to highlight some insights that under contract No. DE-AC05-00OR22725. Accordingly, the US comparative genomics has produced. To date, there are ge- Government retains a paid-up, nonexclusive, irrevocable, worldwide license to publish or reproduce the published form of this contribution, nome sequences available from 50 different bacterial phyla prepare derivative works, distribute copies to the public, and perform and 11 different archaeal phyla. However, the distribution is publicly and display publicly or allow others to do so, for US quite skewed towards a few phyla that contain model organ- Government purposes. isms. But the breadth is continuing to improve, with projects : : < : : < : M. Land L.: Hauser S.:R. Jun I. Nookaew T. H. Ahn dedicated to filling in less characterized taxonomic groups. T. Karpinets S. Poudel D. W. Ussery (*) The clustered regularly interspaced short palindromic repeats Comparative Genomics Group, Biosciences Division, Oak Ridge (CRISPR)-Cas system provides bacteria with immunity National Laboratory, Oak Ridge, TN 37831, USA e-mail: [email protected] against viruses, which outnumber bacteria by tenfold. How : fast can we go? Second-generation sequencing has produced L. Hauser D. W. Ussery a large number of draft genomes (close to 90 % of bacterial Joint Institute for Biological Sciences, University of Tennessee, genomes in GenBank are currently not complete); third- Knoxville, TN 37996, USA generation sequencing can potentially produce a finished ge- L. Hauser nome in a few hours, and at the same time provide Department of Microbiology, University of Tennessee, methlylation sites along the entire chromosome. The diversity Knoxville, TN 37996, USA of bacterial communities is extensive as is evident from the M. R. Leuze : T.<H. Ahn : G. Kora genome sequences available from 50 different bacterial phyla Computer Science and Mathematics Division, Computer Science and 11 different archaeal phyla. Genome sequencing can help Research Group, Oak Ridge National Laboratory, Oak in classifying an organism, and in the case where multiple Ridge, TN 37831, USA genomes of the same species are available, it is possible to O. Lund : D. W. Ussery calculate the pan- and core genomes; comparison of more than Center for Biological Sequence Analysis, Department of Systems 2000 Escherichia coli genomes finds an E. coli core genome Biology, The Technical University of Denmark, Kgs. Lyngby 2800, of about 3100 gene families and a total of about 89,000 dif- Denmark ferent gene families. Why do we care about bacterial genome T. Wassenaar sequencing? There are many practical applications, such as Molecular Microbiology and Genomics Consultants, Tannenstr 7, genome-scale metabolic modeling, biosurveillance, 55576 Zotzenheim, Germany bioforensics, and infectious disease epidemiology. In the near future, high-throughput sequencing of patient metagenomic S. Poudel : D. W. Ussery Genome Science and Technology, University of Tennessee, samples could revolutionize medicine in terms of speed and Knoxville, TN 37996, USA accuracy of finding pathogens and knowing how to treat them. Funct Integr Genomics Keywords Bacteria . Comparative genomics . Bacterial growing another hundredfold—that is, there are more than 30, genomes . Metagenomics . Core-genome . Pan-genome . 000 sequenced bacterial genomes currently publically avail- Next-generation sequencing able in 2014 (NCBI 2014) and thousands of metagenome projects (GOLD 2014). Projects such as the Genomic Encyclopedia of Bacteria and Archaea (GEBA) (Kyrpides et al. 2014) promise to not only add more genomes but expand the genetic diversity and add to the list of available types of Introduction strains. For many years, ribosomal RNA (rRNA) operons, specif- Two decades have passed since the first bacterial genome was ically the 16S rRNA genes, were used as the primary tool for completely sequenced (Fleischmann et al. 1995;Fraseretal. taxonomic assignment and phylogenetic trees (Mizrahi-Man 1995), and the technical improvements and subsequent in- et al. 2013). The 16S rRNA gene is still widely used because it creases in biological knowledge have been just as dramatic is present in at least one copy in every bacterial genome, its in the second 10 years as they were in the first decade. The conserved regions enable simple sample identification using most significant factor influencing scientific progress was, as PCR, and its sequence provides reliable information on bac- predicted, the vast reduction in the price of sequencing, as a terial family, genus, or species in most cases. This single gene result of technical developments. Along with the cost reduc- comparison is now being replaced by more comprehensive tion, second-generation (or Bnext-gen^) sequencing tech- approaches. Full genome sequencing along with additional niques dramatically reduced the average read length; in con- tools can comprehensively analyze and classify hundreds or trast, third-generation (single molecule) sequencing allows for thousands of genomes. These new tools have led to new un- longer read lengths, although at the time of writing, these derstandings of genetic relationships that the 16S rRNA gene methods are still in their infancy. The dramatic reduction in only approximates. the cost of sequencing has made bacterial genome sequencing A notable development in the second decade of bacterial affordable to a great number of labs, leading to a democrati- genome sequencing was the generation of metagenomic zation of sequencing (Shendure and Ji 2008). The explosive data, which covers all DNA present in a given sample growth of data has resulted in a cost shift from sequencing to (Mende et al. 2012). The study of metagenomes was so assembly, analysis, and managing data. new in the last review that the term needed to be defined, Ten years ago, we reviewed the first decade of bacterial as at that time there were only two metagenomic projects genome sequencing (Binnewies et al. 2006). At that time, published. Today, there are more than 20,000 metagenomic there were about 300 sequenced bacterial genomes and only projects publically available, and many terabytes of se- two published metagenomic projects; this represented a quencing data have been produced. The myriad of ecosys- growth of more than 100-fold from the mere two genomes tems includes numerous animal and human microbiomes, sequenced in 1995. The number of sequenced genomes has soils of all types, fresh and salt water samples, and even continued to increase dramatically in the last 10 years (Fig. 1), plant–microbe interaction systems. Fig. 1 Number of bacterial and 16,000 archaeal genomes sequenced each year and submitted to NCBI. 14,000 Source: GenBank prokaryotes.txt file downloaded 4 February 2015 12,000 10,000 8,000 6,000 4,000 2,000 Number of genomes sequencedNumber of genomes 0 Year Funct Integr Genomics As observed 10 years ago, the diversity of bacteria con- Overview of available data tinues to expand and surprise (Lagesen et al. 2010). Instead of 20 Escherichia coli genomes, we now have thousands that In 1995, when the first bacterial genomes were sequenced, can be compared (Cook and Ussery 2013), and they still give GenBank had already grown more than 500-fold from when us new insights into the diversity and plasticity of bacterial it was first started, in 1982. Ten years later, as automated genomes. sequencing became more common, GenBank had grown to The nature of data to be analyzed is changing. For example, more than 75,000 times its original size. Almost 20 years later, microarray analysis of transcriptomes is being replaced by at the time of writing this article, complete genomes in RNA sequencing (Wang et al. 2014; Westermann et al. GenBank appear to be slowing down a bit in favor of other 2012;Zhaoetal.2014), which has some substantial advan- types of submissions. Starting with their introduction in 2002, tages, although the statistical analysis packages for this data WGS bases have kept pace with or exceeded GenBank bases are continually evolving and are by no means standardized. and the addition of Sequence Read Archive (SRA) bases in The stories revealed from analysis of these sample 2008 have dwarfed them both (Fig. 2). metagenomes, especially the human microbiomes, have dra- As of January 2015, the SRA contained more than 1500 matically changed our view of the microbial world to the point trillion (1015) nucleotides or 8000 times the size of GenBank that the general public is now aware of the possible beneficial and Ensemble (Ensemble 2015) had over 20,000 single isolate effects of bacteria on their health and not just as the source of genomes. Indicators of the genomes in process include the illness (Claesson et al. 2012; Huttenhower et al. 2012). Genomes Online Database (GOLD) (Pagani et al. 2012), The ever-increasing amount and complexity of generat- which had 47,083 prokaryotic genomes and the MG-RAST ed sequences has large implications for analysis of this system listed 152,927 metagenomes, of which 23,242 are data.