Genome Based Taxonomic Classification
Total Page:16
File Type:pdf, Size:1020Kb
Genome Genome Based Taxonomic Classification Journal: Genome Manuscript ID gen-2018-0072.R4 Manuscript Type: Article Date Submitted by the 15-Dec-2018 Author: Complete List of Authors: Paul, Bobby; School of Life Sciences, Manipal Academy of Higher Education, Bioinformatics Dixit, Gunjan; School of Life Sciences, Manipal Academy of Higher Education Murali, Thokur Sreepathy; School of Life Sciences, Manipal Academy of Higher Education,Draft Biotechnology Satyamoorthy, K.; School of Life Sciences, Manipal Academy of Higher Education , Biotechnology Keyword: Bacterial taxonomy, Bioinformatics, Nucleotide identity, Nomenclature Is the invited manuscript for consideration in a Special Not applicable (regular submission) Issue? : https://mc06.manuscriptcentral.com/genome-pubs Page 1 of 17 Genome 1 Genome Based Taxonomic Classification 2 Bobby Paul, Gunjan Dixit, Thokur Sreepathy Murali, Kapaettu Satyamoorthy* 3 School of Life Sciences, Manipal Academy of Higher Education, Manipal 576104, INDIA 4 5 6 7 8 *Corresponding Author: Dr. K. Satyamoorthy 9 Address: School of Life Sciences, 10 Manipal Academy of Higher Education, 11 Manipal, Karnataka 12 INDIA 576104 13 Email: [email protected] Draft 1 https://mc06.manuscriptcentral.com/genome-pubs Genome Page 2 of 17 1 Abstract 2 Bacterial populations are routinely characterized based on the microscopic examination, colony 3 formation and biochemical tests. However, in recent past, bacterial identification, classification and 4 nomenclature have been strongly influenced by genome sequence information. Advances in 5 bioinformatics and growth in genome databases has placed genome-based metadata analysis in the 6 hands of researchers who will require taxonomic experience to resolve intricacies. To achieve this, 7 different tools are now available to quantitatively measure genome relatedness within members of 8 the same species and genome-wide average nucleotide identity (gANI) is one such reliable tool to 9 measure genome similarity. A genome assembly showed gANI score of <95% at intra-species level 10 is generally considered indicative of a separate species. In this study, we have analysed 300 whole- 11 genome sequences belonging to 26 different bacterial species available in the Genome database and 12 calculated their similarity at intra-species level based on gANI score. At the intra-species level, nine 13 bacterial species showed less than 90% gANI and more than 10% of unaligned regions. We suggest 14 appropriate use of available bioinformatics resources after genome assembly to arrive at proper 15 bacterial identification, classification andDraft nomenclature in order to avoid erroneous species 16 assignments and disparity due to diversity at the intra-species level. 17 18 19 Key words 20 Bacterial taxonomy; Bioinformatics; Nucleotide identity; Nomenclature 2 https://mc06.manuscriptcentral.com/genome-pubs Page 3 of 17 Genome 1 Introduction 2 Accurate bacterial classification is essential to identify the potential genes to understand genetic 3 mechanisms, disease diagnosis and provide better treatment especially during infectious disease 4 spread, outbreaks and progression (Fricke and Rasko 2014; Croucher and Didelot 2015; Kubicova 5 and Provaznik 2016). Bacterial classification is based on microbial techniques such as isolation, 6 culture and staining to determine morphological and biochemical features. Over the years, in 7 addition to morphological characters, phenotypic and genotypic methods have gained much 8 importance in bacterial identification and classification schema (Donelli et al. 2013). With the 9 rapid advancements in DNA sequencing techniques, genome-based methods are expected to play 10 a major role in delineating the bacterial species at genome level and provide valuable clues towards 11 their identification (Hugenholtz et al. 2016). Recently, metabarcoding and metagenomics have 12 become robust tools to study bacterial and archaeal diversity by comparative genomic analysis 13 (Srivathsan et al. 2015; Bowers et al. 2017). Increased reliability, decreased costs and time for 14 whole-genome sequencing and assemblyDraft has led to an unprecedented amount of information that 15 can provide significant insights towards drug-repositioning and development of robust molecular 16 diagnostic platforms (Fournier et al. 2014; Baltrus 2016; Hahnke et al. 2016; Hugenholtz et al. 17 2016; Jansen van Rensburg et al. 2016; Gosiewski et al. 2017). 18 It has been reported that there are millions of bacterial species of which only a fraction have been 19 identified and characterized (Eichorst et al. 2007; Kim et al. 2014). The global microbial species 20 richness is estimated to be anywhere between 107 to 109 (Schloss and Handelsman 2004). Till 21 December 2017, there were 17947 bacterial species with validly published names in Taxonomy 22 browser (https://www.ncbi.nlm.nih.gov/taxonomy). The development of whole genome 23 sequencing methods has resulted in the availability of large quantities of genomic data in public 24 repositories (Cook et al. 2016; Mukherjee et al. 2017). Over the past 20 years, more than 124,311 25 bacterial genome sequences have been deposited in the Genome database. However, the sequence 26 information needs critical evaluation before deciphering species descriptions, delineating related 27 species and assigning them into a reference genome database. Currently, bacterial species 28 delineation is based on phylogenetic analysis of marker genes such as 16S rRNA, 23S 29 rRNA, rpoB, gyrB, dnaK, etc. (Konstantinidis and Tiedje 2005, Liu et al. 2012). However, this 30 approach takes into consideration evolution of single genes for grouping the species or delineating 31 a species in a genus. Genome-level dissimilarity can happen due to the cross-species genome 3 https://mc06.manuscriptcentral.com/genome-pubs Genome Page 4 of 17 1 contaminations (Merchant et al. 2014; Mukherjee et al. 2015; Kryukov and Imanishi 2016), lateral 2 gene transfer (Gillings 2017), and insertion sequences (Vincent et al. 2016). In addition, nucleotide 3 substitutions, gene insertions and deletions, gene duplications and rearrangements, which have 4 significant impact on evolution of prokaryotic genome (Coenye et al. 2005), are also considered 5 as critical determinants while defining a species. Horizontal gene transfer (HGT) plays an 6 important role in genome adaptation and evolution. Acquisition of new genes by HGT allows for 7 habitat-specific adaptation and ultimately results in a subsequent divergence of species or 8 speciation (Ochman et al. 2000). Additionally, genomes of several symbiotic bacterial species 9 show the existence of smaller genomes, codon reassignments and extreme biases in genome 10 sequences (McCutcheon and Moran 2012). Hence, it is imperative to look at whole genome 11 information rather than at selected genes for arriving at relatedness between species in a given 12 genus. 13 To further add to the complexity, advances in bioinformatics and growth in genome databases 14 requires researchers with taxonomic experienceDraft to perform genome-based metadata analysis in 15 order to critically evaluate the sequence information of genomes before assigning them into a 16 reference genome database. Though, a conclusive quantitative measure of genome relatedness is 17 indeed difficult, different methods have been routinely used to measure the relatedness within 18 members of the same species. The average nucleotide identity (ANI) is one such reliable measure 19 aimed at long-standing, systematic and scalable bacterial species classification (Kim et al. 2014). 20 Calculation of genome-wide ANI (gANI) score usually involves the fragmentation of genome 21 sequences, followed by calculating the nucleotide similarity between sequences by pairwise 22 alignment based only on the aligned regions. This is widely acknowledged as a robust measure of 23 genome relatedness based on the studies carried out on 13,151 prokaryotic genomes assigned to 24 3032 species (Varghese et al. 2015). Therefore the present study was aimed at identifying the 25 genome level similarity between the strains of selected bacterial species based on the data garnered 26 from genome assemblies available at the public databases by using parameter such as gANI. 27 28 Materials and methods 29 To highlight the genome-wide diversity at the species level, we analyzed a total of 300 genome 30 assemblies retrieved from the Genome database (https://www.ncbi.nlm.nih.gov/genome/), which 31 comprised of 26 unique bacterial species (Table 1). The selection of bacterial species for the study 4 https://mc06.manuscriptcentral.com/genome-pubs Page 5 of 17 Genome 1 was based on two criteria: a) the organism must have whole genome sequence in the completed 2 stage and b) the organism should have only one chromosome with about 10-15 available genome 3 assemblies. The genome sequence(s) were retrieved in Fasta file format and features including 4 genome size, %GC, number of genes and proteins were derived in tabular format. Using Microsoft 5 Excel, we calculated the differences in genome size, G+C content, number of coding genes and 6 non-coding genes at species level for all the short-listed strains. To determine the genome-wide 7 average nucleotide identity (gANI) at species level, multiple genome assemblies of the same 8 species were aligned with each other. Jspecies (http://jspecies.ribohost.com/jspeciesws/), a 9 BLAST-based tool was used for the calculation of gANI score and genome coverage (Richter et 10 al. 2015). Using