The ISME Journal (2017) 11, 821–824 © 2017 International Society for Microbial Ecology All rights reserved 1751-7362/17 www.nature.com/ismej SHORT COMMUNICATION Assessment of the bimodality in the distribution of bacterial genome sizes Hyun S Gweon, Mark J Bailey and Daniel S Read Centre for Ecology & Hydrology, Wallingford, UK Bacterial genome sizes have previously been shown to exhibit a bimodal distribution. This phenomenon has prompted discussion regarding the evolutionary forces driving genome size in bacteria and its ecological significance. We investigated the level of inherent redundancy in the public database and the effect it has on the shape of the apparent bimodal distribution. Our study reveals that there is a significant bias in the genome sequencing efforts towards a certain group of species, and that correcting the bias using species nomenclature and clustering of the 16S rRNA gene, results in a unimodal rather than the previously published bimodal distribution. The true genome size distribution and its wider ecological implications will soon emerge as we are currently witnessing rapid growth in the number of sequenced genomes from diverse environmental niches across a range of habitats at an unprecedented rate. The ISME Journal (2017) 11, 821–824; doi:10.1038/ismej.2016.142; published online 11 November 2016 Significant progress has been made in understanding Wolf in 2008, where it was reported that bacterial interactions between ecology and genome evolution genome sizes show a bimodal distribution (Koonin in prokaryotes. A number of recent studies have and Wolf, 2008). The authors speculated that the focussed on the evolution of bacterial genome sizes observation of two distinct groups of bacteria, those (Kempes et al., 2016), indicating that the interaction with 'small' and those with 'large' genomes, directly between an organism and its ecological niche, for reflects the balance between the opposing trends of example, resource availability and environmental genome expansion through gene duplication, hor- stability, selects the genome size of the species izontal gene transfer and replication, and genome (Konstantinidis and Tiedje, 2004; Bentkowski et al., contraction caused by genome streamlining and 2015). The exact mechanisms driving the genome degradation (Koonin and Wolf, 2008). The observed sizes are still not fully resolved (Sabath et al., 2013, bimodality in the database was the first empirical Kempes et al., 2016). It has, however, been specu- evidence to show the two forces at work in bacterial lated that species living in invariant niches tend to genomes, and the bimodalilty in the distribution has have small genomes, as stability acts to reduce since attracted numerous citations in both peer- genome size due the metabolic burden of replicating reviewed articles (Lane, 2011; Mock and Kirkham, DNA with no adaptive value (Giovannoni et al., 2012; Giovannoni et al., 2014; Morán et al., 2015) 2005, 2014) such as in obligatory and intracellular and textbooks (Bergman, 2011; Koonin, 2011; pathogens or mutualists (Moran, 2003; Klasson and Kirchman, 2012; Saitou, 2014; Seshasayee, 2015). Andersson, 2004; Moya et al., 2009). Due to their A substantial proportion of complete bacterial metabolic diversity, species with large genomes are genomes in the public domain belong to human potentially able to tackle a wider range of environ- pathogens and very closely related genomes repre- mental conditions (Schneiker et al., 2007) and tend senting variations within the species (Tatusova et al., to be more ecologically successful where resources 2014). As first reported by Graur, 2014, it has been are scarce but diverse, and where there is little suggested that this fact might introduce a bias to the penalty for slow growth (Konstantinidis and Tiedje, bimodal distribution seen in the previous analyses. 2004). The effect by which these two opposing No formal treatment, however, has been carried out evolutionary forces exert on the overall distribution in the peer-reviewed literature to examine the extent of genome sizes was first observed by Koonin and of database bias and how it may affect bacterial genome size bimodality. The distribution of the bacterial genome size has broad and far-reaching Correspondence: HS Gweon, Centre for Ecology & Hydrology, implications in our understanding of prokaryotes Maclean Building, Benson Lane, Crowmarsh Gifford, Wallingford, and this in turn necessitates reassessment of the OX10 8BB, UK. E-mail: [email protected] distribution and the extent to which the bias distorts Received 23 May 2016; revised 12 August 2016; accepted the apparent bimodality. Here we present our finding 7 September 2016; published online 11 November 2016 that the bias in the database has profound influence Assessment of the bimodality in the distribution of bacterial genome sizes HS Gweon et al 822 500 0.7 Bacteria αβ 0.6 Archaea 400 Multimodal * 0.5 (D = 0.02510; p-value = < 2.2 e-16) 300 0.4 0.3 200 0.2 Frequency Density Frequency Number of genomes 100 0.1 0.0 0 0 2 4 6 8 10 12 14 0246 81012 Genome Size (Mbp) Genome Size (Mbp) * Hartigans' dip test for unimodality / multimodality with simulated p-value (based on 10000 replicates) Salmonella enterica 139 4.8 β Escherichia coli 118 5.1 β Helicobacter pylori 71 1.6 α Staphylococcus aureus 70 2.9 α Chlamydia trachomatis 56 1.0 Mean Genome Size - Bacillus cereus 52 5.6 β Burkholderia pseudomallei 51 7.2 - Listeria monocytogenes 49 3.0 α Klebsiella pneumoniae 39 6.1 - Peak Mycobacterium tuberculosis 39 5.6 β Bacillus thuringiensis 39 4.4 β Bacillus anthracis 33 5.4 β Campylobacter jejuni 31 1.7 α Acinetobacter baumannii 30 4.0 β Bacillus subtilis 28 4.2 β Francisella tularensis 28 1.9 α Yersinia pestis 26 4.7 β Streptococcus pyogenes 25 1.8 α Pseudomonas aeruginosa 24 6.6 - Neisseria meningitidis 23 2.2 α 0 20 40 60 80 100 120 140 Number of genomes Figure 1 (a) Distribution of genome sizes in bacteria and archaea: the curves were generated by Gaussian–kernel smoothing of the individual data points. The figure has a very similar pattern to the figure generated by Koonin and Wolf, 2008. The distribution of archaea was included for comparison only. (b) Distribution of genome sizes in bacteria on a different scale: the distribution shows clear-cut bimodality. Hartigans’ dip test for unimodality/multimodality with simulated P-value with 10 000 Monte Carlo replicates: D = 0.02510, Po2.2e − 16, where values o0.05 indicate significant bi- or multimodality and values > 0.10 indicate unimodality (Freeman and Dale, 2013). (c) Number of genomes from the top 20 most redundant species in the database with mean genome size and peak in which they belong. (Peak α: 1.5–3 Mbp, Peak β:4–5.5 Mbp). The top 20 most redundant species belonged to 971 genomes representing almost 25% of the entire dataset. Most of them (18 species in total) formed part of the peaks (α and β), including the top 4 species, namely Salmonella enterica, Escherichia coli, Helicobacter pylori and Staphylococcus aureus. in shaping the overall distribution of bacterial The level of redundancy in the dataset was next genome size. assessed by counting the number of genomes, which Having obtained a total of 3923 complete bacterial shared the same species classification. The entire genomes from Ensembl Bacteria database, which is dataset of 3923 genomes represented 1706 groups of the most comprehensive source of complete bacterial species with a unique species classification based on genomes (see Supplementary Information for detailed names. As shown in Figure 1c, there was a methods), the distribution of genome sizes was first significant amount of bias in the genome sequencing evaluated and compared against the distribution efforts towards a certain group of species, most of from Koonin and Wolf, 2008. Despite that almost six which belonged to well-characterised human patho- times more genomes have been archived since 2007, gens. In fact, almost 25% of the entire genome the current dataset exhibited a remarkably similar dataset was composed of just 20 species (971 bimodal distribution with its distinctive bimodal genomes). We also found that most of these highly peaks around 2 and 5 Mbp (Figure 1a). Hartigans’ redundant species belonged to the peaks in the dip test (Hartigan and Hartigan, 1985) was used to bimodal distribution. Notably, the two most redun- confirm that it features significant bimodality with a dant species, namely Salmonella enterica, Escher- P-value of 2.2e − 16 (Figure 1b), where P-values o0.05 ichia coli belonged to peak β and Helicobacter pylori, indicate significant bimodality (or multimodality) and Staphylococcus aureus belonged to peak α. P-values > 0.10 indicate unimodality (Freeman and Having observed the bias in the dataset, we Dale, 2013). assessed how much impact this has on the modality The ISME Journal Assessment of the bimodality in the distribution of bacterial genome sizes HS Gweon et al 823 (P = 0.91). The influence these redundant species 500 has on the distribution became more apparent (Figure 2b) as we evaluated the modality of the 400 distribution by progressively removing species from the dataset (from the most redundant to the least). 300 There is a sharp incline towards unimodality as redundant species were gradually excluded 200 (Figure 2b). In fact, the distribution became more or less unimodal after the top 60 redundant species Number of genomes Unimodal * 100 (D = 0.00693; p = 0.918) were removed from the dataset of 1706 species. One of the issues we faced with our approach was 0 024681012 that a large number of genomes in the dataset had Genome Size (Mbp) disorganized and inconsistent taxonomic classifica- tion. For instance, there were genomes using different naming convention such as ones with 1.0 square brackets or strain identifier attached to their species name (for example, ‘[Clostridium]-celluloly- 0.8 ticum’, ‘Francisella sp.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages4 Page
-
File Size-