Genomic Insights That Advance the Species Definition for Prokaryotes
Total Page:16
File Type:pdf, Size:1020Kb
Genomic insights that advance the species definition for prokaryotes Konstantinos T. Konstantinidis*† and James M. Tiedje*†‡§ *Center for Microbial Ecology, and Departments of †Crop and Soil Sciences and ‡Microbiology and Molecular Genetics, Michigan State University, East Lansing, MI 48824 Contributed by James M. Tiedje, December 24, 2004 To help advance the species definition for prokaryotes, we have informative, with respect to the species definition, because it compared the gene content of 70 closely related and fully se- concerns genes that largely determine the organism’s phenotype. quenced bacterial genomes to identify whether species boundaries Further, our strain set represents several major bacterial lineages, exist, and to determine the role of the organism’s ecology on its including ␣- and -proteobacteria, low GC Gram-positive bacilli, shared gene content. We found the average nucleotide identity streptococci, and staphylococci, and high GC Gram-positive my- (ANI) of the shared genes between two strains to be a robust cobacteria, allowing for robust interpretations. We found that means to compare genetic relatedness among strains, and that ANI strains of the same species can vary up to 30% in gene content, values of Ϸ94% corresponded to the traditional 70% DNA–DNA raising questions as to whether they should belong to the same reassociation standard of the current species definition. At the 94% species, and that these intraspecies differences are presumably ANI cutoff, current species includes only moderately homogeneous driven by differences in the ecology of the strains, lending support strains, e.g., most of the >4-Mb genomes share only 65–90% of for a more ecological and stringent definition for prokaryotic their genes, apparently as a result of the strains having evolved in species. different ecological settings. Furthermore, diagnostic genetic sig- natures (boundaries) are evident between groups of strains of the Materials and Methods same species, and the intergroup genetic similarity can be as high Seventy fully sequenced and closely related genomes were used as 98–99% ANI, indicating that justifiable species might be found in this study (Table 1, which is published as supporting infor- even among organisms that are nearly identical at the nucleotide mation on the PNAS web site). The genomic sequences and level. Notably, a large fraction, e.g., up to 65%, of the differences sequence annotation for 63 of the 70 closed genomes that were in gene content within species is associated with bacteriophage published at the time of this study (August 2004) were obtained and transposase elements, revealing an important role of these from National Center for Biotechnology Information’s ftp site, elements during bacterial speciation. Our findings are consistent which can be accessed at ftp:͞͞ftp.ncbi.nih.gov. The remaining with a definition for species that would include a more homoge- seven genomes were closed at the time of this study; however, neous set of strains than provided by the current definition and their annotation was not completed (denoted by NA in Table 1). one that considers the ecology of the strains in addition to their These seven strains were: Salmonella bognori 12419, Yersinia evolutionary distance. enterocolitica, Neisseria meningitidis FAM 18, produced by the MICROBIOLOGY Sanger Center and obtained through the Sanger ftp site at ͞͞ ͞ prokaryotic diversity ͉ species concept ͉ nucleotide identity ͉ comparative ftp: ftp.sanger.ac.uk pub; and Mycobacterium avium, Staphy- genomics ͉ evolution lococcus epidermitidis RP62A, and Clostridium perfrigens (ATCC 13124), produced by The Institute for Genomic Research and obtained through their web site at www.tigr.org. Neisseria gon- bacterial species is essentially considered to be a collection of orrhoeae FA1090 was produced at the Advanced Center for strains that are characterized by at least one diagnostic phe- A Genome Technology at the University of Oklahoma (Norman; notypic trait and whose purified DNA molecules show 70% or which can be accessed at www.genome.ou.edu͞gono.html). higher reassociation values, following the recommendations in the et al. classical paper by Wayne (1). This species definition, while Determination of Conserved Genes and Evolutionary Relatedness. pragmatic and universally applicable within the prokaryotic world The conserved genes between a pair of genomes were determined (2–4), has been criticized for being difficult to implement because by whole-genome sequence comparisons using the BLAST algo- of technological limitations in identifying diagnostic traits and in rithm, release 2.2.5 (12). For these pairwise comparisons, all performing the pairwise DNA–DNA reassociation experiments, predicted protein-coding sequences (CDSs) from one genome and for being often not adequately predictive of phenotype (5–7). (hereafter known as the query genome) were searched against the Furthermore, this definition is much broader and is not encom- genomic sequence of a closely related genome (hereafter known as passed by any of the eukaryotic species definitions (8). Indeed, the reference genome). CDSs from the query genome were con- applying this standard to eukaryotic species would lead to the sidered conserved when they had a BLAST match of at least 60% inclusion of members of many taxonomic tribes in the same species, overall sequence identity (recalculated to an identity along the e.g., all of the primates should then belong to the same species entire sequence) and an alignable region Ͼ70% of their length Ͼ (8–10). Last, several strains that show 70% DNA–DNA reasso- (nucleotide level) in the reference genome, whereas CDSs that had ciation values are classified into different species, even different no match or a match below this cutoff were considered genome- genera, usually on the basis of pathogenicity or host range, such as specific in the query genome. The BLAST search was run with the strains of Escherichia coli and Shigella spp. (11), making the current following settings: x ϭ 150 (drop-off value for gapped alignment), prokaryotic classification somehow inconsistent. q ϭϪ1 (penalty for nucleotide mismatch), and F ϭ F (filter for To gain insight into these issues, we performed pairwise, whole- repeated sequences), and the rest of the parameters were at default genome comparisons between all related (i.e., showing Ͼ94% 16S rRNA gene identity) sequenced bacterial strains to determine both the conserved predicted protein-coding genes between the pair of Abbreviations: ANI, average nucleotide identity; CDS, protein-coding sequence; COG, strains and the strain-specific genes. We then studied how these Cluster of Ortholgous Genes. parameters correlate with the evolutionary distance between the §To whom correspondence should be addressed. E-mail: [email protected]. strains and the strain assignment to species. This analysis is most © 2005 by The National Academy of Sciences of the USA www.pnas.org͞cgi͞doi͞10.1073͞pnas.0409727102 PNAS ͉ February 15, 2005 ͉ vol. 102 ͉ no. 7 ͉ 2567–2572 Downloaded by guest on September 28, 2021 settings. These settings give better sensitivity with more distantly not assignable to the COG database and consisted of 10–20% of the related genomes, compared with default settings, because the total number of annotated genes in a genome. Genes that were default settings target more highly identical sequences. The ge- annotated as hypothetical in the primary annotation and were nomes that were used as query genomes, the genome sizes and total assignable to the COG database conserved hypothetical category or number of CDSs for all genomes used is this study, and the raw data other category were considered conserved hypothetical (and well- from the pair-wise comparisons are summarized in Table 1. characterized) and denoted as such in the article. Searching at the amino acid level predicted more conserved PERL scripts were used to edit CDSs assignments where neces- genes than the nucleotide level search only when the evaluated sary, extracting sequences from GenBank files, formatting data- strains show less than Ϸ97% 16S rRNA sequence identity. For this bases for BLAST searches, and automatically parsing BLAST outputs. case, there was only a slight upward shift of the left end of the regression line in Fig. 4A. Furthermore, the use of less stringent Results and Discussion cutoffs for the determination of conserved sequences did not ANI to Measure Genetic Distance. We needed a more precise significantly alter our final conclusions (data not shown). Last, the measure of the genetic relatedness between any two strains. The use of a cutoff for match length and identity without manual main limitation to a universal measure for all prokaryotic taxa is the inspection of the alignments proved highly accurate for the predic- lack of genes that are widely distributed in all taxa, e.g., recent tion of conserved sequences. For instance, Parkhill et al. (13) have estimates suggest that there are Ͻ100 such genes (23, 24). Even identified 4,297 and 3,394 CDSs of Bordetella bronchiseptica RB50 these widely distributed genes, however, frequently show conflict- to have orthologs in Bordetella parapertussis and Bordetella pertussis, ing values of genetic relatedness because of their varied evolution- respectively, whereas our approach predicted 4,261 and 3,382 CDSs ary histories (mutation rate and selection pressures) and the for the same comparison, respectively. as-yet-unclear and not quantifiable effect of horizontal gene trans- The evolutionary