www.nature.com/ismecomms

ARTICLE OPEN Automated analysis of genomic sequences facilitates high- throughput and comprehensive description of ✉ ✉ Thomas C. A. Hitch 1 , Thomas Riedel2,3, Aharon Oren4, Jörg Overmann2,3,5, Trevor D. Lawley6 and Thomas Clavel 1

© The Author(s) 2021

The study of microbial communities is hampered by the large fraction of still unknown bacteria. However, many of these species have been isolated, yet lack a validly published name or description. The validation of names for novel bacteria requires that the uniqueness of those taxa is demonstrated and their properties are described. The accepted format for this is the protologue, which can be time-consuming to create. Hence, many research fields in microbiology and biotechnology will greatly benefit from new approaches that reduce the workload and harmonise the generation of protologues.

We have developed Protologger, a bioinformatic tool that automatically generates all the necessary readouts for writing a detailed protologue. By producing multiple taxonomic outputs, functional features and ecological analysis using the 16S rRNA gene and genome sequences from a single species, the time needed to gather the information for describing novel taxa is substantially reduced. The usefulness of Protologger was demonstrated by using three published isolate collections to describe 34 novel taxa, encompassing 17 novel species and 17 novel genera, including the automatic generation of ecologically and functionally relevant names. We also highlight the need to utilise multiple taxonomic delineation methods, as while inconsistencies between each method occur, a combined approach provides robust placement. Protologger is open source; all scripts and datasets are available, along with a webserver at www.protologger.de

ISME Communications (2021) 1:16 ; https://doi.org/10.1038/s43705-021-00017-z

INTRODUCTION ‘Candidatus’ taxa is not formally included in the rules of the The recent renaissance of cultivation has led to >500 novel species International Code of Nomenclature of Prokaryotes (ICNP)16, still a being added to the ‘List of Prokaryotic names with Standing in ‘protologue’ describing the taxon is required (Appendix 11 of the Nomenclature’ (LPSN) database every year since 20051. ICNP). In addition, MAGs provide an invaluable background of This has included large-scale cultivation projects of host-associated potentially novel taxa, on to which cultured isolates can be microbial communities2–9 as well as environmental sources, such as compared, strengthening the justification of creating high soil10 and the ocean11,12. Whilst many novel isolates are being taxonomic groups, e.g., families17. One application of this method cultured in such studies, few are taxonomically described with names is GTDB-Tk18, a state-of-the-art resource which utilises the genomes that are validly published. The lack of names prevents the quick and of both isolates and MAGs to place queried genomes within the unique referencing of these taxa, hampering researchers’ ability to currently sequenced space of taxa. By expanding the taxonomic study these species further. This topic affects a wide range of and genomic landscape, MAGs have facilitated detailed analysis of specialities within microbiology, yet has not been addressed. both described and undescribed taxonomic groups19–21. In addition to culture-dependent studies, metagenomic When describing a novel taxon, few guidelines exist with rule approaches can provide complementary results, increasing the 27.2.c of the ICNP stipulating that “The properties of the taxon ability to study a microbial community2,13,14. In particular, being described must be given directly after [the name] and [its metagenome-reconstructed genomes (MAGs) are of increasing etymology]”16. This format is termed a protologue and acts as a importance for studying the functions and of prokar- standardised format for describing a novel taxon in a clear and yotes, although without cultured representatives these functions concise manner22. Further recommended minimal descriptions cannot be validated experimentally. MAGs allow the study of as- have been published for specific lineages, however, these are only yet-uncultured taxa, such as the entire phylum of “Candidatus recommendations according to the ICNP23–25. Lokiarchaeota”15. Currently, MAGs cannot be utilised as type Single marker genes are a common element of protologues, material for providing a validly published name, although including the 16S rRNA gene sequence. However, the advent of ‘Candidatus’ taxa can be proposed. Although the rank of genome sequencing has seen a rise in genome-based measures of

1Functional Microbiome Research Group, RWTH University Hospital, Aachen, Germany. 2Leibniz Institute DSMZ-German Collection of Microorganisms and Cell Cultures, Braunschweig, Germany. 3German Center for Infection Research (DZIF), Partner site Hannover-Braunschweig, Braunschweig, Germany. 4The Institute of Life Sciences, The Hebrew University of Jerusalem, The Edmond J. Safra Campus, Jerusalem, Israel. 5Braunschweig University of Technology, Braunschweig, Germany. 6Host-Microbiota Interactions ✉ Laboratory, Wellcome Sanger Institute, Hinxton, UK. email: [email protected]; [email protected] Received: 11 February 2021 Revised: 21 April 2021 Accepted: 4 May 2021 T.C.A. Hitch et al. 2 Output

Ecology Large-scale amplicon-based environmental survey (www.imngs..org) Taxonomic placement Quality check Close relatives Validity check (nomenclature list) Phylogeny Chimeras (UCHIME) SILVA-LTP Sequence identity User 16S rRNA gene input ANI values (FastANI) data Genome www.protologger.de Completeness POCP values (fasta) & contamination GTDB-Tk (checkM) Ecology Functionality CDS and CRISPR (Prokka) MAG Antibiotic resistance (CARD) environmental survey Carbohydrate degradation (CAZy) (MASH) Pathway analysis (KEGG)

Fig. 1 Simplified overview of Protologger. The key steps within Protologger are highlighted with the tools utilised for each step indicated (in brackets), along with the quality assurance steps. Sections are coloured according to the information they provide with taxonomic placement (in yellow), ecology (in blue), and functionality (in red). The ‘validity check’ stage in taxonomic assignment involves the removal of taxa without validly published names from genomic comparison.

taxonomic diversity. These include gene content dissimilarity26, on 16S rRNA gene sequence identity. Species with validly average nucleotide identity (ANI)27, average amino acid identity published names according to the DSMZ nomenclature list, (AAI)28, digital DNA–DNA hybridisation (dDDH)29, percentage of supplemented with updates from LPSN, have their type genomes conserved proteins (POCP)30, differences in the G + C content of obtained from the GTDB database and used to calculate genome- genomic DNA31 and integration into large-scale phylogenetic based delineation values: ANI, POCP, and differences in the G + C trees32,33. In addition, protologues generally describe the func- content of genomic DNA. Species lacking valid names are 1234567890();,: tional and ecological niche of a given taxon. discarded from genomic analysis due to the lack of standing Currently, no single tool provides users with the taxonomic, within the ICNP. In addition, genomes are assigned a taxonomic functional and ecological insights required for writing a protologue. lineage using GTDB-Tk18,33. Functional analysis is conducted using For example, GTDB-Tk18,MiGA34 and TYGS35 taxonomically place a the proteome predicted from the genome sequence file, given genome, yet none of these methods provide functional or annotated against KEGG for pathway analysis, CAZy for carbohy- detailed ecological readouts. GTDB-Tk output is based on place- drate metabolism, and CARD for profiling the species antibiotic ment within a pre-calculated phylogenomic tree, generating a resistance. Ecological analysis is conducted using both the 16S relative evolutionary distance (RED) value, for placement of novel rRNA gene sequence, as well as the genome. First, the genome is lineages, and directing targeted ANI27 comparisons to existing compared to an internal collection of >49,000 metagenome- closely related species. MiGA also utilises ANI, as well as AAI for assembled genomes (MAGs) (Fig. 2a) collected from across ten placement and uniquely does not apply any tree-based methods. different environments13,14,39–47. Second, ecological occurrence is Similarly, TYGS generates a tree based on blast similarity35–37 and calculated by comparing the 16S rRNA gene sequence to provides dDDH29 values to the close relatives. The lack of secondary operational taxonomic units (OTUs) generated from 19,000 or tertiary taxonomic assignment methods in these existing tools, amplicon datasets (1000 from each of 19 environments, defined limits the ability of the user to confirm the robustness of the in the methods). The taxonomic distribution of these OTUs taxonomic placement. In addition, the functional and ecological highlights the inclusion of many unknown taxa, although as a features of a taxon can also be used to supports its differentiation whole, the database is dominated by Proteobacteria (Fig. 2b). from close relatives. By providing seven lines of taxonomic evidence Protologger is entirely open source, hence the code and combined with ecological and functional readouts, users of databases can be accessed via the GitHub repository (github.com/ Protologger obtain all the building blocks required for the thh32/Protologger) and a dedicated Galaxy-based website is description of novel, or known, taxa. This reduces the burden on available (www.protologger.de), which includes an instructional the user and introduces consistency in the description of taxa. video. For local installation, ~100 Gb RAM and ~200 Gb storage In this paper, we introduce Protologger, an all-in-one tool that space are required due to the integration of GTDB-Tk and its automatically describes the taxonomic, functional and ecological associated databases. features of a species, providing output that can be used directly for writing a protologue. Comparison of taxonomic delineation methods within large- scale collections of gut bacterial isolates Three recently published, large-scale collections of isolates were RESULTS combined to provide a diverse dataset on which to compare the Protologger workflow taxonomic delineation readouts provided by Protologger, includ- Protologger requires the 16S rRNA gene and genome sequence of ing 16S rRNA gene sequence similarity, ANI, POCP, and GTDB-Tk a single species and delivers multiple taxonomic, ecological, and assignment. From the initially reported number of 737 isolates functional readouts using specific databases and tools (Fig. 1). In within the human bacterial collection (HBC)6, 3632 within the addition, Protologger conducts quality checks on both the Broad Institute-OpenBiome Microbiome Library (BIO-ML)7, and genome and 16S rRNA gene sequence, in line with the proposed 410 within the Hungate1000 isolate collection3, dereplication guidelines for the use of genome sequences for taxonomic (>95% ANI) led to the identification of 435, 206 and 308 species- purposes38. A detailed description of all analysis steps can be level genomes, respectively, which were analysed further. Of these found in the “Methods”. (n = 949), complete analysis and output were provided for 851 In brief, taxonomic assignment is conducted via identification of (HBC: 422, BIO-ML: 197, Hungate1000: 232); the failure of some the 50 closest relatives within the SILVA Living Tree Project based genomes to be analysed was due to the inability to identify and

ISME Communications (2021) 1:16 T.C.A. Hitch et al. 3

a Acidobacteria Proteobacteria Actinobacteria Spirochaetes Bacteroidetes Verrucomicrobia Elusimicrobia Other Unknown Permafrost Planctomycetes n=527 Human n=1008

n=44801 n=302 Primates

Soil n=209 Groundwater

Pig n=581 Rumen n=472

n=338 Ocean Mouse

n=830

Hydro-thermal sediment

n=26 107 106 4 10 Number of OTUs 105 b

103

102 101 100 SR1 BRC1 Chlorobi Firmicutes Caldiserica Nitrospirae Tenericutes Nitrospinae Chlamydiae Atribacteria Dictyoglomi Poribacteria Spirochaetes /Chloroplast Fusobacteria Thermotogae Elusimicrobia Synergistetes Parcubacteria Acetothermia Bacteroidetes Fibrobacteres Korarchaeota Acidobacteria Lentisphaerae Omnitrophica Euryarchaeota Pacearchaeota Parvarchaeota Crenarchaeota Actinobacteria Cloacimonetes Diapherotrites Proteobacteria Aminicenantes Latescibacteria Chrysiogenetes Marinimicrobia Calescamantes Deferribacteres Ignavibacteriae Nanoarchaeota Planctomycetes Woesearchaeota Verrucomicrobia Saccharibacteria Microgenomates Thaumarchaeota Armatimonadetes Aenigmarchaeota Hydrogenedentes Unknown Archaea Unknown Bacteria Nanohaloarchaeota Gemmatimonadetes Unknown Prokaryote Deinococcus-thermus Thermodesulfobacteria Candidate division ZB3 Candidate Candidatus Candidatus Candidatus Candidatus Candidatus Candidatus Candidatus Candidatus Candidate division WPS-1 division Candidate Candidate divisionCandidate WPS-2 Candidatus Candidatus Candidatus Candidatus Candidatus Candidatus Candidatus Candidatus Candidatus Candidatus Candidatus Candidatus Candidatus Candidatus Candidatus Candidatus Candidatus Candidatus Cyanobacteria Candidatus Candidatus Candidatus Candidatus Candidatus Candidatus Candidatus Candidatus Candidatus Candidatus Candidatus Candidatus Candidatus Candidatus Fig. 2 Overview of the ecological databases used within Protologger. a Representation of the diverse environments from which MAGs originate from in the ecological analysis. For each environment13,14,39–46, the number of MAGs is stated, along with a pie chart indicating the three most prevalent bacterial phyla (see colour code in the figure), as determined by GTDB-Tk cross-referenced with LPSN. MAGs termed as ‘generic’ due to a lack of metadata are not included (n = 3397)47. b Phylum level taxonomic diversity within the IMNGS amplicon studies utilised within the 16S rRNA gene amplicon-based habitat preference and distribution analysis. These datasets span 63 phyla represented by over 37,314,233 OTUs. The names of phyla lacking a child taxon with a validly published name are in red, as determined via the LPSN database. extract a 16S rRNA gene sequence from the isolates’ genome, similarity (Fig. 1). Using each methods’ results from the pairwise which was conducted before input. comparison of isolates from the three collections (n = 834) to their For many years, DDH has been the gold standard for delineation closest relatives (n < 50 per isolate), those for which both pairwise of bacterial species. However, due to the difficulty in applying this genome and 16S rRNA gene data was present were extracted, method experimentally, bioinformatic proxies have been devel- resulting in 30,247 pairwise comparisons. The consistency of the oped including dDDH29 and ANI48. Initial experiments confirmed three methods to assign these comparisons as either intra- or that ANI values strongly correlate with those from DDH experi- inter-genus was compared (Fig. 3b). The methods were >80% ments48 and large-scale analysis of genomic data has set the congruent when compared pairwise, and even when all compared current threshold for species delineation using ANI at >95%27. simultaneously, 75.6% congruence was observed. 16S rRNA gene Nonetheless, to validate the use of FastANI27 within Protologger, sequence similarity showed a relatively high degree of con- we compared the consistency of both FastANI and dDDH to gruence with GTDB (81.4%) and POCP (82.9%) (Fig. 3b), although it delineate genomes belonging to the same, or different species. uniquely assigned >10% of the pairings as originating from For this, isolates for which Protologger predicted species matches different genera. The majority of these comparisons were inter- were randomly selected from the three isolate collections (n = 70 family, as determined congruently between GTDB assignments isolates). The pairwise genome comparisons for each isolate (n = and 16S rRNA gene sequence similarity scores (Supplementary 1599) were run through FastANI and the same genomes uploaded Fig. 1), with only 11.3% of the comparisons occurring between to the GGDC server to obtain dDDH values37, as no open-source members of different families. This is due to these comparisons version is available preventing large-scale comparison (Fig. 3a). We being based on Protologger output which limits the comparisons observed 100% consistency in both methods’ delineation of to the 50 closest relatives. This confirms the need to integrate species, along with a strong positive correlation between the multiple lines of evidence during taxonomic assignment to guide scores (Pearson R2 = 0.92, P < 0.01). These data support the use of the placement of a novel taxon. FastANI values for species-level delineation, ensuring the open- source nature of this project. Description of taxonomic novelty For genus-level delineation, Protologger provides three suitable Each of the isolate collections originating from the gastrointestinal readouts: POCP, GTDB assignment, and 16S rRNA gene sequence tract was identified to contain multiple novel taxa: 311 novel

ISME Communications (2021) 1:16 T.C.A. Hitch et al. 4

ab10.6% 24.4% 2.6% Uniq. Dif. GTDB

100 Uniq. Dif. POCP GTDB Vs. POCP Uniq. Dif. 16S

95 Cong. Same

62.4% Cong. Dif. 90

2.9% 7.6% 16.0% 20.7% 14.2% 11.0% ANI (%) 85

POCP Vs. 16S 80 GTDB Vs. 16S

75 65.4% 62.2% 20 40 60 80 100 dDDH (%) Fig. 3 Comparison of taxonomic delineation methods. a Pairwise comparisons between the dDDH and ANI values obtained for 70 isolates and their closest relatives, as identified by Protologger (n = 1599). Red lines highlight published boundaries for species delineation: dDDH, <70%; ANI, <95%. b Pairwise comparisons to test the consistency of genus-level delineation parameters (n = 30,247). For each comparison, four groups were formed based on how each method assigned the paired genomes: congruent results, same genus (Cong. Same), congruent results, different genera (Cong. Dif.) and those uniquely identified as belonging to different genera according to the method specified (Uniq. Dif.). See colour code in the figure.

species and 58 novel genera (Fig. 4a). The majority belonged to High-quality MAGs support the study of as-yet-uncultured the phylum Firmicutes, although novel members of the Bacter- taxonomic lineages oidetes, Actinobacteria and Proteobacteria were also present MAGs represent an invaluable resource for both the phylogenetic (Fig. 4b). As many of the isolates within the HBC collection6 were placement and the functional study of isolates within under- made publicly available via deposition at national reference studied lineages. Currently, MAGs cannot be utilised to describe collections, we utilised Protologger to taxonomically describe and novel taxa with validly published names, although ‘Candidatus’ provide validly published names for them. Out of the 72 isolates, taxa can be proposed according to the rules of the ICNP. we deposited 40 at a second national culture collection, which is An alternative nomenclatural code for prokaryotes has been mandatory for the valid publication of names. These 40 isolates recently proposed, called the International Code of Nomenclature represented 34 novel taxa across 9 families, including 17 novel of Uncultivated Prokaryotes (ICNUP), aiming at valid publication of species and 17 novel genera (Fig. 4c). names of as-yet-uncultured taxa using their genomic information To identify whether these isolates represent species previously as the type material51. Hence, the creation of detailed protologues identified as being of importance within the human gut with help of Protologger is also relevant in the context of microbiome, we compared the isolates’ 16S rRNA gene cultivation-independent taxonomy. sequences to the OTU sequences of the Human Microbiome The development of bioinformatic methods to link MAGs to 16S Projects (HMP) ‘most wanted’ taxa49. Within the ‘most wanted’ rRNA gene sequences facilitates their use as input into Proto- list, taxa were stratified into three priority levels; low, medium, logger43. This inclusion of 16S rRNA gene sequences is essential and high based on their perceived novelty when compared to due to the lack of genomes for many described prokaryotes with human-specific strain collections. Out of the 34 novel taxa validly published names. With the potential of such data being described here, HMP ‘most wanted’ species matches were used to describe and name novel taxa as ‘Candidatus’, we aimed identified for 22, including 3 high-, 13 medium- and 6 low- to assess the ability of Protologger to provide reliable information. priority OTUs (Fig. 4c). Currently, the system analyses the quality of input data for Using the habitat preference and distribution analysis con- detection of chimeric 16S rRNA gene sequences, incomplete 16S ducted by Protologger (Figs. 1 and 2), we were able to better rRNA gene sequences (<80%), contaminated genomes (>3%), and understand the importance of these newly named bacteria across incomplete genomes (<95%). Analysis was conducted on the multiple ecosystems. For example, strain Sanger_90 was most iMGMC dataset43, which consists of 484 MAGs from the mouse commonly identified within pig gut (26.3%) and human gut (7.4%) intestine that were matched to 16S rRNA gene sequences using a samples, although sub-dominant in both environments at 0.1% combination of annotation and co-occurrence analysis43. The and 0.2% mean relative abundance, respectively. Due to the iMGMC dataset was dominated (96%, 465 MAGs) by representa- integration of the ‘Great Autonomic Nomenclator’ (GAN)50 within tives of novel taxa based on the ANI, POCP and GTDB-Tk Protologger, the novel genus represented by this strain Sanger_90 assignment, including 44 representatives of novel families (Fig. 5a). was named ‘Porcipelethomonas’, due to being most commonly Overall, warnings about the quality of the MAGs were produced in present within pig gut samples. Similarly, this occurred with the 67% of cases, in comparison to 28.9% across all three isolate medium level HMP ‘most wanted’ species, ‘Laedolimicola ammo- collections (Fig. 5b). The quality of the genomes included within niilytica’ and ‘Huintestinicola butyrica’, were named due to being the MAG dataset was highly variable with 39.5% being deemed present in 65.7% of chicken gut samples and 29.2% of pig ‘high-quality’ (>95% complete, <3% contamination), producing no samples, respectively. Observations such as these are not possible warning. In addition to the genome quality warning, 21 instances with targeted analysis of individual environments. All 34 novel of chimeric 16S rRNA gene sequences and 71 incomplete 16S taxa described within this work were named according to this rRNA genes were detected (Fig. 5c). The ubiquitous nature of method and protologues are provided at the end of the methods these quality issues across the dataset suggests that they are not section. linked to specific lineages but inherent to this method. Hence,

ISME Communications (2021) 1:16 T.C.A. Hitch et al. 5 a HBC BIO-ML Hungate1000 39 7 410 14 Known 15 60 Novel species 87 227 146 142

Novel genus 128 105

b 7 2 10 1010 41 11 13 Firmicutes 1213 10 Proteobacteria 1 3 Bacteroidetes 633 Verrucomicrobia 146 46 93 Actinobacteria 151 Spirochaetes

Prev. human gut (%) c HMP priority Family Prevalence (%) High 0 Novelty Medium 10 Low

20 B ro Brot tolimic

olimi

Gallintestinimic ola acetigignens 30 cola acetigignens

DSM 107469 Suilimivivens a a DSM 102145 40 c robium p DSM 102146 ropioni DSM 107475 DSM 108213 DSM 107528 ellulolyticus ropionicum c 50 a eti DSM 102261 roides c DSM 108253 e mylophila DSM 102825 Barnesiella p Bact 60 DSM 108252 DSM 102150 ammoniilyticum DSM 102166 70 Clostridium ammoniilyticum DSM 102176 yrica DSM 102154 Clostridium ammoniilyticum Faecali dimonas but catena a Tree scale: 0.1 cetigenes Brotomer DSM 107480 DSM 108153 cetigenes Aedoeadaptatus a ammoniilyti ca DSM 102136 DSM 102144 Megasphaera butyrica

Dorea acetigenes DSM 102260 Porcipelethomonas ammoniilytic a DSM 102167 DSM 108233 Huintestinicola butyrica Dorea amylophila Hominime DSM 102115 DSM 102163 a rdicola ac A gathobaculum ammoniilyticumeti DSM 102216 Blautia ammoniilytic DSM 102165 Oscillibac cetigignens DSM 102174 Oscilliba ter a Blautia a Muriventricola aceti DSM 107527 cetigenes Muriventricola aceti ct DSM 102882 B er a ro Anaerostip DSM 108348 DSM 102317 Copro c t etigenes a onth

DSM 102151

eti c oviv DSM 108347 DSM 107473 oc rcoradaptatus ammoniilyticus DSM 108287 DSM 108267

cus ammoniilyticus eti es am ens ammoniilyti eti adaptatus ammoniilyticus DSM 107541 r cola ac DSM 108264

Hoministe ro eti

c ylophilus terco op

obium ac tomonas ac cus a Hominis Muric Laedolimicola ammoniilytic

DSM 108065 omicr ococ ca Families r Alitisca DSM 102148

DSM 108312 Barnesiellaceae Cop ropogastr Bacteroidaceae Anth Erysipelotrichaceae Unassigned Eubacteriales Peptoniphilaceae Novelty Veillonellaceae Ruminococcaceae Species Oscillospiraceae Genera Fig. 4 Uncovering and describing taxonomic novelty using Protologger. All non-redundant species-level isolates from three large collections were processed: the human bacterial collection (HBC)6, the Broad Institute-OpenBiome Microbiome Library (BIO-ML)7 and the Hungate1000 collection3. a Each collection contained novel taxa, representing either undescribed species or genera. b Phylum level diversity of the undescribed isolates. c Phylogenomic tree of the novel HBC isolates described and named. For some species, multiple strains were identified; therefore, the type strain DSM number is in bold (see protologues). Isolates matching HMP ‘most wanted’ species are highlighted with green balls at the branch tips with the size representing priority. The external rings represent isolate specific information as follows: (i) the inner ring highlights the novelty, either species or genus; (ii) the centre ring indicates to which family the isolates are assigned; (iii) the outer ring shows the prevalence of each isolate across 1000 human gut amplicon samples (the ecosystem of origin of the isolates), with values ranging from 1.0–69.6%.

ISME Communications (2021) 1:16 T.C.A. Hitch et al. 6

acContaminated genome Quality warning Known Incomplete genome High quality input Novel family Incomplete 16S rRNA gene 19 Chimera Novel species 44 Phyla

iMGMC-MAG-1 iMG

iMGMC-MAG-209 iMGMC-MAG-51 iMGMC-MAG-121

iMGMC-MAG-294 MC-MAG-84

iMG G-777 iMGMC-MAG-141iMGMC-MAG-80

61 iMGMC-MAG-59 MC-MAG-63 iMGMC-MAG-164

iMGMC-MAG-13

iMGMC-MAG-125

6

iMGMC-MAG-316 iMGMC-MAG-436 9 iMGMC-MAG-550 iMGMC-MAG-610 iMGMC-MAG-61 iMGMC-MAG-143 iMGMC-MA 1 iMGMC-MAG-738 1 1 iMG iMGMC-MAG-705 iMGMC-MAG-603

iMGMC-MAG-822 iMGMC-MAG-88 iMGMC-MAG-376 iMGMC-MAG-861 MC-MAG-163 -396 iMGMC-MAG-640 iMGMC-MAG-154 iMGMC-MAG-781 iMGMC-MAG-243 9 iMGMC-MAG-761 iMGMC-MAG-659

iMGMC-MAG-122 iMGMC-MAG-746 16

iMGMC-MAG-104 iMGMC-MAG-769 iMGMC-MAG-9 iMGMC-MAG-260 iMGMC-MAG-31 iMGMC-MAG-109 iMGMC-MAG iMGMC-MAG- iMGMC-MAG-860 G-384 iMGMC-MAG-592 iMGMC-MAG-253 iMGMC-MAG-780 iMGMC-MAG-199 iMGMC-MAG-6 iMGMC-MAG-416 iMGMC-MAG-65 MC-MA iMGMC-MAG- iMGMC-MAG-268 iMG iMGMC-MAG-17 1 14 iMGMC-MAG-185 iMGMC-MAG-33 iMGMC-MAG-165 iMGMC-MAG-282 iMGMC-MAG-70 MC-MAG-19 71 iMGMC-MAG- iMGMC-MAG-215 1 MG iMGMC-MAG-147 15 i

iMGMC-MAG-24 MAG-7 iMGMC-MAG-22 7 iMGMC-MAG-26 iMGMC-MAG-105 0 iMGMC-MAG-49 iMGMC-MAG-87 3 98 iMGMC- iM iMGMC-MAG-663 iMGMC-MAG-168GMC-M iMGMC-MAG-587 1 -MAG-658 1 iMGMC-MAG-248 iMGMC-MAG-390

iMGMC-MAG-127 A iMGMC-MAG-720 G- iMGMC iMGMC-MAG-145 78 iMGMC-MAG-547 iMGMC-MAG-41 iMGMC-MAG-4 iMGMC-MAG-32 iMGMC-MAG-575 iMGMC-MAG-201 iMGMC-MAG-482 88 iMGMC-MAG-171 iMGMC-MAG-94MC-MAG-742 iMGMC-MAG-299 iMG iMGMC iMGMC-MAG-674 2 iMGMC-MAG iMGMC-MAG-4 - AG-62 iM MAG-234 iMGMC-MAG-487 GM iMGMC-MAG-408 iMGMC-MAG-96C -M - iMGMC-M iMG A 364 G-126 M iMGMC-MAG-449 iMGMC-MAG-45C-MA iMGMC-MAG-194 iM GMC G iMGMC-MAG-422 -3 22 iM iMGMC-MAG-198 G -MA iMGMC-MAG-36MC- G iMGMC-MAG-36 M -264 iMGMC-MAG-99 AG-90 iMGMC-MAG-431

iMGMC-MAG-197 iMGMC-MAG-347 7 iMGMC iMGMC-MAG-593 iM iMGMC-MAG-549 G-683 G -M A MC A iMGMC-MAG-368 iMG -MAG G-75 MC iMGMC-M iMGM - -2 MAG-252 41 C iMGMC-MAG-567 iM - G M M A C G iMGMC-MAG-308 iMGMC-MA-M -34 AG 6 iMGMC-MAG-755 iMGMC- -49 2 G iMGMC-MAG-671 - iMG M 41 1 MC-MAG-375AG-3 9 iMGMC-MAG-495 iMGM 19 iMGMC-MAG-504 C iM -MAG-3 GMC iMGMC-MAG-49 iMGMC-MAG-26-MA 50 G- iMGMC-MAG-576 iMGMC-MAG-361 12 8 iMGMC-MAG-527 iMG 7 iMGMC-MAG-358 MC-MAG-172 iMGMC-MAG-25 iMGMC-MAG-595

i iMGMC-MAG-553 MGMC-MAG-73 i iMGMC-MAG-650 MG MC-MAG-98 iMGMC iMGMC-MAG-666 -496 - iMGMC-MAG-464 iMGMC-MAG-42MAG-55 8 iMGMC-MAG iM GMC- iMGMC-MAG-59 iMGMC- MAG -76 MC-MAG-726 iMG AG-629 iMGMC- MAG-77 M iMGMC-M iM AG GMC- -162 iMGMC-MAG-532 MA iMGMC-MA G-321 iMGMC-MAG-490 iMGMC-MAG-97G-2 89 iMGMC-MAG-529 iMGM C-MA iMGMC-MAG-672 iM G-34 MC-MAG-176 GMC-MA 5 iMG iMGMC-M G-5 6 iMGMC-MAG-849G-660 A -MA iMGMC-MAG-G-262 iMGMC iMGMC-MAG-277178 iMGMC-MAG-585189

iMGMC-MAG- iMGMC-MAG- iMGMC-MAG-231329 359 iMGMC-MAG-476G-952 iMGMC-MAG-106 iMGMC-MA iMGMC-MAG-343 iMGMC-MAG-76418 iMGMC-MAG-269 iMGMC-MAG-

iMGMC-MAG-348 iMGMC-MAG-400

iMGMC-MAG- iMGMC-MAG-509 251 iMGMC-MAG-20 iMGMC-MAG-750

iMGMC-MAG-1604 iMGMC-MAG-16 iMGMC-MAG-20 iMGMC-MAG-150 1 C-MAG-81 iMGM iMGMC-MAG-328 025 iMGMC-MAG-1 iMGMC-MA G-247 iMGMC-MAG-338 iMGMC-MAG-131 iMGMC-MAG-526 iMGMC-MAG-362 iMGMC-MAG-621 iMGMC-MAG-447 iMGMC-MAG-697 iMGMC- Novel genus MAG-438 iMGMC-MAG-734 iMGMC -MAG-356 iMGMC-MAG-681 iMGMC-MAG-444 iMGMC-MAG-732 iMGMC-MAG-232 iMGMC-MAG-729 iMGMC-MAG-325 iMGMC-MAG-693 iMGMC-MAG-180 iMGMC-MAG-423

iMGMC-MAG-569 iMGMC-MAG-690

iMGMC-MAG-318 iMGMC-MAG-838 iMGMC-MAG-435 iMGMC-MAG-388

iMGMC-MAG-387 iMGMC-MAG-349

iMGMC-MAG-64 iMGMC-MAG-410

iMGMC-MAG-33 iMGMC-MAG-790

iMGMC-MAG-107 iMGMC-MAG-757

iMGMC-MAG-258 iMGMC-MAG-754

iMGMC-MAG-221 iMGMC-MAG-441 iMGMC-MAG-89 iMGMC-MAG-116 Tree scale: 0.5 9 iMGMC-MAG-637 iMGMC-MAG-228 iMGMC-MAG-382 iMGMC-MAG-303 iMGMC-MAG-244 iMGMC-MAG-85 17 iMGMC-MAG-772 iMGMC-MAG-4 iMGMC-MAG-824 iMGMC-MAG-167 iMGMC-MAG-886 iMGMC-MAG-259 iMGMC-MAG-892 iMGMC-MAG-216 iMGMC-MAG-1015 MAG-403 iMGMC- iMGMC-M -MAG-279 AG-980 iMGMC iMGMC-MAG-998 iMGMC-MAG-152 iMGMC-MAG-1058

iMGMC-MAG-255 iMGMC-MAG-1060

iMGMC-MAG-339 iMGMC-MAG-999 AG-274 iMGMC-M iMGMC-MAG-1002

iMGMC-MAG-950 iMGMC-M AG-984 iMGMC-MAG-27 iMGMC-MAG- C-MAG-130 iMGM iMGMC-MAG- 1171 G-220 -MA 1 iMGMC iMGMC-MA 164 G-673 iMGMC-MAG-179 iMGMC-MAG-594

iMGMC-MAG-357 iMGMC-MAG-586

iMGMC-MAG-355 iMGMC-MAG-583 iMGMC-MAG-791 iMGMC-MAG-323 35 iMG iMGMC-MAG-68 MC-MAG- iMGMC-MAG-6485 iMGMC-MAG-1 266 65 iMGMC-MAG-508 iMGMC-MAG- -MAG-467 3 iMGMC-MAG-700 iMGMC -MAG-43 iM GMC-MA iMGMC AG-159 97 iMGMC-MAG-580G-646 iMGMC-M iMGMC iMGMC-MAG-4 -MA iMG G MC-M -63 iMGMC-MAG-254 0 iMG AG MC- -519 iMGMC-MAG-188 iM MA 112 GMC-MAG-415G-930 iMGMC-MAG-331 iMGMC-MA MC-MAG- iMG iMGMC-MAG-935 G-28 iMGMC-MAG-256 iMGMC-M

iMGMC-MAG-351 iMGMC-MAG-1006 AG -826 iMGMC-MAG-840 iMGMC-MAG-287 18 iMGMC-MAG-265 iM GM iMGMC-MAG-281C-M iMGMC-MAG-1 AG -6 iMGMC-MAG-309 3 iMGMC-MAG-458 6 iMGMC-MAG-522 iMGMC-MAG-288 iM G iMGMC-MAG-366 MC-M iMGM iMGMC-MAG-302 AG C -92 iMGMC -M iMGMC-MAG-327 AG 5 iMGMC-MAG-M -392 iMGMC-MAG-133 AG-391 iMGMC iMGMC-MAG-378 iMGMC- -MAG-523-47 iMGMC-MAG-406 iM G M iMGMC-MAG-67 MC A iM G- b -M 6 GM iMGMC-MAG-81 A 86 iMGM C G -M -70 Number of quality warnings per input AG- 7 iMGMC-MAG-500 i C MGMC-MAG-5-MAG 41 iMGMC-MAG-326 iMGMC-MAG-804 3 -298 iMGMC-MAG-153 iMG MC iMGMC-MAG-525 54 iMGMC-MAG-242 -M iM AG iMGMC-MAG-245 G -574 iMGMC-MAG-933MC-MAG-719 iMGMC-MAG-240 iM iMGMC-MAG-226 G iMGMC-MAMC- iMGMC-MAG-218 M iMGMC-MAG-802 AG-462 iMGMC-MAG-334 iMGMC-MAG-796 iMGMC-MAG-217 G iMGMC-MAG-797 -1 0 1 2 3 4 047 iMGMC-MAG-53 iMGMC-MAG-631 iMGMC-MAG-57 iMGMC-MAG-819

iMGMC-MAG-82 iMGMC-MAG-8

iMGMC-MAG-102 iMGMC-MAG-752 iMGMC-MAG-758 iMGMC-MAG-307 iMGMC-MAG-767 iMGMC-MAG-333 1 iMGMC-MAG-850 23 iMGMC-MAG-93 iMGMC-MAG-907 iMGMC-MAG-103 iMGMC-MAG-876 iMGMC-MAG-175 iMGMC-MAG-83 iMGMC-MAG-21 iMGMC-M

iMGMC-MAG-332 iMGMC-MAG-893

iMGMC-MAG-250 iMGMC-MAG-854 iMGMC-MAG-69 iMGMC-MAG-957 AG-749 iMGMC-MAG-799 6 iMGMC-MAG-224 iMGMC-MAG-38 iMGMC-MAG-765 iMGMC-MAG-842

iMGMC-MAG-207 iMGMC-MAG-912

iMGMC-MAG-246 iMGMC-MAG-788

iMGMC-MAG-305 iMGMC iMGMC-MAG-40 iMGMC-

iMGMC-MAG-865 iMGMC-MAG-157 iMGMC-MAG-975 -MAG-890

iMGMC-MAG-317 iMGMC-MA M iMGMC-MAG-200 iMGMC-MAG-956 AG-976 iMGMC-MAG-272 iMGMC-MAG-10 iMGMC-MAG iMGMC-MAG-183 MAGs iMGMC-MAG-578 iMGMC-MAG-314 iMGMC- G iMGMC-MAG-206 iMGMC-MAG -1021 iMGMC-MAG-545 iMGMC-MAG-148 iMGMC-MAG-374

iMGMC-MAG-313 iMGMC-MAG-835 MAG-591 iMGMC-MAG-186 iMGMC-MAG-533 - 939 1 iMGMC-MAG-816 iMGMC-MAG-528 1 iMGMC-MAG-536 iMGMC-MAG-684 iMGMC-MAG-502 iMGMC-MAG-514 iMGMC-MAG-530 - 521 iMG

iMGMC-MAG-560 iMGMC-MAG-692 iMGMC-MAG-572 iMGMC-MAG-623 iMGMC-MAG-548 MC-MAG-649 iMGMC-MAG-513 iMGMC-MAG-558

iMGMC-MAG-430 iMGMC-MAG-588 iMGMC-MAG-537 iMGMC-MAG-785 iM iMGMC-MAG-564 iMGMC-MA

iMGMC-MAG-584 iMG GMC-MAG-694

iMGMC-MAG-395 iMGMC-MAG-468 iMGM iMGMC-MAG-501

iMGMC-MAG-551 iMGMC-MAG-830 i (n=483) iMGMC-MAG-394 M 33 i i MC-MAG-561 M i iMGMC-MAG-715 i i M i

M M

8 M 1 M G

7 2

iMGMC-MAG-429 4 0 G 4 5 G

0 G 5 G G M G iMGMC-MAG-129 3 6 C-MAG-285 8 7 M 4 5 M - 4 3 iMGMC-MAG-628 - M M

M - M - C

- -

iMGMC-MAG-612 C G C G C - G C C G C G-174 iMGMC-MAG-828 G G M - A A -

A - iMGMC-MAG-450 A - M - - A A M

M M

M iMGMC-MAG-644 M A M M

M M A - iMGMC-MAG-494 M - M A G - A - A A A iMGMC-MAG-505 - - G

C G C iMGMC-MAG-604 G C G - G C G

C C iMGMC-MAG-661 5 -

- iMGMC-MAG-455 M M - - 2

- M - M 5 1

M M iMGMC-MAG-662 2 3

5 5

iMGMC-MAG-6 4 G G 1 0 iMGMC-MAG-685 G 9 G 3 1 7 iMGMC-MAG-756 G G 9 1 iMGMC-MAG-620

1 M 7 iMGMC-MAG-760 6 M 3

i M i M

M M i i

i i iMGMC-MAG-354 Phyla Verrucomicrobia Isolates Bacteroidetes (n=851) Deferribacteres Actinobacteria Proteobacteria Firmicutes Cyanobacteria Fig. 5 Quality and novelty within a MAG dataset from the mouse intestine43.aNovelty of the 484 iMGMC MAGs according to their 16S rRNA gene sequence similarity to their closest relative. b Comparison of the MAG and isolate collections error warnings per input generated by Protologger. White dots show the median number of errors while the dark grey bar highlights the interquartile range and the black line indicates the lower/upper adjacent values. c Phylogenomic tree of all 484 MAGs with rings on the outside highlighting in black the occurrence of Protologger warnings: chimeric 16S rRNA gene sequences, incomplete 16S rRNA gene sequences, incomplete genomes, contaminated genomes. The MAGs with no warning, hence deemed of high quality, are indicated by green bars.

users with such data should be aware during analysis. In the event burden on the user, it is imperative to ensure robust assignment of the ICNUP being established, the Protologger output for all 484 within the existing taxonomy. As phylogenetic and phylogenomic MAGs has been made available (see “Methods”), facilitating the trees can further support the placement of novel taxa beyond description and naming of these novel taxa. threshold values, both types of trees are also provided by Protologger. While physiological and phenotypic features are predicted, the expression and utilisation of these pathways cannot DISCUSSION be guaranteed, hence observation of the species to validate the The renewed interest in cultivation, as well as the use of physiological features as well as phenotypic testing are recom- metagenomic datasets to infer the existence of novel bacterial mended to verify the predictions. lineages highlights the dire need for automated description and In our hands, Protologger significantly reduces the manual naming of novel taxa50,51. Protologger aims to facilitate this workload required in protologue generation from ~10 h to ~1 h. process by providing all-important building blocks for users to This was shown via application to four isolate- and MAG-based study the taxonomic placement, ecological occurrence, and main datasets, facilitating the description of 34 novel taxa from the functional characteristics of their taxon of interest. human gut, including 22 previously reported to be of particular By providing users with seven lines of taxonomic information interest to the community. By simplifying the process of based on both 16S rRNA gene sequence and genome data, users describing taxa, we believe that the names of a greater number can easily integrate the provided information to decide the final of new taxa will become validly published in the future. The placement of their taxon. Published thresholds of the delineation naming of bacteria has also been highlighted as an area of methods utilised in Protologger were used to compare the ability difficulty due to the need for knowledge of Latin and Greek52. The of each method to differentiate novel from known species and integration of Protologger’s overview file with the ‘GAN’50 allows genera. While consistent in most cases, differences occurred, the creation of names derived from the characteristics of the further highlighting the importance of providing multiple lines of isolate being studied, preventing names based on taxonomic evidence as done in Protologger. For example, although POCP relatedness to existing taxa with validly published names or the values are cited for all novel genera proposed in this paper, many use of placeholder identifiers. By removing these barriers, we fell within the range of 45–55%, highlighting the need for manual make the description and naming of novel taxa a reality for evaluation of borderline values in light of all the taxonomic output researchers who may have dismissed the idea previously due to provided. Altogether, we strongly recommend an educated the additional workload it represents. Continual development of examination of all parameters prior to making a final decision the system is underway to make these descriptions can available on the placement of novel taxa. Whilst this may represent a to the wider research community via integration with BacDive.

ISME Communications (2021) 1:16 T.C.A. Hitch et al. 7 Users are able to email the Protologger overview file to BacDive Genomes are firstly checked for completeness and contamination using ([email protected]), along with the BacDive identifier of the CheckM64 (v1.0.12). Concatenated protein sequence trees using multiple studied isolate, and the two will be linked. This will ensure the marker genes have been extensively applied for the phylogenetic 32,33 longevity of the output and facilitate citable species descriptions placement of genomes . The Genome Taxonomy DataBase (GTDB) is as BacDive entries have digital object identifier numbers. currently the most exhaustive approach to tree-based genome taxonomy, utilising 120 bacterial marker genes to determine the REDs between Furthermore, this will facilitate transparent discussion on the ~100,000 genomes, facilitating standardised distances for each taxonomic taxonomic placement of taxa as all delineation values will be level33. Protologger uses GTDB-Tk (r89) (v1.2.0) for the placement within available to the community. the GTDB taxonomy system via the detection of marker genes to assign a Issues regarding data quality were observed within both isolate- domain. These marker genes are then used to place the genome within and MAG-based datasets, suggesting users must be aware of such the domain-specific reference tree and ANI values confirm the species- issues regardless of the input used. In this age, when taxa can be level identification18. described purely based on their genome53, minimum quality The genomes of closely related species, as determined by 16S rRNA thresholds are essential to prevent erroneous findings. Such gene sequence similarity scores, are obtained from the GTDB type genome minimum thresholds have previously been proposed, but have yet list. Before their inclusion in downstream analysis, only those with validly 38,51 published names are accepted. This is checked by comparison against to be made mandatory . We therefore re-state the need for ‘ fi those taxa with validly published named within the List of Prokaryotes de ned and enforced minimum genome standards for the with Standing in Nomenclature’ maintained by the Leibniz Institute DSMZ- description of novel taxa within the existing ICNP as well as in German Collection of Microorganisms and Cell Cultures1. 51 the context of the ICNUP . Once the list of close relatives has been populated, the percentage (mol Whilst reliant on external databases, Protologger will be updated %) of guanine + cytosine (G + C) of the genomic DNA, ANI and POCP according to each release of both the Living Tree Project and the values are calculated pairwise against the input genome. Protologger 27 GTDB database, ensuring its continual relevance to the community. utilises FastANI (v1.2) to calculate ANI values and custom python scripts Furthermore, the complete open-source nature of Protologger, for POCP calculation, available in the GitHub repository. Previously, it was assumed that the high variability in mol% G + C including its dependencies, ensures that all researchers can access 65 fi between isolates belonging to the same species, up to 5% , prevented its this resource. The identi cation that fastANI provides equally use for species delineation66. However, reanalysis suggested these reliable species-level delineation as dDDH was essential for this, to variations were due to methodological issues that do not exist when prevent the reliance on closed source tools. The additional genome sequences are used to calculated G + C content and that development and maintenance of a webserver (www.protologger. genomes from the same species rarely vary by >1%31. Protologger de) aims to facilitate researchers without either bioinformatic skills contains custom scripts, provided in the GitHub repository, for calculating or the resources necessary for Protologger to have access to this G + C (%). tool. As a tool designed for the community, we welcome The POCP value calculated between the genomes of two strains has also been proposed for delineation of genera based on values <50%30. This researchers to contribute to the continual development, such as 67 68 via the proposal of additional pathways or habitats of interest. threshold has since been used within the Chlamydiales , Pseudomonas and Muribaculaceae17. Protologger calculates POCP values according to the original publication30. MATERIALS AND METHODS Input Functional analysis Protologger requires both the full-length 16S rRNA gene sequence and Protologger provides detailed analysis of the functional repertoire of input genomes including pathway analysis (KEGG)69, carbohydrate activate genome assembly of an isolate to be submitted to the system as FASTA 70 71 files (Fig. 1). As described below, the 16S rRNA gene sequence is then used enzyme (CAZy) and detection of antibiotic resistance genes (CARD) . fi Proteins are first extracted from both the query genome and the for the taxonomic placement of the isolate whilst the genome recon rms 72 the placement, along with providing insight into the functional repertoire reference genomes, facilitating calculation of POCP values, using PROKKA of the isolate (described below in ‘Functional analysis’). Both are also used (v1.14.6) (default parameters), during which the number of CRISPR arrays for the generation of ecological readouts. within the genome are recorded. KEGG orthologs (KOs) are determined using the PROKKA output, providing a non-redundant list of KOs via PROKKA2KEGG (https://github.com/SilentGene/Bio-py/tree/master/ Taxonomic assignment prokka2kegg). These KOs are then compared to manually selected Valid publication of names of novel taxa relies on taxonomic comparison of pathways including utilisation of 10 carbon sources (glucose, arbutin, an isolate to closely related species with a validly published name. The use salicin, cellobiose, sucrose, trehalose, maltose, starch, dextran, cellulose), of 16S rRNA gene sequences to delineate taxa goes back to the 1970s short-chain fatty acid biosynthesis (acetate, propanoate/propionate, when such sequences were used to provide the first evidence that the butanoate/butyrate), siroheme biosynthesis, urease, vitamin biosynthesis Archaea and Bacteria represented distinct lineages of Prokaryotes54. For (biotin (vitamin B7), cobalamin (vitamin B12), folate (vitamin B9), riboflavin this, Protologger compares input 16S rRNA gene sequences to the Living (vitamin B2), menaquinone (vitamin K2) and phylloquinone (vitamin K1)), Tree Project (LTP)55 (LTPs132), a subsidiary of the SILVA database56, which EPS biosynthesis, cytochrome c oxidase, ammonia production, nitrification, is regularly updated and consists of representative 16S rRNA gene ammonia utilisation, sulfide utilisation and sulfate assimilatory reduction. sequences from all isolated taxa. Genes encoding additional features, such as the presence of flagella- The quality of the input 16S rRNA gene sequence is checked via encoding genes (n = 25) and spore production (n = 52) are also studied. detection of chimeras using UCHIME (Usearch v5.2.32)57 with the LTP These pathways were selected due to either representing an inherent database as a reference. BLASTN (v2.10.0+) (>60% identity, >80% query feature of the species e.g. spore formation or motility, or being key to their − coverage and evalue <10 25) is used to identify the 50 highest scoring role in their environment e.g. nitrogen fixation, vitamin biosynthesis. species in the LTP database and pairwise sequence similarity is Currently, 30 different biochemical and physiological features are assessed; calculated58. Delineation of taxa based on 16S rRNA gene sequence however, this will continue to be updated. similarity currently stands at 98.7% for species, 94.5% for genera, and 86.5% for families59. However, some taxonomic lineages require the use of altered delineation thresholds due to their unique genomic makeup or Habitat preference and distribution evolutionary history60,61; hence all values are provided to the user. The 16S rRNA gene amplicon-based habitat preference and distribution The quality of the provided 16S rRNA gene sequence is checked by analysis are determined using a comprehensive database of 19,000 16S rRNA amplicon samples from 19 environments obtained from the IMNGS comparison against its closest match, providing a completeness value, 73 which is reported to the user. These sequences are then aligned using database and representing a total of 38,163,501 OTUs (Fig. 2b). These MUSCLE62 (v3.8.31), default settings, and further processed using environments are activated sludge, bovine gut, chicken gut, coral, freshwater, FastTree63 (v2.1.7), GTR model, to generate a 16S rRNA gene sequence- human gut, human lung, human oral, human skin, human vagina, insect gut, based phylogenetic tree. marine, marine sediment, mouse gut, pig gut, plant, rhizosphere, soil and

ISME Communications (2021) 1:16 T.C.A. Hitch et al. 8 wastewater. Comparison of the query 16S rRNA sequence to this database is Description of Aedoeadaptatus acetigenes sp. nov. Aedoeadaptatus done via BLASTN (97% identity, 80% OTU coverage). acetigenes (L. neut. n. acetum, vinegar; Gr. v. gennaô, to produce; N.L. In addition, input genomes are compared to a database of 49,094 high- part. adj. acetigenes, producing acetate). KEGG analysis identified pathways quality (>90% complete, <5% contamination), based on published thresh- for acetate production from acetyl-CoA (EC:2.3.1.8, 2.7.2.1), propionate 74 olds , MAGs obtained from 12 studies and at least 10 environments production from propanoyl-CoA (EC:2.3.1.8, 2.7.2.1), sulfide and L-serine 75 (Fig. 2a). The comparison is conducted using MASH (v2.2), with results utilised to produce L-cysteine and acetate (EC:2.3.1.30, 2.5.1.47), L- filtered using a distance threshold of <0.05. Comparison against the MAG glutamate production from ammonia via L-glutamine (EC:6.3.1.2, 1.4.1.-) database provides supporting information for the identification of novel and folate (vitamin B9) biosynthesis from 7,8-dihydrofolate (EC:1.5.1.3). This taxonomic groups, reducing the reliance on descriptions from single isolates. species was commonly identified within human vaginal (24.2%) and human gut (8.2%) samples, although sub-dominant in both environments at ca. 0.2% mean relative abundance. The type strain, E51T (=DSM Webserver 107480T), was isolated from human faeces. The G + C content of genomic Protologger is available via a custom Galaxy installation, hosted at the DNA is 48.9%. University Hospital of RWTH Aachen, at www.protologger.de. The website contains detailed instructions, as well as an instructional video to guide users. Description of Agathobaculum ammoniilyticum sp. nov. Agathobaculum ammoniilyticum (N.L. neut. n. ammonium, ammonia; N.L. neut. adj. lyticum, able to loose, able to dissolve; from Gr. neut. adj. lytikon, able to loose, Datasets dissolving; N.L. neut. adj. ammoniilyticum, ammonia-degrading, to reflect Protologger was applied to three large isolate collections which focused on the activity of the bacterium). The species was identified as a member of the gastrointestinal tract. The first collection, Forster et al.6, consists of 737 the genus Agathobaculum, based on 16S rRNA gene sequence similarity of isolates, of which strains representing new taxa were deposited at various 96.8% and 96.7% to the existing members of the genus, Agathobaculum national culture collections and their genomes were sequenced. The butyriciproducens and Agathobaculum desmolans. This was supported by second collection, Poyet et al.7, consists of 7758 isolates, of which 3632 had POCP values of 56.7 and 63.4% to each member, respectively. ANI values to been genome sequenced. Isolates from this collection are maintained each of these related species were below 95%, which was confirmed by locally, forming the BIO-ML culture collection, but not submitted to public GTDB-Tk identification as ‘Agathobaculum-sp900291975’. Within the culture collections7. The Hungate1000 collection is formed from 410 genome, 137 CAZymes were identified, facilitating the predicted utilisation cultured bacteria and archaea for which genomes are available, yet few of both starch and glucose as carbon sources. KEGG analysis identified a strains were deposited to public culture collections. All Archaeal genomes total of 93 transporters, 7 secretion genes and 513 enzymes. This included were ignored during downstream analysis due to Protologger being pathways for acetate production from acetyl-CoA (EC:2.3.1.8, 2.7.2.1), Bacteria specific. While they cannot be used for validation of names of propionate production from propanoyl-CoA (EC:2.3.1.8, 2.7.2.1), riboflavin novel taxa, MAGs can provide greater insight into currently as-yet- (vitamin B2) biosynthesis from GTP identified (EC:3.5.4.25, 3.5.4.26, uncultured lineages. Therefore, a representative dataset of MAGs was 1.1.1.193, 3.1.3.104, 4.1.99.12, 2.5.1.78, 2.5.1.9, 2.7.1.26, 2.7.7.2) and L- selected for input to both tests the quality of the MAGs and highlight the glutamate production from ammonia was identified via L-glutamine use of Protologger to describe the novel taxa. For this, the integrated (EC:6.3.1.2, 1.4.1.-). The type strain, Sanger_34T (=DSM 102882T), was mouse gene catalogue (iMGMC)43 was used as it represents an isolated from human faeces. The G + C content of genomic DNA is 57.0%. advancement in the generation of 484 MAGs which were linked to full- length 16S rRNA gene sequences using a mixture of sequence mapping Description of Alitiscatomonas gen. nov. Alitiscatomonas (L. masc./fem. n. and correlation analysis. ales, a bird; Gr. neut. n. skor, dung; L. fem. n. monas, a monad; The three isolate collections lacked 16S rRNA gene sequences for the Alitiscatomonas, a microbe frequently occurring in the faeces of birds). corresponding strains, hence Barrnap was used to identify the presence of This taxon is named referring to the microbial ecosystem with combined 16S rRNA genes within the genome sequences72. The longest 16S rRNA prevalence and mean relative abundance for this taxon, although originally gene sequence identified for each genome was used as input for isolated from human faeces. Based on 16S rRNA gene sequence similarity, Protologger along with the original genome. The output for all analysed the closest relatives are members of the genus Lacrimispora: Lacrimispora genomes is provided at github.com/thh32/Protologger. xylanolytica (95.7%), Lacrimispora aerotolerans (95.5%) and Lacrimispora sphenoidesT (95.0%). POCP analysis confirmed that strain f_CCE represents a distinct genus to Lacrimispora as POCP analysis to all closest relative Protologues produced values below 50%, including to L. sphenoidesT (38.2%). GTDB-Tk All taxonomic, functional and ecological features used to describe novel supported the placement of strain f_CCE within a novel genus predicted taxa below are solely based on the output of Protologger. The names metagenomically as ‘CAG-81’. The type species of this genus is proposed were produced using the GAN50, modified to accept Protologger Alitiscatomonas aceti. output. For novel genera, the occurrence of the given taxon in which environment is calculated as the prevalence x mean relative abundance in Description of Alitiscatomonas aceti sp. nov. Alitiscatomonas aceti (L. neut. positive samples. The environment identified then directs GAN to utilise n. acetum, vinegar; L. gen. neut. n. aceti, of vinegar). The number of one of the curated lists of environmental prefixes, producing ecologically CAZymes identified within the type strains genome was 148. Strain f_CCE informed names. Species names are selected based on the isolate’s was predicted to utilise starch. KEGG analysis identified pathways for functional repertoire. This modified version of GAN is available at www. acetate production from acetyl-CoA (EC:2.3.1.8, 2.7.2.1), propionate protologger.de. The Protologger output for all isolates described below are production from propanoyl-CoA (EC:2.3.1.8, 2.7.2.1), L-glutamate produc- provided at github.com/thh32/Protologger. tion from ammonia via L-glutamine (EC:6.3.1.2, 1.4.1.-) and folate (vitamin B9) biosynthesis from 7,8-dihydrofolate (EC:1.5.1.3). This species was most Description of Aedoeadaptatus gen. nov. Aedoeadaptatus (Gr. neut. n. commonly identified within chicken gut samples (56.8%), although sub- aidoion, the female pudendum; L. past. part. adaptatus adapted; N.L. masc. dominant at only 0.26% mean relative abundance. The type strain, f_CCET n. Aedoeadaptatus, a microbe frequently occurring in the female (=DSM 107473T), was isolated from human faeces. The G + C content of reproductive tract). This taxon is named referring to the microbial genomic DNA is 51.8%. ecosystem with combined prevalence and mean relative abundance for this taxon, although originally isolated from human faeces. Based on 16S Description of Anaerostipes amylophilus sp. nov. Anaerostipes amylophilus rRNA gene sequence similarity, the closest relatives are Peptoniphilus (Gr. neut. n. amylon, starch; N.L. masc. adj. philus (from Gr. masc. adj. philos) (Peptoniphilus methioninivorax, 88.5%; Peptoniphilus stercorisuis, 88.1%; loving; N.L. masc. adj. amylophilus, starch-loving). The isolate was identified Peptoniphilus asaccharolyticusT, 87.3%) and Tissierella (Tissierella creatinini, as a member of the genus Anaerostipes. The comparison of 16S rRNA gene 87.7%; Tissierella praeacutaT, 86.3%). POCP analysis confirmed that strain sequence to existing members of this genus identified a 99.3% match to E51 represents a distinct genus to Peptoniphilus (P. asaccharolyticusT, Anaerostipes hadrus and 93.3% to the type species of the genus, 44.0%) and Tissierella (T. praeacutaT, 30.0%) as all POCP values to close Anaerostipes caccaeT. The assignment of the isolate to Anaerostipes was relatives were below 50%. GTDB-Tk supported the creation of a novel supported by POCP values of 77.5% and 55.9% to A. hadrus and A. caccae, genus, placing strain E51 within the genus ‘Peptoniphilus_C’, a sub-division respectively. GTDB-Tk supported the placement within Anaerostipes and of the existing Peptoniphilus genus. The type species of this genus is the generation of a novel species as the isolate was assigned to Aedoeadaptatus acetigenes. ‘Anaerostipes hadrus_A’. ANI comparison provided a value of 88.8% to A.

ISME Communications (2021) 1:16 T.C.A. Hitch et al. 9 hadrus, confirming they are not the same species, but share a conserved supported by GTDB-Tk assignment to ‘Barnesiella sp003150885’. Within 16S rRNA gene sequence. Within the genome, 142 CAZymes were the genome, 221 CAZymes were identified along with the utilisation of identified along with the utilisation of glucose, cellobiose and starch. cellulose and starch. KEGG-based analysis identified the presence of the KEGG-based analysis identified the presence of the following pathways: following pathways: acetate production from acetyl-CoA (EC:2.3.1.8, acetate production identified from acetyl-CoA (EC:2.3.1.8, 2.7.2.1), butyrate 2.7.2.1), propionate production from propanoyl-CoA (EC:2.3.1.8, 2.7.2.1), L- production from butanoyl-CoA (EC:2.8.3.8), propionate production from glutamate production from ammonia via L-glutamine (EC:6.3.1.2, 1.4.1.-), propanoyl-CoA (EC:2.3.1.8, 2.7.2.1), sulfide and L-serine utilised to produce folate (vitamin B9) biosynthesis from 7,8-dihydrofolate (EC:1.5.1.3) and L-cysteine and acetate (EC:2.3.1.30, 2.5.1.47), L-glutamate production from riboflavin (vitamin B2) biosynthesis from GTP (EC:3.5.4.25, 3.5.4.26, ammonia via L-glutamine (EC:6.3.1.2, 1.4.1.-) and folate (vitamin B9) 1.1.1.193, 3.1.3.104, 4.1.99.12, 2.5.1.78, 2.5.1.9, 2.7.1.26, 2.7.7.2). The type biosynthesis from 7,8-dihydrofolate (EC:1.5.1.3). The type strain, H1_26T ( strain, E21T (=DSM 107469T), was isolated from human faeces. The G + C = DSM 108065T), was isolated from human faeces. The G + C content of content of genomic DNA is 40.3%. genomic DNA is 36.6%. Description of Blautia ammoniilytica sp. nov. B. ammoniilytica (N.L. neut. n. Description of Anthropogastromicrobium gen. nov. Anthropogastromicro- ammonium, ammonia; N.L. fem. adj. lytica, able to loose, able to dissolve; bium (Gr. masc. n. anthropos, a human being; Gr. fem. n. gaster, the from Gr. fem. adj. lytike, able to loose, dissolving; N.L. fem. adj. stomach; L. neut. n. microbium, a microbe; Anthropogastromicrobium,a ammoniilytica, ammonia-degrading, to reflect the activity of the bacter- microbe from the stomach of humans). Based on 16S rRNA gene sequence ium). The species was identified as a member of the genus Blautia based similarity, the closest relatives are members of the genera Lacrimispora on a 16S rRNA gene sequence similarity of 95.5% to Blautia faecis. (Lacrimispora amygdalina, 90.8%), Lachnospira (Lachnospira multiparaT, However, similarity to the type species, Blautia coccoides, was only 92.6%. 90.8%) and Cuneatibacter (Cuneatibacter caecimurisT, 90.7%). POCP analysis The assignment of the isolate to Blautia was supported by POCP values of confirmed strain H6_35 represents a distinct genus to both Lacrimispora 57 and 58.1% to Blautia wexlerae and Blautia obeum, respectively, whilst a and Lachnospira as all POCP values to close relatives were below 50%. value of 39.2% to B. coccoides suggests genus-level differentiation. ANI GTDB-Tk supported the creation of a novel genus, placing strain H6_35 values against all members of the genus Blautia were below 95%. Both 16S within the predicted genus ‘KLE1615’. The type species of this genus is rRNA gene sequence similarity and POCP suggest that B. ammoniilytica Anthropogastromicrobium aceti. does not belong to the novel genus ‘Hoministercoradaptatus’ also described in this study (see below), with values of 93.0 and 50.3%, Description of Anthropogastromicrobium aceti sp. nov. Anthropogastro- respectively, to strain Sanger_23. In addition, phylogenomic placement microbium aceti (L. neut. n. acetum, vinegar; L. gen. neut. n. aceti,of confirmed this isolate resides between Blautia species with validly vinegar). The number of CAZymes identified within the type strains published names, adding to this monophyletic group. As such strain genome was 292, facilitating the predicted utilisation of cellulose and Sanger_23 is proposed as a novel species within the existing genus Blautia. starch. KEGG analysis identified pathways for acetate production from GTDB assignment identified strain Sanger_23 as ‘Blautia_A sp900066505’. acetyl-CoA (EC:2.3.1.8, 2.7.2.1), butyrate production from butanoyl-CoA Whilst 196 CAZymes were identified within the genome, only starch was (EC:2.8.3.8), propionate production from propanoyl-CoA (EC:2.3.1.8, 2.7.2.1), suggested as a carbon source. KEGG analysis identified a total of 117 sulfide and L-serine utilised to produce L-cysteine and acetate (EC:2.3.1.30, transporters, 4 secretion genes, and 591 enzymes within the genome. This 2.5.1.47), L-glutamate production from ammonia via L-glutamine (EC:6.3.1.2, included pathways for acetate production from acetyl-CoA (EC:2.3.1.8, 1.4.1.-) and folate (vitamin B9) biosynthesis from 7,8-dihydrofolate 2.7.2.1), propionate production from propanoyl-CoA (EC:2.3.1.8, 2.7.2.1), (EC:1.5.1.3). The type strain, H6_35T (=DSM 108287T), was isolated from sulfide and L-serine utilisation to produce L-cysteine and acetate human faeces. The G + C content of genomic DNA is 41.3%. (EC:2.3.1.30, 2.5.1.47), L-glutamate production from ammonia via L- glutamine (EC:6.3.1.2, 1.4.1.-), as well as folate (vitamin B9) biosynthesis T Description of Bacteroides cellulolyticus sp. nov. Bacteroides cellulolyticus from 7,8-dihydrofolate (EC:1.5.1.3). The type strain, Sanger_23 (=DSM T (N.L. neut. n. cellulosum, cellulose; N.L. masc. adj. lyticus, able to loose, able 102163 ), was isolated from human faeces. Its G + C content of genomic to dissolve; from Gr. masc. adj. lytikos, dissolving; N.L. masc. adj. DNA is 43.1%. cellulolyticus, cellulose-dissolving). The species was identified as a member of the genus Bacteroides. The comparison of 16S rRNA gene sequence Description of Blautia acetigignens sp. nov. Blautia acetigignens (L. neut. n. identified the highest matches to existing members of Bacteroides, acetum, vinegar, used to refer to ; L. v. gignere, to produce; N.L. including Bacteroides caecigallinarum (97.5%), although the similarity to part. adj. acetigignens, vinegar- or acetic acid-producing). The species was the type species, Bacteroides fragilis, was only 90.5%. POCP analysis also identified as a member of the genus Blautia, based on 16S rRNA similarity confirmed the placement of strain Sanger_22 within the Bacteroides with of 98.0% to B. faecis, 96.1% to B. obeum and 94.6% to the type species, B. values >50% to 26 existing Bacteroides species, although the value to B. coccoides. While the POCP to B. coccoides was below 50%, the value to B. fragilis was 47.4%. GTDB-Tk supported the placement of strain Sanger_22 obeum was 60.9%, supporting their placement within the same genus. The within the Bacteroides, placed as ‘Bacteroides_A sp900066445’. ANI values placement within Blautia is further supported by the isolates GTDB-Tk to all close relatives were below 95%, confirming this isolate represents a identification as ‘Blautia_A sp900066145’. ANI values to all close relatives novel species. Within the genome, 357 CAZymes were identified along and to ‘Blautia ammoniilytica’ described in this study (see above) were with the utilisation of starch and cellulose. KEGG-based analysis identified below 95%. Within the genome 262 CAZymes were identified along with the presence of the following pathways: acetate production from acetyl- the predicted use of starch. KEGG-based analysis suggested the presence CoA (EC:2.3.1.8, 2.7.2.1), propionate production from propanoyl-CoA of the following pathways: acetate production from acetyl-CoA (EC:2.3.1.8, (EC:2.3.1.8, 2.7.2.1), L-glutamate production from ammonia via L-glutamine 2.7.2.1), propionate production from propanoyl-CoA (EC:2.3.1.8, 2.7.2.1), L- (EC:6.3.1.2, 1.4.1.-), folate (vitamin B9) biosynthesis from 7,8-dihydrofolate glutamate production from ammonia via L-glutamine (EC:6.3.1.2, 1.4.1.-), (EC:1.5.1.3) and riboflavin (vitamin B2) biosynthesis from GTP (EC:3.5.4.25, cobalamin (vitamin B12) biosynthesis from cobinamide (EC:2.5.1.17, 3.5.4.26, 1.1.1.193, 3.1.3.104, 4.1.99.12, 2.5.1.78, 2.5.1.9, 2.7.1.26, 2.7.7.2). The 6.3.5.10, 6.2.1.10, 2.7.1.156), folate (vitamin B9) biosynthesis from 7,8- type strain, Sanger_22T (=DSM 102145T), was isolated from human faeces. dihydrofolate (EC:1.5.1.3) and riboflavin (vitamin B2) biosynthesis from GTP The G + C content of genomic DNA is 40.2%. (EC:3.5.4.25, 3.5.4.26, 1.1.1.193, 3.1.3.104, 4.1.99.12, 2.5.1.78, 2.5.1.9, 2.7.1.26, 2.7.7.2). The type strain, Sanger_28T (=DSM 102165T), was isolated from Description of Barnesiella propionica sp. nov. Barnesiella propionica (N.L. human faeces. The G + C content of genomic DNA is 44.4%. neut. n. acidum propionicum, propionic acid; L. suff. -ica, suffix used with the sense of pertaining to; N.L. fem. adj. propionica, pertaining to propionic Description of Bovifimicola gen. nov. Bovifimicola (L. masc./fem. n. bos,an acid). The species was identified as a member of the genus Barnesiella. The ox, a bull, a cow; L. neut. n. fimum, dung; N.L. suffix masc./fem. cola,an comparison of 16S rRNA gene sequence identified the highest matches to inhabitant of; N.L. fem. n. Bovifimicola, a microbe frequently occurring in both existing members of this genus: Barnesiella viscericolaT (91.9%) and the faeces of cattle). This taxon is named referring to the microbial Barnesiella intestinihominis (91.8%). POCP analysis confirmed the placement ecosystem with combined prevalence and mean relative abundance for of strain E21 within the Barnesiella with values of 59.3% and 59.8% to B. this taxon, although originally isolated from human faeces. Based on 16S viscericolaT (91.9%) and B. intestinihominis, respectively. ANI values to all rRNA gene sequence similarity, the closest relatives are members of the close relatives were below 95.0%, suggesting this isolate represents a novel genera Hespellia (Hespellia porcina, 93.7%; Hespellia stercorisuisT, 93.3%) and species. The assignment as a novel species within Barnesiella was Ruminococcus (Ruminococcus gnavus, 93.7%). POCP analysis confirmed that

ISME Communications (2021) 1:16 T.C.A. Hitch et al. 10

strain Sanger_97 represents a distinct genus to both Hespellia (H. LMG 29485T), was isolated from human faeces. The G + C content of stercorisuisT, 36.3%) and Ruminococcus (R. gnavus, 36.4%) as all POCP genomic DNA is 47.8%. values to close relatives were below 50%. GTDB-Tk supported the creation of a novel genus, placing strain Sanger_97 within the predicted genus Description of Brotonthovivens gen. nov. Brotonthovivens (Gr. masc. n. ‘CAG-603’. The type species of this genus is Bovifimicola ammoniilytica. brotos, a mortal human; Gr. masc. n. onthos, dung; L. pres. part. vivens, living; N.L. fem. n. Brotonthovivens, a microbe from the faeces of humans). Description of Bovifimicola ammoniilytica sp. nov. B. ammoniilytica (N.L. Based on 16S rRNA gene sequence similarity, the closest relatives are neut. n. ammonium, ammonia; N.L. fem. adj. lytica, able to loose, able to members of Roseburia: Roseburia inulinivorans (94.6%), Roseburia intestinalis dissolve; from Gr. fem. adj. lytike, able to loose, dissolving; N.L. fem. adj. (93.6%) and Roseburia hominis (93.5%). As no 16S rRNA gene sequence or ammoniilytica, ammonia-degrading, to reflect the activity of the bacter- genome exist for the type species of the genus, Roseburia cecicolaT,no ium). KEGG analysis identified pathways for acetate production from comparisons could be conducted. POCP analysis confirmed that strain acetyl-CoA (EC:2.3.1.8, 2.7.2.1), propionate production from propanoyl-CoA Sanger_109 represents a distinct genus to Roseburia as all POCP values to (EC:2.3.1.8, 2.7.2.1), L-glutamate production from ammonia via L-glutamine close relatives were below 50%. GTDB-Tk supported the placement of (EC:6.3.1.2, 1.4.1.-), folate (vitamin B9) biosynthesis from 7,8-dihydrofolate strain Sanger_109 within the genus ‘Eubacterium_I’ and a representative of (EC:1.5.1.3) and riboflavin (vitamin B2) biosynthesis from GTP (EC:3.5.4.25, ‘Eubacterium_I sp900066595’. The type species of this genus is Brotontho- 3.5.4.26, 1.1.1.193, 3.1.3.104, 4.1.99.12, 2.5.1.78, 2.5.1.9, 2.7.1.26, 2.7.7.2). This vivens ammoniilytica. species was commonly identified within wastewater microbiome (14.1% of samples, mean relative abundance of 0.01%), bovine gut (6.3% of samples, Description of Brotonthovivens ammoniilytica sp. nov. B. ammoniilytica (N. mean relative abundance of 0.09%) and human gut (2.6% of samples, L. neut. n. ammonium, ammonia; N.L. fem. adj. lytica, able to loose, able to mean relative abundance of 0.08%) samples. The type strain, Sanger_97T dissolve; from Gr. fem. adj. lytike, able to loose, dissolving; N.L. fem. adj. (=DSM 102166T = CCUG 68796T), was isolated from human faeces. The G ammoniilytica, ammonia-degrading, to reflect the activity of the bacter- + C content of genomic DNA is 34.0%. ium). Within the genome, 138 CAZymes were identified along with predicted utilisation of starch. KEGG analysis identified pathways for Description of Brotolimicola gen. nov. Brotolimicola (Gr. masc. n. brotos,a acetate production from acetyl-CoA (EC:2.3.1.8, 2.7.2.1), butyrate produc- mortal human; L. masc. n. limus, dung; N.L. suffix masc./fem. cola,an tion from butanoyl-CoA (EC:2.8.3.8), propionate production from inhabitant of; Brotolimicola, a microbe from the faeces of humans). The propanoyl-CoA (EC:2.3.1.8, 2.7.2.1), L-glutamate production from ammonia closest relatives, based on 16S rRNA gene sequence similarity, are via L-glutamine (EC:6.3.1.2, 1.4.1.-), cobalamin (vitamin B12) biosynthesis Kineothrix alysoidesT (93.1%) and L. multiparaT (92.2%). POCP analysis from cobinamide (EC:2.5.1.17, 6.3.5.10, 6.2.1.10, 2.7.1.156), folate (vitamin confirmed the species represents a distinct genus to K. alysoidesT (≤41.5%) B9) biosynthesis from 7,8-dihydrofolate (EC:1.5.1.3), riboflavin (vitamin B2) and L. multiparaT (≤34.0%). The type species of this genus is Brotolimicola biosynthesis from GTP (EC:3.5.4.25, 3.5.4.26, 1.1.1.193, 3.1.3.104, 4.1.99.12, acetigignens. 2.5.1.78, 2.5.1.9, 2.7.1.26, 2.7.7.2). The type strain, Sanger_109T (=DSM 102148T), was isolated from human faeces. The G + C content of genomic Description of Brotolimicola acetigignens sp. nov. B. acetigignens (L. neut. DNA is 42.4%. n. acetum, vinegar, used to refer to acetic acid; L. v. gignere, to produce; N.L. part. adj. acetigignens, vinegar- or acetic acid-producing). This species Description of Clostridium ammoniilyticum sp. nov. Clostridium ammonii- contains at least 278 CAZymes and are predicted to utilise arbutin, salicin, lyticum (N.L. neut. n. ammonium, ammonia; N.L. neut. adj. lyticum, able to cellobiose, starch and cellulose. KEGG analysis identified pathways for loose, able to dissolve; from Gr. neut. adj. lytikon, able to loose, dissolving; N. acetate production identified from acetyl-CoA (EC:2.3.1.8, 2.7.2.1), propio- L. neut. adj. ammoniilyticum, ammonia-degrading, to reflect the activity of nate production from propanoyl-CoA (EC:2.3.1.8, 2.7.2.1), sulfide and L- the bacterium). The species was identified as a member of the genus serine utilised to produce L-cysteine and acetate (EC:2.3.1.30, 2.5.1.47), L- Clostridium, based on POCP values of >50% to multiple existing species glutamate production from ammonia via L-glutamine (EC:6.3.1.2, 1.4.1.-), within this genus: Clostridium cocleatum (50.2%), Clostridium saccharogumia cobalamin (vitamin B12) biosynthesis from cobinamide (EC:2.5.1.17, (52.9%), and Clostridium spiroforme (53.5%). ANI values to all close relatives 6.3.5.10, 6.2.1.10, 2.7.1.156) and folate (vitamin B9) biosynthesis from 7,8- were below 95% and 16S rRNA gene sequence similarity values below 94%. dihydrofolate (EC:1.5.1.3). The type strain, f_CXYT (=DSM 107528T), was Comparison against Clostridium butyricum, the type species of Clostridium, isolated from human faeces. The G + C content of genomic DNA was was not conducted as it was not within the 50 most similar species, hence 46.0% for both isolates. The placement and description of this species are ignored by Protologger. This further highlights the need for the based on two isolates, f_CXYT and f_CSY (=DSM 107475). Only features reclassification of Clostridium into multiple genera. Isolates of this species that were consistent between these two isolates are described above as contained an average of 170 CAZymes and the predicted utilisation of those of the species. starch, arbutin, salicin, cellobiose and glucose as carbon sources. KEGG analysis predicted the presence of pathways for propionate production Description of Brotomerdimonas gen. nov. Brotomerdimonas (Gr. masc. n. from propanoyl-CoA (EC:2.3.1.8, 2.7.2.1), L-glutamate production from brotos, a mortal human; L. fem. n. merda, dung; L. fem. n. monas, a monad; ammonia was identified via L-glutamine (EC:6.3.1.2, 1.4.1.-) and folate N.L. fem. n. Brotomerdimonas, a microbe from the faeces of humans). Based (vitamin B9) biosynthesis from 7,8-dihydrofolate (EC:1.5.1.3). The type strain, on 16S rRNA gene sequence similarity, the closest relatives are members of H4_15T (=DSM 108253T), was isolated from human faeces. Taxonomic Anaerovorax (Anaerovorax odorimutansT, 90.9%) and Eubacterium (Eubac- placement and description of this species is based on three isolates as both terium infirmum, 90.8%). POCP analysis confirmed that strain Sanger_12 the additional isolates (H6_14 (=DSM 108252) and H3_29 (=DSM 108213)) represents a distinct genus to both Anaerovorax (A. odorimutansT, 41.2%) had an ANI values >95% to the type strain. Only features that were and Eubacterium (E. infirmum, 37.0%) as all POCP values to close relatives consistent between these three isolates are described above as those of the were below 50%. GTDB-Tk supported the creation of a novel genus, species. The G + C content of genomic DNA is between 29.2 and 29.5%. placing strain Sanger_12 within the predicted genus ‘UBA1191’. The type species of this genus is Brotomerdimonas butyrica. Description of Coprococcus ammoniilyticus sp. nov. Coprococcus ammo- niilyticus (N.L. neut. n. ammonium, ammonia; N.L. masc. adj. lyticus, able to Description of Brotomerdimonas butyrica sp. nov. B. butyrica (Gr. neut. n. loose, able to dissolve; from Gr. fem. adj. lytikos, able to loose, dissolving; N. boutyron (Latin transliteration butyrum); Gr. fem. adj. suff. -ica, suffix used L. masc. adj. ammoniilyticus, ammonia-degrading, to reflect the activity of with the sense of belonging to; N.L. fem. adj. butyrica, related to butter, the bacterium). The species was identified as a member of the genus butyric). KEGG analysis identified pathways for acetate production from Coprococcus based on 16S rRNA gene sequence similarity and POCP values acetyl-CoA (EC:2.3.1.8, 2.7.2.1), butyrate production from butanoyl-CoA of 96.5% and 62.3% to the type species C. eutactus, respectively. ANI values (EC:2.8.3.8), propionate production from propanoyl-CoA (EC:2.3.1.8, 2.7.2.1), to all close relatives were below 95%, suggesting this isolate as a novel folate (vitamin B9) biosynthesis from 7,8-dihydrofolate (EC:1.5.1.3) and species, which was supported by GTDB-Tk assignment to ‘Coprococcus riboflavin (vitamin B2) biosynthesis from GTP (EC:3.5.4.25, 3.5.4.26, sp900066115’. An ANI value of 80.4% to ‘Coprococcus aceti’ confirms that 1.1.1.193, 3.1.3.104, 4.1.99.12, 2.5.1.78, 2.5.1.9, 2.7.1.26, 2.7.7.2). This species these isolates represent two novel species within the genus Coprococcus. was identified to be a common member of both the pig gut (54.5%) and Within the genome, 120 CAZymes were identified along with the human gut (40.4%), although sub-dominant at <0.1% mean relative utilisation of glucose, cellulose and starch. KEGG-based analysis identified abundance respectively. The type strain, Sanger_12T (=DSM 102176T = the presence of the following pathways: acetate production from acetyl-

ISME Communications (2021) 1:16 T.C.A. Hitch et al. 11

CoA (EC:2.3.1.8, 2.7.2.1), propionate production from propanoyl-CoA acetate production from acetyl-CoA (EC:2.3.1.8, 2.7.2.1), butyrate produc- (EC:2.3.1.8, 2.7.2.1), L-glutamate production from ammonia via L-glutamine tion from butanoyl-CoA (EC:2.8.3.8), propionate production from (EC:6.3.1.2, 1.4.1.-) and riboflavin (vitamin B2) biosynthesis from GTP propanoyl-CoA (EC:2.3.1.8, 2.7.2.1), sulfide and L-serine utilised to produce (EC:3.5.4.25, 3.5.4.26, 1.1.1.193, 3.1.3.104, 4.1.99.12, 2.5.1.78, 2.5.1.9, 2.7.1.26, L-cysteine and acetate (EC:2.3.1.30, 2.5.1.47), L-glutamate production from T T 2.7.7.2). The type strain, H1_22 (=DSM 108312 ), was isolated from human ammonia via L-glutamine (EC:6.3.1.2, 1.4.1.-), cobalamin (vitamin B12) faeces. The G + C content of genomic DNA is 41.0%. biosynthesis from cobinamide (EC:2.5.1.17, 6.3.5.10, 6.2.1.10, 2.7.1.156), as well as folate (vitamin B9) biosynthesis from 7,8-dihydrofolate (EC:1.5.1.3). Description of Coprococcus aceti sp. nov. C. aceti (L. neut. n. acetum, The type strain, Sanger_02T (=DSM 102136T), was isolated from human vinegar; L. gen. neut. n. aceti, of vinegar). The species was identified as a faeces. Its G + C content of genomic DNA is 43.3%. member of the genus Coprococcus. The comparison of 16S rRNA gene sequences identified the highest match to the type species of Coprococcus, Description of Dorea amylophila sp. nov. D. amylophila (Gr. neut. n. Coprococcus eutactus (99.4%). GTDB-Tk supported the placement of strain amylon, starch; N.L. fem. adj. phila (from Gr. fem. adj. phile) loving; N.L. fem. H2_11 within the Coprococcus, placed as ‘Coprococcus eutactus_A’. adj. amylophila, starch-loving). The species was identified as a member of However, dDDH comparison to the type strains genome via TYGS35 the genus Dorea, based on a 16S rRNA gene sequence similarity of 100.0% confirmed that strain H2_11 belongs to a species distinct from C. eutactus. to Dorea longicatena and 95.4% to the type species of this genus, D. ANI values to all close relatives were below 95%, confirming this isolate formicigeneransT. This is supported by POCP values of 74.8% and 62.3%, represents a novel species and 80.4% to ‘C. ammoniilyticus’ confirms that respectively. While the 16S rRNA gene sequence similarity suggests this these isolates represent two novel species within the genus Coprococcus. isolate is a strain of D. longicatena, the ANI value between these genomes Within the genome, 162 CAZymes were identified along with the was 91.2%, suggesting it as a novel species. GTDB-Tk identified the isolate utilisation of starch and cellulose. KEGG-based analysis identified the as ‘Dorea longicatena_B’. This is due to the need to split D. longicatena into presence of the following pathways: acetate production identified from two distinct species based on genomic comparison. Separation from the acetyl-CoA (EC:2.3.1.8, 2.7.2.1), butyrate production from butanoyl-CoA species ‘D. acetigenes’ and ‘D. ammoniilytica’ was confirmed via an ANI (EC:2.8.3.8), propionate production from propanoyl-CoA (EC:2.3.1.8, 2.7.2.1), values of 85.7% and 80.6%, respectively, between the genomes of the type L-glutamate production from ammonia via L-glutamine (EC:6.3.1.2, 1.4.1.-), strains. Within the genome, 160 CAZymes were identified along with the folate (vitamin B9) biosynthesis from 7,8-dihydrofolate (EC:1.5.1.3) and predicted utilisation of glucose, arbutin, salicin, trehalose and starch as riboflavin (vitamin B2) biosynthesis from GTP (EC:3.5.4.25, 3.5.4.26, carbon sources. KEGG analysis identified a total of 97 transporters, 1.1.1.193, 3.1.3.104, 4.1.99.12, 2.5.1.78, 2.5.1.9, 2.7.1.26, 2.7.7.2). The type 10 secretion genes and 507 enzymes. This included pathways for acetate strain, H2_11T (=DSM 107541T = JCM 31265T), was isolated from human production from acetyl-CoA (EC:2.3.1.8, 2.7.2.1), butyrate production from faeces. The G + C content of genomic DNA is 43.7%. butanoyl-CoA (EC:2.8.3.8), propionate production from propanoyl-CoA (EC:2.3.1.8, 2.7.2.1), sulfide and L-serine utilised to produce L-cysteine and Description of Dorea acetigenes sp. nov. Dorea acetigenes (L. neut. n. acetate (EC:2.3.1.30, 2.5.1.47), L-glutamate production from ammonia via L- acetum, vinegar; Gr. v. gennaô, to produce; N.L. part. adj. acetigenes, glutamine (EC:6.3.1.2, 1.4.1.-), folate (vitamin B9) biosynthesis from 7,8- producing acetate). The species was identified as a member of the genus dihydrofolate (EC:1.5.1.3) and riboflavin (vitamin B2) biosynthesis from GTP Dorea, based on a POCP value to the type species of the genus, D. (EC:3.5.4.25, 3.5.4.26, 1.1.1.193, 3.1.3.104, 4.1.99.12, 2.5.1.78, 2.5.1.9, 2.7.1.26, formicigeneransT, of 52.3%. GTDB-Tk supported placement of Sanger_03 2.7.7.2). The type strain, H5_25T (=DSM 108233T), was isolated from human within the genus Dorea. Strikingly, 16S rRNA gene sequence similarity to faeces. Its G + C content of genomic DNA is 41.2%. Dorea formicigeneransT was only 92.7%. Higher similarity values were obtained to Faecalimonas (Faecalimonas umbilicataT, 95.5%) and Faecali- Description of Faecalicatena acetigenes sp. nov. Faecalicatena acetigenes catena (Faecalicatena contortaT, 94.8%). POCP comparison was highest (L. neut. n. acetum, vinegar; Gr. v. gennaô, to produce; N.L. part. adj. against D. formicigeneransT (52.3%), while only 42.1% to Faecalicatena acetigenes, producing acetate). The species was identified as a member of contortaT. Separation from the species ‘Dorea amylophila’ and ‘Dorea the genus, Faecalicatena. The comparison of 16S rRNA gene sequences ammoniilytica’ was confirmed via an ANI values of 85.7% and 82.2%, identified matches to members of three genera: Mediterraneibacter respectively, between the genomes of the type strains. Whilst 224 (Mediterraneibacter glycyrrhizinilyticus, 95.3%), Ruminococcus (Ruminococcus CAZymes were identified within the genome, only starch was suggested torques, 95.3%) and Faecalicatena (F. orotica, 94.9%, F. contortaT, 94.9%, as a carbon source. KEGG analysis identified a total of 91 transporters, Faecalicatena fissicatena, 94.8%). The similarity to these three genera was 8 secretion genes and 512 enzymes were identified. This included supported by POCP values above 50% to members of each genus, pathways for acetate production from acetyl-CoA (EC:2.3.1.8, 2.7.2.1), including; Mediterraneibacter (M. glycyrrhizinilyticus, 55.8%), Ruminococcus propionate production from propanoyl-CoA (EC:2.3.1.8, 2.7.2.1), sulfide and (R. torques, 57%) and Faecalicatena (F. fissicatena, 52.2%). GTDB-Tk assigned L-serine utilised to produce L-cysteine and acetate (EC:2.3.1.30, 2.5.1.47), L- the isolate to ‘Faecalicatena sp001487105’, an as-yet-uncultured member glutamate production from ammonia via L-glutamine (EC:6.3.1.2, 1.4.1.-), as of this genus. Due to the lack of sequence similarity, either at the 16S rRNA well as folate (vitamin B9) biosynthesis from 7,8-dihydrofolate (EC:1.5.1.3). gene sequence or genome level, to the type species of either The type strain, Sanger_03T (=DSM 102260T), was isolated from human Ruminococcus or Mediterraneibacter, the isolate was placed within the faeces. Its G + C content of genomic DNA is 43.6%. Faecalicatena. Within the genome, 140 CAZymes were identified along with the utilisation of starch and glucose. KEGG-based analysis identified Description of Dorea ammoniilytica sp. nov. D. ammoniilytica (N.L. neut. n. the presence of the following pathways: acetate production identified from ammonium, ammonia; N.L. fem. adj. lytica, able to loose, able to dissolve; acetyl-CoA (EC:2.3.1.8, 2.7.2.1), propionate production from propanoyl-CoA from Gr. fem. adj. lytike, able to loose, dissolving; N.L. fem. adj. (EC:2.3.1.8, 2.7.2.1), sulfide and L-serine utilised to produce L-cysteine ammoniilytica, ammonia-degrading, to reflect the activity of the bacter- and acetate (EC:2.3.1.30, 2.5.1.47), L-glutamate production from ammonia ium). The species was identified as a member of the genus Dorea, based on via L-glutamine (EC:6.3.1.2, 1.4.1.-), folate (vitamin B9) biosynthesis a POCP value of 56.2% to the type species of the genus, D. from 7,8-dihydrofolate (EC:1.5.1.3). The type strain, H2_18T (=DSM formicigeneransT, and of 58.9% to D. longicatena. GTDB-Tk supported 108153T), was isolated from human faeces. The G + C content of genomic placement of the type strain, Sanger_02, as a member of the Dorea. DNA is 42.4%. Interestingly, 16S rRNA gene sequence similarity to D. formicigeneransT was only 92.7%. Higher similarity values were obtained to members of the Description of Gallintestinimicrobium gen. nov. Gallintestinimicrobium (L. Faecalicatena genus: F. contortaT (96.6%) and F. fissicatena (96.1%). masc. n. gallus, a chicken; L. neut. n. intestinum, the gut; N.L. neut. n. However, POCP values to these species’ genomes were below 50% and microbium a microbe; N.L. neut. n. Gallintestinimicrobium, a microbe above 50% to both D. formicigeneransT (56.2%) and D. longicatena (58.9%). frequently occurring in the intestines of chickens). This taxon is named These values confirm that strain Sanger_02 is a novel species of Dorea. referring to the microbial ecosystem with combined prevalence and mean Separation from the species ‘D. acetigenes’ and ‘D. amylophila’ was relative abundance for this taxon, although originally isolated from human confirmed via an ANI values of 82.2% and 80.6%, respectively, between the faeces. Based on 16S rRNA gene sequence similarity, the closest relatives genomes of the type strains. The genome contained 125 CAZymes, are members of Eisenbergiella (Eisenbergiella tayi, 93.7%) and Enterocloster facilitating the predicted utilisation of both starch and glucose as carbon (Enterocloster bolteae, 93.3%; Enterocloster clostridioformis, 93.1%; Enter- sources. KEGG analysis identified a total of 93 transporters, 7 secretion ocloster aldenensis, 93%). POCP analysis confirmed that strain Sanger_16 genes and 513 enzymes within the genome. This included pathways for represents a distinct genus to both Eisenbergiella and Enterocloster as all

ISME Communications (2021) 1:16 T.C.A. Hitch et al. 12 Description of Huintestinicola gen. nov. Huintestinicola (Gr. masc./fem. n. POCP values to close relatives were below 50%, including 42.1% to E. tayi. hus, a pig; L. neut. n. intestinum, the gut; N.L. masc./fem. suff. –cola,an The type species of this genus is Gallintestinimicrobium propionicum. inhabitant of; N.L. fem. n. Huintestinicola, a microbe frequently occurring in the intestines of pigs). This taxon is named referring to the microbial Description of Gallintestinimicrobium propionicum sp. nov. G. propioni- ecosystem with combined prevalence and mean relative abundance for cum (N.L. neut. n. acidum propionicum, propionic acid; L. neut. suff. -icum, this taxon, although originally isolated from human faeces. Based on 16S suffix used with the sense of pertaining to; N.L. neut. adj. propionicum, rRNA gene sequence similarity, the closest relatives are members of the pertaining to propionic acid). Within the genome, 289 CAZymes were genera Ruminococcus (Ruminococcus callidus, 91.9%; R. flavefaciensT, 91%) identified along with the predicted utilisation of arbutin, salicin, trehalose and Clostridium (Clostridium methylpentosum, 91.1%). POCP analysis and starch. KEGG analysis identified pathways for acetate production from confirmed that strain Sanger_06 represents a distinct genus to both acetyl-CoA (EC:2.3.1.8, 2.7.2.1), propionate production from propanoyl-CoA Ruminococcus (R. flavefaciensT, 35.5%) and Clostridium (C. methylpentosum, (EC:2.3.1.8, 2.7.2.1), sulfide and L-serine utilised to produce L-cysteine and 33.1%) as all POCP values to close relatives were below 50%. GTDB-Tk acetate (EC:2.3.1.30, 2.5.1.47), L-glutamate production from ammonia via L- supported the creation of a novel genus, placing strain Sanger_06 within glutamine (EC:6.3.1.2, 1.4.1.-), cobalamin (vitamin B12) biosynthesis the predicted genus ‘CAG-353’. The type species of this genus is H. butyrica. from cobinamide (EC:2.5.1.17, 6.3.5.10, 6.2.1.10, 2.7.1.156) and folate (vitamin B9) biosynthesis from 7,8-dihydrofolate (EC:1.5.1.3). This species Description of Huintestinicola butyrica sp. nov. H. butyrica (Gr. neut. n. was most commonly identified within chicken gut samples (50.7%) at boutyron (Latin transliteration butyrum); Gr. fem. adj. suff. -ica, suffix used 1.48% mean relative abundance. The type strain, Sanger_16T (=DSM with the sense of belonging to; N.L. fem. adj. butyrica, related to butter, 102825T), was isolated from human faeces. The G + C content of genomic butyric). The number of CAZymes identified within the type strains DNA is 47.4%. genome was 180, facilitating the predicted utilisation of cellulose and starch. KEGG analysis identified pathways for butyrate production from Description of Hominimerdicola gen. nov. Hominimerdicola (L. masc. n. butanoyl-CoA (EC:2.8.3.8), sulfide and L-serine utilised to produce L-cysteine homo, a human being; L. fem. n. merda, dung; N.L. masc./fem. suff. –cola, and acetate (EC:2.3.1.30, 2.5.1.47), L-glutamate production from ammonia an inhabitant of; N.L. fem. n. Hominimerdicola, a microbe from the faeces of via L-glutamine (EC:6.3.1.2, 1.4.1.-) and cobalamin (vitamin B12) biosynth- humans). The type isolate was identified to match the previously esis from cobinamide (EC:2.5.1.17, 6.3.5.10, 6.2.1.10, 2.7.1.156). This species described, but not validated, species ‘Ruminococcus bicirculans’ based on was most commonly identified within pig gut samples (29.2%), although 16S rRNA gene sequence similarity (99.7%) and GTDB-Tk placement. While sub-dominant at only 0.1 % mean relative abundance. The type strain, originally placed within the Ruminococcus, strain Sanger_31 represents a Sanger_06T (=DSM 102115T), was isolated from human faeces. The G + C novel genus based on POCP and GTDB-Tk. POCP analysis confirmed that content of genomic DNA is 46.5%. strain Sanger_31 represents a distinct genus to both Ruminococcus @@ (Ruminococcus flavefaciensT, 34.5%) and all other close relatives as all POCP Description of Laedolimicola gen. nov. Laedolimicola (Gr. masc. n. laedos, values were below 50%. GTDB-Tk supported the creation of a novel genus, an unknown bird; L. masc. n. limus, dung; N.L. masc./fem. suff. –cola,an placing strain Sanger_31 within ‘Ruminococcus_D’, a sub-division of the inhabitant of; N.L. fem. n. Laedolimicola, a microbe frequently occurring in existing Ruminococcus genus. The type species of this genus is the faeces of birds). This taxon is named referring to the microbial Hominimerdicola aceti. ecosystem with combined prevalence and mean relative abundance for this taxon, although originally isolated from human faeces. Based on 16S Description of Hominimerdicola aceti sp. nov. H. aceti (L. neut. n. acetum, rRNA gene sequence similarity, the closest relatives are members of the vinegar; L. gen. neut. n. aceti, of vinegar). The number of CAZymes genera Faecalicatena (Faecalicatena orotica, 94.3%; F. fissicatena, 94.1%; F. identified within the type strains genome was 168, facilitating the contortaT, 93.9%) and Eubacterium (Eubacterium oxidoreducens, 94.1%). predicted utilisation of glucose, cellulose and starch. KEGG analysis POCP analysis confirmed that strain Sanger_04 represents a distinct genus identified pathways for propionate production from propanoyl-CoA to both Faecalicatena and Eubacterium as all POCP values to close relatives (EC:2.3.1.222, 2.7.2.1), sulfide and L-serine utilised to produce L-cysteine were below 50%. GTDB-Tk supported the creation of a novel genus, and acetate (EC:2.3.1.30, 2.5.1.47) and L-glutamate production from placing strain Sanger_04 within the predicted genus ‘GCA-900066575’. The ammonia via L-glutamine (EC:6.3.1.2, 1.4.1.-). The production of acetate type species of this genus is L. ammoniilytica. by the 80/3 strain of this species has previously been demonstrated76. The type strain, Sanger_31T (=DSM 102216T), was isolated from human faeces. Description of Laedolimicola ammoniilytica sp. nov. L. ammoniilytica (N.L. The G + C content of genomic DNA is 42.3%. neut. n. ammonium, ammonia; N.L. fem. adj. lytica, able to loose, able to dissolve; from Gr. fem. adj. lytike, able to loose, dissolving; N.L. fem. adj. Description of Hoministercoradaptatus gen. nov. Hoministercoradaptatus ammoniilytica, ammonia-degrading, to reflect the activity of the bacter- (L. masc. n. homo, a human being; L. neut. n. stercus, dung; L. past part. ium). Within the genome, 186 CAZymes were identified along with the adaptatus adapted to; N.L. masc. n. Hoministercoradaptatus, a microbe predicted utilisation of glucose, arbutin, salicin, cellobiose, maltose and from the faeces of humans). This isolate was identified as a distinct genus to starch. KEGG analysis identified pathways for propionate production from its closest relatives, Blautia spp., based on 16S rRNA gene sequence propanoyl-CoA (EC:2.3.1.222, 2.7.2.1), sulfide and L-serine to produce L- similarity and POCP values to the type species, B. coccoides, of 93.8% and cysteine and acetate (EC:2.3.1.30, 2.5.1.47), L-glutamate production from 46.0%, respectively. This was confirmed via GTDB-Tk assigning strain ammonia via L-glutamine (EC:6.3.1.2, 1.4.1.-), cobalamin (vitamin B12) Sanger_32 as a member of ‘Ruminococcus_A’, a genus distinct to that of biosynthesis from cobinamide (EC:2.5.1.17, 6.3.5.10, 6.2.1.10, 2.7.1.156) and Blautia. The type species of this proposed genus is Hoministercoradaptatus folate (vitamin B9) biosynthesis from 7,8-dihydrofolate (EC:1.5.1.3). This ammoniilyticus. species was most commonly identified within chicken gut samples (65.7%), although sub-dominant at 0.35% mean relative abundance. The type Description of Hoministercoradaptatus ammoniilyticus sp. nov. H. ammo- strain, Sanger_04T (=DSM 102317T), was isolated from human faeces. The niilyticus (N.L. neut. n. ammonium, ammonia; N.L. masc. adj. lyticus, able to G + C content of genomic DNA is 49.8%. loose, able to dissolve; from Gr. fem. adj. lytikos, able to loose, dissolving; N. L. masc. adj. ammoniilyticus, ammonia-degrading, to reflect the activity of Description of Megasphaera butyrica sp. nov. Megasphaera butyrica (Gr. the bacterium). The species contains at least 225 CAZymes, however only neut. n. boutyron (Latin transliteration butyrum); Gr. fem. adj. suff. -ica, suffix starch was suggested as a carbon source. KEGG-based analysis identified used with the sense of belonging to; N.L. fem. adj. butyrica, related to the presence of pathways for acetate production from acetyl-CoA butter, butyric). The species was identified as a member of the genus (EC:2.3.1.8, 2.7.2.1), propionate production from propanoyl-CoA Megasphaera based on POCP values >50% to five existing species (EC:2.3.1.8, 2.7.2.1), L-glutamate production from ammonia via L-glutamine belonging to this genus, including the type species Megasphaera elsdenii (EC:6.3.1.2, 1.4.1.-), as well as folate (vitamin B9) biosynthesis from 7,8- (69.5%). ANI values to all close relatives were below 95%, suggesting this dihydrofolate (EC:1.5.1.3). The G + C content of genomic DNA for this isolate as a novel species. This was confirmed by GTDB-Tk identification as species was between 43.7–43.9%. The type strain, Sanger_32T (=DSM ‘Megasphaera sp900066485’. Within the genome, 126 CAZymes were 102174T), was isolated from human faeces. A second isolate (f_RGN (=DSM identified along with the utilisation of starch, glucose and sucrose. KEGG- 107527)) was also isolated from human faeces with an ANI value of 98.0% based analysis identified the presence of the following pathways: butyrate to the type strain. Only features that were consistent between these two production from butanoyl-CoA (EC:2.8.3.8), sulfide and L-serine utilised to isolates are described above as those of the species. produce L-cysteine and acetate (EC:2.3.1.30, 2.5.1.47) and cobalamin

ISME Communications (2021) 1:16 T.C.A. Hitch et al. 13

(vitamin B12) biosynthesis from cobinamide (EC:2.5.1.17, 6.3.5.10, 6.2.1.10, faeces with an ANI value of 99.0% to the type strain. The G + C content of 2.7.1.156). The type strain, Sanger_24T ( = DSM 102144T), was isolated from genomic DNA is between 59.2–59.9%. Only features that were consistent human faeces. The G + C content of genomic DNA is 52.6%. between these two isolates are described above as those of the species.

Description of Muricoprocola gen. nov. Muricoprocola (L. masc. n. mus,a Description of Phocaeicola fibrisolvens sp. nov. P. fibrisolvens (L. fem. n. mouse; Gr. fem. n. kopros, dung; N. L. masc./fem. suff. -cola, an inhabitant fibra,afibre, filament; L. pres. part. solvens, dissolving; N.L. part. adj. of; N.L. fem. n. Muricoprocola, a microbe frequently occurring in the faeces fibrisolvens, fibre-dissolving). The species was identified as a member of the of mice). This taxon is named referring to the microbial ecosystem with newly proposed genus Phocaeicola. The comparison of 16S rRNA gene combined prevalence and mean relative abundance for this taxon, sequences identified the highest matches to existing members of this although originally isolated from human faeces. Based on 16S rRNA gene genus, including Phocaeicola coprocola (94.6%), and low similarity to the sequence similarity, the closest relatives are members of the genus type species of the next closest genus Bacteroides (B. fragilisT, 89.2%). POCP Lacrimispora (Lacrimispora saccharolytica, 93.9%; L. amygdalina, 93.7%; analysis also confirmed the placement of strain Sanger_21 within the Lacrimispora indolis, 93.3%; L. sphenoides, 93.3%). POCP analysis confirmed Phocaeicola with values >50% to multiple members, the highest being that strain Sanger_29 represents a distinct genus as all POCP values to 73.0% to Phocaeicola plebeius. ANI values to all close relatives were below close relatives were below 50%. The type species of this genus is 95%. Within the genome, 372 CAZymes were identified along with the Muricoprocola aceti. utilisation of starch. KEGG-based analysis identified the presence of the following pathways: acetate production from acetyl-CoA (EC:2.3.1.8, Description of Muricoprocola aceti sp. nov. M. aceti (L. neut. n. acetum, 2.7.2.1), propionate production from propanoyl-CoA (EC:2.3.1.8, 2.7.2.1), L- vinegar; L. gen. neut. n. aceti, of vinegar). Within the genome, 158 glutamate production from ammonia via L-glutamine (EC:6.3.1.2, 1.4.1.-), CAZymes were identified along with the predicted utilisation of cellulose folate (vitamin B9) biosynthesis from 7,8-dihydrofolate (EC:1.5.1.3) and and starch. KEGG analysis identified pathways for acetate production from riboflavin (vitamin B2) biosynthesis from GTP (EC:3.5.4.25, 3.5.4.26, acetyl-CoA (EC:2.3.1.8, 2.7.2.1), propionate production from propanoyl-CoA 1.1.1.193, 3.1.3.104, 4.1.99.12, 2.5.1.78, 2.5.1.9, 2.7.1.26, 2.7.7.2). The type T T (EC:2.3.1.8, 2.7.2.1), L-glutamate production from ammonia via L-glutamine strain, Sanger_21 (=DSM 102146 ), was isolated from human faeces. The (EC:6.3.1.2, 1.4.1.-), cobalamin (vitamin B12) biosynthesis from cobinamide G + C content of genomic DNA is 46.4%. (EC:2.5.1.17, 6.3.5.10, 6.2.1.10, 2.7.1.156) and folate (vitamin B9) biosynth- esis from 7,8-dihydrofolate (EC:1.5.1.3). This species was most commonly Description of Porcipelethomonas gen. nov. P. (L. masc. n. porcus, a piglet; identified within mouse gut samples (7.9%) at 1.95% mean relative Gr. masc. n. pelethos, dung; L. fem. n. monas, a monad; N.L. fem. n. abundance. The type strain, Sanger_29T (=DSM 102151T), was isolated Porcipelethomonas, a microbe frequently occurring in the faeces of pigs). from human faeces. The G + C content of genomic DNA is 43.0%. This taxon is named referring to the microbial ecosystem with combined prevalence and mean relative abundance for this taxon, although originally Description of Muriventricola gen. nov. Muriventricola (L. masc. n. mus,a isolated from human faeces. Based on 16S rRNA gene sequence similarity, mouse; L. masc. n. venter, the belly; N.L. masc./fem. suff. –cola,an the closest relatives are Ruminococcus (R. flavefaciensT, 93.1%; Ruminococ- inhabitant of; N.L. fem. n. Muriventricola, a microbe frequently occurring in cus champanellensis, 92.9%; R. callidus, 92.2%). POCP analysis confirmed the intestines of mice). This taxon is named referring to the microbial that strain Sanger_90 represents a distinct genus to Ruminococcus (R. ecosystem with combined prevalence and mean relative abundance for flavefaciensT, 40.3%; R. champanellensis, 46.7%; R. callidus, 45.4%) as all this taxon, although originally isolated from human faeces. The closest POCP values to close relatives were below 50%. GTDB-Tk supported the relatives, based on 16S rRNA gene sequence similarity, are Pseudoflavoni- creation of a novel genus, placing strain Sanger_90 within the predicted fractor capillosus (95.2%) and Flavonifractor plautii (94.9%). POCP analysis genus ‘UBA1394’. The type species of this genus is Porcipelethomonas confirmed the species as representing a distinct genus to P. capillosus ammoniilytica. (<46.5%) and F. plautii (<46.3%). This is supported by the genome tree which shows clear separation from all close relatives. The type species of Description of Porcipelethomonas ammoniilytica sp. nov. P. ammoniilytica this genus is Muriventricola aceti. The type species is Muriventricola aceti. (N.L. neut. n. ammonium, ammonia; N.L. fem. adj. lytica, able to loose, able to dissolve; from Gr. fem. adj. lytike, able to loose, dissolving; N.L. fem. adj. Description of Muriventricola aceti sp. nov. M. aceti (L. neut. n. acetum, ammoniilytica, ammonia-degrading, to reflect the activity of the bacter- vinegar; L. gen. neut. n. aceti, of vinegar). The number of CAZymes ium). KEGG analysis identified pathways for propionate production from identified within the genome of strains of this species ranged from propanoyl-CoA (EC:2.3.1.222, 2.7.2.1), sulfide and L-serine utilised to 105–116. Isolates were predicted to utilise starch. KEGG analysis identified produce L-cysteine and acetate (EC:2.3.1.30, 2.5.1.47) and L-glutamate pathways for acetate production from acetyl-CoA (EC:2.3.1.8, 2.7.2.1), production from ammonia via L-glutamine (EC:6.3.1.2, 1.4.1.-). This species propionate production from propanoyl-CoA (EC:2.3.1.8, 2.7.2.1), sulfide and was most commonly identified within pig gut (26.3%) and human gut L-serine utilised to produce L-cysteine and acetate (EC:2.3.1.30, 2.5.1.47) as (7.4%) samples, although sub-dominant at only 0.1% and 0.2% mean well as folate (vitamin B9) biosynthesis from 7,8-dihydrofolate (EC:1.5.1.3). relative abundance respectively. The type strain, Sanger_90T (=DSM This species was most commonly identified within mouse gut samples 102167T), was isolated from human faeces. The G + C content of genomic (67.7%) at 1.0% mean relative abundance. The type strain, H5_61T (=DSM DNA is 36.9%. 108267T), was isolated from human faeces. The G + C content of genomic DNA ranges from 55.8 – 56.3%. This placement and description of this Description of Roseburia amylophila sp. nov. R. amylophila (Gr. neut. n. species is based on two isolates, H5_61T and H4_62 (=DSM 108264). Only amylon, starch; N.L. fem. adj. phila (from Gr. fem. adj. phile) loving; N.L. fem. features that were consistent between these two isolates are described adj. amylophila, starch-loving). The species was identified as a member of above as those of the species. the genus Roseburia based on POCP values >50% to four existing species belonging to this genus; however, no genome exists for the type species R. Description of Oscillibacter acetigenes sp. nov. O. acetigenes (L. neut. n. cecicola. ANI values to all close relatives were below 95%, suggesting this acetum, vinegar; Gr. v. gennaô, to produce; N.L. part. adj. acetigenes, isolate as a novel species. Inconsistency in taxonomic assignment was producing acetate). The species was identified as a member of the genus observed between methods as the highest match based on 16S rRNA gene Oscillibacter, based on 16S rRNA gene sequence similarity of 95.5% to both sequence similarity was E. oxidoreducens (95.5%) and GTDB assignment existing members of the genus, Oscillibacter ruminantium and Oscillibacter was to the novel genus ‘CAG-45’. However, after E. oxidoreducens, the next valericigenesT. This was supported by POCP values of 52.3 and 47.0% to best matches based on 16S rRNA gene similarity were to Roseburia faecis each member, respectively. ANI values to both members were below 95%, (95.3%), R. hominis (95.1%) and R. intestinalis (95.1%). Placement within the confirmed by GTDB-Tk identification as ‘Oscillibacter-sp900066435’. This genome tree also showed strain Sanger_19 to be monophyletic with the species contains an average of 111 CAZymes. Starch was uniquely existing Roseburia species. Based on the genome tree, we observed that predicted as a carbon source. Genomic prediction identified the following Eubacterium is a taxonomically incongruent genus and requires reclassi- pathways: acetate production from acetyl-CoA (EC:2.3.1.8, 2.7.2.1), propio- fication. Within the genome, 178 CAZymes were identified along with the nate production from propanoyl-CoA (EC:2.3.1.8, 2.7.2.1), sulfide and L- utilisation of arbutin, salicin, sucrose and starch. KEGG-based analysis serine utilised to produce L-cysteine and acetate (EC:2.3.1.30, 2.5.1.47). The identified the presence of the following pathways: propionate production type strain, H4_59T (=DSM 108348T), was isolated from human faeces. A from propanoyl-CoA (EC:2.3.1.8, 2.7.2.1), sulfide and L-serine utilised to second isolate (H2_39 (=DSM 108347)) was also isolated from human produce L-cysteine and acetate (EC:2.3.1.30, 2.5.1.47), L-glutamate

ISME Communications (2021) 1:16 T.C.A. Hitch et al. 14 5. Diakite, A. et al. Extensive culturomics of 8 healthy samples enhances metage- production from ammonia was identified via L-glutamine (EC:6.3.1.2, 1.4.1.- nomics efficiency. PLoS One 14,1–12 (2019). ), folate (vitamin B9) biosynthesis from 7,8-dihydrofolate (EC:1.5.1.3). The 6. Forster, S. C. et al. A human gut bacterial genome and culture collection for T T type strain, Sanger_19 (=DSM 102150 ), was isolated from human faeces. improved metagenomic analyses. Nat. Biotechnol. 37, 186–192 (2019). The G + C content of genomic DNA is 40.7%. 7. Poyet, M. et al. A library of human gut bacterial isolates paired with longitudinal multiomics data enables mechanistic microbiome research. Nat. Med. 25, Description of Suilimivivens gen. nov. Suilimivivens (L. masc. n. sus, a pig; L. 1442–1452 (2019). masc. n. limus, dung; L. pres. part. vivens living; N.L. fem. n. Suilimivivens,a 8. Liu, C. et al. The Mouse Gut Microbial Biobank expands the coverage of cultured microbe frequently occurring in the faeces of pigs). This taxon is named bacteria. Nat. Commun. 11, 79 (2020). https://doi.org/10.1038/s41467-019-13836-5. referring to the microbial ecosystem with combined prevalence and mean 9. Wylensek, D. et al. A collection of bacterial isolates from the pig intestine reveals relative abundance for this taxon, although originally isolated from human functional and taxonomic diversity. Nat. Commun. 11, 6389 (2020). https://doi. faeces. Based on 16S rRNA gene sequence similarity, the closest relatives org/10.1038/s41467-020-19929-w. are members of the genera Kineothrix (K. alysoidesT, 93.5%) and 10. Armanhi, J. S. L. et al. A community-based culture collection for targeting novel Eisenbergiella (E. tayiT, 92.1%). POCP analysis confirmed that strain plant growth-promoting bacteria from the sugarcane microbiome. Front. Plant Sanger_18 represents a distinct genus to both Kineothrix (K. alysoidesT, Sci. 8,1–17 (2018). 44.9%) and Eisenbergiella (E. tayiT, 35.8%) as all POCP values to close 11. Martins, A. et al. Photoprotective bioactivity present in a unique marine bacteria relatives were below 50%. GTDB-Tk supported the creation of a novel collection from Portuguese deep sea hydrothermal vents. Mar. Drugs 11, genus, placing strain Sanger_18 within the predicted genus ‘CAG-95’. The 1506–1523 (2013). type species of this genus is Suilimivivens aceti. 12. Liu, J. et al. Proliferation of hydrocarbon-degrading microbes at the bottom of the Mariana Trench. Microbiome 7,1–13 (2019). Description of Suilimivivens aceti sp. nov. S. aceti (L. neut. n. acetum, 13. Pasolli, E. et al. Extensive unexplored human microbiome diversity revealed by vinegar; L. gen. neut. n. aceti, of vinegar). The number of CAZymes over 150,000 genomes from metagenomes spanning age, geography, and life- identified within the type strains genome was 242, facilitating the style. Cell 176, 649–662 (2019). predicted utilisation of cellulose and starch. KEGG analysis identified 14. Almeida, A. et al. A new genomic blueprint of the human gut microbiota. Nature pathways for acetate production from acetyl-CoA (EC:2.3.1.8, 2.7.2.1), 1, 1 (2019). propionate production from propanoyl-CoA (EC:2.3.1.8, 2.7.2.1), sulfide and 15. Spang, A. et al. Complex archaea that bridge the gap between prokaryotes and L-serine utilised to produce L-cysteine and acetate (EC:2.3.1.30, 2.5.1.47), L- eukaryotes. Nature 521, 173–179 (2015). glutamate production from ammonia via L-glutamine (EC:6.3.1.2, 1.4.1.-), 16. Parker, C. T., Tindall, B. J. & Garrity, G. M. International code of nomenclature of folate (vitamin B9) biosynthesis from 7,8-dihydrofolate (EC:1.5.1.3). This Prokaryotes. Int. J. Syst. Evol. Microbiol. 69, S1 (2019). species was most commonly identified within pig gut samples (17.3%), 17. Lagkouvardos, I. et al. Sequence and cultivation study of Muribaculaceae reveals although sub-dominant at only 0.5% mean relative abundance. The type novel species, host preference, and functional potential of this yet undescribed strain, Sanger_18T (=DSM 102261T), was isolated from human faeces. The family. Microbiome 7,1–15 (2019). G + C content of genomic DNA is 44.4%. 18. Chaumeil, P.-A., Mussig, A. J., Hugenholtz, P. & Parks, D. H. GTDB-Tk: a toolkit to classify genomes with the Genome Taxonomy Database. Bioinformatics 36, Description of Suonthocola gen. nov. Suonthocola (L. masc. n. sus, a pig; Gr. 1925–1927 (2019). masc. n. onthos, dung; N.L. masc./fem. suff. –cola, an inhabitant of; N.L. fem. 19. Aylward, F. O. & Santoro, A. E. Heterotrophic Thaumarchaea with small genomes n. Suonthocola, a microbe frequently occurring in the faeces of pigs). This are widespread in the Dark Ocean. mSystems 5,1–20 (2020). taxon is named referring to the microbial ecosystem with combined 20. Almeida, A. et al. A unified catalog of 204,938 reference genomes from the prevalence and mean relative abundance for this taxon, although originally human gut microbiome. Nat. Biotechnol. https://doi.org/10.1038/s41587-020- isolated from human faeces. This isolate was identified as a distinct genus to 0603-3 (2020). its closest relatives, Clostridium nexile (93.7%) and Lactonifactor longoviformis 21. Nayfach, S. et al. A genomic catalog of Earth’s microbiomes. Nat. Biotechnol. (93.7%), based on 16S rRNA gene sequence similarity. This was supported https://doi.org/10.1038/s41587-020-0718-6 (2020). by POCP values <50% against all close relatives. Further confirmation 22. Tindall, B. J. Proposals to update and make changes to the Bacteriological Code. was obtained by GTDB-Tk assignment as an unnamed genus within Int. J. Syst. Bacteriol. 49, 1309–1312 (1999). the Lachnospiraceae. The type species of this genus is Suonthocola 23. Whitcomb, R. F. Proposal of minimal standards for descriptions of new species of fibrivorans. the class Mollicutes. Int. J. Syst. Bacteriol. 29, 172–180 (1979). 24. Logan, N. A. et al. Proposed minimal standards for describing new taxa of aerobic, endospore-forming bacteria. Int. J. Syst. Evol. Microbiol. 59, 2114–2121 (2009). Description of Suonthocola fibrivorans sp. nov. Suonthocola fibrivorans (L. 25. Mattarelli, P. et al. Recommended minimal standards for description of new taxa fem. n. fibra, fibre; L. v. vorare, to devour; N.L. part. adj. fibrivorans, fibre- of the genera Bifidobacterium, Lactobacillus and related genera. Int. J. Syst. Evol. devouring). Within the isolates’ genome, 388 CAZymes were identified Microbiol. 64, 1434–1451 (2014). along with the predicted use of starch and cellobiose as carbon sources. 26. Tu, Q. & Lin, L. Gene content dissimilarity for subclassification of highly similar KEGG analysis identified a total of 132 transporters, 16 secretion genes and microbial strains. BMC Genomics 17,1–11 (2016). 656 enzymes. This included pathways for acetate production from acetyl- 27. Jain, C., Rodriguez-R, L. M., Phillippy, A. M., Konstantinidis, K. T. & Aluru, S. High CoA (EC:2.3.1.8, 2.7.2.1), propionate production from propanoyl-CoA throughput ANI analysis of 90K prokaryotic genomes reveals clear species (EC:2.3.1.8, 2.7.2.1), sulfide and L-serine utilised to produce L-cysteine and boundaries. Nat. Commun. 9, 5114 (2018). acetate (EC:2.3.1.30, 2.5.1.47), L-glutamate production from ammonia via L- 28. Konstantinidis, K. T. & Tiedje, J. M. Towards a genome-based taxonomy for pro- glutamine (EC:6.3.1.2, 1.4.1.-), as well as folate (vitamin B9) biosynthesis karyotes. J. Bacteriol. 187, 6258–6264 (2005). from 7,8-dihydrofolate (EC:1.5.1.3). This species was most commonly 29. Auch, A. F., von Jan, M., Klenk, H. P. & Göker, M. Digital DNA-DNA hybridization for identified within pig gut samples (9.6%), although sub-dominant at only microbial species delineation by means of genome-to-genome sequence com- 0.11% mean relative abundance. The type strain, Sanger_33T ( = DSM parison. Stand. Genomic Sci. 2, 117–134 (2010). 102154T), was isolated from human faeces. Its G + C content of genomic 30. Qin, Q. L. et al. A proposed genus boundary for the prokaryotes based on DNA is 48.7%. genomic insights. J. Bacteriol. 196, 2210–2215 (2014). 31. Meier-Kolthoff, J. P., Klenk, H. P. & Göker, M. Taxonomic use of DNA G+C content and DNA-DNA hybridization in the genomic age. Int. J. Syst. Evol. Microbiol. 64, REFERENCES 352–356 (2014). 1. Parte, A. C. LPSN - List of prokaryotic names with standing in nomenclature 32. Segata, N., Börnigen, D., Morgan, X. C. & Huttenhower, C. PhyloPhlAn is a new (Bacterio.net), 20 years on. Int. J. Syst. Evol. Microbiol. 68, 1825–1829 (2018). method for improved phylogenetic and taxonomic placement of microbes. Nat. 2. Lagkouvardos, I. et al. The Mouse Intestinal Bacterial Collection (miBC) provides Commun. 4,1–10 (2013). host-specific insight into cultured diversity and functional potential of the gut 33. Parks, D. H. et al. A standardized bacterial taxonomy based on genome phylo- microbiota. Nat. Microbiol. 1, 16131 (2016). geny substantially revises the tree of life. Nat. Biotechnol. 36, 996 (2018). 3. Seshadri, R. et al. Cultivation and sequencing of rumen microbiome members 34. Rodriguez-R, L. M. et al. The Microbial Genomes Atlas (MiGA) webserver: Taxo- from the Hungate1000 Collection. Nat. Biotechnol. 36, 359–367 (2018). nomic and gene diversity analysis of Archaea and Bacteria at the whole genome 4. Bai, Y. et al. Functional overlap of the Arabidopsis leaf and root microbiota. Nature level. Nucleic Acids Res. 46, W282–W288 (2018). 528, 364–369 (2015).

ISME Communications (2021) 1:16 T.C.A. Hitch et al. 15 35. Meier-Kolthoff, J. P. & Göker, M. TYGS is an automated high-throughput platform 65. Rosselló-Mora, R. & Amann, R. The species concept for prokaryotes. FEMS for state-of-the-art genome-based taxonomy. Nat. Commun. 10, 2182 (2019). Microbiol. Rev. 25,39–67 (2001). https://doi.org/10.1038/s41467-019-10210-3. 66. Coenye, T., Gevers, D., Van De Peer, Y., Vandamme, P. & Swings, J. Towards a 36. Henz, S. R., Huson, D. H., Auch, A. F., Nieselt-Struwe, K. & Schuster, S. C. Whole- prokaryotic genomic taxonomy. FEMS Microbiol. Rev. 29, 147–167 (2005). genome prokaryotic phylogeny. Bioinformatics 21, 2329–2335 (2005). 67. Pannekoek, Y., Qi-Long, Q., Zhang, Y. Z. & van der Ende, A. Genus delineation of 37. Meier-Kolthoff, J. P., Auch, A. F., Klenk, H.-P. & Göker, M. Genome sequence-based Chlamydiales by analysis of the percentage of conserved proteins justifies the species delimitation with confidence intervals and improved distance functions. reunifying of the genera Chlamydia and Chlamydophila into one single genus BMC Bioinform 14, 60 (2013). Chlamydia. Pathog. Dis. 74, 2016–2019 (2016). 38. Chun, J. et al. Proposed minimal standards for the use of genome data for the 68. Lalucat, J., Mulet, M., Gomila, M. & García-Valdés, E. Genomics in bacterial tax- taxonomy of prokaryotes. Int. J. Syst. Evol. Microbiol. 68, 461–466 (2018). onomy: Impact on the genus pseudomonas. Genes. 11, 139 (2020). https://doi. 39. Woodcroft, B. J. et al. Genome-centric view of carbon processing in thawing org/10.3390/genes11020139. permafrost. Nature 560,49–54 (2018). 69. Ogata, H. et al. KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids 40. Crits-Christoph, A., Diamond, S., Butterfield, C. N., Thomas, B. C. & Banfield, J. F. Res. 27,29–34 (1999). Novel soil bacteria possess diverse genes for secondary metabolite biosynthesis. 70. Lombard, V., Golaconda Ramulu, H., Drula, E., Coutinho, P. M. & Henrissat, B. The Nature 558, 440–444 (2018). carbohydrate-active enzymes database (CAZy) in 2013. Nucleic Acids Res. 42, 41. Dombrowski, N., Teske, A. P. & Baker, B. J. Expansive microbial metabolic versa- 490–495 (2014). tility and biodiversity in dynamic Guaymas Basin hydrothermal sediments. Nat. 71. McArthur, A. G. et al. The comprehensive antibiotic resistance database. Anti- Commun. 9, 4999 (2018). https://doi.org/10.1038/s41467-018-07418-0. microb. Agents Chemother. 57, 3348–3357 (2013). 42. Tully, B. J., Graham, E. D., & Heidelberg, J. F. The reconstruction of 2,631 draft 72. Seemann, T. Prokka: Rapid prokaryotic genome annotation. Bioinformatics 30, metagenome-assembled genomes from the global oceans. Sci. Data 5,1–8 2068–2069 (2014). (2018). 73. Lagkouvardos, I. et al. IMNGS: A comprehensive open resource of processed 16S 43. Lesker, T. R. et al. An integrated metagenome catalog reveals new insights into rRNA microbial profiles for ecology and diversity studies. Sci. Rep. 6,1–9 (2016). the murine gut microbiome. Cell Rep. 30, 2909–2922 (2020). 74. Bowers, R. M. et al. Minimum information about a single amplified genome 44. Stewart, R. D. et al. Assembly of 913 microbial genomes from metagenomic (MISAG) and a metagenome-assembled genome (MIMAG) of bacteria and sequencing of the cow rumen. Nat. Commun. 9,1–11 (2018). archaea. Nat. Biotechnol. 35, 725–731 (2017). 45. Manara, S. et al. Microbial genomes from non-human primate gut metagenomes 75. Ondov, B. D. et al. Mash: fast genome and metagenome distance estimation expand the primate-associated bacterial tree of life with over 1000 novel species. using MinHash. Genome Biol. 17,1–14 (2016). Genome Biol. 20, 299 (2019). https://doi.org/10.1186/s13059-019-1923-9. 76. Wegmann, U. et al. Complete genome of a new Firmicutes species belonging to 46. Anantharaman, K. et al. Thousands of microbial genomes shed light on inter- the dominant human colonic microbiota (‘Ruminococcus bicirculans’) reveals two connected biogeochemical processes in an aquifer system. Nat. Commun. 7,1–11 chromosomes and a selective capacity to utilize plant glucans. Environ. Microbiol. (2016). 16, 2879–2890 (2014). 47. Parks, D. H. et al. Recovery of nearly 8,000 metagenome-assembled genomes substantially expands the tree of life. Nat. Microbiol. 903,1–10 (2017). 48. Konstantinidis, K. T. & Tiedje, J. M. Genomic insights that advance the species definition for prokaryotes. PNAS 102, 2567–2572 (2005). ACKNOWLEDGEMENTS 49. Fodor, A. A. et al. The ‘most wanted’ taxa from the human microbiome for whole T.C. received funding from the German Research Foundation (DFG, Deutsche – ‚ ‘ genome sequencing. PLoS ONE 7, e41294 (2012). https://doi.org/10.1371/journal. Forschungsgemeinschaft) under Project-ID 403224013 SFB 1382 Gut-liver axis . T.C. pone.0041294. A.H. was funded by both the ERS Start-up grant from RWTH Aachen (ProtoBiome) and 50. Pallen, M. J., Telatin, A. & Oren, A. The next million names for Archaea and the START grant from the University hospital RWTH Aachen (KetoGut). The authors Bacteria. Trends Microbiol. xx,1–21 (2020). thank (i) Antonios Koukis and Ilias Lagkouvardos (Institute of Marine Biology, 51. Murray, A. E. et al. Roadmap for naming uncultivated Archaea and Bacteria. Nat. Biotechnology and Aquaculture, Hellenic Center of Marine Research, Heraklion, – Greece) for providing the IMNGS amplicon datasets, (ii) everyone involved in the Microbiol. 5, 987 994 (2020). 6 52. Oren, A. Prokaryotic names: the bold and the beautiful. FEMS Microbiol. Lett. 367, Forster et al. paper, from which all the isolates described and named here originate 1–6 (2020). from, and (iii) Till Robin Lesker and Till Strowig for reviewing the manuscript. The 53. Hahnke, R. L. et al. Genome-based taxonomic classification of Bacteroidetes. authors also thank Lorenz Riemer for both discussion and quick integration of Front. Microbiol. 7, 2003 (2016). https://doi.org/10.3389/fmicb.2016.02003. Protologger output into BacDive. 54. Woese, C. R. & Fox, G. E. Phylogenetic structure of the prokaryotic domain: The primary kingdoms (archaebacteria/eubacteria/urkaryote/16S ribosomal RNA/ molecular phylogeny). Proc. Natl Acad. Sci. USA 74, 5088–5090 (1977). AUTHOR CONTRIBUTIONS ‘ ’ 55. Yilmaz, P. et al. The SILVA and all-species Living Tree Project (LTP) taxonomic T.C.A.H. and T.C. conceptualised the project. T.C.A.H. designed and coded – frameworks. Nucleic Acids Res. 42, 643 648 (2014). Protologger. A.O. ensured the accuracy of all protologues and ensured all proposed 56. Pruesse, E. et al. SILVA: a comprehensive online resource for quality checked and names meet the criteria of the ICNP code. T.R., J.O., and T.D.L. provided access to aligned ribosomal RNA sequence data compatible with ARB. Nucleic Acids Res. 35, important infrastructures and resources and contributed their respective knowledge. – 7188 7196 (2007). T.C.A.H. and T.C. wrote the manuscript. T.C. secured funding. 57. Edgar, R. C., Haas, B. J., Clemente, J. C., Quince, C. & Knight, R. UCHIME improves sensitivity and speed of chimera detection. Bioinformatics 27, 2194–2200 (2011). 58. Chun, J. et al. EzTaxon: A web-based tool for the identification of prokaryotes based on 16S ribosomal RNA gene sequences. Int. J. Syst. Evol. Microbiol. 57, COMPETING INTERESTS 2259–2261 (2007). T.C. has ongoing scientific collaborations with Cytena GmbH and HiPP GmbH and is 59. Yarza, P. et al. Uniting the classification of cultured and uncultured bacteria and member of the scientific advisory board of Savanna Ingredients GmbH. archaea using 16S rRNA gene sequences. Nat. Rev. Microbiol. 12, 635–645 (2014). 60. Thompson, C. C., Vieira, N. M., Vicente, A. C. P. & Thompson, F. L. Towards a genome based taxonomy of Mycoplasmas. Infect. Genet. Evol. 11, 1798–1804 ADDITIONAL INFORMATION (2011). Supplementary information The online version contains supplementary material 61. Barco, R. A. et al. A genus definition for Bacteria and Archaea based on genome available at https://doi.org/10.1038/s43705-021-00017-z. relatedness and taxonomic affiliation. MBio 11, e02475–19 (2020). https://doi.org/ 10.1128/mBio.02475-19. Correspondence and requests for materials should be addressed to T.C.A.H. or T.C. 62. Edgar, R. C. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 32, 1792–1797 (2004). Reprints and permission information is available at http://www.nature.com/ 63. Price, M. N., Dehal, P. S. & Arkin, A. P. Fasttree: computing large minimum evo- reprints lution trees with profiles instead of a distance matrix. Mol. Biol. Evol. 26, 1641–1650 (2009). Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims 64. Parks, D. H., Imelfort, M., Skennerton, C. T., Hugenholtz, P. & Tyson, G. W. CheckM: in published maps and institutional affiliations. assessing the quality of microbial genomes recovered from. Cold Spring Harb. Lab. Press Method 1,1–31 (2015).

ISME Communications (2021) 1:16 T.C.A. Hitch et al. 16 Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons. org/licenses/by/4.0/.

© The Author(s) 2021

ISME Communications (2021) 1:16