AIX-MARSEILLE UNIVERSITE FACULTE DE MEDECINE DE MARSEILLE ECOLE DOCTORALE DES SCIENCES DE LA VIE ET DE LA SANTE

T H È S E Présentée et publiquement soutenue à l'IHU – Méditerranée Infection

Le 23 novembre 2017 Par Aurélia CAPUTO

ANALYSE DU ET DU PAN-GENOME POUR

CLASSIFIER LES BACTERIES EMERGENTES

Pour obtenir le grade de Doctorat d’Aix-Marseille Université Mention Biologie - Spécialité Génomique et Bio-informatique

Membres du Jury :

Professeur Antoine ANDREMONT Rapporteur Professeur Raymond RUIMY Rapporteur Docteur Pierre PONTAROTTI Examinateur Professeur Didier RAOULT Directeur de thèse

Unité de recherche sur les maladies infectieuses et tropicales émergentes, UM63, CNRS 7278, IRD 198, Inserm U1095 Avant-propos

Le format de présentation de cette thèse correspond à une recommandation de la spécialité Maladies Infectieuses et Microbiologie, à l’intérieur du Master des Sciences de la Vie et de la Santé qui dépend de l’École Doctorale des Sciences de la Vie de Marseille. Le candidat est amené à respecter des règles qui lui sont imposées et qui comportent un format de thèse utilisé dans le Nord de l’Europe et qui permet un meilleur rangement que les thèses traditionnelles. Par ailleurs, les parties introductions et bibliographies sont remplacées par une revue envoyée dans un journal afin de permettre une évaluation extérieure de la qualité de la revue et de permettre à l’étudiant de commencer le plus tôt possible une bibliographie exhaustive sur le domaine de cette thèse. Par ailleurs, la thèse est présentée sur article publié, accepté ou soumis associé d’un bref commentaire donnant le sens général du travail. Cette forme de présentation a paru plus en adéquation avec les exigences de la compétition internationale et permet de se concentrer sur des travaux qui bénéficieront d’une diffusion internationale. Professeur Didier RAOULT Remerciements

Je tiens tout d'abord à remercier mon directeur de thèse, le

Professeur Didier Raoult pour m'avoir donné l’opportunité de faire une thèse, pour sa confiance et son encadrement pendant ces 4 années de thèse.

Je souhaite remercier les Professeurs Antoine Andremont et

Raymond Ruimy pour avoir accepté d'être les rapporteurs de ce travail.

Je remercie également le Docteur Pierre Pontarotti d'avoir accepté de faire partie de mon jury en tant qu'examinateur.

Je remercie également toutes les personnes qui ont travaillé avec moi de près ou de loin. Merci à mes collègues de la plateforme de Bio- informatique et à celles de la plateforme de Séquençage.

Pour finir, je tiens à remercier tout particulièrement ma famille pour leur amour et leur soutien pendant ces 4 années de thèse. TABLE DES MATIÈRES

RÉSUMÉ ………………………………………………….. 1 ABSTRACT……………………………………………….. 4 INTRODUCTION..……………………………………….. 7 Avant-propos.……………………………………… 7 REVUE.………………………………………….... 12 Genome and pan-genome analysis to classify emerging PARTIE I.…………………………………………..………50 Assemblage du génome de Akkermansia muciniphila directement à partir de la métagénomique Avant-propos .………………………………….…...51 ARTICLE 1.……………………………………...…54 Whole-genome assembly of Akkermansia muciniphila sequenced directly from stool PARTIE II.………………………………………………… 66 Étude du génome de massiliensis Avant-propos.……………………………………… 67 ARTICLE 2 .………………………………………. 70 Microvirga massiliensis sp. nov., the human commensal with the largest genome PARTIE III.………………………………………...……… 87 Analyse du pan-genome de Klebsiella pneumoniae Avant-propos.……………………………………… 88 ARTICLE 3.……………………………..………… 91 Pan-genomic analysis to redefine and subspecies based on quantum discontinuous variation: the Klebsiella paradigm CONCLUSIONS ET PERSPECTIVES.…………...…… 104

ANNEXE I.……………………………………………..… 107 Étude du microbiote intestinal humain par culturomics Avant-propos.…………………………………....…108 ARTICLE 3.…………………………………….… 110 Culture of previously uncultured members of the human gut microbiota by culturomics ANNEXE II.………………………………………….…… 119 Étude du génome de Haloferax massilliensis Avant-propos.……………………………...……… 120 ARTICLE 4……………………………………….. 122 Genome sequence and description of Haloferax massiliensis sp. nov., a new halophilic isolated from the human gut REFERENCES DES AVANT- PROPOS…………………166 RÉSUMÉ

Depuis l'introduction du séquençage de l'ADN par

Sanger et Coulson en 1977, d'énormes progrès ont été réalisés.

Un nombre croissant de données est généré dans plusieurs domaines et nécessite de plus en plus de progrès en informatique. La bio-informatique est essentielle aujourd'hui dans de nombreux domaines comme par exemple la gestion et l'analyse des données, la génomique avec l'assemblage et l'annotation de génomes, la génomique comparative, la phylogénie, la métagénomique, la recherche de nouvelles espèces bactériennes et la classification taxonomique.

Mon premier travail a porté sur l'assemblage et l'analyse d'un génome bactérien à partir de données de métagénomique. Le génome d'Akkermansia muciniphila a pu

être assemblé par mapping directement à partir de données issues d'échantillons de selle humaine. Les données provenaient des séquenceurs SOLiD et Roche 454 générant 1.4

1 Gb de reads.

La culturomics permet l'étude de microbiotes humains grâce à l'utilisation de différentes conditions de culture couplée

à une méthode d'identification rapide par MALDI-TOF, ou par l'ARNr 16S. En 2012, cette méthode a permis de décrire le plus grand génome d'une bactérie isolée chez l'homme ; Microvirga massiliensis (9.3 Mb). Mon deuxième travail a permis d'assembler ce génome à l'aide de 8 runs en 454 et 1 run en

MiSeq Illumina. Par la suite, nous avons essayé de comprendre pourquoi cette bactérie a un génome si grand. En effet, on observe qu'elle possède un plasmide, un nombre important d'ORFans et d'ARNr 16S ainsi que des gènes de grande taille dont un mesure plus de 14kb. Elle comporte également un nombre important de transposases créant des éléments répétés au niveau du génome.

Enfin, la troisième et dernière partie de mon travail se base sur les analyses de pan-génome pour la taxonomie

2 bactérienne. La taxonomie est sujette à de nombreux changements selon les données disponibles et les méthodes utilisées, et suit l'évolution des techniques d'identification des bactéries. Nous avons alors redéfinit la notion d'espèce à l'aide du pan-génome au niveau du genre Klebsiella. En effet, une différence trop importante entraînant une cassure au niveau du ratio core/pan-génome, révèle indubitablement l'apparition d'une nouvelle espèce. Cette découverte nous amène à utiliser le pan-génome comme outils novateur pour la taxonomie bactérienne.

Mots clés : Bio-informatique, génomique, culturomics, taxonomie, pan-génome, définition d'espèces

3 ABSTRACT

Since the introduction of DNA sequencing by Sanger and Coulson in 1977, considerable progress has been made. A growing number of data is being generated in several areas and requires more and more advances in computing. Bio- informatics is essential today in many fields such as data management and analysis, genomics with assembly and genome annotation, comparative genomics, phylogeny, metagenomics, research new bacterial species and taxonomic classification.

My first work based on assembling and analyzing bacterial genome from metagenomic data. The genome of

Akkermansia muciniphila could be assembled by mapping directly from data from human stool sample. Data obtained from SOLiD and Roche 454 sequencers generating 1.4 Gb of reads.

Culturomics allows the study of human microbiota by

4 the use of several culture conditions with a rapid identification method by MALDI-TOF or by 16S rRNA. In 2012, this method allowed to describe the largest genome of a bacterium isolated in human; Microvirga massiliensis (9.3 Mb). My second work allowed to assemble this genome using 8 runs from 454 and 1 run from MiSeq Illumina. Subsequently, we tried to understand why this bacterium has such a large genome. Indeed, it is observed that it possesses a , a large number of ORFans and 16S rRNAs as well as large which one is more than 14kb. It also includes a large number of transposases creating repeated elements at the genome level.

Finally, the third and last part of the work concerns pan- genome analyzes for bacterial . Taxonomy is a set of many changes based on available data, methods used and evolution of bacterial identification techniques. We have examined the notion of species using the genome at the

Klebsiella. Indeed, a too large difference leading to a break in

5 the core/pan-genome ratio undoubtedly reveals the appearance of a new species. This discovery leads us to use the pan- genome as an innovative tool for bacterial taxonomy.

Keywords: Bioinformatics, genomics, culturomics, taxonomy, pan-genome, species definition

6 INTRODUCTION

Avant-propos

L'objectif de cette thèse est l'analyse des génomes bactériens émergents ainsi que leur pan-génome afin de les définir et les classifier en fonction de leur contenu génomique.

La première partie de mon travail est une synthèse bibliographique sous forme d'une revue. Ce travail vise à montrer le rôle de la génomique et du pan-génome dans la classification des bactéries. L'étude de l'écosystème digestif bactérien a été explorée pour la première fois par culture microbienne dans les années 70 [1]. La naissance de la génomique, puis le développement des méthodes de séquençage de nouvelle génération (NGS) en 2004, ont permis de découvrir l'incultivable comme le génome d'une souche de notre laboratoire ; Akkermansia muciniphila ainsi que le plus grand génome bactérien isolé chez l'homme, Microvirga muciniphila. L'assemblage du génome d'Akkermansia

7 muciniphila a été fait directement à partir de données de métagénomiques issues d'un échantillon de selle humaine, grâce à une approche par mapping. Les reads générés sont issues de différentes technologies de séquençage ; SOLiD et

Roche 454.

Depuis l'émergence de la métagénomique, la culture a

été progressivement remplacée par des outils moléculaires pour l'étude des écosystèmes complexes, en particulier les microbiotes humains [2]. Cependant, en 2015, une stratégie appelée «microbial culturomics» a été développée et une analyse détaillée a permis de sélectionner les 18 conditions de culture permettant d'explorer un grand nombre d'isolats. La culturomics permet de combiner simultanément différentes conditions de culture, d'utiliser une méthode d'identification rapide par spectrométrie de masse de type MALDI-TOF, et par amplification du gène de l'ARNr 16S pour les bactéries non identifiées. De plus la complémentarité entre la culturomics et

8 la métagénomique est importante car seules 15 % des espèces analysées de façon concomitante ont été détectées par ces deux techniques [3]. De ce fait, par l'augmentation du nombre d'espèces de bactéries découvertes, une nouvelle méthode polyphasique de description des espèces bactériennes a été décrite ; la « taxonogenomics ». Cette dernière permet de décrire de nouvelles espèces en combinant les critères phénotypiques et génomiques. Ainsi, nous avons pu isoler, identifier et analyser le plus grand génome de bactérie isolée chez l'homme, Microvirga massiliensis sp . nov. strain JCT119T avec une taille de 9.3 Mb.

La classification actuelle des espèces bactériennes, repose essentiellement sur une combinaison de propriétés phénotypiques (morphologie, condition environnementale et de culture, pathogenèse) et génotypiques. La taxonomie bactérienne a d'abord utilisé des critères génotypiques tels que la composition génétique de la teneur en G+C, l'hybridation

9 ADN-ADN et plus tard la similarité de séquence du gène de l'ARNr 16S, mais ces critères étaient limités en raison de l'utilisation d'outils génétiques restrictifs. Grâce à l'essor des

NGS, une quantité considérable de données est générée, ce qui rend possible les études sur la base des analyses pan- génomiques. La première définition du pan-génome a été proposée par Tettelin et al [4] en 2005. Un pan-génome est défini par l'ensemble du contenu génétique appartenant à un groupe d'étude. L'analyse de ces gènes fournit donc un aperçu de l'évolution d'un groupe bactérien et permet d'estimer la diversité génomique de ce groupe. En effet, le pan-génome peut être un outil à l'usage de la taxonomie et de la classification des espèces. Grâce à un travail sur le pan-génome de plusieurs espèces de Klebsiella pneumoniae, nous avons pu différencier K. ozaenae et K. rhinoscleromatis des autres espèces de ce genre. A l'aide du calcul core-genome/pan- génome, nous avons observé une cassure importante de ce ratio

10 nous permettant de distinguer les espèces, entraînant une définition différente de ces espèces. Nous croyons que les

études associant le pan-génome permettent de redéfinir les espèces et de les classifier en fonction de leur contenu génomique.

Ce travail a été soumis dans le journal Future Microbiology

11 REVUE

Genome and pan-genome analysis to classify emerging bacteria

Aurélia Caputo, Pierre-Edouard Fournier and Didier Raoult

12 Future Microbiology

Genome and Pan-genome analysis to classify emerging bacteria For Review Only

Journal: Future Microbiology

Manuscript ID Draft

Manuscript Type: Review

Keywords: Genomic, Pan-genome, Taxonomy

Note: The following files were submitted by the author for peer review, but cannot be converted to PDF. You must view these files (e.g. movies) online. figure3.odp

https://mc04.manuscriptcentral.com/fm-fmb 13 Page 1 of 36 Future Microbiology

1 2 Abstract 3 4 Genomic and pangenomic studies are becoming increasingly important. Indeed, genomics coupled 5 6 with culturomics has permitted the discovery of many new bacteria species or genera, such as 7 8 Akkermansia muciniphila and Microvirga massiliensis. 9 10 11 .The bacterial taxonomy suffered of many changes. Thanks to pangenomic analyses, species can be 12 13 redefined, and thus a new definition of these species is generated. The notion of using the genome 14 15 to define species has been done with the Klebsiella genus. Indeed, a toolarge difference leading to 16 17 a break in the core/pangenome ratio undoubtedly reveals the appearance of a new species. This 18 For Review Only 19 20 discovery led us to use the pangenome as an innovative tool for bacterial taxonomy. 21 22 23 24 Keywords 25 26 Genomic, pangenome, taxonomy 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 https://mc04.manuscriptcentral.com/fm-fmb 14 Future Microbiology Page 2 of 36

1 2 Introduction 3 4 The study of digestive bacterial ecosystems was initially explored by microbial culture [1–3]. Since 5 6 the emergence of metagenomics, microbial culture has been gradually replaced by molecular tools 7 8 [4] for complex microbiota study [5]. In 2015, a strategy called “culturomics” was developed and 9 10 11 intensive culture assays allowed selection of the 18 best culture conditions for cultivation of the 12 13 largest number of isolates [6]. Culturomics has allowed identification of a large number of 14 15 prokaryotic species, because it makes it possible to simultaneously combine different culture 16 17 conditions, use 16S rRNA amplification and matrixassisted laser desorption/ionization time 18 For Review Only 19 20 offlight (MALDITOF) [7]. 21 22 The current classification of bacterial species relies on a combination of phenotypic and genotypic 23 24 properties [8–10]. Unraveling bacterial taxonomy was initially done using genotypic criteria, such 25 26 as the genomic G+C content composition, DNADNA hybridization, and later the 16S rRNA gene 27 28 [11,12]. However, these criteria were limited, due to the use of restrictive genetic tools. For 29 30 31 instance, DNADNA hybridization, whose cost is relatively high [13], uses a 70% threshold for 32 33 species discrimination, but cannot be used for all prokaryote genera, as described for Rickettsia 34 35 species [14]. Furthermore, the comparison of the single gene 16S rRNA [15,16], as well as the 36 37 conventional divergence of 1.3% between two 16S rRNA genes [17], leads to a bacterial description 38 39 40 which is shallow and limited [18,19]. 41 42 With the apparition of firstgeneration sequencing in 197577 [20,21], followed by highthroughput 43 44 45 sequencing in 2004 [22], access to complete genetic information was revolutionized. According to 46 47 these modern highthroughput sequencing technologies, a considerable amount of data is generated, 48 49 rendering possible studies based on pangenomic analyses (Fig. 1). The first definition of the pan 50 51 genome was proposed by Tettelin et al. [23] in 2005, just after the start of the era of highthroughput 52 53 sequencing. A pangenome is defined as the entire gene content belonging to a study group [23–25]. 54 55 56 The applications are multiple, including the study of pathogenicity [26,27], the [28], 57 58 resistome [29], prediction of the lifestyle of bacteria [30], and also for taxonomy. Indeed, the study 59 60 https://mc04.manuscriptcentral.com/fm-fmb 15 Page 3 of 36 Future Microbiology

1 2 of the pangenome allows a reclassification of the species [31], in order to clarify and improve the 3 4 traditional criteria. 5 6 7 1 Mapping strategy 8 9 1.1 Sequencing history 10 11 In 1977, Sanger and Coulson introduced Sanger sequencing [20,21]. At the end of the 1980s, large 12 13 scale automation of this method appeared, with the development of fluorescent markings and 14 15 16 capillary electrophoresis. Automation increased the sequencing rate and the cost of sequencing, as 17 18 well as considerable timeFor savings. ReviewThis method was called secondgenerationOnly highthroughput 19 20 sequencing. 21 22 23 Since the end of 20th century, we have been witnessing a real revolution, with the emergence of 24 25 nextgeneration sequencing (NGS) [32]. In 2007, 3 commercial NGS platforms have been available; 26 27 Genome Sequence FLX (Roche/454)(2005), Illumina/Solexa Genome Analyzer (2006) and Applied 28 29 30 Biosystems SOLiD System (2007). The spectacular developments of these techniques have allowed 31 32 massive production of data (smaller sequences) in less time and at lower cost [33,34]. These major 33 34 characteristics allow ultradeep sequencing technologies to be used in the biological and medical 35 36 . 37 38 39 1.2 Applications in genomics and metagenomics 40 41 This technical revolution offers new fields of application, such as genomics (wholegenome 42 43 sequencing, WGS), single sequencing (SCS) and metagenomics. Complete genome sequencing 44 45 46 in particular allows us to understand the genetic basis of phenotypic variability, to analyze 47 48 biodiversity and estimate single nucleotide polymorphism (SNP) diversity [35]. Singlecell 49 50 genomics allows the sequencing of a single isolated cell by capturing and amplifying DNA and is 51 52 capable of producing partial [36] and complete [37] . SCS DNA allows the identification 53 54 55 and assembly of genomes of uncultivated [38–41]. It is estimated that only 1% of 56 57 bacterial species have been cultivated in the laboratory [42]. Nevertheless, SCS has limitations and 58 59 inherent biases [38,41], especially during the amplification step, including falsepositive and false 60 https://mc04.manuscriptcentral.com/fm-fmb 16 Future Microbiology Page 4 of 36

1 2 negative errors, allelic dropout events and coverage nonuniformity [43]. However, singlecell 3 4 genomics and metagenomics are complementary approaches for analyzing bacterial communities 5 6 [44]. The sequencing of metagenomes makes it possible to inventory the diversity of microbial 7 8 ecosystems [45] and to understand the interactions of microorganisms in ecosystems [46]. This 9 10 11 technique, which allows DNA sequencing of bacteria present in specific environments, has many 12 13 advantages. It opens a window on a totally unknown world of great wealth. 14 15 16 1.3 Assembly, finishing and annotations 17 18 1.3.1 ForPrinciple Review Only 19 20 sequencing has become accessible for many laboratories; specific consortium sequencing 21 22 projects are being proposed, and at present the number of projects are exponential. However, NGS 23 24 25 technologies have higher error rates (~0.115%) and smaller read lengths (35700 bp) than those 26 27 obtained from Sanger sequencing platforms [47]. After sequencing, the data produced (reads) are 28 29 computationally reconstructed into longer continuous sequences (contigs), a step called assembly 30 31 [48]. This process consists of an overlap of reads aiming to reconstitute the initial sequence of the 32 33 genome. It’s called de novo when no references are available. The assembly with Sanger data is 34 35 36 based on twobytwo comparison of reads, looking for overlapping sequences of minimal length 37 38 with associated identity percentages (CAP assembler, for example [49]). The change of scale due to 39 40 the huge volume of data, the shortread lengths and the nonuniform confidence in base calling 41 42 excluded this assembly strategy. The most commonly used approach for assembly of shortread data 43 44 45 is therefore based on graph theory; overlap graphs or De Bruijn graphs [50]. A graph is a set of 46 47 nodes connected by links that can be oriented, within which several links form a path. The assembly 48 49 consists in the progressive reduction by concatenation of the number of nodes forming a path. An 50 51 assembly quality can next be assessed using a set of metrics. The usual ones are the total count of 52 53 contigs and scaffolds, their total length, N50, and their average length [51,52]. A good metric is also 54 55 56 the proportion of reads mapped back, or not, to the contigs [51]. In our laboratory, we use a 57 58 threshold of 20 scaffolds. Scaffolds whose size are under 800 bp are remove and scaffolds with a 59 60 https://mc04.manuscriptcentral.com/fm-fmb 17 Page 5 of 36 Future Microbiology

1 2 depth value lower than 25% of the median depth are remove (identified as possible contaminants). 3 4 The best assembly is select by using different criteria, such as the number of scaffolds, N50 and the 5 6 number of N. 7 8 9 The next important step required is genome annotation. To make genomic data valuable, a reliable 10 11 and correct annotation is essential [53]. It is used to identify, localize and distinguish the function of 12 13 genes using similarities when searching in protein databases such as BLAST [54], and provides a 14 15 16 basis for many genome analyses [55]. whose genome is now fully sequenced have 17 18 revealed that nearly 40%For of the genes Review identified have no assigned Only function, either because they do 19 20 not look like any known gene, or (for nearly half of them) they look like other genes but themselves 21 22 have an unknown function [56]. The first step of annotation is predicting the function of genes, 23 24 25 which is generally done for each gene individually using computational tools. However, the 26 27 identification of gene function requires the combination of several experimental complementary 28 29 approaches, whether by computer (in silico analysis), biochemically, or genetic (in vivo and in vitro 30 31 analysis). The second step in annotation consists of identifying the relationship between genes, 32 33 proteins and regulatory elements. These relationships can be of a very varied nature: physical 34 35 36 interactions between proteins/DNA, proteins/RNA and proteins/proteins, networks regulating gene 37 38 expression, metabolic pathways or others. 39 40 41 1.3.2 Tools 42 43 There are several assemblers available using different methods, such as Velvet [57], SOAPdenovo 44 45 [58], Newbler, Cabog [59], Spades [60], Edena [61] and AbySS [62]. 46 47 48 However, powerful approaches based on mapping shortread sequences to a reference genome are 49 50 used for analyzing WGS data from closely related isolates [61,63–66]. Several bioinformatic tools 51 52 can be used for mapping, such as the CLC genomics Workbench (CLC bio, Aarhus, Denmark), 53 54 55 Bowtie [67], BWA [68], SHRIMP [69] or SOAPdenovo [58]. 56 57 There are many possibilities available for the annotation process; Cluster of Orthologous Group 58 59 (COG) [70], Kyoto Encyclopedia of Genes and Genomes (KEGG) [71], Rapid Annotation Server 60 https://mc04.manuscriptcentral.com/fm-fmb 18 Future Microbiology Page 6 of 36

1 2 (RAST) [72], Antibiotic Resistance GeneANNOTation (ARGANNOT) [73] for antibiotic 3 4 resistance gene prediction, Protein Homology/analog Y Recognition Engine V 2.0 (PHYRE2) [74], 5 6 BLAST2GO suite [75] or Artemis [76]. 7 8 9 1.4 Example of a mapping study: Akkermansia muciniphila 10 11 By using metagenomics data from a human stool sample, the genome of Akkermansia muciniphila 12 13 was successfully assembled after mapping of shortread sequences [66]. The stool sample was from 14 15 16 a patient admitted to the intensive care unit and treated with a 10day course of imipenem [77]. 17 18 Highlevel colonizationFor of up to 84%Review by the Verrucomicrobia Only phylum was reported in this patient. 19 20 Reads were generated from a SOLiD sequencer and shortreads shotgun and pairedend runs on a 21 22 454 sequencer. Both technologies generated 1.4 giga bases of metagenomic sequence data from the 23 24 25 sample. The several mapping sequences from SOLiD and 454 data against the A. muciniphila type 26 27 strain ATCC BAA835 enabled obtaining the genome of A. muciniphila strain Urmite with 1 28 29 scaffold and 58 contigs. 30 31 32 The presence of a range of putative ARGs from different antibiotic classes was demonstrated in a 33 34 recent study for A. muciniphila strain Urmite. The putative ARG was βetalactamase, macrolides, 35 36 vancomycin, chloramphenicol, sulfonamide, tetracycline and trimethoprim. 37 38 The approach of mapping shortread sequences to a reference genome has been successful in 39 40 41 assembling a genome directly from a human stool sample. However, this approach is limited, 42 43 because highly divergent sequences not present in the reference could not be detected [65]. These 44 45 limitations will be progressively lessened with the exponentially growing data and subsequent 46 47 genomes available in generalist databases. 48 49 50 2 Finding new species 51 52 2.1 Bacterial culture 53 54 2.1.1 History 55 56 Before the invention of microscopy, few scientists suspected the existence of invisible living 57 58 organisms. In antiquity, Aristotle formulated the idea of an invisible source for disease contagion 59 60 https://mc04.manuscriptcentral.com/fm-fmb 19 Page 7 of 36 Future Microbiology

1 2 but couldn't prove it. In 1680, Antonie van Leeuwenhoek, microscopy pioneer, for the first time 3 4 described and drew bacteria present in plaque from his teeth as well as beer yeasts. The abbot 5 6 Lazzaro Spallanzani (17291799), another microscopy pioneer, was the first to cultivate microbes 7 8 by using culture medium. Robert Koch (18431910), a veritable pioneering microbiologist, 9 10 11 developed the main methods still used today: culture medium adapted to bacteria, bacteria culture 12 13 on solid medium and specific staining. Among other things, he is responsible for the discovery of 14 15 the tuberculosis bacillus as well as the cholera bacillus, vibrio cholerae. From the end of the 19th 16 17 century, especially due to the work of Pasteur (1860), bacterial culture became fundamental in 18 For Review Only 19 20 clinical microbiology and played an immense role in industry, medicine and hygiene. However, 21 22 bacterial culture has been gradually replaced by molecular methods for the study of complex 23 24 microbiota [78]. Nevertheless, bacterial culture remains widely used in most microbiology 25 26 laboratories for the routine diagnosis of bacterial infections [5]. 27 28 2.2 New species identification 29 30 31 2.2.1 Culturomics (MALDI-TOF-MS and 16S rRNA) 32 33 In recent years, with the introduction of a rapid and inexpensive identification method using 34 35 MALDITOF mass spectrometry, the microbial culture of human microbiology has increased 36 37 considerably, making it possible in particular to detect species that are pathogenic bacterial species. 38 39 40 This technique of reference for bacterial identification in clinical microbiology laboratories has also 41 42 developed a new concept for the study of the human microbiota called "microbial culturomics" 43 44 [6,78]. It was developed for the study of a complex microbiota and allows the largest number of 45 46 isolates to be grown through the selection of the 18 best culture conditions [5,6]. This technique is 47 48 based on the diversification of culture conditions by varying the time and temperature of incubation, 49 50 51 culture medium composition and atmosphere [78]. In a preliminary work, this approach allowed the 52 53 cultivation of 340 bacterial species, including 31 new species, as well as species belonging to rare 54 55 phyla (Synergistetes and DeinococcusThermus), using 212 different cultivation conditions [78]. A 56 57 detailed analysis made it possible to successively select 70 and then the 18 most suitable culture 58 59 60 https://mc04.manuscriptcentral.com/fm-fmb 20 Future Microbiology Page 8 of 36

1 2 conditions in order to explore the greatest possible diversity of each sample. Other work carried out 3 4 in our laboratory allowed cultivation of more than 50% of the known species of the human digestive 5 6 tract, including 247 new species [6,7]. New species are identified by MALDITOF, and 16S rRNA 7 8 gene sequencing for nonidentified spectra in MALDITOF. 9 10 11 2.2.2 Other methods 12 13 New species were also identified with multilocus sequence analysis (MLSA) of several 14 15 concatenated housekeeping gene sequences (rrs, recA, gyrB, dnaK, glnII and rpoD) [79,80]. The 16 17 housekeeping genes are usually involved in the expression and maintenance of genetic information 18 For Review Only 19 20 at the level of or . The recA gene is essential for the maintenance and repair 21 22 of DNA and is a good resolutive tool for predicting the lineage and genus among rhizobial strains, 23 24 for example [81]. Phylogenetic analysis for finding new species also uses DNA sequences of the 25 26 internal transcribed spacer 2 (ITS2) region [82]. 27 28 The nucleotide sequence or the peptide sequence can be used. In general, the peptide sequence is 29 30 31 preferred, since it can avoid a certain number of biases inherent in the G+C content of the organism 32 33 studied, as well as the degeneration on the third nucleotide of the codon [83]. 34 35 2.3 The Taxono-genomics strategy 36 37 Since the advent of culturomics, the number of isolates of uncultivated bacteria is increasing. 38 39 40 Consequently, a new approach is needed for the classification of these new putative species. This 41 42 new polyphasic strategy systematically combines phenotypic and genomic criteria, and is called 43 44 “Taxonogenomics” [5,84]. Using this strategy, 15 new bacteria have officially been considered new 45 46 species and/or new genera in official validation lists No. 153 and No. 155 by the International 47 48 Taxonomy Committee of the International Journal of Systematic and Evolutionary Microbiology 49 50 51 [78,85–94]. 52 53 2.4 Example of a new species study: Microvirga massiliensis 54 55 A previous study identified a new bacterial species, Microvirga massiliensis, from a human stool 56 57 sample using culturomics and metagenomics [78]. Recently, the description of the genome of 58 59 60 https://mc04.manuscriptcentral.com/fm-fmb 21 Page 9 of 36 Future Microbiology

1 T 2 Microvirga massiliensis sp. nov. strain JC119 was done with taxonogenomics [95]. The draft 3 4 genome of this bacteria is 9,207,211 bp long, which is the largest bacterial genome of an isolate 5 6 from . According to the Genomes OnLine Database (GOLD) [96], it ranks 139th among the 7 8 largest bacterial genomes. Of the 8,762 predicted genes, 8,685 were proteincoding genes, 77 were 9 10 11 rRNA genes, including 21 rRNA genes, and it exhibits a G+C content of 63.28%. 12 13 14 15 3 The Bacterial Pan-genome 16 17 3.1 History of taxonomy and the concept of bacterial species 18 For Review Only 19 20 The species represents the basic unit of the classification of living organisms. As such, the 21 22 definition of species remains a subject of epistemological debate [97]. We assign a name to every 23 24 living or inanimate being that we can recognize. Each time we assign a name, we make a 25 26 classification. This classification allows us to condense the information resulting from the 27 28 observation of many organisms by describing a limited number of models [98]. The notion of 29 30 31 species has evolved over time, and numerous definitions have been successively published by great 32 33 philosophical or scientific authors: Plato, Aristotle, Buffon, Lamarck and Darwin. The word 34 35 taxonomy, or taxinomy, from the greek taxis (order, arrangement) and nomos (law) refers to the 36 37 science of the laws of classification of living beings. Although this term only appeared in the 19th 38 39 40 century, the need to inventory and classify living beings has been expressed since antiquity. 41 42 Classification is the method of gathering objects into related groups on the basis of various criteria. 43 44 The classification criteria were determined by the needs of men, guided by their beliefs, in relation 45 46 to the progress of scientific techniques of observation and identification. Aristotle classified "living 47 48 beings": mineral, vegetable, , human. The classification was made by observing different 49 50 51 criteria belonging to the mineral, vegetable or animal world with the naked eye. Some specific 52 53 biological properties of bacteria make the definition of species difficult, in particular their extreme 54 55 diversity, the usual requirement to manipulate and examine a bacterial strain and not an isolated cell 56 57 and the small number of morphological characteristics. The great diversity of microorganisms 58 59 60 https://mc04.manuscriptcentral.com/fm-fmb 22 Future Microbiology Page 10 of 36

1 2 compared to that of larger organisms results from two principal causes: their period of evolution and 3 4 their generation time. Due to the difficulties encountered in establishing a satisfactory overall 5 6 definition of bacterial species, Ravin in 1963 proposed considering them, according to the practical 7 8 aim sought, in one of three aspects: nomenspecies, taxospecies and genospecies [99]. The 9 10 11 nomenspecies includes taxa defined for a specific purpose (medical, industrial, etc.) by particular 12 13 properties required of all individuals in this taxon. Taxospecies includes taxa composed of a set of 14 15 biotypes suitable for a given ecological niche. The genospecies gathers all individuals that are 16 17 derived from a common ancestor and that have retained a complete or partial genetic structure. With 18 For Review Only 19 20 the invention of the modern nomenclature, Carl Woese proposed in 1977 a universal phylogenetic 21 22 tree classifying all living beings into three domains (the highest taxonomic rank): Bacteria (or 23 24 Eubacteria), Archaea (or Archaebacteria) and Eucarya [100–102]. This classification was based on 25 26 the comparison of the 16S rRNA gene sequence obtained by the initial sequencing technique of 27 28 Frederick Sanger. Since the early 1990s, because of advances in molecular biology, bacterial 29 30 31 classification has been in perpetual upheaval. Taxonomic information is essential for the 32 33 identification, nomenclature and classification of microbial strains [103] and for understanding the 34 35 biodiversity and relationships among living microorganisms [104]. In 1966, Buchanan et al. [105] 36 37 published a census of 28,900 bacterial species, demonstrating an urgent need for better 38 39 40 classification. The situation was improved in 1980 by Skerman, McGowan and Sneath [106], who, 41 42 through long and laborious effort, reduced the number of valid bacterial species to 1,792. As an 43 44 example, the taxonomy of Salmonella species was thoroughly modified, with the creation of 45 46 subspecies and serovars instead of subgenera and species, respectively, and the distribution of 47 48 strains into two distinct species: S. enterica and S. bongori, on the basis of DNADNA 49 50 51 hybridization [107,108]. To date (May, 2017), this number has risen to 15,626 species (listed in 52 53 LPSN, www.bacterio.net) (Fig. 2). Currently, the taxonomy of prokaryotes relies on polyphasic 54 55 combinations of phenotypic properties (pathogenesis, morphology, environmental and culture 56 57 conditions), chemotaxonomic properties (chemical composition of cellular components) and 58 59 60 https://mc04.manuscriptcentral.com/fm-fmb 23 Page 11 of 36 Future Microbiology

1 2 genotypic properties [8,109], including DNADNA hybridization (DDH), DNA G+C content [110] 3 4 and 16S rRNA sequence similarity [9,17]. The application of molecular hybridization methods 5 6 provides a genomic definition of the bacterial species, taking into account the homology rate and 7 8 the thermal stability of the hybrids obtained between the DNAs of two bacterial isolates. Isolates 9 10 11 belonging to the same species are characterized by homologies at the level of their DNA, which 12 13 results in hybridization percentages greater than or equal to 70% and stability of the hybrids formed 14 15 below 5°C [111–113]. This definition is still recognized by international bacterial taxonomy 16 17 committees. The development of DNA sequencing then led to the determination of a threshold for 18 For Review Only 19 20 defining the species based on the similarity of gene sequences, initially based on the 16S rRNA 21 22 gene. These data were later compared to those obtained by DDH [114,115]. In our laboratory, we 23 24 retain the upper threshold of 98.65% similarity. For the classification of prokaryotes, the 16S rRNA 25 26 gene is an effective molecular marker due to its functional stability, conservation and universal 27 28 presence [116]. However, this gene has several limitations for bacterial taxonomy; notably, the 29 30 31 presence of SNPs in the rRNA operon in a single genome [117,118]; the use of only one gene that 32 33 may not reflect the nature of the genome (~0.07% of a genome) [15,119]; the high degree of 34 35 conservation in a same genus, like Brucella or Rickettsia [120]; the 1.3% divergence accepted 36 37 between two sequences, corresponding to 50 million years of divergence [18,19]; its presence in 38 39 40 multiple and sometimes variable copies [121,122]; lateral transfer, as observed by VanBerkum et 41 42 al., who showed that a small portion of the 16S rRNA gene sequence of Bradyrhizobium elkanii 43 44 originated from a Mesorhizobium spp genome by lateral transfer [119,123]; and presence of a 45 46 threshold for the percentage of similarity. Establishing a cutoff is not at all biological. What happens 47 48 when two species have a similarity percentage of 98.6%? Are these really two distinct species? 49 50 51 Establishing a threshold is not organic and cannot be based essentially on this, especially as 52 53 different thresholds are used by different biologists. 54 55 With the advent of sequencing of whole genomes, phylogeny entered a new era: the era of 56 57 phylogenomics. Many studies demonstrated the importance of genomes in bacterial taxonomy by 58 59 60 https://mc04.manuscriptcentral.com/fm-fmb 24 Future Microbiology Page 12 of 36

1 2 suggesting a focus on the presence or absence of genes within genomes [124–126]; the gene content 3 4 [127]; the presence of SNPs or indels in conserved genes [128]; the comparison of orthologous 5 6 genes [129]; the study of metabolic pathways and gene order [130,131]; or by 7 8 sequence similarity at the genome level, estimated by parameters such as “digital DDH”, Average 9 10 11 Nucleotide Identity (ANI) or AGIOS using the GenomeToGenome Distance Calculator (GGDC), 12 13 ANI calculator and Marseille average genomic identity (MAGI) softwares [132–135]. 14 15 3.2 Pan-genome cases 16 17 3.2.1 Principle 18 For Review Only 19 20 Bacterial species can be described using pangenome analysis [23,31]. The “pangenome” is 21 22 defined as the entire genomic repertoire of a group of genomes [136]. The term pangenome or 23 24 supragenome was first used by Tettelin et al. in 2005 and consists of a combination of genes or 25 26 open reading frames (ORFs) shared by genomes of interest [24]. Analysis of these genes therefore 27 28 provides insight into the evolution of a bacterial group and allows estimation of the genomic 29 30 31 diversity of the dataset. The pangenome of a bacterial species can be divided into three parts. The 32 33 core genome contains genes present in all strains and represents the genetic information that ensures 34 35 the of this species. These are, for example, genes involved in replication, translation control, 36 37 homeostasis and energy production. The second category includes single genes that are present in 38 39 40 only one strain (strainspecific), which provides the organism its ability to adapt to specific 41 42 ecological niches. The third category includes accessory genes present in two or more strains and 43 44 contributes to species diversity [136–138]. These genes may encode biochemical functions that are 45 46 not essential to growth but that confer selective advantages, such as antibiotic resistance, adaptation 47 48 to different niches or colonization of a new host [24]. A pangenome can be determined as open or 49 50 51 closed, according to the lifestyle of the bacterial species studied [139]. Sympatric species live in 52 53 community, have large genomes, an open pangenome, a high level of and a 54 55 large number of ribosomal operons [140], whereas species with a closed pangenome are 56 57 specialized, evolve in a specific niche leading to allopatry, and have a small genome and a reduced 58 59 60 https://mc04.manuscriptcentral.com/fm-fmb 25 Page 13 of 36 Future Microbiology

1 2 number of ribosomal operons [141]. These characteristics allow the appearance of a bona fide 3 4 species. 5 6 Pangenome analyses enable determining the genomic diversity of studied species, and the number 7 8 of genomes used is important in order to accurately represent the entire gene repertoire [23]. With 9 10 11 the emergence and development of NGS, the increasing number of genome sequence data enabled 12 13 more accurate analysis of the pangenome composition. It is important to realize that the estimation 14 15 of the numbers of genes that characterize the pangenome requires some caution in analysis. 16 17 Differences in the number of genes may depend on the methodology used, and are partly related to 18 For Review Only 19 20 the number of genomes available at the time of the determination. Vernikos et al. (2015) 21 22 recommended using at least five genomes for pangenomic analyses. There is indeed a correlation 23 24 between the number of core genes and pangenomes and the number of genomes analyzed [142]. 25 26 This correlation, called the effect of the number of genomes, shows that the number of core gene 27 28 families decreases as the number of genomes examined increases. Conversely, the size of a pan 29 30 31 genome increases with the number of genomes examined. 32 33 3.2.2 Different strategies 34 35 Several software and online tools have been developed, due to the increasing number of pan 36 37 genomic studies, such as GET_HOMOLOGUES [143], PANNOTATOR [144], PanSeq [145], 38 39 40 OrthoMCL [146], PanOCT [147] and PGAP (pangenomes analysis pipeline) [148]. The functions 41 42 of the software and online tools include calculating pangenomic profiles, integrating gene 43 44 annotations, categorizing orthologous genes and constructing phylogenies. The pangenome results 45 46 are influenced by various aspects, including the alignment algorithm and parameters used to 47 48 determine the similarity (% of pairwise aligned sequence length and % identity) and the type and 49 50 51 quality of sequence annotation [136]. 52 53 In our studies, we have used a pipeline (Fig. 3) to perform pangenomic analyses. After protein 54 55 prediction with the Prodigal (Prokaryotic Dynamic Programming Gene finding Algorithm) [149] 56 57 tool, we investigate orthologous genes for an estimation of pangenome composition. We 58 59 60 https://mc04.manuscriptcentral.com/fm-fmb 26 Future Microbiology Page 14 of 36

1 2 implemented a tBLASTn comparison with a 10e3 evalue from studied genomes against the 3 4 protein prediction for the genome in question. Then, as previously described by Rasko et al. [150], 5 6 we use a BLAST score ratio (BSR) approach to accurately evaluate the level of conservation of 7 8 proteins among a pool of genomes. This approach enables determining the bias produced by an 9 10 11 artificially low evalue in BLAST analysis from small regions of high similarity. The use of BSR 12 13 removes this bias because it is directly derived from the similarity of the match [150]. It is 14 15 calculated by dividing the query bit score by the maximum bit score for all genome studies [151], 16 17 which may range from 0 to 1.0 (exact protein match). Therefore, genes with a BSR value > 0.4 18 For Review Only 19 20 (having a protein identity > 40% over 100% of its length) in all genomes are classified as core 21 22 genes. There are many possibilities for the annotation step, in particular COG and KEGG. These 23 24 tools allow a better detailed study within the core and accessory genome for the functional 25 26 distribution. We can observe the difference in COG category distribution and in metabolic pathways 27 28 [152,153]. 29 30 31 3.3 Example of a pan-genome study for taxonomic purposes: the Klebsiella genus 32 33 The taxonomic classification of Klebsiella species has been the subject of a long controversy. 34 35 Klebsiella species are part of the large Enterobacteriaceae family, which are Gramnegative 36 37 bacteria. Originally, the Klebsiella genus was divided into pathovars linked to the diseases that they 38 39 40 caused: Klebsiella pneumoniae, Klebsiella ozaenae and Klebsiella rhinoscleromatis [154]. With the 41 42 development of new tools such as G+C content composition, DNADNA hybridization and 16S 43 44 rRNA sequencing [8,113], the classification of Klebsiella species has been continuously revised 45 46 [155,156]. K. ozaenae and K. rhinoscleromatis were notably reclassified as K. pneumoniae 47 48 subspecies [157–159]. 49 50 51 Pangenome analyses were done for different strains and subspecies of K. pneumoniae, K. oxytoca, 52 53 K. variicola and K. mobilis [31]. We determined the core genome/pangenome ratio for six K. 54 55 pneumoniae strains, which is 94%. Then, we determined the ratio with these same six strains, 56 57 including K. mobilis, K. variicola, K. oxytoca, K. pneumoniae subsp. ozaenae or K. pneumoniae 58 59 60 https://mc04.manuscriptcentral.com/fm-fmb 27 Page 15 of 36 Future Microbiology

1 2 subsp. rhinoscleromatis genomes (Fig. 4); the ratios were 67%, 81%, 69%, 72% and 79% 3 4 respectively. A discontinuity variation in the ratio was observed for the latter, with a decrease from 5 6 13% to 27%. This break in the ratio is > 10% without a transition zone reflecting individual 7 8 biological species. Accordingly, K. pneumoniae subsp. ozaenae or K. pneumoniae subsp. 9 10 11 rhinoscleromatis can be considered as species. A recent study has demonstrated the same break in 12 13 the core/pangenome ratio of E. coli strains, after the additions of Shigella strains [160]. 14 15 A quantum leap between two objects gives rise to a different definition of these two objects. Here, 16 17 there is no transition from one species to another because there is a break in the ratio of the core 18 For Review Only 19 20 genome on the pangenome, leading to a definition of different species. A break in the ratio 21 22 represents a major difference between genomes. These irreconcilable differences cannot exist within 23 24 a single species. 25 26 Two bacterial species that have a few hundred gene differences can gain or lose them but in the end 27 28 they have only these genes different. When two species have many more gene differences, it will be 29 30 31 impossible to lose or gain these genes, and then we can clearly say that these two species are 32 33 different. We believe that pangenome studies make it possible to redefine species, classify them 34 35 depending on their discontinuous genomic content, and allow us to better visualize these differences 36 37 in the number of genes between species. 38 39 40 41 42 Conclusions 43 44 Since the introduction of DNA sequencing by Sanger and Coulson in 1977, great progress has been 45 46 made. A growing amount of data is being generated in several domains, requiring more and more 47 48 advanced computer processing. Bioinformatics is essential today, both through computerized data 49 50 51 modeling and improved methods of treating it. 52 53 Many studies using different methods have now been published in different fields, such as genome 54 55 assembly and annotation, the research on new bacterial species and bacterial taxonomic 56 57 classification. Pangenome study can be used for defining bacterial species. The great discontinuity 58 59 60 https://mc04.manuscriptcentral.com/fm-fmb 28 Future Microbiology Page 16 of 36

1 2 variation in the core/pangenome ratio observed in Klebsiella species can help us to redefine these 3 4 species. Genomics challenges taxonomy. We are at the beginning of the interpretation of the 5 6 genome for taxonomic purposes. 7 8 9 10 11 Future perspective 12 13 As a perspective on the future, we can apply pangenome analysis to the reclassification of other 14 15 bacterial species or genera. 16 17 18 For Review Only 19 20 Executive summary 21 22 Mapping strategy 23 24 The sequencing revolution offers new fields of application as genomics and metagenomics. 25 26 Mapping strategy has permitted to obtain the genome of Akkermansia muciniphila by using 27 28 metagenomics data from a human stool sample. 29 30 31 Finding new species 32 33 Culture was used in most laboratory for the routine diagnosis of bacterial infections. 34 35 Culturomics was developed for studding complex microbiota. 36 37 New species can also be identified by MLSA. 38 39 40 The largest bacterial genome of an isolate from human was identified by culturomics and 41 42 metagenomics, Microvirga massiliensis. 43 44 The bacterial pangenome 45 46 The notion of species has evolved over time and numerous definitions have successively 47 48 been published. 49 50 51 The 16S rRNA gene is an effective molecular marker but exhibits several limitations for 52 53 bacterial taxonomy. 54 55 The pangenome analysis provides insight into the evolution of a bacterial group and 56 57 allows estimating the genomic diversity of the dataset. 58 59 60 https://mc04.manuscriptcentral.com/fm-fmb 29 Page 17 of 36 Future Microbiology

1 2 The analysis of pangenome for different strains and species of Klebsiella reveled a 3 4 discontinuity variation in the ratio core genome/pan genome. 5 6 Pangenome studies make it possible to redefine species. 7 8 9 10 11 12 13 14 15 16 17 18 For Review Only 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 https://mc04.manuscriptcentral.com/fm-fmb 30 Future Microbiology Page 18 of 36

1 2 References 3 4 1. Savage DC. Microbial Ecology of the Gastrointestinal Tract. Annu. Rev. Microbiol. 31(1), 5 107–133 (1977). 6 7 2. Finegold SM, Attebery HR, Sutter VL. Effect of diet on human fecal flora: comparison of 8 Japanese and American diets. Am. J. Clin. Nutr. 27(12), 1456–1469 (1974). 9 10 3. Moore WEC, Holdeman LV. Human Fecal Flora: The Normal Flora of 20 Japanese 11 Hawaiians. Appl. Microbiol. 27(5), 961–979 (1974). 12 13 14 4. Lagier JC, Million M, Hugon P, Armougom F, Raoult D. Human gut microbiota: repertoire 15 and variations. Front. Cell. Infect. Microbiol. 2, 136 (2012). 16 17 5. Fournier PE, Lagier JC, Dubourg G, Raoult D. From culturomics to taxonomogenomics: A 18 need to change Forthe taxonomy Review of prokaryotes in clinical Only microbiology. Anaerobe [Internet]. 19 Available from: http://www.sciencedirect.com/science/article/pii/S1075996415300718. 20 21 6. Lagier JC, Hugon P, Khelaifia S, Fournier PE, La Scola B, Raoult D. The Rebirth of 22 Culture in Microbiology through the Example of Culturomics To Study Human Gut 23 Microbiota. Clin. Microbiol. Rev. 28(1) (2015). 24 25 7. Lagier JC, Khelaifia S, Alou MT, et al. Culture of previously uncultured members of the 26 human gut microbiota by culturomics. Nat. Microbiol. 1, 16203 (2016). 27 28 29 8. Staley JT. The bacterial species dilemma and the genomicphylogenetic species concept. 30 Philos. Trans. R. Soc. B Biol. Sci. 361(1475), 1899–1909 (2006). 31 32 9. Tindall BJ, RossellóMóra R, Busse HJ, Ludwig W, Kämpfer P. Notes on the 33 characterization of prokaryote strains for taxonomic purposes. Int. J. Syst. Evol. Microbiol. 34 60(Pt 1), 249–266 (2010). 35 36 10. Kämpfer P, Glaeser SP. Prokaryotic taxonomy in the sequencing erathe polyphasic 37 approach revisited. Environ. Microbiol. 14(2), 291–317 (2012). 38 39 11. RossellóMora R, Amann R. The species concept for prokaryotes. FEMS Microbiol. Rev. 40 25(1), 39–67 (2001). 41 42 12. Drancourt M, Berger P, Raoult D. Systematic 16S rRNA gene sequencing of atypical clinical 43 44 isolates identified 27 new bacterial species associated with humans. J. Clin. Microbiol. 42(5), 45 2197–2202 (2004). 46 47 13. Sentausa E, Fournier PE. Advantages and limitations of genomics in prokaryotic taxonomy. 48 Clin. Microbiol. Infect. 19(9), 790–795 (2013). 49 50 14. Drancourt M, Raoult D. Taxonomic position of the Rickettsiae: Current knowledge. FEMS 51 Microbiol. Rev. 13(1), 13–24 (1994). 52 53 15. O’Malley MA, Koonin EV. How stands the Tree of Life a century and a half after The 54 Origin? Biol. Direct. 6, 32 (2011). 55 56 16. Dagan T, Roettger M, Stucken K, et al. Genomes of Stigonematalean cyanobacteria 57 (subsection V) and the evolution of oxygenic photosynthesis from prokaryotes to . 58 Genome Biol. Evol. 5(1), 31–44 (2013). 59 60 https://mc04.manuscriptcentral.com/fm-fmb 31 Page 19 of 36 Future Microbiology

1 2 17. Stackebrandt E, Ebers J. Taxonomic parameters revisited: tarnished gold standards. 3 Microbiol. TODAY 33 4 152155. (2006). 4 5 18. Ochman H, Elwyn S, Moran NA. Calibrating bacterial evolution. Proc. Natl. Acad. Sci. U. S. 6 A. 96(22), 12638–12643 (1999). 7 8 19. Ogata H, Audic S, RenestoAudiffren P, et al. Mechanisms of evolution in Rickettsia conorii 9 and R. prowazekii. Science. 293(5537), 2093–2098 (2001). 10 11 20. Sanger F, Coulson AR. A rapid method for determining sequences in DNA by primed 12 synthesis with DNA polymerase. J. Mol. Biol. 94(3), 441–448 (1975). 13 14 21. Sanger F, Nicklen S, Coulson AR. DNA sequencing with chainterminating inhibitors. Proc. 15 16 Natl. Acad. Sci. 74(12), 5463–5467 (1977). 17 18 22. Ambardar S, GuptaFor R, Trakroo Review D, Lal R, Vakhlu J. HighOnly Throughput Sequencing: An 19 Overview of Sequencing Chemistry. Indian J. Microbiol. 56(4), 394–404 (2016). 20 21 23. Tettelin H, Riley D, Cattuto C, Medini D. Comparative genomics: the bacterial pangenome. 22 Curr. Opin. Microbiol. 11(5), 472–477 (2008). 23 24 24. Medini D, Donati C, Tettelin H, Masignani V, Rappuoli R. The microbial pangenome. Curr. 25 Opin. Genet. Dev. 15(6), 589–594 (2005). 26 27 25. Mira A, MartínCuadrado AB, D’Auria G, RodríguezValera F. The bacterial pangenome:a 28 new paradigm in microbiology. Int. Microbiol. Off. J. Span. Soc. Microbiol. 13(2), 45–57 29 (2010). 30 31 32 26. Rasko DA, Webster DR, Sahl JW, et al. Origins of the E. coli strain causing an outbreak of 33 hemolyticuremic syndrome in Germany. N. Engl. J. Med. 365(8), 709–717 (2011). 34 35 27. Chun J, Grim CJ, Hasan NA, et al. Comparative genomics reveals mechanism for shortterm 36 and longterm clonal transitions in pandemic Vibrio cholerae. Proc. Natl. Acad. Sci. U. S. A. 37 106(36), 15442–15447 (2009). 38 39 28. den Bakker HC, Cummings CA, Ferreira V, et al. Comparative genomics of the bacterial 40 genus Listeria: Genome evolution is characterized by limited gene acquisition and limited 41 gene loss. BMC Genomics. 11, 688 (2010). 42 43 29. Olivares J, Bernardini A, GarciaLeon G, Corona F, B Sanchez M, Martinez JL. The intrinsic 44 resistome of bacterial pathogens. Front. Microbiol. 4, 103 (2013). 45 46 47 30. Diene SM, Merhej V, Henry M, et al. The Rhizome of the MultidrugResistant Enterobacter 48 aerogenes Genome Reveals How New “Killer Bugs” Are Created because of a Sympatric 49 Lifestyle. Mol. Biol. Evol. , mss236 (2012). 50 51 31. Caputo A, Merhej V, Georgiades K, et al. Pangenomic analysis to redefine species and 52 subspecies based on quantum discontinuous variation: the Klebsiella paradigm. Biol. Direct. 53 10, 55 (2015). 54 55 32. Glenn TC. Field guide to nextgeneration DNA sequencers. Mol. Ecol. Resour. 11(5), 759– 56 769 (2011). 57 58 33. Margulies M, Egholm M, Altman WE, et al. Genome sequencing in microfabricated high 59 60 https://mc04.manuscriptcentral.com/fm-fmb 32 Future Microbiology Page 20 of 36

1 2 density picolitre reactors. Nature. 437(7057), 376–380 (2005). 3 4 34. Bentley SD, Parkhill J. Comparative genomic structure of prokaryotes. Annu. Rev. Genet. 38, 5 771–792 (2004). 6 7 35. Camiolo S, Sablok G, Porceddu A. Altools: a user friendly NGS data analyser. Biol. Direct. 8 11(1), 8 (2016). 9 10 36. Zhang K, Martiny AC, Reppas NB, et al. Sequencing genomes from single cells by 11 polymerase cloning. Nat. Biotechnol. 24(6), 680–686 (2006). 12 13 37. Woyke T, Tighe D, Mavromatis K, et al. One bacterial cell, one complete genome. PloS One. 14 5(4), e10314 (2010). 15 16 17 38. Gawad C, Koh W, Quake SR. Singlecell genome sequencing: current state of the science. 18 Nat. Rev. Genet.For 17(3), 175–188 Review (2016). Only 19 20 39. Marcy Y, Ouverney C, Bik EM, et al. Dissecting biological “dark matter” with singlecell 21 genetic analysis of rare and uncultivated TM7 microbes from the human mouth. Proc. Natl. 22 Acad. Sci. 104(29), 11889–11894 (2007). 23 24 40. McLean JS, Lombardo MJ, Badger JH, et al. Candidate phylum TM6 genome recovered 25 from a hospital sink biofilm provides genomic insights into this uncultivated phylum. Proc. 26 Natl. Acad. Sci. 110(26), E2390–E2399 (2013). 27 28 41. Lasken RS. Genomic sequencing of uncultured microorganisms from single cells. Nat. Rev. 29 Microbiol. 10(9), 631–640 (2012). 30 31 32 42. Rappé MS, Giovannoni SJ. The uncultured microbial majority. Annu. Rev. Microbiol. 57, 33 369–394 (2003). 34 35 43. Wang Y, Navin NE. Advances and Applications of Single Cell Sequencing Technologies. 36 Mol. Cell. 58(4), 598–609 (2015). 37 38 44. Doud DFR, Woyke T. Novel approaches in functiondriven singlecell genomics. FEMS 39 Microbiol. Rev. 41(4), 538–548 (2017). 40 41 45. Nasheri N, Petronella N, Ronholm J, Bidawid S, Corneau N. Characterization of the 42 Genomic Diversity of Norovirus in Linked Patients Using a Metagenomic Deep Sequencing 43 Approach. Front. Microbiol. 8, 73 (2017). 44 45 46. Weng FCH, Shaw GTW, Weng CY, Yang YJ, Wang D. Inferring Microbial Interactions in 46 47 the Gut of the Hong Kong Whipping Frog (Polypedates megacephalus) and a Validation 48 Using Probiotics. Front. Microbiol. 8, 525 (2017). 49 50 47. Goodwin S, McPherson JD, McCombie WR. Coming of age: ten years of nextgeneration 51 sequencing technologies. Nat. Rev. Genet. 17(6), 333–351 (2016). 52 53 48. Ekblom R, Wolf JBW. A field guide to wholegenome sequencing, assembly and annotation. 54 Evol. Appl. 7(9), 1026–1042 (2014). 55 56 49. Huang X, Madan A. CAP3: A DNA Sequence Assembly Program. Genome Res. 9(9), 868– 57 877 (1999). 58 59 50. Lin Y, Yuan J, Kolmogorov M, Shen MW, Chaisson M, Pevzner PA. Assembly of long error 60 https://mc04.manuscriptcentral.com/fm-fmb 33 Page 21 of 36 Future Microbiology

1 2 prone reads using de Bruijn graphs. Proc. Natl. Acad. Sci. U. S. A. 113(52), E8396–E8405 3 (2016). 4 5 51. Cabau C, Escudié F, Djari A, Guiguen Y, Bobe J, Klopp C. Compacting and correcting 6 Trinity and Oases RNASeq de novo assemblies. PeerJ. 5, e2988 (2017). 7 8 52. Cao MD, Nguyen SH, Ganesamoorthy D, Elliott AG, Cooper MA, Coin LJM. Scaffolding 9 and completing genome assemblies in realtime with nanopore sequencing. Nat. Commun. 8, 10 14515 (2017). 11 12 53. Lugli GA, Milani C, Mancabelli L, van Sinderen D, Ventura M. MEGAnnotator: a user 13 14 friendly pipeline for microbial genomes assembly and annotation. FEMS Microbiol. Lett. 15 363(7) (2016). 16 17 54. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J. 18 Mol. Biol. 215(3),For 403–410 Review(1990). Only 19 20 55. Stein L. Genome annotation: from sequence to biology. Nat. Rev. Genet. 2(7), 493–503 21 (2001). 22 23 56. Médigue C, Bocs S, Labarre L, Mathé C, Vallenet D. L’annotation in silico des séquences 24 génomiques Bioinformatique (1). médecine/sciences. 18(2), 237–250 (2002). 25 26 57. Zerbino DR, Birney E. Velvet: algorithms for de novo short read assembly using de Bruijn 27 graphs. Genome Res. 18(5), 821–829 (2008). 28 29 58. Li R, Li Y, Kristiansen K, Wang J. SOAP: short oligonucleotide alignment program. 30 31 Bioinforma. Oxf. Engl. 24(5), 713–714 (2008). 32 33 59. Miller JR, Delcher AL, Koren S, et al. Aggressive assembly of pyrosequencing reads with 34 mates. Bioinforma. Oxf. Engl. 24(24), 2818–2824 (2008). 35 36 60. SPAdes: A New Genome Assembly Algorithm and Its Applications to SingleCell Sequencing 37 | Abstract [Internet]. Available from: 38 http://online.liebertpub.com/doi/abs/10.1089/cmb.2012.0021. 39 40 61. Hernandez D, François P, Farinelli L, Osterås M, Schrenzel J. De novo bacterial genome 41 sequencing: millions of very short reads assembled on a desktop computer. Genome Res. 42 18(5), 802–809 (2008). 43 44 62. Simpson JT, Wong K, Jackman SD, Schein JE, Jones SJM, Birol I. ABySS: a parallel 45 46 assembler for short read sequence data. Genome Res. 19(6), 1117–1123 (2009). 47 48 63. Nishito Y, Osana Y, Hachiya T, et al. Whole genome assembly of a natto production strain 49 Bacillus subtilis natto from very short read data. BMC Genomics. 11, 243 (2010). 50 51 64. Li H, Ruan J, Durbin R. Mapping short DNA sequencing reads and calling variants using 52 mapping quality scores. Genome Res. 18(11), 1851–1858 (2008). 53 54 65. Bratcher HB, Corton C, Jolley KA, Parkhill J, Maiden MC. A genebygene population 55 genomics platform: de novo assembly, annotation and genealogical analysis of 108 56 representative Neisseria meningitidis genomes. BMC Genomics. 15, 1138 (2014). 57 58 66. Caputo A, Dubourg G, Croce O, et al. Wholegenome assembly of Akkermansia muciniphila 59 60 https://mc04.manuscriptcentral.com/fm-fmb 34 Future Microbiology Page 22 of 36

1 2 sequenced directly from human stool. Biol. Direct. 10, 5 (2015). 3 4 67. Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memoryefficient alignment of 5 short DNA sequences to the human genome. Genome Biol. 10(3), R25 (2009). 6 7 68. Li H, Durbin R. Fast and accurate short read alignment with BurrowsWheeler transform. 8 Bioinforma. Oxf. Engl. 25(14), 1754–1760 (2009). 9 10 69. Rumble SM, Lacroute P, Dalca AV, Fiume M, Sidow A, Brudno M. SHRiMP: accurate 11 mapping of short colorspace reads. PLoS Comput. Biol. 5(5), e1000386 (2009). 12 13 70. Tatusov RL, Natale DA, Garkavtsev IV, et al. The COG database: new developments in 14 phylogenetic classification of proteins from complete genomes. Nucleic Acids Res. 29(1), 22– 15 16 28 (2001). 17 18 71. Kanehisa M, GotoFor S. KEGG: Review kyoto encyclopedia of Onlygenes and genomes. Nucleic Acids Res. 19 28(1), 27–30 (2000). 20 21 72. Aziz RK, Bartels D, Best AA, et al. The RAST Server: rapid annotations using subsystems 22 technology. BMC Genomics. 9, 75 (2008). 23 24 73. Gupta SK, Padmanabhan BR, Diene SM, et al. ARGANNOT, a new bioinformatic tool to 25 discover antibiotic resistance genes in bacterial genomes. Antimicrob. Agents Chemother. 26 58(1), 212–220 (2014). 27 28 74. Kelley LA, Mezulis S, Yates CM, Wass MN, Sternberg MJE. The Phyre2 web portal for 29 protein modeling, prediction and analysis. Nat. Protoc. 10(6), 845–858 (2015). 30 31 32 75. Conesa A, Götz S. Blast2GO: A comprehensive suite for functional analysis in 33 genomics. Int. J. Plant Genomics. 2008, 619832 (2008). 34 35 76. Rutherford K, Parkhill J, Crook J, et al. Artemis: sequence visualization and annotation. 36 Bioinforma. Oxf. Engl. 16(10), 944–945 (2000). 37 38 77. Dubourg G, Lagier JC, Armougom F, et al. Highlevel colonisation of the human gut by 39 Verrucomicrobia following broadspectrum antibiotic treatment. Int. J. Antimicrob. Agents. 40 41(2), 149–155 (2013). 41 42 78. Lagier JC, Armougom F, Million M, et al. Microbial culturomics: paradigm shift in the 43 human gut study. Clin. Microbiol. Infect. Off. Publ. Eur. Soc. Clin. Microbiol. 44 Infect. Dis. 18(12), 1185–1193 (2012). 45 46 47 79. Msaddak A, Rejili M, Durán D, et al. Members of Microvirga and Bradyrhizobium genera 48 are native endosymbiotic bacteria nodulating Lupinus luteus in Northern Tunisian soils. 49 FEMS Microbiol. Ecol. 93(6) (2017). 50 51 80. Busquets A, Gomila M, Beiki F, et al. Pseudomonas caspiana sp. nov., a citrus pathogen in 52 the Pseudomonas syringae phylogenetic group. Syst. Appl. Microbiol. [Internet]. Available 53 from: http://www.sciencedirect.com/science/article/pii/S0723202017300450. 54 55 81. Vinuesa P, LeónBarrios M, Silva C, et al. Bradyrhizobium canariense sp. nov., an acid 56 tolerant that nodulates endemic genistoid legumes (Papilionoideae: Genisteae) 57 from the Canary Islands, along with Bradyrhizobium japonicum bv. genistearum, 58 Bradyrhizobium genospecies alpha and Bradyrhizobium genospecies beta. Int. J. Syst. Evol. 59 60 https://mc04.manuscriptcentral.com/fm-fmb 35 Page 23 of 36 Future Microbiology

1 2 Microbiol. 55(Pt 2), 569–575 (2005). 3 4 82. Ahmad S, Mokaddas E, AlSweih N, Khan ZU. Phenotypic and molecular characterization of 5 Candida dubliniensis isolates from clinical specimens in Kuwait. Med. Princ. Pract. Int. J. 6 Kuwait Univ. Health Sci. Cent. 14 Suppl 1, 77–83 (2005). 7 8 83. Gupta RS. Protein phylogenies and signature sequences: A reappraisal of evolutionary 9 relationships among archaebacteria, eubacteria, and . Microbiol. Mol. Biol. Rev. 10 MMBR. 62(4), 1435–1491 (1998). 11 12 84. Ramasamy D, Mishra AK, Lagier JC, et al. A polyphasic strategy incorporating genomic 13 14 data for the taxonomic description of novel bacterial species. Int. J. Syst. Evol. Microbiol. 15 64(Pt 2), 384–391 (2014). 16 17 85. Ramasamy D, Kokcha S, Lagier JC, Nguyen TT, Raoult D, Fournier PE. Genome sequence 18 and description Forof Aeromicrobium Review massiliense sp. nov.Only Stand. Genomic Sci. 7(2), 246–257 19 (2012). 20 21 86. Roux V, El Karkouri K, Lagier JC, Robert C, Raoult D. Noncontiguous finished genome 22 sequence and description of massiliensis sp. nov. Stand. Genomic Sci. 7(2), 221–232 23 (2012). 24 25 87. Hugon P, Mishra AK, Lagier JC, et al. Noncontiguous finished genome sequence and 26 description of Brevibacillus massiliensis sp. nov. Stand. Genomic Sci. 8(1), 1–14 (2013). 27 28 29 88. Kokcha S, Ramasamy D, Lagier JC, Robert C, Raoult D, Fournier PE. Noncontiguous 30 finished genome sequence and description of Brevibacterium senegalense sp. nov. Stand. 31 Genomic Sci. 7(2), 233–245 (2012). 32 33 89. Oren A, M. Garrity G. List of new names and new combinations previously effectively, but 34 not validly, published. , 1–5 (2014). 35 36 90. Lagier JC, Armougom F, Mishra AK, Nguyen TT, Raoult D, Fournier PE. Noncontiguous 37 finished genome sequence and description of Alistipes timonensis sp. nov. Stand. Genomic 38 Sci. 6(3), 315–324 (2012). 39 40 91. Lagier JC, El Karkouri K, Nguyen TT, Armougom F, Raoult D, Fournier PE. Non 41 contiguous finished genome sequence and description of Anaerococcus senegalensis sp. nov. 42 43 Stand. Genomic Sci. 6(1), 116–125 (2012). 44 45 92. Lagier JC, Gimenez G, Robert C, Raoult D, Fournier PE. Noncontiguous finished genome 46 sequence and description of Herbaspirillum massiliense sp. nov. Stand. Genomic Sci. 7(2), 47 200–209 (2012). 48 49 93. Lagier JC, El Karkouri K, Mishra AK, Robert C, Raoult D, Fournier PE. Non contiguous 50 finished genome sequence and description of Enterobacter massiliensis sp. nov. Stand. 51 Genomic Sci. 7(3), 399–412 (2013). 52 53 94. Lagier JC, Elkarkouri K, Rivet R, Couderc C, Raoult D, Fournier PE. Non contiguous 54 finished genome sequence and description of Senegalemassilia anaerobia gen. nov., sp. nov. 55 Stand. Genomic Sci. 7(3), 343–356 (2013). 56 57 58 95. Caputo A, Lagier JC, Azza S, et al. Microvirga massiliensis sp. nov., the human commensal 59 with the largest genome. MicrobiologyOpen. 5(2), 307–322 (2016). 60 https://mc04.manuscriptcentral.com/fm-fmb 36 Future Microbiology Page 24 of 36

1 2 96. Kyrpides NC. Genomes OnLine Database (GOLD 1.0): a monitor of complete and ongoing 3 genome projects worldwide. Bioinforma. Oxf. Engl. 15(9), 773–774 (1999). 4 5 97. Lherminier P. Le mythe de l’espèce [Internet]. Paris : Ellipses Available from: 6 http://www.bmvr.marseille.fr/in/sites/marseille/faces/details.xhtml;jsessionid=A9FCC9D55B 7 2529EFB74828DCF8E013E6?id=p%3A%3Ausmarcdef_0001286066. 8 9 98. Le minor L, Veron M. Bactériologie médicale [Internet]. Editions Flammarion Available 10 from: https://www.abebooks.fr/Bact%C3%A9riologiem%C3%A9dicaleMINOR 11 L%C3%A9onVERONMichel/1201941569/bd. 12 13 14 99. Ravin AW. Experimental Approaches to the Study of Bacterial Phylogeny. Am. Nat. 97(896), 15 307–318 (1963). 16 17 100. Wheelis ML, Kandler O, Woese CR. On the nature of global classification. Proc. Natl. Acad. 18 Sci. U. S. A. 89,For 2930–2934 Review(1992). Only 19 20 101. Woese CR, Kandler O, Wheelis ML. Towards a natural system of organisms: proposal for the 21 domains Archaea, Bacteria, and Eucarya. Proc. Natl. Acad. Sci. U. S. A. 87(12), 4576–4579 22 (1990). 23 24 102. Woese CR. Bacterial evolution. Microbiol. Rev. 51(2), 221–271 (1987). 25 26 103. Moore ERB, Mihaylova SA, Vandamme P, Krichevsky MI, Dijkshoorn L. Microbial 27 systematics and taxonomy: relevance for a microbial commons. Res. Microbiol. 161(6), 430– 28 29 438 (2010). 30 31 104. Gevers D, Cohan FM, Lawrence JG, et al. Opinion: Reevaluating prokaryotic species. Nat. 32 Rev. Microbiol. 3(9), 733–739 (2005). 33 34 105. R.E. Buchanan, Holt JG, Lessel ER. Index Bergeyana An annotated alphabetic listing of 35 names of the taxa of the bacteria. (1966). 36 37 106. Skerman VBD, McGowan V, Sneath PHA. Approved Lists of Bacterial Names. Int. J. Syst. 38 Evol. Microbiol. 30(1), 225–420 (1980). 39 40 107. Reeves MW, Evins GM, Heiba AA, Plikaytis BD, Farmer JJ. Clonal nature of Salmonella 41 typhi and its genetic relatedness to other salmonellae as shown by multilocus enzyme 42 electrophoresis, and proposal of Salmonella bongori comb. nov. J. Clin. Microbiol. 27(2), 43 44 313–320 (1989). 45 46 108. Le Minor L, Popoff MY. Designation of Salmonella enterica sp. nov., nom. rev., as the Type 47 and Only Species of the Genus Salmonella: Request for an Opinion. Int. J. Syst. Evol. 48 Microbiol. 37(4), 465–468 (1987). 49 50 109. Hugon P, Dufour JC, Colson P, Fournier PE, Sallah K, Raoult D. A comprehensive 51 repertoire of prokaryotic species identified in human beings. Lancet Infect. Dis. 15(10), 52 1211–1219 (2015). 53 54 110. Zakhia F, de Lajudie P. [Modern bacterial taxonomy: techniques reviewapplication to 55 bacteria that nodulate leguminous (BNL)]. Can. J. Microbiol. 52(3), 169–181 (2006). 56 57 111. Krawiec S. Concept of a Bacterial Species. Int. J. Syst. Bacteriol. 35(2), 217–220 (1985). 58 59 60 https://mc04.manuscriptcentral.com/fm-fmb 37 Page 25 of 36 Future Microbiology

1 2 112. Vandamme P, Pot B, Gillis M, de Vos P, Kersters K, Swings J. Polyphasic taxonomy, a 3 consensus approach to bacterial systematics. Microbiol. Rev. 60(2), 407–438 (1996). 4 5 113. Wayne LG, Brenner DJ, Colwell RR, et al. Report of the Ad Hoc Committee on 6 Reconciliation of Approaches to Bacterial Systematics. Int. J. Syst. Bacteriol. 37(4), 463–464 7 (1987). 8 9 114. Keswani J, Whitman WB. Relationship of 16S rRNA sequence similarity to DNA 10 hybridization in prokaryotes. Int. J. Syst. Evol. Microbiol. 51(Pt 2), 667–678 (2001). 11 12 115. Stackebrandt E, Goebel BM. Taxonomic Note: A Place for DNADNA Reassociation and 16S 13 14 rRNA Sequence Analysis in the Present Species Definition in Bacteriology. Int. J. Syst. 15 Bacteriol. 44(4), 846–849 (1994). 16 17 116. Ritari J, Salojärvi J, Lahti L, de Vos WM. Improved taxonomic assignment of human 18 intestinal 16S rRNAFor sequences Review by a dedicated reference Only database. BMC Genomics. 16, 1056 19 (2015). 20 21 117. Acinas SG, Marcelino LA, KlepacCeraj V, Polz MF. Divergence and redundancy of 16S 22 rRNA sequences in genomes with multiple rrn operons. J. Bacteriol. 186(9), 2629–2635 23 (2004). 24 25 118. Rainey FA, WardRainey NL, Janssen PH, Hippe H, Stackebrandt E. paradoxum 26 DSM 7308T contains multiple 16S rRNA genes with heterogeneous intervening sequences. 27 28 Microbiol. Read. Engl. 142 ( Pt 8), 2087–2095 (1996). 29 30 119. Dagan T, Martin W. The tree of one percent. Genome Biol. 7(10), 118 (2006). 31 32 120. Gándara B, Merino AL, Rogel MA, Martı́nezRomero E. Limited Genetic Diversity 33 ofBrucella spp. J. Clin. Microbiol. 39(1), 235–240 (2001). 34 35 121. Větrovský T, Baldrian P. The variability of the 16S rRNA gene in bacterial genomes and its 36 consequences for bacterial community analyses. PloS One. 8(2), e57923 (2013). 37 38 122. Marchandin H, Teyssier C, Siméon De Buochberg M, JeanPierre H, Carriere C, JumasBilak 39 E. Intrachromosomal heterogeneity between the four 16S rRNA gene copies in the genus 40 Veillonella: implications for phylogeny and taxonomy. Microbiol. Read. Engl. 149(Pt 6), 41 1493–1501 (2003). 42 43 44 123. van Berkum P, Terefework Z, Paulin L, Suomalainen S, Lindström K, Eardly BD. Discordant 45 phylogenies within the rrn loci of . J. Bacteriol. 185(10), 2988–2998 (2003). 46 47 124. FitzGibbon ST, House CH. Whole genomebased phylogenetic analysis of freeliving 48 microorganisms. Nucleic Acids Res. 27(21), 4218–4222 (1999). 49 50 125. Tekaia F, Lazcano A, Dujon B. The genomic tree as revealed from whole proteome 51 comparisons. Genome Res. 9(6), 550–557 (1999). 52 53 126. Rivera MC, Lake JA. The ring of life provides evidence for a genome fusion origin of 54 eukaryotes. Nature. 431(7005), 152–155 (2004). 55 56 127. Montague MG, Hutchison CA. Gene content phylogeny of herpesviruses. Proc. Natl. Acad. 57 Sci. U. S. A. 97(10), 5334–5339 (2000). 58 59 60 https://mc04.manuscriptcentral.com/fm-fmb 38 Future Microbiology Page 26 of 36

1 2 128. Gupta RS. The branching order and phylogenetic placement of species from completed 3 bacterial genomes, based on conserved indels found in various proteins. Int. Microbiol. Off. J. 4 Span. Soc. Microbiol. 4(4), 187–202 (2001). 5 6 129. Coenye T, Vandamme P. Extracting phylogenetic information from wholegenome 7 sequencing projects: the lactic acid bacteria as a test case. Microbiol. Read. Engl. 149(Pt 12), 8 3507–3517 (2003). 9 10 130. Huson DH, Steel M. Phylogenetic trees based on gene content. Bioinforma. Oxf. Engl. 11 20(13), 2044–2049 (2004). 12 13 14 131. Snel B, Bork P, Huynen MA. Genome phylogeny based on gene content. Nat. Genet. 21(1), 15 108–110 (1999). 16 17 132. Goris J, Konstantinidis KT, Klappenbach JA, Coenye T, Vandamme P, Tiedje JM. DNADNA 18 hybridization valuesFor and their Review relationship to wholegenome Only sequence similarities. Int. J. 19 Syst. Evol. Microbiol. 57(Pt 1), 81–91 (2007). 20 21 133. Auch AF, von Jan M, Klenk HP, Göker M. Digital DNADNA hybridization for microbial 22 species delineation by means of genometogenome sequence comparison. Stand. Genomic 23 Sci. 2(1), 117–134 (2010). 24 25 134. Auch AF, Klenk HP, Göker M. Standard operating procedure for calculating genometo 26 genome distances based on highscoring segment pairs. Stand. Genomic Sci. 2(1), 142–148 27 28 (2010). 29 30 135. Thompson CC, Chimetto L, Edwards RA, Swings J, Stackebrandt E, Thompson FL. 31 Microbial genomic taxonomy. BMC Genomics. 14, 913 (2013). 32 33 136. Vernikos G, Medini D, Riley DR, Tettelin H. Ten years of pangenome analyses. Curr. Opin. 34 Microbiol. 23C, 148–154 (2015). 35 36 137. Tettelin H, Masignani V, Cieslewicz MJ, et al. Genome analysis of multiple pathogenic 37 isolates of Streptococcus agalactiae: Implications for the microbial “pangenome.” Proc. 38 Natl. Acad. Sci. U. S. A. 102(39), 13950–13955 (2005). 39 40 138. Lapierre P, Gogarten JP. Estimating the size of the bacterial pangenome. Trends Genet. TIG. 41 25(3), 107–110 (2009). 42 43 44 139. Georgiades K, Raoult D. Defining pathogenic bacterial species in the genomic era. Front. 45 Microbiol. 1, 151 (2010). 46 47 140. Audic S, Robert C, Campagna B, et al. Genome analysis of Minibacterium massiliensis 48 highlights the convergent evolution of waterliving bacteria. PLoS Genet. 3(8), e138 (2007). 49 50 141. Merhej V, RoyerCarenzi M, Pontarotti P, Raoult D. Massive comparative genomic analysis 51 reveals convergent evolution of specialized bacteria. Biol. Direct. 4, 13 (2009). 52 53 142. OnaNguema G, RooseAmsaleg C, Paolozzi L, et al. Microbiologie. Dunod. 54 55 143. ContrerasMoreira B, Vinuesa P. GET_HOMOLOGUES, a versatile software package for 56 scalable and robust microbial pangenome analysis. Appl. Environ. Microbiol. 79(24), 7696– 57 7701 (2013). 58 59 60 https://mc04.manuscriptcentral.com/fm-fmb 39 Page 27 of 36 Future Microbiology

1 2 144. Santos AR, Barbosa E, Fiaux K, et al. PANNOTATOR: an automated tool for annotation of 3 pangenomes. Genet. Mol. Res. GMR. 12(3), 2982–2989 (2013). 4 5 145. Laing C, Buchanan C, Taboada EN, et al. Pangenome sequence analysis using Panseq: an 6 online tool for the rapid analysis of core and accessory genomic regions. BMC 7 Bioinformatics. 11(1), 461 (2010). 8 9 146. Li Li, J. Stoeckert C, S. Roos D. OrthoMCL: Identification of Ortholog Groups for 10 Eukaryotic Genomes [Internet]. Available from: http://genome.cshlp.org/content/13/9/2178. 11 12 147. Fouts DE, Brinkac L, Beck E, Inman J, Sutton G. PanOCT: automated clustering of orthologs 13 14 using conserved gene neighborhood for pangenomic analysis of bacterial strains and closely 15 related species. Nucleic Acids Res. 40(22), e172 (2012). 16 17 148. Brittnacher MJ, Fong C, Hayden HS, Jacobs MA, Radey M, Rohmer L. PGAT: a multistrain 18 analysis resourceFor for microbial Review genomes. Bioinformatics Only. 27(17), 2429–2430 (2011). 19 20 149. Hyatt D, Chen GL, Locascio PF, Land ML, Larimer FW, Hauser LJ. Prodigal: prokaryotic 21 gene recognition and translation initiation site identification. BMC Bioinformatics. 11, 119 22 (2010). 23 24 150. Rasko DA, Myers GSA, Ravel J. Visualization of comparative genomic analyses by BLAST 25 score ratio. BMC Bioinformatics. 6, 2 (2005). 26 27 151. Rasko DA, Rosovitz MJ, Myers GSA, et al. The Pangenome Structure of Escherichia coli: 28 29 Comparative Genomic Analysis of E. coli Commensal and Pathogenic Isolates. J. Bacteriol. 30 190(20), 6881–6893 (2008). 31 32 152. Collingro A, Tischler P, Weinmaier T, et al. Unity in varietythe pangenome of the 33 Chlamydiae. Mol. Biol. Evol. 28(12), 3253–3270 (2011). 34 35 153. D’Auria G, JiménezHernández N, PerisBondia F, Moya A, Latorre A. Legionella 36 pneumophila pangenome reveals strainspecific virulence factors. BMC Genomics. 11, 181 37 (2010). 38 39 154. Bascomb S, Lapage SP, Willcox WR, Curtis MA. Numerical Classification of the Tribe 40 Klebsielleae. J. Gen. Microbiol. 66(3), 279–295 (1971). 41 42 155. Cowan ST, Steel M, Shaw C, Duguid JP. A classification of the Klebsiella group. J. Gen. 43 44 Microbiol. 23, 601–612 (1960). 45 46 156. Brenner DJ, Farmer JJ, Hickman FW, Asbury MA, Steigerwalt AG. Taxonomic and 47 Nomenclature Changes in Enterobacteriaceae [Internet]. Available from: 48 https://www.abebooks.fr/TaxonomicNomenclatureEnterobacteriaceaeDonBrenner 49 Farmer/4158251114/bd. 50 51 157. Wang M, Cao B, Yu Q, et al. Analysis of the 16S–23S rRNA Gene Internal Transcribed 52 Spacer Region in Klebsiella Species. J. Clin. Microbiol. 46(11), 3555–3563 (2008). 53 54 158. Orskov I. Genus V. Klebsiella. , 461–465 (1984). 55 56 159. Orskov I. Klebsiella. Bergey’s Man. Syst. Bacteriol. (1974). 57 58 160. Rouli L, Merhej V, Fournier PE, Raoult D. The bacterial pangenome as a new tool for 59 60 https://mc04.manuscriptcentral.com/fm-fmb 40 Future Microbiology Page 28 of 36

1 2 analyzing pathogenic bacteria. New Microbes New Infect. [Internet]. Available from: 3 http://www.sciencedirect.com/science/article/pii/S2052297515000529. 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 For Review Only 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 https://mc04.manuscriptcentral.com/fm-fmb 41 Page 29 of 36 Future Microbiology

1 2 Figures 3 4 Figure 1. Number of publications per year for all pangenome studies in the genomic database on 5 6 the PubMed website. 7 8 Figure 2. Validated number of bacterial names over the years. 9 10 11 Figure 3. Pipeline used for pangenome analyses. 12 13 Figure 4. The core/pangenome ratio for all strains of Klebsiella studied. 14 15 16 17 18 For Review Only 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 https://mc04.manuscriptcentral.com/fm-fmb 42 Future Microbiology Page 30 of 36

1 2 Figure 1. Number of publications per year for all pangenome studies in the genomic database on 3 4 the PubMed website. 5 6 7 8 9 10 11 12 13 14 15 16 17 18 For Review Only 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 https://mc04.manuscriptcentral.com/fm-fmb 43 Page 31 of 36 Future Microbiology

1 2 Figure 2. Validated number of bacterial names over the years. 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 For Review Only 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 https://mc04.manuscriptcentral.com/fm-fmb 44 Future Microbiology Page 32 of 36

1 2 Figure 3. Pipeline used for pangenome analyses. 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 For Review Only 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 https://mc04.manuscriptcentral.com/fm-fmb 45 Page 33 of 36 Future Microbiology

1 2 Figure 4. The core/pangenome ratio for all strain of Klebsiella studied. 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 For Review Only 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 https://mc04.manuscriptcentral.com/fm-fmb 46 Future Microbiology Page 34 of 36

1 2 3 4 100 5 6 7 80 8 9 10 60 11 12 Nb publications 13 14 40 15 16 17 20 18 For Review Only 19 20 21 0 22 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 https://mc04.manuscriptcentral.com/fm-fmb 47 Page 35 of 36 Future Microbiology 1960s: DNA DNA hydridization + 1 chemotaxonomy 2 3 1966 →28,900 species * 4 5 6 30000 7 28000 8 9 26000 10 For Review Only 11 24000

12 13 22000 1970s: Polyphasic 14 taxonomy 15 20000 2017 →15,626 species *** 16 17 18000 2010s: Genome-derived 18 taxonomic tools 16000 19 1980s: 16S rRNA 20 14000 21 22

Nb. Validated Species Validated Nb. 12000 23 1980s: first approved list 2014s: Taxonogenomics 24 10000 of bacterial names 25 26 8000 27 1980 →1,792 species ** 28 6000 29 30 4000 31 2000 32 33 0 34 1966 1969 1972 1975 1978 1981 1984 1987 1990 1993 1996 1999 2002 2005 2008 2011 2014 2017 35 36 Years 37 38 * Buchanan et al. 1966 39 ** Skerman et al. 1980 40 https://mc04.manuscriptcentral.com/fm-fmb *** LPSN 41 48 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 Future Microbiology Page 36 of 36

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 For Review Only 19 20 21 22 23 24 25 26 27 28 29 30 31 32 The core/pan-genome ratio for all strains of Klebsiella studied. 33 34 412x306mm (72 x 72 DPI) 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 https://mc04.manuscriptcentral.com/fm-fmb 49

PARTIE I

Assemblage du génome d’Akkermansia muciniphila directement à partir de la métagénomique

50 Avant-propos

Le microbiote intestinal se compose principalement des phyla , Bacteroidetes, Actinobactéries et

Proteobactéries; le phylum Verrucomicrobia est parfois observé. Les modifications de la composition du microbiote intestinal sous pression antibiotique ont été largement étudiées

[5], révélant une diversité restreinte de la flore intestinale, y compris la colonisation par des organismes tels que les

Enterococci, alors que leur impact sur la charge bactérienne est variable. Une étude préalable dans notre laboratoire a rapporté chez deux patients traités par des antibiotiques à large spectre, la colonisation de haut niveau par Akkermansia muciniphila

(appartenant au phylum des Verrucomicrobia), allant de 39% à

84% de la population bactérienne totale, bien que les tentatives de culture de ce micro-organisme n'aient pas abouti.

L'objectif de ce travail a alors été d'assembler et d'analyser le génome d'une souche d'A. muciniphila issu de

51 l'échantillon de selle du patient ayant 84 % de bactéries appartenant au phylum des Verrucomicrobia. Nous proposons une approche originale d'assemblage du génome en se basant sur une méthode par « mapping » des données issues de métagénomique. Cette méthode nous a permis d'aligner les reads issus de séquenceurs SOLiD et Roche 454 contre le génome de la souche de référence, A. muciniphila strain ATCC

BAA-835. En utilisant la technique de PCR standard ainsi que des données non utilisées pendant le mapping, nous avons pu reconstruire le génome de cette souche nommée « A. muciniphila strain Urmite » ayant une taille de 2.72 Mb composé d'un seul scaffold avec 59 gaps. L'ORFing et l'annotation de ce génome a permis de prédire 2,237 gènes pour ensuite les comparer aux 2,192 gènes du génome de référence.

Une perte de 49 gènes et un gain de 52 gènes ont été observés par comparaison avec la référence. Une analyse fonctionnelle a ensuite été faite par COG et KEGG ainsi qu'une recherche in

52 silico de gènes de résistance aux antibiotiques grâce à la base de données ARG-ANNOT [6] et RAST [7]. Des gènes de résistance aux antibiotiques ont été trouvés après comparaison avec le génome de référence, cependant, aucun gène codant pour la résistance à l'imipénème n'a été détecté, même si celui- ci faisait partie du régime antibiotique du patient.

Ce travail a été publié dans le journal Biology Direct.

53

ARTICLE 1

Whole-genome assembly of Akkermansia muciniphila sequenced directly from human stool

Aurélia Caputo, Grégory Dubourg, Olivier Croce, Sushim Gupta, Catherine Robert, Laurent Papazian, Jean-Marc Rolain and Didier Raoult

54 Caputo et al. Biology Direct (2015) 10:5 DOI 10.1186/s13062-015-0041-1

RESEARCH Open Access Whole-genome assembly of Akkermansia muciniphila sequenced directly from human stool Aurélia Caputo1, Grégory Dubourg1,2, Olivier Croce1, Sushim Gupta1, Catherine Robert1, Laurent Papazian3, Jean-Marc Rolain1,2 and Didier Raoult1,2*

Abstract Background: Alterations in gut microbiota composition under antibiotic pressure have been widely studied, revealing a restricted diversity of gut flora, including colonization by organisms such as Enterococci, while their impact on bacterial load is variable. High-level colonization by Akkermansia muciniphila, ranging from 39% to 84% of the total bacterial population, has been recently reported in two patients being treated with broad-spectrum antibiotics, although attempts to cultivate this have been unsuccessful. Results: Here, we propose an original approach of genome sequencing for Akkermansia muciniphila directly from the stool sample collected from one of these patients. We performed and assembly using metagenomic data obtained from the stool sample. We used a mapping method consisting of aligning metagenomic sequencing reads against the reference genome of the Akkermansia muciniphila MucT strain, and a De novo assembly to support this mapping method. We obtained draft genome of the Akkermansia muciniphila strain Urmite with only 56 gaps. The absence of particular metabolic requirement as possible explanation of our inability to culture this microorganism, suggests that the bacterium was dead before the inoculation of the stool sample. Additional antibiotic resistance genes were found following comparison with the reference genome, providing some clues pertaining to its survival and colonization in the gutofapatienttreatedwith broad-spectrum antimicrobial agents. However, no gene coding for imipenem resistance was detected, although this antibiotic was a part of the patient’s antibiotic regimen. Conclusions: This work highlights the potential of metagenomics to facilitate the assembly of genomes directly from human stool. Reviewers: This article was reviewed by Eric Bapteste, William Martin and Vivek Anantharaman. Keywords: Akkermansia muciniphila, Genome, Gut microbiota, Metagenomics, Antibiotics

Background with several diseases, including obesity [3], eczema [4], The elucidation of the composition of the human gut and necrotizing enterocolitis [5]. While culture-dependent microbiota, which consists of approximately 100,000 bil- methods have been mainly used to elucidate the gut bac- lion bacteria, remains a major challenge for microbiolo- terial repertoire, molecular techniques have gradually gists. The influences of age, geographic location and risen in popularity and are now commonly used for the dietary habits on physiological variations in the micro- characterization of the digestive flora because 80% of bac- biota have been well established [1,2]. Moreover, alter- teria in the human gut remain uncultured [6]. Currently, ations in the composition of gut flora have been linked metagenomics is considered the gold standard for human gut studies despite evidence of several biases in this * Correspondence: [email protected] technology [7]. 1URMITE, UMR CNRS 7278-IRD, Aix-Marseille Université, Marseille Cedex 5, Disturbances induced by antimicrobial agents on the France composition of the gut microbiota have been widely ex- 2AP-HM, CHU Timone, Pôle Infectieux, 13005 Marseille, France Full list of author information is available at the end of the article plored. Most studies, whether they have been culture- dependent or based on molecular techniques, have

© 2015 Caputo et al.; licensee BioMed Central. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

55 Caputo et al. Biology Direct (2015) 10:5 Page 2 of 11

agreed that antibiotics restrict the heterogeneity of the established 70 culture conditions that produce a large di- gut microbiota [8-11]. Thus, some bacterial populations versity of bacteria. Considering the large proportion of that are frequently susceptible to antibiotics to which Verrucomicrobia found in the sample by metagenomic they are exposed [8-12] may be suppressed suggesting analysis, we focused our attention on culturing the population replacement or colonization by resistant Gram-negative Akkermansia muciniphila using select- microorganisms, such as Enterococci, under antibiotic ive medium containing the antibiotic vancomycin (to pressure [13]. inhibit predominant bacterial populations) or imipe- Recently, high-level colonization by the Verrucomicrobia nem, which was the antibiotic administered to the pa- phylum of up to 39% and 84% has been reported in two tient. Previous reports in the literature [15] prompted patients receiving a broad-spectrum antibiotic regimen us to use media containing mucin or to strengthen the [14]. All reads were assigned to one species, Akkermansia anaerobic conditions. The culture conditions used are muciniphila, which is an anaerobic Gram-negative bacter- summarized in Table 1. A MALDI-TOF database was also ium commonly found in the digestive tract that is able to amended with the Akkermansia muciniphila MucT strain degrade mucin [15]. These data were confirmed by spectra. Attempts to isolate Akkermansia muciniphila from fluorescence in situ hybridization. Despite significant the stool sample were unsuccessful (Additional file 1). efforts to culture the bacteria from both samples were unsuccessful. An additional recent study has reported Metagenomic sequencing a potential connection between Akkermansia muciniphila To extract DNA from the fecal samples, a modified ver- and obesity [16]. sion of the protocol described by Zoetendal et al.was Whole genomes have been previously sequenced dir- used [21]. A shotgun and a 5-kb paired-end library were ectly from samples, such as Chlamydia trachomatis pyrosequenced on a Roche 454 Titanium sequencer. from the vagina [17], uncultured Termite Group 1 bac- This project was loaded on a 1/4 region for each applica- teria from cells [18], and Deltaproteobacteria tion on a PTP PicoTiterPlate (PicoTiterPlate PTP Kit; from ocean samples [19]. Here, we performed a whole- Roche), and DNA was extracted twice. The first set of genome assembly for the Akkermansia muciniphila DNA was resuspended in 50 μl TE buffer and used to strain Urmite [EMBL: CCDQ000000000], isolated from construct a shotgun library. The DNA concentration an atypical stool sample, in which over 80% of the se- was measured using a Quant-it Picogreen Kit (Invitro- quences were assigned to the Akkermansia muciniphila gen) and a Genios Tecan fluorometer and was calculated type strain ATCC BAA-835 [Genbank:NC_010655.1]. To to be 37 ng/μl. A second set of DNA was later extracted the best of our knowledge, this is the first report of the in an attempt to construct a paired-end library. DNA whole-genome sequencing of a stool sample in the was resuspended in 120 μl TE buffer, and the concentration absence of a cultured isolate. was measured as above and calculated to be 11.5 ng/μl. The shotgun library was constructed with 500 ng of DNA as de- Methods scribed by the manufacturer with a GS Rapid Library Prep Stool sample Kit (Roche). The concentrationoftheshotgunlibrarywas The patient was a 62-year-old man admitted to the inten- measured using a TBS fluorometer and determined to be sivecareunitandtreatedwitha10-daycourseofimipe- 1.05x109 molecules/μl. The paired-end library was con- nem (3 g/day) at the time of stool collection [14]. He did structed from a mix of the 2 sets of DNA, but only 2.7 μgof not show with any gastrointestinal manifestations. We did DNA from each set was used instead of the 5 μgrecom- not obtain written informed consent for the stool collection mended by the manufacturer. The DNA was mechanically due to the death of the patient. Approval from the local fragmented to 5 kb with a Covaris device (KBioScience- ethics committee of the Institut Fédératif de Recherche LGC Genomics, Teddington, UK) and a miniTUBE-Red. IFR48 (Marseille, France) was obtained under agreement The DNA fragments were visualized using an Agilent 2100 09–022. This agreement allows, according to French legis- BioAnalyzer on a DNA LabChip 7500 with an optimal lation, the use of stool samples because they are considered size of 5 kb. The library was constructed according to to be waste of human origin and do not involve additional the 454 Titanium manufacturer’s paired-end protocol. sample collection from the patient. Circularization and nebulization were performed, gen- erating a pattern with an optimal length of 549 bp. After Culture 17 cycles of PCR amplification followed by double-size se- Each gram of stool was diluted in 9 ml of Dulbecco’s lection, the single-stranded paired-end library was then Phosphate-Buffered Saline (DPBS) (Life technolgies, Saint quantified by an Agilent 2100 BioAnalyzer on an RNA Aubin, France) and inoculated in serial dilutions ranging 6000 Pico LabChip and was measured to be 549 pg/μl. from 1/10 to 1/1010 using different culture media and The library concentration equivalence was calculated to variable conditions. Previous culturomics studies [20] have be 1.87×109 molecules/μl. The library was stored at −20°C

56 Caputo et al. Biology Direct (2015) 10:5 Page 3 of 11

Table 1 In silico prediction of antibiotic resistance genes in our consensus genome Class Best match Length GC content Best hits with Similarity Coverage Accession organism number (aa) (%) (%) (%) Beta-Lactamases cfxA 332 51 Akkermansia muciniphila 100 100 WP031930069 ATTCBAA-835 tlaA 258 60 Akkermansia muciniphila 97 84 YP001876808 ATTCBAA-835 Beta-lactamase domain protein 298 60,7 Akkermansia muciniphila 100 87 YP001877266 ATTCBAA-835 CphA2 403 60,8 Akkermansia muciniphila 99 95 YP297581 ATTCBAA-835 Act 323 60,8 Akkermansia muciniphila 99 85 YP001876732 ATTCBAA-835 Metal-dependent hydrolases of 348 63,6 Akkermansia muciniphila 99 75 YP001877492 the beta-lactamase superfamily ATTCBAA-835 Metall o-beta-1actamase family 468 56,5 Akkermansia muciniphila 99 100 YP001876862 protein ATTCBAA-835 Zn-dependent hydro1ase 274 59,8 Akkermansia muciniphila 96 98 YP001877763 ATTCBAA-835 Glycopeptides vanX; D-ala D-ala dipeptidase 234 56.6 Akkermansia muciniphila 99 95 YP001878228 ATTCBAA-835 MLS mefA; macrolid efflux pump 401 49,2 Akkermansia muciniphila 100 100 WP031931063 ATTCBAA-835 ermB; erythromycin ribosome 245 45 Akkermansia muciniphila 100 100 WP012420167 methy1ase ATTCBAA-835 o1ec; macrolid ABC transporter 702 61,9 Akkermansia muciniphila 65 100 WP012419164 protein ATTCBAA-835 Phenicol catA3; chloramphenica1 211 56 Akkermansia muciniphila 99 100 YP001876953 acetyltransferase ATTCBAA-835 Su1phonamide sulII; dihydropteroate synthase 279 65,7 Akkermansia muciniphila 49 100 YP001877991 ATTCBAA-835 Tetracyclin tetO 639 51,2 Akkermansia muciniphila 100 100 WP012419363 ATTCBAA-835 Trimethoprim dfrA3: dihydrofolate reductase 122 57,8 Akkermansia muciniphila 99 75 YP001878622 ATTCBAA-835 until use. The shotgun library was clonally amplified with 2 In this study, we generated 1.4 gigabases of metagenomic cpb in 4 emPCR reactions, and the 5-kb paired-end library sequence data from the stool sample. Reads were generated was amplified with lower cpb values (0.25 and 0.5 cpb) in 2 from short-read shotgun and paired-end runs on a 454 se- emPCR reactions per condition with a GS Titanium SV quencer and a SOLiD sequencer. These reads were aligned emPCR Kit (Lib-L) v2. The emPCR yield was 10.24% for to a database containing most known human genomes the shotgun library and between 6.4% and 7.8% for the usingDeconseq[22].Only0.2%ofthereadswere clonal amplification of the 5-kb paired-end library. These identified as human, and these were removed from the percentages were within the quality range of 5% to 20% ex- dataset. The 454 shotgun (122,354 reads) and paired- pected for the Roche procedure. A total of (70,000 beads end sequencing (268,104 reads) data were mapped against from the shotgun library were loaded on a ¼ region of a the Akkermansia muciniphila (ATCC BAA-835) genome GS Titanium PicoTiterPlate, whereas only 686,598 beads using CLC workbench software (CLC bio, Aarhus, from the paired-end library were loaded on another ¼ re- Denmark). An identity of 90% was used as the threshold gion of the PicoTiterPlate with a GS Titanium Sequencing for the alignment of a read to the reference genome. Kit XLR70. Runs were performed overnight and then ana- Sequence data obtained with the SOLiD sequencer lyzed with gsRunBrowser and Roche gsAssembler. (3,844,884 reads) were mapped against the previously created consensus using CLC Workbench software. The Metagenomic alignment low setting was used for the largest proportion of the Our metagenomic alignment and the following method data. The parameters used were 70% identity and 40 bp are shown in Figure 1. length fraction.

57 Caputo et al. Biology Direct (2015) 10:5 Page 4 of 11

Figure 1 Schematic of the two-assembly method using metagenomic data performed in this study.

Final mapping was conducted with the 454 shotgun was carried out with ReviSeq algorithm (under develop- and paired-end data against the previously created con- ment, unpublished). Finally, finishing was performed using sensus genome using CLC Workbench software with the Gapfiller and CLC Genomics. The genome obtained using default parameters (80% identity and 50 pb for the length this method contained a single 2.72 Mb chromosome fraction). without gaps.

In silico Alternative methods for assembling Akkermansia antibiotic resistance gene prediction muciniphila genome The ARG-ANNOT database for acquired antibiotic re- The generation of an assembly by the mapping of reads sistance genes (ARGs) was employed for a BLAST search to a reference genome requires a reference with a high using the Bio-Edit interface [26]. The assembled se- level of quality and sufficiently similar sequences. In- quences were searched against the ARG database under −5 deed, if the reference genome contains additional or moderately stringent conditions (e-value of 10 ) for the highly divergent genes, the assembly may include many in silico ARG prediction. These sequences were also sub- gaps or poorly assembled regions. De novo assembly re- mitted to Rapid Annotation using Subsystem Technology mains the best solution, but in our study, it was impos- (RAST) [27] for annotation, additional putative ARG an- sible to achieve because we used a metagenomic sample. notations are listed in Table 1. These putative ARGs were An alternative method involves first creating a de novo further verified through a web-enabled NCBI GenBank assembly to obtain a set of contigs and then selecting BLAST search. only those contigs that are highly similar to the refer- ence. In addition, the contigs are ordered with respect to Results each other using the reference. Thereafter, the sequences Mapping can be joined using conventional finishing procedures, The 454 shotgun and paired-end sequencing data were and the remaining reads can be used to fill the gaps. mapped against the Akkermansia muciniphila genome. These methods were performed to assemble the The total number of mapped reads represented 44% Akkermansia muciniphila genome using several tools. (171,593) of the total reads, and the mean length of The assembly step was conducted using Newbler 2.8 [23] these reads was 199 bp. The paired-end reads repre- and Mira 3.2 [24]. The contigs obtained were combined sented 28% (107,582) of the total reads. The average by Cisa to reduce the set [25]. The contig mapping step coverage of the consensus was 13-fold. The consensus

58 Caputo et al. Biology Direct (2015) 10:5 Page 5 of 11

length generated from this mapping, including the gaps, we mapped these 28 reads against our genome assembly was 2,664,714 bp, which was used for the remainder of and closed two gaps of 31 bp and 296 bp in size. the experiment. We performed a Mauve genome align- ment [28], which showed a high level of similarity be- Finishing tween the reference and consensus genomes (Figure 2). The PCR results allowed us to close 6 gaps of different The differences were due to 519 gaps in the consensus. lengths (ranging from 80 bp to 1168 bp). We then com- Sequence data from the SOLiD sequencer were mapped pared the consensus genome created by the mapping against the previously created consensus. A total of with that obtained by the alternative method. This com- 791,434 reads were mapped, and the mean read length parison revealed sequence insertions that did not exist was 43 bp. The maximal coverage achieved by this map- in the reference but were confirmed by PCR. We were ping was 636-fold due to the large amount of data pro- also able to close the remaining gaps using this method. duced from SOLiD technology, and 95% of the previous To ensure that the gaps would be closed using the alter- consensus was covered. The consensus length generated native genome, we mapped the reads against the corre- from this mapping, including the gaps, was 2,664,713 bp, sponding regions of the genome, leaving a total of 56 which was used for the remainder of the experiment. The gaps. second mapping allowed for the reduction in the number of gaps to 446. This consensus sequence produced thanks Comparison to this previous mapping was used for the next mapping. There were 2192 genes identified in the Akkermansia For the final mapping, the mapped reads represented muciniphila type strain (ATCC BAA-835) genome. In 45% (174,209) of the total, and the mean read length the genome obtained by mapping, we identified 2237 was 197.50 bp. The paired-end reads represented 28% genes. After a BLASTP [29] search and the verification (109,000) of the total reads. The average coverage of the of false positives, 49 Akkermansia muciniphila genes consensus was 13-fold, and the consensus length generated remained that were not present in our genome, and 52 from this mapping, including the gaps, was 2,664,704 bp. genes remained that were not present in the reference. As a result of this mapping, only 392 gaps remained. These sets of genes were analyzed and visualized A large number of short sequences were inserted into using a circular map constructed with ACT Artemis gaps in the consensus sequence during these mapping software (Wellcome Trust Sanger Institute, Cambridge) steps. When we finished using all of the metagenomics [30] (Figure 3). We were thus able to estimate that our data, we removed these sequences because they were genome had lost 49 genes and gained 52 others. For these not useful for the remaining analyses. In the end, we genes, we have detailed mutations type (stop, frameshift, were left with 73 gaps. Analysis of metagenomic data multiples or replace) in Additional file 2. These loss/gains from the trash reads allowed for the collection of 1189 genes could be explain by the adaptation of this bacterium reads corresponding to the genome of the Akkermansia living in sympatric environment (metagenomic sample) muciniphila type strain (ATCC BAA-835), which has a [31] and having the opportunity to exchange genes. G + C content of 55.8%. We carried out a BLASTX [29] search of the trash reads against the non-redundant pro- Functional analysis tein sequence database (Nr), collecting only those reads We used the Clusters of Orthologs Groups (COG) data- with G + C contents of between 54% and 57%. Only 28 base [32] through the WebMGA server [33] to analyze reads remained with a maximum size of 416 bp. However, the distribution of the 49 lost and 52 gained genes

Figure 2 Mauve alignment of the Akkermansia muciniphila reference genome and the consensus created by mapping.

59 Caputo et al. Biology Direct (2015) 10:5 Page 6 of 11

Figure 3 Circular map of the Akkermansia muciniphila genome and the consensus genome. The red bars correspond to reference genes that are absent in the consensus. The blue bars correspond to consensus genes that are absent in the reference.

among the different functional categories. A few of the using BLASTX [29]. We kept only the best hits based on genes could be annotated with this database. The major- the number of thresholds we established. The best hits with − ity of the 49 lost genes were related to and an E-value cut-off of 10 4, an identity cut-off of 50% and a the “C” categories (Energy production and conversion). score cut-off of 50 were retained. With these results, we A few were involved in cellular processes as well as sig- classified the metagenomic data by genus (Figure 4) and naling and information storage and processing. Some species (Figure 5). were part of the “J” (Translation, ribosomal structure and biogenesis) and “O” categories (Posttranslational Antibiotics resistance gene research modification, protein turnover, chaperones). The major- An in silico ARG prediction was performed using ity of the 52 genes that had been gained were involved ARG-ANNOT [26] and RAST [27]. Resistance studies in metabolic reactions. Only two of these genes that of Akkermansia muciniphila have shown the presence belonged to the “L” category (Replication, recombination of a range of putative ARGs from different antibiotic and repair). classes. The details of the ARG analysis are presented in Table 1. However, eight beta-lactamase genes were Metagenomic data other than that from Akkermansia detected in this isolate that shared over 96% similar- muciniphila ity and belonged to the class 1 and 2 beta-lactamases The mapping against the reference genome revealed a num- and metallo-beta-lactamases. Out of the three de- ber of reads that could not be aligned with Akkermansia tected macrolide resistance genes, one was only 65% muciniphila (trash reads). The mothur package [34] was similar to a known macrolide. However, one gene used to remove redundant reads (i.e., reads that were 100% each was found to be associated with resistance to identical) from the 218,865 trash reads. This left 175,756 vancomycin, chloramphenicol,sulfonamide,tetracy- reads, which were compared to GenBank’sNrdatabase clin and trimethoprim.

60 Caputo et al. Biology Direct (2015) 10:5 Page 7 of 11

Figure 4 Classification by genus of the remaining 175,756 reads after mapping. Only abundant reads are represented.

Discussion detected ARGs may confer resistance to their respective This study demonstrated an original approach for obtain- antibiotics. Although our sample was collected from a ing an assembled microbial genome. This approach per- patient being treated with a broad-spectrum antibiotic mitted the assembly of a nearly complete genome from regimen, the in silico prediction warrants further experi- metagenomic data derived from human stool. We demon- mental validation. Moreover, a previous attempt to de- strated the feasibility of assembling this genome by map- tect carbapenemase by MALDI-TOF [38] directly from a ping reads to a reference genome. We used the genome of stool sample has shown negative results [14], suggesting Akkermansia muciniphila, which is a representative of that caution should be used in the interpretation of these the phylum Verrucomicrobia, as the reference. We are findings. confident in our findings, having routinely sequenced whole KEGG analysis revealed no apparent variations in bacterial genomes [35-37] and mapped the Akkermansia metabolic pathways between the two strains [39] (data muciniphila reference genome to fill remaining gaps. not shown). These data exclude particular needs for the Whole-genome sequencing of microorganisms has pre- strain present in the sample, which may explain our fail- viously been performed directly from human samples, ure to culture an isolate despite the use of enriched or such as Chlamydia trachomatis derived from vaginal selective media with antibiotics. Akkermansia mucini- swabs [17] (Table 2). However, to the best of our phila is a fastidious and strictly anaerobic bacterium. It knowledge, this study is the first report of a bacterium is possible that precautions for maintaining the anaero- that has been entirely sequenced from a human stool biosis of the sample from the time of sample collection sample. to aliquoting were unsuccessful, rendering the strain The detected beta-lactamases resistant genes could non-viable because of its extreme sensitivity to oxygen. confer resistance to many beta-lactam antibiotics, such as benzylpenicillin, amoxicillin, cephalothin, ceftriaxone, Conclusions ceftazidime penicillins, cephalosporins, monobactam az- We have proposed an original approach for sequencing treonam and imipenem. The detected macrolide genes a complete genome directly from human stool samples, may confer resistance to erythromycin, azithromycin or which was assembled by mapping reads to a reference clarithromycin due to the high expression of these genes genome. If data obtained here did not explain our failure observed during antibiotic treatment; however, the other to culture the strain from the sample, resistome analysis

Figure 5 Classification by species of the remaining 175,756 reads after mapping. Only abundant reads are represented.

61 Caputo et al. Biology Direct (2015) 10:5 Page 8 of 11

Table 2 Whole-cell sequenced genomes already published Spcies/Strain sequenced Specimen origin Sequencing N reads N scaffolds/contigs N gaps technology Akkermansia muciniphila strain Human stool -SOLID 4,235,342 1 scaffold 56 gaps Amuc -454 (shotgun, PE) Chlamydia trachomatis [17] Human vaginal swab Illumina HiSeq 2000 70,201,544 85% aligned/reference (2.6xcov) 6% aligned (6xcov) -Sanger NA Complete genome (120 contigs) Termit Group l StrRs-D17 [18] Single host protist cell -454 SAR324 clade of Global ocean single cell Illumina GA PE 100 pb 67,995,232 646 Contigs Deltaproteobacteria marine sample provided some clues concerning the colonization and strain Urmite genome directly from a stool sample, which survival of Akkermansia muciniphila in the gastrointes- has never been published to the best of our knowledge. tinal tract in a patient treated with a broad-spectrum Also, the section, entitled Mapping (p.8), could be antibiotic regimen. slightly improved in order to introduce more clearly what are the critical steps of this approach and what is their logical order of application. For example, does the Reviewers’ comments order in which sequences were assembled matter? Here, We thank the reviewers for their valuable comments the mapping started with reads from 454 shotgun data, and helpful suggestions. We would like to respond and then further information were obtained by mapping data revise our manuscript in light of the reviews. obtained from a SOLiD sequencer. The former mapping produced a 13-fold average coverage and left 519 gaps, Reviewer’s report 1: Dr. Eric Bapteste, UPMC, Institut de the latter one resulted in a 636-fold average cover and Biologie Paris Seine, France left 446 gaps. It is not very clear how the authors recon- Reviewer 1 ciled these two results to obtain their next mapping (the This work reports the sequencing of an almost complete one with 392 gaps left and a 13-fold coverage). Likewise, bacterial draft genome (Akkermansia muciniphila) from the logic and order of steps of gaps reduction could be a stool sample collected from a patient. This bacteria explained a bit more. That way, future studies may dir- was likely resistant to antibiotic treatments, and one of ectly use the protocol proposed in this work. the goals of this analysis was to identify genes potentially Authors’ response: There is a logical order of applica- involved in the emergence of this medically challenging tions for the mapping approach. This approach involves phenotype by comparing this genome to the genome of aligning reads against a reference genome. We started a closely related reference. with the longer reads from the 454 shotgun data to ob- To this end, the authors used, according to their own tain as long a consensus sequence as possible. Then, we words, an original approach to assemble metagenomic used the shorter reads from the SOLiD sequencer, which reads, producing a genome with 56 gaps left. I am not allowed us to close gaps and to obtain a higher-quality qualified to evaluate the sampling process, so I will focus sequence. Indeed, the SOLiD technology produced short my report on the other parts of this work. Although the reads in large quantities, from which greater coverage methodology seems very sound, I would like to encour- was obtained. For each step, we used the previously age the authors to elaborate a bit more on several as- generated consensus sequence for the next mapping pects of their analyses. method. We added an additional explanation on page 8, l. 180–184. 1) Would it be possible to explain in what sense the proposed approach is original (what is new/ different 2) Would it be possible to give a little bit more of an from usual approaches for reads assembly)? evolutionary perspective to the results that were found, i.e. the claim that Akkermansia muciniphila Authors’ response: We thank Dr. Bapteste for his lost 49 genes and gained 52 others. How quick were comments on our manuscript. Our study presented an these gains and losses? Maybe, providing the original approach because we obtained a genome directly readers with a distance in terms of % identity from metagenomic data, without pre-processing. Our work between the 16S of the reference genome and the allowed us to sequence the Akkermansia muciniphila 16S of the newly assembled genome might give a

62 Caputo et al. Biology Direct (2015) 10:5 Page 9 of 11

better sense of the extent of divergence that similarity (>90%) to possibly aggregate more divergent occurred in these genomes outside this gold genes into the contigs/genome? standard marker? Authors’ response: We did not try other mapping soft- ware, but we used different ranges of parameters (lines Likewise, did these gains and losses concern limited 138, 142, 144–145). We chose to use a high-stringency regions of the genomes, such as genomic islands, or condition for the first mapping to be sure that we aligned were they widespread? Is there any clue of the mecha- the reads that belonged to Akkermansia muciniphila be- nisms involved in the genes gains? cause we aligned reads from the metagenomic data of a Authors’ response: In this study, we did not focus on stool sample. the evolutionary process. The 16S sequences are identical Table 1 typo:? best his? instead of? best hits?. (100% identity) in the reference and the draft genome. Authors’ response: We corrected this typographical error. According to Figure 3, these genes are widespread in the Quality of written English: Acceptable. genomes and are not situated in limited regions. Thanks to your questions, we have clarified our findings. We Reviewer’s report 2: Prof. William Martin, Institut of Botanic have created Additional file 2 (line 219) to clarify III, Heinrich-Heine University, Düsseldorf, Germany whether these genes have stop codons, frameshifts, or mu- Reviewer 2 tations or are replaced with other genes. This is a fine paper reporting a whole genome sequence This work is based on metagenomic data; the bacteria assembly from human stool, a substantial technical ad- existed in a sympatric environment, allowing the oppor- vance. The focus of the paper is methodological, the ap- tunity to exchange foreign sequences [31]. In this specific plications of the method are broad. This is one of the environment, the bacteria had the capacity to acquire world’s leading genomics groups, which shows in the genes to integrate them into chromosome and the ability quality of preparation for this paper. In my view it can to keep them included in the chromosome. The lost or be published as is, maybe following one more check re- gained genes were linked in the microorganism’s adapta- garding the permissions policies of BD and IFR48 with tion in a sympatric environment. regard to the consent issue. Minor points: The abstract indicates that 56 gaps are Authors’ response: We thank reviewer 2 for the com- left in the draft genome, but Table 2 indicates 58 gaps, ments on our article. As already written in the paper please reconcile these numbers. (lines 82–85), French legislation (agreement 09–022) al- Authors’ response: Thank you for this comment; we lows the utilization of stool samples without the patient’s corrected this table. consent because these samples are considered to be p.7. The authors correctly explain that if the reference wastes of human origin. genome contains additional or highly divergent genes typo p. 4: did not present with any did not show any. with respect to the environmental genome that they Authors’ response: We corrected this typographical error. aimed at reconstructing, their protocol would result in Quality of written English: Acceptable. gaps in this latter draft genome. Conversely, it might also be useful to discuss what would happen, in terms of Reviewer’s report 3: Dr. Vivek Anantharaman, NCBI, NLM, assembly, if the draft genome contains additional or NIH, USA highly divergent genes with respect to the reference gen- Reviewer 3 ome. In particular, is not there a risk to lose some of the The paper presents a novel method of sequencing a bac- original gene content of Akkermansia muciniphila? For terial genome from a stool sample. As a methods paper example, could lost genes be fast evolving genes? this paper presents the data well. But I have a few con- Authors’ response: If the draft genome contains cerns in the analysis section. The authors say that 49 additional highly divergent genes with respect to the Akkermansia muciniphila genes were not present in reference genome, we could lose this information and their genome and 52 genes were not present in the refer- would not be able to reconstruct the original genetic ence set. content entirely. We were able to use this mapping method because the genomes were very similar. 1) Do the genes that are missing fall in regions where the p.8. typo? They sequences? syntenyofthegenomesisdisrupted?Ifso,havethey Authors’ response: We corrected this typographical error. been replaced by some other gene? Or, are they rapidly Figure 1: the authors used CLC for the mapping, did diverging genes and hence have escaped the cut-off? they try other assemblers (and a different range of pa- rameters) to estimate whether the number of gaps could Authors’ response: We thank reviewer 3 for the com- be further reduced? In particular, in a second step of the ments on our article. In this study, we did not focus on analysis, could not it help to relax the criterion of % the evolutionary process. We have performed other

63 Caputo et al. Biology Direct (2015) 10:5 Page 10 of 11

verifications, and we can say that in the 49 genes absent Additional files in Akkermansia muciniphila strain Urmite, 22 had mu- tations involving the appearance of stop codons, 11 were Additional file 1: Culture conditions applied to the stool sample caused by different mutations, 4 had mutations involving during the culturomics study. Additional file 2: Different mutations involved in the 52 gain genes frameshifts, and the remainder were replaced by some and 49 loss genes. other gene in the same location on the genome. We have performed the same verifications for the 52 genes present Abbreviations only in the Akkermansia muciniphila strain Urmite DPBS: Dulbecco’s Phosphate-Buffered Saline; ARG: Antibiotic Resistance Gene; genome, and we can say that 4 had mutations involv- RAST: Rapid Annotation using Subsystem Technology; Nr: Non-redundant; COG: Clusters of Orthologs Groups; KEGG: Kyoto Encyclopedia of Genes and ing the appearance of stop codons, 21 were caused by Genomes. different mutations, 12 had mutations involving frameshifts, and the remainder were replaced. To clarify this point, we Competing interests have added Additional file 2. The authors declare that they have no competing interests. Authors’ contributions 2) A list of the novel genes involved in antibiotics DR designed the research project. AC performed mapping and genomic resistance are shown in the Table 1.Someofthese analysis, and wrote the paper. GD performed biological techniques and wrote the paper. SG and JMR provided resistome analysis. OC performed de genes are shown to have 100% identity to Bacillus novo assembly. CR was involved in metagenomic sequencing. LP provide subtilis, Enterococcus faecalis and the sample. DR revised the paper. All authors read and approved the final warneri? all firmicutes, 99% or higher similarity to manuscript. Bacteroides, and 95% or higher to Clostridium genes. Acknowledgments Given that Akkermansia is a Verrucomicrobia, the This work was funded by IHU Méditerranée Infection. high identities (especially a 100% identity) of the? Author details novel genes? to those from organisms belonging to a 1URMITE, UMR CNRS 7278-IRD, Aix-Marseille Université, Marseille Cedex 5, totally different clade would suggest that they are France. 2AP-HM, CHU Timone, Pôle Infectieux, 13005 Marseille, France. 3 contamination from those genomes and not necessarily Service de Réanimation Médicale-Détresse Respiratoires et Infections Sévères, Marseille, France. novel genes. The authors have to either explain the very high similarity or consider these genes as dubious. Received: 23 October 2014 Accepted: 6 February 2015

’ Authors response: Table 1 was based on the sequences References present in the ARG-ANNOT database, which allows target- 1. Finegold SM, Attebery HR, Sutter VL. Effect of diet on human fecal flora: ing putative genes. We found putative resistance genes based comparison of Japanese and American diets. Am J Clin Nutr. 1974;27:1456–69. 2. Raoult D. Human microbiota. [corrected]. Clin Microbiol Infect Off Publ Eur on sequence homology. We performed verification using Soc Clin Microbiol Infect Dis. 2012;18 Suppl 4:1. BLAST for 9 genes that have high similarity in other bac- 3. Ley RE, Turnbaugh PJ, Klein S, Gordon JI. Microbial ecology: human gut teria. Among these genes, we found 5 genes that have almost microbes associated with obesity. Nature. 2006;444:1022–3. 4. Nylund L, Satokari R, Nikkilä J, Rajilić-Stojanović M, Kalliomäki M, Isolauri E, 100% similarity and 100% coverage with Akkermansia but et al. Microarray analysis reveals marked intestinal microbiota aberrancy in are not annotated as resistances genes, and most are hypo- infants having eczema compared to healthy children in at-risk for atopic thetical. Thanks to this database, we annotated resistance disease. BMC Microbiol. 2013;13:12. 5. Mai V, Young CM, Ukhanova M, Wang X, Sun Y, Casella G, et al. Fecal genes in Akkermansia muciniphila strain Urmite. To clar- microbiota in premature infants prior to necrotizing enterocolitis. PloS One. ify this point, we have revised Table 1. According to this 2011;6:e20647. revision, we have modified the text (lines 234–236). 6. Eckburg PB, Bik EM, Bernstein CN, Purdom E, Dethlefsen L, Sargent M, et al. Diversity of the human intestinal microbial flora. Science. 2005;308:1635–8. Minor issues 7. Stackebrandt E, Goebel BM. Taxonomic Note: A Place for DNA-DNA Reassociation and 16S rRNA Sequence Analysis in the Present Species 1) The Additional files are not referred to in the text of Definition in Bacteriology. Int J Syst Bacteriol. 1994;44:846–9. 8. Dethlefsen L, Huse S, Sogin ML, Relman DA. The pervasive effects of an the paper. This should be added. antibiotic on the human gut microbiota, as revealed by deep 16S rRNA sequencing. PLoS Biol. 2008;6:e280. Authors’ response: We have taken this comment into 9. Jernberg C, Löfmark S, Edlund C, Jansson JK. Long-term impacts of antibiotic exposure on the human intestinal microbiota. Microbiol Read Engl. account, and we made this change. 2010;156(Pt 11):3216–23. 10. Robinson CJ, Young VB. Antibiotic administration alters the community 2) In the? In silico antibiotic resistance gene structure of the gastrointestinal micobiota. Gut Microbes. 2010;1:279–84. 11. Sullivan A, Edlund C, Nord CE. Effect of antimicrobial agents on the prediction? paragraph? They sequences were also? ecological balance of human microflora. Lancet Infect Dis. 2001;1:101–14. should read? These sequences? 12. Lagier J-C, Million M, Hugon P, Armougom F, Raoult D. Human gut microbiota: repertoire and variations. Front Cell Infect Microbiol. 2012;2:136. ’ 13. Iapichino G, Callegari ML, Marzorati S, Cigada M, Corbella D, Ferrari S, et al. Authors response: We corrected this typographical error. Impact of antibiotics on the gut microbiota of critically ill patients. J Med Quality of written English: Acceptable. Microbiol. 2008;57(Pt 8):1007–14.

64 Caputo et al. Biology Direct (2015) 10:5 Page 11 of 11

14. Dubourg G, Lagier J-C, Armougom F, Robert C, Audoly G, Papazian L, et al. matrix-assisted laser desorption ionization-time of flight mass spectrometry. PloS High-level colonisation of the human gut by Verrucomicrobia following One. 2012;7:e31676. broad-spectrum antibiotic treatment. Int J Antimicrob Agents. 2013;41:149–55. 39. Kanehisa M, Goto S. KEGG: kyoto encyclopedia of genes and genomes. 15. Derrien M, Vaughan EE, Plugge CM, de Vos WM. Akkermansia muciniphila Nucleic Acids Res. 2000;28:27–30. gen. nov., sp. nov., a human intestinal mucin-degrading bacterium. Int J Syst Evol Microbiol. 2004;54(Pt 5):1469–76. 16. EverardA,BelzerC,GeurtsL,OuwerkerkJP,DruartC,BindelsLB,etal.Cross-talk between Akkermansia muciniphila and intestinal epithelium controls diet-induced obesity. Proc Natl Acad Sci U S A. 2013;110:9066–71. 17. Andersson P, Klein M, Lilliebridge RA, Giffard PM. Sequences of multiple bacterial genomes and a Chlamydia trachomatis genotype from direct sequencing of DNA derived from a vaginal swab diagnostic specimen. Clin Microbiol Infect Off Publ Eur Soc Clin Microbiol Infect Dis. 2013;19:E405–408. 18. Hongoh Y, Sharma VK, Prakash T, Noda S, Taylor TD, Kudo T, et al. Complete genome of the uncultured Termite Group 1 bacteria in a single host protist cell. Proc Natl Acad Sci U S A. 2008;105:5555–60. 19. Chitsaz H, Yee-Greenbaum JL, Tesler G, Lombardo M-J, Dupont CL, Badger JH, et al. Efficient de novo assembly of single-cell bacterial genomes from short-read data sets. Nat Biotechnol. 2011;29:915–21. 20. Lagier J-C, Armougom F, Million M, Hugon P, Pagnier I, Robert C, et al. Microbial culturomics: paradigm shift in the human gut microbiome study. Clin Microbiol Infect Off Publ Eur Soc Clin Microbiol Infect Dis. 2012;18:1185–93. 21. Zoetendal EG, Booijink CCGM, Klaassens ES, Heilig HGHJ, Kleerebezem M, Smidt H, et al. Isolation of RNA from bacterial samples of the human gastrointestinal tract. Nat Protoc. 2006;1:954–9. 22. Schmieder R, Edwards R. Fast identification and removal of sequence contamination from genomic and metagenomic datasets. PloS One. 2011;6:e17288. 23. Chaisson MJ, Pevzner PA. Short read fragment assembly of bacterial genomes. Genome Res. 2008;18:324–30. 24. Chevreux B, Wetter T, Suhai S. Genome Sequence Assembly Using Trace Signals and Additional Sequence Information 1999. 25. Lin S-H, Liao Y-C. CISA: contig integrator for sequence assembly of bacterial genomes. PloS One. 2013;8:e60843. 26. Gupta SK, Padmanabhan BR, Diene SM, Lopez-Rojas R, Kempf M, Landraud L, et al. ARG-ANNOT, a new bioinformatic tool to discover antibiotic resistance genes in bacterial genomes. Antimicrob Agents Chemother. 2014;58:212–20. 27. Aziz RK, Bartels D, Best AA, DeJongh M, Disz T, Edwards RA, et al. The RAST Server: rapid annotations using subsystems technology. BMC Genomics. 2008;9:75. 28. Darling AE, Mau B, Perna NT. Progressive Mauve: multiple genome alignment with gene gain, loss and rearrangement. PloS One. 2010;5: e11147. 29. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990;215:403–10. 30. Carver TJ, Rutherford KM, Berriman M, Rajandream M-A, Barrell BG, Parkhill J. ACT: the Artemis Comparison Tool. Bioinforma Oxf Engl. 2005;21:3422–3. 31. Diene SM, Merhej V, Henry M, Filali AE, Roux V, Robert C, et al. The Rhizome of the Multidrug-Resistant Enterobacter aerogenes Genome Reveals How New “Killer Bugs” Are Created because of a Sympatric Lifestyle. Mol Biol Evol. 2012:mss236 30:369-383 32. Tatusov RL, Natale DA, Garkavtsev IV, Tatusova TA, Shankavaram UT, Rao BS, et al. The COG database: new developments in phylogenetic classification of proteins from complete genomes. Nucleic Acids Res. 2001;29:22–8. 33. Wu S, Zhu Z, Fu L, Niu B, Li W. WebMGA: a customizable web server for fast metagenomic sequence analysis. BMC Genomics. 2011;12:444. 34. Schloss PD, Westcott SL, Ryabin T, Hall JR, Hartmann M, Hollister EB, et al. Introducing mothur: open-source, platform-independent, community- Submit your next manuscript to BioMed Central supported software for describing and comparing microbial communities. and take full advantage of: Appl Environ Microbiol. 2009;75:7537–41. 35. Lagier J-C, Gimenez G, Robert C, Raoult D, Fournier P-E. Non-contiguous • Convenient online submission finished genome sequence and description of Herbaspirillum massiliense sp. nov. Stand Genomic Sci. 2012;7:200–9. • Thorough peer review 36. Mishra AK, Lagier J-C, Rivet R, Raoult D, Fournier P-E. Non-contiguous finished • No space constraints or color figure charges genome sequence and description of Paenibacillus senegalensis sp. nov. Stand Genomic Sci. 2012;7:70–81. • Immediate publication on acceptance 37. Mishra AK, Lagier J-C, Robert C, Raoult D, Fournier P-E. Non contiguous- • Inclusion in PubMed, CAS, Scopus and Google Scholar finished genome sequence and description of Peptoniphilus timonensis • Research which is freely available for redistribution sp.nov.StandGenomicSci.2012;7:1–11. 38. Kempf M, Bakour S, Flaudrops C, Berrazeg M, Brunel J-M, Drissi M, et al. Rapid detection of carbapenem resistance in Acinetobacter baumannii using Submit your manuscript at www.biomedcentral.com/submit

65

PARTIE II

Étude du génome de Microvirga massiliensis

66 Avant-propos

La culturomics est un concept permettant l'étude de microbiotes humains grâce à l’utilisation d'un grand nombre de conditions de culture avec une méthode d'identification rapide par MALDI-TOF, et par amplification et séquençage du gène de l'ARNr 16S [3]. Cette méthode a permis de détecter en 2012 une nouvelle espèce de bactérie ayant la plus grande taille de génome, isolée chez l'homme, Microvirga massiliensis (9.3

Mb). L'objectif de ce travail est d'assembler et d'analyser ce génome afin de comprendre pourquoi cette bactérie a un aussi grand génome. M. massiliensis est une bactérie gram négatif, aérobie, à catalase positive et oxydase négative. Concernant la partie génomique, le génome a été assemblé avec des données issues des séquenceurs Roche 454 (6 runs en paired-end et 2 runs shotgun) et MiSeq Illumina, permettant d'obtenir ainsi un génome reconstruit en 50 scaffolds. L'étape d'annotation a permis de prédire le nombre de gènes (ORF), les clusters par

67 COG, le nombre de gènes d'ARNr 16S, de NRPS/PKS, d'ORFans (ORFs ayant aucun match avec des gènes connus).

Une étape de comparaison génomique a également été faite avec 4 génomes d'espèces différentes de Microvirga prouvant que cette bactérie est bien une nouvelle espèce. La découverte d'un plasmide in silico a été confirmé par électrophorèse en champ pulsé (PFGE). La grande taille de ce génome s'explique grâce à plusieurs caractéristiques spécifiques. La proportion d'ORFans est importante et représente 17 % des gènes totaux

(1500 ORFans). Il comporte 77 ARN dont 21 ARNr 16S. Le nombre de gènes 16s semble être lié à la taille du génome [8].

Il y a également un nombre important de transposases créant des éléments répétés dans le génome et on estime que ce nombre est 5 à 7 fois supérieur, en comparant la couverture moyenne du génome avec la couverture moyenne des scaffolds contenant les transposases. De plus, nous avons trouvé douze gènes de taille supérieure à 5 Kb, dont l'un était supérieur à 14

68 Kb et correspondait à une fonction prédite d'une protéine répétitive de réarrangement (RHS). Les protéines RHS contiennent des répétitions prolongées, qui sont impliquées dans la recombinaison [9].

Ce travail a été publié dans le journal Microbiology Open.

69

ARTICLE 2

Microvirga massiliensis sp. nov., the human commensal with the largest genome

Aurélia Caputo, Jean-Christophe Lagier, Saïd Azza, Catherine Robert, Donia Mouelhi, Pierre-Edouard Fournier and Didier Raoult

70 ORIGINAL RESEARCH Microvirga massiliensis sp. nov., the human commensal with the largest genome Aurélia Caputo1, Jean-Christophe Lagier1, Saïd Azza1, Catherine Robert1, Donia Mouelhi1, Pierre-Edouard Fournier1 & Didier Raoult1,2 1Unité de Recherche sur les Maladies Infectieuses et Tropicales Émergentes, CNRS, UMR 7278 – IRD 198, Faculté de médecine, Aix-Marseille Université, 27 Boulevard Jean Moulin, 13385 Marseille Cedex 05, France 2Special Infectious Agents Unit, King Fahd Medical Research Center, King Abdulaziz University, Jeddah, Saudi Arabia

Keywords Abstract Culturomics, large genome, Microvirga T massiliensis, taxonogenomics Microvirga massiliensis sp. nov. strain JC119 is a bacteria isolated in Marseille from a stool sample collected in Senegal. The 16S rRNA (JF824802) of M. massil- Correspondence iensis JC119T revealed 95% sequence identity with Microvirga lotononidis Didier Raoult, Unité de Recherche sur les WSM3557T (HM362432). This bacterium is aerobic, gram negative, catalase Maladies Infectieuses et Tropicales positive, and oxidase negative. The draft genome of M. massiliensis JC119T Émergentes, CNRS, UMR 7278 – IRD 198, comprises a 9,207,211-­bp-­long genome that is the largest bacterial genome of Faculté de médecine, Aix-Marseille Université, 27 Boulevard Jean Moulin, 13385 Marseille an isolate in humans. The genome exhibits a G+C content of 63.28% and Cedex 05, France. contains 8685 protein-coding­ genes and 77 RNA genes, including 21 rRNA Tel: +33 491 385 517; Fax: +33 491 387 772; genes. Here, we describe the features of M. massiliensis JC119T, together with E-mail: [email protected] the genome sequence information and its annotation.

Received: 23 July 2015; Revised: 12 November 2015; Accepted: 19 November 2015

doi: 10.1002/mbo3.329

The current bacterial taxonomy to define species Introduction ­traditionally combines phenotypic and genotypic charac- Culturomics was developed in 2012 in order to extend teristics (Stackebrandt and Ebers 2006; Tindall et al. 2010) the human gut repertoire. In the first study, by multiply- such as the phylogenetic marker 16S rRNA sequence, ing the number of culture conditions (212 different culture DNA–DNA hybridization, and DNA G+C content conditions) with rapid identification of the colonies using (Rosselló-­Mora 2006). These latter two tools have been MALDI-­TOF, we were able to identify 340 different bac- considered a “gold standard” but are expensive and have terial species including 31 new bacterial species (Lagier poor reproducibility (Wayne et al. 1987). In recent years, et al. 2012a,b,c,d). In addition, we demonstrated a com- the number of sequenced bacterial genomes has grown plementarity with metagenomics (the usual gold standard rapidly thanks to high-throughput­ sequencing and MALDI-­ for studying complex ecosystems) because only 15% of TOF-­mass spectrometry (MALDI-­TOF-­MS) analysis. These the species were concomitantly detected by culture and have allowed the description of new bacterial genera and metagenomics applied to the same samples. We applied species. Recently, by adapting MALDI-TOF­ to the routine this strategy to diverse stool samples from healthy indi- identification of bacterial species, clinical isolated identi- viduals from diverse geographic origins and from patients fication was made possible for the first time (La Scola with diverse diseases, dramatically extending our knowl- and Raoult 2009; Seng et al. 2009, 2013). These methods edge of the cultured human gut repertoire (Lagier et al. provide a wealth of proteomic and genetic information 2015). (Tindall et al. 2010; Welker and Moore 2011).

© 2016 The Authors. MicrobiologyOpen published by John Wiley & Sons Ltd. This is an open access article under the terms of 1 the Creative Commons Attribution License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited.

71 The Largest Human Bacterial Genome A. Caputo et al.

Given that the current taxonomics rules do not integrate MALDI-­TOF-­MS score > 1.9 were considered as correctly two of the recent revolutions in clinical microbiology, we identified, and colonies that exhibited a MALDI-TOF-­ ­MS proposed a new concept named taxonogenomics including score < 1.9 with spectra in the database were further these. As genome sequence provides access to the full characterized using 16S rRNA sequencing as previously genomic information of a strain at a reduced cost, we described (Seng et al. 2010). If the similarity value between proposed in addition to the classic phenotypic description, the 16S rRNA sequence was lower than 98.7%, we con- to systematically add whole-­genome sequence and genome sidered a new species without performing DNA–DNA comparison with the closest species to describe a new hybridization as suggested by Stackebrandt and Ebers (2006). isolate. Moreover, as MALDI-TOF­ became the reference identification method in most clinical microbiology labo- Phenotypic properties ratories (Clark et al. 2013), we proposed systematically adding both MALDI-TOF­ spectra and spectra comparison The main phenotypic characteristics such as Gram stain- with the closest species to describe a new taxa. Using ing, motility, sporulation, catalase, and oxidase tests were taxonogenomics, nine new bacteria have officially been performed as previously described (Schaeffer et al. 1965; considered as new genera and/or species in validation Adler and Margaret 1967; Humble et al. 1977; Gregersen lists no. 153 and no. 155 (Kokcha et al. 2012; Lagier 1978). The chemical characteristics of the strain JC119T et al. 2012a,b,c,d, 2013a,b; Ramasamy et al. 2012; Roux were tested using API® 20A, API® ZYM, and API® 50 et al. 2012; Hugon et al. 2013; Oren and Garrity 2014). CH strips (Biomerieux, Marcy-­l’Etoile, France). Growth With both the description of the complete genomic temperatures were tested at 25°C, 30°C, 37°C, 45°C, and sequencing and annotation, we present a summary clas- 55°C, respectively. Strain growth was tested under aerobic sification and a set of features for M. massiliensis sp. nov. conditions, with or without 5% CO2, and under anaerobic strain JC119T (=CSUR P153 = DSM 26813) that was and microaerophilic conditions using GENbag anaer and isolated from an African stool sample from Senegal GENbag microaer systems, respectively (Biomerieux). For (Table 1). electron microscopy, we observed colonies using a Morgani 268D (FEI, Limeil-Brevannes,­ France) at an operating Materials and Methods voltage of 60 kV. Antibiotic disks of vancomycin (30 μg), rifampicin (30 μg), doxycycline (30 μg), erythromycin (15 μg), amoxicillin (25 μg), nitrofurantoin (300 μg), Bacterial culture gentamicin (15 and 500 μg), ciprofloxacin (5 μg), amoxi- Sample cillin (20 μg) with clavulanic acid (10 μg), penicillin G (10 μg), trimethoprim (1.25 μg) with sulfamethoxazole In 2012, a stool sample was collected from a 16-­year-­old (23.75 μg), oxacillin (5 μg), imipenem (10 μg), tobramycin Senegalese man living in Dielmo, a rural area in Sine (10 μg), metronidazole (4 μg), and amikacin (30 μg) Saloum, Senegal. Signed informed consent was obtained were purchased from Biomerieux. In vitro susceptibility from the patient. The study and the assent procedure testing of these antibiotics was performed by the disk were approved by the Ethics Committees of the Institut diffusion method following EUCAST recommendations Fédératif de Recherche 48, Faculty of Medicine, Marseille, (Matuschek et al. 2014). Results were interpreted using France, under agreement number 09-­022. After receipt, Eucast 2015 clinical breakpoints (http://www.eucast.org/ the stool sample was frozen at −80°C. clinical_breakpoints/). MALDI-­TOF-­MS protein analysis was carried out using a Microflex spectrometer (Bruker Phenotypic characterization Daltonics, Leipzig, Germany) as previously described (Seng et al. 2009). Briefly, one isolated bacterial colony was Strain isolation transferred with a pipette tip from a culture agar plate The stool sample was cultivated on MOD2 medium in and spread as a thin film on a MSP 96 MALDI-­TOF aerobic conditions at 37°C for 48 h. The culture condi- target plate (Bruker). After isolation, 12 distinct deposits tions included a modified 7h10 medium supplemented from 12 isolated colonies were tested for strain JC119T. with sheep blood as previously described (Ghodbane et al. We superimposed each smear with 2 μL of matrix solu- 2014). tion (saturated solution of α-­cyano-­4-­hydroxycinnamic acid) in 2.5% trifluoroacetic acid, 50% acetonitrile and allowed to dry for 5 min. We recorded the spectra in Strain identification the positive linear mode for the mass range of 2000– All the colonies were identified using MALDI-­TOF MS, 20,000 Da (parameter settings: ion source 1 (ISI), 20 kV; as described below. Colonies that exhibited a IS2, 18.5 kV; lens, 7 kV). Through variable laser power,

2 © 2016 The Authors. MicrobiologyOpen published by John Wiley & Sons Ltd.

72 A. Caputo et al. The Largest Human Bacterial Genome

Table 1. Classification, general features, and project information of Microvirga massiliensis JC119T according to the MIGS recommendations (Woese et al. 1990; Field et al. 2008).

MIGS ID Property Term Evidence code1

Current classification Domain: Bacteria TAS (Woese et al. 1990) Phylum: TAS (Garrity et al. 2005a) Class: TAS (Oren and Garrity 2014; Garrity et al. 2005b) Order: Rhizobiales TAS (Kuykendall 2005; Oren and Garrity 2014) Family: TAS (Kanso and Patel 2003; Zhang et al. 2009; Weon et al. 2010; Ardley et al. 2012; Bailey et al. 2014; Radl et al. 2014; Reeve et al. 2014) Genus: Microvirga TAS Species: M. massiliensis IDA Type strain JC119T IDA Gram stain Negative IDA Cell shape Rod IDA Motility Nonmotile IDA Speculation Nonsporulating IDA Temperature range Mesophile IDA Optimum temperature 37°C IDA MIGS-­6.3 Salinity Unknown MIGS-­22 Oxygen requirement Aerobic IDA Carbon source Unknown IDA Energy source Unknown IDA MIGS-­6 Habitat Human gut IDA MIGS-­15 Biotic relationship Free living IDA MIGS-­14 Pathogenicity Unknown NAS Biosafety level 2 Isolation Human feces MIGS-­4 Geographic location Dielmo, Senegal IDA MIGS-­5 Sample collection time September 2010 IDA MIGS-­4.1 Latitude 13.71667 IDA MIGS-­4.1 Longitude −16.41667 IDA MIGS-­4.3 Depth Surface IDA MIGS-­4.4 Altitude 34 m above sea level IDA MIGS-­31 Finishing quality Noncontiguous finished MIGS-­28 Libraries used Six paired-­end and two Shotgun 454; mate pair MiSeq MIGS-­29 Sequencing platform 454 GS FLX Titanium and Illumina MIGS-­31.2 Fold coverage 77× MIGS-­30 Assemblers Newbler 2.8 MIGS-­32 Gene calling method Prodigal GenBank Date of Release November, 2014 NCBI project ID PRJEB8433 EMBL accession CDSD00000000

1Evidence codes – IDA, inferred from direct assay; TAS, traceable author statement (i.e., a direct report exists in the literature); NAS, nontraceable author statement (i.e., not directly observed for the living, isolated sample, but based on a generally accepted property for the species, or anecdotal evidence). These evidence codes are from the Gene Ontology project (Ashburner et al. 2000). a spectrum was obtained after 240 shots. Per spot, the could identify tested species, where a score ≥ 1.9 enabled acquisition time was between 30 and 60 s. The strain identification at the species level with a validly published JC119T spectra were imported into the MALDI BioTyper species. For strain JC119T, no significant MALDI-TOF­ software (version 3.0, Bruker) and analyzed by standard score was obtained against the database, suggesting that pattern matching (with default parameter settings) against this isolate was not a member of a known species. We the main spectra of 2843 bacteria. The identification added the spectrum from strain JC119T to our database method included m/z from 3000 to 15,000 Da. For every (Fig. 1A). Finally, the gel view was used to demonstrate spectrum, we compared spectra in the database with a the spectral differences with other members of the maximum of 100 peaks. From the resulting score, we Methylobacteriaceae family (Fig. 1B).

© 2016 The Authors. MicrobiologyOpen published by John Wiley & Sons Ltd. 3

73 The Largest Human Bacterial Genome A. Caputo et al.

A

B

4 © 2016 The Authors. MicrobiologyOpen published by John Wiley & Sons Ltd.

74 A. Caputo et al. The Largest Human Bacterial Genome

Figure 1. (A) Reference mass spectrum from Microvirga massiliensis JC119T. (B) Gel view comparing M. massiliensis strain to other Methylobacteriaceae family. The gel view displays the raw spectra of loaded spectrum files arranged in a pseudo-­gel-­like look. The x-­axis records the m/z value. The left y-­axis displays the running spectrum number originating from subsequent spectra loading. The peak intensity is expressed by a Gray scale scheme code. The color bar and the right y-­axis indicate the relation between the color a peak is displayed with and the peak intensity in arbitrary units. Displayed species are indicated on the left.

16.83% for the shotgun and for two clonal amplification Genome sequencing and assembly of the 3-­kb paired-­end library 9.65% in SV reactions and Genomic DNA of M. massiliensis was sequenced on a 6.79%, 14.31%, and 14.56%, respectively, in MV reactions. 454 sequencer with two methods: paired-end­ and shotgun These yields were measured according to the quality ex- and on a MiSeq sequencer (Illumina, Inc., San Diego, pected by the range of 5% to 20% from the Roche pro- CA) with mate pair strategy; 450 μL of bacterial suspen- cedure. The two libraries were loaded on the GS Titanium sion was diluted in 2 mL TE buffer for lysis treatment: PicoTiterPlates PTP Kit 70 × 75 sequenced with the GS a lysozyme incubation of 30 min at 37°C followed by an Titanium Sequencing Kit XLR70. The runs were performed overnight proteinase K incubation at 37°C. The DNA was overnight and then analyzed on the cluster through the purified by three phenol–chloroform extractions and etha- gsRunBrowser and gsAssembler_Roche; 904 Mb was gen- nolic precipitation at −20°C overnight. After centrifugation, erated through passed filters reads with a length average the DNA was resuspended in 180 μL TE buffer. The of 299 bp. A second DNA extraction was performed. Three concentration was measured by the Quant-­it Picogreen petri dishes were spread and resuspended in 6 × 100 μL kit (Invitrogen) on the Genios_Tecan fluorometer at of G2 buffer. First, mechanical lysis was performed by 55.70 ng/μL. This project was sequenced through two glass powder on the Fastprep-­24 device (sample prepara- NGS technologies 454-Roche­ GSFLX Titanium GS and tion system) from MP Biomedicals for 2 × 20 s. DNA Illumina MiSeq. A shotgun and 3-­kb paired-end­ libraries was then incubated for a lysozyme treatment (30 min at were pyrosequenced on the 454_Roche_Titanium. This 37°C) and extracted through the BioRobot EZ 1 Advanced project was loaded twice on a 1/4 region for the shotgun XL (Qiagen, Hilden, Germany)). The gDNA was quanti- library and for the paired-­end library: once on a full PTP fied by a Qubit assay with the high sensitivity kit (Life PicoTiterPlate and four times on a 1/4 region on PTP Technologies, Carlsbad, CA) to 118 ng/μL. PicoTiterPlates. The libraries were constructed according Genomic DNA of M. massiliensis was sequenced on to the 454_Titanium and manufacturer protocols. The the MiSeq Technology (Illumina) with the mate pair strat- shotgun library was constructed with 500 ng of DNA as egy. The gDNA was barcoded in order to be mixed with described by the manufacturer Roche with the GS Rapid 11 other projects with the Nextera mate pair sample prep library Prep kit with a final concentration at 1.35e09 by kit (Illumina). The mate pair library was prepared with the Quant-it­ Ribogreen kit (Invitrogen) on the Genios_ 1 μg of genomic DNA using the Nextera mate pair Illumina Tecan fluorometer. The paired-­end library was constructed guide. The genomic DNA sample was simultaneously from 5 μg of DNA. The DNA was mechanically fragmented fragmented and tagged with a mate pair junction adapter. on the Hydroshear device (Digilab, Holliston, MA) with The pattern of the fragmentation was validated on an an enrichment size at 3–4 kb. The DNA fragments were Agilent 2100 BioAnalyzer (Agilent Technologies Inc.) with visualized through the Agilent 2100 BioAnalyzer (Agilent a DNA 7500 labchip. The DNA fragments ranged in size Technologies Inc., Santa Clara, CA) on a DNA labchip from 1 kb up to 10 kb with an optimal size at 4.81 kb. 7500 with an optimal size of 2.950 kb. Circularization No size selection was performed and 670 ng of tagmented and nebulization were performed and generated a pattern fragments was circularized. The circularized DNA was with an optimal at 371 bp. After PCR amplification through mechanically sheared to small fragments with an optimal 14 cycles followed by double size selection, the single-­ at 667 bp on the Covaris device S2 in microtubes (Covaris, stranded paired-end­ libraries were then quantified on the Woburn, MA). The library profile was visualized on a Quant-­it Ribogreen kit (Invitrogen) on the Genios_Tecan High Sensitivity Bioanalyzer LabChip (Agilent Technologies fluorometer at 140 pg/μL. The library concentration equiva- Inc.). The libraries were normalized at 2 nM and pooled. lence was calculated at 6.92e08. The two libraries were After a denaturation step and dilution at 12 pM, the pool stocked at −20°C until use. The shotgun library was of libraries was loaded onto the reagent cartridge and clonal-­amplified with 3 cpb in 3 SV emPCR and the 3-­kb then onto the instrument along with the flow cell. paired-­end library was amplified with lower cpb (0.5 cpb) Automated cluster generation and sequencing run were in 4 SV emPCR and by a bigger preparation of enriched performed in a single 39 h run in a 2 × 251 bp. Total beads at 0.25, 0.5, and 0.75 cpb with the GS Titanium information of 8.3 Gb was obtained from a 947 K/mm2 MV emPCR Kit (Lib-L)­ v2. The yield of the emPCR was cluster density with a cluster passing quality control filters

© 2016 The Authors. MicrobiologyOpen published by John Wiley & Sons Ltd. 5

75 The Largest Human Bacterial Genome A. Caputo et al. of 99% (18,112,000 clusters). Within this run, the index BSC39 (JPUG00000000), Microvirga flocculans strain ATCC representation for Microvirga was determined to be 8.57%. BAA-­817 (JAEA00000000), M. lotononidis strain WSM3557 The 1,398,916 paired reads were filtered according to the (AJUA00000000), and strain Lut6 read qualities. (AZYE00000000) genomes. Orthologous proteins were The total of 2,915,781 reads produced by six 454 paired-­ identified using the Proteinortho software, version 1.4 end run and two 454 shotgun of M. massiliensis JC119T (Lechner et al. 2011), using a 30% protein identity and were assembled with Newbler version 2.8 which generated 1e−05 E-­value. The average genomic identity of orthologous a genome size of 9.3 Mb with an average of 77× coverage gene sequences (AGIOS) between compared genomes was of the genome. For increased quality of the genome, we determined using the Needleman–Wunsch algorithm global mapped reads from MiSeq sequencer against this genome alignment technique. Artemis (Carver et al. 2005) was using CLC workbench software (CLC bio, Aarhus, Denmark). used for data management, and DNA Plotter (Carver et al. 2009) was used for the visualization of genomic features. The mauve alignment tool was used for multiple genomic Genome annotation and comparison sequence alignment and visualization (Darling et al. 2010). Open reading frames (ORFs) were predicted using In silico DNA–DNA hybridization (DDH) (Richter and Prodigal (Hyatt et al. 2010) with default parameters. Rosselló-­Móra 2009) was performed with the genomes However, the predicted ORFs were excluded if they were previously cited. M. Massiliensis genome was locally aligned spanning a sequencing gap region (containing N). The 2-­by-­2 using BLAT algorithm (Kent 2002; Auch et al. predicted bacterial protein sequences were searched 2010) against each of the selected genomes, and DDH against Clusters of Orthologous Groups (COG) databases values were estimated from a generalized linear model (Tatusov et al. 2001) using BLASTP (Altschul et al. 1990) (Meier-­Kolthoff et al. 2013). (E-­value 1e−03, coverage 70%, and identity percent 30%). If no hit was found, the search was performed against Pulsed field gel electrophoresis (PFGE) the NR database using BLASTP with the same parameters as before. If the sequence lengths were smaller than 80 Plug preparation and treatment and pulsed field gel elec- amino acids, we used an E-value­ of 1e−05. The tRNA trophoresis (PFGE) of bacterial DNA were performed as genes were found by the tRNAScanSE tool (Lowe and previously described by Raoult et al. (2004). SpeI restric- Eddy 1997), whereas ribosomal RNAs (rRNAs) were tion enzymes (Life Technologies) were used to migrate found by using RNAmmer (Lagesen et al. 2007). genomic DNA. Lipoprotein signal peptides and the number of trans- Each agarose block and molecular weight markers (Low membrane helices were predicted using Phobius (Käll Range PFG Marker; Biolabs, New England, New England et al. 2004). ORFans were identified if all the BLASTP Biolabs, Ipswich, Massachusetts, USA) were placed in the performed did not give positive results (E-­value smaller well of a 1% PFGE agarose gel (Sigma, St. Louis, Missouri, than 1e−03 for ORFs with a sequence size higher than USA) in 0.5× TBE. 80 aa or E-value­ smaller than 1e−05 for ORFs with se- The pulsed field gel separation (Fig. S1) was made on quence lengths smaller than 80 aa). Similar parameter a CHEF–DR II apparatus (Bio-­Rad Laboratories Inc., thresholds have already been used in previous work to Hercules, California, USA) with pulses ranging from 5 define ORFans (Yin and Fischer 2006, 2008). to 50 s at a voltage of 5 V/cm and switch angle of 120° Based on 16S rDNA of M. massiliensis (JF824802) and for 20 h at 14°C. Gels were either stained with ethidium closely related species, sequences were aligned using bromide and analyzed using a Gel-Doc­ 2000 system (Bio-­ CLUSTALW, and phylogenetic tree was obtained using Rad Laboratories Inc.) or used to prepare Southern blots. the maximum-likelihood­ method within the MEGA 6.06 software (Tamura et al. 2013). Numbers at the nodes are DNA labeling and hybridization bootstrap values obtained by repeating the analysis 500 times to generate a majority consensus tree. We performed Resolved uncut and digested genomic DNA processed by a BLASTP (Altschul et al. 1990) against the database of PFGE were treated and transferred onto Hybond N+ (GE microbial polyketide and nonribosomal peptide gene clus- Healthcare, Little Chalfont, UK) with the vacuum blotter ters (Conway and Boddy 2013) for the predicted number (model 785, Bio-Rad­ Laboratories Inc.) and UV-­cross-linked­ of polyketide synthase (PKS)- ­and nonribosomal peptide for 2 min. The blots were then hybridized against the synthetase (NRPS)-encoding­ genes, and we keep only the DIG-­labeled probes (scaffolds 24 and 35) as recommended best hit for each protein. by the manufacturer DIG-System­ (Roche Diagnostics, For the genomic comparison, we used M. massiliensis Meylan, France) except that the hybridized probe was strain JC119T (CDSD00000000), Microvirga aerilata strain detected using a horseradish peroxidase-conjugated­

6 © 2016 The Authors. MicrobiologyOpen published by John Wiley & Sons Ltd.

76 A. Caputo et al. The Largest Human Bacterial Genome monoclonal mouse anti-­digoxin (Jackson Immunoresearch, fermentation. Using API® ZYM strips (BioMerieux), posi- West Grove, Pennsylvania, USA, 1:20,000). After washings, tive reactions were observed for esterase (C4), esterase blots were revealed by chemiluminescence assays (ECL; lipase (C8), leucine arylamidase, cysteine arylamidase, GE Healthcare). The resulting signal was detected on trypsin, acid phosphatase, and naphthol-AS-­­BI-­ Hyperfilm™ ECL (GE Healthcare) by using an automated phosphohydrolase. Using an API® 50 CH (BioMerieux), film processor Hyperprocessor™ (GE Healthcare). positive reactions were recorded for d-trehalose­ hydrolysis after 48 h. These results are summarized in Tables 2–4. T Results Microvirga massiliensis strain JC119 was susceptible for rifampicin, doxycycline, erythromycin, amoxicillin, gen- tamicin, ciprofloxacin, ceftriaxone, amoxicillin and clavu- Phenotypic characteristics lanic acid, penicillin G, imipenem, tobramycin, Gram staining showed gram-­negative (Fig. 2A). The metronidazole, amikacin and resistant to vancomycin and motility test was negative. Cells grown on agar were non- nitrofurantoin. sporulated and had a mean diameter of 2.28 μm (Fig. 2B). When compared to the species M. aerophila, M. aerilata, Catalase was positive, and oxidase was negative. nodulans, and Methylobacterium populi, Using API® 20NE (BioMerieux), positive reactions were M. massiliensis exhibits phenotypic characteristics as de- observed to reduce nitrite to nitrate and glucose tailed in Table 5.

AB

Figure 2. (A) Gram straining of Microvirga massiliensis JC119T. (B) Transmission electron microscopy of M. massiliensis JC119T, taken using a Morgagni 268D (Philips Amsterdam, Netherlands) at an operating voltage of 60 kV. The scale represents 500 nm.

Table 2. Microvirga massiliensis sp. nov. reactions results with API® 20NE. Table 3. Microvirga massiliensis sp. nov. reactions results with API® Active components Result ZYM.

Potassium nitrate + Enzyme assayed for Result l-­Tryptophan − d-­Glucose + Alkaline phosphatase − l-­Arginine − Esterase (C4) + Urea − Esterase lipase (C8) + Esculine − Lipase (C14) − Gelatin − Leucine arylamidase + 4-­Nitrophenyl α-­d-­galactopyranoside − Valine arylamidase − d-­Glucose − Cysteine arylamidase + l-­Arabinose − Trypsin + d-­Mannose − α-­Chymotrypsin − d-­Mannitol − Acid phosphatase + N-­Acetyl-­glucosamine − Naphthol-­AS-­BI-­phosphohydrolase + d-­Maltose − α-­Galactosidase − Potassium gluconate − β-­Galactosidase − Capric acid − β-­Glucuronidase − Adipic acid − α-­Glucosidase − Malic acid − β-­Glucosidase − Trisodium citrate − N-­Acetyl-β­ -­glucosaminidase − Phenylacetic acid − α-­Mannosidase − Oxidase − α-­Fucosidase −

© 2016 The Authors. MicrobiologyOpen published by John Wiley & Sons Ltd. 7

77 The Largest Human Bacterial Genome A. Caputo et al.

Table 4. Microvirga massiliensis sp. nov. reactions results with API® summarized the closest strains of M. massiliensis based 50CH. on their sequence identity percentage in Table S1. The phylogenetic tree highlights the position of M. massiliensis Active components Result JC119T (Fig. 3) in relation to other type strains within Glycerol − the Microvirga genus and closely related species. The ge- Erythritol − nome of M. massiliensis JC119T is 9,207,211 bp long with d-­Arabinose − 63.28% GC content (Fig. 4). It is composed of 50 scaf- l-­Arabinose − folds (accession number LN811350–LN811399) (composed d-­Ribose − d-­Xylose − of 365 contigs) with one plasmid. Table 1 shows the l-­Xylose − project information and its association with MIGS version d-­Adonitol − 2.0 compliance (Field et al. 2008). Of the 8762 predicted Methyl β-­d-­xylopyranoside − genes, 8685 were protein-­coding genes, 56 tRNA genes, d-­Galactose − and 21 rRNA genes. A total of 5323 genes (61.29%) were d-­Glucose − assigned as putative function (by COGs or by NR blast); F-­Fructose − d-­Mannose − 1500 genes were identified as ORFans (17.27%). The re- l-­Sorbose − maining genes were annotated as hypothetical proteins. l-­Rhamnose − We have predicted the number of rRNA by mapping. Dulcitol − We mapped all reads against the rRNA operon and all Inositol − reads against the 50 scaffolds. The average coverage of d-­Mannitol − the rRNA operon is seven times higher than the average d-­Sorbitol − coverage of the genome so this genome has seven genes Methyl α-­d-­mannopyranoside − Methyl α-­d-­glucopyranoside − that are 16S rRNA, seven genes that are 23S rRNA, and N-­Acetylglucosamine − seven genes that are 5S rRNA. The circular plasmid se- Amygdaline − quence is complete with 73,638 bp long and was found Arbutine − by PFGE. Esculine − We have predicted 44 PKS/NRPS summarized Table 6. Salicine − The distribution of genes into COGs functional categories d-­Cellobiose − d-­Maltose − is presented in Table 7. We found that the 263 genes d-­Lactose − belonging to the L category especially represented genes d-­Melibiose − encoding for transposases, which are larger than M. floc- d-­Saccharose − culans (140 genes) but smaller than those of M. aerilata, d-­Trehalose + M. lotononidis, and M. lupini (288, 419, and 583 genes, Inuline − respectively). The increase in transposable elements is d-­Melezitose − related to the size of bacterial genome (Touchon and d-­Raffinose − Amidon − Rocha 2007; Chénais et al. 2012; Iranzo et al. 2014). Glycogene − Additionally, two scaffolds (scaffolds 43 and 48) were Xylitol − expected to contain a large number of transposable ele- Gentiobiose − ments (Pray 2008) because their coverage represented d-­Turanose − almost 350× and 250×, respectively, in comparison with d-­Lyxose − the average coverage of the entire genome, that is, 50×, d-­Tagatose − d-­Fucose − that allows us to say that the number of transposases is d-­Arabitol − underestimated. We can estimate that the most real number l-­Arabitol − of transposable elements would be five to seven times Potassium gluconate − more. Moreover, we found twelve genes to be larger than Potassium 2-­ketogluconate − 5000 nucleotides, one of which was larger than 14 kb Potassium 5-­ketogluconate − and corresponded to a predicted function of a rearrange- ment hotspot (RHS) repeat protein. RHS proteins contain extended repeats, which are involved in recombination Genomic characteristics (Koskiniemi et al. 2013). Based on 16S rDNA, the M. massiliensis strain JC119T We compared the genome of M. massiliensis strain (accessory number JF824802) exhibited, among other, 95% JC119T to M. aerilata strain BSC39, M. flocculans strain sequence identity with M. lotononidis WSM3557T (acces- ATCC BAA-­817, M. lotononidis strain WSM3557T, and sory number HM362432) (Ardley et al. 2012). We M. lupini strain Lut6T (Table 8). The draft genome

8 © 2016 The Authors. MicrobiologyOpen published by John Wiley & Sons Ltd.

78 A. Caputo et al. The Largest Human Bacterial Genome

Table 5. Differential characteristics of Microvirga massiliensis sp. nov., strain JC119T, Microvirga aerophila strain 5420S-­12T (Weon et al. 2010), Microvirga aerilata strain 5420S-­16T (Weon et al. 2010), Methylobacterium nodulans strain ORS 2060T (Jourand et al. 2004), and Methylobacterium populi strain BJ001T (Aken et al. 2004).

Properties M. massiliensis M. aerophila M. aerilata M. nodulans

Cell diameter (μm) 2.28 0.8–1.1 1.2–1.5 0.5–1 Oxygen requirement Aerobic Aerobic Aerobic Aerobic Gram stain Negative Negative Negative Negative Motility − − − ± Endospore formation − − − − Production of Alkaline phosphatase − − + NA Acid phosphatase + + + NA Catalase + NA NA NA Oxidase − NA NA NA Nitrate reductase + NA NA + Urease − NA NA + α-­Galactosidase − − − NA β-­Galactosidase − − − − β-­Glucuronidase − − − NA α-­Glucosidase − − − NA β-­Glucosidase − − − − Esterase + + + NA Esterase lipase + − + NA Naphthol-­AS-­BI-­phosphohydrolase + + + NA N-­Acetyl-β­ -­glucosaminidase − − − NA α-­Mannosidase − − − NA α-­Fucosidase − − − NA Leucine arylamidase + − + NA Valine arylamidase − − − NA Cystine arylamidase + − − NA α-­Chymotrypsin − − − NA Trypsin + − + NA Acid from l-­Arabinose − NA NA + d-­Mannose − NA NA + d-­Mannitol − NA NA − d-­Trehalose + NA NA + d-­Mannose − NA NA + Habitat Human gut Air Air C. podocarpa

NA, data not available. sequence of M. lupini is larger than those of M. massil- AGIOS value ranged from 67.91% with M. lupini to 69.45% iensis, M. lotononidis, M. aerilata, and M. flocculans (9.7, with M. flocculans. The DDH was of 21.7% ±2.33 with 9.3, 7.1, 5.7, and 4.1 MB, respectively). The G+C content M. lupini, 21% ±2.33 with M. aerilata, 20.90% ±2.33 of Microvirga species ranged from 61.6 to 64.6 (Table 8). with M. flocculans, and 20.30% ±2.32 with M. lotononidis. The gene content of M. massiliensis (8685) is smaller than These data confirm M. massiliensis as a unique species. that of M. lupini (9865), but larger than those of M. ­lotononidis, M. aerilata, and M. flocculans (6991, 5571, Discussion and 3835, respectively). The distribution of genes into COG categories was similar, but not identical in all com- The Methylobacteriaceae family is comprised of four genera: pared genomes (Fig 5). In addition, M. massiliensis shared Meganema, Methylobacterium, Psychroglaciecola, and 2575, 2275, 2713, and 2729 orthologous genes with M. aer- Microvirga. Currently, the genus Microvirga contains nine ilata, M. flocculans, M. lotononidis, and M. lupini, respec- species (Kanso and Patel 2003; Zhang et al. 2009; Weon tively (Table 9). The AGIOS values ranged from 80.19 et al. 2010; Ardley et al. 2012; Radl et al. 2014). The first to 82.97 among compared Microvirga species except species of the Microvirga genus was M. subterranea strain M. massiliensis. When compared to other species, the Fail4T (Kanso and Patel 2003) and was first deposited as

© 2016 The Authors. MicrobiologyOpen published by John Wiley & Sons Ltd. 9

79 The Largest Human Bacterial Genome A. Caputo et al.

Figure 3. Phylogenetic tree highlighting the position of Microvirga massiliensis JC119T relative to other type strains within the Microvirga genus and closely related species. The strains and their corresponding accession number for 16S rRNA genes are indicated in parentheses. Cupriavidus taiwanensis strain LMG 19424T was used as outgroup. The scale bar represents a 2% nucleotide sequence divergence.

Figure 4. Circular representation of the Microvirga massiliensis strain JC119T genome. From outside to the center: contigs (red/gray), COG category of genes on the forward strand (three circles), genes on forward strand (blue circle), genes on the reverse strand (red circle), COG category on the reverse strand (three circles), GC content.

10 © 2016 The Authors. MicrobiologyOpen published by John Wiley & Sons Ltd.

80 A. Caputo et al. The Largest Human Bacterial Genome

Table 6. NRPS/PKS predicted for Microvirga massiliensis. Table 7. Number of genes associated with the 25 general COG func- tional categories. PKS/NRPS on proteome (best hit) Code Value % value Description Streptomyces avermitilis MA-­4680 (NC_003155) 1 Streptomyces sp. FR-­008 (AY310323) J 195 2.25 Translation Mycobacterium avium 104 (CP000479) A 0 0 RNA processing and modification Streptomyces griseoviridis (AB469822) 1 K 359 4.13 Transcription Sorangium cellulosum (AF210843) L 263 3.03 Replication, recombination, and Bacillus amyloliquefaciens FZB42 (AJ576102) repair Micromonospora chalcea (EU443633) 1 B 9 0.1 Chromatin structure and dynamics Streptomyces coelicolor A3(2) (NC_003888) D 29 0.33 Cell cycle control, mitosis, and Streptomyces venezuelae ATCC 10712 (FR845719) 1 meiosis Saccharopolyspora erythraea NRRL 2338 (NC_009142) Y 0 0 Nuclear structure Acidobacteria bacterium A11 (JF342591) 1 V 66 0.76 Defense mechanisms Sorangium cellulosum (DQ897667) T 283 3.26 Signal transduction mechanisms Myxococcus xanthus DK 1622 (CP000113) 1 M 263 3.03 Cell wall/membrane biogenesis Mycobacterium avium subsp. paratuberculosis K-­10 (AE016958) N 67 0.77 Cell motility Nocardia farcinica IFM 10152 (NC_006361) 1 Z 0 0 Cytoskeleton Mycobacterium gilvum PYR-­GCK (NC_009338) W 0 0 Extracellular structures Chitinophaga sancti (HQ680975) U 80 0.92 Intracellular trafficking and secretion Sorangium cellulosum (HE616533) 1 O 184 2.12 Post-­translational modification, Mycobacterium tuberculosis CDC 1551 (NC_002755) protein turnover, chaperones Mycobacterium sp. MCS (CP000384) C 347 4 Energy production and conversion Sorangium cellulosum (AM407731) G 423 4.87 Carbohydrate transport and Streptomyces sp. MP39-­85 (FJ872525) metabolism Streptomyces antibioticus (FJ545274) E 788 9.07 Amino acid transport and Tistrella mobilis KA081020-­065 (NC_017956) metabolism Streptomyces ambofaciens ATCC 23877 (AM238664) F 88 1.01 Nucleotide transport and Mycobacterium smegmatis str. MC2 155 (CP000480) metabolism Thermomonospora curvata DSM 43183 (NC_013510) H 175 2.01 Coenzyme transport and Microcystis aeruginosa PCC 7806 (AF183408) metabolism I 256 2.95 Lipid transport and metabolism 1This best hit corresponded to several genes (to two from seven) in P 368 4.24 Inorganic ion transport and M. massiliensis. metabolism Q 229 2.64 Secondary metabolites biosynthesis, Corbulabacter subterraneus. Species of the genus Microvirga transport, and catabolism are mostly found in soil. To the best of our knowledge, R 802 9.23 General function prediction only S 525 6.04 Function unknown no Microvirga strain has previously been isolated from – 3724 42.88 Not in COGs humans. M. massiliensis strain JC119T is the type strain of M. massiliensis sp. nov., a new species of the genus Microvirga. The genome of M. massiliensis JC119T is also a large number of transposases that create repeat ele- 9,207,211 bp long, representing the largest genome of a ments in this genome. The abundance of transposable bacteria isolated from a human sample (Lagier et al. elements correlates positively with genome size (Touchon 2012a,b,c,d). It ranks 139th among the largest bacterial and Rocha 2007; Chénais et al. 2012; Iranzo et al. 2014). genomes according to the Genomes OnLine Database Culture remains a critical step of microbiology (Lagier (GOLD) (Kyrpides 1999). All these bacteria whose genome et al. 2015). However, thanks to the multiplication of is larger than M. massiliensis are found in the environment culture conditions, culturomics increased the knowledge (soil, ocean, sand, and landfill). Different characteristics of the human gut microbiota repertoire by 77% and thus of this bacterium can explain a broad genome. The pro- helps to decipher the “dark matter” (Lagier et al. portion of no detectable homologs (ORFans) from the 2012a,b,c,d, 2015). Culturomics allows the generation of gene prediction is about 17%, which corresponds to 1500 stable and robust data such as pure culture and genome ORFans. The number of noncoding genes is important. sequencing by contrast with metagenomics, which shows This genome has 77 RNA: 56 tRNA and 21 rRNA. The a significant lack of reproducibility among laboratories number of RNA appears to be related to genome size (Angelakis et al. 2012). (Klappenbach et al. 2000). For bacterial genomes, the copy Taxonogenomics allows the consideration of proteome number of rRNA operons varies from 1 to 15, for example, and genome sequences, which must be a part of the for Clostridium paradoxum (Rainey et al. 1996). There are taxonomic description of bacteria. To describe a new

© 2016 The Authors. MicrobiologyOpen published by John Wiley & Sons Ltd. 11

81 The Largest Human Bacterial Genome A. Caputo et al.

Table 8. Comparative genomic features of Microvirga massiliensis, Microvirga aerilata, Microvirga flocculans, Microvirga lotononidis, and Microvirga lupini.

M. massiliensis M. aerilata M. flocculans M. lotononidis M. lupini

Genome size (Mb) 9.3 5.7 4.1 7.1 9.7 DNA G+C content (%) 63.3 63.1 64.6 62.9 61.6 Protein-­coding genes 8685 5571 3835 6991 9865 rRNA 21 13 5 6 3 tRNA 56 58 48 55 57

Figure 5. Distribution of functional classes of predicted genes in the genomes from Microvirga massiliensis, Microvirga aerilata, Microvirga flocculans, Microvirga lotononidis, and Microvirga lupini , according to the COG category.

Table 9. Genomic comparison of Microvirga massiliensis and four other Microvirga species.

M. aerilata M. flocculans M. lotononidis M. lupini M. massiliensis

M. aerilata 5571 2665 3131 3163 2575 M. flocculans 82.21 3835 2761 2691 2275 M. lotononidis 80.83 81.67 6992 3305 2713 M. lupini 80.19 80.8 82.97 9865 2729 M. massiliensis 68.69 69.45 68.74 67.91 8685

Numbers of orthologous proteins shared between genomes (upper right) and AGIOS values (lower left); bold numbers indicate the numbers of ­proteins per genome. bacterial species, taxonogenomics uses both phenotypic where M. massiliensis was isolated) is a gram-negative,­ and genotypic data (Ramasamy et al. 2012; Lagier et al. oxidase-­negative, and catalase-positive­ bacterium. This bac- 2015). On the basis of taxonogenomics, we formally pro- terium is an aerobic, nonmotile rod, and non-spore-­ ­ pose the creation of M. massiliensis sp. nov. containing forming, individual cell exhibiting a diameter of 2.28 μm. the strain JC119T. This bacterium is positive for nitrate reductase, d-glucose,­ esterase, esterase lipase, leucine arylamidase, cysteine ar- ylamidase, trypsin, acid phosphatase, and naphthol-AS-­ BI-­ ­ Description of Microvirga massiliensis sp phosphohydrolase. This bacterium is susceptible to nov rifampicin, doxycycline, erythromycin, amoxicillin, gen- Microvirga massiliensis (mas.si.li.en’sis. L. masc. adj. massil- tamicin, ciprofloxacin, ceftriaxone, amoxicillin with clavu- iensis of Massilia, the Latin name of Marseille, France, lanic acid, penicillin, imipenem, tobramycin, metronidazole,

12 © 2016 The Authors. MicrobiologyOpen published by John Wiley & Sons Ltd.

82 A. Caputo et al. The Largest Human Bacterial Genome and amikacin. The range of temperature growth is 28–55°C for the unification of biology. The Gene Ontology (with an optimum at 37°C). The potential pathogenicity Consortium. Nat. Genet. 25:25–29. of the type strain JC119T (=CSUR P153 = DSM 26813) Auch, A. F., M. von Jan, H.-P. Klenk, and M. Göker. 2010. is unknown but was isolated from the stool specimen from Digital DNA–DNA hybridization for microbial species a Senegalese male, in Marseille (France), using culturomics delineation by means of genome-­to-­genome sequence (Lagier et al. 2012a,b,c,d; Dubourg et al. 2014). The G+C comparison. Stand. Genomic Sci. 2:117–134. content of the genome is 63.28%. The partial 16S rRNA Bailey, A. C., M. Kellom, A. T. Poret-Peterson, K. Noonan, sequence of M. massiliensis was deposited in GenBank H. E. Hartnett, and J. Raymond. 2014. Draft genome under the accession number JF824802. The whole-genome­ sequence of Microvirga sp. strain BSC39, isolated from sequence of Microvirga massiliensis strain JC119T (=CSUR biological soil crust of Moab, Utah. Genome Announc. P153 = DSM 26813) has been deposited in EMBL under 2:e01197-14. accession numbers CDSD01000001–CDSD01000365 for Carver, T. J., K. M. Rutherford, M. Berriman, M.-A. contigs and LN811350–LN811399 for scaffolds. Rajandream, B. G. Barrell, and J. Parkhill. 2005. ACT: the Artemis comparison tool. Bioinformatics (Oxford, England) 21:3422–3423. Acknowledgments Carver, T., N. Thomson, A. Bleasby, M. Berriman, and The authors thank the Xegen Company (www.xegen.fr) J. Parkhill. 2009. DNAPlotter: circular and linear for automating the genomic annotation process. interactive genome visualization. Bioinformatics (Oxford, England) 25:119–120. Chénais, B., A. Caruso, S. Hiard, and N. Casse. 2012. The Funding Information impact of transposable elements on eukaryotic genomes: from genome size increase to genetic adaptation to No funding information provided. stressful environments. Gene 509:7–15. Clark, A. E., E. J. Kaleta, A. Arora, and D. M. Wolk. 2013. Conflict of Interest Matrix-­assisted laser desorption ionization-­time of flight mass spectrometry: a fundamental shift in the routine None declared. practice of clinical microbiology. Clin. Microbiol. Rev. 26:547–603. References Conway, K. R., and C. N. Boddy. 2013. ClusterMine360: a Adler, J., and M. D. Margaret. 1967. A method for measuring database of microbial PKS/NRPS biosynthesis. Nucleic the motility of bacteria and for comparing random and Acids Res. 41:D402–D407. non-­random motility. Microbiology 46:161–173. Darling, A. E., B. Mau, and N. T. Perna. 2010. Aken, B. V., C. M. Peres, S. Lafferty Doty, J. Moon Yoon, progressiveMauve: multiple genome alignment with gene and J. L. Schnoor. 2004. Methylobacterium populi sp. gain, loss and rearrangement. PLoS One 5:e11147. nov., a novel aerobic, pink-pigmented,­ facultatively Dubourg, G., J. C. Lagier, C. Robert, F. Armougom, P. methylotrophic, methane-utilizing­ bacterium isolated from Hugon, S. Metidji, et al. 2014. Culturomics and poplar trees (Populus deltoides×nigra DN34). Int. J. Syst. pyrosequencing evidence of the reduction in gut Evol. Microbiol. 54:1191–1196. microbiota diversity in patients with broad-spectrum­ Altschul, S. F., W. Gish, W. Miller, E. W. Myers, and D. J. antibiotics. Int. J. Antimicrob. Agents 44:117–124. Lipman. 1990. Basic local alignment search tool. J. Mol. Field, D., G. Garrity, T. Gray, N. Morrison, J. Selengut, Biol. 215:403–410. P. Sterk, et al. 2008. The minimum information about a Angelakis, E., F. Armougom, M. Million, and D. Raoult. genome sequence (MIGS) specification. Nat. Biotechnol. 2012. The relationship between gut microbiota and 26:541–547. weight gain in humans. Future Microbiol. 7:91–109. Garrity, G. M., J. A. Bell, and T. Lilburn. 2005a. Bergey’s Ardley, J. K., M. A. Parker, S. E. De Meyer, R. D. manual of systematic bacteriology. 2nd ed. Vol. 2. Part Trengove, G. W. O’Hara, W. G. Reeve, et al. 2012. B. Phylum XIV. Proteobacteria phyl. nov., Springer, New Microvirga lupini sp. nov., Microvirga lotononidis sp. nov. York, NY, USA, 2001–2012. and Microvirga zambiensis sp. nov. are Garrity, G. M., J. A. Bell, and T. Lilburn. 2005b. Bergey’s alphaproteobacterial root-nodule­ bacteria that specifically manual of systematic bacteriology. 2nd ed. Vol. 2. Part nodulate and fix nitrogen with geographically and C. Class I. Alphaproteobacteria class. nov., Springer, New taxonomically separate legume hosts. Int. J. Syst. Evol. York, NY, USA, 2001–2012. Microbiol. 62:2579–2588. Ghodbane, R., D. Raoult, and M. Drancourt. 2014. Ashburner, M., C. A. Ball, J. A. Blake, D. Botstein, Dramatic reduction of culture time of Mycobacterium H. Butler, J. M. Cherry, et al. 2000. Gene ontology: tool tuberculosis. Sci. Rep. 4:4236.

© 2016 The Authors. MicrobiologyOpen published by John Wiley & Sons Ltd. 13

83 The Largest Human Bacterial Genome A. Caputo et al.

Gregersen, T. 1978. Rapid method for distinction of La Scola, B., and D. Raoult. 2009. Direct identification of gram-­negative from gram-positive­ bacteria. Eur. J. Appl. bacteria in positive blood culture bottles by matrix-­ Microbiol. Biotechnol. 5:123–127. assisted laser desorption ionisation time-­of-flight­ mass Hugon, P., A. K. Mishra, J. C. Lagier, T. T. Nguyen, spectrometry. PLoS One 4:e8041. C. Couderc, D. Raoult, et al. 2013. Non-­contiguous Lagesen, K., P. Hallin, E. A. Rødland, H.-H. Staerfeldt, finished genome sequence and description ofBrevibacillus T. Rognes, and D. W. Ussery. 2007. RNAmmer: massiliensis sp. nov. Stand. Genomic Sci. 8:1–14. consistent and rapid annotation of ribosomal RNA genes. Humble, M. W., A. King, and I. Phillips. 1977. API ZYM: Nucleic Acids Res. 35:3100–3108. a simple rapid system for the detection of bacterial Lagier, J. C., F. Armougom, M. Million, P. Hugon, enzymes. J. Clin. Pathol. 30:275–277. I. Pagnier, C. Robert, et al. 2012a. Microbial culturomics: Hyatt, D., G.-L. Chen, P. F. Locascio, M. L. Land, F. W. paradigm shift in the human gut microbiome study. Larimer, and L. J. Hauser. 2010. Prodigal: prokaryotic Clin. Microbiol. Infect. 18:1185–1193. gene recognition and translation initiation site Lagier, J. C., F. Armougom, A. K. Mishra, T. T. Nguyen, identification. BMC Bioinformatics 11:119. D. Raoult, and P.-E. Fournier. 2012b. Non-contiguous­ Iranzo, J., M. J. Gómez, F. J. López de Saro, and finished genome sequence and description ofAlistipes S. Manrubia. 2014. Large-scale­ genomic analysis timonensis sp. nov. Stand. Genomic Sci. 6:315–324. suggests a neutral punctuated dynamics of transposable Lagier, J. C., K. El Karkouri, T. T. Nguyen, F. Armougom, elements in bacterial genomes. PLoS Comput. Biol. D. Raoult, and P.-E. Fournier. 2012c. Non-contiguous­ 10:e1003680. finished genome sequence and description of Anaerococcus Jourand, P., E. Giraud, G. Béna, A. Sy, A. Willems, senegalensis sp. nov. Stand. Genomic Sci. 6:116–125. M. Gillis, et al. 2004. Methylobacterium nodulans sp. nov., Lagier, J. C., G. Gimenez, C. Robert, D. Raoult, and P.-E. for a group of aerobic, facultatively methylotrophic, Fournier. 2012d. Non-­contiguous finished genome legume root-­nodule-­forming and nitrogen-­fixing bacteria. sequence and description of Herbaspirillum massiliense sp. Int. J. Syst. Evol. Microbiol. 54:2269–2273. nov. Stand. Genomic Sci. 7:200–209. Käll, L., A. Krogh, and E. L. L. Sonnhammer. 2004. A Lagier, J. C., K. El Karkouri, A. K. Mishra, C. Robert, combined transmembrane topology and signal peptide D. Raoult, and P.-E. Fournier. 2013a. Non contiguous-­ prediction method. J. Mol. Biol. 338:1027–1036. finished genome sequence and description ofEnterobacter Kanso, S., and B. K. C. Patel. 2003. massiliensis sp. nov. Stand. Genomic Sci. 7:399–412. gen. nov., sp. nov., a moderate thermophile from a deep Lagier, J. C., K. Elkarkouri, R. Rivet, C. Couderc, subsurface Australian thermal aquifer. Int. J. Syst. Evol. D. Raoult, and P.-E. Fournier. 2013b. Non contiguous-­ Microbiol. 53:401–406. finished genome sequence and description of Kent, W. J.. 2002. BLAT – the BLAST-­like alignment tool. Senegalemassilia anaerobia gen. nov., sp. nov. Stand. Genome Res. 12:656–664. Article published online before Genomic Sci. 7:343–356. March 2002. Lagier, J. C., P. Hugon, S. Khelaifia, P.-E. Fournier, Klappenbach, J. A., J. M. Dunbar, and T. M. Schmidt. B. La Scola, and D. Raoult. 2015. The rebirth of culture 2000. rRNA operon copy number reflects ecological in microbiology through the example of culturomics to strategies of bacteria. Appl. Environ. Microbiol. study human gut microbiota. Clin. Microbiol. Rev. 66:1328–1333. 28:237–264. doi:10.1128/CMR.00014-­14. Kokcha, S., D. Ramasamy, J. C. Lagier, C. Robert, D. Raoult, Lechner, M., S. Findeiss, L. Steiner, M. Marz, P. F. Stadler, and P.-E. Fournier. 2012. Non-­contiguous finished genome and S. J. Prohaska. 2011. Proteinortho: detection of (co-)­ sequence and description of Brevibacterium senegalense sp. orthologs in large-­scale analysis. BMC Bioinformatics nov. Stand. Genomic Sci. 7:233–245. 12:124. Koskiniemi, S., J. G. Lamoureux, K. C. Nikolakakis, Lowe, T. M., and S. R. Eddy. 1997. tRNAscan-SE:­ a C. t’Kint de Roodenbeke, M. D. Kaplan, D. A. Low, program for improved detection of transfer RNA genes et al. 2013. Rhs proteins from diverse bacteria mediate in genomic sequence. Nucleic Acids Res. 25:955–964. intercellular competition. Proc. Natl Acad. Sci. USA Matuschek, E., D. F. J. Brown, and G. Kahlmeter. 2014. 110:7032–7037. Development of the EUCAST disk diffusion antimicrobial Kuykendall, L. D. 2005. Bergy’s manual of systematic susceptibility testing method and its implementation in bacteriology. 2nd ed. Order VI. Rhizobiales ord. nov., routine microbiology laboratories. Clin. Microbiol. Infect. Springer, New York, NY, USA, 2001–2012. 20:O255–O266. Kyrpides, N. C.. 1999. Genomes OnLine Database (GOLD Meier-Kolthoff, J. P., A. F. Auch, H.-P. Klenk, and 1.0): a monitor of complete and ongoing genome M. Göker. 2013. Genome sequence-based­ species projects world-­wide. Bioinformatics (Oxford, England) delimitation with confidence intervals and improved 15:773–774. distance functions. BMC Bioinformatics 14:60.

14 © 2016 The Authors. MicrobiologyOpen published by John Wiley & Sons Ltd.

84 A. Caputo et al. The Largest Human Bacterial Genome

Oren, A., and G. M. Garrity. 2014. List of new names and pathogenic bacteria in a clinical microbiology new combinations previously effectively, but not validly, laboratory: impact of matrix-­assisted laser desorption published. Int. J. Syst. Evol. Microbiol. 64:1–6. ionization-­time of flight mass spectrometry. J. Clin. Pray, L. 2008. Transposons: the jumping genes. Nat. Educ. Microbiol. 51:2182–2194. 1:1–204. Stackebrandt, E., and J. Ebers. 2006. Taxonomic parameters Radl, V., J. L. Simões-Araújo, J. Leite, S. R. Passos, revisited: tarnished gold standards. Microbiol. Today L. M. V. Martins, G. R. Xavier, et al. 2014. Microvirga 33:152–155. vignae sp. nov., a root nodule symbiotic bacterium Tamura, K., G. Stecher, D. Peterson, A. Filipski, and isolated from cowpea grown in semi-arid­ Brazil. Int. J. S. Kumar. 2013. MEGA6: Molecular Evolutionary Syst. Evol. Microbiol. 64:725–730. Genetics Analysis Version 6.0. Mol. Biol. Evol. Rainey, F. A., N. L. Ward-Rainey, P. H. Janssen, H. Hippe, 30:2725–2729. and E. Stackebrandt. 1996. Clostridium paradoxum DSM Tatusov, R. L., D. A. Natale, I. V. Garkavtsev, T. A. 7308T contains multiple 16S rRNA genes with Tatusova, U. T. Shankavaram, B. S. Rao, et al. 2001. The heterogeneous intervening sequences. Microbiology COG database: new developments in phylogenetic (Reading, England) 142:2087–2095. classification of proteins from complete genomes. Nucleic Ramasamy, D., S. Kokcha, J. C. Lagier, T. T. Nguyen, Acids Res. 29:22–28. D. Raoult, and P.-E. Fournier. 2012. Genome sequence Tindall, B. J., R. Rosselló-Móra, H.-J. Busse, W. Ludwig, and description of Aeromicrobium massiliense sp. nov. and P. Kämpfer. 2010. Notes on the characterization of Stand. Genomic Sci. 7:246–257. prokaryote strains for taxonomic purposes. Int. J. Syst. Raoult, D., S. Audic, C. Robert, C. Abergel, P. Renesto, Evol. Microbiol. 60:249–266. H. Ogata, et al. 2004. The 1.2-megabase­ genome Touchon, M., and E. P. C. Rocha. 2007. Causes of insertion sequence of Mimivirus. Science (New York, N.Y.) sequences abundance in prokaryotic genomes. Mol. Biol. 306:1344–1350. Evol. 24:969–981. Reeve, W., M. Parker, R. Tian, L. Goodwin, H. Teshima, Wayne, L. G., D. J. Brenner, R. R. Colwell, P. A. D. R. Tapia, et al. 2014. Genome sequence of Microvirga Grimont, O. Kandler, M. I. Krichevsky, et al. 1987. lupini strain LUT6T, a novel Lupinus alphaproteobacterial Report of the ad hoc committee on reconciliation of microsymbiont from Texas. Stand. Genomic Sci. approaches to bacterial systematics. Int. J. Syst. Bacteriol. 9:1159–1167. 37:463–464. Richter, M., and R. Rosselló-Móra. 2009. Shifting the Welker, M., and E. R. B. Moore. 2011. Applications of genomic gold standard for the prokaryotic species whole-­cell matrix-­assisted laser-­desorption/ionization definition. Proc. Natl Acad. Sci. USA 106:19126–19131. time-­of-­flight mass spectrometry in systematic Rosselló-Mora, R. 2006. DNA–DNA reassociation methods microbiology. Syst. Appl. Microbiol. 34:2–11. applied to microbial taxonomy and their critical Weon, H.-Y, S.-W. Kwon, J.-A. Son, E.-H. Jo, S.-J. Kim, evaluation. Pp.23–50 in Molecular identification, Y.-S. Kim, et al. 2010. Description of Microvirga systematics, and population structure of prokaryotes. aerophila sp. nov. and Microvirga aerilata sp. nov., Springer, Berlin Heidelberg. http://link.springer.com/ isolated from air, reclassification of Balneimonas flocculans chapter/10.1007/978-3-540-31292-5_2. Takeda et al. 2004 as Microvirga flocculans comb. nov. Roux, V., K. El Karkouri, J. C. Lagier, C. Robert, and and emended description of the genus Microvirga. Int. J. D. Raoult. 2012. Non-contiguous­ finished genome Syst. Evol. Microbiol. 60:2596–2600. sequence and description of Kurthia massiliensis sp. nov. Woese, C. R., O. Kandler, and M. L. Wheelis. 1990. Stand. Genomic Sci. 7:221–232. Towards a natural system of organisms: proposal for the Schaeffer, P., J. Millet, and J.-P. Aubert. 1965. Catabolic domains Archaea, Bacteria, and Eucarya. Proc. Natl Acad. repression of bacterial sporulation. Proc. Natl Acad. Sci. Sci. USA 87:4576–4579. USA 54:704. Yin, Y., and D. Fischer. 2006. On the origin of microbial Seng, P., M. Drancourt, F. Gouriet, B. La Scola, P.-E. ORFans: quantifying the strength of the evidence for Fournier, J.-M. Rolain, et al. 2009. Ongoing revolution in viral lateral transfer. BMC Evol. Biol. 6:63. bacteriology: routine identification of bacteria by Yin, Y., and D. Fischer. 2008. Identification and matrix-­assisted laser desorption ionization time-of-­ flight­ investigation of ORFans in the viral world. BMC Genom. mass spectrometry. Clin. Infect. Dis. 49:543–551. 9:24. Seng, P., J.-M. Rolain P.-E. Fournier M. Drancourt and D. Zhang, J., F. Song, Y. H. Xin, J. Zhang, and C. Fang. 2009. Raoult. 2010. MALDI-TOF-­ mass­ spectrometry applications Microvirga guangxiensis sp. nov., a novel in clinical microbiology. Future Microbiol. 5:1733–1754. Alphaproteobacterium from soil, and emended description Seng, P., C. Abat, J.-M. Rolain, P. Colson, J. C. Lagier, of the genus Microvirga. Int. J. Syst. Evol. Microbiol. F. Gouriet, et al. 2013. Identification of rare 59:1997–2001.

© 2016 The Authors. MicrobiologyOpen published by John Wiley & Sons Ltd. 15

85 The Largest Human Bacterial Genome A. Caputo et al.

Supporting Information Marker (Biolabs, New England) were used as size mark- ers. Sizes are indicated on the left in kilobase pairs. Southern Additional supporting information may be found in the blot (B and C) using DIG-labeled­ probes “scaffold 24” online version of this article: and “scaffold 35,” respectively. The probe “scaffold 24” Figure S1. Pulsed field gel electrophoresis (PFGE) and recognized the potential plasmid DNA band observed with Southern blot (A) PFGE of intact genomic DNA (ND) the uncut genomic DNA. The intact genomic DNA is and SpeI-­digested DNA from Microvirga massiliensis. recognized by the “scaffold 35” probe, suggesting that Electrophoresis was performed in 1% agarose in 0.5× TBE this scaffold is a part of the genomic DNA.Table S1. The buffer, and the pulse time was ramped from 5 to 50 s percentage sequence identity and sequence coverage of for 20 h at a voltage of 5 V/cm for 20 h at 14°C. Gel the 16S rRNA of Microvirga massiliensis with other strains was stained with ethidium bromide. Low Range PFG of Microvirga.

16 © 2016 The Authors. MicrobiologyOpen published by John Wiley & Sons Ltd.

86

PARTIE III

Analyse du pan-genome de Klebsiella pneumoniae

87 Avant-propos

L'unité de base de la classification du vivant est représentée par l'espèce. La définition de l'espèce bactérienne reste un sujet de débat et est en perpétuelle bouleversement.

Elle est sujette à de nombreux changements selon les données disponibles et suit l'évolution des techniques d'identification bactérienne. De plus, nous ne connaissons encore qu'un faible pourcentage des espèces bactériennes existantes. Il ne pourra donc jamais y avoir de classification définitive [10].

Avec l'apparition du séquençage de première génération en 1975-77 suivis du séquençage à haut débit en 2004, l'accès à l'information génétique complète a été révolutionné. Avec ces technologies modernes de séquençage à haut débit, une quantité considérable de données sont générées, ce qui rend possible les études sur la base des analyses pan-génomiques.

La première définition du pan-génome a été proposée par

Tettelin et al en 2005 [4]. Un pan-génome est défini par

88 l'ensemble du contenu génétique appartenant à un groupe d'étude. Le pan-génome d'une espèce bactérienne peut être divisé en trois parties : le « core genome » comprenant l'ensemble des gènes présents dans toutes les souches, les

« gènes uniques » qui sont présents dans une seule souche

(spécifique de la souche) et les « gènes accessoires » qui sont composés de gènes présents dans deux souches ou plus.

Dans ce travail, nous avons développé une autre méthode permettant de proposer une taxonomie qui prend en compte l'analyse du génome et la définition du pan-génome. En effet, la comparaison des ratios core/pan-génome des différentes espèces du genre Klebsiella a révélé que K. pneumoniae subsp. ozaenae et K. pneumoniae subsp. rhinoscleromatis présentent de nombreuses différences entre elles ainsi qu'avec les autres espèces de ce genre bactérien Une cassure dans ces ratios est observée sans zone de transition ce qui reflète des espèces distinctes. Un saut quantique entre deux

89 objets donne lieu à une définition distincte de ces deux objets.

Ici, il n'y a pas de transition d'une espèce à l'autre parce qu'il y a une rupture dans ce ratio, ce qui conduit à une définition différente des espèces. Cette rupture représente une différence majeure entre les génomes et ne peut exister au sein d'une même espèce. Cette constatation nous permet de dire que K. pneumoniae subsp. ozaenae et K. pneumoniae subsp. rhinoscleromatis sont des espèces distinctes du genre

Klebsiella et non des sous espèces de K. pneumoniae. Ce travail présente l'analyse du pan-génome comme un outil novateur pour définir les espèces et représente un grand saut en avant dans la taxonomie bactérienne.

Ce travail a été publié dans le journal Biology Direct.

90

ARTICLE 3

Pan-genomic analysis to redefine species and subspecies based on quantum discontinuous variation: the Klebsiella paradigm

Aurélia Caputo, Vicky Merhej, Kalliopi Georgiades, Pierre- Edouard Fournier, Olivier Croce, Catherine Robert and Didier Raoult

91 Caputo et al. Biology Direct (2015) 10:55 DOI 10.1186/s13062-015-0085-2

RESEARCH Open Access Pan-genomic analysis to redefine species and subspecies based on quantum discontinuous variation: the Klebsiella paradigm Aurélia Caputo1, Vicky Merhej1, Kalliopi Georgiades2, Pierre-Edouard Fournier1, Olivier Croce1, Catherine Robert1 and Didier Raoult1*

Abstract Background: Various methods are currently used to define species and are based on the phylogenetic marker 16S ribosomal RNA gene sequence, DNA-DNA hybridization and DNA GC content. However, these are restricted genetic tools and showed significant limitations. Results: In this work, we describe an alternative method to build taxonomy by analyzing the pan-genome composition of different species of the Klebsiella genus. Klebsiella species are Gram-negative bacilli belonging to the large Enterobacteriaceae family. Interestingly, when comparing the core/pan-genome ratio; we found a clear discontinuous variation that can define a new species. Conclusions: Using this pan-genomic approach, we showed that Klebsiella pneumoniae subsp. ozaenae and Klebsiella pneumoniae subsp. rhinoscleromatis are species of the Klebsiella genus, rather than subspecies of Klebsiella pneumoniae.This pan-genomic analysis, helped to develop a new tool for defining species introducing a quantic perspective for taxonomy. Reviewers: This article was reviewed by William Martin, Pierre Pontarotti and Pere Puigbo (nominated by Dr Yuri Wolf). Keywords: Pan-genome, Klebsiella pneumoniae, Taxonomy

Definitions attempts to establish a bacterial classification [1]. Pathogenic bacteria were initially classified as distinct Term Definitions species according to their pathotype. In this study, we took the Klebsiella species as model. The genus Kleb- Accessory Set of genes present in more than one strain but not genome in all strains studied siella consists of organisms that are usually non- motile, with the exception of Klebsiella mobilis (con- Core genome Genes present in all strains studied sidered as ‘Enterobacter aerogenes’ because of this Pan-genome Gene pool present in the genomes of a group of organisms mobility) [2] and Gram-negative rods. Species of the genus Klebsiella are important common pathogens Species Homogeneous group of isolates characterized by many common features causing variable clinical syndromes including nosoco- mial infections for Klebsiella mobilis, bloodstream in- fections and bacteremia for Klebsiella variicola and Klebsiella oxytoca. Three closely-related species, Kleb- Background siella pneumoniae, Klebsiella rhinoscleromatis and Taxonomy is essential for the identification, nomen- Klebsiella ozaenae have been identified as pathovars clature and classification of bacterial species. Bacterial because they cause distinguishable diseases of the re- taxonomy has undergone many changes since the first spiratory tract: K. pneumoniae is responsible for the majority of human Klebsiella infections [3], causing * Correspondence: [email protected] 1URMITE, UMR CNRS 7278-IRD 198, Faculté de Médecine, Aix-Marseille pneumonia. K. ozaenae is rarer and is found in Université, 27 Boulevard Jean Moulin, 13385 Marseille, Cedex 5, France chronic diseases of the respiratory tract, especially Full list of author information is available at the end of the article

© 2015 Caputo et al. Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

92 Caputo et al. Biology Direct (2015) 10:55 Page 2 of 12

atrophic rhinitis (ozena); it can also be isolated from differences between themselves as well as with bona the sputum, urine and, exceptionally, from blood cul- fide Klebsiella species. This finding supports the claim tures. K. rhinoscleromatis causes rhinoscleroma (a that K. pneumoniae subsp. ozaenae and K. pneumo- tumor of the nose) (Fig. 1). The metabolic activities niae subsp. rhinoscleromatis are distinct species of the of these three species in vitro also differ. Thus, the Klebsiella genus. This work introduces pan-genome fermentation of dulcitol and sorbose and the catabol- analysis as a novel tool to define species and repre- ism of d-tartrate and, secondly, the fermentation of sents a great leap forward in bacterial taxonomy. rhamnose and adonitol, were additional criteria used to define the three biovars [4]. Methods Over time, the taxonomy of bacteria has been reorga- Genome sequencing and annotation nized based on a combination of phenotypic and geno- Genomes from K. pneumoniae subsp. ozaenae and K. typic properties [5]. The genotypic criteria by which pneumoniae subsp. rhinoscleromatis were sequenced bacterial species were first characterized included the using shotgun sequencing method with IonTorrent_Life- genomic GC content composition. Later, DNA-DNA technologies and the Roche_454 method. For IonTorrent hybridization experiments were used for comparisons sequencing, genomic DNA was mechanically fragmented with the closest phylogenetic neighbors [6]. In the in Covaris microTubes to generate a fragment size distri- 1990s, the sequencing of the 16S rRNA gene led to a bution from 180 to 220 bp and purified through Ampure revolution in the classification of bacterial species [7], Beads (Agencourt, Beckman). The fragmented library was enabling the re-classification of living organisms [8]. constructed using adaptor ligation according to the Currently, a threshold identity of 98.7 % in the 16S manufacturer's instructions (Life Technologies). Template rRNA sequence is used to define a new bacterial spe- preparation, emulsion PCR and Ion Sphere Particle (ISP) cies [9–11]. Thus, the taxonomic study of the Klebsi- enrichment was performed using the Ion One Touch kit ella genus, based on 16S rDNA and DNA-DNA (Life Technologies). The quality of the resulting ISPs hybridization, reclassified K. ozaenae and K. rhino- was assessed using a Qubit 2.0 Fluorometer (Life Tech- scleromatis as subspecies of K. pneumonia (Fig. 1). nologies), and samples were loaded twice and sequenced Recently, improvements in genome sequencing have on a 316 chip (Life Technologies). Finally, 3,567,359 facilitated the study of bacterial species, particularly reads for K. pneumoniae subsp. ozaenae and 3,325,174 by analyzing their taxonomy [2, 12]. Previous studies reads for K. pneumoniae subsp. rhinoscleromatis were demonstrated the importance of genomics for bacter- generated. A 5 kb paired end library was constructed ialtaxonomybyassessingthepresenceofindelsor with 5 μg of DNA according to the 454_Titanium paired single nucleotide polymorphisms (SNPs) in conserved end protocol and to the manufacturer’sinstructions. genes [13], comparing orthologous genes [7] and This was mechanically fragmented using the Covaris studying metabolic pathways [14]. In this work, we device (KBioScience-LGC Genomics, Queens Road, developed another method to build a taxonomy that Teddington, Middlesex, TW11 0LY, UK) with takes advantage of genome analysis and pan-genome miniTUBE-Red 5Kb. DNA fragmentation was viewed definition [15]. Indeed, the comparison of the core/ using the Agilent 2100 BioAnalyzer on a DNA labchip pan-genome ratios of the different Klebsiella species 7500 with an optimal size of 4.9 kb. Circularization and revealed that K. pneumoniae subsp. ozaenae and K. nebulization were performed on 100 ng of the sample. pneumoniae subsp. rhinoscleromatis exhibit many After PCR amplification through 17 cycles followed by a

Fig. 1 A 16S rRNA-based phylogenetic tree of all strains studied with their associated pathotype and GC %

93 Caputo et al. Biology Direct (2015) 10:55 Page 3 of 12

double size selection, the single-stranded paired end genomes were deposited at EMBL-EBI under acces- library was then loaded onto a DNA labchip RNA pico sion number CDJH00000000 for K. pneumoniae 6000 on the BioAnalyzer: the pattern showed an subsp. ozaenae and CDOT00000000 for K. pneumo- optimum at 573 bp and the concentration was niae subsp. rhinoscleromatis.Fortheannotation determined at 529 pg/μL. The library concentration process, assembled DNA sequences of the new draft equivalence was calculated as 1.69e10 molecules/μLand genomes were run through various annotation appli- clonally amplified with 0.13, 0.25, 0.5 and 1 copies cations including RNAmmer [19], Prodigal [20], ARA- per bead (cpb) in 2 emPCR reactions per condition GORN [21], Rfam [22], Pfam [23], and Infernal [24]. using the GS Titanium SV emPCR Kit (Lib-L) v2. The yields of the emPCR were respectively of 12.43, Genome sequence comparison and pan-genome analysis 15.48, 11.46 and 12.23 %, according to the expected We retrieved from NCBI the genome sequences of five quality of 5–20 % from the Roche procedure. The strains of K. pneumoniae subsp. pneumoniae including K. enriched clonal amplifications were loaded with pneumoniae pneumoniae HS11286 [Genbank: CP003200] 790,000 beads on the GS Titanium PicoTiterPlates [25], MGH 78578 [Genbank: CP000647] [26], 1084 [Gen- PTP Kit 70x75 sequenced with the GS Titanium Se- bank: CP003785] [27], NTUH-K2044 [Genbank: AP006725] quencing Kit XLR70. The runs were performed over- [28], Ecl8 [Genbank: NZ_CANH00000000] [29], K. pneumo- night and were then analyzed on the cluster through niae KCTC 2242 [Genbank: CP002910] [30] , two strains the gsRunBrowser and gsAssembler_Roche. We ob- (E718 and KCTC 1686) of Klebsiella oxytoca [Genbank: tained 349,885 total reads for K. pneumoniae subsp. CP003683 and CP003218, respectively][31, 32], Klebsiella ozaenae and 499,562 reads for K. pneumoniae subsp. variicola At-22 [Genbank: CP001891] [33] and Klebsiella rhinoscleromatis. The set of reads obtained from the mobilis EA1509E [Genbank: FO203355] (Table 1). two different sequencing methods were assembled To functionally annotate protein sequences, we used with the Mira assembler v3.2. [16]. The resulting con- the WebMGA function prediction workflow [35] and tigs were combined using Opera software v1.2 [17] in the NCBI COG database for prokaryotic proteins [36]. tandem with GapFiller V1.10 [18] to reduce the data- All hits below the default RPSBLAST e-value of 1e-03 set. Finally, manual refinements were made using were reported [37]. We performed a Principal Component CLC Genomics software (CLC bio, Aarhus, Denmark) Analysis (PCA) for all K. pneumoniae strains of the COG and homemade tools. These two newly-sequenced content using the R package (http://CRAN.R-project.org).

Table 1 General genome features Species and Subspecies Type strain Status Genome GC ORF rRNA tRNA Genome References size (Mb) content accession no. (%) Klebsiella pneumoniae HS11286 Complete 5.68 57.1 5,779 25 86 CP003200 Liu et al. (2012) [25] pneumoniae Klebsiella pneumoniae MGH 78578 Complete 5.69 57.2 5,184 25 85 CP000647 McClelland et al. pneumoniae (2006) [26]

Klebsiella pneumoniae KCTC 2242 Complete 5.46 57.3 5,152 25 87 CP002910 Shin et al. (2012) [30]

Klebsiella pneumoniae 1084 Complete 5.39 57.4 4,962 25 79 CP003785 Lin et al. (2012) [27] pneumoniae Klebsiella pneumoniae NTUH-K2044 Complete 5.47 57.4 5,262 25 85 AP006725 Wu et al. (2009) [28] pneumoniae Klebsiella pneumoniae Ecl8 Complete (with gaps) 5.53 57.2 5,177 31 82 HF536482 Fookes et al. (2013) [29] pneumoniae Klebsiella pneumoniae ATCC11296 Draft 4.95 57.5 4,818 3 62 CDJH0000000 Drancourt et al. pneumoniae ozaenae (2001) [34] Klebsiella pneumoniae Urmite Draft 5.35 57.3 5,363 4 64 CDOT0000000 - pneumoniae rhinoscleromatis Klebsiella variicola At–22 Complete 5.46 57.6 4,996 25 85 CP001891 Pinto-Tomas et al. (2009) [33] Klebsiella oxytoca E718 Complete 6.57 55.52 5,923 25 85 CP003683 Liao et al. (2012) [31] Klebsiella oxytoca KCTC 1628 Complete 5.98 56 5,340 25 85 CP00321 Shin et al. (2012) [32] Klebsiella mobilis EA1509E Complete 5.59 54.93 5,117 26 88 FO203355 Diene et al. (2013) [2]

94 Caputo et al. Biology Direct (2015) 10:55 Page 4 of 12

We assigned KEGG orthology (KO) to the studied protein RNA operon (16S-23S-5S) was predicted for K. pneumo- sequences using the KEGG automatic-annotation server niae subsp. ozaenae and for the other strains, ranging (KAAS) [38] and mapped the KO-assigned genes to the from 8 to 9 operons. The number of tRNAs also differed Kyoto Encyclopedia of Genes and Genomes (KEGG) func- depending on the species, ranging from 62 tRNA in K. tional modules [39]. pneumoniae subsp. ozaenae to 87 in K. pneumoniae We determined the pan-genome composition of the KCTC 2242. The hierarchical clustering of the strains six K. pneumoniae strains with and without including based on the number of tRNAs showed that K. pneumo- one of the other studied genomes K. pneumoniae subsp. niae subsp. ozaenae did not cluster with any other ozaenae or K. pneumoniae subsp. rhinoscleromatis or K. strains (Additional file 1). Altogether, K. pneumoniae variicola or K. oxytoca. Therefore, TBLASTN was per- subsp. ozaenae had the smallest genome size, number of formed to search the translated nucleotide database con- genes, number of rRNAs and tRNAs among the K. stituted of the different studied genomes using the pneumoniae strains. The reduced genome content sug- proteomes as queries [37]. For each query, the query bit gests that K. pneumoniae subsp. ozaenae is more spe- score was divided by the maximum bit score for all ge- cialized than the other strains [48, 49]. Indeed, the nomes in order to calculate the Blast Score Ratio (BSR) evolution of specialized bacteria consists principally of [40–43] allowing the conservation of peptides in each gene loss [50], as investigated in particular for Rickett- genome to be defined. Genes with a value of BSR ≥ 0.4 siales [50, 51]. (equivalent to a ≥ 40 % protein identity over 100 % of the protein length) were considered to belong to core. This Pan-genome and taxonomy algorithm allows comparative analysis of multiple pro- The pan-genome for the six strains of Klebsiella pneu- teomes and nucleotide sequence to be performed moniae contained 4,829 core genes (Fig. 2) and the core/ simultaneously. pan-genome ratio was 94 %. This high percentage (more than 90 %) was indicative of a high rate of conservation Single Nucleotide Polymorphism (SNP) analysis among these strains [44]. When the different Klebsiella We identified SNPs among the core genomic regions species were included, the core/pan-genome ratio de- using the Panseq package [1, 44, 45]. Multiple sequence creased to 67 % with K. mobilis, 69 % with K. oxytoca alignments were built using MEGA 6.06 software [46] and 81 % with K. variicola (Fig. 3). Altogether, a discon- and phylogenies were reconstructed using the maximum tinuous variation of 13 to 27 % was observed between likelihood method (PhyML) with 100 bootstrap itera- the bona fide Klebsiella species. tions [47]. When K. pneumoniae subsp. rhinoscleromatis was in- cluded, the pan-genome expanded to 5,268 genes with Results 4,164 core genes. The core/pan-genome ratio was of 79 Comparative genomic analysis of Klebsiella pneumoniae %, with a decrease of 15 % (Fig. 3). When K. pneumoniae genomes subsp. ozaenae was included, the pan-genome expanded The final draft genome of K. pneumoniae subsp. ozaenae to 5,190 genes with 3,720 core genes (Fig. 4). The main strain ATCC 11296 consists of 23 scaffolds [EMBL: differences between the core genes corresponded to LN681173-LN681195] and 128 contigs, containing genes with metabolic functions in starch and sucrose 4,955,887 bp and a GC content of 57.5 %. For K. pneumo- metabolism, galactose metabolism and citrate cycle. The niae subsp. rhinoscleromatis strain Urmite, the draft genome core/pan-genome ratio was of 72 %, with a decrease of consisted of 26 scaffolds [EMBL: LN776221-LN776246] and 22 % (Fig. 3). The rough decrease of the core/pan-gen- 135 contigs, containing 5,342,094 bp and with a GC content ome ratio following the introduction of two strains of K. of 57.3 %. The major features of the Klebsiella pneumoniae pneumoniae highlighted the very distinct genomic con- sequenced genomes are summarized in Table 1. tent of K. pneumoniae subsp. rhinoscleromatis and K. All the studied K. pneumoniae genomes had an aver- pneumoniae subsp. ozaenae. This discountinious vari- age length of 5.44 Mb. The K. pneumoniae subsp. ozae- ation was comparable to that previously observed among nae genome was the smallest with only 4.95 Mb and K. different species, supporting the claim that K. pneumo- pneumoniae subsp. pneumoniae MGH 78578 was the niae subsp. rhinoscleromatis and K. pneumoniae subsp. largest genome with 5.69 Mb. The GC content varied ozaenae are rather distinct species of Klebsiella than from 57.1 % for K. pneumoniae subsp. pneumoniae strains of K. pneumoniae. HS11286 to 57.5 % for K. pneumoniae subsp. ozaenae with an average of 57.3 %. The number of predicted pro- The specific genomic features of K. pneumoniae subsp. teins in Klebsiella pneumoniae ranged from 4,818 for K. ozaenae pneumoniae subsp. ozaenae to 5,779 for K. pneumoniae The phylogenetic tree resulting from the SNPs of the subsp. pneumoniae MGH 78578. A single ribosomal core genome of the studied strains of K. pneumoniae

95 Caputo et al. Biology Direct (2015) 10:55 Page 5 of 12

Fig. 2 Pan-genome representation for 6 analysed strains of Klebsiella pneumoniae. The number of core genes is shows in the yellow circle. For each strain, the number of accessory genes is show in black and the number of unique genes is show in red showed a monophyletic group containing the K. pneu- pneumoniae subsp. ozaenae from the other K. pneumo- moniae subsp. pneumoniae (Fig. 5a) while K. pneumo- niae strains and its recognition as a distinctive species. niae subsp. ozaenae formed a distinct group (Fig. 5b). When compared to the other strains of K. pneumonia, The analysis of the single nucleotide polymorphism K. pneumonia subsp. ozaenae had fewer annotated pro- along the core genome sequence presented K. pneumo- teins in all COG categories (4,572 proteins vs. 5,006 pro- niae subsp. ozaenae as a phylogenetically distinct entity teins on average) (Additional file 2). K. pneumoniae within Klebsiella, that is distant from the other K. pneu- subsp. ozaenae lacked 202 genes (Additional file 3) that moniae strains. Thus, the phylogenetic tree created were present in all other Klebsiella strains and possessed based on SNPs of the core-genome showed that the gen- 62 genes (Additional file 4) that were absent from all omic sequence of K. pneumoniae subsp. ozaenae is very other strains. The missing genes from K. pneumoniae different from that of the other K. pneumonia strains. subsp. ozaenae encode for proteins involved in metabol- Indeed, genome alignment of K. pneumoniae subsp. ism (13 %), information storage and processing (13 %) ozaenae with the six other strains of K. pneumoniae and cellular processes (8 %). using MAUVE software [52] showed a large rearrange- Likewise, the KO-annotation using KEGG server showed ment of K. pneumoniae subsp. ozaenae with different in- that K. pneumoniae subsp. ozaenae had fewer proteins version and deletions events (data not show). These (1,454) involved in metabolic pathways than the other K. findings strongly suggested the separation of K. pneumoniae strains (an average of 1,605 proteins),

Fig. 3 Distribution of the percentage of the core/pan-genome ratio (values on left y-axis) for all strains of Klebsiella and the pan-genome and the core genome (the values on right y-axis) for six strains of Klebsiella pneumoniae

96 Caputo et al. Biology Direct (2015) 10:55 Page 6 of 12

Fig. 4 Pan-genome representation for 6 analysed strains of Klebsiella pneumoniae including Klebsiella pneumoniae subsp. ozaenae. The number of core genes is shows in the yellow circle. For each strain, the number of accessory genes is show in black and the number of unique genes is show in red

especially in amino acid metabolism, carbohydrate metabo- and KEGG data, respectively (Additional file 5, Fig. 6), lim and xenobiotics biodegradation and metabolism. The showed that K. pneumoniae subsp. ozaenae did not analysis of the KEGG pathways for these genomes showed cluster with any other K. pneumoniae strains. These significant differences between K. pneumoniae subsp. ozae- findings suggest that K. pneumoniae subsp. ozaenae nae and the other K. pneumoniae strains in terms of their had differential functional content with specific path- carbohydrate metabolism. The starch and sucrose meta- ways for carbohydrate metabolic in accordance with bolic pathways of K. pneumoniae subsp. ozaenae were defi- the phenotypic specificities observed in vitro for K. cient in the beta-xylosidase enzyme (EC:3.2.1.37) compared pneumoniae subsp. ozaenae. to the other K. pneumoniae strains. We represented some genomic and phenotypic differ- Principal Component Analysis of the COG content, ences between K. pneumoniae subsp. ozaenae and other and hierarchical clustering calculated with the COG Klebsiella pneumoniae in the Fig. 7.

Fig. 5 a: Single nucleotide polymorphisms of the core genes content based tree A. for the 6 strains of Klebsiella pneumoniae b. for the 6 strains of Klebsiella pneumoniae including Klebsiella pneumoniae subsp. ozaenae. These is a PhyML tree with 100 bootstrap iterations

97 Caputo et al. Biology Direct (2015) 10:55 Page 7 of 12

Fig. 6 Hierarchical clustering of the Klebsiella pneumoniae strains based only on the KEGG distribution of the subclasses in the Metabolism category. The colors depend on the number of proteins implied in each metabolism category for each strain. The scale is represented in the figure

Discussion biological entities that could not be confused and could Bacterial taxonomy remains a complex and challenging not transform into one another. A new nomenclature field [53]. Initially, taxonomy was based on phenotypic therefore needed to be introduced and pan-genomic criteria [5] related to a specific biological or medical studies are likely to be the most suitable method for ex- interest. However, taxonomy has experienced a recent ploring species under this system [44, 63]. Pan-genome upheaval following the introduction of new genetic tech- study can identify different situations where speciation niques. After the advent of DNA-DNA hybridization in has occurred. First of all, an extremely broad continuum 1979 [6, 53] many bacterial species were reclassified or is defined as an infinite pan-genome, with a low core/ removed from the taxonomic classification. More re- pan-genome ratio. This indicates a lack of specialization cently, the 16S rRNA gene has been used for the classifi- in a bacterial group and the presence of a species com- cation and nomenclature of bacterial species. This plex or mixture that allows for the genesis of a species method often fails to reflect real distinctions between rather than a real species. In this context, Shigella can species [54]. The use of one single universal 16S rRNA certainly be placed among Escherichia coli species [64]. gene can hardly be a realistic Tree of Life [54]. Further- Nevertheless, Shigella species are irreversibly different more, the accepted threshold of 1.3 % between two 16S from E. coli species in terms of their metabolic, patho- rRNA sequences [9] required to differentiate between physiological and genetic properties. Shigella spp. are two different bacterial species seems to include almost human pathogens, E. coli complex clones, while E. coli 50 million years of the molecular clock [55, 56]. If we strains are mostly commensals of the human intestine consider this threshold as the true species definition cri- presenting a much larger genome repertoire [65]. terion, no bacterial lineages could have specialized in In the context of Klebsiella, we began to define species mammals [1, 57], which is an unacceptable conclusion. using the pan-genome. The quantum discontinuous vari- Because of the use of these criteria for the definition of ation existing between the Klebsiella pneumoniae pan- bacterial species and the use of restrictive tools, the de- genome and the other Klebsiella species shows that a scription of bacteria is very shallow and limited [58]. discontinuous variation > = 10 % of the core/pan-gen- Bacteria with sympatric lifestyles, a high level of hori- ome ratio is observed by adding a single bacterial isolate. zontal gene transfer [53, 59], large genomes, a significant This major difference between genomes leads to a break number of ribosomal operons [60] and large pan- in the ratio. This discontinuous variation corresponds to genomes [61, 62] compose bacterial species complexes. the start of a new mathematical function as previously Only the isolation of a bacterium in a new niche or a described [44]. In a recent study, the best R2 (coefficient significant population reduction will allow the appearance of determination) was determined in order to find the of a ‘specialist,a’ bona fide species which will then present most accurate regression type. It has been shown that an allopatric lifestyle, a smaller genome, a reduced num- the addition of 9 Shigella strains to the 42 E. coli strains ber of ribosomal operons and a smaller pan-genome [48]. created a break in the core/pan-genome ratio and We based our work on the hypothesis that the differ- showed variation in their trend curve [44]. In quantum ence between two species exists as an irreconcilable dif- physics, such an abrupt change is similar to that of a dis- ference. These species, thus, correspond to two distinct continuous variation. Electrons revolve within discrete

98 Caputo et al. Biology Direct (2015) 10:55 Page 8 of 12

Fig. 7 General representation of some distintive genomic and phenotypes of the Klebsiella pneumoniae subsp. ozaenae ATCC 11296 strain. a. Representation of the circular genome of K.p. subsp. ozaenae using Circos b. Genomes alignement of the 6 strains of Klebsiella pneumoniae including Klebsiella pneumoniae subsp. ozaenae. c. API20E identification of Klebsiella pneumoniae subsp. ozaenae ATCC 11296 strain. e. KEGG map of Galactose metabolism. e. KEGG map of Starch and Sucrose metabolism. The proteins surrounded in red are missing in Klebsiella pneumoniae subsp. ozaenae orbits. There is no gradual transition from one orbit to believed to be individual species [4, 66] and were later another; there are instead quantum discontinuous varia- considered to be sub-species of Klebsiella pneumoniae tions. This quantum phenomenon allows us to distin- [67], are actually distinct biological entities that should guish which transitions are progressive and which are indeed be considered as species. We believe that the quantic. The latter transition type results in the redefin- emergence of a pan-genome will allow for the develop- ition of species. The pan-genome study and calculation ment of a more rational approach to species definition, of the core/pan-genome ratio on the genomes of species in which species are defined as circumscribed and dis- that are theoretically the same should result in a linear tinct biological entities with large differences that pre- graph. In practice however, we noticed a break event vent them from transforming into closely-related that prompted us to question the definition of a species. species. We acknowledge the fact that pan-genome- Differences between two species would necessarily be a based species classification may evolve with the discov- striking phenomenon (ratio differences > 10 %) without a ery of new isolates. The definition of bacterial speciation, transition zone (Fig. 3) with irreconcilable differences. however, should reflect the restricted capacity of the spe- These physical phenomena fit well the definition of the cies to obtain new characteristics and to adapt to any species. This is not a shift that reflects the natural vari- ecological changes. ability of species, but is instead a distinct biological phenomenon. According to this perspective, the criteria Conclusions definition based on the species differentiation of Klebsi- We have proposed a new tool for defining bacterial spe- ella pneumoniae enables us to show that Klebsiella ozae- cies using pan-genome analysis. This new method was nae and Klebsiella rhinoscleromatis, which were initially applied to different species of the Klebsiella genus. We

99 Caputo et al. Biology Direct (2015) 10:55 Page 9 of 12

compared the core/pan-genome ratio of different spe- l. 261, worse than the clock issue is that rDNA does cies, which allowed us to take a great discontinuous vari- not clearly predict what the rest of the genome harbours, ation forward in bacterial taxonomy. We found that K. as pangenomes and this paper show. pneumoniae subsp. ozaenae and K. pneumoniae subsp. rhinoscleromatis exhibit as many differences between Authors' response them as those of Klebsiella genus, and demonstrated Thank you for your comments. that these are distinct species of Klebsiella genus. l. 271 "could not transform into one another" is not a very useful criterion because it makes untestable as- Reviewers’ comments sumptions about what might happen in the future … We thank the reviewers for their valuable comments and helpful suggestions. We would like to respond and revise Authors' response our manuscript in light of the reviews. We mean that genomic content reflects the ecosystem. If the bacterium were to change its ecosystem and become Reviewer's report 1 specialized, no return would then be possible because no Prof. William Martin, Institut of Botanic III, Heinrich- exchange is possible (lines 251, 292). Heine University, Düsseldorf, Germany l. 279, is "irreversibly" the right word here?

Reviewer 1 Authors' response This is a very well written and interesting paper. I Yes, the word is “irreversibly”. like it a lot. Few papers deal with species concepts l. 283, here we are getting to the main course of the apper. among bacteria in such a relaxed and readable man- Maybe explain in more detail what Fig. 7 shows and perhaps ner. Clearly, for clinical reasons we have to have spe- find a mathematical decription for the dip ("discontinuous cies so that doctors can tell us what infection we variation") in the c/p ratio that is independent of the value have and how to treat it. Pragmatic approaches to the "10 %", which some might think is the sugestion for a pan- problem are useful, and this paper makes progress in genome defined species boundary, more studies on other that direction.line 83. "clear leap". In the vernacular species would be needed to get a better feel. of traditional systematics, this leap is called "discon- tinuous variation", so the principle has precedent. Authors' response One might have a read of some classical systematic s To clarify, we have reviewed many parts of this paper papers for other kinds of organisms, following the and discussed more about a mathematical description keyword lead "discontinuous variation" in the liera- with an other example of a pan-genomic study, lines 273 ture, and maybe rethink the title accordingly. Basically to 277. this paper suggests using a very traditional criterion l. 286, break — > discontinuity with very modern data (pangenomes).

Authors' response Authors' response “ ” “ ” We thank Prof. Martin for his comments on our manu- Yes, break means discontinuity script. We are pleased that you have enjoyed it. We re- l. 287, nut orbitals are different, because sampling of placed in this paper the word “leap” by “ discontinuous further atoms will not uncover transitional orbitals, but variation” acording to your advice. sampleing of other srains will uncover transitional ge- l. 111, define cpb nomes, probably. But one gets the idea.

Authors' response Authors' response Cpb means copies per bead, we corrected this on line We have added another example of pan-genomic study 108.l. 167, Standard MCL clustering techniques could performed in another study line 279. also be used here instead of blast score ratios. l. 296, which species definition? its a vast literature.

Authors' response Authors' response The Blast Score Ratio is an algorithm that provides infor- We gave a prokaryotic species definition on page 12. For mation concerning conserved genes between genomes (ortho- more precision, we have added some references (43, 44, logs), it also shows their level of conservation (lines 146). 49) on lines 236, 239, 243, 251. The threshold used gives us an estimate of genetic variabil- l. 302 f, what we see here is not a clear recommendation ity. This is why we chose to use the BSR instead of standard of the type that Stackebrandt would issue, but a pleas for MCL clustering. the use of pangenomic data for the species question,

100 Caputo et al. Biology Direct (2015) 10:55 Page 10 of 12

which is unquestionalby reasonable and likely a fruitful av- Reviewer's report 3 enue of pursuit. Dr. Pere Puigbo (nominated by an Editorial Board mem- ber, Dr Yuri Wolf), NCBI, NIH, Bethesda, USA Authors' response Thank you for this comment. Reviewer 3 l. 314 … demonstrated that these are distinct species This article presents an interesting framework to handle of Klebsiella genus at a level of the problem of defining (quantitatively) prokaryotic spe- pangenomic discontinuity that would go undetected in cies. The authors use the simple, yet apparently efficient, a system rDN-based species definitions. Microbial sys- core/pan-genome (C/P) ratio to define species of the tematics has always adapted to new technologies as it genus Klebsiella. Overall, this ratio has the potential to regards species boundaries, perhaps the next generation be a useful tool to classify prokaryotic species. However, of adaptation is upon us now with the availability and I think the article opens several technical and conceptual utility of pangenomes, at least in the clinical context. questions that may be addressed here. - The authors tested the C/P ratio on Klebsiella spe- Authors' response cies, but how it will perform in other species is still un- Thank you for this comment. certain (e.g., intracellular parasites). Moreover, it would In summary, this is a very fine paper, I enjoyed it a be very useful to see an example without a predefined lot.Quality of written English: Acceptable group of closely related species to evaluate the real po- tential of this ratio in prokaryotic classification Reviewer's report 2 Dr. Pierre Pontarotti, Evolution Biologique et Modélisa- tion, Aix-Marseille University, Marseille, France Authors' response We thank Dr. Puigbo for his comments concerning our Reviewer 2 manuscript. An identical study has been already per- The idea proposed in this article i.e. use of the complete formed on other species in our lab (ref. 34). I added these genome comparison methodology to define biological results to the discussion section line 277. species , is really interesting. Therefore, the concept de- - I think the “quantum leap” and the threshold identi- serves to be published. fied in Klebsiella (>10 %) needs some randomization test However, in the present form the article is really diffi- and further exploration in other species. This introduces cult to understand. I recommend that it should be re- questions on how to use the C/P ratio: 1) is there any written especially abstract, material and method , result “golden threshold” that can be used across different and legend section to make them more precise and taxonomical groups? 2) How is this threshold affected understandable. by genome reduction and horizontal gene transfer?

Authors' response Authors' response We thank Dr. Pontarotti for his comments. We have re- In a previous study, Rouli et al. (ref. 34), using a similar written parts of our manuscript as recommended. approach, observed that the C/P ratio varied from genus Concerning the discussion about the quantum leap, to genus and increased when genome size decreased. The the author should discuss the possibility of intermediate influence of horizontal gene transfer on the C/P ratio, species, that are not yet described. In other words, the however, remains to be determined.- The definition of quantum leap could be due to missing data. species in page 3 is quite vague. It is improved on page The authors proposal remind me of punctual equilib- 13, when the authors define the working hypothesis. rium theory from Elderdge and Gould which is based in However, I feel the article is missing a longer discussion part on fossil records. One of the argument against their about the meaning of ‘prokaryotic species’.Itmightbeuse- theory was the possibility of missing fossils. ful to expand this section and include additional references (e.g., PMID19411599, PMID21943000, PMID21714936) Authors' response We acknowledge the fact that all Klebsiella species might not be yet known and therefore the discovery of future Authors' response isolates may modify a little the proposed classification. We have clarified the definition on page 3, on discussion We added this comment on the discussion (line 300). page 12 and we have added the 3 references mentioned. Quality of written English: Needs some language cor- Quality of written English: rections before being published. Acceptable

101 Caputo et al. Biology Direct (2015) 10:55 Page 11 of 12

Additional files 2. Diene SM, Merhej V, Henry M, Filali AE, Roux V, Robert C, et al. The Rhizome of the Multidrug-Resistant Enterobacter aerogenes Genome Reveals How New “Killer Bugs” Are Created because of a Sympatric Lifestyle. Mol Biol Evol. Additional file 1: Hierarchical clustering of the Klebsiella 2013;30(2):369–83. pneumoniae strains based on the number of aminoacil transfer 3. Podschun R, Ullmann U. Klebsiella spp. as nosocomial pathogens: RNAs. Colors represented the number of proteins implied for each tRNA epidemiology, taxonomy, typing methods, and pathogenicity factors. Clin for each strain. The scale is included in the figure. (PNG 53 kb) Microbiol Rev. 1998;11:589–603. Additional file 2: Number of genes for all species studied 4. Bascomb S, Lapage SP, Willcox WR, Curtis MA. Numerical classification of associated with the 25 general COG functional categories. The the tribe Klebsielleae. J Gen Microbiol. 1971;66:279–95. Information storage and processing category is shown in red, the Cellular 5. Staley JT. The bacterial species dilemma and the genomic-phylogenetic processes and signaling category is shown in green and the Metabolism species concept. Philos Trans R Soc B Biol Sci. 2006;361:1899–909. category is shown in blue. The remaining items shown in white belong 6. Wayne LG, Brenner DJ, Colwell RR, Grimont PD, Kandler O, Krichevsky MI, to the Poorly characterized category. A: RNA processing and modification; et al. Report of the Ad Hoc Committee on Reconciliation of Approaches to J: Translation, ribosomal structure and biogenesis; K: Transcription; L: Bacterial Systematics. Int J Syst Bacteriol. 1987;37:463–4. Replication, recombination and repair; B: Chromatin structure and 7. Coenye T, Vandamme P. Extracting phylogenetic information from whole- dynamics; D: Cell cycle control, , chromosome partitioning; M: genome sequencing projects: the lactic acid bacteria as a test case. Cell wall/membrane/envelope biogenesis; N: Cell motility; O: Microbiol Read Engl. 2003;149(Pt 12):3507–17. Posttranslational modification, protein turnover, chaperones; P: Inorganic ion 8. Woese CR, Kandler O, Wheelis ML. Towards a natural system of organisms: transport and metabolism; T: Signal transduction mechanisms; U: Intracellular proposal for the domains Archaea, Bacteria, and Eucarya. Proc Natl Acad Sci trafficking, secretion, and vesicular transport; C: Energy production and U S A. 1990;87:4576–9. conversion; Q: Secondary metabolites biosynthesis, transport and catabolism; E: 9. Stackebrandt E, Ebers J. Taxonomic parameters revisited: tarnished gold Amino acid transport and metabolism; F: Nucleotide transport and metabolism; standards. Microbiology today. 2006;33(4):152–5. G: Carbohydrate transport and metabolism; H: Coenzyme transport and 10. Lagier JC, Hugon P, Khelaifia S, Fournier PE, La Scola B, Raoult D. The metabolism; I: Lipid transport and metabolism; R: General function prediction Rebirth of Culture in Microbiology through the Example of Culturomics To only; S: Function unknown. (PDF 27 kb) Study Human Gut Microbiota. Clin Microbiol Rev. 2015;28. Additional file 3: Table showing the 202 genes, annotated by COG, 11. Lagier JC, Edouard S, Pagnier I, Mediannikov O, Drancourt M, Raoult D. present in the 6 strains of Klebsiella pneumoniae except K. Current and Past Strategies for Bacterial Culture in Clinical Microbiology. pneumoniae subsp. ozaenae. (TIFF 1443 kb) Clin Microbiol Rev. 2015;28:208–36. Additional file 4: Table showing 62 genes, annotated by COG, that are 12. Fitz-Gibbon ST, House CH. Whole genome-based phylogenetic analysis of only present in Klebsiella pneumoniae subsp. ozaenae. (TIFF 398 kb) free-living microorganisms. Nucleic Acids Res. 1999;27:4218–22. 13. Gupta RS. The branching order and phylogenetic placement of Additional file 5: Plot of the Principal Component Analysis (PCA) axis species from completed bacterial genomes, based on conserved of the COG content of the 6 strains of Klebsiella pneumoniae including indels found in various proteins. Int Microbiol Off J Span Soc Microbiol. Klebsiella pneumoniae subsp. ozaenae using the R package. (TIFF 2723 kb) 2001;4:187–202. 14. Huson DH, Steel M. Phylogenetic trees based on gene content. Bioinforma – Abbreviations Oxf Engl. 2004;20:2044 9. SNP: Single nucleotide polymorphisms; ISP: Ion sphere particle; PGM: Personal 15. Rouli L, Mbengue M, Robert C, Ndiaye M, La Scola B, Raoult D. Genomic genome machine; CPB: Copies per bead; CDSs: Coding DNA sequences; analysis of three African strains of Bacillus anthracis demonstrates that they are COG: Clusters of orthologs groups; MeV: Multiexperiment viewer; PCA: Principal part of the clonal expansion of an exclusively pathogenic bacterium. New component analysis; KEGG: Kyoto encyclopedia of genes and genomes; KO: KEGG Microbes New Infect. 2014;2:161–9. orthology; KAAS: KEGG automatic annotation server; BSR: Blast score ratio. 16. Chevreux B, Pfisterer T, Drescher B, Driesel AJ, Müller WEG, Wetter T, et al. Using the miraEST assembler for reliable and automated mRNA transcript assembly and – Competing interests SNP detection in sequenced ESTs. Genome Res. 2004;14:1147 59. The author declares that they have no competing interests. 17. Gao S, Sung WK, Nagarajan N. Opera: reconstructing optimal genomic scaffolds with high-throughput paired-end sequences. J Comput Biol J Comput Mol Cell Biol. 2011;18:1681–91. Authors’ contributions 18. Boetzer M, Pirovano W. Toward almost closed genomes with GapFiller. Genome DR designed the research project. AC performed genomic analysis, analyzed Biol. 2012;13:R56. the data and wrote the paper. VM performed functional analysis and wrote 19. Lagesen K, Hallin P, Rødland EA, Staerfeldt HH, Rognes T, Ussery DW. RNAmmer: the paper. KG wrote the paper. PEF provided support. OC performed de novo consistent and rapid annotation of ribosomal RNA genes. Nucleic Acids Res. assembly and wrote the paper. CR was involved in sequencing. DR revised 2007;35:3100–8. the paper. All authors read and approved the final manuscript. 20. Hyatt D, Chen G-L, Locascio PF, Land ML, Larimer FW, Hauser LJ. Prodigal: prokaryotic gene recognition and translation initiation site identification. ’ Authors information BMC Bioinformatics. 2010;11:119. Not applicable. 21. Laslett D, Canback B. ARAGORN, a program to detect tRNA genes and tmRNA genes in nucleotide sequences. Nucleic Acids Res. 2004;32:11–6. Funding 22. Griffiths-Jones S, Bateman A, Marshall M, Khanna A, Eddy SR. Rfam: an RNA This work was funded by IHU Méditerranée Infection. family database. Nucleic Acids Res. 2003;31:439–41. 23. Punta M, Coggill PC, Eberhardt RY, Mistry J, Tate J, Boursnell C, et al. Author details The Pfam protein families database. Nucleic Acids Res. 2012;40 1 URMITE, UMR CNRS 7278-IRD 198, Faculté de Médecine, Aix-Marseille (Database issue):D290–301. Université, 27 Boulevard Jean Moulin, 13385 Marseille, Cedex 5, France. 24. Nawrocki EP, Kolbe DL, Eddy SR. Infernal 1.0: inference of RNA alignments. 2 – Departement of Biological Sciences, University of Cyprus, P.O. Box 20537 Bioinforma Oxf Engl. 2009;25:1335–7. 1678, Nicosia Cyprus, Greece. 25. Liu P, Li P, Jiang X, Bi D, Xie Y, Tai C, Deng Z, Rajakumar K, Ou HY. Complete genome sequence of Klebsiella pneumoniae subsp. pneumoniae HS11286, a Received: 29 April 2015 Accepted: 22 September 2015 multidrug-resistant strain isolated from human sputum. J Bacteriol. 2012;194:1841–1842. 26. McClelland M, Sanderson EK, Spieth J, Clifton WS, Latreille P, et al. The Klebsiella pneumonia Genome Sequencing. 2006. References 27. Lin AC, Liao TL, Lin YC, Lai YC, Lu MC, Chen YT. Complete genome sequence 1. Georgiades K, Raoult D. Defining pathogenic bacterial species in the of Klebsiella pneumoniae 1084, a hypermucoviscosity-negative K1 clinical genomic era. Front Microbiol. 2010;1:151. strain. J Bacteriol. 2012;194:6316.

102 Caputo et al. Biology Direct (2015) 10:55 Page 12 of 12

28. Wu KM, Li LH, Yan JJ, Tsao N, Liao TL, Tsai HC, et al. Genome sequencing 53. Doolittle WF, Zhaxybayeva O. On the origin of prokaryotic species. Genome and comparative analysis of Klebsiella pneumoniae NTUH-K2044, a strain Res. 2009;19:744–56. causing liver abscess and meningitis. J Bacteriol. 2009;191:4492–4501. 54. O’Malley MA, Koonin EV. How stands the Tree of Life a century and a half 29. Fookes M, Yu J, De Majumdar S, Thomson N, Schneiders T. Genome sequence of after The Origin? Biol Direct. 2011;6:32. Klebsiella pneumoniae Ecl8, a reference strain for targeted genetic manipulation. 55. Ochman H, Elwyn S, Moran NA. Calibrating bacterial evolution. Proc Natl Genome Announc 2013;1. doi:10.1128/genomeA.00027-12. Acad Sci U S A. 1999;96:12638–43. 30. Shin SH, Kim S, Kim JY, Lee S, Um Y, Oh MK, et al. Complete genome 56. Ogata H, Audic S, Renesto-Audiffren P, Fournier PE, Barbe V, Samson D, et al. sequence of the 2,3-butanediol-producing Klebsiella pneumoniae strain Mechanisms of evolution in Rickettsia conorii and R. prowazekii. Science. KCTC 2242. J Bacteriol. 2012;194:2736–2737. 2001;293:2093–8. 31. Liao TL, Lin AC, Chen E, Huang TW, Liu YM, Chang YH, et al. Complete 57. Georgiades K, Merhej V, Raoult D. The influence of rickettsiologists on post- genome sequence of Klebsiella oxytoca E718, a New Delhi metallo-ß- modern microbiology. Front Cell Infect Microbiol. 2011;1:8. lactamase-1-producing nosocomial strain. J Bacteriol. 2012;194:5454. 58. Rosselló-Mora R. DNA-DNA Reassociation Methods Applied to Microbial 32. ShinSH,KimS,KimJY,LeeS,UmY,OhMK,etal.Completegenomesequence Taxonomy and Their Critical Evaluation. In: Stackebrandt PDE, editor. of Klebsiella oxytoca KCTC 1686, used in production of 2,3-butanediol. J Bacteriol. Molecular Identification, Systematics, and Population Structure of 2012;194:2371–2372. Prokaryotes. Berlin Heidelberg: Springer; 2006. p. 23–50. 33. Pinto-Tomás AA, Anderson MA, Suen G, Stevenson DM, Chu FST, Cleland WW, 59. Andam CP, Gogarten JP. Biased gene transfer and its implications for the et al. Symbiotic nitrogen fixation in the gardens of leaf-cutter ants. concept of lineage. Biol Direct. 2011;6:47. Science. 2009;326:1120–1123. 60. Audic S, Robert C, Campagna B, Parinello H, Claverie J-M, Raoult D, et al. 34. Drancourt M, Bollet C, Carta A, Rousselier P. Phylogenetic analyses of Klebsiella Genome analysis of Minibacterium massiliensis highlights the convergent species delineate Klebsiella and Raoultella gen. nov., with description of evolution of water-living bacteria. PLoS Genet. 2007;3, e138. Raoultella ornithinolytica comb. nov., Raoultella terrigena comb. nov. and 61. Tettelin H, Masignani V, Cieslewicz MJ, Donati C, Medini D, Ward NL, et al. Raoultella planticola comb. nov. Int J Syst Evol Microbiol. 2001; 51:925–932. Genome analysis of multiple pathogenic isolates of Streptococcus 35. Wu S, Zhu Z, Fu L, Niu B, Li W. WebMGA: a customizable web server for fast agalactiae: Implications for the microbial “pan-genome.”. Proc Natl Acad Sci metagenomic sequence analysis. BMC Genomics. 2011;12:444. U S A. 2005;102:13950–5. 36. Tatusov RL, Natale DA, Garkavtsev IV, Tatusova TA, Shankavaram UT, Rao BS, 62. Via S. Natural selection in action during speciation. Proc Natl Acad Sci U S A. – et al. The COG database: new developments in phylogenetic classification 2009;106 Suppl 1:9939 46. of proteins from complete genomes. Nucleic Acids Res. 2001;29:22–8. 63. Tettelin H, Riley D, Cattuto C, Medini D. Comparative genomics: the bacterial – 37. Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, et al. Gapped pan-genome. Curr Opin Microbiol. 2008;11:472 7 [Antimicrobials/Genomics]. BLAST and PSI-BLAST: a new generation of protein database search programs. 64. Pupo GM, Lan R, Reeves PR. Multiple independent origins of Shigella clones Nucleic Acids Res. 1997;25:3389–402. of Escherichia coli and convergent evolution of many of their – 38. MoriyaY,ItohM,OkudaS,YoshizawaAC,KanehisaM.KAAS:anautomatic characteristics. Proc Natl Acad Sci U S A. 2000;97:10567 72. genome annotation and pathway reconstruction server. Nucleic Acids Res. 65. Maurelli AT, Routh PR, Dillman RC, Ficken MD, Weinstock DM, Almond GW, 2007;35 suppl 2:W182–5. et al. Shigella infection as observed in the experimentally inoculated – 39. Kanehisa M, Goto S. KEGG: kyoto encyclopedia of genes and genomes. domestic pig, Sus scrofa domestica. Microb Pathog. 1998;25:189 96. Nucleic Acids Res. 2000;28:27–30. 66. Cowan ST, Steel M, Shaw C, Duguid JP. A classification of the Klebsiella – 40. Rasko DA, Myers GSA, Ravel J. Visualization of comparative genomic group. J Gen Microbiol. 1960;23:601 12. analyses by BLAST score ratio. BMC Bioinformatics. 2005;6:2. 67. Ørskov, I. Genus v. Klebsiella. In: N. R. Krieg and J. G. Holt, editors. Bergey's manual of systematic bacteriology, vol. 1. Baltimore, Md: Williams & Wilkins; 41. Pearson T, Hornstra HM, Sahl JW, Schaack S, Schupp JM, Beckstrom-Sternberg SM, 1984. p. 461–465. et al. When outgroups fail; phylogenomics of rooting the emerging pathogen, Coxiella burnetii. Syst Biol. 2013;62:752–62. 42. Sahl JW, Gillece JD, Schupp JM, Waddell VG, Driebe EM, Engelthaler DM, et al. Evolution of a pathogen: a comparative genomics analysis identifies a genetic pathway to pathogenesis in Acinetobacter. PloS One. 2013;8:e54287. 43. D’Amato F, Eldin C, Georgiades K, Edouard S, Delerce J, Labas N, et al. Loss of TSS1 in hypervirulent Coxiella burnetii 175, the causative agent of Q fever in French Guiana. Comp Immunol Microbiol Infect Dis. 2015;41:35–41. 44. Rouli L, Merhej V, Fournier PE, Raoult D: The bacterial pangenome as a new tool for analyzing pathogenic bacteria. New Microbes New Infect 2015;7:72–85. 45. Laing C, Buchanan C, Taboada EN, Zhang Y, Kropinski A, Villegas A, et al. Pan-genome sequence analysis using Panseq: an online tool for the rapid analysis of core and accessory genomic regions. BMC Bioinformatics. 2010;11:461. 46 Tamura K, Stecher G, Peterson D, Filipski A, Kumar S. MEGA6: Molecular Evolutionary Genetics Analysis version 6.0. Mol Biol Evol. 2013;30:2725–9. 47 Guindon S, Dufayard JF, Lefort V, Anisimova M, Hordijk W, Gascuel O. New Algorithms and Methods to Estimate Maximum-Likelihood Phylogenies: Assessing the Performance of PhyML 3.0. Syst Biol. 2010;59:307–21. 48 Merhej V, Royer-Carenzi M, Pontarotti P, Raoult D. Massive comparative genomic Submit your next manuscript to BioMed Central analysis reveals convergent evolution of specialized bacteria. Biol Direct. 2009;4:13. and take full advantage of: 49 Rolain JM, Vayssier-Taussat M, Saisongkorh W, Merhej V, Gimenez G, Robert C, et al. Partial Disruption of Translational and Posttranslational Machinery • Convenient online submission Reshapes Growth Rates of Bartonella birtlesii. mBio. 2013;4:e00115–13. 50. Darby AC, Cho NH, Fuxelius HH, Westberg J, Andersson SGE. Intracellular • Thorough peer review pathogens go extreme: genome evolution in the Rickettsiales. Trends • No space constraints or color figure charges Genet. 2007;23:511–20. • Immediate publication on acceptance 51 Merhej V, Georgiades K, Raoult D. Postgenomic analysis of bacterial pathogens repertoire reveals genome reduction rather than virulence • Inclusion in PubMed, CAS, Scopus and Google Scholar – factors. Brief Funct Genomics. 2013;12:291 304. • Research which is freely available for redistribution 52 Darling AE, Mau B, Perna NT. progressiveMauve: Multiple Genome Alignment with Gene Gain, Loss and Rearrangement. PLoS ONE. 2010;5:e11147. Submit your manuscript at www.biomedcentral.com/submit

103 CONCLUSIONS ET PERSPECTIVES

La génomique représente un outil important pour l'analyse et la classification des bactéries émergentes. Elle permet la découverte de nouvelles espèces comme

Akkermansia muciniphila, Microvirga massiliensis et

Haloferax massiliensis grâce à l'assemblage et l'annotation de leur génome, par la métagénomique, par la génomique comparative, par les approches de culturomics et de taxonogenomics.

De plus, grâce à l'utilisation du pan-génome comme outil taxonomique, nous avons pu redéfinir la notion d'espèce et l'appliquer directement au niveau du genre Klebsiella.

L'importante variation discontinue observée au niveau des ratios core-genome/pan-génome nous a permis de redéfinir les espèces de ce genre. Une telle variation ne peut exister au sein d'une même espèce.

Nous croyons que les études de pan-génome, à plus

104 grande échelle, permettrait de redéfinir les espèces bactériennes. De ce fait, par la génomique, la taxonomie est remise en question.

En perspective, il serait intéressant d'appliquer systématiquement notre définition des espèces basées sur le pan-génome pour la classification des espèces, lorsque cela est possible. Il serait également intéressant d’utiliser cet outil au niveau de la classification des genres et/ou familles bactériennes. Cependant, de nombreuses questions peuvent

être posées. Cette notion de ratio pourra-t-elle s’utiliser de manière identique par rapport au ratio pour la classification des espèces ? Le cutoff utilisé pour redéfinir les espèces sera-t-il le même au sein d’espèces, de genres ou familles bactériennes différentes ? Dans l’avenir, nous pourrions également utiliser cette définition des espèces basée sur le pan-génome à des groupes de génomes déjà définit afin de pouvoir corriger au fur et à mesure les erreurs de classification qui ce seraient glissées

105 au fil du temps ou dues à l’utilisation de méthodes incomplètes.

106

ANNEXE I

Étude du microbiote intestinal humain par culturomics

107 Avant-propos

Au cours de ces trois années, j'ai eu l'occasion de participer à un projet important portant sur l'étude du microbiote intestinal humain par culturomics. L'objectif de ce travail était d'évaluer le rôle de la culturomics afin d'élucider les lacunes de la métagénomique. Tout d'abord, mon travail s'est porté sur la comparaison des 247 gènes ARNr 16S des nouvelles espèces du laboratoire jusque là, avec les données de métagénomique de 325 runs listés par HMP

(http://www.hmpdacc.org/catalog). Puis, mon travail a été de détecter les ORFans des 168 génomes de nouvelles espèces isolées et séquencées dans notre laboratoire et de les comparer aux séquences de contigs/scaffolds issus de 148 runs de HMP.

Enfin, nos 247 nouvelles espèces ont été recherchées dans les

396 et 239 échantillons de intestinaux humains décrits par Nielsen et al. [11] et Browne et al. [12] respectivement. En comparant les résultats des analyses

108 métagénomiques et de culturomics, ce travail montre que l'utilisation de la culturomics permet la culture d'organismes correspondant à des séquences précédemment non assignées.

Au total, la culturomics double le nombre d'espèces isolées au moins une fois dans l'intestin humain.

Ce travail a été publié dans le journal Nature microbiology.

109

ARTICLE 3

Culture of previously uncultured members of the human gut microbiota by culturomics

Jean-Christophe Lagier, Saber Khelaifia, Maryam Tidjani Alou, Sokhna Ndongo, Niokhor Dione, Perrine Hugon, Aurelia Caputo, Frédéric Cadoret, Sory Ibrahima Traore, El Hadji Seck, Gregory Dubourg, Guillaume Durand, Gaël Mourembou, Elodie Guilhot, Amadou Togo, Sara Bellali, Dipankar Bachar, Nadim Cassir, Fadi Bittar, Jérémy Delerce, Morgane Mailhe, Davide Ricaboni, Melhem Bilen, Nicole Prisca Makaya Dangui Nieko, Ndeye Mery Dia Badiane, Camille Valles, Donia Mouelhi, Khoudia Diop, Matthieu Million, Didier Musso, Jônatas Abrahão, Esam Ibraheem Azhar, Fehmida Bibi, Muhammad Yasir, Aldiouma Diallo, Cheikh Sokhna, Felix Djossou, Véronique Vitton, Catherine Robert, Jean Marc Rolain, Bernard La Scola, Pierre-Edouard Fournier, Anthony Levasseur and Didier Raoult

110 LETTERS PUBLISHED: 7 NOVEMBER 2016 | ARTICLE NUMBER: 16203 | DOI: 10.1038/NMICROBIOL.2016.203 OPEN Culture of previously uncultured members of the human gut microbiota by culturomics Jean-Christophe Lagier1, Saber Khelaifia1, Maryam Tidjani Alou1,SokhnaNdongo1, Niokhor Dione1, Perrine Hugon1,AureliaCaputo1,FrédéricCadoret1, Sory Ibrahima Traore1,ElHadjiSeck1, Gregory Dubourg1,GuillaumeDurand1, Gaël Mourembou1,ElodieGuilhot1, Amadou Togo1, Sara Bellali1,DipankarBachar1, Nadim Cassir1, Fadi Bittar1, Jérémy Delerce1, Morgane Mailhe1, Davide Ricaboni1,MelhemBilen1,NicolePriscaMakayaDanguiNieko1,NdeyeMeryDiaBadiane1, Camille Valles1, Donia Mouelhi1, Khoudia Diop1, Matthieu Million1, Didier Musso2, Jônatas Abrahão3, Esam Ibraheem Azhar4, Fehmida Bibi4, Muhammad Yasir4, Aldiouma Diallo5,CheikhSokhna5, Felix Djossou6, Véronique Vitton7, Catherine Robert1, Jean Marc Rolain1, Bernard La Scola1, Pierre-Edouard Fournier1, Anthony Levasseur1 and Didier Raoult1*

Metagenomics revolutionized the understanding of the years, microbial culture techniques have been neglected, which relations among the human microbiome, health and diseases, explains why the known microbial community of the human gut but generated a countless number of sequences that have not is extremely low13. Before we initiated microbial culturomics13 of been assigned to a known microorganism1. The pure culture the approximately 13,410 known bacterial and archaea species, of prokaryotes, neglected in recent decades, remains essential 2,152 had been identified in humans and 688 bacteria and 2 to elucidating the role of these organisms2. We recently intro- archaea had been identified in the human gut. Culturomics consists duced microbial culturomics, a culturing approach that uses of the application of high-throughput culture conditions to the study multiple culture conditions and matrix-assisted laser desorp- of the human microbiota and uses matrix-assisted laser desorption/ tion/ionization–time of flight and 16S rRNA for identification2. ionization–time of flight (MALDI–TOF) or 16S rRNA amplification Here, we have selected the best culture conditions to increase and sequencing for the identification of growing colonies, some of the number of studied samples and have applied new protocols which have been previously unidentified2. With the prospect of iden- (fresh-sample inoculation; detection of microcolonies and tifying new genes of the human gut microbiota, we extend here the specific cultures of Proteobacteria and microaerophilic and number of recognized bacterial species and evaluate the role of this halophilic prokaryotes) to address the weaknesses of the strategy in resolving the gaps in metagenomics, detailing our strategy previous studies3–5. We identified 1,057 prokaryotic species, step by step (see Methods). To increase the diversity, we also thereby adding 531 species to the human gut repertoire: 146 obtained frozen samples from healthy individuals or patients with bacteria known in humans but not in the gut, 187 bacteria various diseases from different geographical origins. These frozen and 1 archaea not previously isolated in humans, and 197 poten- samples were collected as fresh samples (stool, small-bowel and tially new species. Genome sequencing was performed on the colonic samples; Supplementary Table 1). Furthermore, to determine new species. By comparing the results of the metagenomic appropriate culture conditions, we first reduced the number of and culturomic analyses, we show that the use of culturomics culture conditions used (Supplementary Table 2a–c) and then allows the culture of organisms corresponding to sequences focused on specific strategies for some taxa that we had previously previously not assigned. Altogether, culturomics doubles the failed to isolate (Supplementary Table 3). number of species isolated at least once from the human gut. First, we standardized the microbial culturomics for application The study of the human gut microbiota has been revived by to the sample testing (Supplementary Table 1). A refined analysis metagenomic studies6–8. However, a growing problem is the gaps of our first study, which had tested 212 culture conditions4,showed that remain in metagenomics, which correspond to unidentified that all identified bacteria were cultured at least once using one sequences that may be correlated with an identified organism9. of the 70 best culture conditions (Supplementary Table 2a). We Moreover, the exploration of relations between the microbiota and applied these 70 culture conditions (Supplementary Table 2a) to human health require—both for an experimental model and the study of 12 stool samples (Supplementary Table 1). Thanks to therapeutic strategies—the growing of microorganisms in pure the implementation of the recently published repertoire of human culture10, as recently demonstrated in elucidations of the role of bacteria13 (see Methods), we determined that the isolated bacteria Clostridium butyricum in necrotizing enterocolitis and the influence included 46 bacteria known from the gut but not recovered by of gut microbiota on cancer immunotherapy effects11,12. In recent culturomics before this work (new for culturomics), 38 that had

1Aix Marseille Université URMITE, UM63, CNRS 7278, IRD 198, INSERM 1095, 27 Boulevard Jean Moulin, 13385 Marseille Cedex 5, France. 2Institut Louis Malardé, Papeete, Tahiti, Polynésie Française. 3Departamento de Microbiologia Laboratorio de , Universidade Federal de Minas Gerais, Belo Horizonte, Brasil. 4Special Infectious Agents Unit, King Fahd Medical Research Center, King Abdulaziz University, Jeddah 21589, Saudi Arabia. 5Institut de Recherche pour le Développement, UMR 198 (URMITE), Campus International de Hann, IRD, BP 1386, CP, 18524 Dakar, Sénégal. 6Department of Infectious and Tropical Diseases, Centre Hospitalier de Cayenne, Cayenne, French Guiana. 7Service de Gastroentérologie, Hôpital Nord, Assistance Publique-Hôpitaux de Marseille, 13915 Marseille, France. *e-mail: [email protected]

NATURE MICROBIOLOGY | VOL 1 | DECEMBER 2016 | www.nature.com/naturemicrobiology 1

© 2016 Macmillan Publishers Limited,111 part of Springer Nature. All rights reserved. LETTERS NATURE MICROBIOLOGY DOI: 10.1038/NMICROBIOL.2016.203

Present work Total number of 1,476 1,480 1,525 1,430 microorganisms 1,394 1,400 known in human 1,324 1,283 gut 1,170 1,071 1,099 1,103 1,051 1,012 994 997 247 New species 216 217 (NS) 857 900 206 847 199 199 159 First isolation in 149 269 human (NH) 225 259 260 205 208 572 188 175 Already known Culturomics 690 449 60 in human other 240 242 244 250 results 110 231 234 site (H) 341 50 200 218 30 81 142 Microorganisms 60 104 identified by 77 404 335 362 371 380 382 382 culturomics 260 323 174 214 (H(GUT))

286 Microorganisms identified by 355 328 319 310 308 308 430 367 other laboratories only 516 476 69 (ref. 15) 690 69 69 69 69 69 69 69 69 69

ABCDE FGH I J K

A: First project of culturomics E: Cohorts I: Halophilic Archaea B: Published culturomics studies F: Fresh stools J: Microcolonies C: 70 culture conditions G: Proteobacteria K: Duodenum D: 18 culture conditions H: Microaerophilic

Figure 1 | Number of different bacteria and archaea isolated during the culturomics studies. Columns A and B represent the results from previously published studies, and columns C to K the different projects described herein. The bacterial species are represented in five categories: NS, new species; NH, prokaryotes first isolated in humans; H, prokaryotes already known in humans but never isolated from the human gut; H (GUT), prokaryotes known in the human gut but newly isolated by culturomics; and prokaryotes isolated by other laboratories but not by culturomics. already been isolated in humans but not from the gut (non-gut Among the gut species mentioned in the literature13 and not pre- bacteria), 29 that had been isolated in humans for the first time viously recovered by culturomics, several were extremely oxygen- (non-human bacteria) and 10 that were completely new species sensitive anaerobes, several were microaerophilic and several were (unknown bacteria) (Fig. 1 and Supplementary Tables 4a and 5). Proteobacteria, and we focused on these bacteria (Supplementary Beginning in 2014, to reduce the culturomics workload and Table 3). Because delay and storage may be critical with anaerobes, extend our stool-testing capabilities, we analysed previous studies we inoculated 28 stools immediately upon collection. This enabled and selected the 18 best culture conditions2. We performed cultures the culture of 27 new gut species for culturomics, 13 non-gut bacteria, in liquid media in blood culture bottles, followed by subcultures on 17 non-human bacteria and 40 unknown bacteria (Fig. 1 and agar (Supplementary Table 2b). We designed these culture con- Supplementary Tables 3a and 4). When we specifically tested 110 ditions by analysing our first studies. The results of those studies samples for Proteobacteria, we isolated 9 bacteria new to culturomics, indicated that emphasizing three components was essential: pre- 3 non-gut bacteria and 3 non-human bacteria (Fig. 1 and incubation in a blood culture bottle (56% of the new species iso- Supplementary Tables 4a and 5). By culturing 242 stool specimens lated), the addition of rumen fluid (40% of the new species isolated) exclusively under a microaerophilic atmosphere, we isolated 9 bacteria and the addition of sheep blood (25% of the new species isolated)2–5. new to culturomics, 6 non-gut bacteria, 17 non-human bacteria and 7 We applied this strategy to 37 stool samples from healthy individ- unknown bacteria (Fig. 1 and Supplementary Tables 4a and 5). We also uals with different geographic provenances and from patients with introduced the culture of halophilic prokaryotes from the gut and different diseases (Supplementary Table 1). This new strategy microcolony detection. The culture of halophilic bacteria was per- enabled the culture of 63 organisms new to culturomics, 58 non- formed using culture media supplemented with salt for 215 stool gut bacteria, 65 non-human bacteria and 89 unknown bacteria samples, allowing the culture of 48 halophilic prokaryotic species, (Fig. 1 and Supplementary Tables 4a and 5). including one archaea (Haloferax alexandrinus), 2 new bacteria for cul- We also applied culturomic conditions (Supplementary turomics, 2 non-gut bacteria, 34 non-human bacteria, 10 unknown bac- Table 2c) to large cohorts of patients sampled for other purposes teria and one new halophilic archaea (Haloferax massiliensis sp. nov.) (premature infants with necrotizing enterocolitis, pilgrims returning (Fig. 1 and Supplementary Tables 4a and 5). Among these 48 halophilic – from the Hajj and patients before or after bariatric surgery) prokaryotic species, 7 were slight halophiles (growing with 10–50 g l 1 – (Supplementary Table 1). A total of 330 stool samples were ana- of NaCl), 39 moderate halophiles (growing with 50–200 g l 1 of NaCl) – lysed. This enabled the detection of 13 bacteria new to culturomics, and 2 extreme halophiles (growing with 200–300 g l 1 of NaCl). 18 non-gut bacteria, 13 non-human bacteria and 10 unknown We also introduced the detection of microcolonies that were species (Fig. 1 and Supplementary Tables 4a and 5). barely visible to the naked eye (diameters ranging from 100 to

2 NATURE MICROBIOLOGY | VOL 1 | DECEMBER 2016 | www.nature.com/naturemicrobiology

© 2016 Macmillan Publishers Limited,112 part of Springer Nature. All rights reserved. NATURE MICROBIOLOGY DOI: 10.1038/NMICROBIOL.2016.203 LETTERS

Extension of human gut repertoire Decipher metagenomic gaps by culturomics

(1) Comparison of 16S rRNA of our 247 new species (197 + 50 previously published) with HMP

125 of our species previously detected 945 different prokaroytes including 2 archaea as OTU by metagenomics studies 973 samples 1,200 (2) 19,980 new ORFans genes including 1,326 from 54 of our new species 1,000 197 New species (3) From 7.7 to 60.7% of our new species detected in Nielsen and Browne metagenomic studies, respectively 800 New for human 10−1 188 10−2 New for human gut 10−3 10−4 Previously known 10−5 600 146 Comparison of 84 samples 10−6 from gut but new 10−7 analysed by metagenomics 10−8 10−9 190 for culturomics −10 400 and culturomics 10 Previously known from gut 200 336 (4) Among the 200 16S rRNAs of the new 0 species: 102 recovered 827 times (average Number of species 9.8 per stool)

ATGACGTGACGGGCGGTGTGTACAAGGCCC GGGAA C GT ATT C (5) Analysis of the species with a cut off of 20 100 110 120 130 reads = 4,158 OTU and 556 species

600 1,258 16S rRNAs of Never found in 500 86 unidentified colonies Not human gut 50 culturomics 400 Previously known (136 species) 102 from gut but not by 300 61 culturomics 47 New species 2.7 million spectra 200 MALDI–TOF Culturomics New in human gut 901,364 colonies (420 species) 210 6,000 100 New in humans

4,000 Known from gut

Intensity (a.u.) 0 2,000 Number of species 0 2,000 4,000 6,000 8,000 10,000 12,000 14,000 16,000 18,000 m/z

Figure 2 | Summary of the culturomics work that has extended the gut repertoire and filled some of the gaps in metagenomics.

300 µm) and could only be viewed with magnifying glasses. These (Supplementary Table 7), we blasted these with 13,984,809 colonies were transferred into a liquid culture enrichment contigs/scaffolds from the assembly of whole metagenomic studies medium for identification by MALDI–TOF mass spectrometry by HMP, enabling the detection of 1,326 ORFans (6.6%) from 54 (MS) or 16S rRNA amplification and sequencing. By testing ten of our new bacterial species (including 45 detected also from 16S) stool samples, we detected two non-gut bacteria, one non-human (Supplementary Table 8). Therefore, at least 102 new bacterial bacterium and one unknown bacterium that only formed micro- species were found but not identified in previous metagenomic colonies (Fig. 1 and Supplementary Tables 4a and 5). Finally, by studies from the HMP. Third, we searched for our 247 new culturing 30 duodenal, small bowel intestine and colonic samples, species in the 239 human gut microbiome samples from healthy we isolated 22 bacteria new to culturomics, 6 non-gut bacteria, individuals described by Browne et al., in which 137 bacterial 9 non-human bacteria and 30 unknown bacteria (Fig. 1 and species were isolated15. We captured 150 of our new species in Supplementary Tables 4a and 5). To continue the exploration of these metagenomics data, representing 60.7% (Supplementary gut microbiota, future culturomics studies could also be applied to Table 9). Moreover, we also identified 19 of our species (7.7%) intestinal biopsies. from 396 human stool individuals described by Nielsen et al., In addition, we performed five studies to evaluate the role of cul- from which 741 metagenomic species and 238 unique metagenomic turomics for deciphering the gaps in metagenomics9. First, we com- genomes were identified16 (Supplementary Table 9). Fourth, we pared the 16S rRNA sequences of the 247 new species (the 197 new analysed the 16S rRNA metagenomic sequences of 84 stools also prokaryotic species isolated here in addition to the 50 new bacterial tested by culturomics (Supplementary Table 10). We compared the species isolated in previous culturomic studies3–5) to the 5,577,630 OTUs identified by blast with a database including the 16S rRNA reads from the 16S rRNA metagenomic studies listed by the of all species isolated by culturomics. Among the 247 16S rRNA of Human Microbiome Project (HMP) (http://www.hmpdacc.org/ the new species, 102 were recovered 827 times, with an average of catalog). We found sequences, previously termed operational taxo- 9.8 species per stool. Finally, analysis of these species using a cutoff nomic units (OTUs), for 125 of our bacterial species (50.6%). These threshold of 20 reads identified 4,158 OTUs and 556 (13.4%) identified bacterial species included Bacteroides bouchedurhonense, species (Supplementary Table 11), among which 420 species which was recovered in 44,428 reads, showing that it is a common (75.5%) were recovered by culturomics. Of these, 210 (50%) were bacterium (Supplementary Table 6). Second, because the genome previously found to be associated with the human gut, 47 were not sequencing of 168 of these new species allowed the generation of previously found in humans (11.2%), 61 were found in humans but 19,980 new genes that were previously unknown (ORFans genes) not in the gut (14.5%) and 102 (24.3%) were new species.

NATURE MICROBIOLOGY | VOL 1 | DECEMBER 2016 | www.nature.com/naturemicrobiology 3

© 2016 Macmillan Publishers Limited,113 part of Springer Nature. All rights reserved. LETTERS NATURE MICROBIOLOGY DOI: 10.1038/NMICROBIOL.2016.203

* Methanobrevibacter s Haloferax massiliensis

Bacteroides bouchedurhonensis

Bacteroides congolense Bactero Mediterranea massiliensis Massiliopre Prevotellamassilia timonensis *

B iliensis Metaprevotella massiliensis B acteroides fragilis

acter ensis ides Bacteroides phoceense m Ihuprevotella massilie ithii ili Prevotella phoceensis Bacteroides timonensis mediterraneensi mass o ass votell ides neonati is s hense m s nsis Mar i e s

coccus timonensis nsis a massiPrevotellal ca dda s iae iliens assiliensis nsi eilla massili s iens Parabacteroides massil illo massiliensis s m e s dium massiliense sil a i a timonensis * Parabacter omal siliensis re lioamazonia massiliensist i ars iensis r ssilibacillus senegalense s e a ensis M f cella massiliensis chella liensis onensi Lasco ensis silioculturomica e Sanguibactero cc nsis stridium je ella massiliensi m iliensis ea rella massilie as l u Massi i * Culturomica massiliensis ae lo Neglec beduini Massil o Anaerotruncus massiliense Ih M Anaerotruncus rubiinfantis C S labacillus massiliensis cea mas roven as ti Bittarella ma Gorba s phoc oides di uminiclostr s Gabonia massiliensis o P i Butyric R Anaerom scilibacter mas na Butyri Ph Fournie phoceensis ensis O Butyricimonas massiliensisides mas ien Marseillobacter massili sili stasonis r marsseillense imonas timonensis idextraIntesti massinimonas gabone actor plautia cimonas phoceensi s dium to egalense is Clostridium stinimo nsis 0.05 Col massiliens siliensis Intestinimon onifr la m Inte v liosen Clostri Intestinimonas mass Fla massilie m provencensis lavonifrac u sis Alistipes p F sia massiliensis n ense Beduinel m massi n s iu silie o Tidjanibacter massiliensis Christensenellalyne timonensisd rh Christensenellao i du NiameyiaP isenbergiella massiliensis Al hoceensis E Alistipes obesihomisti Clostr pes ihumii Corynebacteri nense Alistipes jed BlautiaBlautia mas timonensis Alisti Blautia phoceensis is iensis Alistipespes senegalensis sil Alistipes provencendahenseinis Ruminococcus massiliensis * Fusob Lagierella massiliensismas timonensis Lachnoclostridium bouches ac Fusobacteriumterium nucleatum massiliense LachnoclClostridiumfricanellaostridium touaregense timoourtella massiliensis * Enterobacter cloacae A Drancourtella timonens sis ranc s Enterobacter timonens D i * Klebsiella pneumoniae Ruminococcus phoceensis Enterobacter massiliensis Eubacterium massiliense Dorea massiliens is * Escherichia coli Tyzzerellaariatricus massiliensis massiliensis Halomonas massiliensis B Bacillus Bacillusjeddahensis saudii Pseudomonas mass Xanthomonas massiliensis Bacillus rubiinfantis Vitreoscilla massiliensis iliens Bacillus massius jeddahtimonensislioamazoniensis Herbaspirillum massiliense is Bacill Bacillus massilionigeriensis Bacillus massilioanorexius Sutterella massiliensis Bacillus testis Dakarella massiliensis Bacillus massiliosenegalensis Duodena massiliensis Bacillus touaregensis Bacillus timonensis Microvirga massiliensis Bacillus phoceensis Desulfomassilia massiliensis Bacillus massiliogabonensis Pacaella massiliensis Bacillus mediterraneensis Bacillus niameyensis Cellulomonas timonensis Bacillus andreraoultii Cellulomonas massiliensis Rubiinfantum massiliensis Timonella senegalensis Planococcus massiliensis Oceanobacillus jeddahense Mobilicoccus massiliensis Oceanobacillus massiliensis Brachybacterium massiliensis Virgibacillus massiliensis Nesterenkonia massiliensis Thalassobacillus massiliensis Virgibacillus senegalensis Brevibacterium senegalense Gracilibacillus massiliensis Brevibacterium phoceense Gracilib Flaviflexus massiliensis Lentibacillacillus timonensis omyces ihumii sis Ba Paraliobacius massilien Actin s grossen cillus cereus sis ce llus massiliens Actinomy sdurhonensis * che censis No is ces bou en socomiicoccu Actinomy ynesiense e Ente Actinomyceses prov pol s Enterococcusro s massi pacaens coccu liens Actinomyc ces phocteriumeensi Kurth mas s fae is omy c Kurthiaia massiliensis se silien cium * Actin oryneba ne Strep sis C ouchedurhonensisddahense Kurthi galensis tococcus timo b je RubMassiliobaa timonensis erium linskerseumii e h opar nen Corynebacterium um karo sis Numidumvulum massct massiliensiseriu sis Corynebact n se BrevibacillusRisungbinell mass m acteri assilie senega ococcuss massilim silensisien is Coryneb Corynestbacterium i liens Gor lense Bla i P Paenibail ilie iliensis aeniba libacte a massilnsisiensi Nocardioide Paenibaci iliensis ssiliensis Paenib ci ri cillus marallu smumiensis timon AeromicrobiumNigeriumccus mas mass mass PaePaenibacillus s senegal s um acnes acil llu raco P nibacillus lus bs re e essa aeniba amass ensis ns T Paenibacillus ouchesdu e Streptomycesgum ma PaenibacillusPa phoceensiscillus touaregeihum pionibacteri e num ili Paenibacin ensi * Pro ibacillus senegal i ii r s nsis dis honensis Raoultibactergerth massilienella lentasis ant ll ibioticophila Eggerthella* Eg timonensis us rubi nsis * Bifidobacterium lon D Holdemania massil naerobia ie Hol in omass onibacterArabia massiliense massilie l fan Beduma fasdemania t ClostriClostridiudium saum jedd tis ilien Gord C Clostridium sis Clos ini mas sella ihumii Clostridlostridium tertium * tidiosa sis ugonellalin massiliensis s M H s i C t C S G D ridium Stoque assiliomicrobiota timonenimonens enegalimassilColia a C lostr lost toquefichusuyanas m S ensi ceensis is K l es il ie Clostridiumh niameyostrid ie nensis n i nsis el ridiumu mediter nsis aerofaciens iens u idium amazonitimom mass n ai esiell fi mo is Cl Clostridium po ige chus is iensis s Malnutritionisia mas dii Clostridiu Clos fiab a as

Romboutsia t ostridi i ollinsella massilien timonensis Anaerosalibacter u

ella pho a ti Anaerofus ly rien h

diannikovella m diannikovella Me C timonensis Peptoniphilus S m massi s

n Clost timo jeddahense il e timonensis Urmitella a massiliensisn

enegalia mass a mas i Ndiopella massiliens Ndiopella t e i ensi

ridium c s lioa se

Ols terium mass Urmitella massiliensis Urmitella ilus senegalensis ilus Peptoniph iense nen

* CollinsellaEnorma masJeddahellasili Olsenella massiliensis massiliensis u Anaerosalibacter massiliensis Anaerosalibacter ptoniphil Pe sil

Peptoniphilus phoceensis Peptoniphilus ridium s

Enorma timonensis duode Peptoniphilus m

Olsenell Emergencia maz

sihominis obe hilus

Olsenella provencens massiliensis Peptonip Ihuba se ie Collinsella massilioamazoniensis timonensis Micromassilia ih liod r m bifermentans * c aneense nsis

t s ul Neofamilia massiliensis Neofamilia

cus massilien massiliensis Murdochiella o I i umii gasphaera massil M s massil enegalense l t i nie eobacter dak ense elmoense s imone uromi is c

Kallipyga gabonensis Kallipyga n o Me Caecumella massiliensis te e nsis ococ b

ns

Colonella i r m i

us grossensis us ilie arense Kallipyga massiliensis Kallipyga loba l ien c e n Massiliobacillus massil si

Anaerococcus rubiinfantis Anaerococcus timonense iensis s

nsis a Acidaminococcus timonensis assiliensis Anaeroc sis ens s liensis ssiliensis i is cillus m Negativicoccus massiliense Anaerococcus obesihominis

um massiliense Anaerococcus senegalensis Anaerococcus

Acidamin is e numensis

occus j assi

liensis

ed

dahense

Figure 3 | Phylogenetic tree of the 247 new prokaryote species isolated by culturomics. Bacterial species from Firmicutes are highlighted in red, Actinobacteria (light green), Proteobacteria (blue), Bacteroidetes (purple), Synergistetes (green), Fusobacteria (dark green) and Archaea (grey), respectively. The sequences of 16 prokaryotic species belonging to six phyla previously known from the human gut and more frequently isolated by culture in human gut are highlighted in bold and by an asterisk.

Interestingly, among the 136 species not previously found by culturo- to Proteobacteria (a phylum that we have under-cultured to date; mics, 50 have been found in the gut and 86 have never previously Supplementary Table 5), 88 to Bacteroidetes, 9 to Fusobacteria, 3 to been found in the human gut (Fig. 2 and Supplementary Table 11). Synergistetes, 2 to , 1 to Lentisphaerae and 1 to Overall, in this study, by testing 901,364 colonies using MALDI– Verrucomicrobia (Supplementary Table 4a). Among these 197 new TOF MS (Supplementary Table 1), we isolated 1,057 bacterial species, prokaryotes species, 106 (54%) were detected in at least two stool including 531 newly found in the human gut. Among them, 146 samples, including a species that was cultured in 13 different stools were non-gut bacteria, 187 were non-human bacteria, one was a non- (Anaerosalibacter massiliensis) (Supplementary Table 4a). In compari- human halophilic archaeon and 197 were unknown bacteria, including son with our contribution, a recent work using a single culture medium two new families (represented by Neofamilia massiliensis gen. nov., sp. was able to culture 120 bacterial species, including 51 species known nov. and Beduinella massiliensis gen. nov., sp. nov.) and one unknown from the gut, 1 non-gut bacterium, 1 non-human bacterium and 67 halophilic archaeon (Fig. 1 and Supplementary Table 4a). Among these, unknown bacteria, including two new families (Supplementary 600 bacterial species belonged to Firmicutes, 181 to Actinobacteria, 173 Table 12).

4 NATURE MICROBIOLOGY | VOL 1 | DECEMBER 2016 | www.nature.com/naturemicrobiology

© 2016 Macmillan Publishers Limited,114 part of Springer Nature. All rights reserved. NATURE MICROBIOLOGY DOI: 10.1038/NMICROBIOL.2016.203 LETTERS

To obtain these significant results we tested more than 900,000 the growth of all the bacteria4. We applied these culture conditions to 12 more stool colonies, generating 2.7 million spectra, and performed 1,258 samples and tested 160,265 colonies by MALDI–TOF (Supplementary Table 1). The fi fi 18 best culture conditions were selected using liquid media enrichment in a medium molecular identi cations of bacteria not identi ed through fl – fi containing blood and rumen uid and subculturing aerobically and anaerobically in MALDI TOF, using 16S rRNA ampli cation and sequencing. The a solid medium (Supplementary Table 2b)2. Subcultures were inoculated every three new prokaryote species are available in the Collection de Souches days on solid medium, and each medium was kept for 40 days. We applied these de l’Unité des Rickettsies (CSUR) and Deutsche Sammlung von culture conditions to 40 stool samples, ultimately testing 565,242 colonies by – Mikroorganismen und Zellkulturen (DSMZ) (Supplementary MALDI TOF (Supplementary Table 1). Tables 4a and 5). All 16S sequences of the new species and the Cohorts. In parallel to these main culturomics studies, we used fewer culture species unidentified by MALDI–TOF, as well as the genome conditions to analyse a larger number of stool samples. We refer to these projects as sequences of the new species, have been deposited in GenBank cohorts. Four cohorts were analysed (pilgrims returning from the Hajj, premature (Supplementary Tables 5 and 13). In addition, thanks in part to infants with necrotizing enterocolitis, patients before and after bariatric surgery, and patients for acidophilic bacterial species detection). A total of 330 stool samples an innovative system using a simple culture for the archaea – 17 generated the 52,618 colonies tested by MALDI TOF for this project without an external source of hydrogen , among these prokaryotes (Supplementary Table 1). we isolated eight archaeal species from the human gut, including two new ones for culturomics, one non-gut archaea, four Pilgrims from the Hajj. A cohort of 127 pilgrims was included and 254 rectal swabs non-human archaea and one new halophilic species. were collected from the pilgrims: 127 samples were collected before the Hajj and 127 We believe that this work is a key step in the rebirth of the use of samples were collected after the Hajj. We inoculated 100 µl of liquid sample in an 2–5,16 8 ml bottle containing Trypticase Soy Broth (BD Diagnostics) and incubated the culturing in human microbiology and only the efforts of several sample at 37 °C for 1 day. We inoculated 100 µl of the enriched sample into four teams around the world in identifying the gut microbiota repertoire culture media: Hektoen agar (BD Diagnostics), MacConkey agar+Cefotaxime (bioMérieux), Cepacia agar (AES Chemunex) and Columbia ANC agar will allow an understanding and analysis of the relations between the − (bioMérieux). The sample was diluted 10 3 before being plated on the MacConkey microbiota and human health, which could then participate in −4 adapting Koch’s postulates to include the microbiota21. The and Hektoen agars and 10 before being plated on the ANC agar. The sample was not diluted before being inoculated on the Cepacia agar. Subcultures were performed rebirth of culture, termed culturomics here, has enabled the cultur- on Trypticase Soy Agar (BD Diagnostics) and 3,000 colonies were tested using ing of 77% of the 1,525 prokaryotes now identified in the human gut MALDI–TOF. (Fig. 1 and Supplementary Table 4b). In addition, 247 new species (197 cultured here plus 50 from previous studies) and their genomes Preterm neonates. Preterm neonates were recruited from four neonatal intensive care units (NICUs) in southern France from February 2009 to December 2012 are now available (Fig. 3). The relevance of the new species found by (ref. 12). Only patients with definite or advanced necrotizing enterocolitis culturomics is emphasized because 12 of them were isolated in our corresponding to Bell stages II and III were included. Fifteen controls were matched routine microbiology laboratory from 57 diverse clinical samples to 15 patients with necrotizing enterocolitis by sex, gestational age, birth weight, days (Supplementary Table 14). In 2016, 6 of the 374 (1.6%) different of life, type of feeding, mode of delivery and duration of previous antibiotic therapy. identifications performed in the routine laboratory were new The stool samples were inoculated into 54 preselected culture conditions (Supplementary Table 2c). The anaerobic cultures were performed in an anaerobic species isolated from culturomics. As 519 of the species found by – fi chamber (AES Chemunex). A total of 3,000 colonies were tested by MALDI TOF culturomics in the gut for the rst time (Fig. 1) were not included for this project. in the HMP (Supplementary Table 15) and because hundreds of their genomes are not yet available, the results of this study Stool analyses before and after bariatric surgery. We included 15 patients who had should prompt further genome sequencing to obtain a better bariatric surgery (sleeve gastrectomy or Roux-en-Y gastric bypass) from 2009 to fi 2014. All stool samples were frozen before and after surgery. We used two different identi cation in gut metagenomic studies. culture conditions for this project. Each stool sample was diluted in 2 ml of Dulbecco’s phosphate-buffered saline, then pre-incubated in both anaerobic (BD Methods Bactec Plus Lytic/10 Anaerobic) and aerobic (BD Bactec Plus Lytic/10 Aerobic) Samples. To obtain a larger diversity of gut microbiota, we analysed 943 different blood culture bottles, with 4 ml of sheep blood and 4 ml of sterile rumen fluid being stool samples and 30 small intestine and colonic samples from healthy individuals added as previously described4. These cultures were subcultured on days 1, 3, 7, 10, living or travelling in different geographical regions (Europe, rural and urban Africa, 15, 21 and 30 in 5% sheep blood Columbia agar (bioMérieux), and 33,650 colonies Polynesia, India and so on) and from patients with diverse diseases (for example, were tested by MALDI–TOF. anorexia nervosa, obesity, malnutrition and HIV). The main characteristics are summarized in Supplementary Table 1. Consent was obtained from each patient, Acidophilic bacteria. The pH of each stool sample was measured using a pH meter: and the study was approved by the local Ethics Committee of the IFR48 (Marseille, 1 g of each stool specimen was diluted in 10 ml of neutral distilled water (pH 7) and France; agreement no. 09–022). Except for the small intestine and stool samples that centrifuged for 10 min at 13,000g; the pH values of the supernatants were then we directly inoculated without storage (see sections ‘Fresh stool samples’ and measured. Acidophilic bacteria were cultured after stool enrichment in a liquid ‘Duodenum and other gut samples’), the faecal samples collected in France were medium consisting of Columbia Broth (Sigma-Aldrich) modified by the addition of − immediately aliquoted and frozen at 80 °C. Those collected in other countries were (per litre) 5 g MgSO4, 5 g MgCl2, 2 g KCl, 2 g glucose and 1 g CaCl2. The pH was sent to Marseille on dry ice, then aliquoted and frozen at −80 °C for between 7 days adjusted to five different values: 4, 4.5, 5, 5.5 and 6, using HCl. The bacteria were and 12 months before analysis. then subcultured on solid medium containing the same nutritional components and pH as the culture enrichment. They were inoculated after 3, 7, 10 or 15 incubation − Culturomics. Culturomics is a high-throughput method that multiplies culture days in liquid medium for each tested pH condition. Serial dilutions from 10 1 to − conditions in order to detect higher bacterial diversity. The first culturomics study 10 10 were then performed, and each dilution was plated on agar medium. Negative concerned three stool samples, 212 culture conditions (including direct inoculation controls (no inoculation of the culture medium) were included for each condition. in various culture media), and pre-incubation in blood culture bottles incubated Overall, 16 stool samples were inoculated, generating 12,968 colonies, which aerobically and anaerobically4. Overall, 352 other stool samples, including stool were tested by MALDI–TOF. samples from patients with anorexia nervosa3, patients treated with antibiotics5,or Senegalese children, both healthy and those with diarrhoea22, were previously Optimization of the culturomics strategy. In parallel with this standardization studied by culturomics, and these results have been comprehensively detailed in period, we performed an interim analysis in order to detect gaps in our strategy. previous publications3–5. In this work, we only included the genome sequences of the Analysing our previously published studies, we observed that 477 bacterial species 50 new bacterial species isolated in these previous works to contribute to our analysis previously known from the human gut were not detected. Most of these species grew of culturomics and to fill some of the gaps left by metagenomics. In addition, these in strict anaerobic (209 species, 44%) or microaerophilic (25 species, 5%) conditions, previously published data are clearly highlighted in Fig. 1, illustrating the overall and 161 of them (33%) belonged to the phylum Proteobacteria, whereas only 46 of contribution of culturomics in exploring the gut microbiota. them (9%) belonged to the phylum Bacteroidetes (Supplementary Table 3). The Bacterial species isolated from our new projects and described here were classification was performed using our own database: (http://www.mediterranee- obtained using the strategy outlined in the following sections. infection.com/article.php?laref=374&titre=list-of-prokaryotes-according-to-their- aerotolerant-or-obligate-anaerobic-metabolism). Focusing on these bacterial Standardization of culturomics for the extension of sample testing. Arefined species, we designed specific strategies with the aim of cultivating these analysis allowed the selection of 70 culture conditions (Supplementary Table 2a) for missing bacteria.

NATURE MICROBIOLOGY | VOL 1 | DECEMBER 2016 | www.nature.com/naturemicrobiology 5

© 2016 Macmillan Publishers Limited,115 part of Springer Nature. All rights reserved. LETTERS NATURE MICROBIOLOGY DOI: 10.1038/NMICROBIOL.2016.203

Fresh stool samples. As the human gut includes extremely oxygen-sensitive laboratories. With this technique, we isolated seven methanogenic archaea through bacterial species, and because frozen storage kills some bacteria10, we tested 28 stool culturomic studies as previously described25–27. In addition, we propose here an samples from healthy individuals and directly cultivated these samples on collection affordable alternative that does not require specific equipment17. Indeed, a simple and without storage. Each sample was directly cultivated on agar plates, enriched in double culture aerobic chamber separated by a microfilter (0.2 μm) was used to grow blood culture bottles (BD Bactec Plus Lytic/10 Anaerobic) and followed on days 2, 5, two types of microorganism that develop in perfect symbiosis. A pure culture of 10 and 15. Conditions tested were anaerobic Columbia with 5% sheep blood Bacteroides thetaiotaomicron was placed in the bottom chamber to produce the (bioMérieux) at 37 °C with or without thermic shock (20 min/80 °C), 28 °C, hydrogen necessary for the growth of the methanogenic archaea, which was trapped anaerobic Columbia with 5% sheep blood agar (bioMérieux) and 5% rumen fluid in the upper chamber. A culture of Methanobrevibacter smithii or other – – – and R-medium (ascorbic acid 1 g l 1, uric acid 0.4 g l 1, and glutathione 1 g l 1,pH hydrogenotrophic methanogenic archaea had previously been placed in the adjusted to 7.2), as previously described23. For this project, 59,688 colonies were chamber. In the case presented here, the methanogenic archaea were grown tested by MALDI–TOF. aerobically on an agar medium supplemented with three antioxidants (ascorbic acid, glutathione and uric acid) and without the addition of any external gas. We Proteobacteria. We inoculated 110 stool samples using pre-incubation in blood subsequently cultured four other methanogenic archaeal species for the first time culture bottles (BD Bactec Plus Lytic/10 Anaerobic) supplemented with vancomycin aerobically, and successfully isolated 13 strains of M. smithii and 9 strains of – (100 µg l 1; Sigma-Aldrich). The subcultures were performed on eight different Methanobrevibacter oralis from 100 stools and 45 oral samples. This medium allows selective solid media for the growth of Proteobacteria. We inoculated onto aerobic isolation and antibiotic susceptibility testing. This change allows the routine MacConkey agar (Biokar-Diagnostics), buffered charcoal yeast extract (BD study of methanogens, which have been neglected in clinical microbiology Diagnostic), eosine-methylene blue agar (Biokar-Diagnostics), Salmonella–Shigella laboratories and may be useful for biogas production. Finally, to culture halophilic agar (Biokar-Diagnostics), Drigalski agar (Biokar-Diagnostics), Hektoen agar archaea, we designed specific culture conditions (described in the ‘Halophilic (Biokar-Diagnostics), thiosulfate-citrate-bile-sucrose (BioRad) and Yersinia agar bacteria’ section). (BD Diagnostic) and incubated at 37 °C, aerobically and anaerobically. For this project, 18,036 colonies were tested by MALDI–TOF. Identification methods. The colonies were identified using MALDI–TOF MS. Each deposit was covered with 2 ml of a matrix solution (saturated α-cyano acid-4- Microaerophilic conditions. We inoculated 198 different stool samples directly hydroxycinnamic in 50% acetonitrile and 2.5% trifluoroacetic acid). This analysis onto agar or after pre-incubation in blood culture bottles (BD Bactec Plus Lytic/10 was performed using a Microflex LT system (Bruker Daltonics). For each spectrum, a Anaerobic bottles, BD). Fifteen different culture conditions were tested using Pylori maximum of 100 peaks was used and these peaks were compared with those of agar (bioMérieux), Campylobacter agar (BD), Gardnerella agar (bioMérieux), 5% previous samples in the computer database of the Bruker Base and our homemade sheep blood agar (bioMérieux) and our own R-medium as previously described23. database, including the spectra of the bacterial species identified in previous We incubated Petri dishes only in microaerophilic conditions using GENbag works28,29. An isolate was labelled as correctly identified at the species level when at microaer systems (bioMérieux) or CampyGen agar (bioMérieux), except the least one of the colonies’ spectra had a score ≥1.9 and another of the colonies’ R-medium, which was incubated aerobically at 37 °C. These culture conditions spectra had a score ≥1.7 (refs 28,29). generated 41,392 colonies, which were tested by MALDI–TOF. Protein profiles are regularly updated based on the results of clinical diagnoses and on new species providing new spectra. If, after three attempts, the species could Halophilic bacteria. In addition, we used new culture conditions to culture not be accurately identified by MALDI–TOF, the isolate was identified by 16S rRNA halophilic prokaryotes. The culture enrichment and isolation procedures for the sequencing as previously described. A threshold similarity value of >98.7% was culture of halophilic prokaryotes were performed in a Columbia broth medium chosen for identification at the species level. Below this value, a new species was fi 30 (Sigma-Aldrich), modi ed by adding (per litre): MgCl2·6H2O, 5 g; MgSO4·7H2O, suspected, and the isolate was described using taxonogenomics . 5 g; KCl, 2 g; CaCl2·2H2O, 1 g; NaBr, 0.5 g; NaHCO3, 0.5 g and 2 g of glucose. The pH was adjusted to 7.5 with 10 M NaOH before autoclaving. All additives Classification of the prokaryotes species cultured. We used our own online were purchased from Sigma-Aldrich. Four concentrations of NaCl were used prokaryotic repertoire13 (http://hpr.mediterranee-infection.com/arkotheque/client/ – – – – (100 g l 1, 150 g l 1, 200 g l 1 and 250 g l 1). ihu_bacteries/recherche/index.php) to classify all isolated prokaryotes into four A total of 215 different stool samples were tested. One gram of each stool categories: new prokaryote species, previously known prokaryote species in the specimen was inoculated aerobically into 100 ml of liquid medium in flasks at 37 °C human gut, known species from the environment but first isolated in humans, and while stirring at 150 r.p.m. Subcultures were inoculated after 3, 10, 15 and 30 known species from humans but first isolated in the human gut. Briefly, to complete − − incubation days for each culture condition. Serial dilutions from 10 1 to 10 10 were the recent work identifying all the prokaryotes isolated in humans13, we examined then performed in the culture medium and then plated on agar medium. Negative methods by conducting a literature search, which included PubMed and books on controls (no inoculation of the culture medium) were included for each culture infectious diseases. We examined the Medical Subject Headings (MeSH) indexing condition. After three days of incubation at 37 °C, different types of colonies provided by Medline for bacteria isolated from the human gut and we then appeared: yellow, cream, white and clear. Red and pink colonies began to appear established two different queries to automatically obtain all articles indexed by after the 15th day. All colonies were picked and re-streaked several times to obtain Medline dealing with human gut isolation sites. These queries were applied to all pure cultures, which were subcultured on a solid medium consisting of Colombia bacterial species previously isolated from humans as previously described, and we agar medium (Sigma-Aldrich) NaCl. The negative controls remained sterile in all obtained one or more articles for each species, confirming that the bacterium had culture conditions, supporting the authenticity of our data. been isolated from the human gut13.

Detection of microcolonies. Finally, we began to focus on microcolonies detected International deposition of the strains, 16S rRNA accession numbers and using a magnifying glass (Leica). These microcolonies, which were not visualized genome sequencing accession number. Most of the strains isolated in this study with the naked eye and ranged from 100 to 300 µm, did not allow direct were deposited in CSUR (WDCM 875) and are easily available at http://www. identification by MALDI–TOF. We subcultured these bacteria in a liquid medium mediterranee-infection.com/article.php?laref=14&titre=collection-de- (Columbia broth, Sigma-Aldrich) to allow identification by MALDI–TOF after souches&PHPSESSID=cncregk417fl97gheb8k7u7t07 (Supplementary Tables 4a and centrifugation. Ten stool samples were inoculated and then observed using this b). All the new prokaryote species were deposited into two international collections: magnifying glass for this project, generating the 9,620 colonies tested. CSUR and DSMZ (Supplementary Table 5). Importantly, among the 247 new prokaryotes species (197 in the present study and 50 in previous studies), we failed to Duodenum and other gut samples. Most of the study was designed to explore the subculture 9 species that were not deposited, of which 5 were nevertheless genome gut microbiota using stool samples. Nevertheless, as the small intestine microbiota sequenced. Apart from these species, all CSUR accession numbers are available in are located where the nutrients are digested24, which means there are greater Supplementary Table 5. Among these viable new species, 189 already have a DSMZ difficulties in accessing samples than when using stool specimens, we analysed number. For the other 49 species, the accession number is not yet assigned but the different levels of sampling, including duodenum samples (Supplementary Table 1). strain is deposited. The 16S rRNA accession numbers of the 247 new prokaryotes First, we tested five duodenum samples previously frozen at −80 °C. A total of species are available in Supplementary Table 5, along with the accession number of 25,000 colonies were tested by MALDI–TOF. In addition, we tested samples from the known species needing 16S rRNA amplification and sequencing for the different gut levels (gastric, duodenum, ileum and left and right colon) of other identification (Supplementary Table 14). Finally, the 168 draft genomes used for our patients. We tested 25,048 colonies by MALDI–TOF for this project. We tested analysis have already been deposited with an available GenBank accession 15 culture conditions, including pre-incubation in blood culture bottles with sterile number (Supplementary Table 5) and all other genome sequencing is still in rumen fluid and sheep blood (BD Bactec Plus Lytic/10 Anaerobic), 5% sheep blood progress, as the culturomics are still running in our laboratory. agar (bioMérieux), and incubation in both microaerophilic and anaerobic conditions, R-medium23 and Pylori agar (bioMérieux). Overall, we tested New prokaryotes. All new prokaryote species have been or will be comprehensively 50,048 colonies by MALDI–TOF for this project. described by taxonogenomics, including their metabolic properties, MALDI–TOF spectra and genome sequencing30. Among these 247 new prokaryote species, 95 have Archaea. The culture of methanogenic archaea is a fastidious process, and the already been published (PMID available in Supplementary Table 5), including 70 necessary equipment for this purpose is expensive and reserved for specialized full descriptions and 25 ‘new species announcements’. In addition, 20 are under

6 NATURE MICROBIOLOGY | VOL 1 | DECEMBER 2016 | www.nature.com/naturemicrobiology

© 2016 Macmillan Publishers Limited,116 part of Springer Nature. All rights reserved. NATURE MICROBIOLOGY DOI: 10.1038/NMICROBIOL.2016.203 LETTERS review and the 132 others are ongoing (Supplementary Table 5). This includes 37 Sensitivity Bioanalyzer LabChip (Agilent Technologies). The libraries were bacterial species already officially recognized (as detailed in Supplementary Table 5). normalized at 2 nM and pooled. After a denaturation step and dilution at 15 pM, All were sequenced successively with a paired-end strategy for high-throughput the pool of libraries was loaded onto the reagent cartridge and then onto the pyrosequencing on the 454-Titanium instrument from 2011 to 2013 and using instrument along with the flow cell. To prepare the paired-end library, 1 ng of MiSeq Technology (Illumina) with the mate pair strategy since 2013. genome as input was required. DNA was fragmented and tagged during the tagmentation step, with an optimal size distribution at 1 kb. Limited-cycle PCR Metagenome sequencing. Total DNA was extracted from the samples using a amplification (12 cycles) completed the tag adapters and introduced dual-index method modified from the Qiagen stool procedure (QIAamp DNA Stool Mini Kit). barcodes. After purification on Ampure XP beads (Beckman Coulter), the library For the first 24 metagenomes, we used GS FLX Titanium (Roche Applied Science). was normalized and loaded onto the reagent cartridge and then onto the instrument Primers were designed to produce an amplicon length (576 bp) that was along with the flow cell. For the 2 Illumina applications, automated cluster approximately equivalent to the average length of reads produced by GS FLX generation and paired-end sequencing with index reads of 2 × 250 bp were Titanium (Roche Applied Science), as previously described. The primer pairs performed in single 39-hour runs. commonly used for gut microbiota were assessed in silico for sensitivity to sequences from all phyla of bacteria in the complete Ribosomal Database Project (RDP) ORFans identification. Open reading frames (ORFs) were predicted using Prodigal database. Based on this assessment, the bacterial primers 917F and 1391R were with default parameters for each of the bacterial genomes. However, the predicted selected. The V6 region of 16S rRNA was pyrosequenced with unidirectional ORFs were excluded if they spanned a sequencing gap region. The predicted sequencing from the forward primer with one-half of a GS FLX Titanium bacterial sequences were searched against the non-redundant protein sequence (NR) PicoTiterPlate Kit 70×75 per patient with the GS Titanium Sequencing Kit XLR70 database (59,642,736 sequences, available from NCBI in 2015) using BLASTP. after clonal amplification with the GS FLX Titanium LV emPCR Kit (Lib-L). ORFans were identified if their BLASTP E-value was lower than 1e-03 for an Sixty other metagenomes were sequenced for 16S rRNA sequencing using MiSeq alignment length greater than 80 amino acids. We used an E-value of 1e-05 if the technology. PCR-amplified templates of genomic DNA were produced using the alignment length was <80 amino acids. These threshold parameters have been used surrounding conserved regions’ V3–V4 primers with overhang adapters in previous studies to define ORFans (refs 12–14). The 168 genomes considered in (FwOvAd_341F TCGTCGGCAGCGTCAGATGTGTATAAGAGACAGCCTACGGG this study are listed in Supplementary Table 7. These genomes represent 615.99 Mb NGGCWGCAG; ReOvAd_785RGTCTCGTGGGCTCGGAGATG TGTATAAGA and contain a total of 19,980 ORFans. Some of the ORFans from 30 genomes were GACAGGACTACHVGGGTATCTAATCC). Samples were amplified individually calculated in a previous study4 with the non-redundant protein sequence database for the 16S V3–V4 regions by Phusion High Fidelity DNA Polymerase (Thermo containing 14,124,377 sequences available from NCBI in June 2011. Fisher Scientific) and visualized on the Caliper Labchip II device (Illumina) by a DNA 1K LabChip at 561 bp. Phusion High Fidelity DNA Polymerase was chosen for Metagenomic 16S sequences. We collected 325 runs of metagenomic 16S rRNA PCR amplifications in this biodiversity approach and deep sequencing: a sequences available in the HMP data sets that correspond to stool samples from thermostable DNA polymerase characterized by the greatest accuracy, robust healthy human subjects. All samples were submitted to Illumina deep sequencing, reactions and high tolerance for inhibitors, and finally by an error rate that is resulting in 761,123 Mo per sample on average, and a total of 5,970,465 high-quality approximately 50-fold lower than that of DNA polymerase and sixfold lower than sequencing reads after trimming. These trimmed data sets were filtered using CLC that of Pfu DNA polymerase. After purification on Ampure beads (Thermo Fisher Genomics Workbench 7.5, and reads shorter than 100 bp were discarded. We Scientific), the concentrations were measured using high-sensitivity Qbit technology performed an alignment of 247 16S rRNA sequences against the 5,577,630 reads (Thermo Fisher Scientific). Using a subsequent limited-cycle PCR on 1 ng of each remaining using BLASTN. We used a 1e-03 e-value, 100% coverage and 98.7% PCR product, Illumina sequencing adapters and dual-index barcodes were added to cutoff, corresponding to the threshold for defining a species, as previously described. each amplicon. After purification on Ampure beads, the libraries were then Finally, we reported the total number of aligned reads for each 16S rRNA sequence normalized according to the Nextera XT (Illumina) protocol. The 96 multiplexed (Supplementary Table 8). samples were pooled into a single library for sequencing on the MiSeq. The pooled We collected the sequences of the 3,871,657 gene non-redundant gene catalogue library containing indexed amplicons was loaded onto the reagent cartridge and from the 396 human gut microbiome samples (https://www.cbs.dtu.dk/projects/ then onto the instrument along with the flow cell. Automated cluster generation and CAG/)15. We performed an alignment of 247 16S rRNA sequences against the paired-end sequencing with dual index reads of 2 × 250 bp were performed in a 3,871,657 gene non-redundant gene catalogue using BLASTN with a threshold of single 39-hour run. On the instrument, the global cluster density and the global 1e-03 e-value, 100% coverage and 98.7% cutoff. The new species identified in these passed filter per flow cell were generated. The MiSeq Reporter software (Illumina) data are reported in Supplementary Table 9. We collected the raw data sets of 239 determined the percentage indexed and the clusters passing the filter for each runs deposited at EBI (ERP012217)16. We used the PEAR software (PMID amplicon or library. The raw data were configured in fasta files for R1 and R2 reads. 24142950) for merging raw Illumina paired-end reads using default parameters. We performed an alignment of 247 16S rRNA sequences against the 265,864,518 Genome sequencing. The genomes were sequenced using, successively, two high- merged reads using BLASTN. We used a 1e-03 e-value, 100% coverage and 98.7% throughput NGS technologies: Roche 454 and MiSeq Technology (Illumina) with cutoff. The list of the new species identified in these data is included in paired-end application. Each project on the 454 sequencing technology was loaded Supplementary Table 9. on a quarter region of the GS Titanium PicoTiterPlate and sequenced with the GS FLX Titanium Sequencer (Roche). For the construction of the 454 library, 5 μg DNA Whole metagenomic shotgun sequences. We collected the contigs/scaffolds from was mechanically fragmented on the Covaris device (KBioScience-LGC Genomics) the assembly of 148 runs available in the HMP data sets. The initial reads of these through miniTUBE-Red 5Kb. The DNA fragmentation was visualized through the samples were assembled using SOAPdenovo v.1.04 (PMID 23587118). These Agilent 2100 BioAnalyser on a DNA LabChip7500. Circularization and assemblies correspond to stool samples from healthy human subjects and generated fragmentation were performed on 100 ng. The library was then quantified on Quant- 13,984,809 contigs/scaffolds with a minimum length of 200 bp and a maximum it Ribogreen kit (Invitrogen) using a Genios Tecan fluorometer. The library was length of 371,412 bp. We aligned the 19,980 ORFans found previously against these clonally amplified at 0.5 and 1 cpb in 2 emPCR reactions according to the conditions data sets using BLASTN. We used a 1e-05 e-value, 80% coverage and 80% identity for the GS Titanium SV emPCR Kit (Lib-L) v2 (Roche). These two enriched clonal cutoff. Finally, we reported the total number of unique aligned ORFans for each amplifications were loaded onto the GS Titanium PicoTiterPlates and sequenced species (Supplementary Table 8). with the GS Titanium Sequencing Kit XLR70. The run was performed overnight and then analysed on the cluster through gsRunBrowser and gsAssembler_Roche. Study of the gaps in metagenomics. The raw fastq files of paired-end reads from an Sequences obtained with Roche were assembled on gsAssembler with 90% identity Illumina Miseq of 84 metagenomes analysed concomitantly by culturomics were and 40 bp of overlap. The library for Illumina was prepared using the Mate Pair filtered and analysed in the following steps (accession no. PRJEB13171). technology. To improve the assembly, the second application in was sometimes performed with paired ends. The paired-end and the mate-pair strategies were Data processing: filtering the reads, dereplication and clustering. The paired-end barcoded in order to be mixed, respectively, with 11 other genomic projects prepared reads of the corresponding raw fastq files were assembled into contigs using with the Nextera XT DNA sample prep kit (Illumina) and 11 others projects with Pandaseq31. The high-quality sequences were then selected for the next steps of the Nextera Mate Pair sample prep kit (Illumina). The DNA was quantified by a Qbit analysis by considering only those sequences that contained both primers (forward assay with high-sensitivity kit (Life Technologies). In the first approach, the mate and reverse). In the following filtering steps, the sequences containing N were pair library was prepared with 1.5 µg genomic DNA using the Nextera mate pair removed. Sequences with length shorter than 200 nt were removed, and sequences Illumina guide. The genomic DNA sample was simultaneously fragmented and longer than 500 nt were trimmed. Both forward and reverse primers were also tagged with a mate-pair junction adapter. The profile of the fragmentation was removed from each of the sequences. An additional filtering step was applied to validated on an Agilent 2100 Bioanalyzer (Agilent Technologies) with a DNA 7500 remove the chimaeric sequences using UCHIME (ref. 32) of USEARCH (ref. 33). LabChip. The DNA fragments, which ranged in size, had an optimal size of 5 kb. No The filtering steps were performed using the QIIME pipeline34. Strict dereplication size selection was performed, and 600 ng of ‘tagmented’ fragments measured on the (clustering of duplicate sequences) was performed on the filtered sequences, and Qbit assay with the high-sensitivity kit were circularized. The circularized DNA was they were then sorted by decreasing number of abundance35–37. For each mechanically sheared to small fragments, with optimal fragments being 700 bp, on a metagenome, the clustering of OTUs was performed with 97% identity. Total OTUs Covaris S2 device in microtubes. The library profile was visualized on a High from the 84 metagenomes (Supplementary Table 10) clustered with 93% identity.

NATURE MICROBIOLOGY | VOL 1 | DECEMBER 2016 | www.nature.com/naturemicrobiology 7

© 2016 Macmillan Publishers Limited,117 part of Springer Nature. All rights reserved. LETTERS NATURE MICROBIOLOGY DOI: 10.1038/NMICROBIOL.2016.203

Building reference databases. We downloaded the Silva SSU and LSU database1 22. Samb-Ba, B. et al. MALDI–TOF identification of the human gut microbiome in and release 123 from the Silva website and, from this, a local database of predicted people with and without diarrhea in Senegal. PLoS ONE 9, e87419 (2014). amplicon sequences was built by extracting the sequences containing both primers. 23. Dione, N., Khelaifia, S., La Scola, B., Lagier, J.C. & Raoult D. A quasi-universal Finally, we had our local reference database containing a total of 536,714 well- medium to break the aerobic/anaerobic bacterial culture dichotomy in clinical annotated sequences separated into two subdatabases according to their gut or non- microbiology. Clin. Microbiol. Infect. 22, 53–58 (2016). gut origin. We created four other databases containing 16S rRNA of new species 24. Raoult, D. & Henrissat, B. Are stool samples suitable for studying the link sequences and species isolated by culturomics separated into three groups (human between gut microbiota and obesity? Eur. J. Epidemiol. 29, 307–309 (2014). gut, non-human gut, and human not reported in gut). The new species database 25. Khelaifia, S., Raoult, D. & Drancourt, M. A versatile medium for cultivating contains 247 sequences, the human gut species database 374 sequences, the non- methanogenic archaea. PLoS ONE 8, e61563 (2013). human gut species database 256 sequences and the human species not reported in 26. Khelaifia, S. et al. Draft genome sequence of a human-associated isolate of gut database 237 sequences. methanobrevibacter arboriphilicus, the lowest-G+C-content archaeon. Genome Announc. 2, e01181 (2014). Taxonomic assignments. For taxonomic assignments, we applied at least 20 reads 27. Dridi, B., Fardeau, M.-L., Ollivier, B., Raoult, D. & Drancourt, M. per OTU. The OTUs were then searched against each database using BLASTN Methanomassiliicoccus luminyensis gen. nov., sp. nov., a methanogenic archaeon (ref. 38). The best match of ≥97% identity and 100% coverage for each of the OTUs isolated from human faeces. Int. J. Syst. Evol. Microbiol. 62, 1902–1907 (2012). was extracted from the reference database, and taxonomy was assigned up to the 28. Seng, P. et al. Identification of rare pathogenic bacteria in a clinical microbiology species level. Finally, we counted the number of OTUs assigned to unique species. laboratory: impact of matrix-assisted laser desorption ionization-time of flight mass spectrometry. J. Clin. Microbiol. 51, 2182–2194 (2013). Data availability. The GenBank accession numbers for the sequences of 29. Seng, P. et al. Ongoing revolution in bacteriology: routine identification of the16SrRNA genes of the new bacterial species as well as their accession numbers in bacteria by matrix-assisted laser desorption ionization time-of-flight mass both Collection de Souches de l’Unité des Rickettsies (CSUR, WDCM 875) and the spectrometry. Clin. Infect. Dis. 49, 543–551 (2009). Deutsche Sammlung von Mikroorganismen und Zellkulturen (DSMZ) are listed in 30. Ramasamy, D. et al. A polyphasic strategy incorporating genomic data for the Supplementary Table 5. Sequencing metagenomics data have been deposited in taxonomic description of novel bacterial species. Int. J. Syst. Evol. Microbiol. 64, NCBI under Bioproject PRJEB13171. 384–391 (2014). 31. Masella, A. P., Bartram, A. K., Truszkowski, J. M., Brown, D. G. & Neufeld, J. D. Received 20 April 2016; accepted 14 September 2016; PANDAseq: paired-end assembler for Illumina sequences. BMC Bioinformatics published 7 November 2016 13, 31 (2012). 32. Edgar, R. C., Haas, B. J., Clemente, J. C., Quince, C. & Knight, R. UCHIME References improves sensitivity and speed of chimera detection. Bioinformatics 27, – 1. Lagier, J. C., Million, M., Hugon, P., Armougom, F. & Raoult, D. Human gut 2194 2200 (2011). microbiota: repertoire and variations. Front. Cell Infect. Microbiol. 2, 136 (2012). 33. Edgar, R. C. Search and clustering orders of magnitude faster than BLAST. – 2. Lagier, J. C. et al. The rebirth of culture in microbiology through the example Bioinformatics 26, 2460 2461 (2010). of culturomics to study human gut microbiota. Clin Microbiol Rev. 28, 34. Caporaso, J. G. et al. QIIME allows analysis of high-throughput community – 237–264 (2015). sequencing data. Nat. Methods 7, 335 336 (2010). 3. Pfleiderer, A. et al. Culturomics identified 11 new bacterial species from a 35. Stoeck, T. et al. Massively parallel tag sequencing reveals the complexity of – single anorexia nervosa stool sample. Eur. J. Clin. Microbiol. Infect. Dis. 32, anaerobic marine protistan communities. BMC Biol. 7, 72 77 (2009). 1471–1481 (2013). 36. Mondani, L. et al. Microbacterium lemovicicum sp. nov., a bacterium 4. Lagier, J. C. et al. Microbial culturomics: paradigm shift in the human gut isolated from a natural uranium-rich soil. Int. J. Syst. Evol. Microbiol. 63, – microbiome study. Clin. Microbiol. Infect. 18, 1185–1193 (2012). 2600 2606 (2013). 5. Dubourg, G. et al. Culturomics and pyrosequencing evidence of the reduction in 37. Boissiere, A. et al. Midgut microbiota of the malaria mosquito vector Anopheles gut microbiota diversity in patients with broad-spectrum antibiotics. Int. J. gambiae and interactions with Plasmodium falciparum infection. PLoS Pathog. Antimicrob. Agents 44, 117–124 (2014). 8, e1002742 (2012). 6. Ley, R. E., Turnbaugh, P. J., Klein, S. & Gordon, J. I. Microbial ecology: human 38. Altschul, S. F., Gish, W., Miller, W., Myers, E. W., Lipman, D. J. Basic local – gut microbes associated with obesity. Nature 444, 1022–1023 (2006). alignment search tool. J. Mol. Biol. 215, 403 410 (1990). 7. Ley, R. E. et al. Obesity alters gut microbial ecology. Proc. Natl Acad. Sci. USA 102, 11070–11075 (2005). Acknowledgements 8. Gill, S. R. et al. Metagenomic analysis of the human distal gut microbiome. The authors thank R. Valero, A.A. Jiman-Fatani, B. Ali Diallo, J.-B. Lekana-Douki, Science 312, 1355–1359 (2006). B. Senghor, A. Derand, L. Gandois, F. Tanguy, S. Strouk, C. Tamet, F. Lunet, M. Kaddouri, 9. Rinke, C. et al. Insights into the phylogeny and coding potential of microbial L. Ayoub, L. Frégère, N. Garrigou, A. Pfleiderer, A. Farina and V. Ligonnet for technical dark matter. Nature 499, 431–437 (2013). support. This work was funded by IHU Méditerranée Infection as a part of a Foundation 10. Lagier, J. C. et al. Current and past strategies for bacterial culture in clinical Louis D grant and by the Deanship of Scientific Research (DSR), King Abdulaziz microbiology. Clin. Microbiol. Rev. 28, 208–236 (2015). University, under grant no. 1–141/1433 HiCi. 11. Vetizou, M. et al. Anticancer immunotherapy by CTLA-4 blockade relies on the gut microbiota. Science 350, 1079–1084 (2015). Author contributions 12. Cassir, N. et al. Clostridium butyricum strains and dysbiosis linked to necrotizing D.R. conceived and designed the experiments. J.-C.L., S.K., M.T.A., S.N., N.D., P.H., A.C., enterocolitis in preterm neonates. Clin. Infect. Dis. 61, 1107–1115. F.C., S.I.T., E.H.S., G.Dub., G.Dur., G.M., E.G. A.T., S.B., D.B., N.C., F.B., J.D., M.Ma., D.R., 13. Hugon, P. et al. A comprehensive repertoire of prokaryotic species identified in M.B., N.P.M.D.N., N.M.D.B., C.V., D.M., K.D., M.Mi., C.R., J.M.R., B.L.S., P.-E.F. and A.L. human beings. Lancet Infect. Dis. 15, 1211–1219 (2015). performed the experiments. D.M., J.A., E.I.A., F.B., M.Y., A.D., C.S., F.D. and V.V. 14. The Human Microbiome Project Consortium. A framework for human contributed materials/analysis tools. J.-C.L., A.C., A.L. and D.R. analysed the data. J.-C.L., microbiome research. Nature 486, 215–221 (2012). A.L. and D.R. wrote the manuscript. All authors read and approved the final manuscript. 15. Browne, H. P. et al. Culturing of ‘unculturable’ human microbiota reveals novel taxa and extensive sporulation. Nature 533, 543–546 (2016). Additional information 16. Nielsen, H. B. et al. Identification and assembly of genomes and genetic elements Supplementary information is available for this paper. Reprints and permissions information in complex metagenomic samples without using reference genomes. is available at www.nature.com/reprints. Correspondence and requests for materials should be Nat. Biotechnol. 32, 822–828 (2014). addressed to D.R. 17. Khelaifia, S. et al. Aerobic culture of methanogenic archaea without an external source of hydrogen. Eur. J. Clin. Microbiol. Infect. Dis. 35, 985–991 (2016). Competing interests 18. Rettedal, E. A., Gumpert, H. & Sommer, M. O. Cultivation-based multiplex The authors declare no competing financial interests. phenotyping of human gut microbiota allows targeted recovery of previously uncultured bacteria. Nat. Commun. 5, 4714 (2014). This work is licensed under a Creative Commons Attribution 4.0 19. Hiergeist, A., Gläsner, J., Reischl, U. & Gessner, A. Analyses of intestinal International License. The images or other third party material in microbiota: culture versus sequencing. ILAR J. 56, 228–240 (2015). this article are included in the article’s Creative Commons license, 20. Rajilic-Stojanovic, M. & de Vos, W. M. The first 1000 cultured species of the unless indicated otherwise in the credit line; if the material is not included under the human gastrointestinal microbiota. FEMS Microbiol. Rev. 38, 996–1047 (2014). Creative Commons license, users will need to obtain permission from the license holder to 21. Byrd, A. L. & Segre, J. A. Infectious disease. Adapting Koch’s postulates. Science reproduce the material. To view a copy of this license, visit http://creativecommons.org/ 351, 224–226 (2016). licenses/by/4.0/

8 NATURE MICROBIOLOGY | VOL 1 | DECEMBER 2016 | www.nature.com/naturemicrobiology

© 2016 Macmillan Publishers Limited,118 part of Springer Nature. All rights reserved.

ANNEXE II

Étude du génome de Haloferax massilliensis

119 Avant-propos

Mon travail a porté sur la partie génomique de cette nouvelle espèce. En appliquant le concept de culturomics et en utilisant des conditions de culture contenant une forte concentration en sel, nous avons isolé les premières archaea halophiles qui colonisent l'intestin humain. L'objectif de mon travail a été d'analyser le génome, de l'annoter et d'effectuer une analyse de génomique comparative. Pour estimer le niveau moyen de similarité de la séquence nucléotidique du génome entre la souche étudiée et les quatre espèces les plus proches

(ayant un génome disponible), nous avons utilisé l'identité génomique moyenne des séquences de gènes orthologues

(AGIOS). Ce pipeline permet de détecter des protéines orthologues entre les génomes comparé 2-à-2, puis récupère les gènes correspondants et enfin détermine le pourcentage moyen d'identité de séquence nucléotidique entre les ORFs orthologues. Le génome de H. massiliensis a été aligné

120 localement 2 par 2 et les valeurs de l'hybridation ADN-ADN ont été estimées à l'aide de la comparaison des séquences génomique-génomique (GGDC). H. massiliensis a été isolée de l'intestin humain dans le cadre d'une étude de culturomics visant à étendre le répertoire de micro-organismes colonisant l'intestin humain.

Ce travail a été soumis dans le journal Archaea.

121

ARTICLE 4

Genome sequence and description of Haloferax massiliensis sp. nov., a new halophilic archaea isolated from the human gut

Saber Khelaifia, Aurelia Caputo, Claudia Andrieu, Frederique Cadoret, Nicholas Armstrong, Caroline Michelle, Jean- Christophe Lagier, Felix Djossou, Pierre-Edouard Fournier and Didier Raoult

122 Extremophiles

Draft Manuscript for Review

Genome sequence and description of Haloferax massiliensis sp. nov., a new halophilic archaea isolated from the human gut

For Peer Review Journal: Extremophiles

Manuscript ID EXT-17-Nov-0224

Manuscript Type: Original Paper

Date Submitted by the Author: 14-Nov-2017

Complete List of Authors: Khelaifia, Saber; Unite de recherche sur les maladies infectieuses et tropicales emergentes, IHU-Méditerranée Infection Caputo, Aurelia; IHU Mediterranee Infection Andrieu, Claudia; IHU Mediterranee Infection Cadoret, Frédéric ; Unite de recherche sur les maladies infectieuses et tropicales emergentes Armstron, Nicholas; IHU Mediterranee Infection, Mass Spectrometry Michelle, Caroline; Unite de recherche sur les maladies infectieuses et tropicales emergentes Lagier, Jean-Christophe; Unite de recherche sur les maladies infectieuses et tropicales emergentes DJOSSOU, Félix ; Infectious and Tropical Diseases Department, Centre Hospitalier Andrée-Rosemon, Cayenne, French Guiana. Fournier, Pierre-edouard; Unite de recherche sur les maladies infectieuses et tropicales emergentes Raoult, Didier; Unite de recherche sur les maladies infectieuses et tropicales emergentes

Keyword: Culturomics, taxono-genomics, Halophilic archaea, Haloferax massiliensis.

123 Page 1 of 42 Extremophiles

1 2 3 1 Genome sequence and description of Haloferax massiliensis sp. nov., a new halophilic 4 5 2 archaea isolated from the human gut 6 7 8 3 Running title: Haloferax massiliensis sp. nov. 9 10 11 4 12 13 14 5 Saber Khelaifia1*, Aurelia Caputo1, Claudia Andrieu1, Frederique Cadoret1, Nicholas 15 16 6 Armstrong1, Caroline Michelle1, JeanChristophe Lagier1, Felix Djossou2, PierreEdouard 17 18 For Peer1 Review1, 3 19 7 Fournier and Didier Raoult 20 21 8 22 23 1 24 9 Unité de Recherche sur les Maladies Infectieuses et Tropicales Emergentes, CNRS (UMR 25 26 10 7278), IRD (198), INSERM (U1095), AMU (UM63), Institut HospitaloUniversitaire 27 28 11 MéditerranéeInfection, AixMarseille Université, 13385 Marseille Cedex 5. 29 30 31 12 2 Infectious and Tropical Diseases Department, Centre Hospitalier AndréeRosemon, Cayenne, 32 33 34 13 French Guiana. 35 36 3 37 14 Special Infectious Agents Unit, King Fahd Medical Research Center, King Abdulaziz 38 39 15 University, Jeddah, Saudi Arabia 40 41 42 16 *Corresponding author: Dr. Saber Khelaifia 43 44 45 17 Institut HospitaloUniversitaire MéditerranéeInfection, AixMarseille Université, 1921 46 47 18 Boulevard Jean Moulin, 13385 Marseille Cedex 5, France. Email: [email protected] 48 49 50 19 Keywords: Culturomics, taxonogenomics, Halophilic archaea, Haloferax massiliensis. 51 52 53 54 55 56 57 58 1 59 60

124 Extremophiles Page 2 of 42

1 2 3 20 ABSTRACT 4 5 6 21 By applying the culturomics concept and using culture conditions containing a high salt 7 8 22 concentration, we herein isolated the first known halophilic archaea colonizing the human gut. 9 10 23 Here we described its phenotypic and biochemical characterization as well as its genome 11 12 T 13 24 annotation. Strain ArcHr (= CSUR P0974 = CECT 9307) was mesophile and grew optimally at 14 T 15 25 37°C and pH 7. Strain ArcHr was also extremely halophilic with an optimal growth observed 16 17 26 at 15% NaCl. It showed Gramnegative cocci, was strictly aerobic, nonmotile and nonspore 18 For Peer Review 19 27 forming, and exhibited catalase and oxidase activities. The 4,015,175 bp long genome exhibits a 20 21 28 G+C% content of 65.36 % and contains 3,911 proteincoding and 64 predicted RNA genes. 22 23 T 24 29 PCRbased identification of the 16S rRNA gene of strain ArcHr yielded a 99.1% sequence 25 26 30 similarity with Haloferax prahovense, the phylogenetically closest validated species in the 27 28 31 Haloferax genus. The DDH was of 50.70% ± 5.2 with H. prahovense, 53.70% ± 2.69 with H. 29 30 32 volcanii, 50.90% ± 2.64 with H. alexandrinus, 52.90% ± 2.67 with H. gibbonsii and 54.30% ± 31 32 T 33 33 2.70 with H. lucentense. The data herein represented confirm strain ArcHr as a unique species 34 35 34 and consequently we propose its classification as representative of a novel species belonging to 36 37 35 the genus Haloferax, as Haloferax massiliensis sp. nov. 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 2 59 60

125 Page 3 of 42 Extremophiles

1 2 3 36 INTRODUCTION 4 5 6 37 The human intestinal microbiota is a complex ecosystem consisting of a wide diversity including 7 8 38 bacteria (Lagier et al. 2012), archaea (Khelaifia et al. 2013), and unicellular eukaryotes (Nam 9 10 39 2008). The culturomics concept, recently introduced in our laboratory to study the prokaryotes 11 12 13 40 diversity in the human gut (Lagier et al. 2012), allowed the isolation of a huge halophilic 14 15 41 bacteria diversity including several new species (Lagier et al. 2016). Among the diverse culture 16 17 42 conditions and several culture media used by culturomics to isolate new prokaryotes, some 18 For Peer Review 19 43 conditions targeting specifically extremophile organisms were also used (Lagier et al. 2016). 20 21 44 Indeed, culture media containing high salt concentration are essentially used to select halo 22 23 24 45 organisms including halophilic bacteria and archaea. 25 26 27 46 Currently, the determination of the affiliation of a new prokaryote is based on the 16S rDNA 28 29 47 sequence, G+C content % and DNADNA hybridization (DDH). This approach is limited 30 31 48 because of the very low cutoff between species and genera (Welker & Moore 2011). In some 32 33 49 cases, 16S rRNA gene sequence comparison has been proved to poorly discriminate some 34 35 36 50 species belonging to a same genus and remain ineffective (Stackebrandt & Ebers 2006). 37 38 51 Recently, we proposed a polyphasic approach based on phenotypic and biochemical 39 40 52 characterization, MALDITOF MS spectrum and total genome sequencing and annotation to 41 42 53 better define and classify new taxa (Ramasamy et al. 2014). 43 44 45 54 Using culturomics techniques to isolate halophilic prokaryotes colonizing the human gut (Lagier 46 47 T 48 55 et al. 2016), we herein isolated strain ArcHr from a stool specimen of a 22yearold Amazonian 49 50 56 obese female patient, and presented the different characteristics enabling the classification of the 51 52 57 Haloferax massiliensis strain ArcHrT as a new species of the Haloferax genus. The Haloferax 53 54 58 genus was first described by Torreblanca (1986) (Torreblanca et al. 1986) and actually includes 55 56 59 12 species with validly published names. Members of the Haloferax genus are essentially 57 58 3 59 60

126 Extremophiles Page 4 of 42

1 2 3 60 extremely halophilic archaea that require high salt concentrations for growth and inhabit 4 5 61 hypersaline environments such as the Dead Sea and the Great Salt Lake. They are classified in 6 7 62 the family within the Euryarchaeota phylum and the various species constitute 8 9 10 63 22 recognized genera (Grant et al. 2001). 11 12 13 64 In this study, we present a classification and a set of characteristics for Haloferax massiliensis sp. 14 T 15 65 nov., strain ArcHr (= CSUR P0974 = CECT 9307) with its complete genome sequencing and 16 17 66 annotation. 18 For Peer Review 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 4 59 60

127 Page 5 of 42 Extremophiles

1 2 3 67 MATERIALS AND METHODS 4 5 68 Ethics and samples collection 6 7 8 69 The stool specimens were collected from a 22yearold Amazonian obese female patient after 9 10 70 defecation in sterile plastic containers, sampled and stored at 80°C until use. Informed and 11 12 13 71 signed consent was obtained from the patient. The study and the assent procedure were approved 14 15 72 by the Ethics Committees of the IHU Méditerranée Infection (Faculty of Medicine, Marseille, 16 17 73 France), under agreement number 09022. Salt concentration of the stool specimen was 18 For Peer Review 19 74 measured by digital refractometer (Fisher scientific, Illkirch, France) and the pH was measured 20 21 75 using a pHmeter. 22 23 24 25 76 Isolation of the strain 26 27 T 28 77 Strain ArcHr was isolated in December 2013 by aerobic culture of the stool specimen in a 29 30 78 homemade culture medium consisting of a Columbia broth (SigmaAldrich, SaintQuentin 31 32 79 Fallavier, France) modified by the addition of (per liter): MgCl2 6H2O, 15 g; MgSO4 7H2O, 20 g; 33 34 80 KCl, 4 g; CaCl2 2H2O, 2 g; NaBr, 0,5 g; NaHCO3, 0,5 g, glucose, 2 g and 150 g of NaCl. pH was 35 36 81 adjusted to 7.5 with 10 M NaOH before autoclaving. Approximately, 1g of stool specimen was 37 38 39 82 inoculated into 100 mL of this liquid medium in a flask incubated aerobically at 37°C with 40 41 83 stirring at 150 rpm. Subcultures were realized after ten, fifteen, twenty and thirty days of 42 43 84 incubation. Then, serial dilutions of 101 to 1010 were performed in the homemade liquid culture 44 45 85 medium and then plated onto agar plates consisting of the previously detailed liquid medium 46 47 48 86 with 1.5 % agar. 49 50 51 87 Strain identification by MALDI-TOF MS and 16S rRNA gene sequencing 52 53 54 88 MALDITOF MS protein analysis was carried out as previously described (Seng et al. 2009). 55 T 56 89 The resulting twelve spectra of strain ArcHr were imported into the MALDI BioTyper 57 58 5 59 60

128 Extremophiles Page 6 of 42

1 2 3 90 software (version 2.0, Bruker) and analyzed by standard pattern matching (with default 4 5 91 parameter settings) against the main spectra of halophilic and methanogenic archaea including 6 7 92 the spectra from Haloferax alexandrinus, Methanobrevibacter smithii, Methanobrevibacter 8 9 10 93 oralis, Methanobrevibacter arboriphilicus, and Methanomassilicoccus massiliensis. The 16S 11 12 94 rRNA gene amplification by PCR and sequencing were performed as previously described (Lepp 13 14 95 et al. 2004). The phylogenetic tree was carried out according to the method described by Elsawi 15 16 96 (Elsawi et al. 2017). 17 18 For Peer Review 19 97 Growth conditions 20 21 22 98 The optimum growth temperature of strain ArcHrT was tested on the solid medium by 23 24 5 25 99 inoculating 10 CFU/mL of an exponentially growing culture incubated aerobically at 28, 37, 45, 26 27 100 and 55°C. Growth atmosphere was tested under aerobic atmosphere, in the presence of 5% CO2, 28 29 101 and also in microaerophilic and anaerobic atmospheres created using GENbag microaer and 30 31 102 GENbag anaer (BioMérieux, Marcy l’Etoile, France) respectively. The optimum NaCl 32 33 103 concentration required for growth was tested at 0, 1, 5, 7.5, 10, 15, 20, 25, and 30% of NaCl. The 34 35 36 104 optimum pH was determined by growth testing at pH 5, 6, 7, 7.5, 8 and 9. 37 38 39 105 Biochemical, sporulation and motility assays 40 41 T 42 106 To characterize the biochemical properties of strain ArcHr , we used the commercially available 43 44 107 Api ZYM, Api 20 NE, Api 50 CH strips (bioMérieux), supplemented by 15% NaCl (w/v) and 45 46 108 30g/L of MgSO4, according to the manufacturer’s instructions. The sporulation test was done by 47 48 109 thermicshock at 80°C during 20 min and subculturing on the solid medium. The motility of 49 50 T 51 110 strain ArcHr was assessed by observing a fresh culture under DM1000 photonic microscope 52 53 111 (Leica Microsystems, Nanterre, France) with a 100X oilimmersion objective lens. The colonies’ 54 55 56 57 58 6 59 60

129 Page 7 of 42 Extremophiles

1 2 3 112 surface was observed on the agar culture medium after 3 days of incubation under aerobic 4 5 113 conditions at 37°C. 6 7 8 114 Antibiotic susceptibility testing 9 10 11 115 Susceptibility of strain ArcHrT to antibiotics was tested using antibiotic disks (B. Braun Medical 12 13 116 SAS, Boulogne, France) containing the following antibiotics: Fosfomycin 50µg, Doxycycline 14 15 16 117 30UI, Rifampicin 30µg, Vancomycin 30µg, Amoxicillin 20µg, Erythromycin 15UI, Ampicillin 17 18 118 25, Cefoxitin 30µg,For Colistin 50µg, Peer Tobramycin Review10µg, Gentamicin 500µg, Penicillin G 10UI, 19 20 119 Trimethoprim 1.25µg / Sulfamethoxazole 23.75µg, Oxacillin 5µg, Imipenem 10µg and 21 22 120 Metronidazole 4µg. 23 24 25 121 Microscopy and Gram test 26 27 28 122 Cells were fixed with 2.5 % glutaraldehyde in 0.1M cacodylate buffer for at least 1h at 4°C. A 29 30 31 123 drop of cell suspension was deposited for approximately 5 minutes on glowdischarged formvar 32 33 124 carbon film on 400 mesh nickel grids (FCF400Ni, EMS). The grids were dried on blotting paper 34 35 125 and cells were negatively stained for 10 s with 1% ammonium molybdate solution in filtered 36 37 126 water at RT. Electron micrographs were acquired with a Morgagni 268D (Philips) transmission 38 39 127 electron microscope operated at 80 keV. The Gram stain was performed using the color Gram 2 40 41 42 128 kit (Biomerieux) and observed using a DM1000 photonic microscope (Leica Microsystems). 43 44 45 129 Fatty acid methyl ester (FAME) analysis by GC/MS 46 47 130 Cellular fatty acid methyl ester (FAME) analysis was performed by GC/MS. Three samples were 48 49 131 prepared with approximately 80 mg of bacterial biomass per tube harvested from several culture 50 51 132 plates. Fatty acid methyl esters were prepared as described by (Sasser 2006). GC/MS analyses 52 53 54 133 were carried out as described before (Dione et al. 2016). Briefly, fatty acid methyl esters were 55 56 134 separated using an Elite 5MS column and monitored by mass spectrometry (Clarus 500 SQ 8 57 58 7 59 60

130 Extremophiles Page 8 of 42

1 2 3 135 S, Perkin Elmer, Courtaboeuf, France). Spectral database search was performed using MS 4 5 136 Search 2.0 operated with the Standard Reference Database 1A (NIST, Gaithersburg, USA) and 6 7 137 the FAMEs mass spectral database (Wiley, Chichester, UK). 8 9 10 138 DNA extraction and genome sequencing 11 12 13 139 After scraping 5 Petri dishes in 1mL TE buffer, the genomic DNA (gDNA) of strain ArcHrT 14 15 16 140 was extracted from 200µL of the bacterial suspension after a classical lysis treatment with a final 17 18 141 concentration of lysozymeFor at 40 Peer mg/ml for 2hrs Reviewat 37°C followed by an incubation time of 1hr at 19 20 142 37°C in SDS 1% final and 30µL RNAse. Proteinase K treatment was realized with at 37°C. 21 22 143 After three phenol extractions and alcohol precipitation, the sample was eluted in the minimal 23 24 25 144 volume of 50µL in EB buffer. DNA was quantified by a Qubit assay with the high sensitivity kit 26 27 145 (Life technologies, Carlsbad, CA, USA) to 14ng/µl. 28 29 30 146 GDNA was sequenced on the MiSeq Technology (Illumina Inc, San Diego, CA, USA) with the 31 32 147 mate pair strategy. The gDNA was barcoded in order to be mixed with 11 others projects with 33 34 148 the Nextera Mate Pair sample prep kit (Illumina). The mate pair library was prepared with 1µg 35 36 149 of gDNA using the Nextera mate pair Illumina guide. The gDNA sample was simultaneously 37 38 39 150 fragmented and tagged with a mate pair junction adapter. The pattern of the fragmentation was 40 41 151 validated on an Agilent 2100 BioAnalyzer (Agilent Technologies Inc, Santa Clara, CA, USA) 42 43 152 with a DNA 7500 labchip. The DNA fragments ranged in size from 1kb up to 11kb with an 44 45 153 optimal size at 4 kb. No size selection was performed and 373.3 ng of tagmented fragments were 46 47 48 154 circularized. The circularized DNA was mechanically sheared to small fragments with an 49 50 155 optimal size at 480 bp on the Covaris device S2 in microtubes (Covaris, Woburn, MA, USA). 51 52 156 The library profile was visualized on a High Sensitivity Bioanalyzer LabChip (Agilent 53 54 157 Technologies Inc, Santa Clara, CA, USA) and the final concentration library was measured at 55 56 158 21.79 nmol/l. The libraries were normalized at 2nM and pooled. After a denaturation step and 57 58 8 59 60

131 Page 9 of 42 Extremophiles

1 2 3 159 dilution at 15 pM, the pool of libraries was loaded onto the reagent cartridge and then onto the 4 5 160 instrument along with the flow cell. Automated cluster generation and sequencing run were 6 7 161 performed in a single 39hour run in a 2x251bp. Total information of 10.6 Gb was obtained 8 9 2 10 162 from a 1326 K/mm cluster density with a cluster passing quality control filters of 99.1 % 11 T 12 163 (20,978,044 pass filter clusters). Within this run, the index representation for strain ArcHr was 13 14 164 determined to be of 6.22 %. The 1,303,974 paired reads were filtered according to the read 15 16 165 qualities, trimmed then assembled. 17 18 For Peer Review 19 166 Genome assembly 20 21 22 167 Illumina reads where trimmed using Trimmomatic (Lohse et al. 2012), then assembled thought 23 24 25 168 Spades software (Nurk et al. 2013; Bankevich et al. 2012). Contigs obtained were combined 26 27 169 together by SSpace (Boetzer et al. 2011) and Opera software (Gao et al. 2011) helped by 28 29 170 GapFiller (Boetzer et al. 2012) to reduce the set. Some manual refinements using CLC 30 31 171 Genomics v7 software (CLC bio, Aarhus, Denmark) and homemade tools in Python improved 32 33 172 the genome. Finally, the draft genome of strain ArcHrT consisted of 8 contigs. 34 35 36 173 Genome annotation and comparison 37 38 39 174 Noncoding genes and miscellaneous features were predicted using RNAmmer (Lagesen et al. 40 41 42 175 2007), ARAGORN (Laslett et al. 2004), Rfam (GriffithsJones et al. 2003), PFAM (Punta et al. 43 44 176 2012), and Infernal (Nawrocki et al. 2009). Coding DNA sequences (CDSs) were predicted 45 46 177 using Prodigal (Hyatt 2010) and functional annotation was achieved using BLAST + (Camacho 47 48 178 et al. 2009) and HMMER3 (Eddy 2011) against the UniProtKB database (The UniProt 49 50 T 51 179 Consortium 2011). A brief genomic comparison was also made between strain ArcHr 52 53 180 (CSTE00000000), Haloferax alexandrinus strain ArcHv (CCDK00000000), Haloferax 54 55 181 gibbonsii strain ARA6 (CP011947), Haloferax lucentense strain DSM 14919 (AOLH00000000), 56 57 58 9 59 60

132 Extremophiles Page 10 of 42

1 2 3 182 Haloferax volcanii strain DS2 (CP001956) and Haloferax prahovense strain DSM 18310 4 5 183 (AOLG00000000). To estimate the mean level of nucleotide sequence similarity at the genome 6 7 184 level between strain ArcHrT and the four closest species with an available genome, we used the 8 9 10 185 Average Genomic Identity of Orthologous gene Sequences (AGIOS), in a laboratory’s pipeline. 11 12 186 Briefly, this pipeline combines the Proteinortho (Lechner et al. 2010) software (with the 13 14 187 following parameters: evalue 1e5, 30% identity, 50% coverage and algebraic connectivity of 15 16 188 50%) for the detection of orthologous proteins between genomes compared pairwise, retrieves 17 18 189 the corresponding genesFor and determines Peer the mean Review percentage of nucleotide sequence identity 19 20 21 190 between orthologous ORFs using the NeedlemanWunsch global alignment algorithm 22 T 23 191 (Ramasamy et al. 2014). Strain ArcHr genome was locally aligned 2by2 using BLAT 24 25 192 algorithm (Kent et al. 2002; Auch et al. 2010) against each selected genomes previously cited 26 27 193 and DNADNA hybridization (DDH) values were estimated by using the genometo genome 28 29 30 194 sequence comparison (Auch et al. 2010). 31 32 33 195 RESULTS 34 35 36 196 Strain identification and phylogenetic analysis 37 38 39 197 Using MALDITOF MS identification, no significant score allowing a correct identification was 40 T 41 198 obtained for strain ArcHr against our database (the Bruker database is constantly incremented 42 43 199 with URMITE data), suggesting that our isolate did not belong to any known species; and 44 45 200 consequently, spectra from strain ArcHrT was added to our database (http://www.mediterranee 46 47 48 201 infection.com/article.php?laref=256&titre=urms) (Figure 1). PCRbased identification of the 16S 49 T 50 202 rRNA gene of strain ArcHr (HG964472) exhibited a 99.2% sequence similarity with Haloferax 51 52 203 prahovense JCM 13924 (NR113446), the phylogenetically closest validated species with 53 54 204 standing in nomenclature (Figure 2). As 16S rRNA gene sequence comparison has been proven 55 56 205 to poorly discriminate Haloferax species, we sequenced the complete genome of strain ArcHrT 57 58 10 59 60

133 Page 11 of 42 Extremophiles

1 2 3 206 and a digital DNADNA hybridization (dDDH) was made with four of the closest Haloferax 4 5 207 species (see the part on genome comparison). These data confirmed strain ArcHrT as a unique 6 7 208 species. Finally, the gel view showed the protein spectral differences with other members of the 8 9 10 209 genus Haloferax (Figure 3). 11 12 13 210 Phenotypic and biochemical characteristics 14 15 T 16 211 Strain ArcHr colonies’ were circular, red, shiny and smooth with a diameter of 0.51 mm. Cells 17 18 212 were GramnegativeFor cocci, nonmotile Peer and nonsporeforming, Review generally occurred singly or in 19 20 213 pairs and had a mean diameter of 0.9 nm (Figure 4). Strain ArcHrT was mesophilic and grew at 21 22 23 214 temperatures ranging from 25 to 45°C, with an optimum at 37°C. NaCl was required for growth 24 25 215 and the strain grew at a salinity ranging from 10 to 25 % of NaCl with an optimum at 15%. The 26 27 216 optimum pH for growth was 7 (range between pH 6.5 to 8). The strain was strictly aerobic and 28 29 217 grew in the presence of 5% CO2; no growth was observed in microaerophilic or anaerobic 30 31 218 condition. Principal features are presented in Table1. 32 33 34 219 Strain ArcHrT exhibited positive catalase and oxidase activities. Using an API ZYM strip, 35 36 37 220 positive reactions were observed for alkaline phosphatase, acid phosphatase, esterase (C4), 38 39 221 esterase lipase (C8), leucine arylamidase, naphtholASBIphosphohydrolase, βglucuronidase, 40 41 222 and negative reactions were observed for lipase (C14), valine arylamidase, trypsin, α 42 43 223 chymotrypsin , βgalactosidase, Nacetyl βglucosaminidase, αgalactosidase, αglucosidase, β 44 45 46 224 glucosidase, αfucosidase, αmannosidase. An API 50CH strip showed positive reaction for 47 48 225 glycerol, Dfructose, Lrhamnose, potassium 2ketogluconate and potassium 5ketogluconate, 49 50 226 and negative reactions for arbutin, salicin, Dmaltose, Dsucrose, Draffinose, , erythritol, D 51 52 227 ribose, Dxylose, Lxylose, Dadonitol, methylβDxylopyranoside, Dglucose, Dgalactose, D 53 54 228 lactose, Lsorbose, dulcitol, inositol, Dmannitol, Dsorbitol, methylαDmannopyranoside, 55 56 57 229 methylαDglucopyranoside, Dcellobiose, Dmelibiose, Dtrehalose, Dmelezitose, starch, 58 11 59 60

134 Extremophiles Page 12 of 42

1 2 3 230 glycogen, xylitol, gentiobiose, Dturanose, Dlyxose, Dtagatose, Dfucose, Lfucose, Darabitol, 4 5 231 Larabitol, and potassium gluconate. The phenotypic characteristics of strain ArcHrT were 6 7 232 compared with the most closely related species (Table 2). 8 9 10 233 Antimicrobial susceptibility testing demonstrated that strain ArcHrT was susceptible to 11 12 13 234 Rifampicin and Trimethoprim/Sulfamethoxazole, and resistant to Fosfomycin, Doxycycline, 14 15 235 Vancomycin, Amoxicillin, Erythromycin, Ampicillin, Cefoxitin, Colistin, Tobramycin, 16 17 236 Gentamicin, Penicillin G, Oxacillin, Imipenem and Metronidazole. 18 For Peer Review 19 20 237 The only fatty acid reported is 3methylButanoic acid (5:0 iso), a branched short chain fatty acid 21 22 238 (SCFA). Phenylacetic acid, also known as an antifungal agent (Ryan et al. 2009), was also 23 24 25 239 described using this technique. 26 27 28 240 Genome sequencing information and annotation 29 30 241 Strain ArcHrT’s genome was sequenced as part of a culturomic study aiming at isolating all 31 32 242 prokaryotes species colonizing the human gut (Lagier et al. 2016) and because of its 33 34 T th 35 243 phylogenetic affiliation to the Haloferax genus. Strain ArcHr represents the 13 genome 36 T 37 244 sequenced in the Haloferax genus. The draft genome of strain ArcHr contains 4,015,175 bp 38 39 245 with a G+C content of 65.36 % and consists of 8 contigs without gaps (Figure 5). The genome 40 41 246 was shown to encode at least 64 predicted RNA including 3 rRNA, 57 tRNA, 4 miscellaneous 42 43 44 247 RNA and 3,911 proteincoding genes. Among these genes, 490 (13%) were found to be putative 45 46 248 proteins and 291 (8%) were assigned as hypothetical proteins. Moreover, 2,335 genes matched at 47 48 249 least one sequence in Clusters of Orthologous Groups (COGs) database (Tatusov et al. 2000; 49 50 250 Tatusov et al. 1997) with BLASTP default parameters. Table 3 shows the detailed project 51 52 251 information and its association with MIGS version 2.0 compliance. The properties and the 53 54 55 252 statistics of the genome are summarized in Table 4. The distribution of genes into COGs 56 57 253 functional categories is presented in Table 5. 58 12 59 60

135 Page 13 of 42 Extremophiles

1 2 3 254 Genome comparison 4 5 6 255 The draft genome of strain ArcHrT is larger than that of H. prahovense, H. alexandrinus, H. 7 8 256 gibbonsii, H. lucentense and H. volcanii (4.35, 4, 3.9, 3.62, 2.95 and 2.85 Mb respectively). The 9 10 257 G+C content of strain ArcHrT is smaller than that of H. alexandrinus, H. lucentense, H. volcanii 11 12 13 258 and H. gibbonsii (65.36, 66, 66.4, 66.6 and 67.1 % respectively) but smaller than that of H. 14 T 15 259 prahovense (65.7%). The gene content of strain ArcHr is larger than that of H. alexandrinus, 16 17 260 H. prahovense, H. lucentense, H. gibbonsii and H. volcanii (3.911, 3.770, 3.766, 3.593, 2.997 18 For Peer Review 19 261 and 2.917). 20 21 22 262 The distribution of genes into COG categories was identical (Figure 6) in all compared genomes. 23 24 25 263 The Average Genomic Identity of Orthologous gene Sequences (AGIOS) shows that strain Arc 26 27 T 28 264 Hr shared 2.690, 2.353, 2.958, 2.975 and 2.459 orthologous genes with H. lucentense, H. 29 30 265 volcanii, H. prahovense, H. alexandrinus and H. gibbonsii respectively (Table 6). Among 31 32 266 compared species, except for strain ArcHrT, AGIOS values ranged from 92.08% to 98.83%. 33 34 267 AGIOS values between strain ArcHrT and compared species were in the same range (from 35 36 268 92.24% with H. alexandrinus to 93.29% with H. volcanii). The DDH values ranged from 37 38 T 39 269 50.70% to 82.20%, among compared species, except for strain ArcHr . Among compared 40 T 41 270 species and strain ArcHr , the DDH values ranged from 50.70% with H. prahovense, to 54.30% 42 43 271 with H. lucentense, these values were lower than the 70% cutoff (MeierKolthoff et al. 2013) 44 45 272 (Table 7). 46 47 48 273 DISCUSSION 49 50 51 274 Here, we describe the genome sequence and most of the biochemical characteristics of the first 52 53 54 275 isolate of Haloferax massiliensis sp. nov., an extremely halophilic archaea isolated from the 55 56 276 human gut. Halophilic organisms are generally known to colonize hypersaline environments 57 58 13 59 60

136 Extremophiles Page 14 of 42

1 2 3 277 where the salt concentration is close to saturation, such as salt lakes and salt marshes (Oren et al. 4 5 278 1994). Here, using a culture medium containing high salt concentration, we successfully isolated 6 7 279 strain ArcHrT belonging to the Haloferax genus within the Halobacteriaceae family. This strain 8 9 10 280 presents the first halophilic archaea isolated from the human gut. Recently, DNA sequences 11 12 281 belonging to some halophilic archaea frequently present or abundant in extreme environments 13 14 282 were detected by PCR in the human gastrointestinal tract as well as some members of the 15 16 283 Halobacteriaceae family (Oxley et al. 2010). Bacterial halophilism has become a subject of 17 18 284 considerable interestFor for microbiologists Peer and molecular Review biologists during the past twenty years, 19 20 21 285 because of their development on salty foods (Fukushima et al. 2007). Indeed, these organizations 22 23 286 have also been detected in refined salt (Diop et al. 2016) as well as food products where salt is 24 25 287 used in large quantities in the process of their conservation such as salted fish, pork ham, 26 27 288 sausages and fish sauces (Tanasupawat et al. 2009; Kim et al. 2010). Additionally, the limitation 28 29 30 289 of these organisms to extreme environments has been recently contested after their detection in 31 32 290 habitats with relatively low salinity, suggesting an ability of adaptation to survive in more 33 34 291 moderate environments (Purdy et al. 2004). 35 36 37 292 This work does not intend to demonstrate a medical or biotechnological interest regarding strain 38 39 293 ArchHrT, its only aim is to expand knowledge about the human microbiota and isolating all the 40 41 42 294 prokaryotes that colonize the human digestive tract (Lagier et al. 2016). 43 44 45 295 CONCLUSION 46 47 T 48 296 Based on the characteristics reported here and the phylogenetic affiliation of strain ArcHr , we 49 50 297 proposed the creation of Haloferax massiliensis sp. nov., as a new species belonging to the 51 52 298 Haloferax genus with strain ArcHrT as its type strain. Haloferax massiliensis sp. nov., (= 53 54 299 CSURP0974 = CECT 9307), described here, was isolated from the human gut as part of a 55 56 57 58 14 59 60

137 Page 15 of 42 Extremophiles

1 2 3 300 culturomic study aiming at expanding the repertoire of microorganisms colonizing the human 4 5 301 gut. 6 7 8 302 Description of Haloferax massiliensis sp. nov. 9 10 11 303 Haloferax massiliensis (mas.si.li.en’sis, L. fem. adj., massiliensis of Massilia, the Roman name 12 13 304 of Marseille, France, where the type strain was isolated). 14 15 16 305 Haloferax massiliensis strain ArcHrT is a strictly aerobic Gram negative coccus, nonmotile and 17 18 For Peer Review 19 306 nonsporeforming. The cells’ mean diameter is of 0.9 µm. An optimal growth was observed at 20 21 307 37°C, pH 7 and 15 % of NaCl. Colonies are red, smooth, shiny and measure 0.51 mm. Strain 22 23 308 ArcHrT has exhibited positive catalase and oxidase activities. 24 25 26 309 Using API strips, positive reactions were observed for alkaline phosphatase, acid phosphatase, 27 28 310 esterase (C4), esterase lipase (C8), leucine arylamidase, naphtholASBIphosphohydrolase, β 29 30 31 311 glucuronidase, glycerol, Dfructose, Lrhamnose, potassium 2ketogluconate and potassium 5 32 T 33 312 ketogluconate. Strain ArcHr was susceptible to Rifampicin and Trimethoprim 34 35 313 /Sulfamethoxazole. The only fatty acid reported is 3methylButanoic acid (5:0 iso), a branched 36 37 314 short chain fatty acid (SCFA).The genome of Haloferax massiliensis is 4,349,774 bp long and 38 39 315 exhibits a G+C% content of 65.36 %. The 16S rRNA and genome sequences are deposited in 40 41 42 316 EMBLEBI under accession numbers HG964472 and CSTE00000000, respectively. The type 43 T 44 317 strain ArcHr (= CSUR P0974 = CECT 9307) was isolated from a stool specimen of 22yearold 45 46 318 Amazonian obese female patient as part of a culturomics study. 47 48 49 50 51 52 53 54 55 56 57 58 15 59 60

138 Extremophiles Page 16 of 42

1 2 3 319 ABBREVIATIONS 4 5 320 FAME: Fatty Acid Methyl Ester 6 7 321 GC/MS: Gaz Chromatography/Mass Spectrometry 8 9 10 322 CSUR: Collection de Souches de l’Unité des Rickettsies 11 12 13 323 CECT: Colección Española de Cultivos Tipo 14 15 16 324 MALDI-TOF MS: Matrixassisted laserdesorption/ionization timeofflight mass spectrometry 17 18 For Peer Review 19 325 URMITE: Unité de Recherche sur les Maladies Infectieuses et Tropicales Emergentes 20 21 22 326 IU: International Unit 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 16 59 60

139 Page 17 of 42 Extremophiles

1 2 3 327 CONFLICT OF INTEREST STATEMENT 4 5 6 328 The authors declare that there is no conflict of interest. 7 8 9 329 ACKNOWLEDGEMENTS 10 11 12 330 The authors thank Magdalen Lardière for English reviewing. This study was funded by the 13 14 331 “Fondation Méditerranée Infection”. 15 16 17 18 For Peer Review 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 17 59 60

140 Extremophiles Page 18 of 42

1 2 3 332 References list 4 5 333 Asker D, Ohta Y (2002) Haloferax alexandrinus sp. nov., an extremely halophilic 6 7 334 canthaxanthinproducing archaeon from a solar saltern in Alexandria (Egypt). Int J Syst 8 9 335 Evol Microbiol 52(P3):72938. 10 11 12 336 Auch AF, von Jan M, Klenk HP, Göker M (2010) Digital DNADNA hybridization for 13 14 15 337 microbial species delineation by means of genometogenome sequence comparison. 16 17 338 Stand Genomic Sci 2(1):11734. 18 For Peer Review 19 20 339 Bankevich A, Nurk S, Antipov D, Gurevich AA, Dvorkin M, Kulikov AS, Lesin VM, Nikolenko 21 22 340 SI, Pham S, Prjibelski AD, Pyshkin AV, Sirotkin AV, Vyahhi N, Tesler G, Alekseyev 23 24 341 MA, Pevzner PA (2012) SPAdes: a new genome assembly algorithm and its applications 25 26 27 342 to singlecell sequencing. J Comput Biol 19(5):45577. 28 29 30 343 Boetzer M, Henkel CV, Jansen HJ, Butler D, Pirovano W (2011) Scaffolding preassembled 31 32 344 contigs using SSPACE. Bioinformatics 27(4):5789. 33 34 35 345 Boetzer M. and Pirovano W (2012) Toward almost closed genomes with GapFiller. Genome 36 37 346 Biol 13:R56. 38 39 40 347 Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, Madden TL (2009) 41 42 348 BLAST+: architecture and applications. BMC Bioinformatics 10:421. 43 44 45 349 CavalierSmith T (2002) The neomuran origin of archaebacteria, the negibacterial root of the 46 47 48 350 universal tree and bacterial megaclassification. Int J Syst Evol Microbiol 52:776. 49 50 51 351 Dione N, Sankar SA, Lagier JC, Khelaifia S, Michele C, Armstrong N, Richez M, Abrahão J, 52 53 352 Raoult D, Fournier PE (2016) Genome sequence and description of Anaerosalibacter 54 55 353 massiliensis sp. nov. New Microbes and New Infections 10:6676. 56 57 58 18 59 60

141 Page 19 of 42 Extremophiles

1 2 3 354 Diop A, Khelaifia S, Armstrong N, Labas N, Fournier PE, Raoult D, Million M (2016) Microbial 4 5 355 culturomics unravels the halophilic microbiota repertoire of table salt: description of 6 7 356 Gracilibacillus massiliensis sp. nov. Microb Ecol Health Dis 18(27):32049. 8 9 10 357 Eddy SR (2011) Accelerated profile HMM searches. PLoS Comp Biol 7(10):e1002195 11 12 13 358 Fukushima T, Usami R, Kamekura M (2007) A traditional Japanesestyle salt field is a niche for 14 15 16 359 haloarchaeal strains that can survive in 0.5% salt solution. Saline Systems 3(1): 1. 17 18 For Peer Review 19 360 Elsawi Z, Togo AH, Beye M, Dubourg G, Andrieu C, Armsrtong N, Richez M, di Pinto F, Bittar 20 21 361 F, Labas N, Fournier PE, Raoult D, Khelaifia S (2017) Hugonella massiliensis gen. nov., 22 23 362 sp. nov., genome sequence, and description of a new strictly anaerobic bacterium isolated 24 25 363 from the human gut. Microbiologyopen doi: 10.1002/mbo3.458. 26 27 28 29 364 Enache M, Itoh T, Kamekura M, Teodosiu G, Dumitru L (2007) Haloferax prahovense sp. nov., 30 31 365 an extremely halophilic archaeon isolated from a Romanian salt lake. Int J Syst Evol 32 33 366 Microbiol 57(2):3937. 34 35 36 367 Field D, Garrity G, Gray T, Morrison N, Selengut J, Sterk P, Tatusova T, Thomson N, Allen MJ, 37 38 368 Angiuoli SV, Ashburner M, Axelrod N, Baldauf S, Ballard S, Boore J, Cochrane G, Cole 39 40 369 J, Dawyndt P, De Vos P, DePamphilis C, Edwards R, Faruque N, Feldman R, Gilbert J, 41 42 43 370 Gilna P, Glöckner FO, Goldstein P, Guralnick R, Haft D, Hancock D, Hermjakob H, 44 45 371 HertzFowler C, Hugenholtz P, Joint I, Kagan L, Kane M, Kennedy J, Kowalchuk G, 46 47 372 Kottmann R, Kolker E, Kravitz S, Kyrpides N, LeebensMack J, Lewis SE, Li K, Lister 48 49 373 AL, Lord P, Maltsev N, Markowitz V, Martiny J, Methe B, Mizrachi I, Moxon R, Nelson 50 51 52 374 K, Parkhill J, Proctor L, White O, Sansone SA, Spiers A, Stevens R, Swift P, Taylor C, 53 54 375 Tateno Y, Tett A, Turner S, Ussery D, Vaughan B, Ward N, Whetzel T, San Gil I, 55 56 57 58 19 59 60

142 Extremophiles Page 20 of 42

1 2 3 376 Wilson G, Wipat A (2008) The minimum information about a genome sequence (MIGS) 4 5 377 specification. Nat Biotechnol 26(5):5417. 6 7 8 378 Juez G, RodriguezValera F, Ventosa A and Kushner DJ (1986) Haloarcula hispanica spec. nov. 9 10 11 379 and Haloferax gibbonsii spec. nov., two new species of extremely halophilic 12 13 380 archaebacteria. Syst Appl Microbiol 8:7579. 14 15 16 381 Garrity GM, Holt JG (2001) Phylum AII. Euryarchaeota phy. nov. In: Bergey's Manual of 17 18 382 Systematic For Bacteriology, Peer 2nd ed., vol. Review 1 (The Archaea and the deeply branching and 19 20 383 phototrophic Bacteria) (D.R. Boone and R.W. Castenholz, eds.), SpringerVerlag, New 21 22 23 384 York p. 211. 24 25 26 385 Gao S, Sung WK, Nagarajan N (2011) Opera: reconstructing optimal genomic scaffolds with 27 28 386 highthroughput pairedend sequences. Journal of Computational Biology 18(11): 1681 29 30 387 91 31 32 33 388 Grant WD, Kamekura M, McGenity TJ, Ventosa A (2001) Class III. Halobacteria class. nov. In: 34 35 389 Bergey's Manual of Systematic Bacteriology, 2nd ed., vol. 1 (The Archaea and the deeply 36 37 38 390 branching and phototrophic Bacteria) (D.R. Boone and R.W. Castenholz, eds.), Springer 39 40 391 Verlag, New York p. 294. 41 42 43 392 Grant WD, Kamekura M, McGenity TJ, Ventosa A (2001) The order . In 44 45 393 Bergey’s Manual of Systematic Bacteriology, 2nd edn, vol. 1, pp. 294–334. Edited by D. 46 47 394 R. Boone & R. W. Castenholz. New York: Springer. 48 49 50 395 Grant WD, Larsen H (1989) Extremely halophilic archaeobacteria. Order Halobacteriales ord. 51 52 396 nov. In Bergey's Manual of Systematic Bacteriology, vol. 3, pp. 2216±2233. Edited by N. 53 54 55 397 Pfennig. Baltimore: Williams & Wilkins. 56 57 58 20 59 60

143 Page 21 of 42 Extremophiles

1 2 3 398 GriffithsJones S, Bateman A, Marshall M, Khanna A, Eddy SR (2003) Rfam: an RNA family 4 5 399 database. Nucleic Acids Res 31:439441. 6 7 8 400 Gupta RS, Naushad S, Baker S (2015) Phylogenomic analyses and molecular signatures for the 9 10 401 class Halobacteria and its two major clades: a proposal for division of the class 11 12 13 402 Halobacteria into an emended order Halobacteriales and two new orders, 14 15 403 ord. nov. and ord. nov., containing the novel families Haloferacaceae fam. 16 17 404 nov. and Natrialbaceae fam. nov. Int J Syst Evol Microbiol 65:10501069. 18 For Peer Review 19 20 405 Gutierrez MC, Kamekura M, Holmes ML, DyallSmith ML and Ventosa A (2002) Taxonomic 21 22 406 characterization of Haloferax sp. (H. alicantei) strain Aa 2.2: description of Haloferax 23 24 25 407 lucentensis sp. nov. Extremophiles 6:479483. 26 27 28 408 Hyatt D, Chen GL, Locascio PF, Land ML, Larimer FW, Hauser LJ (2010) Prodigal: prokaryotic 29 30 409 gene recognition and translation initiation site identification. BMC Bioinformatics 31 32 410 11:119. 33 34 35 411 Kamekura M, Mizuki T, Usami R, Yoshida Y, Horikoshi K, Vreeland RH (2004) The potential 36 37 412 use of signature bases from 16S rRNA gene sequences to aid the assignment of microbial 38 39 413 strains to genera of halobacteria. In Halophilic Microorganisms, pp. 77–100. Edited by A. 40 41 42 414 Ventosa. Berlin: Springer. 43 44 45 415 Kent WJ (2002) BLATthe BLASTlike alignment tool. Genome Res. 12(4):65664. 46 47 48 416 Khelaifia S, Raoult D, Drancourt M (2013) A versatile medium for cultivating methanogenic 49 50 417 archaea. PLoS One 17;8(4):e61563. 51 52 53 418 Kim MS, Roh SW, Bae JW (2010) Halomonas jeotgali sp. nov., a new moderate halophilic 54 55 419 bacterium isolated from a traditional fermented seafood. J Microbiol 48(3):40410. 56 57 58 21 59 60

144 Extremophiles Page 22 of 42

1 2 3 420 Lagesen K, Hallin P, Rodland EA, Staerfeldt HH, Rognes T, Ussery DW (2007) RNAmmer: 4 5 421 consistent and rapid annotation of ribosomal RNA genes. Nucleic Acids Res 35:3100 6 7 422 3108. 8 9 10 423 Lagier JC, Armougom F, Million M, Hugon P, Pagnier I, Robert C, Bittar F, Fournous G, 11 12 13 424 Gimenez G, Maraninchi M, Trape JF, Koonin EV, La Scola B, Raoult D (2012) 14 15 425 Microbial culturomics: paradigm shift in the human gut microbiome study. Clin 16 17 426 Microbiol Infect 18(12):118593. 18 For Peer Review 19 20 427 Lagier JC, Khelaifia S, Alou MT, Ndongo S, Dione N, Hugon P, Caputo A, Cadoret F, Traore 21 22 428 SI, Seck EH, Dubourg G, Durand G, Mourembou G, Guilhot E, Togo A, Bellali S, 23 24 25 429 Bachar D, Cassir N, Bittar F, Delerce J, Mailhe M, Ricaboni D, Bilen M, Dangui Nieko 26 27 430 NP, Dia Badiane NM, Valles C, Mouelhi D, Diop K, Million M, Musso D, Abrahão J, 28 29 431 Azhar EI, Bibi F, Yasir M, Diallo A, Sokhna C, Djossou F, Vitton V, Robert C, Rolain 30 31 432 JM, La Scola B, Fournier PE, Levasseur A, Raoult D (2016) Culture of previously 32 33 433 uncultured members of the human gut microbiota by culturomics. Nat Microbiol 34 35 36 434 7(1):16203. 37 38 39 435 Laslett D, Canback B (2004) ARAGORN, a program to detect tRNA genes and tmRNA genes in 40 41 436 nucleotide sequences. Nucleic Acids Res 32:1116. 42 43 44 437 Lechner M, Findeiss S, Steiner L, Marz M, Stadler PF, Prohaska SJ (2011) Proteinortho: 45 46 438 detection of (co)orthologs in largescale analysis. BMC Bioinformatics.28;12:124. 47 48 49 439 Lepp PW, Brinig MM, Ouverney CC, Palm K, Armitage GC, Relman DA (2004) Methanogenic 50 51 440 Archaea and human periodontal disease. Proc Natl Acad Sci U S A 101(16):617681. 52 53 54 55 56 57 58 22 59 60

145 Page 23 of 42 Extremophiles

1 2 3 441 Lohse M, Bolger AM, Nagel A, Fernie AR, Lunn JE, Stitt M, Usadel B (2012) RobiNA: a user 4 5 442 friendly, integrated software solution for RNASeqbased transcriptomics. Nucleic Acids 6 7 443 Res 40(Web Server issue):W6227. 8 9 10 444 MeierKolthoff JP, Auch AF, Klenk HP, Göker M (2013) Genome sequencebased species 11 12 13 445 delimitation with confidence intervals and improved distance functions. BMC 14 15 446 Bioinformatics 21;14:60. 16 17 18 447 Nam YD, Chang HW,For Kim KH, Peer Roh SW, Kim Review MS, Jung MJ, Lee SW, Kim JY, Yoon JH, Bae 19 20 448 JW (2008) Bacterial, archaeal, and eukaryal diversity in the intestines of Korean people. J 21 22 449 Microbiol 46(5):491501. 23 24 25 450 Nawrocki EP, Kolbe DL, Eddy SR (2009) Infernal 1.0: inference of RNA alignments. 26 27 28 451 Bioinformatics 25(10):13357. 29 30 31 452 Nurk S, Bankevich A, Antipov D, Gurevich AA, Korobeynikov A, Lapidus A, Prjibelski AD, 32 33 453 Pyshkin A, Sirotkin A, Sirotkin Y, Stepanauskas R, Clingenpeel SR, Woyke T, McLean 34 35 454 JS, Lasken R, Tesler G, Alekseyev MA, Pevzner PA (2013) Assembling singlecell 36 37 455 genomes and minimetagenomes from chimeric MDA products. J Comput Biol 38 39 456 20(10):71437. 40 41 42 457 Oren A (1994) The ecology of the extremely halophilic archaea. FEMS Microbiology Reviews 43 44 45 458 13(4): 415. 46 47 48 459 49 50 51 460 Oxley AP, Lanfranconi MP, Würdemann D, Ott S, Schreiber S, McGenity TJ, Timmis KN, 52 53 461 Nogales B (2010) Halophilic archaea in the human intestinal mucosa. Environ Microbiol 54 55 462 12(9):23982410. 56 57 58 23 59 60

146 Extremophiles Page 24 of 42

1 2 3 463 Punta M, Coggill PC, Eberhardt RY, Mistry J, Tate J, Boursnell C, Pang N, Forslund K, Ceric G, 4 5 464 Clements J, Heger A, Holm L, Sonnhammer EL, Eddy SR, Bateman A and Finn RD 6 7 465 (2012) The Pfam protein families database. Nucleic Acids Res 40:D290D301. 8 9 10 466 Purdy KJ, CresswellMaynard TD, Nedwell DB, McGenity TJ, Grant WD, Timmis KN, Embley 11 12 13 467 TM (2004) Isolation of that grow at low salinities. Environ Microbiol 14 15 468 6(6):5915. 16 17 18 469 Ryan LA, Dal BelloFor F, Czerny PeerM, Koehler P, ArendtReview EK (2009) Quantification of phenyllactic 19 20 470 acid in wheat sourdough using high resolution gas chromatographymass spectrometry. J 21 22 471 Agric Food Chem 11;57(3):10604. 23 24 25 472 Sasser M (2006) Bacterial identification by gas chromatographic analysis of Fatty Acids Methyl 26 27 28 473 Esters (GCFAME)”, MIDI. Technical Note #101. 29 30 31 474 Seng P, Abat C, Rolain JM, Colson P, Lagier JC, Gouriet F, Pierre Edouard Fournier, Michel 32 33 475 Drancourt, Bernard La Scola and Didier Raoult (2013) Identification of aare pathogenic 34 35 476 bacteria in a clinical microbiology laboratory: Impact of MatrixAssisted Laser 36 37 477 Desorption Ionization–Time of Flight Mass Spectrometry. J Clin Microbiol 38 39 478 51(7):218294. 40 41 42 479 Stackebrandt E, Ebers J (2006) Taxonomic parameters revisited: tarnished gold standards. 43 44 45 480 Microbiol today 33:152. 46 47 48 481 Tanasupawat S, Namwong S, Kudo T and Itoh T (2009) Identification of halophilic bacteria 49 50 482 from fish sauce (nampla) in Thailand. Journal of culture collections (6):6975 51 52 53 483 Tatusov RL, Galperin MY, Natale DA and Koonin EV (2000) The COG database: a tool for 54 55 484 genomoescale analysis of protein functions and evolution. Nucleic Acids Res 28:3336. 56 57 58 24 59 60

147 Page 25 of 42 Extremophiles

1 2 3 485 Tatusov RL, Koonin EV and Lipman DJ (1997) A genomic perspective on protein families. 4 5 486 Science 278:631637. 6 7 8 487 The UniProt Consortium (2011) Ongoing and future developments at the Universal Protein 9 10 488 Resource, Nucleic Acids Res 39: D214D219. 11 12 13 489 Tindall BJ, Tomlinson GA, Hochstein LI (1989) Transfer of Halobacterium denitrificans 14 15 16 490 (Tomlinson, Jahnke, and Hochstein) to the genus Haloferax as Haloferax denitrificans 17 18 491 comb. nov. ForInt J Syst Bacteriol Peer 39(3):35960. Review 19 20 21 492 Torreblanca M, RodriquezValera F, Juez G, Ventosa A, Kamekura M, Kates M (1986) 22 23 493 Classification of nonalkaliphilic halobacteria based on numerical taxonomy and polar 24 25 494 lipid composition, and description of Haloarcula gen. nov. and Haloferax gen. nov. Syst 26 27 28 495 Appl Microbiol 8:8999. 29 30 31 496 Welker M, Moore ER (2011) Applications of wholecell matrixassisted laser 32 33 497 desorption/ionization timeofflight mass spectrometry in systematic microbiology. Syst 34 35 498 Appl Microbiol. 34(1): 211. 36 37 38 499 Woese CR, Kandler O, Wheelis ML (1990) Towards a natural system of organisms: proposal for 39 40 500 the domains Archaea, Bacteria, and Eucarya. Proc Natl Acad Sci USA 87 (12): 4576–9. 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 25 59 60

148 Extremophiles Page 26 of 42

1 2 T 3 501 Table 1. Classification and general features of Haloferax massiliensis strain ArcHr according 4 5 502 to the MIGS recommendations (Field 2008). 6 7 8 MIGS ID Property Term Evidence codea 9 10 Current classification Domain: Archaea TAS (Woese 1990) 11 12 Phylum: Euryarchaeota TAS (CavalierSmith 2002) 13 (Garrity and Holt 2001) 14 15 Class: Halobacteria TAS (Grant 2001 ; Gupta 2015) 16 17 Order: Haloferacales TAS (Grant & Larsen 1989 ; 18 For Peer ReviewGupta 2015) 19 20 Family: Haloferacaceae TAS (Grant & Larsen 1990 ; 21 Gupta 2015) 22 23 Genus: Haloferax TSA (Torreblanca 1986) 24 25 Species: Haloferax IDA 26 massiliensis 27 28 Type strain: ArcHrT IDA 29 30 Gram stain negative IDA 31 32 Cell shape Cocci IDA 33 34 Motility Non motile IDA 35 36 Sporulation non sporeforming IDA 37 38 Temperature range Mesophile IDA 39 40 Optimum temperature 37°C IDA 41 pH pH 6.5 to 8 42 43 Optimum pH 7 44 45 MIGS6.3 Salinity 10 to 25% IDA 46 47 Optimum salinity 15% NaCl IDA 48 49 MIGS22 Oxygen requirement Strictly aerobic IDA 50 51 Carbon source Unknown IDA 52 53 Energy source Unknown IDA 54 55 MIGS6 Habitat Human gut IDA 56 57 58 26 59 60

149 Page 27 of 42 Extremophiles

1 2 3 MIGS15 Biotic relationship Free living IDA 4 5 Pathogenicity Unknown NAS 6 7 Biosafety level 2 IDA 8 9 MIGS14 Isolation Human feces IDA 10 11 MIGS4 Geographic location France IDA 12 MIGS5 Sample collection December 2013 IDA 13 14 time 15 16 MIGS4.3 Depth surface IDA 17 MIGS4.4 Altitude 0 m above sea level IDA 18 For Peer Review 19 503 aEvidence codes IDA: Inferred from Direct Assay; TAS: Traceable Author Statement (i.e., a 20 504 direct report exists in the literature); NAS: Nontraceable Author Statement (i.e., not directly 21 22 505 observed for the living, isolated sample, but based on a generally accepted property for the 23 506 species, or anecdotal evidence). These evidence codes are from 24 507 http://www.geneontology.org/GO.evidence.shtml of the Gene Ontology project (Ashburner et al. 25 508 2000). If the evidence is IDA, then the property was directly observed for a live isolate by one of 26 509 the authors or an expert mentioned in the acknowledgements. 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 27 59 60

150 Page 28 of 42

+ + + + + + + + na na H. H. lucentense lucentense H ), ), aloferax ). na: No na: No ). 1986

Haloferax volcanii volcanii Haloferax + + + + + + + + H. 2002 et et al. et al. alexandrinus alexandrinus 2007), 2007), et et al. Gutierrez Gutierrez

Torreblanca Torreblanca + + + + + + + + ( ( na H. H. (Enache (Enache mediterranei mediterranei

+ + + + + na na H. H. gibbonsii gibbonsii Haloferax lucentense Haloferax Haloferax mediterranei mediterranei Haloferax Haloferax prahovense Haloferax ,

T + + + + + + na na H. H. 2002) and 2002) 28 28 151 1989); 4, 4, 1989); denitrifcans et al. et al. Extremophiles strain strain ArcHr

+ + + + + + + na na H. H. (Asker (Asker volcanii volcanii (Tindall (Tindall

+ + + + + + na na H. H. prahovense prahovense Haloferax massiliensis massiliensis Haloferax For Peer Review , Haloferax alexandrinus , Haloferax alexandrinus H

denitrifcans aloferax ), ), + + + + + + + H. H. 1986) 1986 massiliensis massiliensis et al. et et al. (Juez (Juez Differential characteristics of characteristics Differential Torreblanca Torreblanca available data. available data. gibbonsii ( Table Table 2.

Properties Properties requirement Oxygen Gram stain Salt requirement Motility formation Endospore Indole Tween 80 hydrolysis of Production Alkaline phosphatase Catalase Oxidase 512 512 513 510 510 511 511 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

+ + + + + na na na na na Water of a of a Water

+ + + + + + + + + + + na solar solar saltern

+ + + + + + + + + + + + + na na na na Solar salt salt Solar

+ + + + + + + + + + na na na na na Solar Solar

+ + + + + + + + na na na na 29 29 152 solar saltern saltern solar Extremophiles

+ + + + + + + + na na na na Bottom Bottom

+ + + + + + na na na Na Salt lake lake Salt For Peer Review

+ + + + Human Gut Human Gut

Urease Urease βgalactosidase Nacetylglucosamine from Acid LArabinose Ribose Mannose Mannitol Sucrose Dglucose Dfructose Dmaltose Dlactose Gelatin hydrolysis hydrolysis Starch Casein hydrolysis Habitat Nitrate reductase Nitrate reductase Page 29 of 42 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 Page 30 of 42 saltern saltern pond pond salterns salterns 30 30 153 Extremophiles sedimen sedimen

For Peer Review

514 514 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 Page 31 of 42 Extremophiles

1 2 3 515 Table 3. Project information 4 5 6 MIGS ID Property Term 7 8 MIGS31 Finishing quality High quality draft 9 10 MIGS28 Libraries used 1 matepaired 11 MIGS29 Sequencing platforms MiSeq Illumina 12 13 MIGS31.2 Sequencing coverage 620 14 15 MIGS30 Assemblers Spades 16 17 MIGS32 Gene calling method Prodigal 18 For Peer Review 19 Genbank ID CSTE000000001 CSTE000000008 20 21 Genbank Date of Release Apr, 2014 22 23 MIGS13 Source material identifier ArcHrT 24 25 Project relevance Mar, 2014 26 27 516 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 31 59 60

154 Extremophiles Page 32 of 42

1 2 3 517 Table 4. Nucleotide content and gene count levels of the genome 4 5 6 Attribute Value % of Total* 7 Genome size (bp) 4,015,175 100 8 DNA coding region (bp) 3,414,159 78.50 9 10 DNA G+C content (bp) 2,624,318 65.36 11 Total proteincoding genes 3,911 100 12 13 rRNA 3 0.08 14 tRNA 57 1.46 15 tmRNA 0 0 16 17 miscRNA 4 0.11 18 Genes with function predictionFor 2,825 Peer 72.23 Review 19 Genes assigned to COGs 3,116 79.68 20 * 21 518 The total is based on either the size of the genome in base pairs or the total number of 22 519 protein coding genes in the annotated genome. 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 32 59 60

155 Page 33 of 42 Extremophiles

1 2 3 520 Table 5. Number of genes associated with the 25 general COG functional categories. 4 5 6 Code Description Value% of total 7 8 J Translation, ribosomal structure and biogenesis 165 4.22 9 A RNA processing and modification 1 0.03 10 K Transcription 172 4.40 11 12 L Replication, recombination and repair 133 3.41 13 B Chromatin structure and dynamics 6 0.16 14 15 D Cell cycle control, cell division, chromosome partitioning 27 0.69 16 Y Nuclear structure 0 0.0 17 V Defense mechanisms 37 0.95 18 For Peer Review 19 T Signal transduction mechanisms 157 4.02 20 M Cell wall/membrane biogenesis 119 3.05 21 N Cell motility 38 0.98 22 23 Z Cytoskeleton 0 0.0 24 W Extracellular structures 0 0.0 25 26 U Intracellular trafficking and secretion, and vesicular transport 35 0.90 27 O Posttranslational modification, protein turnover, chaperones 113 2.89 28 C Energy production and conversion 208 5.32 29 30 G Carbohydrate transport and metabolism 219 5.6 31 E Amino acid transport and metabolism 341 8.72 32 33 F Nucleotide transport and metabolism 75 1.92 34 H Coenzyme transport and metabolism 154 3.94 35 I Lipid transport and metabolism 79 2.02 36 37 P Inorganic ion transport and metabolism 202 5.17 38 Q Secondary metabolites biosynthesis, transport and catabolism 54 1.39 39 R General function prediction only 490 12.53 40 41 S Function unknown 291 7.45 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 33 59 60

156 Page 34 of 42 2355 2355 2412 2696 2685 2459 3053 H. gibbonsii H. gibbonsii 2690 2690 2353 2958 2973 4259 92,82 92,82 H. H. massiliensis

96.3 2761 2761 2400 3238 4109 92.24 92.24 H. alexandrinus alexandrinus H. 2754 2754 2409 4180 98.83 98.83 92.33 96.68 34 34 157 Extremophiles H. prahovense H. prahovense 2348 2348 2995 93.21 92.84 93.29 93.67 H. volcanii H. volcanii For Peer Review 92.1 4086 4086 97.07 97.07 92.08 93.07 92.44 H. lucentense H. lucentense Number of orthologous proteins shared between genomes (upper right), average percentage similarity of nucleotides corresponding to corresponding nucleotides similarity of percentage average right), (upper genomes between proteins shared of orthologous Number orthologous protein shared between genomes (lower left) and number of proteins per genome (bold) proteins of per genome and number left) (lower genomes between shared protein orthologous

Haloferax Haloferaxlucentense volcanii Haloferax Haloferax prahovense Haloferax alexandrinus Haloferax massiliensis Haloferax gibbonsii Table 6. Table 6.

521 521 523 522 522 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 72% ± 5.8 5.8 72% ± 100% ± 00 00 100% ± 96% ± 2.4 2.4 ± 96% 50.70% ± 5.2 5.2 ± 50.70% 5.3 ± 50.70% 52.80 % ± 5.3 5.3 ± 52.80 % H. prahovense H. prahovense

100% ± 00 00 100% ± H. H. volcanii 53.70% ± 2.69 2.69 53.70% ± 2.68 53.50% ± 2.69 53.80% ± 2.69 82.20% ±

100% ± 00 00 100% ± 54.30% ± 2.70 54.30% 2.70 ± 51.90% 2.65 ± H. lucentense lucentense H. 50. 80% ± 2.63 2.63 ± 50. 80%

35 35 158 100% ± 00 00 100% ± H. gibbonsii H. gibbonsii 52.90% ± 2.67 2.67 52.90% ± 2.90 73.50% ± Extremophiles

100% ± 00 00 ± 100% 50.90% ± 2.64 2.64 ± 50.90% H. alexandrinus H. alexandrinus For Peer Review

100% ± 00 00 100% ± H. H. massiliensis

Pairwise comparison of Haloferax massiliensis with other species using GGDC, formula 2 (DDH estimates based on identities / HSP / HSP on estimates identities based (DDH formula 2 using GGDC, species other with massiliensis Haloferax comparison of Pairwise length)*

H. H. massiliensis H. alexandrinus H. gibbonsii H. lucentense H. volcanii H. prahovense Table 7. Table 7. in sets (whichfromempirical are always limited test data models derived based on distances intergenomic estimating from values in DDH uncertainty the inherent intervals confidence indicate *The results. GGDC asas well the analyses and phylogenomic 4) 16S rRNA(Figure the with accordance areresults in size) These

526 526 527 524 524 525 525 Page 35 of 42 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 Extremophiles Page 36 of 42

1 2 3 528 Figures legends 4 5 6 529 Figure 1. Reference mass spectrum from Haloferax massiliensis strain ArcHrT. Spectra from 7 8 530 12 individual colonies were compared and a reference spectrum was generated. 9 10 11 531 Figure 2. Phylogenetic tree highlighting the position of Haloferax massiliensis strain ArcHrT 12 13 532 relative to other type strains within Haloferax, Halosarcina, Halobelus and Halobaculum 14 15 16 533 genus. The respective GenBank accession numbers for 16S rRNA genes are indicated in 17 18 534 parenthesis. SequencesFor were aligned Peer using CLUSTALW, Review and phylogenetic inferences were 19 20 535 obtained using the maximumlikelihood method within the MEGA software. The scale bar 21 22 536 represents 0.005% nucleotide sequence divergence. 23 24 25 537 Figure 3. Transmission electron microscopy of Haloferax massiliensis strain ArcHrT, using a 26 27 28 538 Morgani 268D (Philips) at an operating voltage of 80keV.The scale bar represents 500 nm. 29 30 T 31 539 Figure 4. Gel view comparing Haloferax massiliensis strain ArcHr to other species within 32 33 540 the genus Haloferax. The gel view displays the raw spectra of loaded spectrum files arranged 34 35 541 in a pseudogel like look. The xaxis records the m/z value. The left yaxis displays the 36 37 542 running spectrum number originating from subsequent spectra loading. The peak intensity is 38 39 543 expressed by a Gray scale scheme code. The color bar and the right yaxis indicate the 40 41 42 544 relation between the color of a peak and the peak intensity, in arbitrary units. Displayed 43 44 545 species are indicated on the left. 45 46 47 546 Figure 5. Circular representation of the Haloferax massiliensis ArcHrT genome. Circles from 48 49 547 the center to the outside: GC screw (green/purple), GC content (green/purple) and contigs 50 51 548 (orange/brown). 52 53 54 549 Figure 6. Distribution of functional classes of predicted genes according to cluster of 55 56 T 57 550 orthologous groups of proteins from Haloferax massiliensis strain ArcHr 58 36 59 60

159 Page 37 of 42 Extremophiles

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 For Peer Review 19 20 21 22 23 24 25 26 27 28 29 30 Figure 1. Reference mass spectrum from Haloferax massiliensis strain Arc-HrT. Spectra from 12 individual 31 colonies were compared and a reference spectrum was generated. 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

160 Extremophiles Page 38 of 42

1 2 3 4 5 6 7 8 9 10 11 12 For Peer Review 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 161 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 Page 39 of 42 Extremophiles

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 For Peer Review 19 20 21 22 23 24 25 26 27 28 29 30 31 32 Figure 3. Transmission electron microscopy of Haloferax massiliensis strain Arc-HrT, using a Morgani 268D 33 (Philips) at an operating voltage of 80keV.The scale bar represents 500 nm. 34 35 174x131mm (200 x 200 DPI) 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

162 Extremophiles Page 40 of 42

1 2 3 4 5 6 7 8 9 10 11 12 For Peer Review 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 163 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 Page 41 of 42 Extremophiles

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 For Peer Review 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 Figure 5. Circular representation of the Haloferax massiliensis Arc-HrT genome. Circles from the center to 41 the outside: GC screw (green/purple), GC content (green/purple) and contigs (orange/brown). 42 43 169x169mm (103 x 103 DPI) 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

164 Extremophiles Page 42 of 42

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 For Peer Review 19 20 21 22 23 24 25 26 27 Figure 6. Distribution of functional classes of predicted genes according to cluster of orthologous groups of 28 proteins from Haloferax massiliensis strain Arc-HrT 29 30 184x108mm (114 x 114 DPI) 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

165 REFERENCES DES AVANT- PROPOS

1. Lagier J-C, Edouard S, Pagnier I, Mediannikov O, Drancourt M, Raoult D. Current and Past Strategies for Bacterial Culture in Clinical Microbiology. Clin. Microbiol. Rev. 28(1) (2015).

2. Lagier J-C, Hugon P, Khelaifia S, Fournier P-E, La Scola B, Raoult D. The Rebirth of Culture in Microbiology through the Example of Culturomics To Study Human Gut Microbiota. Clin. Microbiol. Rev. 28(1) (2015).

3. Lagier J-C, Armougom F, Million M, et al. Microbial culturomics: paradigm shift in the human gut microbiome study. Clin. Microbiol. Infect. Off. Publ. Eur. Soc. Clin. Microbiol. Infect. Dis. 18(12), 1185–1193 (2012).

4. Tettelin H, Masignani V, Cieslewicz MJ, et al. Genome analysis of multiple pathogenic isolates of Streptococcus agalactiae: Implications for the microbial “pan-genome.” Proc. Natl. Acad. Sci. U. S. A. 102(39), 13950–13955 (2005).

5. Sullivan A, Edlund C, Nord CE. Effect of antimicrobial agents on the ecological balance of human microflora. Lancet Infect. Dis. 1(2), 101–114 (2001).

6. Gupta SK, Padmanabhan BR, Diene SM, et al. ARG- ANNOT, a new bioinformatic tool to discover antibiotic resistance genes in bacterial genomes. Antimicrob. Agents Chemother. 58(1), 212–220 (2014).

7. Aziz RK, Bartels D, Best AA, et al. The RAST Server: rapid annotations using subsystems technology. BMC

166 Genomics. 9, 75 (2008).

8. Klappenbach JA, Dunbar JM, Schmidt TM. rRNA Operon Copy Number Reflects Ecological Strategies of Bacteria. Appl. Environ. Microbiol. 66(4), 1328–1333 (2000).

9. Koskiniemi S, Lamoureux JG, Nikolakakis KC, et al. Rhs proteins from diverse bacteria mediate intercellular competition. Proc. Natl. Acad. Sci. U. S. A. 110(17), 7032–7037 (2013).

10. Vandamme P, Pot B, Gillis M, de Vos P, Kersters K, Swings J. Polyphasic taxonomy, a consensus approach to bacterial systematics. Microbiol. Rev. 60(2), 407–438 (1996).

11. Nielsen HB, Almeida M, Juncker AS, et al. Identification and assembly of genomes and genetic elements in complex metagenomic samples without using reference genomes. Nat. Biotechnol. 32(8), 822–828 (2014).

12. Browne HP, Forster SC, Anonye BO, et al. Culturing of “unculturable” human microbiota reveals novel taxa and extensive sporulation. Nature. 533(7604), 543–546 (2016).

167