AIX-MARSEILLE UNIVERSITE FACULTE DE MEDECINE-LA TIMONE ECOLE DOCTORALE DES SCIENCES DE LA VIE ET DE LA SANTE

Présentée et soutenue le 24 Novembre 2017

Par

En vue de l’obtention du grade de Docteur de l’Université Aix-Marseille Spécialité : Génomique et Bio-informatique

REAL-TIME GENOMICS TO DECIPHER ATYPICAL IN CLINICAL MICROBIOLOGY

COMPOSITION DU JURY

Président du Jury Professeur Anthony Levasseur Examinateur Professeur Ruimy Raymond Rapporteur1 Professeur Marie Kempf Rapporteur2 Professeur Estelle Jumas-Bilak Directeur de Thèse Professeur Jean-Marc Rolain

Unité de Recherche sur les Maladies Infectieuses et Tropicales Emergentes URMITE CNSR-IRD UMR7278, IHU MEDITERRANEE INFECTION

AIX-MARSEILLE UNIVERSITE FACULTE DE MEDECINE-LA TIMONE ECOLE DOCTORALE DES SCIENCES DE LA VIE ET DE LA SANTE

Présentée et soutenue le 24 Novembre 2017

Par

En vue de l’obtention du grade de Docteur de l’Université Aix-Marseille Spécialité : Génomique et Bio-informatique

REAL-TIME GENOMICS TO DECIPHER ATYPICAL BACTERIA IN CLINICAL MICROBIOLOGY

COMPOSITION DU JURY

Président du Jury Professeur Anthony Levasseur Examinateur Professeur Ruimy Raymond Rapporteur1 Professeur Marie Kempf Rapporteur2 Professeur Estelle Jumas-Bilak Directeur de Thèse Professeur Jean-Marc Rolain

Unité de Recherche sur les Maladies Infectieuses et Tropicales Emergentes URMITE CNSR-IRD UMR7278, IHU MEDITERRANEE INFECTION

1

CONTENT Avant-propos Résumé /Abstract Introduction Chapter I: Review Articles I: Real-time genomics and the impact of bacterial genome recombination in clinical microbiology Kodjovi D. Mlaga, Seydina M. Diene, R. Ruimy, J-M Rolain. (Submitted in Genome Biology and Evolution)

Chapter II: Comparative genomic applied in clinical microbiology

Articles II: Using MALDI-TOF MS typing method to decipher outbreak: the case of Staphylococcus saprophyticus causing urinary tract infections (UTIs) in Marseille, France. Kodjovi D. Mlaga, Grégory Dubourg, Cedric Abat, Hervé Chaudet, Laurène Lotte, Seydina M. Diene, Didier Raoult, Raymond Ruimy and Jean-Marc Rolain. European Journal of Clinical Microbiology & Infectious Diseases. pp. 1–7, Aug. 2017.

Articles III: Comparative genomic analysis of Staphylococcus saprophyticus reveals a drift from “saprophytic” to “pathogenic” bacteria due to extensive genomic recombination. Kodjovi D. Mlaga, Seydina M. Diene, Ruimy Raymond, Jean-Marc Rolain (Submitted in BMC genomics)

Articles IV: Extensive comparative genomic analysis of Enterococcus faecalis and Enterococcus faecium reveals a direct association between absence of CRISPR systems, the presence of anti-endonuclease (ardA) and acquisition of vancomycin resistance genes in E. faecium. Kodjovi D. Mlaga, Seydina M. Diene, Vincent Garcia, Philippe Colson, Ruimy Raymond, Didier Raoult, Jean-Marc Rolain (draft manuscript)

Chapter III: Taxonogenomics applied in new species/genus description

Article V: ‘Nissabacter archeti’, gen. nov, sp. nov., a new member of family, isolated from pustule scalp at Archet 2 Hospital, Nice. Kodjovi D. Mlaga, Jean-Marc Rolain, Ruimy. Raymond New Microbes New Infect. 2017 May; 17: 81–83.

Article IV: Phenotypic and genomic description of ‘Nissabacter archeti’, gen. nov., sp. nov., a new member of Enterobacteriaceae family, isolated from human pustule scalp at Archet 2 Hospital, Nice, France. Kodjovi D. Mlaga, Romain Lotte, Henri Montaudié, Jean-

2

Marc Rolain, Ruimy. Raymond (Submitted in International Journal of Systematic and Evolutionary Microbiology)

Chapter IV: Conclusion and Perspectives

Posters and Presentation

Acknowledgments

3

Avant-propos

Le format de presentation de cette thèse correcpond à la recommendation de la specialité

Genomique et bio-informatique a l’interieur du Master des Sciences de la vie et de la Santé qui dépend de l’Ecole Doctorale des Sciences de la vie et de la santé de Marseille.

Le candidat est amené à respecter les règles qui lui sont imposées et qui comportent un format de thèses utilisé dans le Nord de l’Europe et qui permet un meilleur rangement que les thèses traditionnelles. Par ailleurs, la partie introduction et bibliographie est remplacée par une revue envoyée dans un journal afin de permettre une evaluation exterieure de la qualité de la revue et de permettre a l’étudiant de commencer le plus tôt possible une bibliographie exhaustive sur le demaine de cette thèse.

Par ailleurs, la thèses est présentée sur article publié, accepté, ou soumis associé d’un bref commentaire donnant le sens général du travail. Cette forme de présentation a paru plus en adéquation avec les exigences de la competition internatinales et permet de se concentrer sur des travaux qui bénéficieront d’une diffuion internationale.

Professeur Didier RAOULT

4

Résumé

Le développement du « Next-Generation Sequencing » (NGS) et l'amélioration de

« Whole Genome Sequencing » (WGS) ont contribué a l’amélioration de l'analyse des données du séquençage des génomes microbiens à haut-débit. Le concept de «Génomique en Temps

Réel» (GTR) initialement inventé par des sociétés bio-informatiques pour une analyse génomique profonde a progressivement intégré les processus diagnostiques de routine en microbiologie clinique pour déchiffrer l’evolution genomique bacterienne, la detection et l’analyse des déterminants de resistance aux agents antimicrobiens, la taxono-genomique et la surveillance systématique des épidémies. L'augmentation de la quantité de données de séquençage mise à disposition dans GenBank nécessitera des outils d’analyse bien établis et intégrés pour l'analyse des données de séquençage du génome, afin de fournir des résultats précis pour une bonne gestion des patients. L'émergence de souches multiresistantes et de pathogenes de plus en plus virulents sont devenus une menace sérieuse pour la santé humaine et animale. La recombinaison génomique au travers des mutations ponctuelles, le transfert horizontal de gène et la perte de gènes ont enormement contribué à l'adaptation et à l'évolution des bactéries dans les divers environnements et hôtes; résultant de l'acquisition des déterminants de la résistance antimicrobienne, de la virulence et de nouveau profile métabolique. Un diagnostic rapide est l'approche la plus efficace pour prévenir et contrôler les infections ou les maladies microbiennes, par conséquent, pour parvenir à une thérapeutique efficace. De nos jours, les outils diagnostiques de routine en microbiologie ont montré leur limite dans la gestion des infections ou d’épidemie due à des pathogènes atypiques, les facteurs de virulences et la classification taxonomique. Les progrès récents dans les technologies de séquençage ont donné aux laboratoires de microbiologie l'accès au WGS. L'objectif de notre thèse est d'appliquer les approches génomiques en temps réel (GTR) pour déchiffrer les caractéristiques génomiques et la recombinaison génomique des bactéries atypiques ainsi que leur impact sur les maladies infectieuses. Le premier projet de notre thèses consiste à déchiffrer une epidemie communautaire due à Staphylococcus saprophyticus impliqué dans les infections des voies 5 urinaires (UTI) en utilisant la technologie de MALDI-TOF MS et une analyse comparative des génomes entiers de Staphylococcus saprophyticus d’origine clinique et non-clinique pour comprendre leur évolution génomique. Le deuxième projet est une analyse comparative de l'évolution du génome d'Enterococcus fecalis et d'Enterococcus faecium isolés de l'homme, des animaux et de l'environnement pour décrypter la différence de propagation et l'acquisition de déterminants antimicrobiens, specialement les genes de resistance à la vancomycine; et un dernier projet axé sur la description d'un nouveau genre Nissabacter et sa première espèce

Nissabacter archeti une nouvelle branche phylogéniques de la famille des Enterobacteriaceae

à l'aide de la taxono-genomique.

6

Abstract

The development of Next-Generation Sequencing (NGS) and the improvement in whole genome sequencing (WGS) have contributed to improving the analysis of high-throughput microbial genome sequence data. The concept of “Real-Time Genomics” (RTG) initially coined by computational biology companies for in-depth genomic analysis has recently toward becoming an integral part of routine diagnostic processes in clinical microbiology to decipher antimicrobial resistance determinants, taxono-genomics, and routine outbreak surveillance. The increase in the amount of sequencing data made available in Genbank will require well- established and integrated computational tools for genome sequencing data analysis, to provide accurate results for correct patient management. The emergence of multidrug resistance and extensive virulent pathogens have become a severe threat to human health. Genome recombination through point mutations, horizontal gene transfer and gene loss have contributed the adaptation and the evolution of bacteria in the various environment and host niches significantly; resulting in the acquisition of antimicrobial resistance determinants, virulence and metabolism pattern. A timely diagnosis is the most effective approach to prevent and control microbial infections or diseases, consequently, to achieve an effective therapy. Nowadays, microbiology routine diagnostic tools have shown their limitation in the management of atypical bacteria infection or outbreak as far as multi-drug resistance bacteria, novel virulence or toxin genes and taxonomic classification are concerned. The recent advances in sequencing technologies have given microbiology laboratories access to whole genome sequencing. The objective of our thesis is to applied the Real-time genomic approaches to decipher bacterial genomic features and genome recombination event of atypical bacteria and their impact on infectious diseases and reconstitute genomic evolution characteristic of atypical bacteria involved in community outbreak. The first project of our PhD is to decipher a community outbreak of Staphylococcus saprophyticus involved in Urinary Tract Infections (UTI) using

MALDI-TOF MS technology and a comparative whole genome analysis of clinical and non-

7 clinical Staphylococcus saprophyticus to understand their genomic evolution. The second project is a comparative genome evolutionary analysis of Enterococcus faecalis and

Enterococcus faecium isolated from human, animal and environment to decipher the difference in spread and the acquisition in antimicrobial determinant. Finally, our last project focused on the description of a new genus Nissabacter and the first species Nissabacter archeti a new phylogenic branch of Enterobacteriaceae family using taxono-genomic.

8

Introduction The development of next-generation sequencing (NGS) and the improvement in WGS have contributed significantly to the analysis of high-throughput microbial genome sequence data [1,2]. Twenty years have passed since the powerful combination of WGS, and computational analysis of data has transformed our understanding of how microorganisms live, evolve and interact with their communities, with their hosts, and how they cause infectious diseases [3]. The concept of “Real-Time Genomics” (RTG) was initially coined by computational biology companies to offer in-depth genomic analysis solutions to researchers in NGS technology and to give meaning to DNA and RNA sequencing data. It was applied to big data from high-throughput sequencing in clinical pathologies (tumours), considering the sensitivity and the efficiency of the process. Recently, RTG has become an integral part of routine diagnostic processes in clinical microbiology to decipher antimicrobial resistance determinants [4], taxono-genomics [5], and routine outbreak surveillance [6]. In 2016, more than 85,156 bacterial genomes were sequenced and registered in Genbank database, and it looks as 2017 will produce even more. This progress requires well-established and integrated computational tools for genome sequencing data analysis, to provide accurate results for correct patient management. Bacterial infections are a common cause of infectious diseases in human and animal health and demanding ever more enormous medical, social and technological resources [7,8]. The emergence of multidrug resistance [9,10] and extensive virulent pathogens

[11,12] have become a severe threat to human health. Genome recombination through point mutations, horizontal gene transfer and gene loss have contributed significantly to and become the driving force of the adaptation and the evolution of bacteria in the various environment and host niches [13–15]. This genome modification has resulted in the acquisition of antimicrobial resistance determinants [16,17], virulence and metabolism pattern [18], enabling these patterns to cause infectious diseases with unusual clinical profiles. Timely diagnosis is the most effective approach to preventing and controlling microbial infectious diseases and the emergence of multidrug resistance and outbreak. The gold standard for the diagnosis of infectious diseases

9 has long been a culture in growth-supporting media, including isolation, identification and antibiotic-susceptibility testing of the causative microorganism from clinical samples.

Currently, such diagnoses take a minimum of 24h hours to several weeks. The introduction of the polymerase chain reaction (PCR) method in the 1980s resulted in the development of a multitude of diagnostic tools and improved the efficiency of diagnostics and the characterisation of pathogens. [19]. Recent new developments in Matrix-Assisted Laser

Desorption Ionization-Time of Flight Mass Spectrometry (MALDI-TOF MS) enabled microbial typing at a low cost, with varying success rates, depending on the microorganism in question [20,21]. This revolutionary technology allows for more comfortable and faster diagnosis of human pathogens than conventional phenotypic and molecular identification methods, with unquestionable reliability and cost-effectiveness [22]. MALDI-TOF-MS data has been integrated as alternatives to chemotaxonomy and DNA-DNA Hybridization (DDH), respectively, for the taxonomic description of bacteria, provided that the new isolates are compared to the phylogenetically closest species with standing in nomenclature [23]. In recent decades, molecular biology has moved from gene-by-gene analysis to more complex studies using a genome-wide scale [24]. Recent advances in sequencing technology and its decreasing costs have given microbiological laboratories access to WGS. The availability of well- established tools for the automated analysis of sequence data and databases will ensure WGS becomes an essential tool for clinical microbiology laboratories. In the past decade, WGS has been used to investigate the acquisition and spread of antimicrobial resistance mediated by mobile genetic elements, virulence, and pathogenicity acquired through horizontal gene transfer

(HGT).

We divided the objective of our PhD project into two parts:

1. The first aim of our work was to understand the spread, the adaptation and the evolution

of clinical strains using comparative genome analysis.

10

2. The second aim is to identify and describe new genus/species of strains isolated in

clinical condition using taxono-genomic approach

Hence, we organised this manuscript into three main chapters presented below:

Chapter I: we focused this section on Article I, a review entitled “Real-time genomics and the impact of bacterial genome recombination in clinical microbiology”. Genome recombination through point mutations, horizontal gene transfer and gene loss have contributed significantly to and become the driving force of the adaptation and the evolution of bacteria in the various environment and host niches. This genomic modification has resulted in the acquisition of antimicrobial resistance determinants, virulence and metabolism patterns, enabling these patterns to cause infectious diseases with unusual clinical profiles. In the past decade, whole genome sequencing has been used to investigate the acquisition and spread of antimicrobial resistance mediated by mobile genetic elements, virulence, and pathogenicity factors acquired through horizontal gene transfer (HGT). In this review, we propose the review of publicly available computational biology tools, widely used for real-time genomics in clinical microbiology over the past decade and will subsequently highlight how WGS analysis has made it possible to decipher the impact of bacterial genome recombinations in clinical microbiology.

Chapter II: This section focuses on the comparative genome analysis of Staphylococcus saprophyticus in one hand and Enterococcus faecalis and Enterococcus faecium on the other hand. We divide this chapter into two parts:

I. In December 2014, our surveillance system identified an abnormal increase in S. saprophyticus causing UTIs in four University Hospitals in Marseille, indicating a suspected community S. saprophyticus UTI outbreak. Matrix-Assisted Laser Desorption Ionization-Time of Flight Mass Spectrometry (MALDI-TOF MS) spectral analysis of strains were used to analyse strains cluster expansion, comparing strains from Marseille to those from Nice during the same period. MALDI-TOF MS spectral analysis revealed a geographical restricted clonal 11 expansion of S. saprophyticus strains clusters in Marseille as compared to Nice. We published this work in the Article II entitled “Using MALDI-TOF MS typing method to decipher outbreak: the case of Staphylococcus saprophyticus causing urinary tract infections (UTIs) in Marseille, France.” in European Journal of Clinical Microbiology & Infectious Diseases. We sequenced and compared a strain isolated from a female patient from the “Hôpital La Timone” in Marseille, France, who had experienced UTI in December 2014 with all available genomes of S. saprophyticus from the NCBI/Genbank database at December 2016 isolated from various environments (clinical and non-clinical strains), to investigate the genomic evolution of this bacterial species and its genomic characteristics. Our findings suggest that S. saprophyticus, initially a saprophytic bacterium, has drifted to becoming a pathogenic bacterium through accumulated evolutionary events including massive genome recombinations and single nucleotide polymorphisms (SNPs). We submitted this work in the Article III entitled

“Comparative genomic analysis of Staphylococcus saprophyticus reveals a drift from

“saprophytic” to “pathogenic” bacteria due to extensive genomic recombinations” to the

BMC Genomics journal.

II. A MALDI-TOF MS spectra analysis of E. faecalis strains isolated in Marseille has shown a clustering between human and chicken strains suspecting a zoonotic dissemination.

We have decided to sequence the genome of four strains of E. faecalis isolated from human and two from chicken and perform a comparative analysis with the publicly available genome of E. faecalis and E. faecium to decipher their genomic evolution. In this study, we showed that massive recombinations have occurred in E. faecalis with the presence of an imported number of CRISPR system and associated proteins (cas) compared to E. faecium. Moreover, we found an association between absence of CRISPR system, the presence of anti-endonuclease protein (ardA) (both HGT regulators) and the acquisition of vancomycin resistance genes (vanA, vanB, vanC) carried by plasmids in Enterococci. A considerable number of E. faecium was isolated from animals

(14.7%) mainly in Europe (86.6%) with a zoonotic dissemination demonstrated based on the phylogenic network analysis. We submitted this work as Article IV entitled “Extensive 12 comparative genomic analysis of Enterococcus faecalis and Enterococcus faecium reveals a direct association between absence of CRISPR systems, the presence of anti-endonuclease

(ardA) and acquisition of vancomycin resistance genes in E. faecium” in Genome Research.

Chapter III: This section focuses on the use of taxono-genomics tools to describe “Nissabacter archeti’, a new member of Enterobacteriaceae family, isolated from human pustule scalp of a 29 years old male isolated from Archet II Hospital of Nice. Base on the genome sequence analysis, the

MALDI-TOF MS scoring and the phenotypic characterisation we propose to describe Nissabacter archeti as the first species of the first genus belonging to Enterobacteriaceae family, phylogenetically close to and Ewingella, Rahnella and Gibbsiella. We published this work as Article

V entitled: “‘Nissabacter archeti’, gen. nov, sp. nov., a new member of Enterobacteriaceae family, isolated from human pustule scalp at Archet 2 Hospital, Nice” in New Microbes

New Infection journal. Moreover, the articles IV: “Phenotypic and genomic description of

‘Nissabacter archeti’, gen. nov., sp. nov., a new member of Enterobacteriaceae family, isolated from human pustule scalp at Archet 2 Hospital, Nice, France” Submitted in

International Journal of Systematic and Evolutionary Microbiology.

13

References

1. Illumina. An Introduction to Next-Generation Sequencing Technology Table of Contents. 2016;1–16. 2. Long SW, Williams D, Valson C, Cantu CC, Cernoch P, Musser JM, et al. A genomic day in the life of a clinical microbiology laboratory. J. Clin. Microbiol. 2013; 3. Loman NJ, Pallen MJ. Twenty years of bacterial genome sequencing. Nat. Rev. Microbiol. [Internet]. Nature Publishing Group; 2015 [cited 2015 Nov 9];13:787–94. 4. Mellmann A, Bletz S, Böking T, Kipp F, Becker K, Schultes A, et al. Real-Time Genome Sequencing of Resistant Bacteria Provides Precision Infection Control in an Institutional Setting. Bourbeau P, editor. J. Clin. Microbiol. [Internet]. 2016 [cited 2016 Dec 21];54:2874– 81. 5. Cao MD, Ganesamoorthy D, Elliott A, Zhang H, Cooper M, Coin L. Real-time strain typing and analysis of antibiotic resistance potential using Nanopore MinION sequencing. bioRxiv [Internet]. 2015;19356. 6. McGann P, Bunin JL, Snesrud E, Singh S, Maybank R, Ong AC, et al. Real-Time Application of Whole Genome Sequencing for Outbreak Investigation – What is an achievable Turnaround Time? Diagn. Microbiol. Infect. Dis. [Internet]. 2016 [cited 2016 May 8]; 7. Zyga S, Zografakis-sfakianakis M. Emerging and Re-Emerging Infectious Diseases: A potential pandemic threat. 2011;3:159–68. 8. Law I, Fidler DP, Clarke KC, Levin BR, Mcsweegan E, Kronenberger CB, et al. Emerging Infectious Diseases. Emerg. Infect. Dis. 1996; 9. Ferjani S, Saidani M, Ennigrou S, Hsairi M, Slim AF, Boutiba Ben Boubaker I. Multidrug resistance and high virulence genotype in uropathogenic Escherichia coli due to diffusion of ST131 clonal group producing CTX-M-15: an emerging problem in a Tunisian hospital. Folia Microbiol. (Praha). [Internet]. 2014 [cited 2016 Dec 2];59:257–62. 10. Olaitan AO, Berrazeg M, Fagade OE, Adelowo OO, Alli JA, Rolain JM. The emergence of multidrug-resistant Acinetobacter baumannii producing OXA-23 carbapenemase, Nigeria. Int. J. Infect. Dis. [Internet]. 2013 [cited 2015 Oct 22];17:e469–70. 11. Njage PMK, Buys EM. Pathogenic and commensal Escherichia coli from irrigation water show potential in the transmission of the extended spectrum and AmpC β-lactamases determinants to isolates from lettuce. Microb. Biotechnol. [Internet]. 2015 [cited 2016 Dec 2];8:462–73. 12. Nhantumbo AA, Cantarelli VV, Caireão J, Munguambe AM, Comé CE, Pinto G do C, et al. Frequency of Pathogenic Paediatric Bacterial Meningitis in Mozambique: The Critical Role of Multiplex Real-Time Polymerase Chain Reaction to Estimate the Burden of Disease. PLoS One [Internet]. 2015 [cited 2016 Mar 22];10:e0138249. 13. Palmer KL, Kos VN, Gilmore MS. Horizontal gene transfer and the genomics of enterococcal antibiotic resistance. Curr. Opin. Microbiol. 2010. p. 632–9. 14. Syvanen M. Evolutionary implications of horizontal gene transfer. Annu. Rev. Genet. 14

[Internet]. Annual Reviews ; 2012 [cited 2016 Jul 25];46:341–58. 15. Koonin E V., Koonin, V. E. Horizontal gene transfer: essentiality and evolvability in prokaryotes, and roles in evolutionary transitions. F1000Research [Internet]. 2016 [cited 2017 Apr 5];5:1805. 16. Olaitan AO, Diene SM, Assous MV, Rolain J-M. Genomic Plasticity of Multidrug-Resistant NDM-1 Positive Clinical Isolate of Providencia rettgeri. Genome Biol. Evol. [Internet]. Oxford University Press; 2016 [cited 2016 Oct 6];8:723–8. 17. Imperi F, Antunes LCS, Blom J, Villa L, Iacono M, Visca P, et al. The genomics of Acinetobacter baumannii: Insights into genome plasticity, antimicrobial resistance and pathogenicity. IUBMB Life [Internet]. 2011 [cited 2016 Jun 21];63:1068–74. 18. Navarre WW. Chapter Three – The Impact of Gene Silencing on Horizontal Gene Transfer and Bacterial Evolution. Adv. Microb. Physiol. [Internet]. 2016 [cited 2017 Apr 5]. p. 157–86. 19. Hedman P, Ringertz O, Lindström M, Olsson K. The origin of Staphylococcus saprophyticus from cattle and pigs. Scand. J. Infect. Dis. [Internet]. 1993 [cited 2017 Jan 6];25:57–60. 20. Spinali S, van Belkum A, Goering R V., Girard V, Welker M, Van Nuenen M, et al. Microbial Typing by Matrix-Assisted Laser Desorption Ionization–Time of Flight Mass Spectrometry: Do We Need Guidance for Data Interpretation? Doern G V., editor. J. Clin. Microbiol. [Internet]. 2015 [cited 2017 Jan 6];53:760–5. 21. Firacative C, Trilles L, Meyer W. MALDI-TOF MS enables the rapid identification of the major molecular types within the Cryptococcus neoformans/C. gattii species complex. Heimesaat MM, editor. PLoS One [Internet]. 2012 [cited 2017 Jul 3];7:e37566. 22. Seng P, Rolain J-M, Fournier PE, La Scola B, Drancourt M, Raoult D. MALDI-TOF-mass spectrometry applications in clinical microbiology. Future Microbiol. [Internet]. 2010 [cited 2016 Jan 20];5:1733–54. 23. Fournier P-E, Drancourt M. New Microbes New Infections promotes modern prokaryotic : a new section “TaxonoGenomics: new genomes of microorganisms in humans”. New microbes new Infect. [Internet]. 2015 [cited 2016 Mar 31];7:48–9. 24. Guarnaccia M, Gentile G, Alessi E, Schneider C, Petralia S, Cavallaro S. Is this the real time for genomics? Genomics [Internet]. 2014 [cited 2016 May 7];103:177–82.

15

Chapter I: Review-Real-time genomics and the impact of bacterial genome

recombination in clinical microbiology

16

1 Real-time genomics and the impact of bacterial genome recombination in clinical

2 microbiology

3

4 Kodjovi D. Mlaga1, Seydina M. Diene1, Ruimy. Raymond2,3, Jean-Marc Rolain1*

5

6 1. URMITE, Aix Marseille Université, UM63, CNRS 7278, IRD 198, INSERM 1095, IHU-

7 Méditerranée Infection, 19-21 Boulevard Jean Moulin 13385 Marseille Cedex 05, France

8 2. Department of Bacteriology at Nice Academic Hospital, Nice Medical University Nice,

9 France;

10 3. INSERM U1065 (C3M), Bacterial Toxins in Host-Pathogen Interactions, C3M, Bâtiment

11 Universitaire Archimed, Nice, France.

12

13 *Corresponding author: Prof. Jean-Marc Rolain

14 Email: [email protected]

15 URMITE, Aix Marseille Université, UM63, CNRS 7278, IRD 198, INSERM 1095, IHU-

16 Méditerranée Infection, 19-21 Boulevard Jean Moulin 13385 Marseille Cedex 05, France

17 Tel: +33(0) 4 91 32 43 75; Fax: +33 (0) 4 86 13 68 28

18

19 Keywords: real-time genomic, genome recombination, horizontal gene transfer, point muta-

20 tion, gene loss

17

21 Abstract

22 The development of cost-effective next-generation sequencing technology (NGS), re-

23 cent advances in whole genome sequencing (WGS), and decreasing costs have had a signifi-

24 cant impact on our knowledge of the behaviour of the bacteria which cause infectious diseases.

25 Bacterial genome recombinations through point mutations and gene exchanges, including

26 horizontal gene transfer and gene loss have contributed significantly to the adaptation and

27 evolution of bacteria and have become the driving force of bacterial survival in the host nich-

28 es. In this paper, we propose a review of publicly available computational biology tools, fre-

29 quently used for real-time genomics in clinical microbiology over the past decade and high-

30 light the impact of bacterial genome recombinations in clinical microbiology.

18

31 Introduction

32 Microbial infections are a common cause of infectious diseases causing severe threats to

33 human and animal health and demanding ever more significant medical, social and technolog-

34 ical resources (Zyga & Zografakis-sfakianakis 2011; Law et al. 1996). The emergence of mul-

35 tidrug resistance (Ferjani et al. 2014; Olaitan et al. 2013) and extensive virulent pathogens

36 (Njage & Buys 2015; Nhantumbo et al. 2015) have become a severe threat to human health.

37 Genome recombination through point mutations, horizontal gene transfer and gene loss have

38 contributed significantly to and become the driving force of the (short-term) adaptation and

39 the (long-term) evolution of bacteria in the various environment and host niches (Palmer et al.

40 2010; Syvanen 2012; Koonin et al. 2016). This genomic modification has resulted in the ac-

41 quisition of antimicrobial resistance determinants (Olaitan et al. 2016; Imperi et al. 2011),

42 virulence and metabolism patterns (Navarre 2016), enabling these patterns to cause infectious

43 diseases with unusual clinical profiles. Timely diagnosis is the most effective approach to

44 preventing and controlling microbial infectious diseases and the emergence of multidrug re-

45 sistance. The gold standard for the diagnosis of infectious diseases has long been sample

46 culture in growth-supporting media, including isolation, identification and antibiotic-

47 susceptibility testing of the causative microorganism from clinical samples. Currently, such

48 diagnoses take a minimum of 24–72 hours. The introduction of the polymerase chain reaction

49 (PCR) method in the 1980s resulted in the development of a multitude of diagnostic tools and

50 improved the efficiency of diagnostics and the characterisation of pathogens. In recent dec-

51 ades, molecular biology has moved from gene-by-gene analysis to more complex studies us-

52 ing a genome-wide scale (Guarnaccia et al. 2014). Recent advances in sequencing technology

53 and its decreasing costs have given microbiological laboratories access to WGS. The availa-

54 bility of well-established tools for the automated analysis of sequence data and databases will

55 ensure WGS becomes an essential tool for clinical microbiology laboratories (Bertelli &

19

56 Greub 2013). In the past decade, WGS has been used to investigate the acquisition and spread

57 of antimicrobial resistance mediated by mobile genetic elements (Palmer et al. 2010; Imperi et

58 al. 2011), virulence, and pathogenicity (Messerer et al. 2017; Schneider et al. 2004). In this

59 paper, we propose the review of publicly available computational biology tools, frequently

60 used for real-time genomics in clinical microbiology over the past decade and will subse-

61 quently highlight how WGS analysis has made it possible to decipher the impact of bacterial

62 genome recombinations in clinical microbiology.

63 1. New advances in whole genome sequencing and the concept of real-time genomics

64 The development of cost-effective next-generation sequencing (NGS) and advances in

65 WGS have facilitated the analysis of bacterial genetic material at a whole-genome level and

66 on a larger scale (Illumina 2016; Long et al. 2013). It has been twenty years since the power-

67 ful combination of WGS and computational analysis of data transformed our understanding of

68 how bacteria live, evolve and interact with their communities, with their hosts, and how they

69 cause infectious diseases (Loman & Pallen 2015). Three technologies were ben initially de-

70 veloped, namely the 454 platform (Roche), the SOLiD platform (Life Technologies) and the

71 Illumina platform (Illumina). They all present limitations regarding “ read” length or accuracy,

72 although these have been improved over time. The recent development of smaller bench-top

73 sequencers, such as the MiSeq platform (Illumina) and the Ion Torrent platform (Life

74 Technologies), followed by the decrease in turnaround time and cost, brought with it the

75 potential to introduce real-time bacterial genome sequencing into clinical diagnostic

76 laboratories and the management of outbreaks (Illumina 2016). Recent developments in na-

77 nopore technology enabled direct, electronic analysis of DNA, RNA or small proteins (Török

78 & Peacock 2012). The concept of “Real-Time Genomics” (RTG) was initially coined by

79 computational biology companies to offer in-depth genomic analysis solutions to researchers

80 in NGS technology and to give meaning to DNA and RNA sequencing data. It was applied to

20

81 big data from high-throughput sequencing in clinical pathologies (tumours), considering the

82 sensitivity and the efficiency of the process. The first bacterial genome has been completely

83 sequenced over the past two decades, and the technical improvements and subsequent in-

84 creases in biological knowledge since then have been dramatic (Land et al. 2015). Recently,

85 RTG has become an integral part of routine diagnostic processes in clinical microbiology to

86 decipher antimicrobial resistance determinants (Mellmann et al. 2016), strain typing (Cao et al.

87 2015), and routine outbreak surveillance (McGann et al. 2016). In 2016, more than 85,156

88 bacterial genomes were sequenced and registered in the NCBI database, and it looks as 2017

89 will produce even more. This progress requires well-established and integrated computational

90 tools for genome sequencing data analysis, to provide accurate results for correct patient man-

91 agement.

92 2. Computational whole genome sequencing data: method and analysis

93 Recent advances in computational biology have contributed significantly to the avail-

94 ability of a wide variety of bioinformatic tools for quantitative and qualitative genome se-

95 quencing data analysis (Table I). The analysis approach will depend on whether it is a single

96 cell whole genome, comparative analysis, or a metagenomic analysis (Figure I). This figure

97 describes how scientists can implement computational tools and method in clinical microbiol-

98 ogy. The scope of this review will be limited to single cell whole genome and a comparative

99 genome analysis.

100 2.1. Single cell whole genome sequencing data analysis

101 Whole genome sequencing data analysis is used in clinical diagnosis to decipher the

102 strain type and to describe the antimicrobial resistance profile. Whole genome sequencing can

103 be processed using nucleic acid material obtained directly from clinical specimens (urine,

104 biological fluids etc.) or a single colony from bacterial culture. Before assembling read se-

21

105 quences, it is recommended to generate a graphic report of sequence reads quality, using

106 FastQC (Andrews 2010). The output of the data analysis will depend on the quality of the raw

107 read sequencing.

108 2.1.1. Bacterial de novo genome assembly

109 Computer programs typically use data consisting of single and paired reads which can

110 be put together through overlapping regions into a continuous sequence called a ‘contig’.

111 These contigs can subsequently be linked up into ‘scaffolds’, with gaps between them (Baker

112 2012). An overabundance of bioinformatics tools was made available for prokaryote genome

113 assembly (Table I-A). The most frequently used include A5-miseq (Coil et al. 2015), Abyss

114 (Simpson et al. 2009), Velvet (Zerbino & Birney 2008), Spades (Bankevich et al. 2012) or

115 CLC-assembly(Sequencing 2011). The choice of software depends upon the sequencing

116 technology (mate-pair or pair-end, single reads), the read length (short or long) and the

117 computer platform used for the analysis (Windows, web, Linux or Mac). Genome assembly

118 can be performed with or without a reference sequence taxonomically close to the sequenced

119 strains. Depending on the software and the mode used during the assembly process, joining

120 reads can introduce bias. Therefore, it is recommended a blind assembly without a reference

121 sequence. Later, the genomes sequence in “contig” should be aligned against reference

122 sequences (if available) using Mauve (Darling et al. 2004, 2010), to re-order the contigs or

123 scaffolds before annotation and further comparative analysis. From there, when using CLC-

124 assembly, we can easily identify gaps in the reference sequence, which can be mapped against

125 the initial reads of the sequenced genome to identify consensus sequences. This consensus

126 sequence can be introduced in between “contigs” to close the gaps. By doing this, we reduce

127 the number of “contig” by increasing the size of the genome. However, the scaffolding

128 process can be automated, and there are some few assembly software, such us Orione

129 (Cuccuru et al. 2014), SOAPdenovo (Luo, R., Liu, B., Xie, Y., Li, Z., Huang, W., Yuan, J.,

22

130 and Wang 2012), SSPACE (Boetzer et al. 2011), ScaffoldScaffolder(Bodily et al. 2015) and

131 A5-Miseq (Coil et al. 2015), that can perform scaffolding process by joining more “contigs”

132 into longer sequences.

133 2.1.2. Bacterial genome sequence annotation

134 Once the reads are assembled into “contig” and or “scaffold” organised in a “Fasta”

135 file, the next step is to identify all the genes which are hidden in the sequenced genome. This

136 process is known as ‘annotation’. A couple of tools are made available to automate this pro-

137 cess (Table I-B). Prokka (Seemann 2014) software is one of the best, most user-friendly pro-

138 karyote genome annotation tools which has been published and is freely available. It is com-

139 posed of a variety of primary core bioinformatic tools and has generated significant and useful

140 annotation formats for further comparative genome analysis. It is a fast, standalone command

141 line tool, which can also annotate metagenome data. Another program, RAST (Aziz et al.

142 2008), can additionally provide detailed annotation, functional information and pathway anal-

143 ysis, although its main limitation is the length of time it can take to annotate a genome de-

144 pending on the server workload.

145 2.1.3. Taxonomical identification and detection of genome features (Taxono-

146 genomics)

147 One of the most important aspects of microbiology diagnostics is the taxonomical iden-

148 tification of the pathogen causing the infection. The Basic Local Alignment Search Tool

149 (BLAST) (Camacho et al. 2009; Gertz 2005) is the tool which is widely used to find regions

150 of local similarity between sequences to obtain strain similarity between two genome

151 sequences and their feature profiles. A BLAST search enables a query sequence to be com-

152 pared against a library or database and makes it possible to identify a library sequences that

153 resemble the query sequence above certain defined thresholds. Many other tools have also

23

154 been described, but the that which is most widely used for prokaryote taxonomy is the Aver-

155 age Nucleotide Identity (ANI). The ANI of the shared genes (core genes) between two strains

156 is the robust means of comparing genetic relatedness among strains. The ANI values of

157 approximately 94% strains have been shown to correspond to the traditional 70% DNA-DNA

158 hybridization (DDH) which is standard in the current definition of species (Konstantinidis &

159 Tiedje 2005). Another method used is the Genome-To-Genome Distance Calculator (GGDC),

160 which is implemented in a web tool and is used to identify and delineate bacteria at species

161 and genus level by calculating the DDH value percentage (Auch et al. 2010). Some genome

162 features such as antimicrobial resistance determinants, virulence genes, and prophages are

163 organised in databases that can be used for similarity search and to find genome features (Ta-

164 ble I-H). The most frequently used genomic databases are the NCBI database (for general

165 purposes), ARG-ANNOT (Gupta et al. 2014) and ResFinder (Zankari et al. 2012) (for finding

166 and identifying antimicrobial resistance genes finding and identification–ResFinder focuses

167 on acquired genes and does not find chromosomal mutations), and VFDB (Chen et al. 2005)

168 and PATRIC (Snyder et al. 2007) for virulence factor system identification in bacteria.

169 Several tools are available to identify prophages in bacterial genomes, but PHASTER (Arndt

170 et al. 2016) which is the newer version of PHAST is the easiest to use and can generate more

171 comprehensive data, with the possibility of visualising pro-phages located on the genome and

172 it the composition of the genes.

173 2.2.Comparative genomics: bacterial pan-genome analysis

174 A comparative genome analysis is a systematic computational approach to deciphering

175 the similarity, dissimilarity and evolution between multiple genome sequences from the same

176 species or a higher taxonomic level. The purpose is to understand the biological differences

177 between the pathogenic behaviour of multiple bacteria. Another aspect of comparative studies

178 is an attempt to understand the evolution profile of a bacterial community (Prentice 2004).

24

179 One of the most robust computational analysis used in comparative genomics is the pan-

180 genome, which will be the primary focus of this section. The principle behind the pan-genome

181 analysis is that bacterial strains belonging to the same species vary considerably in gene con-

182 tent and, consequently, the genetic repertoire of a given species is much more extensive than

183 the gene content of individual strains (Mira et al. 2010). The pan-genome analysis makes it

184 possible to study the biodiversity within species or genus. Low pan-genome diversity (closed

185 pan-genome) could be a synonym for a stable environment, while bacterial species with abili-

186 ties to adapt to various niches would be expected to have high pan-genome diversity (open

187 pan-genome) (Snipen & Ussery 2010). In the pan-genome analysis process, gene sequences

188 from multiple strain genomes are concatenated together, then, an all-against-all blast is

189 performed on the concatenated file, which is then clustered to identify gene

190 families/orthologs/clusters. The gene ortholog data is finally parsed to generate different

191 components of a pan-genome. The Pangenome is composed of the core genome, which is

192 organized into the hardcore (genes present in 99%–100% taxa) and the softcore (genes present

193 in 95%–99% taxa), and the adaptive genomes which are organized into shell genes (genes

194 present in 15%–95%) and cloud genes (genes present in 0%–15% genomes), as described by

195 Kaas et al. (Kaas et al. 2012). Three differences algorithms are used to infer the gene cluster-

196 ing. Contreras-Moreira et al. have developed the bidirectional best-hit (BDBH). It is a

197 BLAST-based method and uses one reference from the gene sequences to grow clusters

198 (Contreras-Moreira & Vinuesa 2013). It is used in the GET_HOMOLOGUES pan-genome

199 package along with COG-triangles (Kristensen et al. 2010) and OrthoMCL(Li et al. 2003)

200 which make the GET_HOMOLOGUES pan-genome package one of the most robust

201 (Vinuesa & Contreras-Moreira; Contreras-Moreira & Vinuesa 2013) ever built. Another

202 package for building prokaryote pan-genomes is Roary (Page et al. 2015). In this tool, coding

203 nucleotide sequences are extracted from the input data and converted to protein sequences.

25

204 These are then filtered to remove partial sequences and iteratively pre-clustered with CD-HIT

205 (Fu et al. 2012; Li & Godzik 2006). An all-against-all alignment is performed with

206 “BLASTP”. Sequences are then clustered with MCL (van Dongen 2000) and, finally, the pre-

207 clustering results from CD-HIT are merged with the results of MCL. This shortcut makes

208 Roary the most rapid ortholog gene detection method capable of producing a pan genome to

209 date (Page et al. 2015). Currently, tens of pan-genome computational packages have been

210 made available for efficient comparative analysis of bacterial species (Xiao et al. 2015). The

211 most frequent pan-genome tools are described in Table I-G. As an example, we performed a

212 pangenome analysis of 37 genome sequences of S. aureus retrieved from Genbank (Suppl.file.

213 F1). Figure 2 shows a clusterised distribution of differents component of pangenome plotted

214 against a parsimony pangenome tree. We estimate the ortholog genes distribution using Roary

215 oftware(Page et al. 2015). We generate the plot using available software package PlotTree

216 (https://github.com/katholt/plotTree). Also, we found some other relevant examples in the

217 literature reviews. A comparative genomic analysis of Neisseria meningitidis using the pan-

218 genome approach has facilitated the identification of 11 genes specific to N. meningitidis

219 genomes and common to at least 177 of the 183 genomes available (97%) (soft-core) as

220 targets for the diagnosis of N. meningitidis in clinical microbiology(Diene et al. 2016). This

221 approach has dramatically improved the clinical diagnosis of these pathogens in laboratories

222 and has had a significant impact on patient management and outcomes. Another aspect is to

223 identify specific genes within the community of species involved in the community and hos-

224 pital-acquired infections. A comparative analysis study between pathogenic and non-

225 pathogenic Enterococcus faecalis has shown that the pathogenic potential of a particular E.

226 faecalis strain may be determined by the presence of virulence factors, rather than the level of

227 expression of such traits (Vebø et al. 2010). Overall, the outcomes of pangenome analysis

228 have contributed significantly to several clinical decisions. It is now essential to redefine the

26

229 community of pathogens capable of causing infection in clinical microbiology, as the pres-

230 ence of known traditional human pathogens may not necessarily be the cause of the pathology,

231 and many factors may be involved.

232 2.3. Pangenome parsimony tree

233 One of the emerging concepts of the pan-genome analysis is inferring the parsimony

234 pan-genome tree. Visualizing the relationships between genomes within pan-genomes could

235 be helpful in establishing a picture of phylogenetic relatedness and the degree of horizontal

236 gene transfer (HGT), as well as assisting in the understanding of phenotypic differences and

237 the genomic evolution of the studied bacteria. It is a gene-content parsimony tree that illus-

238 trates the similarities and differences between genomes inside a pan-genome. When genes

239 families are defined, they are represented in a pan-matrix, where each row corresponds to a

240 genome and each column to a gene family. The presence of the gene is indicated as “1” and

241 the absence a “0”. The pan-matrix is used to compute the relative distance according to the

242 Manhattan method (Sneath 1986)., parsimony trees can be formed by hierarchical clustering

243 Using this distance measure. Two genomes are similar and located on the same phylogenic

244 branches not only by sharing the same genes but also by lacking the same genes. Standard

245 procedure is described by Snipen et al. (Snipen & Ussery 2010). The core-gene tree can also

246 be estimated. Kaas et al. also showed that the pan-genome tree differs from the core-gene tree

247 because it is based on the genes that are absent and present among the genomes. Since all the

248 core-genes will be present in all genomes, these will not in any way influence the

249 phylogenetic relationships in this tree (Kaas et al. 2012). An example of pan-genome parsi-

250 mony tree is shown in Figure 3-A. We estimated this phylogenic dendrogram from orthologs

251 of the 37 genomes sequences of S. aureus species from our previous analysis. It revealed not

252 only the diversity observed among the analysed genome of S. aureus but also the gene content

253 profile and subsequently the evolutionary history.

27

254 3. Bacterial genome recombinations and their impact on clinical microbiology

255 Several mechanisms have been described as being involved in the adaptation and evolu-

256 tion of bacteria. These mechanisms have had a significant impact on the physiopathology and

257 phenotypic traits of these pathogens in clinical microbiology. The mechanism commonly de-

258 scribed in bacteria for their adaptation and evolution are point mutations, HGT (gene gain)

259 and gene loss.

260 3.1. Point mutations

261 Often thought to be the basis for the evolution, genetic point mutations consist of the

262 substitution of one nucleotide with another (often referred to as a single-nucleotide polymor-

263 phism (SNP)) or the insertion or deletion of a single nucleotide. Consequently, single

264 nucleotide substitution in a nucleic acid sequence, which encodes a protein may produce

265 either a synonymous (silent) codon mutation, with no change to the encoded amino acid, or a

266 non-synonymous mutation, resulting in an amino acid change which may modify the function

267 and structure of the encoded protein. Thus, single nucleotide insertion or deletion of protein-

268 coding sequences will result in a reading frame shift in such a way that the downstream or the

269 upstream codons, including stop codons, could be translated from a different reading frame,

270 leading to a significant modification to the structure and function of the encoded protein.

271 Point mutations in non-protein-coding DNA sequences may also have phenotypic and

272 functional consequences, mainly if they affect a regulatory element, contributing to genetic

273 variation (Bryant et al. 2014). In clinical microbiology, many studies have reported a signifi-

274 cant impact due to the point mutation including antimicrobial drug resistance, genotyping,

275 taxonomy and epidemiology. Cannatelli et al. reported that Klebsiella pneumoniae might use

276 point mutation to inactivate the mgrB gene, encoding a negative-feedback regulator of the

277 PhoQ-PhoP signalling system; this may be responsible for colistin resistance, due to the re-

278 sulting upregulation of the pmrHFIJKLM lipid A modification system (Cannatelli et al. 2014).

28

279 Similar findings have been described for Acinetobacter baumannii (Rolain et al. 2013).

280 Moreover, Vila et al. demonstrated that there is an association between double mutations in

281 the gyrA gene of ciprofloxacin-resistant clinical isolates of Escherichia coli and the minimal

282 inhibitory concentration (MICs) of ciprofloxacin (Vila et al. 1994). They also showed that a

283 change in Ser-83 in GyrA is enough to generate elevated resistance levels to nalidixic acid,

284 while a second mutation at Asp-87 in the GyrA may play a complementary role in developing

285 the strain’s elevated level of ciprofloxacin resistance. Similar findings have also been reported

286 by Reyna et al. (Reyna et al. 1995), where Salmonella typhimurium gyrA mutations were

287 associated with fluoroquinolone resistance. Several other studies have deciphered the associa-

288 tion between mutations in the rpoB gene, encoding the β subunit of RNA polymerase and

289 Rifampicin-resistant phenotypes in populations of Mycobacterium tuberculosis (Huang et al.

290 2002; A. Nisha 2012; Regmi et al. 2015). Point mutations have also been associated with a

291 specific clone selection responsible for major hospital-acquired outbreaks around the world.

292 In the case of Staphylococcus aureus MRSA (Kong et al. 2016), Salmonella enterica serovar

293 Typhi (Yap et al. 2014), Mycobacterium tuberculosis (Fenner et al. 2011), the SNP approach

294 has been used for epidemiological and surveillance studies. With the advent of sequencing

295 technologies, a variety of statistical tests have been developed to quantify selection pressures

296 acting on protein-coding regions. Of these, the “dN/dS ratio” is one of the most widely used,

297 due in part to its simplicity and robustness (Kryazhimskiy & Plotkin 2008). The ratio ω = β/α

298 (also referred to as dN/dS or KA/KS) has become a standard measure of selective pressure

299 where ω ≈ 1 signifies neutral evolution, ω < 1 means negative selection and ω > 1, positive

300 selection (Pond et al. 2009). Castillo-Ramírez et al. used the dN/dS ratio to investigate the

301 emergence of bacterial population clones. In their study, they examine the distribution of syn-

302 onymous and non-synonymous SNPs within recently emerged clones of two critical noso-

303 comial pathogens, methicillin-resistant Staphylococcus aureus (MRSA) and Clostridium

29

304 difficile. They found that in both species, a much higher proportion of synonymous changes in

305 those single nucleotide polymorphisms (SNPs) are likely to have emerged through recombi-

306 nation compared to de novo mutations. They concluded that this might be explained by the

307 very recent emergence of the mutational SNPs combined with a reduction in the efficiency of

308 selection due to niche specialisation (Castillo-Ramírez et al. 2011).

309 3.2. Horizontal gene transfer (HGT)

310 Also referred to as lateral gene transfer (LGT), HGT is the movement of genetic materi-

311 al between unicellular organisms or unicellular and multicellular organisms (Andam et al.

312 2010). It produces incredibly dynamic genomes in which substantial amounts of DNA are

313 introduced into and deleted from the chromosome (Ochman et al. 2000). In prokaryotes, three

314 mechanisms are known to be involved in the transfer of DNA material, namely transformation

315 (uptake of free DNA in solution) (Lin et al. 2009), plasmid-mediated transfer (conjugation)

316 (Sivertsen et al. 2016; Yao et al. 2017; Maurelli 2007), and bacterial virus-mediated transfer

317 (phage transduction) (Brüssow et al. 2004; Menouni et al. 2015). This genome plasticity has

318 had a significant impact upon emerging pathogens in terms their behaviour and interaction

319 with their hosts. It has contributed not only to the evolution of the lifestyle of the bacteria with

320 the emergence of new phenotypes but has also disrupted its physiology.

321 3.2.1. Implication in prokaryote taxonomy

322 HGT has had an impact upon bacterial taxonomy and identification in clinical labora-

323 tories in recent decades. Phylogenetic inference assumes that prokaryote taxa are monophylet-

324 ic and that there is no sufficient exchange of genetic material between closely related or unre-

325 lated taxa by recombination events such that this significantly limits phylogenic interpreta-

326 tions. However, HGT has caused considerable confusion in the conventional taxonomy of

327 bacteria (Young 2001). Leo M. et al. observed, by closer inspection of 16S rRNA sequences

30

328 of the Streptococcus anginosus Group, a mosaic-like structure, strongly suggestive of the hor-

329 izontal transfer of the 16S rRNA gene segments between distinct species. Southern blot

330 hybridization further showed that, within a single strain, all copies of the 16S rRNA gene had

331 the same composition.It indicates that the apparent mosaic structures were not PCR-induced

332 artefacts and that such recombination may lead to the construction of incorrect phylogenetic

333 trees based on the 16S rRNA genes (Schouls et al. 2003). HGT can also influence the Aver-

334 age Nucleotide Identity (ANI) score since the value is determined based on gene content and

335 sequence similarity. From our previous example of analysis, we generate the maximum-

336 likelihood phylogenic tree of 37 genomes sequences of S. aureus using whole genome se-

337 quences. The tree (Figure 3-B) was estimated with all recombinations loci and (Figure 3-C)

338 after removing recombination sites. We could observe a difference in the trees topologies, the

339 size of the branches and a significant reconstruction of the genealogic nodes. This genealogy

340 reconstruction can have a significant impact on the interpretation in the context of epidemiol-

341 ogy and transmission tracking.

342 3.2.2. Acquisition of antimicrobial resistance genes mediated by resistance and

343 pathogenicity genomic islands

344 One of the most widely reported consequences of horizontal gene transfer in prokary-

345 otes is the acquisition of antimicrobial resistance genes mediated by resistance or pathogenici-

346 ty islands (PAI) (Schmidt & Hensel 2004; Hentschel & Hacker 2001). These are a group of

347 mobile genetic elements that play a pivotal role in the virulence and resistance of bacteria to

348 antibiotics (Schmidt & Hensel 2004). Pathogenicity islands have significantly contributed to

349 the acquisition of novel resistance determinants by bacterial pathogens described in clinical

350 microbiology. They are mostly involved in the selection pressure and clone expansion during

351 outbreaks. As described in Staphylococcus stepanovicii, the mecC-harboring region is

352 identified as a recombination hotspot (Semmler et al. 2016). In this study, the authors showed

31

353 that S. stepanovicii genome sequencing revealed that strain IMT28705 is harbouring a mecC

354 gene shares 99.2% nucleotide (and 98.5% amino acid) sequence identity with mecC of S. au-

355 reus MRSA_LGA251 and the insertion of SCCmec alters the site of attR1 on the chromosome.

356 In Providencia rettgeri H1736, genome analysis revealed five predicted plasmids as well as

357 other mobile genetic elements (MGEs) including phages, genomic islands, and integrative and

358 conjugative elements. This author showed that the resistome consisted of a total of 27 differ-

359 ent antibiotic resistance genes including blaNDM-1, mostly located on MGEs, making P.

360 rettgeri H1736 significantly different from other P. rettgeri isolates (Olumuyiwa Olaitan et al.

361 2016). In Neisseria gonorrhoeae, researchers showed that the evolution occurs in response to

362 antimicrobial selective pressure and genomic islands offer selective advantages to host

363 bacteria and its acquisition may not only facilitate the spread of antimicrobial resistance in

364 gonococcal populations but may also confer fitness advantages (Harrison et al. 2016). It is

365 now known that plasmid mobilisation is a significant mechanism for HGT in the evolution of

366 antibiotic resistance of E. faecalis (Manson et al. 2010) and E. faecium (Mikalsen et al. 2015)

367 populations. The role of bacteriophages in the transfer of antimicrobial resistance genes has

368 also been demonstrated. They can act as vehicles for the horizontal exchange of genetic

369 material, and modify their host genomic structure by inserting their DNA into the host

370 genome. They can carry genes that encode new functions, especially extended-spectrum beta-

371 lactamase and fluoroquinolone resistance (Subirats et al. 2016; Lekunberri et al. 2017; Marti

372 et al. 2014; Yosef et al. 2015).

373 3.3. Gene loss in bacteria

374 Scientists always thought that the acquisition of virulence gene clusters on plasmids and

375 pathogenicity islands though HGT controls the evolution of bacterial pathogens. However, it

376 has recently been shown that more virulent pathogens tend to reduce their genome size com-

377 pared to their non-pathogenic cousins (Merhej et al. 2009). Genes that are no longer necessary

32

378 for the adaptive lifestyle of the pathogen are selectively inactivated through point mutation,

379 insertion, or deletion. These genes are called “anti-virulence genes”. Intrinsic or external

380 selective pressure sometimes can lead to the deletion of large loci of the genome that contain

381 anti-virulence genes generating “black holes” in the pathogen genome. Inactivation of anti-

382 virulence genes leads to a pathogen that is highly adapted to its host niche and consequently

383 more virulent (Maurelli 2007). Georgiades et al., in a comparative genome analysis of “bad

384 bug” highly epidemic species and their closest non-epidemic species concluded that

385 pathogenic capacity is not the result of “virulence factors” but is the outcome of a virulent

386 gene repertoire resulting from reduced genome repertoires (Georgiades & Raoult 2011). Paul

387 et al. also discovered that gene loss and antibiotic resistance are the main driving force behind

388 the adaptation of Salmonella Typhimurium (Paul et al. 2016). Pan-genome analyse widely

389 used to analyse, detects gene loss by comparative genome analysis. ProgressiveMauve

390 (Darling et al. 2010) tools can be used To visualise gene loss; This method uses a different

391 alignment score called a sum-of-pairs breakpoint score, which enable the accurate detection of

392 rearrangement breakpoints when genomes have different gene content.

393 3.4. Hotspot of recombination detection and implication in whole genome phyloge-

394 ny analysis.

395 The phylogenetic reconstruction of the whole genome to infer the genetic relationship

396 between strains continues to be problematic because bacteria occasionally undergo homolo-

397 gous recombination, whereby a fragment of a donor genome replaced that of the recipients.

398 The most robust computational tools that offer simultaneous recombination hotspot detection

399 and phylogenetic inferences are ClonalFrameML (Didelot & Wilson 2015) and

400 Gubbins(Croucher et al. 2015) (Table I-E). If we assume that recombination does not exist in

401 bacteria, all genomic positions would have been in the clonal frame and a phylogenetic

402 reconstruction would, therefore, reflect the clonal genealogy of the bacterial population. In the

33

403 presence of recombination events, this tool will detect imported loci containing elevated den-

404 sities of base substitutions suggestive of horizontal sequence transfer and accurately recon-

405 structing a maximum likelihood phylogeny. Recently, many bioinformatic tools have been

406 developed to detected recombination loci within the genome of the bacteria (Table I-F).

407 Within species, HGT between closely related organisms are difficult to detect because donors

408 and recipients share orthologs and phylogenetic features. HGTector (Zhu et al. 2014) is one

409 method which has been developed to detected HGTs. The core of this approach is an all-

410 against-all BLASTP search of the protein product of each protein-coding gene of the ge-

411 nome(s) of interest against the genome database. This approach allows for flexibility because

412 it can be scaled to the level of taxonomic/phylogenetic interest, and can be adjusted to fre-

413 quent taxonomical updates. A similar approach has been developed in HGT-Finder (Nguyen

414 et al. 2015). IslandViewer (http://pathogenomics.sfu.ca/islandviewer) is also a robust tool

415 used in a web-based application that provides a user-friendly interface for predicting genomic

416 islands using the most accurate methods for genomic island prediction: IslandPick,

417 IslandPath-DIMOB and SIGI-HMM. The graphical interface enables comfortable viewing

418 and downloading of island data in multiple formats, at both the chromosome and gene level,

419 for method-specific, or overlapping GI predictions(Langille & Brinkman 2009). Studies have

420 shown that recombination plays a crucial role in phylogenic differentiation and dissemination

421 within and outside lineages of L. pneumophila (David et al. 2017), S. pneumoniae (Chaguza et

422 al. 2015), and E. faecalis (Raven et al. 2016). From our example, we identify and generate the

423 plot of recombination hotspots in 37 genomes of S. aureus ( Figure 4).the red (non-specific)

424 and blue (specific) blocs represent recombination loci along the genomes. The analysis re-

425 vealed the probable cause of the genomic variability found in the phylogenic analysis.

34

426 4. Perspectives and conclusion

427 Advances in whole genome sequencing technology have contributed significantly to the

428 paradigm of the role of genome recombination in pathogen behaviour in clinical microbiology.

429 Recombinations have had a significant impact on taxonomic identification, phylogenic classi-

430 fication, the acquisition and dissemination of antimicrobial resistance genes, and the clonal

431 spread of virulent pathogens. The challenge now is to be able to implement the most relevant

432 and available computational tools on a real-time basis in clinical microbiology. The rapid

433 emergence and genomic evolution of dangerous pathogens require a synergic drive between

434 microbiologists and computational scientists to develop and implement “dry lab” tools for the

435 appropriate management of microbial infectious diseases and to overcome outbreaks.

436 Consent for publication

437 All co-authors have seen and approved the manuscript.

438 Availability of data and materials

439 All accession numbers of genome sequences used in this work are included in Supplemen-

440 tary File F1.

441 Funding

442 This work was supported and funded by IHU Fondation Méditerranée Infection.

443 Acknowledgements

444 We thank IHU Fondation Méditerranée Infection for funding this study.

445

446

35

447 Figures and table legends

448 Figure 1: Computational workflow of bacterial sequence data analysis that can be

449 implemented in clinical microbiology: orange boxes include core and data entry point for bac-

450 terial sequence data analysis, green boxes show output data (results) necessary for health care.

451 Blue outline boxes represent the computational process.

452 Figure 2: pan-genome visualisation of 37 genes sequences of S. aureus: the pan-genome

453 estimation was performed using Roary(Page et al. 2015). From left to right, the plot shows the

454 core genome (hardcore & softcore), shell genes, cloud genes.

455 Figure 3: Phylogenic tree of 37 genomes of S. aureus retrieved from NCBI/GenBank: A:

456 topology considering recombinations sites, B: topology excluding recombination sites. C:

457 pangenome parsimony tree Whole genome alignment was performed using all-against-all

458 alignment method of Mugsy(Angiuoli & Salzberg 2011). Phylogeny was inferred using

459 RaxML(Stamatakis 2014) and Gubbins (Croucher et al. 2015). Trees topology shows how

460 genome recombination can significantly impact the genealogy relationship between strains of

461 same species. The accession number of S. aureus genome analysed in this section are listed in

462 Suppl.Files F1.

463 Figure 4: Hotspot of recombination detected in the genomes of S. aureus. The heatmap

464 was generated using gubbins(Croucher et al. 2015). The red stripes represent non-specific

465 recombination hotspot and blue strips, specific recombination hotspots.

36

466 References

467 A. Nisha. 2012. Molecular characterization of rpoB gene encoding the RNA polymerase 468 $β$ subunit in rifampin-resistant Mycobacterium tuberculosis strains from south India. 469 African J. Biotechnol. 11:3160–3168. doi: 10.5897/AJB10.449.

470 Andam CP, Williams D, Gogarten JP. 2010. Natural taxonomy in light of horizontal gene 471 transfer. Biol. Philos. 25:589–602. doi: 10.1007/s10539-010-9212-8.

472 Andrews S. 2010. FastQC: A quality control tool for high throughput sequence data. 473 Http://Www.Bioinformatics.Babraham.Ac.Uk/Projects/Fastqc/. 474 http://www.bioinformatics.babraham.ac.uk/projects/. doi: citeulike-article-id:11583827.

475 Angiuoli S V, Salzberg SL. 2011. Mugsy: fast multiple alignment of closely related whole 476 genomes. Bioinformatics. 27:334–42. doi: 10.1093/bioinformatics/btq665.

477 Arndt D et al. 2016. PHASTER: a better, faster version of the PHAST phage search tool. 478 Nucleic Acids Res. 44:W16-21. doi: 10.1093/nar/gkw387.

479 Auch AF, Klenk H-P, Göker M. 2010. Standard operating procedure for calculating genome- 480 to-genome distances based on high-scoring segment pairs. Stand. Genomic Sci. 2:142–148. 481 doi: 10.4056/sigs.541628.

482 Aziz RK et al. 2008. The RAST Server: rapid annotations using subsystems technology. BMC 483 Genomics. 9:75. doi: 10.1186/1471-2164-9-75.

484 Baker M. 2012. De novo genome assembly: what every biologist should know. Nat. Methods. 485 9:333–337. doi: 10.1038/nmeth.1935.

486 Bankevich A et al. 2012. SPAdes: a new genome assembly algorithm and its applications to 487 single-cell sequencing. J. Comput. Biol. 19:455–77. doi: 10.1089/cmb.2012.0021.

488 Bertelli C, Greub G. 2013. Rapid bacterial genome sequencing: methods and applications in 489 clinical microbiology. Clin Microbiol Infect. 19:803–813. doi: 10.1111/1469-0691.12217.

490 Bodily PM, Fujimoto MS, Snell Q, Ventura D, Clement MJ. 2015. ScaffoldScaffolder: 491 Solving contig orientation via bidirected to directed graph reduction. Bioinformatics. 32:17– 492 24. doi: 10.1093/bioinformatics/btv548.

493 Boetzer M, Henkel C V., Jansen HJ, Butler D, Pirovano W. 2011. Scaffolding pre-assembled 494 contigs using SSPACE. Bioinformatics. 27:578–579. doi: 10.1093/bioinformatics/btq683.

495 Brüssow H, Canchaya C, Hardt W-D. 2004. Phages and the evolution of bacterial pathogens: 496 from genomic rearrangements to lysogenic conversion. Microbiol. Mol. Biol. Rev. 68:560---- 497 602, table of contents. doi: 10.1128/MMBR.68.3.560-602.2004.

498 Bryant J, Chewapreecha C, Bentley SD. 2014. Developing insights into the mechanisms of 499 evolution of bacterial pathogens from wholegenome sequences. Futur. Microbiol. 7:1283– 500 1296. doi: 10.2217/fmb.12.108.

501 Camacho C et al. 2009. BLAST+: architecture and applications. BMC Bioinformatics. 10:421. 502 doi: 10.1186/1471-2105-10-421.

503 Cannatelli A et al. 2014. MgrB inactivation is a common mechanism of colistin resistance in

37

504 KPC-producing Klebsiella pneumoniae of clinical origin. Antimicrob. Agents Chemother. 505 58:5696–703. doi: 10.1128/AAC.03110-14.

506 Cao MD et al. 2015. Real-time strain typing and analysis of antibiotic resistance potential 507 using Nanopore MinION sequencing. bioRxiv. 19356. doi: 10.1101/019356.

508 Castillo-Ramírez S et al. 2011. The Impact of Recombination on dN/dS within Recently 509 Emerged Bacterial Clones Balloux, F, editor. PLoS Pathog. 7:e1002129. doi: 510 10.1371/journal.ppat.1002129.

511 Chaguza C, Cornick JE, Everett DB. 2015. Mechanisms and impact of genetic recombination 512 in the evolution of Streptococcus pneumoniae. Comput. Struct. Biotechnol. J. 13:241–7. doi: 513 10.1016/j.csbj.2015.03.007.

514 Chen L et al. 2005. VFDB: A reference database for bacterial virulence factors. Nucleic Acids 515 Res. 33:325–328. doi: 10.1093/nar/gki008.

516 Coil D, Jospin G, Darling AE. 2015. A5-miseq: An updated pipeline to assemble microbial 517 genomes from Illumina MiSeq data. Bioinformatics. 31:587–589. doi: 518 10.1093/bioinformatics/btu661.

519 Contreras-Moreira B, Vinuesa P. 2013. GET_HOMOLOGUES, a versatile software package 520 for scalable and robust microbial pangenome analysis. American Society for Microbiology 521 doi: 10.1128/AEM.02411-13.

522 Croucher NJ et al. 2015. Rapid phylogenetic analysis of large samples of recombinant 523 bacterial whole genome sequences using Gubbins. Nucleic Acids Res. 43:e15. doi: 524 10.1093/nar/gku1196.

525 Cuccuru G et al. 2014. Orione, a web-based framework for NGS analysis in microbiology. 526 Bioinformatics. 30:1928–1929. doi: 10.1093/bioinformatics/btu135.

527 Darling ACE, Mau B, Blattner FR, Perna NT. 2004. Mauve: multiple alignment of conserved 528 genomic sequence with rearrangements. Genome Res. 14:1394–1403. doi: 529 10.1101/gr.2289704.

530 Darling AE, Mau B, Perna NT. 2010. progressiveMauve: multiple genome alignment with 531 gene gain, loss and rearrangement. PLoS One. 5:e11147. doi: 10.1371/journal.pone.0011147.

532 David S et al. 2017. Dynamics and impact of homologous recombination on the evolution of 533 Legionella pneumophila Didelot, X, editor. PLOS Genet. 13:e1006855. doi: 534 10.1371/journal.pgen.1006855.

535 Didelot X, Wilson DJ. 2015. ClonalFrameML: efficient inference of recombination in whole 536 bacterial genomes. PLoS Comput. Biol. 11:e1004041. doi: 10.1371/journal.pcbi.1004041.

537 Diene SM et al. 2016. Comparative genomics of Neisseria meningitidis strains: new targets 538 for molecular diagnostics. Clin. Microbiol. Infect. doi: 10.1016/j.cmi.2016.03.022.

539 van Dongen S. 2000. Graph clustering by flow simulation. Graph Stimul. by flow Clust. PhD 540 thesis:University of Utrecht. doi: 10.1016/j.cosrev.2007.05.001.

541 Fenner L et al. 2011. ‘Pseudo-Beijing’: evidence for convergent evolution in the direct repeat 542 region of Mycobacterium tuberculosis. PLoS One. 6:e24737. doi:

38

543 10.1371/journal.pone.0024737\rPONE-D-11-11180 [pii].

544 Ferjani S et al. 2014. Multidrug resistance and high virulence genotype in uropathogenic 545 Escherichia coli due to diffusion of ST131 clonal group producing CTX-M-15: an emerging 546 problem in a Tunisian hospital. Folia Microbiol. (Praha). 59:257–262. doi: 10.1007/s12223- 547 013-0292-0.

548 Fu L, Niu B, Zhu Z, Wu S, Li W. 2012. CD-HIT: Accelerated for clustering the next- 549 generation sequencing data. Bioinformatics. 28:3150–3152. doi: 550 10.1093/bioinformatics/bts565.

551 Georgiades K, Raoult D. 2011. Genomes of the most dangerous epidemic bacteria have a 552 virulence repertoire characterized by fewer genes but more toxin-antitoxin modules. PLoS 553 One. 6:e17962. doi: 10.1371/journal.pone.0017962.

554 Gertz E. 2005. BLAST scoring parameters. Nlm. Nih. 555 Gov/Blast/Documents/Developer/Scoring. 1–54.

556 Guarnaccia M et al. 2014. Is this the real time for genomics? Genomics. 103:177–82. doi: 557 10.1016/j.ygeno.2014.02.003.

558 Gupta SK et al. 2014. ARG-ANNOT, a new bioinformatic tool to discover antibiotic 559 resistance genes in bacterial genomes. Antimicrob. Agents Chemother. 58:212–220. doi: 560 10.1128/AAC.01310-13.

561 Harrison OB et al. 2016. Genomic analyses of Neisseria gonorrhoeae reveal an association of 562 the gonococcal genetic island with antimicrobial resistance. doi: 10.1016/j.jinf.2016.08.010.

563 Hentschel U, Hacker J. 2001. Pathogenicity islands: the tip of the iceberg. Microbes Infect. 564 3:545–548. doi: 10.1016/S1286-4579(01)01410-1.

565 Huang H, Jin Q, Ma Y, Chen X, Zhuang Y. 2002. Characterization of rpoB mutations in 566 rifampicin-resistant Mycobacterium tuberculosis isolated in China. Tuberculosis (Edinb). 567 82:79–83.

568 Illumina. 2016. An Introduction to Next-Generation Sequencing Technology Table of 569 Contents. 1–16. doi: Pub No. 770-2012-008.

570 Imperi F et al. 2011. The genomics of Acinetobacter baumannii: Insights into genome 571 plasticity, antimicrobial resistance and pathogenicity. IUBMB Life. 63:1068–1074. doi: 572 10.1002/iub.531.

573 Kaas RS et al. 2012. Estimating variation within the genes and inferring the phylogeny of 186 574 sequenced diverse Escherichia coli genomes. BMC Genomics. 13:577. doi: 10.1186/1471- 575 2164-13-577.

576 Kong Z et al. 2016. Whole-Genome Sequencing for the Investigation of a Hospital Outbreak 577 of MRSA in China. PLoS Med. 3:1–12. doi: 10.1371/journal.pone.0149844.

578 Konstantinidis KT, Tiedje JM. 2005. Genomic insights that advance the species definition for 579 prokaryotes. Proc. Natl. Acad. Sci. 102:2567–2572. doi: 10.1073/pnas.0409727102.

580 Koonin E V., Koonin, V. E. 2016. Horizontal gene transfer: essentiality and evolvability in 581 prokaryotes, and roles in evolutionary transitions. F1000Research. 5:1805. doi:

39

582 10.12688/f1000research.8737.1.

583 Kristensen DM et al. 2010. A low-polynomial algorithm for assembling clusters of 584 orthologous groups from intergenomic symmetric best matches. Bioinformatics. 26:1481– 585 1487. doi: 10.1093/bioinformatics/btq229.

586 Kryazhimskiy S, Plotkin JB. 2008. The population genetics of dN/dS Gojobori, T, editor. 587 PLoS Genet. 4:e1000304. doi: 10.1371/journal.pgen.1000304.

588 Land M et al. 2015. Insights from 20 years of bacterial genome sequencing. Funct. {&} Integr. 589 genomics. 15:141–161. doi: 10.1007/s10142-015-0433-4.

590 Langille MGI, Brinkman FSL. 2009. IslandViewer: an integrated interface for computational 591 identification and visualization of genomic islands. Bioinformatics. 25:664–665. doi: 592 10.1093/bioinformatics/btp030.

593 Law I et al. 1996. Emerging Infectious Diseases. Emerg. Infect. Dis.

594 Lekunberri I, Subirats J, Borrego CM, Balcázar JL. 2017. Exploring the contribution of 595 bacteriophages to antibiotic resistance. doi: 10.1016/j.envpol.2016.11.059.

596 Li L, Stoeckert CJ, Roos DS. 2003. OrthoMCL: Identification of ortholog groups for 597 eukaryotic genomes. Genome Res. 13:2178–2189. doi: 10.1101/gr.1224503.

598 Li W, Godzik A. 2006. Cd-hit: A fast program for clustering and comparing large sets of 599 protein or nucleotide sequences. Bioinformatics. 22:1658–1659. doi: 600 10.1093/bioinformatics/btl158.

601 Lin EA et al. 2009. Natural Transformation of Helicobacter pylori Involves the Integration of 602 Short DNA Fragments Interrupted by Gaps of Variable Size Blanke, SR, editor. PLoS Pathog. 603 5:e1000337. doi: 10.1371/journal.ppat.1000337.

604 Loman NJ, Pallen MJ. 2015. Twenty years of bacterial genome sequencing. Nat. Rev. 605 Microbiol. 13:787–794. doi: 10.1038/nrmicro3565.

606 Long SW et al. 2013. A genomic day in the life of a clinical microbiology laboratory. J. Clin. 607 Microbiol. doi: 10.1128/JCM.03237-12.

608 Luo, R., Liu, B., Xie, Y., Li, Z., Huang, W., Yuan, J., and Wang J. 2012. SOAPdenovo2: an 609 empirically improved memory-efficient short-read de novo assembler. Gigascience. 1:18. doi: 610 10.1186/2047-217X-1-18.

611 Manson JM, Hancock LE, Gilmore MS. 2010. Mechanism of chromosomal transfer of 612 Enterococcus faecalis pathogenicity island, capsule, antimicrobial resistance, and other traits. 613 Proc. Natl. Acad. Sci. U. S. A. 107:12269–74. doi: 10.1073/pnas.1000139107.

614 Marti E, Variatza E, Balcázar JL. 2014. Bacteriophages as a reservoir of extended-spectrum β 615 -lactamase and fluoroquinolone resistance genes in the environment. Clin. Microbiol. Infect. 616 20:O456–O459. doi: 10.1111/1469-0691.12446.

617 Maurelli AT. 2007. Black holes, antivirulence genes, and gene inactivation in the evolution of 618 bacterial pathogens. FEMS Microbiol. Lett. 267:1–8. doi: 10.1111/j.1574-6968.2006.00526.x.

619 McGann P et al. 2016. Real Time Application of Whole Genome Sequencing for Outbreak 620 Investigation – What is an achievable Turnaround Time? Diagn. Microbiol. Infect. Dis. doi:

40

621 10.1016/j.diagmicrobio.2016.04.020.

622 Mellmann A et al. 2016. Real-Time Genome Sequencing of Resistant Bacteria Provides 623 Precision Infection Control in an Institutional Setting Bourbeau, P, editor. J. Clin. Microbiol. 624 54:2874–2881. doi: 10.1128/JCM.00790-16.

625 Menouni R, Hutinet G, Petit MA, Ansaldi M. 2015. Bacterial genome remodeling through 626 bacteriophage recombination. FEMS Microbiol. Lett. 362. doi: 10.1093/femsle/fnu022.

627 Merhej V, Royer-Carenzi M, Pontarotti P, Raoult D. 2009. Massive comparative genomic 628 analysis reveals convergent evolution of specialized bacteria. Biol. Direct. 4:13. doi: 629 10.1186/1745-6150-4-13.

630 Messerer M, Fischer W, Schubert S. 2017. Investigation of horizontal gene transfer of 631 pathogenicity islands in Escherichia coli using next-generation sequencing Anjum, M, editor. 632 PLoS One. 12:e0179880. doi: 10.1371/journal.pone.0179880.

633 Mikalsen T et al. 2015. Investigating the mobilome in clinically important lineages of 634 Enterococcus faecium and Enterococcus faecalis. BMC Genomics. 16:282. doi: 635 10.1186/s12864-015-1407-6.

636 Mira A et al. 2010. The bacterial pan-genome:a new paradigm in microbiology. Int. Microbiol. 637 13:45–57. doi: 10.2436/20.1501.01.110.

638 Navarre WW. 2016. Chapter Three – The Impact of Gene Silencing on Horizontal Gene 639 Transfer and Bacterial Evolution. In: Advances in Microbial Physiology.Vol. 69 pp. 157–186. 640 doi: 10.1016/bs.ampbs.2016.07.004.

641 Nguyen M, Ekstrom A, Li X, Yin Y. 2015. HGT-Finder: A New Tool for Horizontal Gene 642 Transfer Finding and Application to Aspergillus genomes. Toxins (Basel). 7:4035–4053. doi: 643 10.3390/toxins7104035.

644 Nhantumbo AA et al. 2015. Frequency of Pathogenic Paediatric Bacterial Meningitis in 645 Mozambique: The Critical Role of Multiplex Real-Time Polymerase Chain Reaction to 646 Estimate the Burden of Disease. PLoS One. 10:e0138249. doi: 10.1371/journal.pone.0138249.

647 Njage PMK, Buys EM. 2015. Pathogenic and commensal E scherichia coli from irrigation 648 water show potential in transmission of extended spectrum and AmpC β-lactamases 649 determinants to isolates from lettuce. Microb. Biotechnol. 8:462–473. doi: 10.1111/1751- 650 7915.12234.

651 Ochman H, Lawrence JG, Groisman EA. 2000. Lateral gene transfer and the nature of 652 bacterial innovation. Nature. 405:299–304. doi: 10.1038/35012500.

653 Olaitan AO et al. 2013. Emergence of multidrug-resistant Acinetobacter baumannii producing 654 OXA-23 carbapenemase, Nigeria. Int. J. Infect. Dis. 17:e469–e470. doi: 655 10.1016/j.ijid.2012.12.008.

656 Olaitan AO, Diene SM, Assous MV, Rolain J-M. 2016. Genomic Plasticity of Multidrug- 657 Resistant NDM-1 Positive Clinical Isolate of Providencia rettgeri. Genome Biol. Evol. 8:723– 658 8. doi: 10.1093/gbe/evv195.

659 Olumuyiwa Olaitan A, Diene SM, Victor Assous M, Rolain JM. 2016. Genomic plasticity of 660 multidrug-resistant NDM-1 positive clinical isolate of providencia rettgeri. Genome Biol.

41

661 Evol. 8:723–728. doi: 10.1093/gbe/evv195.

662 Page AJ et al. 2015. Roary: Rapid large-scale prokaryote pan genome analysis. 663 Bioinformatics. 31:btv421. doi: 10.1093/bioinformatics/btv421.

664 Palmer KL, Kos VN, Gilmore MS. 2010. Horizontal gene transfer and the genomics of 665 enterococcal antibiotic resistance. Curr. Opin. Microbiol. 13:632–9. doi: 666 10.1016/j.mib.2010.08.004.

667 Paul S, Sokurenko E V, Chattopadhyay S. 2016. Corrected Genome Annotations Reveal Gene 668 Loss and Antibiotic Resistance as Drivers in the Fitness Evolution of Salmonella 669 Typhimurium. J. Bacteriol. JB.00545-16. doi: 10.1128/JB.00545-16.

670 Pond SLK, Poon A, Frost SDW. 2009. Estimating selection pressures on alignments of 671 coding sequences. —Lemey P, Salemi M, Vandamme A,. 1–81.

672 Prentice MB. 2004. Bacterial comparative genomics. Genome Biol. 5:338. doi: 10.1186/gb- 673 2004-5-8-338.

674 Raven KE et al. 2016. Genome-based characterization of hospital-adapted Enterococcus 675 faecalis lineages. Nat. Microbiol. 1. doi: 10.1038/nmicrobiol.2015.33.

676 Regmi SM et al. 2015. Whole genome sequence analysis of multidrug-resistant 677 Mycobacterium tuberculosis Beijing isolates from an outbreak in Thailand. Mol. Genet. 678 Genomics. 290:1933–41. doi: 10.1007/s00438-015-1048-0.

679 Reyna F, Huesca M, Ctor V, Lez G, Fuchs ALY. 1995. Salmonella typhimurium gyrA 680 Mutations Associated with Fluoroquinolone Resistance. Antimicrob. Agents Chemother. 681 39:1621–1623.

682 Rolain J-M et al. 2013. Real-time sequencing to decipher the molecular mechanism of 683 resistance of a clinical pan-drug-resistant Acinetobacter baumannii isolate from Marseille, 684 France. Antimicrob. Agents Chemother. 57:592–6. doi: 10.1128/AAC.01314-12.

685 Schmidt H, Hensel M. 2004. Pathogenicity islands in bacterial pathogenesis. Clin. Microbiol. 686 Rev. 17:14–56. doi: 10.1128/cmr.17.1.14-56.2004.

687 Schneider G et al. 2004. The Pathogenicity Island-Associated K15 Capsule Determinant 688 Exhibits a Novel Genetic Structure and Correlates with Virulence in Uropathogenic 689 Escherichia coli Strain 536. Infect. Immun. 72:5993–6001. doi: 10.1128/IAI.72.10.5993- 690 6001.2004.

691 Schouls LM, Schot CS, Jacobs JA. 2003. Horizontal transfer of segments of the 16S rRNA 692 genes between species of the Streptococcus anginosus group. J. Bacteriol. 185:7241–6. doi: 693 10.1128/JB.185.24.7241-7246.2003.

694 Seemann T. 2014. Prokka: rapid prokaryotic genome annotation. Bioinformatics. 30:2068–9. 695 doi: 10.1093/bioinformatics/btu153.

696 Semmler T et al. 2016. A Look into the Melting Pot: The mecC-Harboring Region Is a 697 Recombination Hot Spot in Staphylococcus stepanovicii. PLoS One. 11:e0147150. doi: 698 10.1371/journal.pone.0147150.

699 Sequencing H. 2011. CLC Genomics Workbench. Workbench. 1–4.

42

700 Simpson JT et al. 2009. ABySS: A parallel assembler for short read sequence data. Genome 701 Res. 19:1117–1123. doi: 10.1101/gr.089532.108.

702 Sivertsen A et al. 2016. A Silenced vanA Gene Cluster on a Transferable Plasmid Caused an 703 Outbreak of Vancomycin-Variable Enterococci. Antimicrob. Agents Chemother. 60:4119– 704 4127. doi: 10.1128/AAC.00286-16.

705 Sneath PH a. 1986. Estimating uncertainty in evolutionary trees from Manhattan-distance 706 triads. Syst. Zool. 35:470–488. doi: 10.2307/2413110.

707 Snipen L, Ussery DW. 2010. Standard operating procedure for computing pangenome trees. 708 Stand. Genomic Sci. 2:135–141. doi: 10.4056/sigs.38923.

709 Snyder EE et al. 2007. PATRIC: The VBI PathoSystems Resource Integration Center. 710 Nucleic Acids Res. 35. doi: 10.1093/nar/gkl858.

711 Stamatakis A. 2014. RAxML version 8: A tool for phylogenetic analysis and post-analysis of 712 large phylogenies. Bioinformatics. doi: 10.1093/bioinformatics/btu033.

713 Subirats J, Sànchez-Melsió A, Borrego CM, Balcázar JL, Simonet P. 2016. Metagenomic 714 analysis reveals that bacteriophages are reservoirs of antibiotic resistance genes. Int. J. 715 Antimicrob. Agents. 48:163–167. doi: 10.1016/j.ijantimicag.2016.04.028.

716 Syvanen M. 2012. Evolutionary implications of horizontal gene transfer. Annu. Rev. Genet. 717 46:341–58. doi: 10.1146/annurev-genet-110711-155529.

718 Török ME, Peacock SJ. 2012. Rapid whole-genome sequencing of bacterial pathogens in the 719 clinical microbiology laboratory-pipe dream or reality? J. Antimicrob. Chemother. 67:2307– 720 2308. doi: 10.1093/jac/dks247.

721 Vebø HC, Solheim M, Snipen L, Nes IF, Brede DA. 2010. Comparative genomic analysis of 722 pathogenic and probiotic Enterococcus faecalis isolates, and their transcriptional responses to 723 growth in human urine. PLoS One. 5. doi: 10.1371/journal.pone.0012489.

724 Vila J et al. 1994. Association between double mutation in gyrA gene of ciprofloxacin- 725 resistant clinical isolates of Escherichia coli and MICs. Antimicrob. Agents Chemother. 726 38:2477–9.

727 Vinuesa P, Contreras-Moreira B. Robust identification of orthologues and paralogues for 728 microbial pan‐genomics using GET_HOMOLOGUES: a case study of pIncA/C plasmids.

729 Xiao J, Zhang Z, Wu J, Yu J. 2015. A Brief Review of Software Tools for Pangenomics. 730 Genomics. Proteomics Bioinforma. 13:73–76. doi: 10.1016/j.gpb.2015.01.007.

731 Yao Y et al. 2017. Insights into a Novel blaKPC-2-Encoding IncP-6 Plasmid Reveal 732 Carbapenem-Resistance Circulation in Several Enterobacteriaceae Species from Wastewater 733 and a Hospital Source in Spain. Front. Microbiol. 8:1143. doi: 10.3389/fmicb.2017.01143.

734 Yap K-P, Gan HM, Teh CSJ, Chai LC, Thong KL. 2014. Comparative genomics of closely 735 related Salmonella enterica serovar Typhi strains reveals genome dynamics and the 736 acquisition of novel pathogenic elements. BMC Genomics. 15:1007. doi: 10.1186/1471-2164- 737 15-1007.

738 Yosef I, Manor M, Kiro R, Qimron U. 2015. Temperate and lytic bacteriophages programmed

43

739 to sensitize and kill antibiotic-resistant bacteria. Proc. Natl. Acad. Sci. 2015:201500107. doi: 740 10.1073/pnas.1500107112.

741 Young JM. 2001. Implications of alternative classi cations and horizontal gene transfer for 742 bacterial taxonomy. Int. J. Syst. Evol. Microbiol. 945–953.

743 Zankari E et al. 2012. Identification of acquired antimicrobial resistance genes. J. Antimicrob. 744 Chemother. 67:2640–2644. doi: 10.1093/jac/dks261.

745 Zerbino DR, Birney E. 2008. Velvet: Algorithms for de novo short read assembly using de 746 Bruijn graphs. Genome Res. 18:821–829. doi: 10.1101/gr.074492.107.

747 Zhu Q, Kosoy M, Dittmar K. 2014. HGTector: an automated method facilitating genome- 748 wide discovery of putative horizontal gene transfers. BMC Genomics. 15:717. doi: 749 10.1186/1471-2164-15-717.

750 Zyga S, Zografakis-sfakianakis M. 2011. Emerging and re-Emerging Infectious Diseases: A 751 potential pandemic threat. 3:159–168.

752

44 45

Figure 1 Hard core Shell genes Cloud genes 46

Figure 2 Soft core A B C 47

Figure 3 0 Mb 1.42 Mb 2.82 Mb 48

Figure 4

Summary

The development of cost-effective next-generation sequencing technology (NGS), recent advances in whole genome sequencing (WGS), and decreasing costs have had a significant impact on our knowledge of the behaviour of the bacteria which cause infectious diseases. Twenty years have passed since the powerful combination of WGS, and computational analysis of data has transformed our understanding of how bacteria live, evolve and interact with their communities, with their hosts, and how they cause infectious diseases.

Three technologies were initially developed, namely the 454 platform (Roche), the SOLiD platform (Life Technologies) and the Illumina platform (Illumina). The recent development of smaller bench-top sequencers, such as the MiSeq platform (Illumina) and the Ion Torrent platform (Life Technologies), followed by the decrease in turnaround time and cost, brought with it the potential to introduce rapid whole bacterial genome sequencing into clinical diagnostic laboratories and the management of outbreaks. The concept of “Real-Time

Genomics” (RTG) was initially coined by computational biology companies to offer in-depth genomic analysis solutions to researchers in NGS technology and to give meaning to DNA and

RNA sequencing data. It was applied to big data from high-throughput sequencing in clinical pathologies (tumours), considering the sensitivity and the efficiency of the process. The emergence of pathogenic multidrug-resistant bacteria and their spread around the world has become a severe threat to human health regarding microbial infectious diseases. Bacterial genome recombinations through point mutations and gene exchanges, including horizontal gene transfer and gene loss have contributed significantly to the adaptation and evolution of bacteria and have become the driving force of bacterial survival in the host niches. Here, we propose a review of publicly available computational biology tools, which have frequently been used for real-time genomics analysis in clinical microbiology over the past decade and subsequently, we will highlight how WGS analysis has deciphered the impact of bacterial genome recombinations in clinical microbiology.

49

Chapter II: Comparative genomic analysis applied in clinical microbiology

50

Summary

Recent new developments in Matrix-Assisted Laser Desorption Ionization-Time of Flight Mass

Spectrometry (MALDI-TOF MS) enabled microbial typing at a low cost, with varying success rates, depending on the microorganism in question. This revolutionary technology allows for more comfortable and faster diagnosis of human pathogens than conventional phenotypic and molecular identification methods, with unquestionable reliability and cost-effectiveness.

However, MALDI-TOF MS analysis only targets a limited number of proteins especially growth depending proteins of the bacteria.

The recent advances in NGS, the progress in WGS and the availability of well-established tools for the automated analysis of sequence data and databases have contributed significantly to understanding bacterial adaptation and genomic evolution of human pathogens. WGS is used to investigate the acquisition and spread of antimicrobial resistance, virulence, and pathogenicity mediated by mobile genetic elements.

During our PhD, we investigate the spread of S. saprophyticus causing UTI in Marseille community using MALDI-TOF MS spectral data, comparing data from Marseille and Nice. An isolated was sequenced and analysed using comparative genome analysis with the publicly available genome of S. saprophyticus. Also, we investigate the spread and the acquisition of antimicrobial resistance genes between E. faecalis and E. faecium using comparative genome analysis.

51

I. MALDI-TOF MS spectral data analysis and comparative genome analysis of

S. saprophyticus

52

Introduction

Staphylococcus saprophyticus, a coagulase-negative staphylococcus (CoNS), is mainly associated with urinary tract infections (UTIs) in humans[1,2]. Researchers have established that sexually active women were the most widely affected, as the prevalence among women aged 16-25 years with UTI accounted for up to 42.3% of cases [3,4]. Foods are a possible source of contamination, as the microorganism had been found in various food samples, mainly in those containing pork or beef [5]. Strikingly, the association between meat processing and S. saprophyticus UTIs supports the hypothesis of an animal reservoir [6]. To date, it is not known how S. saprophyticus initially described as a saprophytic bacterium and considered as a contaminant in clinical microbiology, has evolved and adapted to urinary tract niches. In 1985,

Urease production and renal and ureteral stones were found to be associated with S. saprophyticus infection [7]. It is only in 2005 that the first genome of S. saprophyticus was sequenced and a complete analysis has revealed the presence of uro-adherence protein (UafA) associated with the uro-pathogenicity and the uro-adherence of the bacteria[8], and later correctly characterised by Sakinc et al.[9]. Recently, UafA protein also known as Serine-

Aspartate Repeat Protein or Seine-rich protein[10] has become a central interest in the study of

S. saprophyticus virulence and pathogenicity [11,12]. Hence, it is worth to understand how this pathogen’s genome has evolved and subsequently become the second leading cause of UTI worldwide. In December 2014, our surveillance system identified an abnormal increase in S. saprophyticus causing UTIs in four University Hospitals in Marseille, indicating a suspected community S. saprophyticus UTI outbreak. Matrix-Assisted Laser Desorption Ionization-Time of Flight Mass Spectrometry (MALDI-TOF MS) spectral analysis of strains were used to analyse strains cluster expansion, comparing strains from Marseille to those from Nice during the same period. MALDI-TOF MS spectral analysis revealed a geographical restricted clonal expansion of S. saprophyticus strains clusters in Marseille as compared to Nice. We published this work in the Article II entitled “Using MALDI-TOF MS typing method to decipher outbreak: the case of Staphylococcus saprophyticus causing urinary tract infections (UTIs) 53 in Marseille, France”. We sequenced and compared a strain isolated from a female patient from the “Hôpital La Timone” in Marseille, France, who had experienced UTI in December

2014 to all available genomes of S. saprophyticus from the NCBI/Genbank database as at

December 2016 to investigate the genomic evolution of this bacterial species and its genomic characteristics. Our findings suggest that S. saprophyticus, initially a saprophytic bacterium, has drifted to becoming a pathogenic bacterium through accumulated evolutionary events including massive genome recombinations and single nucleotide polymorphisms (SNPs). We submitted this work in the Article III entitled “Comparative genomic analysis of

Staphylococcus saprophyticus reveals a drift from “saprophytic” to “pathogenic” bacteria due to extensive genomic recombination.”

54

Reference

1. Le Bouter A. Infections ?? Staphylococcus saprophyticus. J. des Anti-Infectieux [Internet]. Elsevier Masson SAS; 2011;13:12–9.

2. Loulergue J, Laudat P, Audurier a. Infections urinaires à Staphylococcus saprophyticus. Médecine Mal. Infect. 1982;12:72–6.

3. Wallmark G, Arremark I, Telander B. Staphylococcus saprophyticus: a frequent cause of acute urinary tract infection among female outpatients. J. Infect. Dis. [Internet]. 1978 [cited 2017 Jan 6];138:791–7.

4. Le Bouter A. Infections à Staphylococcus saprophyticus. J. des Anti-infectieux [Internet]. 2011 [cited 2015 Nov 2];13:12–9.

5. Kim BS, Kim CT, Park BH, Kwon S, Cho YJ, Kim N, et al. Draft genome sequence of Staphylococcus saprophyticus subsp. saprophyticus M1-1, isolated from the gills of a Korean rockfish, sebastes schlegeli hilgendorf, after high hydrostatic pressure processing. J. Bacteriol. 2012. p. 4441–2.

6. Hedman P, Ringertz O, Lindström M, Olsson K. The origin of Staphylococcus saprophyticus from cattle and pigs. Scand. J. Infect. Dis. [Internet]. 1993 [cited 2017 Jan 6];25:57–60.

7. Raz R, Colodner R, Kunin CM. Who are you--Staphylococcus saprophyticus? Clin. Infect. Dis. 2005;40:896–8.

8. Kuroda M, Yamashita A, Hirakawa H, Kumano M, Morikawa K, Higashide M, et al. Whole genome sequence of Staphylococcus saprophyticus reveals the pathogenesis of uncomplicated urinary tract infection. Proc. Natl. Acad. Sci. U. S. A. [Internet]. 2005 [cited 2016 Mar 11];102:13272–7.

9. Sakinc T, Kleine B, Gatermann SG. SdrI, a serine-aspartate repeat protein identified in Staphylococcus saprophyticus strain 7108, is a collagen-binding protein. Infect. Immun. [Internet]. 2006 [cited 2016 Mar 11];74:4615–23.

10. King NP, Beatson SA, Totsika M, Ulett GC, Alm RA, Manning PA, et al. UafB is a serine- rich repeat adhesin of Staphylococcus saprophyticus that mediates binding to fibronectin, fibrinogen and human uroepithelial cells. Microbiology. 2011;157:1161–75.

11. Marlinghaus L, Huß M, Korte-Berwanger M, Sakinc-Güler T, Gatermann SG. D-serine transporter in Staphylococcus saprophyticus identified. FEMS Microbiol. Lett. [Internet]. 2016 [cited 2016 Jun 11];

12. Szabados F, Mohner A, Kleine B, Gatermann SG. Staphylococcus saprophyticus surface- associated protein (Ssp) is associated with lifespan reduction in Caenorhabditis elegans. Virulence [Internet]. 2013 [cited 2016 Mar 11];4:604–11.

55

Article II

Using MALDI-TOF MS typing method to decipher outbreak: the case of Staphylococcus

saprophyticus causing urinary tract infections (UTIs) in Marseille, France.

Kodjovi D. Mlaga, Grégory Dubourg, Cedric Abat, Hervé Chaudet, Laurène Lotte, Seydina

M. Diene, Didier Raoult, Raymond Ruimy and Jean-Marc Rolain

European Journal of Clinical Microbiology &

Infectious Diseases. pp. 1–7, Aug. 2017.

Impact factor: 2.72

56 Eur J Clin Microbiol Infect Dis DOI 10.1007/s10096-017-3069-6

ORIGINAL ARTICLE

Using MALDI-TOF MS typing method to decipher outbreak: the case of Staphylococcus saprophyticus causing urinary tract infections (UTIs) in Marseille, France

K. D. Mlaga1 & G. Dubourg1,2 & C. Abat1,2 & H. Chaudet1,2 & L. Lotte3 & S. M. Diene1,2 & D. Raoult1,2 & R. Ruimy3,4 & J.-M. Rolain1,2

Received: 11 May 2017 /Accepted: 12 July 2017 # Springer-Verlag GmbH Germany 2017

Abstract Staphylococcus saprophyticus is one of the leading aspecificS. saprophyticus strain clusters circulating in causes of urinary tract infections (UTI). In December 2014, Marseille, and (ii) MALDI-TOF MS can be used as a cost- our surveillance system identified an abnormal increase in effective tool to investigate an outbreak. S. saprophyticus causing UTIs in four university hospitals in Marseille, indicating a suspected community S. saprophyticus UTI outbreak. This was detected by our surveillance system Introduction BALYSES (Bacterial real-time Laboratory-based Surveillance System). S. saprophyticus/ Escherichia coli Staphylococcus saprophyticus, a coagulase-negative staphy- UTI ratio increased three-fold from 0.0084 in 2002 to 0.025 lococcus (CoNS), is mainly associated with urinary tract in- in December 2015 in Marseille with an abnormal peak in fections (UTIs) in humans through its uro-tropism. It was December 2014, and with an annual estimated ratio trend of established that sexually active women were the most widely 5.10−6 (p-value < 10−3). Matrix-Assisted Laser Desorption affected, as the prevalence among women aged 16–25 years Ionisation-Time of Flight Mass Spectrometry (MALDI-TOF with UTI accounted for up to 42.3% of cases [1, 2]. The MS) spectral analysis of strains was used to analyse strains gastrointestinal tract is currently considered as the main reser- cluster expansion, comparing strains from Marseille to those voir of S. saprophyticus. Moreover, the rectal and vaginal from Nice during the same period. MALDI-TOF MS spectral colonisation of S. saprophyticus is associated with UTIs due analysis revealed a geographical restricted clonal expansion of to this microorganism [3]. Foods are a possible source of the strains clusters in Marseille as compared to Nice. Our contamination, as the microorganism had been found in vari- finding suggests (i) a geographically restricted expansion of ous food samples, mainly in those containing pork or beef [4]. Strikingly, the association between meat processing and S. saprophyticus UTIs supports the hypothesis of an animal * J.-M. Rolain reservoir [5]. Recent new developments in Matrix-Assisted [email protected] Laser Desorption Ionisation-Time of Flight Mass Spectrometry (MALDI-TOF MS) enabled microbial typing 1 URMITE, UM63, CNRS 7278, IRD 198, INSERM 1095, at a low cost, with varying success rates, depending on the IHU-Méditerranée Infection, Aix-Marseille Université, 19–21 microorganism in question [6, 7]. This revolutionary technol- Boulevard Jean Moulin, 13385 Marseille Cedex 05, France ogy allows for easier and faster diagnosis of human pathogens 2 Pôle des Maladies Infectieuses et Tropicales Clinique et Biologique, than conventional phenotypic and molecular identification Fédération de Bactériologie-Hygiène-Virologie, University Hospital Centre Timone, Institut Hospitalo-Universitaire (IHU) Méditerranée methods, with unquestionable reliability and cost- Infection, Assistance Publique - Hôpitaux de Marseille, effectiveness [8]. A study has shown that MALDI-TOF MS Marseille, France can be used for a correct and accurate species identification of 3 Department of Bacteriology at Nice Academic Hospital, Nice most Staphylococcus species with 99.3% sensitivity [9]. Medical University, Nice, France However, low rate identification of S. saprophyticus has been 4 INSERM U1065 (C3M), Bacterial Toxins in Host Pathogen reported [10]. In December 2014, we observed an abnormal Interactions, C3M, Bâtiment Universitaire Archimed, Nice, France increase in the number of patients infected with

57 Eur J Clin Microbiol Infect Dis

S. saprophyticus involved in UTIs in our hospital in Marseille experienced UTIs due to E. coli from January 2002 to (France), detected by our automated bacterial surveillance sys- December 2015. A total of 888 patients having experienced tem BALYSES (Bacterial real-time Laboratory-based S. saprophyticus UTIs were reported in the various units of the Surveillance System) [11]. This suggested a possible spread MUH. Over the same period, 4512 patients have experienced of clonal strains and a probable outbreak over this period. In E. coli UTI. this study, we investigate the epidemiological increase in the number of UTI cases caused by S. saprophyticus from January Identification base on MALDI-TOF typing 2002 to December 2015 and propose a MALDI-TOF MS based typing method to investigate the geographical spread Microflex spectrometer (Bruker Dal- tonics, Leipzig, of S. saprophyticus clusters causing UTIs in Marseille Germany) method was used following previously described community. protocol [8]. A colony from a culture agar plate was spread on an MSP 96 MALDI-TOF target plate (Bruker). Two distinct colonies were tested for S. saprophyticus. Each smear was Material and method covered with 2 μL of matrix solution (saturated solution of alpha-cyano-4- hydroxycinnamic acid in 50% acetonitrile and Study setting and samples 2.5% trifluoroacetic acid) and allowed to dry for 5 min. Spectra were recorded in the positive linear mode for the mass The data analysed were collected from the four university hos- range of 2000 to 20,000 Da (parameter settings: ion source 1 pitals in Marseille (Hôpital du Nord, Hôpital du Sud, Hôpital la (ISI), 20 kV; IS2, 18.5 kV; lens, 7 kV). A spectrum was ob- Conception and Hôpital La Timone) falling within the tained after 240 shots with variable laser power. The time of Assistance Publique–Hôpitaux de Marseille (AP-HM). The di- acquisition was between 30 s and 1 min per spot. The 20 agnosis of UTIs in Marseille follows a local guideline (http:// SIT1T spectra were imported into MALDI BioTyper 3.0 soft- www.infectiologie.com/UserFiles/File/medias/Recos/2014- ware (Bruker) and analysed by standard pattern matching infections_urinaires-court.pdf). A cyto-bacteriological urine ex- (with default parameter settings) against the main spectra of am is indicated for any clinical suspicion of UI. The leucocyturia 7,379 bacteria. A score of ≥2 with a validly published species threshold is set to be >104 UFC/ml. The threshold of significant enabled identification at the species level, a score of ≥1.7 and bacteriuria depends on the bacterial species involved and on the <2 identify at the genus level and a score of <1.7 was consid- sex of the patient. For S. saprophyticus and E. coli, the bacteri- ered to be an invalid result [12]. uria is set to be >103 UFC/ml either in men or women. The data were collected from January 2002 to December 2015 for epide- S. saprophyticus MALDI-TOF MS spectral data analysis miological analysis. For MALDI-TOF MS spectral analysis, 240 strains of S. saprophyticus were collected from Marseille univer- MALDI-TOF MS was performed for 240 strains from MUH sity hospitals (MUH) as the study area and 83 strains from Nice and 83 from NUH, using Brucker Microflex to identify colo- University Hospital (NUH) as a control area, between January nies as previously described [13], and Biotyper (version 3.3) 2014 and December 2015. All strains were isolated from patients was used to generate spectra. All misidentified strains were experiencing UTIs. removed from this study. The clustering process was per- formed using R and a homemade program based on the Retrospective analysis of S. saprophyticus causing UTIs MaldiQuant package [14]. Spectral preparation included a se- in Marseille lection of the 3000–15,000 m/z range, recalibration, baseline subtraction (Statistics-sensitive Nonlinear Iterative Peak- The analysis was performed using data from a database of clipping algorithm, 100 iterations), peak selection (signal- 14 years of historical clinical data. Data on S. saprophyticus noise ratio of 3), and averaging of technical replicates. A total infections were retrieved from this historically retrospective of 14 significant peaks were retained for analysis. The dis- database to build another database that only included data on tance between the samples was calculated using the cosine patients who experienced S. saprophyticus UTIs over the distance, and the resulting dendrogram was bootstrapped study period (January 2002 to December 2015). All duplicates (1000 iterations) [15, 16]. To identify the differentially were removed based on patient ID. The annual ratios of expressed proteins, the binDA package was used, allowing S. saprophyticus UTIs were then calculated, dividing the an- for a binary discriminant analysis on the m/z peaks [17]as nual number of patients who experienced UTIs due to the selection of the most differentially expressed peaks be- S. saprophyticus by the annual number of patients who tween two spectral groups. Finally, we tried to find a

58 Eur J Clin Microbiol Infect Dis

Fig. 1 a Fourteen years’ monthly distribution of the number of patients 2015, with an annually estimated trend of the ratio of 5.10−6 (p <0.001). who experienced urinary tract infections (UTIs) due to S. saprophyticus The black curve represents the ratio between the number of the patients from January 2002 to December 2015. The red arrow indicates the having experienced UTI due to S. saprophyticus and the total number of abnormal increase in the number of positive cases of UTIs due to patients reported positive to E. coli UTI. The purple envelope represents S. saprophyticus. b The yearly ratio between the number of patients the 95% confidence interval of the blue slope.Theblue line represents the who experienced urinary tract infections (UTIs) due to S. saprophyticus slope of the ratio evolution (p <0.05)andthered curve represents the versus those due to E. coli over the 2002–2015 period. S. saprophyticus/ LOESS regression E. coli UTIs ratios increased three-fold from 0.0084 in 2002 to 0.025 in

59 Eur J Clin Microbiol Infect Dis correspondence between the m/z peaks and the UniProtKB community showing a peak in December 2014 (Fig. 1a). database (http://www.uniprot.org/uniprot/), filtering the From January 2002 to December 2015, 888 patients were database for the Staphylococcus genus. A mass fluctuation reported as having experienced S. saprophyticus UTIs in the of 0.5 per 1000 was tolerated. various units of the MUH. Most them were females (836 patients, 94.1%) and were aged between 11 and 34 years Statistical analysis (746 patients, 89.2%). Over the same period, 4512 patients experienced E. coli UTI. Throughout the study period, Statistical analyses were performed using the R software [18]. S. saprophyticus/E. coli UTIs ratios increased 3-fold from A linear model was used on S. saprophyticus/E. coli UTI 0.0084 in 2002 to 0.025 in 2015, with an annually estimated ratios to analyse and define the historical trends of the ratio, trend of the ratio of 5.10−6 (p <0.001)(Fig.1b). when we i.e. the annual trend of the ratio over the study period. compared the number of E. coli causing UTIs to the number of Pearson’s chi-square test was also performed to determine S. saprophyticus causing UTIs within the study period, we whether the increase in the S. saprophyticus/E. coli UTI ratio observed a significant increase of the number of was significant within the study period. All the statistical anal- S. saprophyticus UTIs (from 2138 vs. 18 in 2002, to 4512 yses performed were two-sided and p-values < 0.05 were con- vs. 112 in 2015, p < 0.001). The study shows that the increase sidered as statistically significant. of S. saprophyticus UTI is attributed to an increase in female patients aged between 11 and 34 years (746) representing 89.2% of the total positive cases. The Pearson's correlation Results coefficient is 0.95, indicating a strong uphill (positive) linear relationship. The linear regression coefficient is 0.785 Epidemiology of UTIs due to S. saprophyticus in Marseille (p < 0.001). Also, we observed a statistically significant in- crease (p < 0.001) when comparing the number of patients In December 2014, our epidemiological surveillance system who experienced community-acquired E. coli UTI and the detected an abnormal increase in the number of patients number of patients who experienced community-acquired experiencing UTI due to S. saprophyticus in the Marseille S. saprophyticus UTI within the study period.

Fig. 2 a Dendrogram showing the relative distance between and NUH strains. b Peak representation (m/z range 3-15 kDa, the signal- S. saprophyticus strains. Samples isolated from Marseille University to-noise ratio of the average MALDI-TOFMSfromintact Hospital (MUH) are indicated using the label “o” symbol and strains S. saprophyticus belonging to the potential cluster of strains; Group 1). isolated from Nice Hospital (NUH) without a label. Cluster 1 is the Corresponding putative proteins retrieved from the UniProtKB database potential cluster of S. saprophyticus strains, indicating a potential clonal within a ± 3 Da window are numbered expansion, while the clusters 2 & 3 are composed of a mixture of MUHs

60 Eur J Clin Microbiol Infect Dis

MALDI-TOF MS spectral data analysis and the second one with an m/z of 4986 Da. MALDI-TOF identified only a subset of the protein biomarkers that are A total of 323 S. saprophyticus spectra from Marseille and predicted by a bacterial genome, corresponding to the 4– Nice were analysed and divided into three distinct clusters. 20 kDa range [21]. Between 30 and 40 proteins were identi- The cluster 1 (N = 190, 58.8%; p = 0.004) with low aggre- fied, belonging to three categories: ribosomal proteins, DNA- gation distances, exclusively includes strains from MUH binding proteins HU, and cold-shock proteins. Several protein whereas the two other clusters involve both MUH and peaks, identified and presented in Fig. 2b, belong to the two NUH strains with higher aggregation distances (Fig. 2a). first categories. However, the 4986 Da protein is currently The dendrogram appearance suggests a spreading of a spe- uncharacterized, and there is no protein registered for cific strain cluster corresponding to cluster 1, covering S. saprophyticus within a 4986 ± 3 Da range. This suggested from January 2014 and December 2015. Using BinDA, a probable horizontally transferred gene. Further investigation peaks with an m/z value of 4986 and 4937, respectively, is necessary to decipher the source of these transferred genes differentiate cluster 1 from the others and consequently and the putative role it might play in the differentiation of characterise the apparent clone (Fig. 2b). After building S. saprophyticus strains found in Marseille. Although theaveragespectraofthesupposedMUHcluster,thedif- pulsed-field gel electrophoresis (PFGE) methods are widely ferent discriminating peaks were tested for biomolecular used for CoNS typing [22, 23], genome sequencing is current- identification. Cross-validation of the cluster separation ly considered as the gold standard for understanding the using these two peaks shows a sensitivity of 0.989, speci- spread and transmission of disease [24]. Despite a significant ficity of 0.992, a positive predictive value of 0.995, and a reduction in the cost of genome sequencing technique in re- negative predictive value of 0.985. We queried the cent years, it is still not routinely used in clinical laboratories. UniProtKB database for S. saprophyticus using a mass Given the widespread use of MALDI-TOF MS for routine window of ±3 Da (Fig. 2b). The 4937 Da peak corresponds bacterial identification, it becomes a cost-effective method of to the 4939 Da uncharacterized protein SSP 1885, identi- screening for the clonal emergence and specific strain cluster fied in S. saprophyticus subsp. saprophyticus (strain ATCC spread through spectral analysis and clustering as shown in 15305 / DSM 20229), but no correspondence was found this study. Despite significant studies addressing the emer- for the 4986 Da peak. gence of S. saprophyticus, its source and transmission are currently poorly understood. Clonal investigation of UTIs caused by S. saprophyticus,todate,hadrarelybeenper- Discussion formed. However, PFGE applied to 50 strains from five dif- ferent locations in northern Europe revealed the persistence of S. saprophyticus has often been reported as the second leading pathogenic clones over large areas [25], while in this study, the cause of UTIs, especially in young females [3, 19], as was major cluster identified in MUH was not detected in NUH, found in this study whereby 94.1% of the patients reported despite only being 200 km apart. Moreover, the prevalence in are female and 89.2% of them are aged between 11 and cattle and pigs raises the question of a possible zoonotic in- 34 years old. Here, we demonstrated an increase in fection, as its incidence had been positively associated with S. saprophyticus causing UTIs in Marseille with an abnormal meat processing [26]. However, comparisons between human increase in December 2014. We were unable to observe any and animal strains have not yet been performed, to the best of seasonal variation, as has previously been reported [20]. This our knowledge. work was also conducted to highlight specific protein signa- tures from the 323 clinical S. saprophyticus strains to decipher strain clusters circulating in Marseille within the period of January 2014 to December 2015 compared to Nice using Conclusion MALDI-TOF MS. Isolates were collected from the two dis- tinct locations in the south-east of France. The results suggest Our study confirmed an increasing number of patients infected the spread of a specific strain cluster of S. saprophyticus with- with S. saprophyticus with UTIs in a community from January in the Marseille population since at least January 2014. Most 2002 to December 2015 with an abnormal peak indicating an strains of MUH included in this study belong to the cluster 1 outbreak in December 2014 in Marseille. Application of the (N = 190, 58.8%). No strains from the NUH were found in MALDITOF-MS method revealed a specific strain cluster, cluster 1, although the same clustering method had been ap- geographically restricted to Marseille compared to Nice. plied, which suggests a restricted geographical spread. This This study provides a simple and available method of com- cluster is characterised by the presence of two proteins: one paring clonal strains, which should be further implemented on with an m/z of 4937 Da, which may correspond to SSP1885, a large scale to investigate outbreaks.

61 Eur J Clin Microbiol Infect Dis

Compliance with ethical standards linkinghub.elsevier.com/retrieve/pii/S0882401017300529.doi:10. 1016/j.micpath.2017.02.034 Funding sources This work was funded by IHU Méditerranée 11. Abat C, Chaudet H, Colson P, Rolain J-M, Raoult D (2015) Real- Infection. time microbiology laboratory surveillance system to detect abnor- mal events and emerging infections, Marseille, France. Emerg Infect Dis 21:1302–1310. http://www.pubmedcentral.nih.gov/ Conflict of interest We have no conflicts of interest to declare. articlerender.fcgi?artid=4517727&tool=pmcentrez&rendertype= abstract.doi:10.3201/eid2108.141419 Ethical approval Not applicable. 12. Togo AHH, Khelaifia S, Lagier J-C, Caputo A, Robert C, Fournier P-E et al (2016) Noncontiguous finished genome sequence and description of Paenibacillus ihumii sp. nov. strain AT5. New Microbes New Infect 10:142–150. http://linkinghub.elsevier.com/ References retrieve/pii/S2052297516000159.doi:10.1016/j.nmni.2016.01.013 13. Seng P, Abat C, Rolain JM, Colson P, Lagier J-C, Gouriet F et al 1. Wallmark G, Arremark I, Telander B (1978) Staphylococcus (2013) Identification of rare pathogenic bacteria in a clinical micro- saprophyticus: a frequent cause of acute urinary tract infection biology laboratory: impact of matrix-assisted laser desorption among female outpatients. J Infect Dis 138:791–797 http://www. ionization-time of flight mass spectrometry. J Clin Microbiol 51: – ncbi.nlm.nih.gov/pubmed/739158 2182 2194. http://www.pubmedcentral.nih.gov/articlerender.fcgi? artid=3697718%7B&%7Dtool=pmcentrez%7B&% 2. Le Bouter A (2011) Infections à Staphylococcus saprophyticus.J 7Drendertype=abstract.doi:10.1128/JCM.00492-13 des Anti-Infect 13:12–19. http://www.sciencedirect.com/science/ article/pii/S2210654511000032.doi:10.1016/j.antinf.2011.01.002 14. Gibb S, Strimmer K (2012) MALDIquant: a versatile R package for the analysis of mass spectrometry data. Bioinformatics 28:2270– 3. Latham RH, Running K, Stamm WE (1983) Urinary tract infec- 2271. http://www.ncbi.nlm.nih.gov/pubmed/22796955.doi:10. tions in young adult women caused by Staphylococcus 1093/bioinformatics/bts447 saprophyticus. JAMA 250:3063–3066 http://www.ncbi.nlm.nih. 15. Minh BQ, Nguyen MAT, von Haeseler A (2013) Ultrafast approx- gov/pubmed/6644988 imation for phylogenetic bootstrap. Mol Biol Evol 30:1188–1195. 4. Kim BS, Kim CT, Park BH, Kwon S, Cho YJ, Kim N et al (2012) http://www.ncbi.nlm.nih.gov/pubmed/23418397. doi:10.1093/ Draft genome sequence of Staphylococcus saprophyticus subsp. molbev/mst024 saprophyticus M1-1, isolated from the gills of a Korean rockfish, 16. Suzuki Y, Glazko GV, Nei M (2002) Overcredibility of molecular sebastes schlegeli hilgendorf, after high hydrostatic pressure pro- phylogenies obtained by Bayesian phylogenetics. Proc Natl Acad cessing. J Bacteriol 194:4441–4442. doi:10.1128/JB.00848-12 Sci 99:16138–16143. http://www.ncbi.nlm.nih.gov/pubmed/ 5. Hedman P, Ringertz O, Lindström M, Olsson K (1993) The origin 12451182.doi:10.1073/pnas.212646199 of Staphylococcus saprophyticus from cattle and pigs. Scand J 17. Gibb S, Strimmer K (2015) Differential protein expression and peak – Infect Dis 25:57 60 http://www.ncbi.nlm.nih.gov/pubmed/ selection in mass spectrometry data by binary discriminant analysis. 8460350 Bioinformatics 31:3156–3162. http://www.ncbi.nlm.nih.gov/ 6. Spinali S, van Belkum A, Goering RV, Girard V, Welker M, Van pubmed/26026136.doi:10.1093/bioinformatics/btv334 Nuenen M et al (2015) Microbial typing by matrix-assisted laser 18. R Core Team (2016) R Development Core Team R. R: A Language – desorption ionization time of flight mass spectrometry: do we need and Environment for Statistical Computing. R Foundation for – guidance for data interpretation? J Clin Microbiol 53:760 765. Statistical Computing 0:409. http://www.r-project.org, doi: 10. http://www.ncbi.nlm.nih.gov/pubmed/25056329.doi:10.1128/ 1007/978-3-540-74686-7 JCM.01635-14 19. Widerström M, Wiström J, Ferry S, Karlsson C, Monsen T, 7. Firacative C, Trilles L, Meyer W (2012) MALDI-TOF MS enables Widerstrom M et al (2017) Molecular epidemiology of the rapid identification of the major molecular types within the Staphylococcus saprophyticus isolated from women with uncom- Cryptococcus neoformans/C. gattii species complex. PLoS One 7: plicated community-acquired urinary tract infection. J Clin e37566. http://dx.plos.org/10.1371/journal.pone.0037566.doi:10. Microbiol 45:1561–1564. http://www.ncbi.nlm.nih.gov/pubmed/ 1371/journal.pone.0037566 17344356.doi:10.1128/JCM.02071-06 8. Seng P, Rolain J-M, Fournier PE, La Scola B, Drancourt M, Raoult 20. Fabre R, Mérens A, Tabone-Ledan C, Epifanoff G, Cavallo J-D, D (2010) MALDI-TOF-mass spectrometry applications in clinical Ternois I (2013) Staphylococcus saprophyticus isolés d’examens microbiology. Future Microbiol 5:1733–1754. http://www.ncbi. cytobactériologiques urinaires en ville : épidémiologie et nlm.nih.gov/pubmed/21133692.doi:10.2217/fmb.10.127 sensibilité aux antibiotiques (étude Label Bio Elbeuf – novembre 9. Spanu T, De Carolis E, Fiori B, Sanguinetti M, D’Inzeo T, Fadda G 2007–juillet 2009). Pathol Biol 61:44–48. http://linkinghub. et al (2011) Evaluation of matrix-assisted laser desorption elsevier.com/retrieve/pii/S0369811412000442. doi:10.1016/j. ionization-time-of-flight mass spectrometry in comparison to patbio.2012.03.008 rpoB gene sequencing for species identification of bloodstream 21. Ryzhov V, Fenselau C (2001) Characterization of the protein subset infection staphylococcal isolates. Clin Microbiol Infect 17:44–49. desorbed by MALDI from whole bacterial cells. Anal Chem 73: http://linkinghub.elsevier.com/retrieve/pii/S1198743X14609113. 746–750 http://www.ncbi.nlm.nih.gov/pubmed/11248887 doi:10.1111/j.1469-0691.2010.03181.x 22. Widerström M, Wiström J, Sjöstedt A, Monsen T (2012) 10. Ayeni FA, Andersen C, Nørskov-Lauritsen N (2017) Comparison Coagulase-negative staphylococci: update on the molecular epide- of growth on mannitol salt agar, matrix-assisted laser desorption/ miology and clinical presentation, with a focus on Staphylococcus ionization time-of-flight mass spectrometry, VITEK(®) 2 with par- epidermidis and Staphylococcus saprophyticus. Eur J Clin tial sequencing of 16S rRNA gene for identification of coagulase- Microbiol Infect Dis 31:7–20. http://www.ncbi.nlm.nih.gov/ negative staphylococci. Microb Pathog 105:255–259. http:// pubmed/21533877.doi:10.1007/s10096-011-1270-6

62 Eur J Clin Microbiol Infect Dis

23. de Sousa VS, Rabello RF, Dias RC, Martins IS, Santos LB, Alves 25. Widerström M, Wiström J, Ferry S, Karlsson C, Monsen T (2007) EM et al (2013) Time-based distribution of Staphylococcus Molecular epidemiology of Staphylococcus saprophyticus isolated saprophyticus pulsed field gel-electrophoresis clusters in from women with uncomplicated community-acquired urinary tract community-acquired urinary tract infections. Mem Inst Oswaldo infection. J Clin Microbiol 45:1561–1564. doi:10.1128/JCM.02071-06 Cruz 108:73–76. http://www.ncbi.nlm.nih.gov/pubmed/23440118. 26. Hedman P, Ringertz O (1991) Urinary tract infections caused by doi:10.1590/S0074-02762013000100012 Staphylococcus saprophyticus. A matched case control study. J Inf 24. Sintchenko V,Holmes EC (2015) The role of pathogen genomics in Secur 23:145–153. http://www.ncbi.nlm.nih.gov/pubmed/ assessing disease transmission. BMJ 350:h1314 http://www.ncbi. 1753113.doi:10.1016/0163-4453(91)92045-7 nlm.nih.gov/pubmed/25964672

63

Article III

Comparative genomic analysis of Staphylococcus saprophyticus reveals a drift from

“saprophytic” to “pathogenic” bacteria due to extensive genomic recombination.

Kodjovi D. Mlaga, Seydina M. Diene, Ruimy Raymond, Jean-Marc Rolain

BMC genomics

Impact factor: 3.72

64 BMC Ge nomics Comparative genomic analysis of Staphylococcus saprophyticus reveals a drift from "saprophytic" to "pathogenic" bacteria due to extensive genomic recombination --Manuscript Draft--

Manuscript Number: Full Title: Comparative genomic analysis of Staphylococcus saprophyticus reveals a drift from "saprophytic" to "pathogenic" bacteria due to extensive genomic recombination Article Type: Research article Section/Category: Comparative and evolutionary genomics

Funding Information: IHU-Mediterannee Infection Mr Kodjovi Dodji Mlaga

Abstract: Staphylococcus saprophyticus is one of the leading causes of Urinary Tract Infections (UTI) in young people. Bacterial genome recombination and mobile genetic elements play a fundamental role in the proliferation and diversification of the natural population. In this study, we performed a pan-genome and genomic recombination analysis to decipher the evolutionary profile and genomic characteristics of clinical and non-clinical strains of S. saprophyticus. Pan-genome analysis of 32 of S. saprophyticus strains reveals an open-pangenome with a total of 4,434 orthologous genes. Nearly 50% of the specific orthologous genes of non-clinical strains belong to transcriptional regulatory [K] and carbohydrate transport and metabolism [G] functional Cluster of Ortholog Group (COG). We identify massive recombination hotspots in all genomes with a similar recombination profile for each phylogenetic clade. The ratio of base substitutions predicted to have been imported through recombination to those occurring through point mutation (r/m) was determined. It revealed that the relative impact of recombination and mutation on the variation accumulated on the branches is higher in clinical strains than non-clinical strains. The evolutionary analysis shows an active selection of the uro-adherence protein, UafA, which enhances the emergence of S. saprophyticus capable of adhering to the human bladder thus causing UTIs. Our findings suggest that S. saprophyticus, initially a saprophytic bacterium, has drifted to becoming a pathogenic bacterium through accumulated evolutionary events including massive genome recombinations and single nucleotide polymorphisms (SNPs). Corresponding Author: Jean-Marc Rolain, PharmD, PhD URMITE CNRS INSERM IRD Marseille cedex 05, -- Select your State -- FRANCE Corresponding Author Secondary Information: Corresponding Author's Institution: URMITE CNRS INSERM IRD Corresponding Author's Secondary Institution: First Author: Kodjovi Dodji Mlaga, PhD. First Author Secondary Information: Order of Authors: Kodjovi Dodji Mlaga, PhD. Seydina M Diene, Associate-professor Raymond Ruimy, Professor Jean-Marc Rolain, PharmD, PhD Order of Authors Secondary Information:

65 Powered by Editorial Manager® and ProduXion Manager® from Aries Systems Corporation Manuscript Click here to download Manuscript Mlaga et al. Manuscript.doc

Click here to view linked References P a g e | 1

1 2 1 Comparative genomic analysis of Staphylococcus saprophyticus reveals a drift from 3 4 2 “saprophytic” to “pathogenic” bacteria due to extensive genomic recombination. 5 6 7 3 Kodjovi D. Mlaga1, Seydina M. Diene1, Ruimy Raymond2,3, Jean-Marc Rolain1* 8 9 10 11 4 12 13 14 5 1. URMITE, Aix-Marseille Université, UM63, CNRS 7278, IRD 198, INSERM 1095, IHU- 15 16 6 Méditerranée Infection, 19-21 Boulevard Jean Moulin 13385 Marseille Cedex 05, France. 17 18 19 7 2. Department of Bacteriology at Nice Academic Hospital, Nice Medical University Nice, 20 21 8 France. 22 23 24 9 3. INSERM U1065 (C3M), Bacterial Toxins in Host Pathogen Interactions, C3M, Bâtiment 25 26 10 Universitaire Archimed, Nice, France. 27 28 29 11 *Corresponding author: Prof. Jean-Marc Rolain 30 31 32 33 12 IHU Méditerranée Infection, Marseille, France 34 35 36 13 Email: [email protected] 37 38 39 14 Tel: +33(0) 4 91 32 43 75 40 41 42 43 15 Fax: +33 (0) 4 86 13 68 28 44 45 46 16 47 48 49 17 Keywords: Staphylococcus saprophyticus, epidemiology, whole genome 50 51 52 18 sequencing (WGS), Pan-genome, recombination, Single nucleotide polymorphisms 53 54 19 (SNPs). 55 56 57 58 59 60 1 61 62 63 64 66 65 P a g e | 2

1 2 20 Abstract 3 4 5 21 Staphylococcus saprophyticus is one of the leading causes of Urinary Tract Infections 6 7 22 (UTI) in young people. Bacterial genome recombination and mobile genetic elements play a 8 9 10 23 fundamental role in the proliferation and diversification of the natural population. In this study, 11 12 24 we performed a pan-genome and genomic recombination analysis to decipher the evolutionary 13 14 15 25 profile and genomic characteristics of clinical and non-clinical strains of S. saprophyticus. Pan- 16 17 26 genome analysis of 32 of S. saprophyticus strains reveals an open-pangenome with a total of 18 19 27 20 4,434 orthologous genes. Nearly 50% of the specific orthologous genes of non-clinical strains 21 22 28 belong to transcriptional regulatory [K] and carbohydrate transport and metabolism [G] 23 24 29 functional Cluster of Ortholog Group (COG). We identify massive recombination hotspots in 25 26 27 30 all genomes with a similar recombination profile for each phylogenetic clade. The ratio of base 28 29 31 substitutions predicted to have been imported through recombination to those occurring through 30 31 32 32 point mutation (r/m) was determined. It revealed that the relative impact of recombination and 33 34 33 mutation on the variation accumulated on the branches is higher in clinical strains than non- 35 36 37 34 clinical strains. The evolutionary analysis shows an active selection of the uro-adherence 38 39 35 protein, UafA, which enhances the emergence of S. saprophyticus capable of adhering to the 40 41 36 human bladder thus causing UTIs. Our findings suggest that S. saprophyticus, initially a 42 43 44 37 saprophytic bacterium, has drifted to becoming a pathogenic bacterium through accumulated 45 46 38 evolutionary events including massive genome recombinations and single nucleotide 47 48 49 39 polymorphisms (SNPs). 50 51 52 53 54 55 56 57 58 59 60 2 61 62 63 64 67 65 P a g e | 3

1 2 40 Introduction 3 4 5 41 Staphylococcus saprophyticus is one of the leading causes of Urinary Tract Infections 6 7 42 (UTI) in young people (11 – 34 years)[1], particularly women, accounting for up to 40% of 8 9 10 43 cases in this population [2–7]. UTI is defined as the presence of microbial pathogens in the 11 12 44 urinary tract, usually classified based on the site of infection: bladder (cystitis), kidney 13 14 15 45 (pyelonephritis), or urine (bacteriuria) [8]. Several risk factors for UTI caused by S. 16 17 46 saprophyticus have been identified such as sexual intercourse, the use of condoms and 18 19 47 20 diaphragms coated with spermicide gel, pregnancy, raw meat consumption, and diabetes [9– 21 22 48 11]. S. saprophyticus, a coagulase-negative Staphylococcus has a special uro-tropism and 23 24 49 ecological features that are distinctly different from others Staphylococci and Escherichia coli 25 26 27 50 [12]. This bacterium is usually susceptible to antibiotics except for fosfomycin [13]. Virulence 28 29 51 factors associated with S. saprophyticus, enabling the bacteria to cause UTIs include surface- 30 31 32 52 associated protein (adherence to urothelial cells), lipase, lipoteichoic acid, hemagglutinin that 33 34 53 binds to fibronectin, urease, and the production of extracellular slime [12,14]. The first whole 35 36 37 54 genome sequencing (WGS) of the S. saprophyticus subsp. saprophyticus ATCC 15305 strain, 38 39 55 isolated from the urine of a young woman, revealed a single “open reading frame” (Orf), 40 41 56 predicted to be a cell-wall-anchored protein and associated with a positive hemagglutination 42 43 44 57 test and adherence to human bladder cells. Mobile genetic elements have also been identified in 45 46 58 this S. saprophyticus genome [14]. Other studies have suggested several other reservoirs, 47 48 49 59 including animals (cattle and pigs) [15], meat [16], fish [17] and the environment (running 50 51 60 water) [18], as potential intermediary sources of human contamination. Genome recombination 52 53 54 61 and mobile genetic elements play a significant role in the proliferation and diversification of the 55 56 62 natural population of bacteria [19]. Previous studies have shown a significant association 57 58 63 between the uro-pathogenicity of S. saprophyticus and the presence of the hemagglutinin 59 60 3 61 62 63 64 68 65 P a g e | 4

1 2 64 protein (UafA) [20] and a Serine-D transporter protein [21]. In Marseille, we recently reported 3 4 65 an abnormal increase in the number of S. saprophyticus UTIs detected by BALYSES (Bacterial 5 6 66 real-time Laboratory-based Surveillance System) [22]. Using the MALDI-TOF MS analysis 7 8 9 67 approach, we identified a spread of S. saprophyticus strain clusters in the Marseille area in 10 11 68 comparison with the neighbouring city of Nice. To date, it is not known how S. saprophyticus 12 13 14 69 initially described as a saprophytic bacteria and considered as a contaminant in clinical 15 16 70 microbiology, has evolved and adapted to urinary tract niche. In 1985, Urease production and 17 18 19 71 renal and ureteral stones were found to be associated with S. saprophyticus infection [12]. It is 20 21 72 only in 2005 that the first genome of S. saprophyticus was sequenced and a complete analysis 22 23 73 24 has revealed the presence of uro-adherence protein (UafA) associated with the uro- 25 26 74 pathogenicity and the uro-adherence of the bacteria [14], and later adequately characterised by 27 28 75 Sakinc et al.[23]. Recently, UafA protein also known as Serine-Aspartate Repeat Protein or 29 30 31 76 Seine-rich protein[24] has become a central interest in the study of S. saprophyticus virulence 32 33 77 and pathogenicity [21,25]. Hence, it is worth to understand how this pathogen’s genome has 34 35 36 78 evolved and subsequently become the second leading cause of UTI worldwide. In this study, 37 38 79 we perform a comparative genomic analysis of all available genomes of S. saprophyticus from 39 40 41 80 the NCBI database at December 2016 from various environments (clinical and non-clinical 42 43 81 strains) to investigate the genomic evolution of this bacterial species and its genomic 44 45 82 characteristics. 46 47 48 49 83 50 51 52 53 54 55 56 57 58 59 60 4 61 62 63 64 69 65 P a g e | 5

1 2 84 Results 3 4 5 85 Description of S. saprophyticus genomic features including the strain sequenced from 6 7 86 Marseille 8 9 10 11 87 The genome size of the S. saprophyticus G764 (FKIN01) strain from Marseille is 12 13 88 estimated to be 2,523,588-bps assembled into 21 scaffolds. The overall GC content is 33.27%. 14 15 89 16 A total of 2,584 genes were predicted including 2,498 CDS, 63 tRNA, and 23 rRNAs (7, 7 and 17 18 90 9, for 16S rRNA, 23S rRNA, and 5S rRNA, respectively) (Suppl. Table S1). As shown in 19 20 91 Suppl. Table S2, 767 CDSs were annotated as hypothetical proteins (unknown function), 494 21 22 23 92 proteins were classified as belonging to the enzyme group, and 101 proteins to the transporters 24 25 93 category. Two prophages were identified, including one complete prophage and a second, 26 27 28 94 incomplete one. We detected no CRISPR-Cas systems in the G764 genome. Of the 36 available 29 30 95 genomes from NCBI, we excluded four genomes sequences from this study because of unclear 31 32 33 96 taxonomy (DDH < 70%, whole genome sequence alignment show more than 25% gaps. We 34 35 97 present the general features of all genomes included in this study (n=32) in Suppl. Table S3. A 36 37 98 total of 59.37% (19/32) of the strains were isolated from human urine samples, 15.62% (5/32) 38 39 40 99 from cheese, and the rest of animals and the environment. The average size of the S. 41 42 100 saprophyticus genomes was 2.61-Mb with 2,541 genes and 33.02 % GC on average. We did 43 44 45 101 not observe any significant differences in the genome size, gene content, and % GC content (p- 46 47 102 value ≥ 0.99). 48 49 50 103 51 Core, cloud and shell gene components of the S. saprophyticus pan-genome 52 53 54 104 The pan-genome analysis of all S. saprophyticus strains, including the G764 strain 55 56 105 (n=32 genomes) revealed a total of 4,699 orthologous genes identified with the COG algorithm 57 58 59 106 and 4,656 with OrthoMCL. Both algorithms identified 4,434 genes in common, so this was 60 5 61 62 63 64 70 65 P a g e | 6

1 2 107 used for the comparative analysis. From this analysis, the cloud genes (specific genes) was 3 4 108 composed of 1,385 orthologs, 965 orthologs in the shell genes (accessory genes), 2,084 5 6 109 orthologs in the soft-core, and 1,924 orthologs in the hard-core genes (Fig. 1). The core genome 7 8 9 110 (hard-core genes) represent on average, 75.7% of each S. saprophyticus proteome. The inferred 10 11 111 parsimony pan-genome tree, based on the presence/absence of the gene, revealed four clades 12 13 14 112 (Fig. 2); while clade 1 consisted exclusively of clinical strains, clades 2, 3, and 4 contained a 15 16 113 mixture of clinical, food, animal and environmental strains (Fig. 2). As shown in Figure 2, 17 18 19 114 strains from food, particularly from cheeses which are produced from cattle, are the oldest and 20 21 115 closest to the ancestral strain. However, some strains from food and the environment in clades 2 22 23 116 24 and 3 share a common parent with clinical strains. The evolution of the pan-genome was 25 26 117 plotted based on the Tettelin model and follows the function 2506 + 18.4(g-1) + 142 exp (- 27 28 118 2/4.66) ((1-exp(-g-1))/ (1-exp (-1/4.66))) suggesting an open pan-genome of the S. 29 30 31 119 saprophyticus species (Fig. 3). 32 33 34 120 Proteomic comparison, based on functional COGs (fig. 4) of clinical and non-clinical 35 36 37 121 strains revealed differences in functional groups including in the Amino Acid Transport and 38 39 122 Metabolism group [E], the Carbohydrate Transport and Metabolism group [G], and the 40 41 123 Transcription group [K] (Fig. 4). We did not observe any differences in defence mechanisms 42 ’ 43 44 124 COGs [V]. Interestingly, clinical strains lack significant genes encoding for transcriptional 45 46 125 regulatory proteins, posttranslational modification proteins, and secondary metabolism proteins. 47 48 49 126 A total of 47 orthologous genes were identified exclusively in clinical strains and grouped into 50 51 127 11 functional COGs, and 95 orthologous genes were identified exclusively in non-clinical 52 53 54 128 strains and grouped in 15 functional COGs (Suppl. Table S4). This finding suggests that 55 56 129 clinical strains have lost more genes than they acquired and most of these acquired genes are 57 58 130 functionally uncharacterized (i.e. annotated as hypothetical proteins). Other accessory genes 59 60 6 61 62 63 64 71 65 P a g e | 7

1 2 131 shared between clinical strains are phage proteins, recombinases, and transposase proteins, an 3 4 132 evidence of acquisition of mobile genetic elements. This difference between clinical and non- 5 6 133 clinical strains is also reflected in carbohydrate transport and metabolism [G], transcriptional 7 8 9 134 regulatory [K], and general function prediction [R], as indicated in Figure 4. We found 10 11 135 essential genes conferring pathogenicity to S. saprophyticus in the hard-core and soft-core 12 13 14 136 genes, such as those encoding for biofilm formation by icaABCDR operon and the associated 15 16 137 regulator's operon (sarARVXZ), the urease operon (ureABCDEFGH), and the capsular operon 17 18 19 138 (capABCD). We identify genes encoding for membrane transporter proteins in all strains 20 21 139 including inorganic ion transport: Na+/H+ antiporters, Na+/Pi cotransporters, osmotolerance 22 23 140 ProP PutP 24 transporters: proline/betaine transporter , High-affinity proline permease , glycine 25 26 141 betaine/choline transporter, proline/glycine, betaine ABC transporter ATPase component 27 28 142 Opu(CA-CB-CD) and periplasmic component OpuCC. The uro-adherence gene involved in the 29 30 31 143 hemagglutination of S. saprophyticus was found in 26 of the 32 genomes analysed (81%) and 32 33 144 was absent in six of the genomes analysed (shell genes). We also identify a hemolysin-like gene 34 35 36 145 H1U in all strains. Surprisingly, the fosB2 gene, involved in fosfomycin resistance was absent 37 38 146 in some genomes, mostly clinical strains. However, all analysed genome strains harboured 39 40 41 147 resistance genes to β-lactams (mecA, blaZ) and tetracycline (tetA, tetD) (Fig. 5). Non-clinical 42 43 148 isolates were mainly characterised by cysteine/O-acetyl serine efflux genes, the putative 44 45 149 glycosyltransferase epsH gene, the cupin domain protein, the histidine protein kinase gene 46 47 48 150 saeS, the ATP-dependent Clp protease, the ATP-binding subunit gene clpL, the Glucosamine-6- 49 50 151 phosphate deaminase gene, and the putative glycosyltransferase gene epsJ. The average of all 51 52 53 152 pairwise comparisons of coding sequences included in the hardcore and softcore showed a 54 55 153 dN/dS ratio = 0.671(ratio < 1), the average of S. saprophyticus ATCC 15305 (used as reference 56 57 58 154 strains) sequence compared to others indicated a dN/dS ratio = 0.689 (ratio < 1). It shows that 59 60 7 61 62 63 64 72 65 P a g e | 8

1 2 155 most single nucleotide polymorphisms and mutations that occurred within the core genome are 3 4 156 non-synonymous and do not directly affect the molecular evolution of S. saprophyticus. 5 6 7 157 Maximum-likelihood phylogeny tree reconstruction and genomic recombination analysis 8 9 10 158 of S. saprophyticus. 11 12 13 159 Phylogenetic whole genome analysis, based on recombination hotspots and single 14 15 160 16 nucleotide polymorphisms (SNPs), was inferred using the maximum-likelihood approach. As 17 18 161 shown in Figure 6, we observed a similar topology to the parsimony pan-genome tree (Fig. 2). 19 20 162 The S. saprophyticus strains were grouped into four clades from which only clinical strains 21 22 23 163 were found in clade 1, while the remained clades (2, 3, and 4) were composed of a mixture of 24 25 164 clinical and non-clinical strains (Fig. 6). As compared to clinical strains, non-clinical strains 26 27 28 165 from food and the environment were more closely related to parental strains. We observed a 29 30 166 high density of recombination in all isolates (Fig. 7). Of a total of 32 genomes analysed 31 32 33 167 (twenty-two clinical and ten non-clinical), we identify 29574 SNPs from which recombination 34 35 168 imported 15429 (52.1%). From the 22 clinical strains genomes, 61.3% (12894/21032) were due 36 37 169 to recombination and from the ten non-clinical strains genomes, only 29.6%(2535/8542) were 38 39 40 170 imported by recombination with details shown in Suppl. Table S5. The position and the 41 42 171 number of recombination that occurred in the genomes are similar within each clade. Most of 43 44 45 172 the recombination blocks correspond to the location of transposons (Tn10 and Tn916) and 46 47 173 prophages inside most genomes. The identical recombination blocks (Fig. 7, in red colour) 48 49 50 174 correspond to clades sharing a common ancestor with a similar evolutionary history. The 51 52 175 finding shows that clinical and non-clinical strains shared a common ancestor with a similar 53 54 176 55 recombination profile in clades 2, 3 and 4. There is a statistically significant difference in the 56 57 177 number of recombination blocks between clinical strains (higher) and non-clinical strains (p- 58 59 60 8 61 62 63 64 73 65 P a g e | 9

1 -5 2 178 value <10 ) (Suppl. Table S5). The ratio of base substitutions predicted to have been imported 3 4 179 through recombination to those occurring through point mutation (r/m) is higher within clinical 5 6 180 strains as compared to non-clinical strains (p-value <10-5) (fig 8). This ratio indicates that 7 8 9 181 single nucleotide polymorphism recombinations that occurred have impacted the genomic 10 11 182 evolution of clinical strains to a greater extent than non-clinical strains and these may have 12 13 14 183 occurred within the cloud and the shell genes. 15 16 17 184 Uro-adherence factor protein UafA polymorphisms and evolution 18 19 20 185 The uafA gene was present in 26 of the 32 clinical and non-clinical strains (Fig. 5). The 21 22 23 186 protein size of non-clinical strains vary between 1,320 and 3,760 amino acids and are composed 24 25 187 of two parts including one with a conserved size of 1,063 amino acids present in all strains 26 27 28 188 exhibiting fewer polymorphisms in comparison with the other variable part consisting of serine- 29 30 189 rich repeat tandems (fig. 9). We identify a “SESESL” serine-rich tandem repeats in the 31 32 33 190 FKIN01, AHKB01 and ATCC15305 strains, “SESESLSQ” tandem in the JUUE01, JXBG01 34 35 191 strains, and “SESESLSA” tandem in the FDAARGOS 137 and FDAARGOS 168 strains. The 36 37 192 protein sequence alignment revealed codon stop mutations and recurrent mutations in some 38 39 40 193 strains leading to the uafA gene being split into two fragments. Hemagglutination tests which 41 42 194 were performed to investigate whether single nucleotide polymorphisms occurring in the uafA 43 44 45 195 gene affect the activity of the protein revealed positive activity for the ATCC15305 (positive 46 47 196 control) and FKIN01 strains (from Marseille). In contrast, this test revealed negative activity 48 49 50 197 for the JXBG01 strain, which was isolated from river water from Malaysia. Evolutionary 51 52 198 statistical analysis performed on all 32 uafA genes indicated a synonymous substitution with a 53 54 199 55 dN/dS ratio = 1.0449 (dN/dS>1). When comparing the ATCC15305 strain to others, the 56 57 200 average indicated dN = 1.6503, dS = 1.3303 with dN/dS = 1.2405 > 1. This ratio reveals that 58 59 60 9 61 62 63 64 74 65 P a g e | 10

1 2 201 the selection which occurred in the uro-adherence protein is positive and overcomes the 3 4 202 purifying selection. 5 6 7 203 Discussion 8 9 10 11 204 In this paper, we report on the first comparative genomic analysis of S. saprophyticus 12 13 205 isolates from various sources. Before the 1960s, S. saprophyticus was considered to be a 14 15 206 16 contaminant in urine samples [12]. It quickly emerged as the second leading cause of UTIs 17 18 207 associated with young women [26], pregnancy [27], and sexual habits [28]. This study results 19 20 208 from a previous survey conducted in the Marseille area (France) which revealed a 21 22 23 209 geographically restricted clonal expansion of S. saprophyticus within the city of Marseille 24 25 210 compared to strains from Nice which is only 200 km distant. Here, we investigate the evolution 26 27 28 211 of S. saprophyticus isolates from various specimens to understand how these bacteria evolved 29 30 212 from a saprophytic bacterium to a human pathogen, particularly causing urinary tract infections 31 32 33 213 (UTI). The comparative analysis of 32 publicly available genomes of S. saprophyticus 34 35 214 including the FKIN01 genome sequenced from Marseille revealed an open pan-genome profile 36 37 215 of S. saprophyticus strains, suggesting a high genome plasticity of this bacterial species. 38 39 40 216 Moreover, this finding is supported by several mobile genetic elements including transposons 41 42 217 and phages which were identified in all analysed genomes in this study. Bacteria can regularly 43 44 45 218 exchange DNA using legitimate or illegitimate recombination mediated by mechanisms 46 47 219 including transduction, transformation or conjugation, bringing further changes [29–31]. The 48 49 50 220 parsimony pan-genome tree shows that S. saprophyticus emerged from ancestral strains, close 51 52 221 to those mainly isolated from food and the environment, and which are heterogeneous and 53 54 222 55 clustered over time through genes lost and gained. Gene loss, rather than the acquisition of 56 57 223 virulence genes, has been considered as the driving force in the adaptation of pathogens to 58 59 60 10 61 62 63 64 75 65 P a g e | 11

1 2 224 eukaryotic cells [32]. These host-dependent bacteria (pathogens) exhibit fewer rRNA genes, 3 4 225 more split rRNA operons and fewer transcriptional regulators [32,33], as observed in this study 5 6 226 of S. saprophyticus genomes. Based on our findings, S. saprophyticus genomes seem to have 7 8 9 227 evolved independently of virulence genes. Indeed, nearly 50% of genes lost by the clinical 10 11 228 strains analysed belong to transcriptional regulatory [K] and carbohydrate transport and 12 13 14 229 metabolism [G] functional groups. However, through genomic DNA exchange (i.e. genes 15 16 230 gained and lost) S. saprophyticus strains end up adapting to humans, and particularly to the 17 18 19 231 human urinary tract and bladder environments. S. saprophyticus presents a high core genome 20 21 232 proportion (75.7%), much like S. epidermidis (80%) [34], in comparison to that of S. aureus 22 23 233 E. coli 24 (56%) [35] and (40%) [36]. Most SNPs occurring in the core genome are non- 25 26 234 synonymous, indicating a negative purifying selection which implies that this does not affect 27 28 235 the structure and the function of the affected proteins. However, differentiation has taken place 29 30 31 236 in the cloud and shell genes (accessory genes). The previously described virulence genes 32 33 237 associated with S. saprophyticus are included in the core genome, such as hemagglutinin [37] 34 35 36 238 D-serine transporter [21], urease [38] and biofilm formation genes [39], present in both clinical 37 38 239 and non-clinical strains analysed in this study. Interestingly, our analysis reveals that not all the 39 40 41 240 analysed genomes of S. saprophyticus possessed the fosB gene, known as being responsible for 42 43 241 resistance to fosfomycin, described in Gram-positive bacteria [40]. S. saprophyticus has been 44 45 242 described as naturally resistant to fosfomycin [41]. Hence, the absence of the fosB gene in some 46 47 48 243 genomes highlights evidence of gene loss. Transporters including Na+/H+ anti-transporters, 49 50 244 Na+/Pi co-transporters, proline/betaine transporters, proline permease, and glycine 51 52 53 245 betaine/choline transporters, essential for the survival of S. saprophyticus in a urine 54 55 246 environment were present in all analysed genomes. Recombination has introduced a total 56 57 58 247 52.1% of SNPs with 61.3% occurring in clinical strains and 29.6% in non-clinical strains 59 60 11 61 62 63 64 76 65 P a g e | 12

1 2 248 genomes. The ratio of base substitutions predicted to have been imported through 3 4 249 recombination to those occurring through point mutation (r/m) was also determined. It revealed 5 6 250 that the relative impact of recombination and mutation on the variation accumulated on the 7 8 9 251 branch are higher for clinical strains than non-clinical strains. Thes findings suggest that the 10 11 252 pathogenicity of S. saprophyticus is a result of genomic recombination through single 12 13 14 253 nucleotide polymorphism, mobile genetic element exchanges, and gene loss, to adapt to any 15 16 254 variations in the direct environment. Recombination analysis shows that clinical strains within 17 18 19 255 each clade shared almost identical genomic recombination hotspot profiles and single 20 21 256 nucleotide polymorphism (SNPs), evidence of being descended from a common ancestor. A 22 23 257 S. aureus 24 similar study conducted on species using the Bayesian Recombination Tracker 25 26 258 (BRATNextGen) program to detect recombination events within 165 isolates revealed that of 27 28 259 the 16,172 SNPs (corresponding to 0.53% of the genome) identified, recombination has 29 30 31 260 introduced 8,569 (53%). Moreover, the ratio r/m estimating the number of SNPs introduced by 32 33 261 recombination and mutation between each pair of sub-populations vary considerably from one 34 35 36 262 sub-group to another[42]. The maximum-likelihood phylogenetic tree indicated that clinical 37 38 263 strains and non-clinical strains are closely related, and share common parents and ancestors. It 39 40 41 264 supports the hypothesis that food [16,17,43], animals [15] and the environment [18] are 42 43 265 potential sources of contamination to humans. Hemagglutination tests and UafA protein 44 45 266 sequence comparisons show that not all S. saprophyticus strains possess the functional Uro- 46 47 48 267 adherence UafA protein. The evolutionary analysis shows an active selection of the protein 49 50 268 leading, therefore, to the emergence of S. saprophyticus capable of adhering to the human 51 52 53 269 bladder and epithelial cell membranes [14,44,45], causing UTIs. 54 55 56 270 57 58 59 60 12 61 62 63 64 77 65 P a g e | 13

1 2 271 Conclusion 3 4 5 272 In summary, this analysis shows that S. saprophyticus was initially saprophytic and has 6 7 273 drifted to being a pathogenic bacterium through massive genome recombination and single 8 9 10 274 nucleotide polymorphism (SNPs) events, resulting from the significant loss of genes 11 12 275 categorised in the transcriptional regulatory and carbohydrate metabolism and transport 13 14 15 276 functional groups. This species exhibits high genome plasticity, making S. saprophyticus 16 17 277 vulnerable to exogenous DNA acquisition, including phages and transposons, as noted here. 18 19 278 20 Also, evolutionary selection with non-synonymous substitution overcoming synonymous 21 22 279 substitutions has occurred in the uro-adherence protein gene. These have led to the emergence 23 24 280 of a specific population of S. saprophyticus capable of causing disease, particularly UTIs, in 25 26 27 281 humans. 28 29 30 282 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 13 61 62 63 64 78 65 P a g e | 14

1 2 283 Materials and Methods 3 4 5 284 Whole genome sequencing of S. saprophyticus, Marseille isolate G764 and genome data 6 7 285 extraction from NCBI database 8 9 10 286 The S. saprophyticus G764 strain was isolated from a female patient from the Hôpital La 11 12 13 287 Timone in Marseille, France, who had experienced a urinary tract infection (UTI) in December 14 15 288 2014. We resuspended bacteria growth in 400 μL of TE buffer. Then 200 μL of this suspension 16 17 18 289 was diluted in 1 mL TE buffer for lysis treatment. The extracted DNA was then purified using 19 20 290 QIAGEN spin-column kits. We sequenced the genomic DNA of the strain G764 on the MiSeq 21 22 23 291 Technology (Illumina Inc, San Diego, CA, USA) with the mate-pair strategy. gDNA was 24 25 292 quantified by a Qubit assay using the high sensitivity kit (Life Technologies, Carlsbad, CA, 26 27 293 28 USA). We prepared the mate pair library with 1.5 µg of genomic DNA following the Nextera 29 30 294 mate pair Illumina guideline. We simultaneously fragmented the genomic DNA sample and 31 32 295 tagged with a mate-pair junction adapter. We validated the pattern of fragmentation on an 33 34 35 296 Agilent 2100 BioAnalyzer (Agilent Technologies Inc, Santa Clara, CA, USA) with a DNA 36 37 297 7500 labchip. The DNA fragments ranged in size from 1.5 kb to 11 kb with an optimal size at 38 39 40 298 5.316 kb. We did not perform any size selection, and 640.3 ng of tagged fragments were 41 42 299 circularised. The circularised DNA was mechanically sheared to small fragments with an 43 44 45 300 optimal size of 1,550 bp on the Covaris device S2 in T6 tubes (Covaris, Woburn, MA, USA). 46 47 301 We visualised the library profile on a High Sensitivity Bioanalyzer LabChip (Agilent 48 49 302 Technologies Inc, Santa Clara, CA, USA) and the final concentration library was measured at 50 51 52 303 15.44 nmol/l. The libraries were normalised at 2nM and pooled. After a denaturation step and 53 54 304 dilution at 15 pM, the pool of libraries was loaded onto the reagent cartridge and then onto the 55 56 57 305 instrument along with the flow cell. We performed an automated cluster generation and 58 59 306 sequencing run in a single 2x301-bp run. The read sequences produced were assembled with 60 14 61 62 63 64 79 65 P a g e | 15

1 2 307 automated “scaffolding” using A5-Miseq (version 3)[46]. The scaffolds generated from A5- 3 4 308 Miseq were re-ordered and aligned against a reference genome of S. saprophyticus ATCC 5 6 309 15305, using Mauve Aligner (version snapshot_2015-02-13 build 0 © 2003-2015) [47]. 7 8 9 310 Genome features and coding sequences classification were performed using MicroScope 10 11 311 platform[48] For the comparative genomic analysis; we retrieve 35 other S. saprophyticus 12 13 14 312 genomes from the NCBI database. For quality control purposes, we performed in silico DNA- 15 16 313 DNA-Hybridization (DDH) on all genomes retrieved using the Genome-Genome distance 17 18 19 314 calculator (GGDC) [49]. Four genomes (accession numbers: LUGM01, JUTO01, FMPI01, and 20 21 315 FMPG01) were excluded from this study because the DDH was < 70 %. In this analysis, we 22 23 316 24 renamed all the retrieved genomes by their WGS ID and complete genome with strain ID 25 26 317 (Table 3). All genomes were re-annotated using Prokka (version 1.11) [50]. In total, this 27 28 318 analysis included 22 clinical and ten non-clinical S. saprophyticus genomes 29 30 31 32 319 Orthologous gene cluster detection and pan-genomic analysis 33 34 35 320 The pan-genomic analysis was performed using GET_HOMOLOGUES [51,52] 36 37 321 (version 2.0). We clustered homologous gene families using two clustering algorithms: 38 39 40 322 “OrthoMCL” (MCL) with OrthoMCL (version 1.4), and “Clusters of Orthologous Groups 41 42 323 (COGs) with COG triangle (version 2.1). Default parameters were used for alignment (75% 43 44 45 324 coverage, 75% identity, e-value at 1e-05). We generated a consensus clustering by both 46 47 325 OrthoMCL and COG algorithms profiles using default parameters. We queried the built 48 49 50 326 consensus pan-genome matrix, to determine the hard-core (genes present in 99%-100% taxa), 51 52 327 soft-core (genes present in 95%-99% taxa), shell (genes present in 15%-95%) and cloud genes 53 54 328 55 (genes present in 0%-15% of the genome) as described by Kaas [53]. We performed the 56 57 329 parsimony pan-genome tree by hierarchical clustering using Ward’s method with Manhattan 58 59 60 15 61 62 63 64 80 65 P a g e | 16

1 2 330 Distances [54] using midpoint rooting in APE package[55]. The Tettelin [56] model was used 3 4 331 to estimate the logical core and pan-genome size. We retrieve accessory genes (shell genes) of 5 6 332 the clinical and non-clinical strain genomes and run a BLAST analysis against the functional 7 8 9 333 COG database [57] to determine and compare functional orthologous genes specific to clinical 10 11 334 and non-clinical clades. Antimicrobial resistance genes, virulence and transporter genes related 12 13 14 335 to the urine environment heatmap were plotted against the phylogeny tree to analyse 15 16 336 distribution. 17 18 19 337 20 Genome recombination, phylogeny tree and evolutionary analysis 21 22 23 338 To reconstruct the phylogenetic tree, we used an all-against-all pairwise approach to 24 25 339 align 32 whole S. saprophyticus genomes Suppl. Table S3 using Mugsy (version 1.2.3) 26 27 28 340 software [58]. We putatively identified hotspots of recombination loci containing high densities 29 30 341 of SNPs which were suggestive of horizontally transferred sequences using Gubbins (version 31 32 33 342 1.4.5) software [59] with the default settings. We reconstructed the evolutionary maximum- 34 35 343 likelihood phylogeny by determining the ancestral genome, considering points of variation, 36 37 344 genome plasticity, and single nucleotide polymorphism (SNP) sites and generating a genome 38 39 40 345 recombination heat map. We identified non-synonymous (dN) and synonymous (dS) base 41 42 346 substitutions [60,61] within the core genome to provide information on the type of selection 43 44 45 347 that has occurred in the genome protein-coding sequences as well as the uro-adherence protein, 46 47 348 the major genes associated with the uro-pathogenicity of S. saprophyticus. 48 49 50 349 51 Uro-adherence protein structure analysis 52 53 54 350 Given the uro-adherence factor A (UafA) [14,62] and B (UafB) [24], major proteins 55 56 351 involved in the uro-pathogenicity of S. saprophyticus, we decided to investigate how well this 57 58 59 352 protein was conserved across all S. saprophyticus isolates. Easyfig [63] software was used to 60 16 61 62 63 64 81 65 P a g e | 17

1 2 353 visualise the genome sub-region, based on the BLAST approach. We used a hemagglutination 3 4 354 test, as previously described [14], to investigate the biological activity of the UafA protein. We 5 6 355 used ATCC 15305 strain as a positive control and normal saline water (0.09g/l) as a negative 7 8 9 356 control (a solution used to prepare bacterial suspension). Marseille strains FKIN01 and 10 11 357 JXBG01 were tested and shown to have, respectively, a full-length intact and mutated UafA 12 13 14 358 protein. 15 16 17 359 Ethics approval and consent to participate 18 19 20 21 360 NA 22 23 24 25 361 Consent for publication 26 27 28 362 All co-authors have seen and approved the manuscript. 29 30 31 32 363 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 17 61 62 63 64 82 65 P a g e | 18

1 2 364 Availability of data and materials 3 4 5 365 All material and data presented in this manuscript are original, available for the public, 6 7 8 366 unpublished, and has not been simultaneously submitted to another journal. All accession 9 10 367 numbers of genome sequences used in this work are included in Supplementary Table 3. 11 12 13 14 368 Competing interests 15 16 17 369 NA 18 19 20 21 370 Funding 22 23 24 25 371 This work was supported and funded by IHU Fondation Méditerranée Infection. 26 27 28 372 Authors' contributions 29 30 31 32 373 This study was designed and directed by Jean-Marc Rolain, Ruimy Raymond and Seydina 33 34 35 374 Diene, Kodjovi Mlaga performed all analysis and wrote the manuscript. 36 37 38 375 Acknowledgements 39 40 41 42 376 We thank IHU Fondation Méditerranée Infection for funding this study. 43 44 45 377 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 18 61 62 63 64 83 65 P a g e | 19

1 2 378 REFERENCES 3 4 5 379 1. Mlaga KD, Dubourg G, Abat C, Chaudet H, Lotte L, Diene SM, et al. Using MALDI-TOF 6 7 380 MS typing method to decipher outbreak: the case of Staphylococcus saprophyticus causing 8 9 10 381 urinary tract infections (UTIs) in Marseille, France. Eur. J. Clin. Microbiol. Infect. Dis. 11 12 382 Springer Berlin Heidelberg; 2017;1–7. 13 14 15 383 2. Flores-Mireles AL, Walker JN, Caparon M, Hultgren SJ. Urinary tract infections: 16 17 384 18 epidemiology, mechanisms of infection and treatment options. Nat. Rev. Microbiol. Nature 19 20 385 Publishing Group; 2015;13:269–84. 21 22 23 386 3. Foxman B. Urinary tract infection syndromes. Occurrence, recurrence, bacteriology, risk 24 25 387 factors, and disease burden. Infect. Dis. Clin. North Am. 2014;28:1–13. 26 27 28 388 4. Fabre R, Mérens a., Tabone-Ledan C, Epifanoff G, Cavallo J-D, Ternois I. Staphylococcus 29 30 31 389 saprophyticus isolés d’examens cytobactériologiques urinaires en ville : épidémiologie et 32 33 390 sensibilité aux antibiotiques (étude Label Bio Elbeuf – novembre 2007–juillet 2009). Pathol. 34 35 391 Biol. 2013;61:44 8. 36 – 37 38 392 5. Eriksson A, Giske CG, Ternhag A. The relative importance of Staphylococcus saprophyticus 39 40 41 393 as a urinary tract pathogen: distribution of bacteria among urinary samples analysed during 1 42 43 394 year at a major Swedish laboratory. Apmis. 2013;121:72–8. 44 45 46 395 6. Ferreira AM, Bonesso MF, Mondelli AL, Da Cunha M de LR de S. Identification of 47 48 396 49 Staphylococcus saprophyticus isolated from patients with urinary tract infection using a simple 50 51 397 set of biochemical tests correlating with 16S-23S interspace region molecular weight patterns. 52 53 398 J. Microbiol. Methods. Elsevier B.V.; 2012;91:406–11. 54 55 56 399 7. Jhora ST, Paul S. Urinary Tract Infections Caused by Staphylococcus saprophyticus and their 57 58 59 400 antimicrobial sensitivity pattern in Young Adult Women. Bangladesh J Med Microbiol. 60 19 61 62 63 64 84 65 P a g e | 20

1 2 401 2011;5:21–5. 3 4 402 8. Foxman B. Epidemiology of urinary tract infections: Incidence, morbidity, and economic 5 6 7 403 costs. Disease-a-Month. 2003. p. 53–70. 8 9 10 404 9. Kodner CM, Thomas Gupton EK. Recurrent urinary tract infections in women: Diagnosis 11 12 405 and management. Am. Fam. Physician. 2010;82:638–43. 13 14 15 406 10. Foxman B, Gillespie B, Koopman J, Zhang L, Palin K, Tallman P, et al. Risk factors for 16 17 407 18 second urinary tract infection among college women. Am.J.Epidemiol. 2000;151:1194–205. 19 20 408 11. Harrington RD, Hooton TM. Urinary tract infection risk factors and gender. J Gend Specif 21 22 23 409 Med. 2000;3:27–34. 24 25 26 410 12. Raz R, Colodner R, Kunin CM. Who are you--Staphylococcus saprophyticus? Clin. Infect. 27 28 411 Dis. 2005;40:896–8. 29 30 31 412 13. Le Bouter A. Infections à Staphylococcus saprophyticus. J. des Anti-infectieux. 32 33 413 2011;13:12 9. 34 – 35 36 414 14. Kuroda M, Yamashita A, Hirakawa H, Kumano M, Morikawa K, Higashide M, et al. 37 38 39 415 Whole genome sequence of Staphylococcus saprophyticus reveals the pathogenesis of 40 41 416 uncomplicated urinary tract infection. Proc. Natl. Acad. Sci. U. S. A. 2005;102:13272–7. 42 43 44 417 15. Hedman P, Ringertz O, Lindström M, Olsson K. The origin of Staphylococcus 45 46 418 47 saprophyticus from cattle and pigs. Scand. J. Infect. Dis. 1993;25:57–60. 48 49 419 16. Hedman P, Ringertz O. Urinary tract infections caused by Staphylococcus saprophyticus. A 50 51 52 420 matched case control study. J. Infect. 1991;23:145–53. 53 54 55 421 17. Kim BS, Kim CT, Park BH, Kwon S, Cho YJ, Kim N, et al. Draft genome sequence of 56 57 422 Staphylococcus saprophyticus subsp. saprophyticus M1-1, isolated from the gills of a Korean 58 59 60 20 61 62 63 64 85 65 P a g e | 21

1 2 423 rockfish, sebastes schlegeli hilgendorf, after high hydrostatic pressure processing. J. Bacteriol. 3 4 424 2012. p. 4441–2. 5 6 7 425 18. Chan K-G, Sulaiman J, Yong DA, Tee KK, Yin W-F, Priya K. Draft Genome Perspective 8 9 426 of Staphylococcus saprophyticus Strain SU8, an N-Acyl Homoserine Lactone-Degrading 10 11 12 427 Bacterium. Genome Announc. 2015;3. 13 14 15 428 19. Juhas M, van der Meer JR, Gaillard M, Harding RM, Hood DW, Crook DW. Genomic 16 17 429 islands: tools of bacterial horizontal gene transfer and evolution. FEMS Microbiol. Rev. 18 19 430 20 2009;33:376–93. 21 22 431 20. Kleine B, Gatermann S, Sakinc T. Genotypic and phenotypic variation among 23 24 25 432 Staphylococcus saprophyticus from human and animal isolates. BMC Res. Notes. 2010;3:163. 26 27 28 433 21. Marlinghaus L, Huß M, Korte-Berwanger M, Sakinc-Güler T, Gatermann SG. D-serine 29 30 434 transporter in Staphylococcus saprophyticus identified. FEMS Microbiol. Lett. 2016; 31 32 33 435 22. Abat C, Chaudet H, Colson P, Rolain J-M, Raoult D. Real-Time Microbiology Laboratory 34 35 436 Surveillance System to Detect Abnormal Events and Emerging Infections, Marseille, France. 36 37 38 437 Emerg. Infect. Dis. 2015;21:1302–10. 39 40 41 438 23. Sakinc T, Kleine B, Gatermann SG. SdrI, a serine-aspartate repeat protein identified in 42 43 439 Staphylococcus saprophyticus strain 7108, is a collagen-binding protein. Infect. Immun. 44 45 46 440 2006;74:4615–23. 47 48 441 49 24. King NP, Beatson SA, Totsika M, Ulett GC, Alm RA, Manning PA, et al. UafB is a serine- 50 51 442 rich repeat adhesin of Staphylococcus saprophyticus that mediates binding to fibronectin, 52 53 443 fibrinogen and human uroepithelial cells. Microbiology. 2011;157:1161–75. 54 55 56 444 25. Szabados F, Mohner A, Kleine B, Gatermann SG. Staphylococcus saprophyticus surface- 57 58 59 445 associated protein (Ssp) is associated with lifespan reduction in Caenorhabditis elegans. 60 21 61 62 63 64 86 65 P a g e | 22

1 2 446 Virulence. 2013;4:604–11. 3 4 447 26. Gillespie WA, Sellin MA, Gill P, Stephens M, Tuckwell LA, Hilton AL. Urinary tract 5 6 7 448 infection in young women, with special reference to Staphylococcus saprophyticus. J. Clin. 8 9 449 Pathol. 1978;31:348–50. 10 11 12 450 27. Kline KA, Lewis AL. Gram-Positive Uropathogens, Polymicrobial Urinary Tract Infection, 13 14 15 451 and the Emerging Microbiota of the Urinary Tract. Microbiol. Spectr. NIH Public Access; 16 17 452 2016;4. 18 19 20 453 28. Fihn SD, Boyko EJ, Chen CL, Normand EH, Yarbro P, Scholes D. Use of spermicide- 21 22 454 coated condoms and other risk factors for urinary tract infection caused by Staphylococcus 23 24 25 455 saprophyticus. Arch. Intern. Med. 1998;158:281–7. 26 27 28 456 29. García-Solache M, Lebreton F, McLaughlin RE, Whiteaker JD, Gilmore MS, Rice LB. 29 30 457 Homologous Recombination within Large Chromosomal Regions Facilitates Acquisition of β- 31 32 33 458 Lactam and Vancomycin Resistance in Enterococcus faecium. Antimicrob. Agents Chemother. 34 35 459 American Society for Microbiology; 2016;60:5777–86. 36 37 38 460 30. Dettman JR, Rodrigue N, Kassen R. Genome-wide patterns of recombination in the 39 40 461 opportunistic human pathogen Pseudomonas aeruginosa. Genome Biol. Evol. 2015;7:18–34. 41 42 43 462 31. Didelot X, Lawson D, Darling A, Falush D. Inference of homologous recombination in 44 45 46 463 bacteria using whole-genome sequences. Genetics. Genetics Society of America; 47 48 464 2010;186:1435–49. 49 50 51 465 32. Merhej V, Royer-Carenzi M, Pontarotti P, Raoult D. Massive comparative genomic 52 53 466 analysis reveals convergent evolution of specialized bacteria. Biol. Direct. BioMed Central; 54 55 56 467 2009;4:13. 57 58 59 468 33. Georgiades K, Raoult D. Genomes of the most dangerous epidemic bacteria have a 60 22 61 62 63 64 87 65 P a g e | 23

1 2 469 virulence repertoire characterized by fewer genes but more toxin-antitoxin modules. PLoS One. 3 4 470 Public Library of Science; 2011;6:e17962. 5 6 7 471 34. Conlan S, Mijares LA, NISC Comparative Sequencing Program, Becker J, Blakesley RW, 8 9 472 Bouffard GG, et al. Staphylococcus epidermidis pan-genome sequence analysis reveals 10 11 12 473 diversity of skin commensal and hospital infection-associated isolates. Genome Biol. BioMed 13 14 474 Central; 2012;13:R64. 15 16 17 475 35. Bosi E, Monk JM, Aziz RK, Fondi M, Nizet V, Palsson BØ. Comparative genome-scale 18 19 476 20 modelling of Staphylococcus aureus strains identifies strain-specific metabolic capabilities 21 22 477 linked to pathogenicity. Proc. Natl. Acad. Sci. 2016;E3801–E3809. 23 24 25 478 36. Mira A, Martín-Cuadrado AB, D’Auria G, Rodríguez-Valera F, D'Auria G, 26 27 479 Rodríguez-Valera F. The bacterial pan-genome:a new paradigm in microbiology. Int. 28 29 30 480 Microbiol. 2010;13:45–57. 31 32 33 481 37. Meyer HWG, Wengler-Becker U, Gatermann SG. The hemagglutinin of Staphylococcus 34 35 482 saprophyticus is a major adhesin for uroepithelial cells. Infect. Immun. 1996;64:3893–6. 36 37 38 483 38. Schäfer UK, Kaltwasser H. Urease from Staphylococcus saprophyticus: purification, 39 40 484 characterization and comparison to Staphylococcus xylosus urease. Arch. Microbiol. 41 42 43 485 1994;161:393–9. 44 45 46 486 39. Cernohorská L, Votava M. [Antibiotic resistance and biofilm formation in Staphylococcus 47 48 487 saprophyticus strains isolated from urine]. Epidemiol. Mikrobiol. Imunol. Cas. Spol. pro 49 50 488 51 Epidemiol. a Mikrobiol. Ces. lékarské Spol. J.E. Purkyne. 2010;59:88–91. 52 53 489 40. Castañeda-García A, Blázquez J, Rodríguez-Rojas A. Molecular Mechanisms and Clinical 54 55 56 490 Impact of Acquired and Intrinsic Fosfomycin Resistance. Antibiot. (Basel, Switzerland). 57 58 491 Multidisciplinary Digital Publishing Institute (MDPI); 2013;2:217–36. 59 60 23 61 62 63 64 88 65 P a g e | 24

1 2 492 41. Le Bouter A. Infections ?? Staphylococcus saprophyticus. J. des Anti-Infectieux. Elsevier 3 4 493 Masson SAS; 2011;13:12–9. 5 6 7 494 42. Castillo-Ramírez S, Corander J, Marttinen P, Aldeljawi M, Hanage WP, Westh H, et al. 8 9 495 Phylogeographic variation in recombination rates within a global clone of methicillin-resistant 10 11 12 496 Staphylococcus aureus. Genome Biol. 2012;13:R126. 13 14 15 497 43. Colodner R, Ken-Dror S, Kavenshtock B, Chazan B, Raz R. Epidemiology and clinical 16 17 498 characteristics of patients with Staphylococcus saprophyticus bacteriuria in Israel. Infection. 18 19 499 20 2006;34:278–81. 21 22 500 44. Gatermann S, Marre R, Heesemann J, Henkel W. Hemagglutinating and adherence 23 24 25 501 properties of Staphylococcus saprophyticus: epidemiology and virulence in experimental 26 27 502 urinary tract infection of rats. FEMS Microbiol. Immunol. 1988;1:179–85. 28 29 30 503 45. Meyer HGW, Müthing J, Gatermann SG. The hemagglutinin of Staphylococcus 31 32 33 504 saprophyticus binds to a protein receptor on sheep erythrocytes. Med. Microbiol. Immunol. 34 35 505 1997;186:37–43. 36 37 38 506 46. Coil D, Jospin G, Darling AE. A5-miseq: An updated pipeline to assemble microbial 39 40 507 genomes from Illumina MiSeq data. Bioinformatics. 2015;31:587–9. 41 42 43 508 47. Darling ACE, Mau B, Blattner FR, Perna NT. Mauve: multiple alignment of conserved 44 45 46 509 genomic sequence with rearrangements. Genome Res. 2004;14:1394–403. 47 48 510 49 48. Vallenet D, Belda E, Calteau A, Cruveiller S, Engelen S, Lajus A, et al. MicroScope - An 50 51 511 integrated microbial resource for the curation and comparative analysis of genomic and 52 53 512 metabolic data. Nucleic Acids Res. 2013;41:636–47. 54 55 56 513 49. Meier-Kolthoff JP, Auch AF, Klenk H-P, Göker M, Wayne L, Brenner D, et al. Genome 57 58 59 514 sequence-based species delimitation with confidence intervals and improved distance functions. 60 24 61 62 63 64 89 65 P a g e | 25

1 2 515 BMC Bioinformatics. BioMed Central; 2013;14:60. 3 4 516 50. Seemann T. Prokka: rapid prokaryotic genome annotation. Bioinformatics. 2014;30:2068 5 – 6 7 517 9. 8 9 10 518 51. Contreras-Moreira B, Vinuesa P. GET_HOMOLOGUES, a versatile software package for 11 12 519 scalable and robust microbial pangenome analysis. Appl. Environ. Microbiol. American 13 14 15 520 Society for Microbiology; 2013. 16 17 521 18 52. Vinuesa P, Contreras-Moreira B. Robust identification of orthologues and paralogues for 19 20 522 microbial pan-genomics using GET_HOMOLOGUES: A case study of pIncA/C plasmids. 21 22 523 Methods Mol. Biol. 2015;1231:203–32. 23 24 25 524 53. Kaas RS, Friis C, Ussery DW, Aarestrup FM, Otto T, Oryan M, et al. Estimating variation 26 27 28 525 within the genes and inferring the phylogeny of 186 sequenced diverse Escherichia coli 29 30 526 genomes. BMC Genomics. BioMed Central; 2012;13:577. 31 32 33 527 54. Strauss T, von Maltitz MJ. Generalising Ward’s Method for Use with Manhattan Distances. 34 35 528 PLoS One. Public Library of Science; 2017;12:e0168288. 36 37 38 529 55. Paradis E, Claude J, Strimmer K. APE: Analyses of phylogenetics and evolution in R 39 40 41 530 language. Bioinformatics. 2004;20:289–90. 42 43 44 531 56. Tettelin HHH, Riley D, Cattuto C, Medini D. Comparative genomics: the bacterial pan- 45 46 532 genome. Curr. Opin. Microbiol. 2008;11:472–7. 47 48 49 533 57. Tatusov RL, Fedorova ND, Jackson JD, Jacobs AR, Kiryutin B, Koonin E V, et al. The 50 51 534 COG database: an updated version includes eukaryotes. BMC Bioinformatics. BioMed Central; 52 53 54 535 2003;4:41. 55 56 57 536 58. Angiuoli S V, Salzberg SL. Mugsy: fast multiple alignment of closely related whole 58 59 537 genomes. Bioinformatics. 2011;27:334–42. 60 25 61 62 63 64 90 65 P a g e | 26

1 2 538 59. Croucher NJ, Page AJ, Connor TR, Delaney AJ, Keane JA, Bentley SD, et al. Rapid 3 4 539 phylogenetic analysis of large samples of recombinant bacterial whole genome sequences using 5 6 540 Gubbins. Nucleic Acids Res. 2015;43:e15. 7 8 9 541 60. Kryazhimskiy S, Plotkin JB. The population genetics of dN/dS. Gojobori T, editor. PLoS 10 11 12 542 Genet. Public Library of Science; 2008;4:e1000304. 13 14 15 543 61. Pond SLK, Poon A, Frost SDW. Estimating selection pressures on alignments of coding 16 17 544 sequences. —Lemey P, Salemi M, Vandamme A,. 2009;1–81. 18 19 20 545 62. King NP, Beatson SA, Totsika M, Ulett GC, Alm RA, Manning PA, et al. In vitro 21 22 546 adherence of Staphylococcus saprophyticus, Staphylococcus epidermidis, Staphylococcus 23 24 25 547 haemolyticus, and Staphylococcus aureus to human ureter.UafB is a serine-rich repeat adhesin 26 27 548 of Staphylococcus saprophyticus that mediates binding to fibronecti. Microbiology. 28 29 30 549 2011;157:1161–75. 31 32 33 550 63. Sullivan MJ, Petty NK, Beatson SA. Easyfig: a genome comparison visualizer. 34 35 551 Bioinformatics. 2011;27:1009–10. 36 37 38 552 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 26 61 62 63 64 91 65 P a g e | 27

1 2 553 Figure legends 3 4 554 Figure 1: The pan-genome profile, including 32 genomes of S. saprophyticus, shows the 5 6 7 555 number of genes clusters in the different components of the pan-genome. Histogram 8 9 556 plotted from pan-genome consensus built from COGs and OrthoMCL algorithms: core genes 10 11 12 557 (no colour), softcore (yellow), shell (orange), cloud (red). 13 14 15 558 Figure 2: Parsimony pan-genomic tree of 32 S. saprophyticus genomes. The tree was 16 17 559 18 generated from genes absent/present matrix from the pan-genome generated from orthologs 19 20 560 clustering COG and OrthoMCL intersections. In total this analysis included twenty-two clinical 21 22 561 and ten non-clinical S. saprophyticus genomes. Red colour indicates clinical strains (urine), 23 24 25 562 blue: food strains, Green: animal strains and black: the environment (river water). Four clades 26 27 563 were identified: clade 1 contains quasi-clinical strains and clades 2, 3 and 4 contain a mixture 28 29 30 564 of clinical, food, animal, and environmental strains. 31 32 33 565 Figure 3: Core and pan-genomic evolution plot of 32 S. saprophyticus. Plot (B) shows the 34 35 566 evolution of open pan-genome following functions first proposed by Tettelin in 2005. Plot (A) 36 37 38 567 shows core genome following functions proposed by Tettelin (red curve) and Willenbrock et al. 39 40 568 (blue curve) residual standard errors are reported on the right margin as a measure of the 41 42 43 569 goodness of fit. 44 45 46 570 Figure 4: Functional COGs’ distribution of specific proteins that differentiate clinical 47 48 571 49 from non-clinical strains genomes: specific genes were retrieved from orthologous genes, and 50 51 572 a BLAST analysis was run against the COG database to determine functional COG. In total this 52 53 573 analysis included twenty-two clinical and ten non-clinical S. saprophyticus genomes. Figures 54 55 56 574 inside plots represent the number of orthologous clusters identified within each COG. 57 58 59 60 27 61 62 63 64 92 65 P a g e | 28

1 2 575 Figure 5: Heatmap show distribution of essential genes relevant to the pathogenicity of S. 3 4 576 saprophyticus plotted against the pan-genomic phylogeny tree. In total this analysis included 5 6 577 twenty-two clinical and ten non-clinical S. saprophyticus genomes. Colored nodes indicate 7 8 9 578 samples from which strains have been isolated, genes scanned are: mecA (penicillin-binding 10 11 579 protein 2 prime), blaZ (beta-lactamase resistance), fosB2 (fosfomycin resistance), van (Y-YB) 12 13 14 580 (vancomycin resistance, D-alanyl-D-alanine carboxypeptidase), tet(A-D) (tetracycline 15 16 581 resistance), norB (nitric-oxide reductase), isa (A-B-C-D-R) (intracellular adherence protein, 17 18 19 582 biofilm formation), sar(A-R-V-X-Z) (accessory regulatory), uafA (Uro-adherence protein), H1U 20 21 583 (hemolysin-like protein), soj2 (sporulation protein), virB (type IV secretion protein), vap 22 23 584 immR KatE & catE 24 (virulence-associated protein) (HTH-type transcriptional regulator ImmR), 25 26 585 (catalase), sdrC (Adherence-related protein), isd (G-I) (Heme transporter-related protein), sepA 27 28 586 (extracellular serine protease), aur (aureolysin/Zin metalloproteinase), cap (A8-A-B-C-D) 29 30 31 587 (Capsular polysaccharide biosynthesis protein), pgl (F-J) (glycosylation pathway protein), 32 33 588 tag(B-E-F-H) (teichoic acid biosynthesis protein) ure(A-B-C-D-E-F-G-H) (urease metabolism 34 35 36 589 protein), Na+/H+ antiporters, Na+/P+ antiporter, proline/betaine transporter ProP, High-affinity 37 38 590 proline permease PutP, Glycine betaine/choline transporter, Proline/glycine, betaine ABC 39 40 41 591 transporter ATPase component Opu(CA-CB-CD) and periplasmic component OpuCC. 42 43 44 592 Figure 6: Phylogenetic tree inferred from the whole genome (B) against pangenome 45 46 593 parsimony tree (A): Maximum-likelihood tree obtained from the all-against-all alignment of 47 48 49 594 32 S. saprophyticus genomes considering recombination hotspot and potential horizontal genes 50 51 595 transferred. In total this analysis included twenty-two clinical and ten non-clinical S. 52 53 54 596 saprophyticus genomes. Sub-groups clusters identified in A are identical to those identified in 55 56 597 B. 57 58 59 60 28 61 62 63 64 93 65 P a g e | 29

1 2 598 Figure 7: Predicted regions of recombination against phylogeny reconstruction based on 3 4 599 the final iteration of 32 S. saprophyticus. In total this analysis included twenty-two clinical 5 6 600 and ten non-clinical S. saprophyticus genomes. Blue strips represent unique/specific 7 8 9 601 recombination hotspots; red stripes represent shared recombination hotspots. Recombination 10 11 602 hotspot regions correspond to the region of phage insertion and transposon (Tn10, Tn916) in 12 13 14 603 various genomes. 15 16 17 604 Figure 8: Plots showing the distribution of genomic evolutionary features: The left Y axe 18 19 605 20 represent SNPs, and the right Y axe represents the ratio r/m. Clinical strains genome features 21 22 606 are organised at left and non-clinical at the right side of the plot. The blue barplot indicates the 23 24 607 distribution of total number of SNPs detected (blue), The orange curve shows the number of 25 26 27 608 SNPs that occurred insides recombination blocs, and the grey curve shows the number of SNPs 28 29 609 that occurred outside recombination blocs. The yellow curve shows the ratio of base 30 31 32 610 substitutions predicted to have been imported through recombination to those occurring through 33 34 611 point mutation. This value gives a measure of the relative impact of recombination 35 36 37 612 and mutation on the variation accumulated on the branch[59]. 38 39 40 613 Figure 9: S. saprophyticus genome sub-region visualisation is containing uafA genes. It 41 42 614 shows the mutation that split the uro-adherence protein (UafA) into two fragments, as well as 43 44 45 615 structure truncation and protein size variation (AHKB01, FDAARGOS_137, JUUE01). UafA 46 47 616 coding genes completely disappear in LSLC01 even though sequences exist. The average of all 48 49 50 617 pairwise comparisons indicated non-synonymous substitution dN = 1.5681, synonymous 51 52 618 substitutions dS = 1.5006 with dN/dS = 1.0449 >1. When comparing the ATCC 15305 strain to 53 54 619 55 others, the average indicated was dN = 1.6503, dS = 1.3303 with dN/dS = 1.2405 > 1 56 57 58 620 59 60 29 61 62 63 64 94 65 P a g e | 30

1 2 621 Supplementary data 3 4 5 622 Suppl. Table S1: Genomic features of the S. saprophyticus G764 strain isolated from the 6 7 8 623 Hôpital La Timone in Marseille 9 10 11 624 12 Suppl. Table S2: Coding sequences classification of S. saprophyticus G764 13 14 15 625 Suppl. Table S3: Strains list and genomic features of 32 S. saprophyticus included in this study 16 17 18 626 Suppl. Table S4: Specific orthologous gene clusters of clinical strains 19 20 21 627 Suppl. Table S5: Genome recombination and evolutionary features of 32 S. saprophyticus 22 23 24 628 included in this study 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 30 61 62 63 64 95 65 Figure 1 Click here to download Figure Figure 1.TIF

96 Figure 2 Click here to download Figure Figure 2.TIF

97 Figure 3 Click here to download Figure Figure 3.TIF 98 Figure 4 Click here to download Figure Figure 4.TIF

99 Figure 5 Click here to download Figure Figure 5.TIF 100 Figure 6 Click here to download Figure Figure 6.TIF 101 Figure 7 Click here to download Figure Figure 7.TIF 102 Figure 8 Click here to download Figure Figure 8.TIF 103 Figure 9 Click here to download Figure Figure 9.TIF

104 Suppl. Table S1: Genomic features of the S. saprophyticus G764 strain isolated from the

Hôpital La Timone in Marseille

Features size

Sequence length 2’523’588-bp

Chromosome contigs 21

%GC 33,27%

Average CDS length 837,63-bp

Protein coding density 0,8348

Number of Genomic Objects 2’660 (CDS, fCDS, rRNA, tRNA, miscRNA)

Number of CDS 2’523

Number of tRNA 63

Number of rRNA 23

105 Suppl. Table S2: Coding sequences classification of S. saprophyticus G764

Categories Size

Unclassified Orf 847 Orf of unknown function 767 Enzyme 494 Transporter 101 Putative enzyme 89 Factor 82 Structure 43 Regulator 39 Putative transporter 30 Putative membrane component 11 Cell process 9 Putative factor 7 Putative regulator 6 Carrier 4 Receptor 4 Membrane component 2 Phenotype 1 Lipoprotein 1

106

Suppl. Table S3: Strains list and genomic features of 32 S. saprophyticus included in this study

Organism/Name Strain Size (Mb) GC% Acc. number Scaffolds Genes Specimen / organism Origin S. saprophyticus G764 2.61065 33.2 FKIN01 21 2513 Urines/human Clinical S. saprophyticus FDAARGOS_168 2.65014 33.2 NZ_CP014113.1 - 2596 Vagina Clinical S. saprophyticus FDAARGOS_137 2.57915 33.2 NZ_CP014057.1 - 2512 Urine/human Clinical S. saprophyticus ATCC 15305 2.5779 33.1 NC_007350.1 - 2513 Urines/human Clinical S. saprophyticus KACC 16562 2.61994 33.1 AHKB01 58 2573 Food/fish Non-clinical S. saprophyticus 758_SSAP 2.6294 33 JUUE01 98 2539 Clinical Clinical S. saprophyticus SU8 2.70842 33 JXBG01 44 2641 Water/river Clinical 107 S. saprophyticus JB027 2.58685 33 LMYP01 51 2518 Hand/human Clinical Clinical S. saprophyticus 7108 2.58986 33 LMYQ01 32 2499 Urines/human Clinical S. saprophyticus 937 2.58974 33 LMYR01 36 2518 Urines/human Clinical S. saprophyticus 889 2.62719 33 LMYS01 31 2572 Urines/human Clinical S. saprophyticus BWH3 2.66804 33 LMYT01 31 2616 Urines/human Clinical S. saprophyticus 9325 2.62812 33 LMYU01 33 2557 Urines/human Clinical S. saprophyticus 1815 2.60258 33 LMYV01 23 2538 Urines/human Clinical S. saprophyticus 2262 2.60279 33 LMYW01 26 2542 Urines/human Clinical S. saprophyticus B2 2.61394 33 LMYX01 34 2563 Urines/human Clinical S. saprophyticus 9556 2.57004 33 LMYY01 33 2500 Urines/human S. saprophyticus JBCB14 2.65292 33 LMZI01 93 2598 Nose/cow Non-clinical S. saprophyticus AC34 2.62884 33 LMZJ01 37 2569 Urines/human Clinical S. saprophyticus 9777 2.66994 33 LMZK01 45 2621 Urines/human Clinical S. saprophyticus 1146 2.60879 32.9 LMZL01 29 2545 Urines/human Clinical S. saprophyticus 3201 2.59416 33 LMZM01 27 2522 Urines/human Clinical S. saprophyticus 3751 2.56226 33 LMZN01 34 2497 Urines/human Clinical S. saprophyticus 396A 2.59918 32.9 LNNE01 39 2504 Food/cheese Non-clinical S. saprophyticus 725A_RS6 2.5437 33 LNPG01 27 2452 Food/cheese Non-clinical S. saprophyticus BC4 2.64424 33 LNPH01 42 2556 Food/cheese Non-clinical S. saprophyticus 735A 2.62363 33 LNPI01 41 2546 Food/cheese Non-clinical Non-clinical 108 S. saprophyticus CE6 2.6641 33 LNPJ01 62 2600 Food/salami S. saprophyticus 429A 2.57465 33 LNPK01 33 2489 Food/cheese Non-clinical S. saprophyticus BWH2 2.62516 33 LNPV01 61 2540 Urines/human Clinical S. saprophyticus MF4371 2.57949 33 LSLB01 38 2512 Plant/salmon Non-clinical S. saprophyticus MF6029 2.51324 33.1 LSLC01 35 2441 Equip/meat Non-clinical The average genome size, %GC and coding sequences were calculated based on data extracted from NCBI database. The number of scaffolds/contigs was not indicated for complete genome

Suppl. Table S4: Specific orthologous gene clusters of clinical strains

Cluster number genes clusters (Clinical strains) 2608 Efflux protein QacA 2682 DNA-binding transcriptional regulator cynR1 2700 Transcriptional regulator YybR 2704 Hypothetical protein 2711 Hypothetical protein 2832 PTS system glucose-specific transporter ptsG1 2848 N-acetylmannosamine kinase NanK 2941 EamA-like transporter 2942 Transport protein yicL1 3007 Zinc-responsive transporter 3008 Hypothetical protein 3012 2-deoxy-D-gluconate 3021 Ferrichrome-binding protein FhuD2 3067 Hypothetical protein 3068 Hypothetical protein 3072 Membrane protein insertase yidC1 3073 OxaA-like protein 3091 Hypothetical protein 3092 Hypothetical protein 3101 Lipase 2 lip2-1 3102 Lipase 2 lip2-2 3107 Glutamate/aspartate proton symporter gltP1 3115 Hypothetical protein 3129 Iron (3+)-hydroxamate-binding protein FhuD3 3217 Hypothetical protein 3588 Transposase 3729 CsbD-like protein 3926 Transcriptional activator rhaS 3928 2-keto-myo-inositol dehydratase iolE 3930 Phosphate transporter 4;1 pht4-1 3931 Phosphate transporter 4;1 pht4-2 3980 Hypothetical protein 4067 Hypothetical protein 4068 Hypothetical protein 4092 Hypothetical protein 4193 Staphylococcus haemolysin 4195 Topology modulation protein 4199 Integrase core domain 4260 Excinuclease ABC subunit A uvrA2 4488 Alpha/beta superfamily hydrolase mhpC 4646 DNA-binding transcriptase 4719 MarR family protein

109 4889 Hypothetical protein 4899 Hypothetical protein 5005 Hypothetical protein 5006 Hypothetical protein 5007 5-amino-6-(5-phospho-D-ribitylamino) uracil ybjI2 Cluster number genes clusters (Non-clinical strains) 194 Hypothetical protein 197 Hypothetical protein 323 Hypothetical protein 346 Hypothetical protein 449 Topology modulation protein 450 Leucine export protein 451 Threonine efflux system 454 Integrase core domain 455 Hypothetical protein 525 Transcriptional regulatory 526 Hypothetical protein 575 Hypothetical protein 577 Hypothetical protein 578 Hypothetical protein 579 Hypothetical protein 680 Hypothetical protein 729 LysR-type transcriptional activator catM 730 nitrogen assimilation protein 731 hypothetical protein 760 PTS system glucose-specific transporter ptsG2 771 Lichenan operon transcriptional anti-terminator licR 772 L-ascorbate-specific enzyme IIA component ulaC 773 PTS system ascorbate 774 PTS system ascorbate-specific transporter ulaA 775 Cupin domain protein 776 Hypothetical protein 777 Cysteine and O-acetylserine exporter eamB 1 778 Transcriptional activator 1006 Hypothetical protein 1007 Hypothetical protein 1008 DNA-binding transcriptase 1074 Xylose isomerase-like 1075 Putative NADH-binding oxidoreductase ycjS 1 1076 Trehalose utilization protein 1077 Putative NADH-binding oxidoreductase ycjS 2 1078 HTH-type transcriptional regulator GlvR 2 1138 Hypothetical protein 1139 Hypothetical protein 1194 Hypothetical protein 1195 Hypothetical protein

110 1196 DNA-binding transcriptional repressor 1210 Hypothetical protein 1213 Hypothetical protein 1232 Hypothetical protein 1246 Hypothetical protein 1247 Hypothetical protein 1381 Major facilitator superfamily transporter 1413 Hypothetical protein 1420 Hypothetical protein 1422 Hypothetical protein 1423 Hypothetical protein 1424 Hypothetical protein 1425 Hypothetical protein 1427 Hypothetical protein 1428 Hypothetical protein 1460 Hypothetical protein 1497 Hypothetical protein 1685 Hypothetical protein 1686 DNA-binding transcriptional regulator 1699 Alanine dehydrogenase aldA 1 1719 Transcriptional regulator nolA 1726 Hypothetical protein 1732 Hypothetical protein 1778 Hypothetical protein 1803 Hypothetical protein 1804 Hypothetical protein 1840 Hypothetical protein 1875 Helix-turn-helix protein 1915 Hypothetical protein 1918 ATP-dependent Clp protease proteolytic clpP1 1920 Hypothetical protein 1922 Hypothetical protein 1926 Hypothetical protein 1927 Hypothetical protein 2102 CsbD-like protein 2155 Hypothetical protein 2156 HTH-type transcriptional regulator cysL 2257 Hypothetical protein 2267 Bacterial regulatory 2272 Dehydropantoate 2-reductase 2285 LysR substrate binding domain protein 2300 Putative ATPase of the ABC class 2301 Hypothetical protein 2302 Hypothetical protein 2303 Calcineurin-like phosphoesterase 2304 Hypothetical protein

111 2306 Hypothetical protein 2307 Hypothetical protein 2530 Hypothetical protein 2533 Hypothetical protein 2534 Aryl esterase 2537 Peptidoglycan hydrolase lytN 2540 3-oxoadipate enol-lactonase catD 2542 Regulator of mucoid phenotype rmpB 2543 3-hexulose-6-phosphate

The first column indicates the clusters number and the second column the genes clusters names identified with default parameters (75% coverage, 75% identity, e-value at 1e-05). A consensus clustering was generated by both OrthoMCL and COG algorithms profiles using default parameters where specifics clusters genes were queried from.

112

Suppl. Table S5: Genome recombination and evolutionary features of 32 S. saprophyticus included in this study

Num of Total Num of SNPs Num of SNPs Recomb Bases in Genome Bases in Strains SNPs inside recomb Outside recomb Blocks Recomb r/m rho/theta Length Clonal Frame LMYP01 2023 1058 965 20 456941 1.096373 0.020725 2445808 2092054 LMYX01 2012 1635 377 22 666713 4.33687 0.058355 2445146 1919200 LMYW01 0 0 0 0 521611 0 0 2438726 1952760 LMYY01 0 0 0 0 601127 0 0 2458430 1927865 LNPJ01 0 0 0 0 531496 0 0 2441631 1968670 LMYS01 232 104 128 4 595149 0.8125 0.03125 2466934 1921953 LNPK01 1295 375 920 9 333176 0.407609 0.009783 2445751 2160567

113 AHKB01 3760 1383 2377 29 439317 0.581826 0.0122 2224939 1988447 FDAARGOS_168 1795 1507 288 26 731469 5.232639 0.090278 2416892 1972470 LMZJ01 293 124 169 3 643632 0.733728 0.017751 2437051 1868560 LMYR01 266 140 126 4 596088 1.111111 0.031746 2468518 1920007 LMZI01 0 0 0 0 643511 0 0 2414141 1864544 ATCC_15305 2816 1590 1226 25 645353 1.296901 0.020392 2577899 1995103 LMZL01 3057 2000 1057 27 710067 1.892148 0.025544 2393867 1959017 LSLC01 187 98 89 3 348373 1.101124 0.033708 2336088 2029409 LMYQ01 3365 1870 1495 28 590404 1.250836 0.018729 2453742 1968655 JXBG01 2295 1138 1157 21 532966 0.983578 0.01815 2376844 2003840 LMYV01 110 17 93 1 521611 0.182796 0.010753 2438996 1952792 LMZK01 484 275 209 4 534028 1.315789 0.019139 2450266 1978368 FKIN01 1410 895 515 25 563762 1.737864 0.048544 2374207 2074400 LNPH01 95 66 29 2 580643 2.275862 0.068966 2435831 1922352 LMYU01 192 128 64 2 522305 2 0.03125 2442017 1955796 FDAARGOS_137 73 8 65 1 575592 0.123077 0.015385 2434049 1922906 LNNE01 4670 858 3812 16 141152 0.225079 0.004197 2444890 2356409 LNPG01 0 0 0 0 223631 0 0 2417237 2228233 LMZM01 0 0 0 0 581305 0 0 2411812 1873084 LMZN01 4950 3074 1876 49 765353 1.638593 0.026119 2394255 1866201 JUUE01 408 312 96 4 587217 3.25 0.041667 2434852 1925405 LMYT01 0 0 0 0 595119 0 0 2468866 1921236 LNPV01 2230 1262 968 31 498565 1.303719 0.032025 2441671 2152621 LNPI01 0 0 0 0 580083 0 0 2430443 1919375 LSLB01 0 0 0 0 347897 0 0 2452262 2141776

Total SNPs: this value indicated the total number of base substitution reconstructed onto the strains branches; Num of SNPs inside recombinations: shows the number of base substitution reconstructed onto the strains branches that fall within a predicted recombination (indicated as “r”); Num of SNPs outside recombinations: shows the number of base substitutions reconstructed onto strains branches that fall 114 outside of a predicted recombination (predicted to have occurred by point mutation (indicated as “m”); Num of Recombination Blocks: this shows the number of recombination blocks reconstructed onto the strains branches; the ratio r/m is the ration of base substitution predicted to have been imported through recombination to those occurring through point of mutation. This value gives a measure of the relative impact of recombination and mutation on the variation accumulated on the branches ; rho/theta indicates the ratio of the number of recombination events to point mutations on a branch ; a measure of the relative rate of recombination and point mutation .

Conclusion The first part of this chapter was conducted to decipher the spread of S. saprophyticus using

MALDI-TOF MS spectral data and to highlight specific protein signatures from analysed strains to decipher strain clusters circulating in Marseille. Our study confirmed an increasing number of patients infected with S. saprophyticus with UTIs in a community from January 2002 to December 2015 with an abnormal peak indicating an outbreak in December 2014 in

Marseille. Application of the MALDI-TOF MS method revealed a specific strain cluster, geographically restricted to Marseille community compared to Nice. This study provides a simple and available method of comparing clonal strains, which should be further implemented on a large scale to understand outbreaks.

In the second part of this chapter, we report on the first comparative genomic analysis of S. saprophyticus isolates from various sources. The comparative analysis of 32 publicly available genomes of S. saprophyticus including the FKIN01 genome sequenced from Marseille revealed an open pan-genome profile of S. saprophyticus strains, suggesting a high genome plasticity of this bacterial species. The parsimony pan-genome tree shows that S. saprophyticus emerged from ancestral strains, close to those mainly isolated from food and the environment, and which are heterogeneous and clustered over time through genes lost and gained events. We have shown that the adaptation of S. saprophyticus to the clinical condition is a result of genomic recombination through single nucleotide polymorphism, mobile genetic element exchanges, and gene loss, to adapt to any variations in the direct environment. We also revealed that not all S. saprophyticus strains possess the functional Uro-adherence UafA protein. The evolutionary analysis shows a positive selection of the protein leading, therefore, to the emergence of S. saprophyticus capable of adhering to the human bladder and epithelial cell membranes and causing UTIs.

In summary, this analysis shows that S. saprophyticus initially considered as saprophytic has drifted to being a pathogenic bacterium through massive genome recombination and single nucleotide polymorphism (SNPs) events, resulting from the significant loss of genes 115 categorised in the transcriptional regulatory and carbohydrate metabolism and transport functional groups without minimising HGT events. Also, evolutionary selection with non- synonymous substitution overcoming synonymous substitutions has occurred in the uro- adherence protein gene. These have led to the emergence of a specific population of S. saprophyticus capable of causing disease, particularly UTIs, in humans.

116

II. Comparative genomic analysis of Enterococcus faecalis and Enterococcus

faecium

117

Introduction

The genus Enterococcus comprises 54 species that are ubiquitous and present in the gastrointestinal tract of animals, including mammals, reptiles, birds and insects, which is thought to be the largest reservoir of Enterococci [1]. Two species, Enterococcus faecalis and

Enterococcus faecium, cause the vast majority of hospital-acquired enterococcal infections in humans [2]. The success of E. faecium and E. faecalis in evolving as multi-resistant nosocomial pathogens are associated with their ability to acquire and share adaptive traits, including antimicrobial resistance genes encoded by mobile genetic elements (MGEs)[3]. However, E. faecium is intrinsically more frequently reported as resistant to antibiotics, especially to vancomycin, than E. faecalis (8.8% vs 1.0% in Europe, 79.4% vs 8.5% in the US, 22.4% vs

0.1% in Canada) [4], and vanA and vanB are the most common mobile genes clusters involved

[5,6]. Enterococci genomic evolution has always been associated with the acquisition of vancomycin resistance genes carried by plasmids [6,7], and this phenomenon may be regulated by the presence or absence of clustered, regularly interspaced short palindromic repeats

(CRISPR) that provide bacteria and archaea with sequence-specificity and an acquired defense against plasmids and phages[8]. Also, it is known that restriction-modification systems and anti-endonuclease (ardA) play an essential role in the regulation of the mobile genetic element transfer in the Enterococcus genus[9] and the acquisition and spread of antimicrobial resistance genes[11]. A MALDI-TOF MS spectra analysis of E. faecalis strains isolated in Marseille has shown a clustering between human and chicken strains suspecting a zoonotic dissemination.

We have decided to sequence the genome of four strains of E. faecalis isolated from human and two from chicken and perform a comparative analysis with the publicly available genome of E. faecalis and E. faecium to decipher their genomic evolution. In this study, we showed that massive recombinations have occurred in E. faecalis. Also, we noted the presence of an imported number of CRISPR system and associated proteins (cas) compared to E. faecium.

Moreover, we found an association between absence of CRISPR system, the presence of anti- endonuclease protein (ardA) (both HGT regulators) and the acquisition of vancomycin 118 resistance genes (vanA, vanB) carried by plasmids that differentiate E. faecalis from E. faecium.

A considerable number of E. faecium was isolated from animals (14.7%) mainly in Europe

(86.6%) with a zoonotic dissemination demonstrated based on the phylogenic network analysis.

This article is in preparation for a draft article entitled “Extensive comparative genomic analysis of Enterococcus faecalis and Enterococcus faecium reveals a direct association between absence of CRISPR systems, the presence of an anti-endonuclease gene (ardA) and acquisition of vancomycin resistance genes in E. faecium.”

119

Reference

1. Guzman Prieto AM, van Schaik W, Rogers MRC, Coque TM, Baquero F, Corander J, et al. Global Emergence and Dissemination of Enterococci as Nosocomial Pathogens: Attack of the Clones? Front. Microbiol. 2016;7:1–15. 2. Higuita NIA, Huycke MM. Enterococcal Disease, Epidemiology, and Implications for Treatment. In: Gilmore MS, Clewell DB, Ike Y SN, editor. Enterococci From Commensals to Lead. Causes Drug Resist. Infect. 2014. p. 1–27. 3. García-Solache M, Lebreton F, McLaughlin RE, Whiteaker JD, Gilmore MS, Rice LB. Homologous Recombination within Large Chromosomal Regions Facilitates Acquisition of β- Lactam and Vancomycin Resistance in Enterococcus faecium. Antimicrob. Agents Chemother. American Society for Microbiology; 2016;60:5777–86. 4. Kristich CJ, Rice LB, Arias CA. Enterococcal Infection—Treatment and Antibiotic Resistance. Enterococci From Commensals to Lead. Causes Drug Resist. Infect. Massachusetts Eye and Ear Infirmary; 2014. 5. Manson JM, Keis S, Smith JMB, Cook GM. A Clonal Lineage of VanA-Type Enterococcus faecalis Predominates in Vancomycin-Resistant Enterococci Isolated in New Zealand. Antimicrob. Agents Chemother. 2003;47:204–10. 6. Quintiliani R, Courvalin P. Conjugal transfer of the vancomycin resistance determinant vanB between enterococci involves the movement of large genetic elements from chromosome to chromosome. FEMS Microbiol. Lett. 1994;119:359–63. 7. Lam MMC, Seemann T, Tobias NJ, Chen H, Haring V, Moore RJ, et al. Comparative analysis of the complete genome of an epidemic hospital sequence type 203 clone of vancomycin- resistant Enterococcus faecium. BMC Genomics. 2013;14:595. 8. Palmer KL, Gilmore MS. Multidrug-Resistant Enterococci Lack CRISPR-cas. MBio. 2010;1:e00227-10. 9. Price VJ, Huo W, Sharifi A, Palmer KL. Act Additively against Conjugative Antibiotic Resistance Plasmid Transfer in Enterococcus faecalis. Mol. Biol. Physiol. 2016;1:1–13. 10. Palmer KL, van Schaik W, Willems RJL, Gilmore MS, Willem Van Schaik CA, Willems RJL, et al. Enterococcal Genomics. Enterococci From Commensals to Lead. Causes Drug Resist. Infect. Massachusetts Eye and Ear Infirmary; 2014. 11. Mcmahon SA, Roberts GA, Johnson KA, Cooper LP, Liu H, White JH, et al. Extensive DNA mimicry by the ArdA anti-restriction protein and its role in the spread of antibiotic resistance. Nucleic Acids Res. 2009;37:4887–97.

120

Article IV:

Extensive comparative genomic analysis of Enterococcus faecalis and Enterococcus faecium reveals a direct association between absence of CRISPR systems, the presence of

an anti-endonuclease gene (ardA) and acquisition of vancomycin resistance genes in E.

faecium.

Kodjovi D. Mlaga, Seydina M. Diene, Vincent Garcia, Philippe Colson, Ruimy Raymond,

Didier Raoult, Jean-Marc Rolain

121 1 Extensive comparative genomic analysis of Enterococcus faecalis and Enterococcus faecium

2 reveals a direct association between absence of CRISPR systems, the presence of anti-

3 endonuclease (ardA) and acquisition of vancomycin resistance genes in E. faecium.

4

5 Kodjovi D. Mlaga1, Seydina M. Diene1, Vincent Garcia1, Philippe Colson1, Ruimy Raymond2,

6 Didier Raoult1, Jean-Marc Rolain1*

7

8 Affiliations :

9 1. URMITE, Aix-Marseille Université, UM63, CNRS 7278, IRD 198, INSERM 1095, IHU-

10 Méditerranée Infection, 19-21 Boulevard Jean Moulin 13385 Marseille Cedex 05, France

11 2. Department of Bacteriology at Nice Academic Hospital, Nice Medical University Nice,

12 France;

13

14 *Corresponding author: Prof Jean-Marc Rolain [email protected]

15 URMITE CNRS IRD UMR 6236, IHU Méditerranée Infection, Valorisation and Transfer, Aix-

16 Marseille Université, Faculté de Médecine et de Pharmacie, 19-20 Bd Jean Moulin, 13385

17 Marseille Cedex 05, France, Tel: +33(0) 4 91 32 43 75 / +33 (0) 4 86 13 68 28

18

19 Keywords: E. faecalis, E. faecium, CRISPR-spacers, vancomycin-resistance, recombinations

20

1 122 21 Abstract

22 Introduction: Mobile genetic elements (MGE) are known to carry vancomycin resistance genes

23 (vanA, vanB) in E. faecalis and E. faecium, and researchers have suggested that the pres-

24 ence/absence of a CRISPR and endonuclease/anti-endonuclease system can regulate their acqui-

25 sition.

26 Material and Method: We performed a comparative genomic analysis of all available genomes

27 (447 E. faecalis including six genomes sequenced in Marseille and 407 E. faecium genomes) and

28 investigated the association between the presence/absence of CRISPRs, the endonuclease/anti-

29 endonuclease system and the acquisition of antimicrobial resistance genes.

30 Results: There were 84.3% of E. faecalis isolated from North America and 51.1% of E. faecium

31 isolated from Europe with 85.4% of E. faecalis and 69.4% of E. faecium isolated primarily from

32 humans. A considerable number of E. faecium was isolated from animals (14.7%) mainly in Eu-

33 rope (86.6%) with a zoonotic dissemination demonstrated based on the phylogenic network

34 analysis. We detected CRISPR systems in 53.8% genomes of E. faecalis and only 9.3% of E.

35 faecium with a statistically significant difference (p-value < 10-5). We found a negative correla-

36 tion between the number of CRISPR systems detected and the size of the genome (r = -0.397, p-

37 value < 10-5) and a positive relationship between the % GC of the genome and the number of

38 CRISPR systems detected (r = 0.215, p-value < 10-5). The pan-genome analysis identified a total

39 of 22,424 orthologous genes in E. faecalis and 24,004 in E. faecium. We detected recombination

40 sites in 93% of the genomes of E. faecalis analysed while only 40.5% among E. faecium. The

41 presence of a CRISPR system in the genome of E. faecium decreased by 0.77 times the acquisi-

42 tion of vancomycin-resistant genes (p-value <10-5). Also, the presence of the anti-endonuclease

43 genes ardA, mainly found in E. faecium, may explain the decrease in the number of CRISPR

2 123 44 found in E. faecium, known to inactivate the endonucleases’ protective activities and enable the

45 genome of E. faecium to be versatile in acquiring MGE.

46 Conclusion: Data showed that there is a direct association between the absence of CRISPR, the

47 presence of anti-endonuclease gene ardA and the acquisition of vancomycin resistance genes.

48 Also, the zoonotic dissemination of E. faecium may explain why they are emerging as a vanco-

49 mycin-resistant threat to humans, and not E. Faecalis due to the misuse of avoparcin in farming.

3 124 50 Introduction

51 Enterococci are an ancient genus of Enterococcaceae that have adapted to living in

52 complex environments and surviving in harsh conditions (Byappanahalli et al., 2012; Elsner et

53 al., 2000). The genus Enterococcus comprises 54 species that are ubiquitously present in the

54 gastrointestinal tract of animals, including mammals, reptiles, birds and insects, which is thought

55 to be the largest reservoir of Enterococci (Guzman Prieto et al. 2016). Two species, Enterococcus

56 faecalis and Enterococcus faecium, are the leading cause the vast majority of hospital-acquired

57 enterococcal infections in humans (Higuita and Huycke 2014). The plasticity of the

58 Enterococcus genomes allows them to rapidly respond and adapt to the particular environment

59 by acquiring genetic determinants. It increases their ability to colonise and infect the host

60 (Guzman Prieto et al., 2016). The success of E. faecium and E. faecalis in evolving as multi-

61 resistant nosocomial pathogens are associated with their capacity to harbour and share adaptive

62 features, including antimicrobial resistance genes encoded by mobile genetic elements (MGEs)

63 (García-Solache et al., 2016). However, E. faecium is intrinsically more frequently reported as

64 more resistant to antibiotics, especially to vancomycin, than E. faecalis (8.8% vs 1.0% in Europe,

65 79.4% vs 8.5% in the US, 22.4% vs 0.1% in Canada) (Kristich et al., 2014) than E. faecalis and

66 vanA and vanB are the most common mobile genes involved (Bourgogne et al., 2008; Solheim et

67 al., 2011; do Prado et al., 2016; Sivertsen et al., 2016). A sequence analysis of E. faecalis V583

68 revealed that 26% of the 3.36-Mb genome consisted of mobile elements, including seven

69 putative phages, 38 insertion elements, remnants of 3 integrated plasmids, as well as three

70 independently replicating plasmids (Paulsen et al., 2003; Gilmore et al., 2013). Enterococci

71 genomic evolution has always been associated with the acquisition of vancomycin resistance

72 genes carried by plasmids (Evans et al., 2001; Paulsen et al., 2003; Hegstad et al., 2010;

73 Lebreton et al., 2011; Arias and Murray, 2012; Mikalsen et al., 2015) and virulence genes

4 125 74 (Kayaoglu and Ørstavik, 2004; Nallapareddy et al., 2006; Domann et al., 2007; Soheili et al.,

75 2014). This phenomenon may be regulated by the presence or absence of clustered, regularly

76 interspaced short palindromic repeats (CRISPR) that provide bacteria and archaea with

77 sequence-specificity and an acquired defence against plasmids and phages (Palmer and Gilmore,

78 2010). Also, it is known that restriction-modification systems and anti-endonuclease (ardA) play

79 a significant role in the regulation of the mobile genetic element transfer in the Enterococcus

80 genus (Palmer et al., 2014) and the acquisition and spread of antimicrobial resistance

81 genes(Mcmahon et al. 2009). The aim of this work was to investigate the presence of

82 recombination sites in both E. faecalis and E. faecium (i) and the association between the

83 absence or presence of a CRISPR system, an endonuclease and anti-endonuclease system, and

84 the acquisition of antimicrobial resistance genes, especially vancomycin resistance genes vanA,

85 vanB and vanC, using sequenced genomes of E. faecalis from Marseille and publicly available

86 genomes of both species (ii).

87

88 Materials and methods

89 Whole genome sequencing and sequence extraction from NCBI

90 From Marseille, four clinical strains of E. faecalis and two strains from chicken faeces

91 were included in this study. We conducted a whole genome nucleic acid extraction from the six

92 strains using the QIAGEN automated method. We sequenced the E. faecalis genome using the

93 MiSeq Technology (Illumina Inc, San Diego, CA, USA) with the mate-pair strategy. The reads

94 were assembled using A5-Miseq (Coil et al., 2015). The scaffolds were re-ordered and aligned

95 against a reference genome, E. faecalis ATCC 22809, using the Mauve Aligner (Darling et al.,

96 2004) with default parameters. We extracted a total of 447 whole genome sequences of E.

5 126 97 faecalis and 407 of E. faecium from the NCBI database. We annotated all 854 genomes of the

98 two species, including genomes sequenced from Marseille, (strains G823, G824, G881, G882,

99 G883, G884 with accession numbers FPDY01, FPDW01, FPEB01, FPDZ01, FPEC01 and

100 FPEA01, respectively) with Prokka (Seemann 2014) using Pfam and the SwissProt database with

101 default parameters.

102

103 Genome phylogenetic tree reconstruction of both E. faecalis and E. faecium

104 We performed a whole genome SNP alignment of the 447 whole genomes of E. faecalis

105 and 407 genomes of E. faecium using Scapper (https://github.com/tseemann/scapper). Scapper

106 package is composed with Mummer (Delcher et al., 2003) (version 3.23) and Trimal (Capella-

107 Gutiérrez et al., 2009) (version 1.4) to generate a whole genome alignment with the default

108 setting to reconstruct the genome phylogenetic tree. The strains E. faecalis ATCC 29212

109 (NZ_CP008816.1) and E. faecium DO (NC_017960.1) genomes were used as a reference,

110 respectively for E. faecalis and E. faecium. Alignment data were shrunken using snp-sites (Page

111 et al., 2016) to remove the monomorphic sites. We inferred approximately-maximum-likelihood

112 phylogenetic trees from shrunken alignments of nucleotide sequences using Fasttree (Price et al.,

113 2009). The phylogenetic tree generated was used as the entry tree for ClonalFrameML analysis.

114

115 Detection of recombination hotspots insides the genomes of E. faecalis and E. faecium

116 We putatively identified base substitution and recombination loci containing elevated

117 densities of base substitutions suggestive of horizontal transferring sequences. We constructed a

118 maximum likelihood phylogeny based on the putative point mutations outside these regions of

119 high sequence diversity using ClonalFrameML (version v1.0-19-g9488a80) (Didelot and Wilson,

6 127 120 2015) with the default setting. We reconstructed the evolutionary maximum-likelihood

121 phylogeny by determining genetic genealogy, considering points of variation and genome

122 plasticity and generated a genome recombination Heatmap against phylogenetic trees.

123

124 Detection of clustered regularly interspaced short palindromic repeats (CRISPRs) spacers

125 inside the genomes of E. faecalis and E. faecium

126 The detection of a CRISPR system (spacers, repeats) was conducted using MinCED

127 software (Bland et al., 2007). We set the minimum number of repeats a CRISPR system must

128 contain to 3, the minimum length of the CRISPR repeats to 23 nucleotides, the maximum length

129 of the CRISPR repeats to 47 nucleotides, the minimum length of the CRISPR spacers to 26

130 nucleotides, and the maximum length of the CRISPR spacers to 50 nucleotides. We generated a

131 standard general features format for all genomes of E. faecalis, and E. faecium tested. The

132 numbers of the CRISPR system and spacers detected were computed and imported into an R

133 statistical environment for statistical analysis and plot.

134

135 Orthologous gene detection and pan-genome analysis of both species

136 The pan-genome analysis was performed using Roary (version3.6.8) (Page et al., 2015).

137 We clustered homologous gene families using OrthoMCL (MCL) (version 1.4) and conducted

138 gene alignment with MAFFT (Katoh and Standley, 2013). We used default parameters for the

139 BLAST (95% minimum identity, E-value =1e-05) and genes had to be present in > 99% of all

140 isolates to be included in the hard-core genome. We determined core (hard-core) (genes present

141 in 99%-100% taxa), softcore (genes present in 95%-99% taxa), shell (genes present in 15%-

142 95%) and cloud genes (genes present in 0%-15% genomes) as described by Kaas (Kaas et al.,

143 2012). We generated a binomial pan-genome profile for the presence, indicated as (1), and

7 128 144 absence of genes as (0) inside the genome of both species. A parsimony pan-genome tree and a

145 network interaction plot were generated for both species using a Manhattan matrices distance

146 algorithm using the method described by Snipen et al. (Snipen and Ussery, 2010) to infer their

147 ecological lifestyle. Orthologous genes related to antimicrobial, endonuclease and anti-

148 endonuclease gene distribution were extracted using homemade scripts and plotted against

149 maximum-likelihood phylogenetic trees in both species.

150

151 Statistical analysis

152 All statistical analyses conducted in this study were performed using R statistical

153 software (R Core Team and R Development Core Team, 2016). We used Student’s t-test for

154 means comparison and Pearson Chi-square used for proportion comparison inside and between

155 the two species. The Pearson correlation test was used to show a statistical association between

156 two genomic features. A logistic regression analysis was used to compute the association

157 between qualitative genomic variables (presence and absence of vancomycin genes and the

158 number of recombination and CRISPR-spacers detected inside and across the genomes of both

159 species). The odds ratio was calculated to interpret the association. We set The CI level to 95%.

160 The statistical test was significant at a p-value < 0.05. All p-values below 0.00001 were

161 standardized as p-value < 10-5 in this study.

162

163 Results

164 Comparison of E. faecalis and E. faecium genome features shows differences in genome size

165 and GC percentage

166 We extracted a total of 447 genomes of E. faecalis, including six genomes of E. faecalis

167 sequenced in Marseille (Supplementary Table 1), and 407 E. faecium genomes

8 129 168 (Supplementary Table 2) from the NCBI database. We assembled The clinical Marseille strains

169 G823, G824, G883 and G884 with accession numbers FPDY01, FPDW01, FPEC01 and

170 FPEA01 into 2.89 Mb, 3.096 Mb, 2.76 Mb and 2.95 Mb, respectively. Moreover, animal

171 (chicken) strains G881, G882, with accession numbers FPEB01 and FPDZ01 into 2.97 Mb and

172 3.25 Mb, respectively. We reported Details of genomic features in Suppl. Table I. There were

173 84.3% (337/447) of E. faecalis isolated from North America and 51.1% (208/407) of E. faecium

174 isolated from Europe. There were 85.4% (382/447) of the E. faecalis strains and 69.4%

175 (281/407) of the E. faecium strains isolated primarily from humans. However, a considerable

176 number of E. faecium was isolated from animals; 14.7% (60/407), and those were mainly in

177 Europe; 86.6% (52/60) (Supplementary Figure 1). The average genome size of E. faecalis is

178 estimated at 3.055 Mbp and that of E. faecium at 2.908 Mbp, a statistically significant difference

179 between the two species (p-value < 10-5). The average proteome size of E. faecalis is estimated

180 at 2936 and that of E. faecium at 2711. The GC percentage of E. faecalis is 37.36%, while that

181 of E. faecium is 38.01%, a significant difference (p-value < 10-5). The analysis of genome

182 features indicates a statistically significant difference between E. faecalis and E. faecium.

183

184 The genomes of E. faecalis contain a high density of recombination hotspots compared to E.

185 faecium

186 We processed a whole genome SNP alignment for both species. The phylogenetic tree of

187 E. faecalis shows seven different clades (Figure 1-A) organised as two homogenous clades close

188 to the parental strains, containing quasi-exclusively human strains, and five heterogeneous others

189 distant from the ancestral strains. The length of phylogenetic branches indicating the level of

190 genome evolution from the ancestral strain contains a mixture of animal, environmental, food

191 and human strains. In E. faecium, twelve different smaller and homogeneous clades were

9 130 192 identified (Figure 1-B). The clades 7, 8, 9, 10 and 11 are a mixture of animal, environmental,

193 food and human strains. However, clades 1, 2, 3, 4, 6 and 12 contain some strains with an

194 unknown source reported. We detected recombination sites in 93% (415/447) of the genomes of

195 E. faecalis analysed (Figure 2-A). The number of recombinations detected per genome varied

196 from 0 to 245, with 27 recombination sites detected per genome on average. We identified a

197 substantial number of recombination sites in clades containing mixtures of animal,

198 environmental, food and human strains. We detected recombination sites in only 40.5%

199 (165/407) of the genomes of E. faecium, with the number detected varying from 0 to 35, and four

200 recombinations on average per genome analysed (Figure 2-B). We observed a similar profile as

201 in E. faecalis, where most of the recombination occurred within the clades containing a mixture

202 of isolates (animal, environmental, food, human). The result indicates that there are more

203 genomic recombination hotspots detected in E. faecalis than in E. faecium, a statistically

204 significant difference (p-value < 10-5) and the majority occurred in branches with mixtures of

205 strains from environments different than those unique to humans.

206

207 Human and non-human strains shared common genetic lineages in both E. faecalis and E.

208 faecalis, and E. faecium dissemination may be zoonotic.

209 The pan-genome analysis identified a total of 22,424 orthologous genes in E. faecalis and 24,004

210 in E. faecium. We generated a parsimony pan-genome phylogeny trees for both species based on

211 the absence or presence of gene content. The trees estimation are based on the distance between

212 genomes using the relative Manhattan distance between pan-genome profiles. The principal is

213 that two genomes are similar, not only that they share the same genes, but also they lack the

214 same genes (Snipen and Ussery 2010). We identified eight putative phylogenetic groups in E.

215 faecalis (Supplementary Figure 3-A) and 12 in E. faecium (Supplementary Figure 3-B). In

10 131 216 both species, most of the phylogenetic branches contain a mixture of human and non-human

217 strains, suggesting that they shared an evolutionary history and might derive from the same

218 genetic lineage. In E. faecalis, the G8 contains quasi-human strains, suggesting a clonal

219 expansion, and they were all isolated from the US. The phylogenic network analysis revealed

220 that in E. faecalis (Figure 3-A), non-human (animal) strains are mostly located at the

221 peripheries (edges) of the network; and there were two groups identified with two emerging

222 clusters from each cluster exclusively human strains. This observation indicated that the

223 environment and animals might be contaminated from human wastes with low aggregation with

224 human strains insides the two network groups identified. In the other hand, in E. faecium

225 (Figure 3-B), non-human strains are mostly located insides networks with high aggregation with

226 human strains and evidence of direct interaction between human and non-human strains with one

227 large emerging clusters containing a mixture of human and non-human strains. This observation

228 suggests a higher zoonotic dissemination and transmission of E. faecium than E. faecalis.

229 The pan-genome component and orthologous genes distribution are displayed in the

230 Supplementary Table III. Both species present a similar pan-genome profile, with an exponential

231 increase in the total number of orthologous genes and new genes when we added more genomes

232 to the pan-genome. It suggests that the pan-genome of these two species is open. The hard-core

233 genome of E. faecalis represents 42% of the average proteome, while that of E. faecium is

234 estimated at 31.6%. A high number of orthologous genes included in cloud genes (accessory

235 genes), 18,650 and 20,345 for E. faecalis and E. faecium respectively, suggest a prominent level

236 of genome plasticity.

237

238

11 132 239 There were more CRISPR systems and absence of anti-endonuclease in E. faecalis genomes

240 compared to E. faecium

241 We detected CRISPR systems in 53.8% ( 219/447) genomes of E. faecalis and only 9.3%

242 (38/407) of E. faecium, a statistically significant difference (p-value < 10-5). In E. faecalis, the

243 number of CRISPR systems detected varied from 1 to 5 and contained two on average per

244 genome. In E. faecium, it varied from 1 to 2, and less than one on average per genome. In E.

245 faecalis, we observed a negative correlation between the number of CRISPR systems detected

246 and the size of the genome (r = -0.397, CI [-0.4725613 – -0.3161209], p-value < 10-5) and a

247 positive correlation between the % GC of the genome and the number of CRISPR systems

248 detected (r = 0.215, CI [0.1250817 – 0.3020480], p-value < 10-5). We found a positive

249 association between the numbers of CRISPR spacers detected per genome and the numbers of

250 recombination hotspots detected in E. faecalis (F-statistic: 19.97, p-value < 10-5). Three

251 CRISPR-associated coding proteins: CRISPR-cas1, CRISPR-cas2 and CRISPR-cas9, were

252 identified in both species. We detected two variants of cas9 (cas9 and cas9.1) in E. faecalis alone

253 (We set sequence identity cutoff at 80%). Also, the anti-endonuclease genes ardA were mainly

254 found in E. faecium with low density where the genome recombination is high.

255

256 The association between the absence of CRISPR systems, the presence of anti-endonuclease

257 genes (ardA) and the acquisition of vancomycin resistance genes vanA, vanB in E. faecium.

258 We retrieved antimicrobial resistance genes, endonuclease, anti-endonuclease genes and

259 presence and absence matrices from the overall pan-genome matrices, and the Heatmaps were

260 plotted against maximum-likelihood phylogeny trees to analyse their distribution in both E.

261 faecalis (Figure 4-A) and E. faecium (Figure 4-B). We observed that the vancomycin resistance

262 genes vanA and vanB were present in both species. E. faecium harbors more vancomycin

12 133 263 resistance genes than E. faecalis (E. faecium/E. faecalis: vanA 156/26, p-value < 10-5; vanB

264 188/29, p-value < 10-5). We detected CRISPR system in 219 and 38 E. faecalis and E. faecium

265 genomes, respectively (Figure 4-A&B). Most importantly, we detected the presence of

266 vancomycin resistance genes in the genomes where a CRISPR system is absent in both species.

267 Also, endonuclease genes, including type I, II, and III, were found in both species, with a slight

268 increase in E. faecium. However, anti-endonuclease genes (ardA) were absent in E. faecalis,

269 while massively present in more than 90% of the analysed genomes of E. faecium. The presence

270 of a CRISPR system in the genome of E. faecium decreased by 0.77 times the acquisition of

271 vancomycin-resistant genes (estimates = -0.77, OR = 0.46, p-value <10-5, CI = [0.348 - 0.601]).

272 The number of recombination hotspots detected in the genome of both species decreased by 0.02

273 times the acquisition of vancomycin resistance genes (estimate = -0.02, OR = 0.973, CI = [0.956

274 - 0.985], p-value = < 0.00021). We found that there is a direct association between the absence

275 of CRISPR-spacers, the presence of anti-endonuclease genes (ardA) and the acquisition of

276 vancomycin resistance in E. faecium. Also, ble (bleomycin resistance genes) and tetR,

277 responsible for resistance to tetracycline, were found only in E. faecalis, in human strains closer

278 to the ancestral strain, and were absent in the E. faecium genome that we analysed. Conversely,

279 fosB and fosX, both associated with resistance to fosfomycin and mostly described in Gram-

280 positive bacteria, were detected only in E. faecium.

281

282 Discussion

283 E. faecalis and E. faecium are the main causes of enterococcal nosocomial infections

284 (Guzman Prieto et al., 2016) and they have been widely reported in the blood (Furtado et al.,

285 2014) and urinary tract infections (Ronald, 2003). In the past decade, the emergence of these two

286 species was attributed to resistance to antibiotics used in treating humans infections, especially

13 134 287 vancomycin (Lebreton et al., 2013). Palmer et al. and Van Hal et al. have shown that vancomycin

288 resistance genes, especially vanA and vanB and their associated regulatory genes vanAHXZ and

289 vanBYXZ, are transferable by mobile genetic elements and plasmids (Palmer et al., 2010; Van

290 Hal et al., 2016). Palmer et al. have demonstrated that multidrug-resistant Enterococci lack the

291 CRISPR-cas protein in their genome (Palmer and Gilmore, 2010). Here we performed an

292 extensive comparative analysis of 854 publicly available genomes of E. faecalis (447) and E.

293 faecium (407), including six genomes of E. faecalis sequenced in Marseille. We aimed to

294 demonstrate the high rate of vancomycin-resistant E. faecium reported over the world (Cetinkaya

295 et al., 2000; O’Driscoll and Crank, 2015) compared to E. faecalis, and the association with the

296 presence or absence of a CRISPR system and its associated protein cas in their respective

297 genomes. The CRISPR system (spacers and related proteins) provides bacteria and archaea with

298 a sequence-specific, acquired defence system against plasmids and phage acquisition (Marraffini,

299 2015) and adaptive immunity against foreign elements. When virus injects their genetic element

300 into the bacteria, a small sequence of the viral genome, known as a spacer, is integrated into the

301 CRISPR locus to immunise the bacteria. Spacers are transcribed into small RNAs that guide the

302 direct cleavage of viral DNA by cas nuclease proteins. The immunised population not only

303 acquires resistance to its predators but also passes this resistance mechanism vertically to its

304 progeny (Marraffini, 2015; Ratner et al., 2016). In this study, we observed that genomes of E.

305 faecalis contain a statistically significant higher number of recombination hotspots than E.

306 faecium. It has armed E. faecalis with a substantial number of CRISPR systems that protect the

307 bacteria from acquiring subsequent external DNA, such as mobile genetic elements and

308 plasmids. Therefore, we found a positive correlation between the number of recombination

309 hotspots and the presence of CRISPR-spacers found in E. faecalis. The massive ardA orthologs

310 detected in E. faecium is evidence of acquisition of plasmid elements, carriers of vancomycin

14 135 311 resistance genes vanA and vanB (Lebreton et al., 2011; Brodrick et al., 2016; Sivertsen et al.,

312 2016). Our study also showed that there were more vancomycin resistance genes vanA and vanB

313 detected in E. faecium than in E. faecalis, a statistically significant difference. Therefore, we

314 found that the presence of a CRISPR system is protective for E. faecalis in acquiring specific

315 mobile genetic elements carrying the vancomycin resistance genes vanA and vanB. However, the

316 presence of ardA genes inactivate the function of endonuclease protective activities and enables

317 the genome of E. faecium to be versatile in acquiring external DNA horizontally. The anti-

318 endonuclease gene ardA is known to regulate horizontal gene transfer, causing multidrug

319 resistance in Enterococcus (Gilmore et al., 2014; Palmer et al., 2014) and actively contribute to

320 the acquisition and dissemination of antimicrobial resistance genes(Mcmahon et al. 2009). These

321 observations explain why E. faecium is reported to exhibit more resistance to vancomycin than

322 E. faecalis. In this analysis, CRISPR-cas9 was detected in E. faecalis as well as CRISPR-cas1

323 and CRISPR-cas2 and was almost absent in E. faecium. CRISPR-cas1 and CRISPR-cas2, two

324 metal-dependent nucleases, are both necessary and sufficient for spacers acquisition, but

325 dispensable for target interference (Datsenko et al., 2012; Ratner et al., 2016). However,

326 CRISPR-cas9, the sole Type II Cas protein, is involved in target surveillance and interference

327 (Jinek et al., 2012). The endonuclease activity of Cas9 is dispensable for acquisition, as its role is

328 to select spacers, whereas Cas1, whose non-specific nuclease activity is required for adaptation,

329 cleaves the contiguous sequence, yielding a precisely selected spacer sequence (Ratner et al.,

330 2016). it explains the near absence of CRISPR-cas9 protein in E. faecium. The parsimony pan-

331 genome tree analysis reveals that the strains of E. faecalis and E. faecium associated with human

332 infection may find their origin in animals, food and the environment. An ecological lifestyle

333 network visualisation of both species shows that in E. faecalis there is a low interaction between

334 non-human strains and human strains (few zoonotic transmission). Conversely, animal and

15 136 335 environmental E. faecium strains are linked to human strains, and strains may pass from animal

336 and environment to humans and vice versa (high zoonotic transmission). it is also supported by

337 Lebreton et al. (Lebreton et al., 2013), who discovered that the epidemic hospital-adapted lineage

338 of E. faecium emerged approximately 75 years ago, concomitant with the introduction of

339 antibiotics, from a population that included mostly animal strains, and not from human

340 commensal lines. This explains why E. faecium strains are more associated with zoonotic

341 dissemination, and the emergence of vancomycin-resistant E. faecium in humans may be related

342 to the use of Avoparcin as an animal growth promoter, known to produce cross-resistance to

343 vancomycin (Nilsson, 2012).

344

345 Conclusion

346 This study shows that there is extensive genomic recombination that has occurred in E.

347 faecalis species due to mobile genetic elements and phages capable of inducing adaptive

348 immunity with the acquisition of a CRISPR system. It protects E. faecalis from acquiring

349 external DNA sequences carrying the vancomycin resistance genes vanA, vanB. It correlates with

350 a reduced number of CRISPR systems found in E. faecium and the substantial number of anti-

351 endonuclease ardA genes and vancomycin resistance genes found. The emergence and

352 dissemination of E. faecium infection may be due to zoonotic transmission, and the misuse of

353 antibiotics (avoparcin) may cause the selection of emerging vancomycin resistance in

354 Enterococci. This finding explains why E. faecium is more reported worldwide as a vancomycin-

355 resistant Enterococcus, and not E. faecalis.

356

16 137 357 Acknowledgments

358 We are very grateful to

359 Conflict of interest and financial disclosure

360 No potential conflict of interest or financial disclosure for all authors.

361

17 138 362 References

363 Arias CA, Murray BE. 2012. The rise of the Enterococcus: beyond vancomycin resistance. Nat 364 Rev Microbiol 10: 266–78.

365 Bland C, Ramsey TL, Sabree F, Lowe M, Brown K, Kyrpides NC, Hugenholtz P. 2007. CRISPR 366 recognition tool (CRT): a tool for automatic detection of clustered regularly interspaced 367 palindromic repeats. BMC Bioinformatics 8: 209.

368 Bourgogne A, Garsin DA, Qin X, Singh K V, Sillanpaa J, Yerrapragada S, Ding Y, Dugan-Rocha 369 S, Buhay C, Shen H, et al. 2008. Large scale variation in Enterococcus faecalis illustrated 370 by the genome analysis of strain OG1RF. Genome Biol 9: R110.

371 Brodrick HJ, Raven KE, Harrison EM, Blane B, Reuter S, Török ME, Parkhill J, Peacock SJ. 372 2016. Whole-genome sequencing reveals transmission of vancomycin-resistant 373 Enterococcus faecium in a healthcare network. Genome Med 8: 4.

374 Byappanahalli MN, Nevers MB, Korajkic A, Staley ZR, Harwood VJ. 2012. Enterococci in the 375 environment. Microbiol Mol Biol Rev 76: 685–706.

376 Capella-Gutiérrez S, Silla-Martínez JM, Gabaldón T. 2009. trimAl: A tool for automated 377 alignment trimming in large-scale phylogenetic analyses. Bioinformatics 25: 1972–1973.

378 Cetinkaya Y, Falk P, Mayhall CG. 2000. Vancomycin-resistant enterococci. American Society for 379 Microbiology (ASM).

380 Coil D, Jospin G, Darling AE. 2015. A5-miseq: An updated pipeline to assemble microbial 381 genomes from Illumina MiSeq data. Bioinformatics 31: 587–589.

382 Csárdi G, Nepusz T. 2006. The igraph software package for complex network research. 383 InterJournal Complex Syst 1695: 1–9.

384 Darling ACE, Mau B, Blattner FR, Perna NT. 2004. Mauve: multiple alignment of conserved 385 genomic sequence with rearrangements. Genome Res 14: 1394–1403.

386 Datsenko KA, Pougach K, Tikhonov A, Wanner BL, Severinov K, Semenova E. 2012. Molecular 387 memory of prior infections activates the CRISPR/Cas adaptive bacterial immunity system. 388 Nat Commun 3: 945.

389 Delcher AL, Salzberg SL, Phillippy AM. 2003. Using MUMmer to identify similar regions in 390 large sequence sets. Curr Protoc Bioinformatics Chapter 10: Unit 10.3.

391 Didelot X, Wilson DJ. 2015. ClonalFrameML: efficient inference of recombination in whole 392 bacterial genomes. PLoS Comput Biol 11: e1004041.

393 do Prado GVB, Marchi AP, Moreno LZ, Rizek C, Amigo U, Moreno AM, Rossi F, Guimaraes T, 394 Levin AS, Costa SF. 2016. Virulence and resistance pattern of a novel sequence type of 395 linezolid-resistant Enterococcus faecium identified by whole-genome sequencing.

396 Domann E, Hain T, Ghai R, Billion A, Kuenne C, Zimmermann K, Chakraborty T. 2007. 397 Comparative genomic analysis for the presence of potential enterococcal virulence factors 398 in the probiotic Enterococcus faecalis strain Symbioflor 1. Int J Med Microbiol 297: 533–

18 139 399 539.

400 Elsner H a, Sobottka I, Mack D, Claussen M, Laufs R, Wirth R. 2000. Virulence factors of 401 Enterococcus faecalis and Enterococcus faecium blood culture isolates. Eur J Clin 402 Microbiol Infect Dis 19: 39–42.

403 Evans M, Davies JK, Sundqvis G, Figdor D. 2001. Mechanisms involved in the resistance of 404 Enterococcus faecalis to calcium hydroxide. Aust Endod J 27: 115.

405 Fisher K, Correspondence CP. 2009. The ecology, epidemiology and virulence of Enterococcus. 406 Microbiology 155: 1749–1757.

407 Furtado I, Xavier PCN, Tavares LVM, Alves F, Martins SF, Martins A de S, Palhares DB. 2014. 408 Enterococcus faecium and Enterococcus faecalis in blood of newborns with suspected 409 nosocomial infection. Rev do Inst Med Trop São Paulo 56: 77–80.

410 García-Solache M, Lebreton F, McLaughlin RE, Whiteaker JD, Gilmore MS, Rice LB. 2016. 411 Homologous Recombination within Large Chromosomal Regions Facilitates Acquisition of 412 β-Lactam and Vancomycin Resistance in Enterococcus faecium. Antimicrob Agents 413 Chemother 60: 5777–86.

414 Gilmore MS, Clewell DB, Ike Y, Shankar N. 2014. Enterococci, From Commensals to Leading 415 Causes of Drug Resistant Infection eds. M.S. Gilmore, D.B. Clewell, Y. Ike, and N. 416 Shankar. Boston: Massachusetts http://www.ncbi.nlm.nih.gov/.

417 Gilmore MS, Lebreton F, van Schaik W. 2013. Genomic transition of enterococci from gut 418 commensals to leading causes of multidrug-resistant hospital infection in the antibiotic era. 419 Curr Opin Microbiol 16: 10–16.

420 Guzman Prieto AM, van Schaik W, Rogers MRC, Coque TM, Baquero F, Corander J, Willems 421 RJL. 2016. Global Emergence and Dissemination of Enterococci as Nosocomial Pathogens: 422 Attack of the Clones? Front Microbiol 7: 1–15.

423 Hegstad K, Mikalsen T, Coque TM, Werner G, Sundsfjord A. 2010. Mobile genetic elements and 424 their contribution to the emergence of antimicrobial resistant Enterococcus faecalis and 425 Enterococcus faecium. Clin Microbiol Infect 16: 541–554.

426 Higuita NIA, Huycke MM. 2014. Enterococcal Disease , Epidemiology , and Implications for 427 Treatment. In Enterococci: From Commensals to Leading Causes of Drug Resistant 428 Infection (ed. S.N. Gilmore MS, Clewell DB, Ike Y), pp. 1–27.

429 Jinek M, Chylinski K, Fonfara I, Hauer M, Doudna JA, Charpentier E. 2012. A Programmable 430 Dual-RNA–Guided DNA Endonuclease in Adaptive Bacterial Immunity. Science (80- ) 337.

431 Kaas RS, Friis C, Ussery DW, Aarestrup FM, Otto T, Oryan M, Prado V, Pickering L, Russo T, 432 Johnson J, et al. 2012. Estimating variation within the genes and inferring the phylogeny of 433 186 sequenced diverse Escherichia coli genomes. BMC Genomics 13: 577.

434 Katoh K, Standley DM. 2013. MAFFT multiple sequence alignment software version 7: 435 improvements in performance and usability. Mol Biol Evol 30: 772–80.

436 Kayaoglu G, Ørstavik D. 2004. Virulence factors of Enterococcus faecalis: relationship to

19 140 437 endodontic disease. Crit Rev Oral Biol Med 15: 308–320. 438 Kristich CJ, Rice LB, Arias CA. 2014. Enterococcal Infection—Treatment and Antibiotic 439 Resistance. Massachusetts Eye and Ear Infirmary.

440 Lebreton F, Depardieu F, Bourdon N, Fines-Guyon M, Berger P, Camiade S, Leclercq R, 441 Courvalin P, Cattoir V. 2011. D-Ala-D-Ser VanN-Type Transferable Vancomycin Resistance 442 in Enterococcus faecium. Antimicrob Agents Chemother 55: 4606–4612.

443 Lebreton F, van Schaik W, McGuire AM, Godfrey P, Griggs A, Mazumdar V, Corander J, Cheng 444 L, Saif S, Young S, et al. 2013. Emergence of epidemic multidrug-resistant Enterococcus 445 faecium from animal and commensal strains. MBio 4: e00534-13.

446 Marraffini LA. 2015. CRISPR-Cas immunity in prokaryotes. Nature 526: 55–61.

447 Mcmahon SA, Roberts GA, Johnson KA, Cooper LP, Liu H, White JH, Carter LG, Sanghvi B, 448 Oke M, Walkinshaw MD, et al. 2009. Extensive DNA mimicry by the ArdA anti-restriction 449 protein and its role in the spread of antibiotic resistance. Nucleic Acids Res 37: 4887–4897.

450 Mikalsen T, Pedersen T, Willems R, Coque TM, Werner G, Sadowy E, van Schaik W, Jensen LB, 451 Sundsfjord A, Hegstad K. 2015. Investigating the mobilome in clinically important lineages 452 of Enterococcus faecium and Enterococcus faecalis. BMC Genomics 16: 282.

453 Nallapareddy SR, Singh K V., Sillanp J, Garsin DA, Hk M, Erlandsen SL, Murray BE. 2006. 454 Endocarditis and biofilm-associated pili of Enterococcus faecalis. J Clin Invest 116: 2799– 455 2807.

456 Nilsson O. 2012. Vancomycin resistant enterococci in farm animals - occurrence and importance. 457 Infect Ecol Epidemiol 2.

458 O’Driscoll T, Crank CW. 2015. Vancomycin-resistant enterococcal infections: epidemiology, 459 clinical manifestations, and optimal management. Infect Drug Resist 8: 217–30.

460 Page AJ, Cummins CA, Hunt M, Wong VK, Reuter S, Holden MTG, Fookes M, Falush D, Keane 461 JA, Parkhill J. 2015. Roary: Rapid large-scale prokaryote pan genome analysis. 462 Bioinformatics 31: btv421.

463 Page AJ, Taylor B, Delaney AJ, Soares J, Seemann T, Keane JA, Harris SR. 2016. SNP-sites: 464 rapid efficient extraction of SNPs from multi- FASTA alignments. Microb Genomics 2.

465 Palmer KL, Gilmore MS. 2010. Multidrug-Resistant Enterococci Lack CRISPR-cas. MBio 1: 466 e00227-10.

467 Palmer KL, Kos VN, Gilmore MS. 2010. Horizontal gene transfer and the genomics of 468 enterococcal antibiotic resistance. Curr Opin Microbiol 13: 632–9.

469 Palmer KL, van Schaik W, Willems RJL, Gilmore MS, Willem Van Schaik CA, Willems RJL, 470 Gilmore MS. 2014. Enterococcal Genomics. Massachusetts Eye and Ear Infirmary.

471 Paulsen, I. T., Banerjei L, Myers GSA, Nelson KE, Seshadri R, Read TD, Fraser, C. MPaulsen, I. 472 T., Banerjei, L., Myers, G. S. A., Nelson, K. E., Seshadri, R., Read TD, Fraser CM, Paulsen 473 IT, Banerjei L, et al. 2003. Role of mobile DNA in the evolution of vancomycin-resistant

20 141 474 Enterococcus faecalis. Science (80- ) 299: 2071–2074.

475 Price MN, Dehal PS, Arkin AP. 2009. FastTree: Computing Large Minimum Evolution Trees 476 with Profiles instead of a Distance Matrix. Mol Biol Evol 26: 1641–1650.

477 R Core Team, R Development Core Team R. 2016. R: A Language and Environment for 478 Statistical Computing ed. R.D.C. Team. R Found Stat Comput 0: 409.

479 Ratner HK, Sampson TR, Weiss DS. 2016. Overview of CRISPR–Cas9 Biology. Cold Spring 480 Harb Protoc 2016: pdb.top088849.

481 Ronald A. 2003. The etiology of urinary tract infection: Traditional and emerging pathogens. 482 Disease-a-Month 49: 71–82. 483 Seemann T. 2014. Prokka: rapid prokaryotic genome annotation. Bioinformatics 30: 2068–9.

484 Sivertsen A, Pedersen T, Larssen KW, Bergh K, Rønning TG, Radtke A, Hegstad K. 2016. 485 Silenced vanA gene cluster on a transferable plasmid cause outbreak of vancomycin 486 variable enterococci. Antimicrob Agents Chemother 60: AAC.00286-16.

487 Snipen L, Ussery DW. 2010. Standard operating procedure for computing pangenome trees. 488 Stand Genomic Sci 2: 135–141.

489 Soheili S, Ghafourian S, Sekawi Z, Neela V, Sadeghifard N, Ramli R, Hamat RA. 2014. Wide 490 distribution of virulence genes among Enterococcus faecium and Enterococcus faecalis 491 clinical isolates. ScientificWorldJournal 2014: 623174.

492 Solheim M, Brekke MC, Snipen LG, Willems RJL, Nes IF, Brede D a. 2011. Comparative 493 genomic analysis reveals significant enrichment of mobile genetic elements and genes 494 encoding surface structure-proteins in hospital-associated clonal complex 2 Enterococcus 495 faecalis. BMC Microbiol 11: 3.

496 Van Hal SJ, Ip CLC, Ansari MA, Wilson DJ, Espedido BA, Jensen SO, Bowden R. 2016. 497 Evolutionary dynamics of Enterococcus faecium reveals complex genomic relationships 498 between isolates with independent emergence of vancomycin resistance. Microb Genomics 499 2.

500

21 142 501

22 143 502 Figure 1: Maximum-likelihood phylogenetic tree of 447 E. faecalis genomes and 407 E. faecium genomes.

503 We detected Seven phylogenetic clades in E. faecalis (left); clade 1 and 7 are homogeneous quasi-human specific strains and closer to

504 ancestral strains, while clades 2, 3, 4, 5 and 6 are a mixture of various human and non-human strains and distant from ancestral strains.

505 In E. faecium (right), nine phylogenetic clades were detected, all containing a combination of human and non-human strains. Both

506 phylogenetic trees were generated from an SNP alignment of the whole genome. Monomorphic sites were trimmed from the

507 alignment, and a first phylogenetic tree was produced using FastTree with the generalised-reverse time method. These trees are used

508 as entering trees for ClonalFrameML analysis. Final Midpoint-rooted clonal phylogenetic trees were generated.

23 144 24 145 510 Figure 2: Clonal phylogeny and inferred recombination events. Midpoint-rooted clonal phylogeny (left) of 447 E. faecalis isolates

511 (A) and 407 E. faecium isolates (B). Recombination events (right) estimated as described in Methods. The sizes and genomic locations

512 of recombination fragments (dark-blue line segments) occurring along branches in the phylogeny are aligned with branches in the

513 phylogeny.

25 146 E. faecalis E. faecium 147

514

26

515 Figure 3: Network visualisation of E. faecalis (A) and E. faecium (B), indicating the interaction between human strains and

516 non-human strains. The plots were generated with R statistical package igraph(Csárdi and Nepusz, 2006) using gene clusters data

517 produced from the pan-genome analysis. The genomic distance was computed using the Manhattan method. In E. faecalis non-human

518 (animal) strains are mostly located at the peripheries (edges) of the network. This indicated that animals might be contaminated from

519 human wastes with minimum zoonotic transmission while in E. faecium, non-human strains are mostly located insides networks with

520 high aggregation with human strains and evidence of direct interaction between human and non-human strains. This suggests a

521 zoonotic transmission of E. faecium. Circles in both species indicate emerging clusters

27 148 149

522 28

523 Figure 4: Heat map of antimicrobial orthologous genes detected in E. faecalis (A) and E. faecium (B) plotted against a

524 maximum-likelihood phylogenetic tree. Phylogenetic tree (left), restriction enzymes/endonucleases, antimicrobial genes distribution

525 (center) including CRISPR-cas proteins, blue to pink ticks show the presence of orthologous genes in the corresponding genome on

526 the phylogenetic tree, empty space illustrates the absence of the genes and red bars (right) show CRISPR-spacers, a scale indicating

527 the presence and the number of spacers identified. Genes shown in the top horizontal bars are: Restriction

528 enzymes/endonucleases/anti-endonucleases: 5-methylcytosine-specific restriction enzyme B (mcrB), anti-restriction protein (ardA),

529 restriction endonuclease (CfrBI), restriction endonuclease (Eco29kI), restriction-modification methylase (Eco57I), EcoKI restriction-

530 modification system protein (HsdS), 5-methylcytosine restriction system component (McrBC), restriction system protein (Mrr),

531 restriction endonuclease (NgoBV), restriction endonuclease (NgoFVII), putative type-1 restriction enzyme specificity protein

532 (MG438), putative type-1 restriction enzyme specificity protein (MPN_089), restriction endonuclease (SinI), Type I restriction enzyme

533 EcoKI M protein (hsdM), Type I restriction enzyme EcoKI M protein (hsdM), Type I restriction enzyme EcoKI M protein (hsdR),

534 Type I restriction enzyme (EcoR124II), Type II restriction endonuclease (RE_Alw26IDE), Type III restriction enzyme (Type III), Type-

535 2 restriction enzyme (Sau3AI), antimicrobial resistance genes: bmrA (multidrug resistance ABC transporter), qacA (multidrug efflux

536 protein), msbA (lipid ABC transporter permease/ATPase, multidrug resistance ABC transporter), ble (bleomycin-resistance genes),

537 emrY (multidrug resistance protein Y), bcr (bicyclomycin/multidrug efflux system), yheH, yheI (multidrug resistance ABC

538 transporter), fosB (fosfomycin resistance genes B), fosX (fosfomycin resistance gene X), tetA, tetC, tetD, tetM, tet R (tetracycline

539 resistance gene class, A, C, D, M and R), vanA (vancomycin resistance gene A), vanH, vanT, vanX, vanW, vanY, vanZ (vancomycin

29 150 540 resistance A associated regulatory genes), vanB (vancomycin resistance genes B), vanXB, vanYB (vancomycin resistance genes B

541 associated regulatory) VanC (vancomycin resistance genes C), linA (lincosamide B resistance genes), cmlA (chloramphenicol efflux

542 protein) cas1, cas2, cas9, cas9-1( CRISPR-associated coding protein genes).

543

30 151 544 Supplementary data:

545 Table I: List of E.faecalis genomes analysed

546 Table II: List of E. faecium genomes analysed

547 Table III: Distribution of pan-genome component in both E. faecalis and E. faecium

548

Pan-genome component E. faecalis E. faecium

No. of genomes size No. of genomes size

Total orthologous genes - 22,424 - 24,004

152 Hard-core genes 447 1231 407 859

Soft-core genes 424 - 442 401 386 - 402 295

Shell genes 67 - 423 2142 61 - 385 2505

Cloud genes 0 - 65 18,650 0 - 60 20,345

549

31

153

550

551 Supplementary Figure 1: Distribution of E. faecalis and E. faecium based on the biological and the geographical source of isolation.

32

154

552

553 Supplementary Figure 2: pan-genome plots showing the evolution of total orthologous genes and conserved genes in E. faecalis

554 (A) and E. faecium (B). Horizontal axe (X-axe) indicate the number of genomes added to the pan-genome, vertical axes show the

555 total number of genes detected in the pan-genomes, dot curves indicate the total number of genes (pan-genes), plain curves indicate

556 conserved genes (core genes).

33

155

557

558 Supplementary Figure 3: Parsimony pan-genome phylogenetic tree: Trees were generated from 447 genomes of E. faecalis (A).

559 This tree shows eight phylogenetic groups. G8 shows clonal clusters of strains, all isolated in the US within the same period and same

560 geographical area. The remaining clades contain a mixture of human and non-human strains and 407 genomes of E. faecium (B). This

561 tree shows 12 phylogenetic groups. All clades contain mixtures of human and non-human strains.

34

Supplementary data:

Table I: List of E.faecalis genomes analysed

Size Accessions Organism/Name CladeID (Mb) GC% number Scaffolds Genes Proteins Level Enterococcus faecalis V583 20137 3.35997 37.3546 NC_004668.1 4 3412 3264 Complete Genome Enterococcus faecalis OG1RF 20137 2.73963 37.8 NC_017316.1 1 2636 2548 Complete Genome Enterococcus faecalis 62 - 3.13082 37.3617 CP002491.1 5 3157 3075 Complete Genome Enterococcus faecalis D32 20137 3.0625 37.4365 NC_018221.1 3 3082 2934 Complete Genome Enterococcus faecalis str. Symbioflor 1 20137 2.81067 37.7 NC_019770.1 1 2805 2686 Complete Genome Enterococcus faecalis DENG1 20137 2.96104 37.5 NZ_CP004081.1 1 2960 2838 Complete Genome Enterococcus faecalis ATCC 29212 20137 3.04813 37.3559 NZ_CP008816.1 3 3038 2876 Complete Genome Enterococcus faecalis 20137 2.80343 37.6 NZ_CP014949.1 1 2781 2643 Complete Genome

156 Enterococcus faecalis TX0102 20137 2.87153 37.4 AEBD01 40 2788 2710 Scaffold Enterococcus faecalis TX0630 20137 3.22833 37 AEBE01 141 3295 3170 Scaffold Enterococcus faecalis TX0031 20137 2.8204 37.5 AEBF01 32 2733 2653 Scaffold Enterococcus faecalis TX4244 20137 2.92536 37.3 AEBH01 53 2869 2785 Scaffold Enterococcus faecalis TX1346 20137 2.7829 37.6 AEBI01 91 2748 2633 Scaffold Enterococcus faecalis TX1342 20137 2.83654 37.5 AEBJ01 27 2747 2664 Scaffold Enterococcus faecalis TX1302 20137 2.88376 37.5 AEBK01 32 2811 2734 Scaffold Enterococcus faecalis TX0043 20137 2.81205 37.5 AEBL01 35 2792 2716 Scaffold Enterococcus faecalis TX0027 20137 3.06166 37.2 AEBM01 59 3067 2977 Scaffold Enterococcus faecalis TX0309A 20137 3.11221 37.2 AEBN01 56 3153 3054 Scaffold Enterococcus faecalis TX0309B 20137 3.10797 37.2 AEBO01 69 3156 3047 Scaffold Enterococcus faecalis TX2137 20137 2.96688 37.2 AEBQ01 106 2913 2812 Scaffold Enterococcus faecalis TX0017 20137 2.99707 37.4 AEBP01 69 3000 2889 Scaffold Enterococcus faecalis TX4248 20137 3.18702 37.1 AEBR01 85 3208 3101 Scaffold Enterococcus faecalis DAPTO 516 20137 3.0558 37.3 AEBS01 79 3049 2958 Scaffold Enterococcus faecalis DAPTO 512 20137 3.05442 37.3 AEBT01 74 3045 2952 Scaffold Enterococcus faecalis TX0855 20137 2.98732 37.2 AEBV01 65 2943 2836 Scaffold Enterococcus faecalis TX2134 20137 3.12143 37.1 AEBW01 78 3136 3018 Scaffold Enterococcus faecalis TX0860 20137 3.06177 37.2 AEBX01 72 3034 2939 Scaffold Enterococcus faecalis TX0109 20137 2.9673 37.4 AEBY01 78 2927 2826 Scaffold Enterococcus faecalis EnGen0311 20137 3.16869 37 AEBZ01 86 3170 3060 Scaffold Enterococcus faecalis TX2141 20137 2.90311 37.4 AECG01 77 2934 2817 Scaffold Enterococcus faecalis TX0411 20137 3.12459 37.2 AECA01 75 3174 3057 Scaffold Enterococcus faecalis TX0645 20137 3.17461 37.1 AECE01 90 3194 3067 Scaffold Enterococcus faecalis TX1341 20137 2.99999 37.2 AECF01 38 2972 2892 Scaffold Enterococcus faecalis TX0012 20137 2.81853 37.4 AECD01 36 2720 2636 Scaffold Enterococcus faecalis TX0470 20137 2.87879 37.3 AECC01 42 2837 2748 Scaffold Enterococcus faecalis TX0312 20137 2.78566 37.6 AECB01 42 2735 2650 Scaffold Enterococcus faecalis TX4000 20137 2.8403 37.5 AEBB01 55 2805 2710 Scaffold Enterococcus faecalis T1 20137 2.95069 37.7 ACAD01 16 2871 2765 Scaffold

157 Enterococcus faecalis T2 20137 3.26383 37.2 ACAE01 22 3224 3106 Scaffold Enterococcus faecalis T3 20137 2.79121 37.6 ACAF01 10 2705 2614 Scaffold Enterococcus faecalis ATCC 4200 20137 3.03617 37.5 ACAG01 14 2986 2856 Scaffold Enterococcus faecalis DS5 20137 3.18642 37.3 ACAI01 43 3151 3021 Scaffold Enterococcus faecalis ARO1/DG 20137 2.84994 37.7 ACAK01 13 2777 2680 Scaffold Enterococcus faecalis Merz96 20137 3.08525 37.7 ACAM01 21 2996 2892 Scaffold Enterococcus faecalis HIP11704 20137 3.20299 37.3 ACAN01 38 3186 3050 Scaffold Enterococcus faecalis JH1 20137 3.0428 37.4 ACAP01 24 2966 2868 Scaffold Enterococcus faecalis E1Sol 20137 2.8819 37.6 ACAQ01 14 2842 2720 Scaffold Enterococcus faecalis Fly1 20137 2.83429 37.5 ACAR01 12 2703 2593 Scaffold Enterococcus faecalis D6 20137 2.90625 37.6 ACAT01 11 2835 2748 Scaffold Enterococcus faecalis T11 20137 2.74872 37.7 ACAU01 13 2642 2575 Scaffold Enterococcus faecalis CH188 20137 3.22035 37.3 ACAV01 27 3177 3066 Scaffold Enterococcus faecalis X98 20137 2.9427 37.5 ACAW01 13 2951 2831 Scaffold Enterococcus faecalis TX0104 20137 3.15648 37.4 ACGL01 95 3133 2974 Scaffold Enterococcus faecalis TX1322 20137 2.97429 37.5 ACGM01 40 2924 2797 Scaffold Enterococcus faecalis ATCC 29200 20137 2.97337 37.6 ACHK01 64 2900 2802 Scaffold Enterococcus faecalis EnGen0297 20137 3.12993 37.4 ACIX01 103 3098 2875 Scaffold Enterococcus faecalis T8 20137 3.03739 37.5 ACOC01 24 2998 2856 Scaffold Enterococcus faecalis S613 20137 3.04735 37.3 ADDP01 105 3037 2936 Scaffold Enterococcus faecalis R712 20137 3.03954 37.3 ADDQ01 105 3033 2936 Scaffold Enterococcus faecalis 599 20137 3.04841 37.2 ALZI01 106 3026 2900 Scaffold Enterococcus faecalis ERV103 20137 3.06482 37.4 ALZJ01 56 3044 2959 Scaffold Enterococcus faecalis ERV116 20137 3.06975 37.4 ALZK01 54 3037 2963 Scaffold Enterococcus faecalis ERV129 20137 3.15795 37.3 ALZL01 81 3149 3039 Scaffold Enterococcus faecalis ERV25 20137 3.08736 37.3 ALZM01 93 3082 2974 Scaffold Enterococcus faecalis ERV31 20137 3.05317 37.3 ALZN01 83 3025 2940 Scaffold Enterococcus faecalis ERV37 20137 3.12022 37.3 ALZO01 126 3134 3020 Scaffold Enterococcus faecalis ERV41 20137 3.15213 37.3 ALZP01 129 3185 3067 Scaffold Enterococcus faecalis ERV62 20137 3.11751 37.3 ALZQ01 76 3128 3025 Scaffold

158 Enterococcus faecalis ERV63 20137 3.16512 37.3 ALZR01 94 3197 3081 Scaffold Enterococcus faecalis ERV65 20137 3.09327 37.3 ALZS01 102 3100 2998 Scaffold Enterococcus faecalis ERV68 20137 3.13475 37.3 ALZT01 100 3157 3046 Scaffold Enterococcus faecalis ERV72 20137 3.13714 37.3 ALZU01 94 3152 3053 Scaffold Enterococcus faecalis ERV73 20137 3.08542 37.4 ALZV01 125 3081 2933 Scaffold Enterococcus faecalis ERV81 20137 3.14055 37.3 ALZW01 93 3159 3032 Scaffold Enterococcus faecalis ERV85 20137 3.12947 37.3 ALZX01 121 3136 3008 Scaffold Enterococcus faecalis ERV93 20137 3.09818 37.3 ALZY01 103 3101 2999 Scaffold Enterococcus faecalis R508 20137 2.79303 37.4 ALZZ01 66 2702 2617 Scaffold Enterococcus faecalis EnGen0065 20137 3.33063 37.3 AIIK01 33 3400 3258 Scaffold Enterococcus faecalis EnGen0062 20137 3.00378 37.4 AIIL01 3 2987 2832 Scaffold Enterococcus faecalis EnGen0061 20137 3.34833 37.1 AIIM01 33 3440 3270 Scaffold Enterococcus faecalis EnGen0064 20137 3.4091 37.5 AIIN01 14 3447 3277 Scaffold Enterococcus faecalis EnGen0066 20137 2.89236 37.6 AIIO01 4 2884 2759 Scaffold Enterococcus faecalis EnGen0063 20137 3.20972 37.2 AIIP01 10 3145 2973 Scaffold Enterococcus faecalis EnGen0059 20137 3.20032 37.2 AIIQ01 11 3128 2976 Scaffold Enterococcus faecalis EnGen0076 20137 2.90365 37.4 AIIR01 6 2851 2725 Scaffold Enterococcus faecalis EnGen0074 20137 3.02481 37.5 AIIS01 14 2954 2834 Scaffold Enterococcus faecalis EnGen0075 20137 2.95449 37.4 AIIT01 5 2932 2828 Scaffold Enterococcus faecalis EnGen0058 20137 3.06318 37.5 AIIU01 17 3003 2877 Scaffold Enterococcus faecalis EnGen0073 20137 2.97081 37.5 AIIV01 11 2890 2793 Scaffold Enterococcus faecalis EnGen0078 20137 2.97336 37.4 AIIW01 10 2953 2832 Scaffold Enterococcus faecalis EnGen0080 20137 3.07348 37.2 AIIX01 19 3037 2924 Scaffold Enterococcus faecalis EnGen0079 20137 3.08757 37.4 AIIY01 14 3100 2965 Scaffold Enterococcus faecalis EnGen0060 20137 2.92276 37.4 AIIZ01 4 2911 2798 Scaffold Enterococcus faecalis EnGen0081 20137 2.99289 37.3 AIJA01 8 2974 2855 Scaffold Enterococcus faecalis EnGen0082 20137 2.91098 37.5 AIJC01 7 2849 2715 Scaffold Enterococcus faecalis EnGen0083 20137 2.90455 37.4 AIJD01 6 2866 2760 Scaffold Enterococcus faecalis EnGen0084 20137 2.79156 37.7 AIJE01 4 2692 2577 Scaffold Enterococcus faecalis EnGen0071 20137 2.92486 37.6 AIJF01 6 2892 2747 Scaffold

159 Enterococcus faecalis EnGen0072 20137 2.91215 37.4 AIJG01 7 2873 2763 Scaffold Enterococcus faecalis EnGen0067 20137 3.33615 37.1 AIJH01 39 3446 3290 Scaffold Enterococcus faecalis EnGen0068 20137 3.26917 37.2 AIJI01 12 3348 3208 Scaffold Enterococcus faecalis EnGen0069 20137 3.32938 37.1 AIJJ01 32 3409 3270 Scaffold Enterococcus faecalis EnGen0070 20137 3.2731 37.2 AIJK01 23 3361 3215 Scaffold Enterococcus faecalis EnGen0106 20137 3.27642 37.2 AIPV01 22 3358 3215 Scaffold Enterococcus faecalis EnGen0088 20137 3.33964 37.2 AIPW01 7 3402 3268 Scaffold Enterococcus faecalis EnGen0120 20137 3.22636 37.1 AIPX01 37 3318 3185 Scaffold Enterococcus faecalis EnGen0089 20137 3.30156 37.2 AIPY01 4 3358 3226 Scaffold Enterococcus faecalis EnGen0090 20137 3.27377 37.1 AIPZ01 15 3347 3215 Scaffold Enterococcus faecalis EnGen0109 20137 3.3224 37.1 AIQA01 53 3392 3262 Scaffold Enterococcus faecalis EnGen0110 20137 3.2934 37.1 AIQB01 41 3371 3225 Scaffold Enterococcus faecalis EnGen0091 20137 3.34189 37.2 AIQC01 9 3403 3254 Scaffold Enterococcus faecalis EnGen0092 20137 3.32362 37.2 AIQD01 24 3393 3250 Scaffold Enterococcus faecalis EnGen0085 20137 3.32328 37.1 AIQE01 32 3392 3279 Scaffold Enterococcus faecalis EnGen0111 20137 3.29857 37.2 AIQF01 36 3366 3227 Scaffold Enterococcus faecalis EnGen0119 20137 3.32198 37.1 AIQH01 19 3381 3265 Scaffold Enterococcus faecalis EnGen0093 20137 2.92373 37.2 AIQL01 14 2861 2777 Scaffold Enterococcus faecalis EnGen0094 20137 3.32412 37.1 AIQN01 12 3398 3268 Scaffold Enterococcus faecalis EnGen0095 20137 3.3235 37.1 AIQP01 9 3410 3266 Scaffold Enterococcus faecalis EnGen0096 20137 3.34585 37.1 AIQQ01 11 3425 3288 Scaffold Enterococcus faecalis EnGen0097 20137 3.40671 37.1 AIQR01 18 3482 3337 Scaffold Enterococcus faecalis EnGen0112 20137 3.37676 37.1 AIQS01 27 3445 3294 Scaffold Enterococcus faecalis EnGen0098 20137 3.35529 37.2 AIQT01 9 3409 3268 Scaffold Enterococcus faecalis EnGen0099 20137 3.26317 37.1 AIRC01 39 3345 3219 Scaffold Enterococcus faecalis EnGen0113 20137 3.23366 37.2 AIRD01 44 3306 3148 Scaffold Enterococcus faecalis EnGen0114 20137 3.31386 37.1 AIRE01 22 3397 3271 Scaffold Enterococcus faecalis EnGen0100 20137 3.31711 37.1 AIRG01 22 3394 3264 Scaffold Enterococcus faecalis EnGen0107 20137 3.24374 37.1 AIRY01 26 3321 3204 Scaffold Enterococcus faecalis EnGen0087 20137 3.33375 37.1 AIRZ01 30 3423 3303 Scaffold

160 Enterococcus faecalis EnGen0108 20137 3.31778 37.1 AISA01 37 3404 3268 Scaffold Enterococcus faecalis EnGen0086 20137 2.96976 37.4 AISB01 10 2934 2790 Scaffold Enterococcus faecalis EnGen0115 20137 3.07705 37.5 AISC01 10 3106 2979 Scaffold Enterococcus faecalis EnGen0101 20137 3.36582 37.3 AISD01 26 3384 3224 Scaffold Enterococcus faecalis EnGen0102 20137 2.90492 37.6 AISE01 9 2841 2726 Scaffold Enterococcus faecalis EnGen0103 20137 2.91788 37.6 AISF01 9 2850 2729 Scaffold Enterococcus faecalis EnGen0104 20137 2.90183 37.6 AISG01 8 2839 2732 Scaffold Enterococcus faecalis EnGen0105 20137 2.90435 37.6 AISH01 9 2840 2724 Scaffold Enterococcus faecalis EnGen0116 20137 2.87269 37.6 AISI01 8 2798 2701 Scaffold Enterococcus faecalis EnGen0117 20137 2.97398 37.4 AISJ01 21 2923 2817 Scaffold Enterococcus faecalis EnGen0118 20137 2.95737 37.5 AISK01 10 2911 2794 Scaffold Enterococcus faecalis EnGen0332 20137 3.3 37.2 ASEN01 28 3317 3189 Scaffold Enterococcus faecalis EnGen0341 20137 3.24386 37.5 ASCU01 28 3258 3112 Scaffold Enterococcus faecalis EnGen0366 20137 3.27428 37.4 ASCV01 8 3367 3226 Scaffold Enterococcus faecalis EnGen0344 20137 3.00786 37.3 ASCW01 4 2953 2852 Scaffold Enterococcus faecalis EnGen0361 20137 3.16517 37.4 ASCX01 17 3204 3078 Scaffold Enterococcus faecalis EnGen0345 20137 3.10989 37.5 ASCY01 14 3055 2938 Scaffold Enterococcus faecalis EnGen0346 20137 2.72323 37.7 ASCZ01 4 2617 2525 Scaffold Enterococcus faecalis ATCC 19433 20137 2.8814 37.7 ASDA01 3 2896 2754 Scaffold Enterococcus faecalis EnGen0362 20137 3.00451 37.2 ASDB01 15 2959 2840 Scaffold Enterococcus faecalis EnGen0348 20137 3.09011 37.4 ASDC01 15 3111 2959 Scaffold Enterococcus faecalis EnGen0363 20137 2.94664 37.4 ASDD01 17 2847 2740 Scaffold Enterococcus faecalis ATCC 35038 20137 2.97624 37.5 ASDE01 10 2943 2820 Scaffold Enterococcus faecalis EnGen0364 20137 2.9899 37.4 ASDF01 13 2883 2758 Scaffold Enterococcus faecalis EnGen0350 20137 2.99809 37.6 ASDG01 5 2966 2828 Scaffold Enterococcus faecalis EnGen0336 20137 2.87038 37.5 ASDH01 3 2802 2707 Scaffold Enterococcus faecalis EnGen0351 20137 2.99194 37.5 ASDI01 3 2967 2829 Scaffold Enterococcus faecalis EnGen0352 20137 2.78339 37.7 ASDJ01 4 2694 2543 Scaffold Enterococcus faecalis EnGen0337 20137 2.92663 37.5 ASDK01 5 2877 2769 Scaffold Enterococcus faecalis EnGen0365 20137 3.08631 37.6 ASDL01 11 3036 2875 Scaffold

161 Enterococcus faecalis EnGen0342 20137 3.00797 37.5 ASDM01 9 2961 2823 Scaffold Enterococcus faecalis EnGen0354 20137 3.16594 37.3 ASDN01 22 3223 3089 Scaffold Enterococcus faecalis EnGen0355 20137 3.03417 37.4 ASDO01 9 3051 2911 Scaffold Enterococcus faecalis EnGen0369 20137 3.12208 37.5 ASDP01 25 3175 2993 Scaffold Enterococcus faecalis EnGen0356 20137 2.96671 37.4 ASDQ01 7 2877 2764 Scaffold Enterococcus faecalis EnGen0357 20137 2.79183 37.6 ASDR01 8 2688 2593 Scaffold Enterococcus faecalis EnGen0358 20137 3.06395 37.3 ASDS01 20 3033 2913 Scaffold Enterococcus faecalis EnGen0370 20137 3.12235 37.4 ASDT01 13 3120 2964 Scaffold Enterococcus faecalis EnGen0368 20137 2.99652 37.7 ASDU01 7 2932 2791 Scaffold Enterococcus faecalis EnGen0359 20137 3.34154 37.1 ASDV01 12 3366 3220 Scaffold Enterococcus faecalis EnGen0360 20137 2.9269 37.4 ASDW01 7 2912 2803 Scaffold Enterococcus faecalis EnGen0340 20137 2.83322 37.4 ASDX01 5 2775 2677 Scaffold Enterococcus faecalis EnGen0367 20137 2.89225 37.6 ASDY01 4 2883 2728 Scaffold Enterococcus faecalis ATCC 6055 20137 3.22746 37.2 ASDZ01 31 3266 3125 Scaffold Enterococcus faecalis ATCC 10100 20137 3.00495 37.4 ASEA01 6 2949 2845 Scaffold Enterococcus faecalis EnGen0338 20137 3.21273 37.3 ASED01 32 3227 3083 Scaffold Enterococcus faecalis EnGen0339 20137 3.26989 37.3 ASEE01 31 3338 3198 Scaffold Enterococcus faecalis EnGen0327 20137 3.08627 37.5 ASEF01 8 3066 2954 Scaffold Enterococcus faecalis EnGen0331 20137 3.04414 37.3 ASEG01 27 3032 2920 Scaffold Enterococcus faecalis EnGen0329 20137 3.04724 37.4 ASEH01 17 3030 2929 Scaffold Enterococcus faecalis EnGen0326 20137 2.97125 37.6 ASEI01 8 2921 2817 Scaffold Enterococcus faecalis EnGen0334 20137 2.96343 37.4 ASEJ01 5 2936 2837 Scaffold Enterococcus faecalis EnGen0333 20137 2.91268 37.6 ASEK01 5 2893 2775 Scaffold Enterococcus faecalis EnGen0328 20137 2.95516 37.6 ASEL01 5 2977 2838 Scaffold Enterococcus faecalis EnGen0330 20137 2.97256 37.3 ASEM01 13 2977 2840 Scaffold Enterococcus faecalis EnGen0234 20137 3.02121 37.5 AJAC01 19 3006 2875 Scaffold Enterococcus faecalis EnGen0235 20137 3.37502 37.1 AJAG01 33 3419 3266 Scaffold Enterococcus faecalis EnGen0237 20137 3.14517 37.3 AJAW01 29 3182 3054 Scaffold Enterococcus faecalis EnGen0238 20137 3.19631 37.5 AJAX01 27 3217 3112 Scaffold Enterococcus faecalis EnGen0239 20137 3.2006 37.6 AJAY01 13 3260 3123 Scaffold

162 Enterococcus faecalis EnGen0240 20137 2.95877 37.6 AJAZ01 6 2923 2797 Scaffold Enterococcus faecalis EnGen0241 20137 2.89056 37.6 AJBA01 4 2863 2743 Scaffold Enterococcus faecalis EnGen0242 20137 3.32966 37.5 AJBB01 25 3409 3251 Scaffold Enterococcus faecalis EnGen0243 20137 3.19702 37.2 AJBC01 45 3241 3128 Scaffold Enterococcus faecalis EnGen0244 20137 3.22598 37.3 AJBD01 11 3230 3090 Scaffold Enterococcus faecalis EnGen0245 20137 3.08671 37.3 AJBE01 8 3051 2916 Scaffold Enterococcus faecalis EnGen0246 20137 3.18699 37.4 AJBF01 10 3184 3044 Scaffold Enterococcus faecalis EnGen0247 20137 3.28481 37.2 AJBG01 34 3318 3146 Scaffold Enterococcus faecalis EnGen0248 20137 3.1999 37.3 AJBH01 21 3215 3082 Scaffold Enterococcus faecalis EnGen0252 20137 3.22673 37.3 AJBI01 20 3238 3119 Scaffold Enterococcus faecalis EnGen0251 20137 3.28353 37.1 AJBJ01 30 3313 3148 Scaffold Enterococcus faecalis EnGen0231 20137 3.28118 37.1 AJBK01 44 3361 3220 Scaffold Enterococcus faecalis EnGen0249 20137 3.03057 37.3 AJBL01 6 2969 2846 Scaffold Enterococcus faecalis EnGen0250 20137 2.8202 37.7 AJBM01 3 2741 2640 Scaffold Enterococcus faecalis EnGen0299 20137 2.93373 37.6 AJDH01 11 2910 2771 Scaffold Enterococcus faecalis EnGen0301 20137 3.12761 37.1 AJDK01 10 3070 2917 Scaffold Enterococcus faecalis EnGen0297 20137 3.11169 37.3 AJDY01 36 3113 2972 Scaffold Enterococcus faecalis EnGen0310 = MMH594 20137 3.25559 37.1 AJDZ01 25 3335 3208 Scaffold Enterococcus faecalis EnGen0294 20137 3.20317 37.2 AJEA01 16 3214 3095 Scaffold Enterococcus faecalis EnGen0307 20137 3.18991 37.3 AJEB01 19 3224 3087 Scaffold Enterococcus faecalis EnGen0280 20137 3.24669 37.3 AJEC01 16 3271 3152 Scaffold Enterococcus faecalis EnGen0303 20137 3.16163 37.3 AJED01 18 3140 3000 Scaffold Enterococcus faecalis EnGen0298 20137 3.2621 37.1 AJEE01 35 3293 3157 Scaffold Enterococcus faecalis EnGen0311 20137 3.25099 37.2 AJEF01 26 3287 3112 Scaffold Enterococcus faecalis EnGen0302 20137 3.19619 37.4 AJEG01 6 3194 3049 Scaffold Enterococcus faecalis EnGen0306 20137 3.22235 37.3 AJEH01 20 3230 3089 Scaffold Enterococcus faecalis EnGen0291 20137 2.87252 37.5 AJEI01 11 2862 2778 Scaffold Enterococcus faecalis EnGen0282 20137 2.77535 37.7 AJEJ01 3 2710 2610 Scaffold Enterococcus faecalis ATCC 29200 20137 3.01458 37.5 AJEK01 15 2991 2862 Scaffold

163 Enterococcus faecalis EnGen0279 20137 2.8858 37.7 AJEL01 9 2848 2721 Scaffold Enterococcus faecalis EnGen0304 20137 2.81699 37.7 AJEM01 3 2780 2672 Scaffold Enterococcus faecalis EnGen0281 20137 2.95549 37.6 AJEN01 4 2953 2830 Scaffold Enterococcus faecalis EnGen0287 20137 3.03033 37.3 AJEO01 5 2987 2863 Scaffold Enterococcus faecalis EnGen0300 20137 3.26233 37.1 AJEP01 53 3317 3165 Scaffold Enterococcus faecalis EnGen0295 20137 2.95886 37.5 AJEQ01 4 2920 2809 Scaffold Enterococcus faecalis EnGen0284 20137 3.01755 37.2 AJES01 8 2965 2844 Scaffold Enterococcus faecalis EnGen0293 20137 3.10547 37.4 AJEU01 16 3097 2966 Scaffold Enterococcus faecalis ATCC 27275 20137 3.00114 37.4 AJEW01 8 2959 2855 Scaffold Enterococcus faecalis ATCC 27959 20137 2.94393 37.6 AJEX01 2 2887 2805 Scaffold Enterococcus faecalis EnGen0289 20137 2.91996 37.5 AJEY01 10 2860 2749 Scaffold Enterococcus faecalis EnGen0285 20137 3.153 37.4 AJEZ01 21 3143 3015 Scaffold Enterococcus faecalis EnGen0335 20137 3.22406 37.6 ASEO01 18 3249 3090 Scaffold Enterococcus faecalis EnGen0290 20137 3.06524 37.5 AJEV01 12 3035 2909 Scaffold Enterococcus faecalis EnGen0283 20137 3.09784 37.4 AJER01 13 3065 2950 Scaffold Enterococcus faecalis EnGen0194 20137 3.33091 37.1 AIPS01 24 3414 3283 Scaffold Enterococcus faecalis EnGen0195 20137 3.25511 37.1 AIPT01 28 3325 3204 Scaffold Enterococcus faecalis EnGen0196 20137 3.28873 37.1 AIPU01 39 3372 3240 Scaffold Enterococcus faecalis EnGen0197 20137 3.29134 37.1 AIQG01 19 3364 3243 Scaffold Enterococcus faecalis EnGen0198 20137 2.93852 37.2 AIQI01 15 2907 2782 Scaffold Enterococcus faecalis EnGen0199 20137 3.25567 37.2 AIQJ01 21 3314 3199 Scaffold Enterococcus faecalis EnGen0200 20137 3.3291 37.1 AIQK01 21 3402 3281 Scaffold Enterococcus faecalis EnGen0201 20137 3.27664 37.1 AIQM01 37 3365 3214 Scaffold Enterococcus faecalis EnGen0202 20137 3.29313 37.1 AIQO01 40 3378 3237 Scaffold Enterococcus faecalis EnGen0203 20137 3.28563 37.1 AIQU01 43 3330 3215 Scaffold Enterococcus faecalis EnGen0204 20137 3.28749 37.1 AIQV01 37 3369 3221 Scaffold Enterococcus faecalis EnGen0207 20137 3.33457 37.1 AIQW01 22 3412 3267 Scaffold Enterococcus faecalis EnGen0205 20137 3.29416 37.1 AIQX01 35 3395 3249 Scaffold Enterococcus faecalis EnGen0228 20137 3.27309 37.1 AIQY01 34 3344 3224 Scaffold Enterococcus faecalis EnGen0206 20137 3.27998 37.1 AIQZ01 40 3359 3227 Scaffold

164 Enterococcus faecalis EnGen0374 20137 3.28712 37.1 AIRA01 42 3363 3235 Scaffold Enterococcus faecalis EnGen0208 20137 3.20687 37.2 AIRB01 22 3281 3166 Scaffold Enterococcus faecalis EnGen0209 20137 3.28553 37.1 AIRF01 41 3364 3217 Scaffold Enterococcus faecalis EnGen0210 20137 3.29711 37.1 AIRH01 42 3394 3238 Scaffold Enterococcus faecalis EnGen0211 20137 3.29851 37.1 AIRI01 26 3389 3257 Scaffold Enterococcus faecalis EnGen0212 20137 3.129 37.3 AIRJ01 12 3101 2971 Scaffold Enterococcus faecalis EnGen0213 20137 3.29385 37.2 AIRK01 39 3379 3225 Scaffold Enterococcus faecalis EnGen0214 20137 3.23806 37.2 AIRL01 31 3304 3177 Scaffold Enterococcus faecalis EnGen0215 20137 3.28286 37.2 AIRM01 25 3376 3225 Scaffold Enterococcus faecalis EnGen0216 20137 3.338 37.1 AIRN01 41 3427 3303 Scaffold Enterococcus faecalis EnGen0217 20137 3.30573 37.1 AIRO01 35 3392 3246 Scaffold Enterococcus faecalis EnGen0218 20137 3.28772 37.1 AIRP01 39 3377 3238 Scaffold Enterococcus faecalis EnGen0219 20137 3.29298 37.1 AIRQ01 41 3364 3247 Scaffold Enterococcus faecalis EnGen0220 20137 3.31955 37 AIRR01 45 3428 3292 Scaffold Enterococcus faecalis EnGen0221 20137 3.24615 37.1 AIRS01 27 3325 3204 Scaffold Enterococcus faecalis EnGen0222 20137 3.31696 37.1 AIRT01 36 3408 3264 Scaffold Enterococcus faecalis EnGen0223 20137 3.24253 37.1 AIRU01 26 3301 3189 Scaffold Enterococcus faecalis EnGen0224 20137 3.30123 37.2 AIRV01 33 3366 3229 Scaffold Enterococcus faecalis EnGen0225 20137 3.33174 37.1 AIRW01 38 3421 3293 Scaffold Enterococcus faecalis EnGen0226 20137 3.08939 37.3 AIRX01 20 3136 3023 Scaffold Enterococcus faecalis EnGen0232 20137 3.20746 37.3 AIZS01 40 3235 3079 Scaffold Enterococcus faecalis EnGen0233 20137 2.95195 37.5 AIZW01 24 2928 2799 Scaffold Enterococcus faecalis V583 20137 3.3298 37.4 AHYN01 9 3411 3288 Scaffold Enterococcus faecalis V583 20137 3.36431 37.4 ASWP01 9 3453 3329 Scaffold Enterococcus faecalis KI-6-1-110608-1 20137 2.63813 37.5 ATIE01 35 2532 2466 Scaffold Enterococcus faecalis 02-MB-P-10 20137 2.90479 37.2 ATIF01 78 2819 2707 Scaffold Enterococcus faecalis 20-SD-BW-06 20137 2.79151 37.5 ATIG01 40 2710 2639 Scaffold Enterococcus faecalis 02-MB-BW-10 20137 3.05362 37.1 ATIH01 157 3065 2922 Scaffold Enterococcus faecalis D811610-10 20137 2.72052 37.6 ATII01 32 2616 2549 Scaffold Enterococcus faecalis B83616-1 20137 2.6564 37.7 ATIJ01 51 2582 2499 Scaffold

165 Enterococcus faecalis 06-MB-S-10 20137 3.01378 37.2 ATIK01 99 2986 2883 Scaffold Enterococcus faecalis 06-MB-S-04 20137 3.04071 37.2 ATIL01 120 3005 2901 Scaffold Enterococcus faecalis F01966 20137 2.93243 37.2 ATIN01 120 2907 2790 Scaffold Enterococcus faecalis 20-SD-BW-08 20137 2.78967 37.5 ATIP01 40 2710 2639 Scaffold Enterococcus faecalis 20.SD.W.06 20137 2.84403 37.3 ATIQ01 66 2766 2674 Scaffold Enterococcus faecalis RP2S-4 20137 3.00368 37.2 ATIR01 84 2929 2828 Scaffold Enterococcus faecalis WKS-26-18-2 20137 2.89154 37.4 ATIY01 124 2885 2765 Scaffold Enterococcus faecalis VC1B-1 20137 2.877 37.3 ATIZ01 54 2826 2739 Scaffold Enterococcus faecalis UP2S-6 20137 2.85914 37.4 ATJA01 72 2816 2718 Scaffold Enterococcus faecalis SLO2C-1 20137 2.78087 37.5 ATJB01 38 2694 2626 Scaffold Enterococcus faecalis LA3B-2 20137 2.88771 37.3 ATJC01 105 2889 2732 Scaffold Enterococcus faecalis BM4654 20137 3.50898 37.8 AXOG01 23 3640 3500 Scaffold Enterococcus faecalis BM4539 20137 3.06681 37.9 AXOH01 5 3020 2881 Scaffold Enterococcus faecalis JH2-2 20137 2.89927 37.6 AXOI01 2 2873 2742 Scaffold Enterococcus faecalis EnGen0400 20137 2.92181 37.7 JAHF01 13 2857 2733 Scaffold Enterococcus faecalis EnGen0401 20137 3.07048 37.4 JAHG01 17 3074 2933 Scaffold Enterococcus faecalis EnGen0402 20137 2.94995 37.6 JAHH01 4 2912 2795 Scaffold Enterococcus faecalis EnGen0403 20137 3.13741 37.3 JAHI01 6 3187 3061 Scaffold Enterococcus faecalis EnGen0404 20137 3.1508 37.3 JAHJ01 8 3204 3069 Scaffold Enterococcus faecalis EnGen0405 20137 3.13423 37.2 JAHK01 5 3172 3065 Scaffold Enterococcus faecalis EnGen0406 20137 3.07305 37.3 JAHL01 9 3119 2997 Scaffold Enterococcus faecalis EnGen0407 20137 2.73747 37.8 JAHM01 2 2653 2571 Scaffold Enterococcus faecalis EnGen0408 20137 3.14707 37.3 JAHN01 7 3179 3073 Scaffold Enterococcus faecalis EnGen0409 20137 2.8749 37.7 JAHO01 4 2833 2703 Scaffold Enterococcus faecalis EnGen0410 20137 3.17517 37.2 JAHP01 4 3231 3087 Scaffold Enterococcus faecalis EnGen0411 20137 2.91171 37.6 JAHQ01 6 2915 2785 Scaffold Enterococcus faecalis EnGen0412 20137 3.06507 37.3 JAHR01 4 3080 2974 Scaffold Enterococcus faecalis EnGen0413 20137 2.77793 37.7 JAHS01 5 2686 2580 Scaffold Enterococcus faecalis EnGen0414 20137 2.73352 37.6 JAHT01 4 2660 2567 Scaffold Enterococcus faecalis EnGen0415 20137 3.10143 37.4 JAHU01 7 3134 3012 Scaffold

166 Enterococcus faecalis EnGen0416 20137 2.98094 37.5 JAHV01 9 2971 2846 Scaffold Enterococcus faecalis EnGen0417 20137 3.02228 37.4 JAHW01 11 3007 2870 Scaffold Enterococcus faecalis EnGen0418 20137 3.02007 37.3 JAHX01 4 2997 2898 Scaffold Enterococcus faecalis EnGen0419 20137 2.73807 37.8 JAHY01 5 2667 2571 Scaffold Enterococcus faecalis EnGen0420 20137 2.73455 37.8 JAHZ01 1 2649 2570 Scaffold Enterococcus faecalis EnGen0421 20137 3.0112 37.3 JAIA01 8 2980 2878 Scaffold Enterococcus faecalis EnGen0422 20137 3.26073 37.4 JAIB01 14 3266 3129 Scaffold Enterococcus faecalis EnGen0423 20137 2.83241 37.7 JAIC01 2 2765 2668 Scaffold Enterococcus faecalis EnGen0424 20137 2.72258 37.7 JAID01 1 2650 2555 Scaffold Enterococcus faecalis EnGen0425 20137 2.97597 37.3 JAIE01 7 2973 2856 Scaffold Enterococcus faecalis EnGen0426 20137 3.00838 37.4 JAIF01 3 3024 2911 Scaffold Enterococcus faecalis EnGen0427 20137 3.32994 37.1 JAIG01 16 3415 3287 Scaffold Enterococcus faecalis 918 20137 3.31812 37.1 AVNY01 111 3452 3306 Scaffold Enterococcus faecalis Efa HS0914 20137 2.81732 37.3 JPDQ01 15 2770 2682 Scaffold Enterococcus faecalis 20137 3.05163 37.1 LKGS01 966 3193 2559 Scaffold Enterococcus faecalis TUSoD Ef11 20137 2.83665 37.7 ACOX02 11 2808 2683 Contig Enterococcus faecalis PC1.1 20137 2.75392 37.6 ADKN01 79 2688 2614 Contig Enterococcus faecalis OG1X 20137 2.73907 37.7 AFHH01 78 2640 2516 Contig Enterococcus faecalis M7 20137 2.7332 37.7 AGVN01 77 2634 2481 Contig Enterococcus faecalis 10244 20137 3.11229 37.339 ASWX01 79 3122 3006 Contig Enterococcus faecalis E12 20137 2.98348 37.2 AWPI01 117 3022 2898 Contig Enterococcus faecalis EnGen0286 20137 2.92362 37.5 AJET01 12 2882 2794 Contig Enterococcus faecalis MA1 20137 2.96214 37.4 ANMP01 74 2977 2878 Contig Enterococcus faecalis AZ19 20137 2.93529 37.3 AYLU01 98 2896 2813 Contig Enterococcus faecalis FL2 20137 2.66569 37.6 AYKK01 119 2617 2532 Contig Enterococcus faecalis GA2 20137 2.67756 37.6 AYKL01 50 2588 2549 Contig Enterococcus faecalis GAN13 20137 2.84692 37.4 AYLV01 92 2786 2693 Contig Enterococcus faecalis KS19 20137 2.74034 37.5 AYND01 94 2650 2574 Contig Enterococcus faecalis MD6 20137 2.73319 37.5 AYLN01 102 2669 2585 Contig Enterococcus faecalis MN16 20137 2.83359 37.4 AYKM01 63 2745 2684 Contig

167 Enterococcus faecalis MTmid8 20137 2.69046 37.6 AYKU01 61 2595 2538 Contig Enterococcus faecalis MTUP9 20137 2.9734 37.2 AYOJ01 66 2905 2804 Contig Enterococcus faecalis NJ44 20137 2.91373 37.2 AYOK01 129 2870 2775 Contig Enterococcus faecalis NY9 20137 2.98386 37 AYOL01 113 2924 2817 Contig Enterococcus faecalis 20137 2.96543 37.2 JPWN01 13 2956 2854 Contig Enterococcus faecalis 20137 3.01051 37.3 JPTY01 37 2991 2879 Contig Enterococcus faecalis 20137 2.97906 37.4 JPTZ01 40 2999 2887 Contig Enterococcus faecalis 20137 2.94666 37.3 JQHD01 81 2927 2843 Contig Enterococcus faecalis 20137 3.23297 37.3 JSES01 70 3330 3209 Contig Enterococcus faecalis EnGen0310 20137 3.26068 37 AOPW01 104 3357 3246 Contig Enterococcus faecalis JH2-2 20137 2.86463 37.6 CAWH01 172 2826 2618 Contig Enterococcus faecalis 20137 3.05351 36.9 JWBU01 152 2984 2886 Contig Enterococcus faecalis 20137 3.15503 37.1 JWAR01 97 3186 3067 Contig Enterococcus faecalis 20137 3.14873 37.2 JVXY01 172 3189 3070 Contig Enterococcus faecalis 20137 3.18577 37.2 JVXC01 143 3218 3109 Contig Enterococcus faecalis 20137 3.2219 37.3 JVUK01 117 3252 3149 Contig Enterococcus faecalis 20137 2.99657 37.5 JVSW01 88 2965 2843 Contig Enterococcus faecalis 20137 2.92289 37.3 JVZS01 114 2842 2774 Contig Enterococcus faecalis 20137 3.08145 37.1 JVZM01 137 3045 2929 Contig Enterococcus faecalis 20137 2.8754 37.5 JVVS01 51 2821 2732 Contig Enterococcus faecalis 20137 3.13653 37.2 JVTP01 186 3148 3022 Contig Enterococcus faecalis 20137 2.96039 37.4 JVQP01 58 2912 2835 Contig Enterococcus faecalis 20137 3.25221 37.2 JVKC01 96 3251 3110 Contig Enterococcus faecalis 20137 2.97395 37.3 JVCH01 127 2872 2789 Contig Enterococcus faecalis 20137 2.96736 37.4 JVQQ01 46 2927 2838 Contig Enterococcus faecalis 20137 2.88711 37.5 JVOQ01 163 2842 2736 Contig Enterococcus faecalis 20137 2.97325 37.3 JVOC01 161 2968 2895 Contig Enterococcus faecalis 20137 3.09886 37.3 JVNY01 80 3069 2951 Contig Enterococcus faecalis 20137 2.87829 37.5 JVBW01 51 2861 2732 Contig Enterococcus faecalis 20137 2.88764 37.5 JVBV01 60 2856 2738 Contig

168 Enterococcus faecalis 20137 2.8841 37.4 JVAM01 155 2818 2735 Contig Enterococcus faecalis 20137 3.01896 37.2 JUXV01 126 3014 2911 Contig Enterococcus faecalis 20137 2.98925 37.2 JUXL01 188 2980 2886 Contig Enterococcus faecalis 20137 3.01191 37.2 JUWK01 199 2974 2892 Contig Enterococcus faecalis 20137 2.81577 37.4 JUVH01 150 2739 2662 Contig Enterococcus faecalis 20137 3.18909 37.2 JUUM01 292 3226 3094 Contig Enterococcus faecalis 20137 2.98227 37.2 JUUJ01 326 2984 2842 Contig Enterococcus faecalis 20137 2.95563 37.2 JUQC01 182 2938 2843 Contig Enterococcus faecalis 20137 2.79543 37.5 JUPP01 171 2707 2605 Contig Enterococcus faecalis 20137 2.98506 37.2 JUXT01 225 2987 2871 Contig Enterococcus faecalis 20137 3.00691 37.4 JUXC01 223 3010 2894 Contig Enterococcus faecalis 20137 2.99763 37.2 JUVP01 147 2982 2875 Contig Enterococcus faecalis 20137 2.98645 37.2 JUVA01 261 2991 2850 Contig Enterococcus faecalis 20137 3.19095 37.1 JUPR01 307 3246 3095 Contig Enterococcus faecalis 20137 2.86108 37.4 JUOO01 187 2863 2763 Contig Enterococcus faecalis 20137 3.04471 37.3 JUMK01 141 3000 2929 Contig Enterococcus faecalis 20137 2.95477 37.3 JUOP01 215 2927 2821 Contig Enterococcus faecalis 20137 3.20197 37.1 JUNN01 258 3264 3121 Contig Enterococcus faecalis 20137 3.03497 37.3 JUMJ01 140 2982 2894 Contig Enterococcus faecalis 20137 3.05681 37.2 JULA01 129 3047 2943 Contig Enterococcus faecalis 20137 2.76372 37.5 JUNQ01 113 2675 2609 Contig Enterococcus faecalis 20137 3.17273 37.1 JUNL01 472 3292 3061 Contig Enterococcus faecalis 20137 2.91574 37.3 LAEB01 18 2892 2804 Contig Enterococcus faecalis 20137 3.14674 37.1 LKGR01 42 3109 2975 Contig Enterococcus faecalis NBRC 100480 20137 2.83321 37.5 BCQC01 37 2824 2729 Contig Enterococcus faecalis ATCC 29212 20137 3.01126 37.3 MTFY01 49 3098 2959 Contig Enterococcus faecalis ATCC 29212 20137 3.09625 37.2 FPDW01 55 3219 3051 Contig Enterococcus faecalis ATCC 29212 20137 3.25366 37 FPDZ01 48 3327 3095 Contig Enterococcus faecalis ATCC 29212 20137 2.97208 37.3 FPEB01 40 3050 2863 Contig Enterococcus faecalis ATCC 29212 20137 2.89971 37.4 FPDY01 56 2981 2811 Contig

169 Enterococcus faecalis ATCC 29212 20137 2.95394 37.3 FPEA01 44 3061 2886 Contig Enterococcus faecalis ATCC 29212 20137 2.76049 37.5 FPEC01 15 2766 2663 Contig Enterococcus faecalis 20137 2.89672 37.4 JTKT01 57 2838 2746 Scaffold Enterococcus faecalis 20137 2.96923 37.3 JTKW01 34 2941 2832 Scaffold Enterococcus faecalis 20137 2.76281 37.6 JTKS01 24 2687 2592 Scaffold Enterococcus faecalis 20137 2.78667 37.5 JTKU01 20 2692 2611 Scaffold Enterococcus faecalis 20137 2.88866 37.4 JTKV01 103 2834 2749 Scaffold Enterococcus faecalis 20137 3.03771 37.2 JTKX01 89 2985 2853 Scaffold Enterococcus faecalis 20137 2.89236 37.5 JWAW01 90 2809 2720 Scaffold Enterococcus faecalis 20137 2.84673 37.4 JVYW01 131 2793 2713 Scaffold Enterococcus faecalis 20137 3.3048 37.2 JVTX01 110 3330 3206 Scaffold Enterococcus faecalis 20137 3.30387 37.2 JVTK01 111 3325 3203 Scaffold Enterococcus faecalis 20137 3.20343 37.3 JVTG01 95 3221 3082 Scaffold Enterococcus faecalis 20137 2.98751 37.4 JVSV01 73 2990 2881 Scaffold Enterococcus faecalis 20137 2.96521 37.4 JVQS01 54 2928 2839 Scaffold Enterococcus faecalis 20137 3.11136 37.1 JVPV01 388 3162 2959 Scaffold Enterococcus faecalis 20137 2.79238 37.5 JVPG01 140 2727 2651 Scaffold Enterococcus faecalis 20137 3.12153 37.3 JVOF01 95 3117 3019 Scaffold Enterococcus faecalis 20137 2.98897 37.3 JVOA01 145 2990 2911 Scaffold Enterococcus faecalis 20137 2.98429 37.3 JVIY01 133 2983 2901 Scaffold Enterococcus faecalis 20137 2.95914 37.3 JVIK01 103 2957 2878 Scaffold Enterococcus faecalis 20137 3.26723 37.1 JVGB01 101 3306 3188 Scaffold Enterococcus faecalis 20137 3.01627 37.3 JVBG01 176 2989 2877 Scaffold Enterococcus faecalis 20137 2.80974 37.3 JVAI01 107 2724 2652 Scaffold Enterococcus faecalis 20137 3.01936 37.2 JVAD01 130 3017 2914 Scaffold Enterococcus faecalis 20137 2.99707 37.3 JVQY01 105 2995 2876 Scaffold Enterococcus faecalis 20137 2.96057 37.4 JVQT01 64 2902 2835 Scaffold Enterococcus faecalis 20137 2.96191 37.4 JVQF01 57 2913 2831 Scaffold Enterococcus faecalis 20137 3.01209 37.3 JVPS01 114 3013 2934 Scaffold Enterococcus faecalis 20137 2.89199 37.3 JVPJ01 74 2823 2748 Scaffold

170 Enterococcus faecalis 20137 3.02318 37.3 JVOG01 125 3043 2934 Scaffold Enterococcus faecalis 20137 2.95885 37.2 JVOB01 236 2985 2878 Scaffold Enterococcus faecalis 20137 3.01222 37.3 JVJB01 89 3034 2934 Scaffold Enterococcus faecalis 20137 3.07533 37.1 JVID01 70 3070 2947 Scaffold Enterococcus faecalis 20137 3.19108 37.1 JVHL01 101 3216 3059 Scaffold Enterococcus faecalis 20137 3.23003 37.2 JVDH01 161 3311 3162 Scaffold Enterococcus faecalis 20137 2.85333 37.5 JVBD01 104 2778 2705 Scaffold Enterococcus faecalis 20137 3.00262 37.2 JVAN01 230 3041 2900 Scaffold Enterococcus faecalis 20137 2.88817 37.4 JUZT01 140 2814 2731 Scaffold Enterococcus faecalis 20137 2.97827 37.2 JUYS01 248 2968 2814 Scaffold Enterococcus faecalis 20137 3.05917 37.1 JUXZ01 106 3064 2932 Scaffold Enterococcus faecalis 20137 3.07883 37.2 JUWE01 137 3098 2993 Scaffold Enterococcus faecalis 20137 2.83672 37.6 LSFS01 31 2718 2630 Scaffold Enterococcus faecalis 20137 2.92998 37.6 LQAM01 14 2912 2795 Scaffold Enterococcus faecalis TX1467 20137 3.02628 37.1 AFBS01 126 3557 3510 Scaffold Enterococcus faecalis ATCC 29212 20137 3.02706 37.2 ALOD01 126 2443 2347 Contig Enterococcus faecalis CBRD01 20137 2.81317 37.5 AWYG01 140 1887 1874 Contig Enterococcus faecalis PF3 20137 3.21386 37.5 AZIA01 397 3364 3173 Contig Enterococcus faecalis DORA_14 20137 2.96586 37.2991 AZLY01 50 2911 2911 Contig Enterococcus faecalis 20137 3.10643 37.5 JMEC01 38 3051 2734 Contig

171 Supplementary data:

Table II: List of E. faecium genomes analysed

Size Accessions Organism/Name CladeID (Mb) GC% number Scaffolds Genes Proteins Level Enterococcus faecalis V583 20137 3.35997 37.3546 NC_004668.1 4 3412 3264 Complete Genome Enterococcus faecalis OG1RF 20137 2.73963 37.8 NC_017316.1 1 2636 2548 Complete Genome Enterococcus faecalis 62 - 3.13082 37.3617 CP002491.1 5 3157 3075 Complete Genome Enterococcus faecalis D32 20137 3.0625 37.4365 NC_018221.1 3 3082 2934 Complete Genome Enterococcus faecalis str. Symbioflor 1 20137 2.81067 37.7 NC_019770.1 1 2805 2686 Complete Genome Enterococcus faecalis DENG1 20137 2.96104 37.5 NZ_CP004081.1 1 2960 2838 Complete Genome Enterococcus faecalis ATCC 29212 20137 3.04813 37.3559 NZ_CP008816.1 3 3038 2876 Complete Genome Enterococcus faecalis 20137 2.80343 37.6 NZ_CP014949.1 1 2781 2643 Complete Genome

172 Enterococcus faecalis TX0102 20137 2.87153 37.4 AEBD01 40 2788 2710 Scaffold Enterococcus faecalis TX0630 20137 3.22833 37 AEBE01 141 3295 3170 Scaffold Enterococcus faecalis TX0031 20137 2.8204 37.5 AEBF01 32 2733 2653 Scaffold Enterococcus faecalis TX4244 20137 2.92536 37.3 AEBH01 53 2869 2785 Scaffold Enterococcus faecalis TX1346 20137 2.7829 37.6 AEBI01 91 2748 2633 Scaffold Enterococcus faecalis TX1342 20137 2.83654 37.5 AEBJ01 27 2747 2664 Scaffold Enterococcus faecalis TX1302 20137 2.88376 37.5 AEBK01 32 2811 2734 Scaffold Enterococcus faecalis TX0043 20137 2.81205 37.5 AEBL01 35 2792 2716 Scaffold Enterococcus faecalis TX0027 20137 3.06166 37.2 AEBM01 59 3067 2977 Scaffold Enterococcus faecalis TX0309A 20137 3.11221 37.2 AEBN01 56 3153 3054 Scaffold Enterococcus faecalis TX0309B 20137 3.10797 37.2 AEBO01 69 3156 3047 Scaffold Enterococcus faecalis TX2137 20137 2.96688 37.2 AEBQ01 106 2913 2812 Scaffold Enterococcus faecalis TX0017 20137 2.99707 37.4 AEBP01 69 3000 2889 Scaffold Enterococcus faecalis TX4248 20137 3.18702 37.1 AEBR01 85 3208 3101 Scaffold Enterococcus faecalis DAPTO 516 20137 3.0558 37.3 AEBS01 79 3049 2958 Scaffold Enterococcus faecalis DAPTO 512 20137 3.05442 37.3 AEBT01 74 3045 2952 Scaffold Enterococcus faecalis TX0855 20137 2.98732 37.2 AEBV01 65 2943 2836 Scaffold Enterococcus faecalis TX2134 20137 3.12143 37.1 AEBW01 78 3136 3018 Scaffold Enterococcus faecalis TX0860 20137 3.06177 37.2 AEBX01 72 3034 2939 Scaffold Enterococcus faecalis TX0109 20137 2.9673 37.4 AEBY01 78 2927 2826 Scaffold Enterococcus faecalis EnGen0311 20137 3.16869 37 AEBZ01 86 3170 3060 Scaffold Enterococcus faecalis TX2141 20137 2.90311 37.4 AECG01 77 2934 2817 Scaffold Enterococcus faecalis TX0411 20137 3.12459 37.2 AECA01 75 3174 3057 Scaffold Enterococcus faecalis TX0645 20137 3.17461 37.1 AECE01 90 3194 3067 Scaffold Enterococcus faecalis TX1341 20137 2.99999 37.2 AECF01 38 2972 2892 Scaffold Enterococcus faecalis TX0012 20137 2.81853 37.4 AECD01 36 2720 2636 Scaffold Enterococcus faecalis TX0470 20137 2.87879 37.3 AECC01 42 2837 2748 Scaffold Enterococcus faecalis TX0312 20137 2.78566 37.6 AECB01 42 2735 2650 Scaffold Enterococcus faecalis TX4000 20137 2.8403 37.5 AEBB01 55 2805 2710 Scaffold Enterococcus faecalis T1 20137 2.95069 37.7 ACAD01 16 2871 2765 Scaffold

173 Enterococcus faecalis T2 20137 3.26383 37.2 ACAE01 22 3224 3106 Scaffold Enterococcus faecalis T3 20137 2.79121 37.6 ACAF01 10 2705 2614 Scaffold Enterococcus faecalis ATCC 4200 20137 3.03617 37.5 ACAG01 14 2986 2856 Scaffold Enterococcus faecalis DS5 20137 3.18642 37.3 ACAI01 43 3151 3021 Scaffold Enterococcus faecalis ARO1/DG 20137 2.84994 37.7 ACAK01 13 2777 2680 Scaffold Enterococcus faecalis Merz96 20137 3.08525 37.7 ACAM01 21 2996 2892 Scaffold Enterococcus faecalis HIP11704 20137 3.20299 37.3 ACAN01 38 3186 3050 Scaffold Enterococcus faecalis JH1 20137 3.0428 37.4 ACAP01 24 2966 2868 Scaffold Enterococcus faecalis E1Sol 20137 2.8819 37.6 ACAQ01 14 2842 2720 Scaffold Enterococcus faecalis Fly1 20137 2.83429 37.5 ACAR01 12 2703 2593 Scaffold Enterococcus faecalis D6 20137 2.90625 37.6 ACAT01 11 2835 2748 Scaffold Enterococcus faecalis T11 20137 2.74872 37.7 ACAU01 13 2642 2575 Scaffold Enterococcus faecalis CH188 20137 3.22035 37.3 ACAV01 27 3177 3066 Scaffold Enterococcus faecalis X98 20137 2.9427 37.5 ACAW01 13 2951 2831 Scaffold Enterococcus faecalis TX0104 20137 3.15648 37.4 ACGL01 95 3133 2974 Scaffold Enterococcus faecalis TX1322 20137 2.97429 37.5 ACGM01 40 2924 2797 Scaffold Enterococcus faecalis ATCC 29200 20137 2.97337 37.6 ACHK01 64 2900 2802 Scaffold Enterococcus faecalis EnGen0297 20137 3.12993 37.4 ACIX01 103 3098 2875 Scaffold Enterococcus faecalis T8 20137 3.03739 37.5 ACOC01 24 2998 2856 Scaffold Enterococcus faecalis S613 20137 3.04735 37.3 ADDP01 105 3037 2936 Scaffold Enterococcus faecalis R712 20137 3.03954 37.3 ADDQ01 105 3033 2936 Scaffold Enterococcus faecalis 599 20137 3.04841 37.2 ALZI01 106 3026 2900 Scaffold Enterococcus faecalis ERV103 20137 3.06482 37.4 ALZJ01 56 3044 2959 Scaffold Enterococcus faecalis ERV116 20137 3.06975 37.4 ALZK01 54 3037 2963 Scaffold Enterococcus faecalis ERV129 20137 3.15795 37.3 ALZL01 81 3149 3039 Scaffold Enterococcus faecalis ERV25 20137 3.08736 37.3 ALZM01 93 3082 2974 Scaffold Enterococcus faecalis ERV31 20137 3.05317 37.3 ALZN01 83 3025 2940 Scaffold Enterococcus faecalis ERV37 20137 3.12022 37.3 ALZO01 126 3134 3020 Scaffold Enterococcus faecalis ERV41 20137 3.15213 37.3 ALZP01 129 3185 3067 Scaffold Enterococcus faecalis ERV62 20137 3.11751 37.3 ALZQ01 76 3128 3025 Scaffold

174 Enterococcus faecalis ERV63 20137 3.16512 37.3 ALZR01 94 3197 3081 Scaffold Enterococcus faecalis ERV65 20137 3.09327 37.3 ALZS01 102 3100 2998 Scaffold Enterococcus faecalis ERV68 20137 3.13475 37.3 ALZT01 100 3157 3046 Scaffold Enterococcus faecalis ERV72 20137 3.13714 37.3 ALZU01 94 3152 3053 Scaffold Enterococcus faecalis ERV73 20137 3.08542 37.4 ALZV01 125 3081 2933 Scaffold Enterococcus faecalis ERV81 20137 3.14055 37.3 ALZW01 93 3159 3032 Scaffold Enterococcus faecalis ERV85 20137 3.12947 37.3 ALZX01 121 3136 3008 Scaffold Enterococcus faecalis ERV93 20137 3.09818 37.3 ALZY01 103 3101 2999 Scaffold Enterococcus faecalis R508 20137 2.79303 37.4 ALZZ01 66 2702 2617 Scaffold Enterococcus faecalis EnGen0065 20137 3.33063 37.3 AIIK01 33 3400 3258 Scaffold Enterococcus faecalis EnGen0062 20137 3.00378 37.4 AIIL01 3 2987 2832 Scaffold Enterococcus faecalis EnGen0061 20137 3.34833 37.1 AIIM01 33 3440 3270 Scaffold Enterococcus faecalis EnGen0064 20137 3.4091 37.5 AIIN01 14 3447 3277 Scaffold Enterococcus faecalis EnGen0066 20137 2.89236 37.6 AIIO01 4 2884 2759 Scaffold Enterococcus faecalis EnGen0063 20137 3.20972 37.2 AIIP01 10 3145 2973 Scaffold Enterococcus faecalis EnGen0059 20137 3.20032 37.2 AIIQ01 11 3128 2976 Scaffold Enterococcus faecalis EnGen0076 20137 2.90365 37.4 AIIR01 6 2851 2725 Scaffold Enterococcus faecalis EnGen0074 20137 3.02481 37.5 AIIS01 14 2954 2834 Scaffold Enterococcus faecalis EnGen0075 20137 2.95449 37.4 AIIT01 5 2932 2828 Scaffold Enterococcus faecalis EnGen0058 20137 3.06318 37.5 AIIU01 17 3003 2877 Scaffold Enterococcus faecalis EnGen0073 20137 2.97081 37.5 AIIV01 11 2890 2793 Scaffold Enterococcus faecalis EnGen0078 20137 2.97336 37.4 AIIW01 10 2953 2832 Scaffold Enterococcus faecalis EnGen0080 20137 3.07348 37.2 AIIX01 19 3037 2924 Scaffold Enterococcus faecalis EnGen0079 20137 3.08757 37.4 AIIY01 14 3100 2965 Scaffold Enterococcus faecalis EnGen0060 20137 2.92276 37.4 AIIZ01 4 2911 2798 Scaffold Enterococcus faecalis EnGen0081 20137 2.99289 37.3 AIJA01 8 2974 2855 Scaffold Enterococcus faecalis EnGen0082 20137 2.91098 37.5 AIJC01 7 2849 2715 Scaffold Enterococcus faecalis EnGen0083 20137 2.90455 37.4 AIJD01 6 2866 2760 Scaffold Enterococcus faecalis EnGen0084 20137 2.79156 37.7 AIJE01 4 2692 2577 Scaffold Enterococcus faecalis EnGen0071 20137 2.92486 37.6 AIJF01 6 2892 2747 Scaffold

175 Enterococcus faecalis EnGen0072 20137 2.91215 37.4 AIJG01 7 2873 2763 Scaffold Enterococcus faecalis EnGen0067 20137 3.33615 37.1 AIJH01 39 3446 3290 Scaffold Enterococcus faecalis EnGen0068 20137 3.26917 37.2 AIJI01 12 3348 3208 Scaffold Enterococcus faecalis EnGen0069 20137 3.32938 37.1 AIJJ01 32 3409 3270 Scaffold Enterococcus faecalis EnGen0070 20137 3.2731 37.2 AIJK01 23 3361 3215 Scaffold Enterococcus faecalis EnGen0106 20137 3.27642 37.2 AIPV01 22 3358 3215 Scaffold Enterococcus faecalis EnGen0088 20137 3.33964 37.2 AIPW01 7 3402 3268 Scaffold Enterococcus faecalis EnGen0120 20137 3.22636 37.1 AIPX01 37 3318 3185 Scaffold Enterococcus faecalis EnGen0089 20137 3.30156 37.2 AIPY01 4 3358 3226 Scaffold Enterococcus faecalis EnGen0090 20137 3.27377 37.1 AIPZ01 15 3347 3215 Scaffold Enterococcus faecalis EnGen0109 20137 3.3224 37.1 AIQA01 53 3392 3262 Scaffold Enterococcus faecalis EnGen0110 20137 3.2934 37.1 AIQB01 41 3371 3225 Scaffold Enterococcus faecalis EnGen0091 20137 3.34189 37.2 AIQC01 9 3403 3254 Scaffold Enterococcus faecalis EnGen0092 20137 3.32362 37.2 AIQD01 24 3393 3250 Scaffold Enterococcus faecalis EnGen0085 20137 3.32328 37.1 AIQE01 32 3392 3279 Scaffold Enterococcus faecalis EnGen0111 20137 3.29857 37.2 AIQF01 36 3366 3227 Scaffold Enterococcus faecalis EnGen0119 20137 3.32198 37.1 AIQH01 19 3381 3265 Scaffold Enterococcus faecalis EnGen0093 20137 2.92373 37.2 AIQL01 14 2861 2777 Scaffold Enterococcus faecalis EnGen0094 20137 3.32412 37.1 AIQN01 12 3398 3268 Scaffold Enterococcus faecalis EnGen0095 20137 3.3235 37.1 AIQP01 9 3410 3266 Scaffold Enterococcus faecalis EnGen0096 20137 3.34585 37.1 AIQQ01 11 3425 3288 Scaffold Enterococcus faecalis EnGen0097 20137 3.40671 37.1 AIQR01 18 3482 3337 Scaffold Enterococcus faecalis EnGen0112 20137 3.37676 37.1 AIQS01 27 3445 3294 Scaffold Enterococcus faecalis EnGen0098 20137 3.35529 37.2 AIQT01 9 3409 3268 Scaffold Enterococcus faecalis EnGen0099 20137 3.26317 37.1 AIRC01 39 3345 3219 Scaffold Enterococcus faecalis EnGen0113 20137 3.23366 37.2 AIRD01 44 3306 3148 Scaffold Enterococcus faecalis EnGen0114 20137 3.31386 37.1 AIRE01 22 3397 3271 Scaffold Enterococcus faecalis EnGen0100 20137 3.31711 37.1 AIRG01 22 3394 3264 Scaffold Enterococcus faecalis EnGen0107 20137 3.24374 37.1 AIRY01 26 3321 3204 Scaffold Enterococcus faecalis EnGen0087 20137 3.33375 37.1 AIRZ01 30 3423 3303 Scaffold

176 Enterococcus faecalis EnGen0108 20137 3.31778 37.1 AISA01 37 3404 3268 Scaffold Enterococcus faecalis EnGen0086 20137 2.96976 37.4 AISB01 10 2934 2790 Scaffold Enterococcus faecalis EnGen0115 20137 3.07705 37.5 AISC01 10 3106 2979 Scaffold Enterococcus faecalis EnGen0101 20137 3.36582 37.3 AISD01 26 3384 3224 Scaffold Enterococcus faecalis EnGen0102 20137 2.90492 37.6 AISE01 9 2841 2726 Scaffold Enterococcus faecalis EnGen0103 20137 2.91788 37.6 AISF01 9 2850 2729 Scaffold Enterococcus faecalis EnGen0104 20137 2.90183 37.6 AISG01 8 2839 2732 Scaffold Enterococcus faecalis EnGen0105 20137 2.90435 37.6 AISH01 9 2840 2724 Scaffold Enterococcus faecalis EnGen0116 20137 2.87269 37.6 AISI01 8 2798 2701 Scaffold Enterococcus faecalis EnGen0117 20137 2.97398 37.4 AISJ01 21 2923 2817 Scaffold Enterococcus faecalis EnGen0118 20137 2.95737 37.5 AISK01 10 2911 2794 Scaffold Enterococcus faecalis EnGen0332 20137 3.3 37.2 ASEN01 28 3317 3189 Scaffold Enterococcus faecalis EnGen0341 20137 3.24386 37.5 ASCU01 28 3258 3112 Scaffold Enterococcus faecalis EnGen0366 20137 3.27428 37.4 ASCV01 8 3367 3226 Scaffold Enterococcus faecalis EnGen0344 20137 3.00786 37.3 ASCW01 4 2953 2852 Scaffold Enterococcus faecalis EnGen0361 20137 3.16517 37.4 ASCX01 17 3204 3078 Scaffold Enterococcus faecalis EnGen0345 20137 3.10989 37.5 ASCY01 14 3055 2938 Scaffold Enterococcus faecalis EnGen0346 20137 2.72323 37.7 ASCZ01 4 2617 2525 Scaffold Enterococcus faecalis ATCC 19433 20137 2.8814 37.7 ASDA01 3 2896 2754 Scaffold Enterococcus faecalis EnGen0362 20137 3.00451 37.2 ASDB01 15 2959 2840 Scaffold Enterococcus faecalis EnGen0348 20137 3.09011 37.4 ASDC01 15 3111 2959 Scaffold Enterococcus faecalis EnGen0363 20137 2.94664 37.4 ASDD01 17 2847 2740 Scaffold Enterococcus faecalis ATCC 35038 20137 2.97624 37.5 ASDE01 10 2943 2820 Scaffold Enterococcus faecalis EnGen0364 20137 2.9899 37.4 ASDF01 13 2883 2758 Scaffold Enterococcus faecalis EnGen0350 20137 2.99809 37.6 ASDG01 5 2966 2828 Scaffold Enterococcus faecalis EnGen0336 20137 2.87038 37.5 ASDH01 3 2802 2707 Scaffold Enterococcus faecalis EnGen0351 20137 2.99194 37.5 ASDI01 3 2967 2829 Scaffold Enterococcus faecalis EnGen0352 20137 2.78339 37.7 ASDJ01 4 2694 2543 Scaffold Enterococcus faecalis EnGen0337 20137 2.92663 37.5 ASDK01 5 2877 2769 Scaffold Enterococcus faecalis EnGen0365 20137 3.08631 37.6 ASDL01 11 3036 2875 Scaffold

177 Enterococcus faecalis EnGen0342 20137 3.00797 37.5 ASDM01 9 2961 2823 Scaffold Enterococcus faecalis EnGen0354 20137 3.16594 37.3 ASDN01 22 3223 3089 Scaffold Enterococcus faecalis EnGen0355 20137 3.03417 37.4 ASDO01 9 3051 2911 Scaffold Enterococcus faecalis EnGen0369 20137 3.12208 37.5 ASDP01 25 3175 2993 Scaffold Enterococcus faecalis EnGen0356 20137 2.96671 37.4 ASDQ01 7 2877 2764 Scaffold Enterococcus faecalis EnGen0357 20137 2.79183 37.6 ASDR01 8 2688 2593 Scaffold Enterococcus faecalis EnGen0358 20137 3.06395 37.3 ASDS01 20 3033 2913 Scaffold Enterococcus faecalis EnGen0370 20137 3.12235 37.4 ASDT01 13 3120 2964 Scaffold Enterococcus faecalis EnGen0368 20137 2.99652 37.7 ASDU01 7 2932 2791 Scaffold Enterococcus faecalis EnGen0359 20137 3.34154 37.1 ASDV01 12 3366 3220 Scaffold Enterococcus faecalis EnGen0360 20137 2.9269 37.4 ASDW01 7 2912 2803 Scaffold Enterococcus faecalis EnGen0340 20137 2.83322 37.4 ASDX01 5 2775 2677 Scaffold Enterococcus faecalis EnGen0367 20137 2.89225 37.6 ASDY01 4 2883 2728 Scaffold Enterococcus faecalis ATCC 6055 20137 3.22746 37.2 ASDZ01 31 3266 3125 Scaffold Enterococcus faecalis ATCC 10100 20137 3.00495 37.4 ASEA01 6 2949 2845 Scaffold Enterococcus faecalis EnGen0338 20137 3.21273 37.3 ASED01 32 3227 3083 Scaffold Enterococcus faecalis EnGen0339 20137 3.26989 37.3 ASEE01 31 3338 3198 Scaffold Enterococcus faecalis EnGen0327 20137 3.08627 37.5 ASEF01 8 3066 2954 Scaffold Enterococcus faecalis EnGen0331 20137 3.04414 37.3 ASEG01 27 3032 2920 Scaffold Enterococcus faecalis EnGen0329 20137 3.04724 37.4 ASEH01 17 3030 2929 Scaffold Enterococcus faecalis EnGen0326 20137 2.97125 37.6 ASEI01 8 2921 2817 Scaffold Enterococcus faecalis EnGen0334 20137 2.96343 37.4 ASEJ01 5 2936 2837 Scaffold Enterococcus faecalis EnGen0333 20137 2.91268 37.6 ASEK01 5 2893 2775 Scaffold Enterococcus faecalis EnGen0328 20137 2.95516 37.6 ASEL01 5 2977 2838 Scaffold Enterococcus faecalis EnGen0330 20137 2.97256 37.3 ASEM01 13 2977 2840 Scaffold Enterococcus faecalis EnGen0234 20137 3.02121 37.5 AJAC01 19 3006 2875 Scaffold Enterococcus faecalis EnGen0235 20137 3.37502 37.1 AJAG01 33 3419 3266 Scaffold Enterococcus faecalis EnGen0237 20137 3.14517 37.3 AJAW01 29 3182 3054 Scaffold Enterococcus faecalis EnGen0238 20137 3.19631 37.5 AJAX01 27 3217 3112 Scaffold Enterococcus faecalis EnGen0239 20137 3.2006 37.6 AJAY01 13 3260 3123 Scaffold

178 Enterococcus faecalis EnGen0240 20137 2.95877 37.6 AJAZ01 6 2923 2797 Scaffold Enterococcus faecalis EnGen0241 20137 2.89056 37.6 AJBA01 4 2863 2743 Scaffold Enterococcus faecalis EnGen0242 20137 3.32966 37.5 AJBB01 25 3409 3251 Scaffold Enterococcus faecalis EnGen0243 20137 3.19702 37.2 AJBC01 45 3241 3128 Scaffold Enterococcus faecalis EnGen0244 20137 3.22598 37.3 AJBD01 11 3230 3090 Scaffold Enterococcus faecalis EnGen0245 20137 3.08671 37.3 AJBE01 8 3051 2916 Scaffold Enterococcus faecalis EnGen0246 20137 3.18699 37.4 AJBF01 10 3184 3044 Scaffold Enterococcus faecalis EnGen0247 20137 3.28481 37.2 AJBG01 34 3318 3146 Scaffold Enterococcus faecalis EnGen0248 20137 3.1999 37.3 AJBH01 21 3215 3082 Scaffold Enterococcus faecalis EnGen0252 20137 3.22673 37.3 AJBI01 20 3238 3119 Scaffold Enterococcus faecalis EnGen0251 20137 3.28353 37.1 AJBJ01 30 3313 3148 Scaffold Enterococcus faecalis EnGen0231 20137 3.28118 37.1 AJBK01 44 3361 3220 Scaffold Enterococcus faecalis EnGen0249 20137 3.03057 37.3 AJBL01 6 2969 2846 Scaffold Enterococcus faecalis EnGen0250 20137 2.8202 37.7 AJBM01 3 2741 2640 Scaffold Enterococcus faecalis EnGen0299 20137 2.93373 37.6 AJDH01 11 2910 2771 Scaffold Enterococcus faecalis EnGen0301 20137 3.12761 37.1 AJDK01 10 3070 2917 Scaffold Enterococcus faecalis EnGen0297 20137 3.11169 37.3 AJDY01 36 3113 2972 Scaffold Enterococcus faecalis EnGen0310 = MMH594 20137 3.25559 37.1 AJDZ01 25 3335 3208 Scaffold Enterococcus faecalis EnGen0294 20137 3.20317 37.2 AJEA01 16 3214 3095 Scaffold Enterococcus faecalis EnGen0307 20137 3.18991 37.3 AJEB01 19 3224 3087 Scaffold Enterococcus faecalis EnGen0280 20137 3.24669 37.3 AJEC01 16 3271 3152 Scaffold Enterococcus faecalis EnGen0303 20137 3.16163 37.3 AJED01 18 3140 3000 Scaffold Enterococcus faecalis EnGen0298 20137 3.2621 37.1 AJEE01 35 3293 3157 Scaffold Enterococcus faecalis EnGen0311 20137 3.25099 37.2 AJEF01 26 3287 3112 Scaffold Enterococcus faecalis EnGen0302 20137 3.19619 37.4 AJEG01 6 3194 3049 Scaffold Enterococcus faecalis EnGen0306 20137 3.22235 37.3 AJEH01 20 3230 3089 Scaffold Enterococcus faecalis EnGen0291 20137 2.87252 37.5 AJEI01 11 2862 2778 Scaffold Enterococcus faecalis EnGen0282 20137 2.77535 37.7 AJEJ01 3 2710 2610 Scaffold Enterococcus faecalis ATCC 29200 20137 3.01458 37.5 AJEK01 15 2991 2862 Scaffold

179 Enterococcus faecalis EnGen0279 20137 2.8858 37.7 AJEL01 9 2848 2721 Scaffold Enterococcus faecalis EnGen0304 20137 2.81699 37.7 AJEM01 3 2780 2672 Scaffold Enterococcus faecalis EnGen0281 20137 2.95549 37.6 AJEN01 4 2953 2830 Scaffold Enterococcus faecalis EnGen0287 20137 3.03033 37.3 AJEO01 5 2987 2863 Scaffold Enterococcus faecalis EnGen0300 20137 3.26233 37.1 AJEP01 53 3317 3165 Scaffold Enterococcus faecalis EnGen0295 20137 2.95886 37.5 AJEQ01 4 2920 2809 Scaffold Enterococcus faecalis EnGen0284 20137 3.01755 37.2 AJES01 8 2965 2844 Scaffold Enterococcus faecalis EnGen0293 20137 3.10547 37.4 AJEU01 16 3097 2966 Scaffold Enterococcus faecalis ATCC 27275 20137 3.00114 37.4 AJEW01 8 2959 2855 Scaffold Enterococcus faecalis ATCC 27959 20137 2.94393 37.6 AJEX01 2 2887 2805 Scaffold Enterococcus faecalis EnGen0289 20137 2.91996 37.5 AJEY01 10 2860 2749 Scaffold Enterococcus faecalis EnGen0285 20137 3.153 37.4 AJEZ01 21 3143 3015 Scaffold Enterococcus faecalis EnGen0335 20137 3.22406 37.6 ASEO01 18 3249 3090 Scaffold Enterococcus faecalis EnGen0290 20137 3.06524 37.5 AJEV01 12 3035 2909 Scaffold Enterococcus faecalis EnGen0283 20137 3.09784 37.4 AJER01 13 3065 2950 Scaffold Enterococcus faecalis EnGen0194 20137 3.33091 37.1 AIPS01 24 3414 3283 Scaffold Enterococcus faecalis EnGen0195 20137 3.25511 37.1 AIPT01 28 3325 3204 Scaffold Enterococcus faecalis EnGen0196 20137 3.28873 37.1 AIPU01 39 3372 3240 Scaffold Enterococcus faecalis EnGen0197 20137 3.29134 37.1 AIQG01 19 3364 3243 Scaffold Enterococcus faecalis EnGen0198 20137 2.93852 37.2 AIQI01 15 2907 2782 Scaffold Enterococcus faecalis EnGen0199 20137 3.25567 37.2 AIQJ01 21 3314 3199 Scaffold Enterococcus faecalis EnGen0200 20137 3.3291 37.1 AIQK01 21 3402 3281 Scaffold Enterococcus faecalis EnGen0201 20137 3.27664 37.1 AIQM01 37 3365 3214 Scaffold Enterococcus faecalis EnGen0202 20137 3.29313 37.1 AIQO01 40 3378 3237 Scaffold Enterococcus faecalis EnGen0203 20137 3.28563 37.1 AIQU01 43 3330 3215 Scaffold Enterococcus faecalis EnGen0204 20137 3.28749 37.1 AIQV01 37 3369 3221 Scaffold Enterococcus faecalis EnGen0207 20137 3.33457 37.1 AIQW01 22 3412 3267 Scaffold Enterococcus faecalis EnGen0205 20137 3.29416 37.1 AIQX01 35 3395 3249 Scaffold Enterococcus faecalis EnGen0228 20137 3.27309 37.1 AIQY01 34 3344 3224 Scaffold Enterococcus faecalis EnGen0206 20137 3.27998 37.1 AIQZ01 40 3359 3227 Scaffold

180 Enterococcus faecalis EnGen0374 20137 3.28712 37.1 AIRA01 42 3363 3235 Scaffold Enterococcus faecalis EnGen0208 20137 3.20687 37.2 AIRB01 22 3281 3166 Scaffold Enterococcus faecalis EnGen0209 20137 3.28553 37.1 AIRF01 41 3364 3217 Scaffold Enterococcus faecalis EnGen0210 20137 3.29711 37.1 AIRH01 42 3394 3238 Scaffold Enterococcus faecalis EnGen0211 20137 3.29851 37.1 AIRI01 26 3389 3257 Scaffold Enterococcus faecalis EnGen0212 20137 3.129 37.3 AIRJ01 12 3101 2971 Scaffold Enterococcus faecalis EnGen0213 20137 3.29385 37.2 AIRK01 39 3379 3225 Scaffold Enterococcus faecalis EnGen0214 20137 3.23806 37.2 AIRL01 31 3304 3177 Scaffold Enterococcus faecalis EnGen0215 20137 3.28286 37.2 AIRM01 25 3376 3225 Scaffold Enterococcus faecalis EnGen0216 20137 3.338 37.1 AIRN01 41 3427 3303 Scaffold Enterococcus faecalis EnGen0217 20137 3.30573 37.1 AIRO01 35 3392 3246 Scaffold Enterococcus faecalis EnGen0218 20137 3.28772 37.1 AIRP01 39 3377 3238 Scaffold Enterococcus faecalis EnGen0219 20137 3.29298 37.1 AIRQ01 41 3364 3247 Scaffold Enterococcus faecalis EnGen0220 20137 3.31955 37 AIRR01 45 3428 3292 Scaffold Enterococcus faecalis EnGen0221 20137 3.24615 37.1 AIRS01 27 3325 3204 Scaffold Enterococcus faecalis EnGen0222 20137 3.31696 37.1 AIRT01 36 3408 3264 Scaffold Enterococcus faecalis EnGen0223 20137 3.24253 37.1 AIRU01 26 3301 3189 Scaffold Enterococcus faecalis EnGen0224 20137 3.30123 37.2 AIRV01 33 3366 3229 Scaffold Enterococcus faecalis EnGen0225 20137 3.33174 37.1 AIRW01 38 3421 3293 Scaffold Enterococcus faecalis EnGen0226 20137 3.08939 37.3 AIRX01 20 3136 3023 Scaffold Enterococcus faecalis EnGen0232 20137 3.20746 37.3 AIZS01 40 3235 3079 Scaffold Enterococcus faecalis EnGen0233 20137 2.95195 37.5 AIZW01 24 2928 2799 Scaffold Enterococcus faecalis V583 20137 3.3298 37.4 AHYN01 9 3411 3288 Scaffold Enterococcus faecalis V583 20137 3.36431 37.4 ASWP01 9 3453 3329 Scaffold Enterococcus faecalis KI-6-1-110608-1 20137 2.63813 37.5 ATIE01 35 2532 2466 Scaffold Enterococcus faecalis 02-MB-P-10 20137 2.90479 37.2 ATIF01 78 2819 2707 Scaffold Enterococcus faecalis 20-SD-BW-06 20137 2.79151 37.5 ATIG01 40 2710 2639 Scaffold Enterococcus faecalis 02-MB-BW-10 20137 3.05362 37.1 ATIH01 157 3065 2922 Scaffold Enterococcus faecalis D811610-10 20137 2.72052 37.6 ATII01 32 2616 2549 Scaffold Enterococcus faecalis B83616-1 20137 2.6564 37.7 ATIJ01 51 2582 2499 Scaffold

181 Enterococcus faecalis 06-MB-S-10 20137 3.01378 37.2 ATIK01 99 2986 2883 Scaffold Enterococcus faecalis 06-MB-S-04 20137 3.04071 37.2 ATIL01 120 3005 2901 Scaffold Enterococcus faecalis F01966 20137 2.93243 37.2 ATIN01 120 2907 2790 Scaffold Enterococcus faecalis 20-SD-BW-08 20137 2.78967 37.5 ATIP01 40 2710 2639 Scaffold Enterococcus faecalis 20.SD.W.06 20137 2.84403 37.3 ATIQ01 66 2766 2674 Scaffold Enterococcus faecalis RP2S-4 20137 3.00368 37.2 ATIR01 84 2929 2828 Scaffold Enterococcus faecalis WKS-26-18-2 20137 2.89154 37.4 ATIY01 124 2885 2765 Scaffold Enterococcus faecalis VC1B-1 20137 2.877 37.3 ATIZ01 54 2826 2739 Scaffold Enterococcus faecalis UP2S-6 20137 2.85914 37.4 ATJA01 72 2816 2718 Scaffold Enterococcus faecalis SLO2C-1 20137 2.78087 37.5 ATJB01 38 2694 2626 Scaffold Enterococcus faecalis LA3B-2 20137 2.88771 37.3 ATJC01 105 2889 2732 Scaffold Enterococcus faecalis BM4654 20137 3.50898 37.8 AXOG01 23 3640 3500 Scaffold Enterococcus faecalis BM4539 20137 3.06681 37.9 AXOH01 5 3020 2881 Scaffold Enterococcus faecalis JH2-2 20137 2.89927 37.6 AXOI01 2 2873 2742 Scaffold Enterococcus faecalis EnGen0400 20137 2.92181 37.7 JAHF01 13 2857 2733 Scaffold Enterococcus faecalis EnGen0401 20137 3.07048 37.4 JAHG01 17 3074 2933 Scaffold Enterococcus faecalis EnGen0402 20137 2.94995 37.6 JAHH01 4 2912 2795 Scaffold Enterococcus faecalis EnGen0403 20137 3.13741 37.3 JAHI01 6 3187 3061 Scaffold Enterococcus faecalis EnGen0404 20137 3.1508 37.3 JAHJ01 8 3204 3069 Scaffold Enterococcus faecalis EnGen0405 20137 3.13423 37.2 JAHK01 5 3172 3065 Scaffold Enterococcus faecalis EnGen0406 20137 3.07305 37.3 JAHL01 9 3119 2997 Scaffold Enterococcus faecalis EnGen0407 20137 2.73747 37.8 JAHM01 2 2653 2571 Scaffold Enterococcus faecalis EnGen0408 20137 3.14707 37.3 JAHN01 7 3179 3073 Scaffold Enterococcus faecalis EnGen0409 20137 2.8749 37.7 JAHO01 4 2833 2703 Scaffold Enterococcus faecalis EnGen0410 20137 3.17517 37.2 JAHP01 4 3231 3087 Scaffold Enterococcus faecalis EnGen0411 20137 2.91171 37.6 JAHQ01 6 2915 2785 Scaffold Enterococcus faecalis EnGen0412 20137 3.06507 37.3 JAHR01 4 3080 2974 Scaffold Enterococcus faecalis EnGen0413 20137 2.77793 37.7 JAHS01 5 2686 2580 Scaffold Enterococcus faecalis EnGen0414 20137 2.73352 37.6 JAHT01 4 2660 2567 Scaffold Enterococcus faecalis EnGen0415 20137 3.10143 37.4 JAHU01 7 3134 3012 Scaffold

182 Enterococcus faecalis EnGen0416 20137 2.98094 37.5 JAHV01 9 2971 2846 Scaffold Enterococcus faecalis EnGen0417 20137 3.02228 37.4 JAHW01 11 3007 2870 Scaffold Enterococcus faecalis EnGen0418 20137 3.02007 37.3 JAHX01 4 2997 2898 Scaffold Enterococcus faecalis EnGen0419 20137 2.73807 37.8 JAHY01 5 2667 2571 Scaffold Enterococcus faecalis EnGen0420 20137 2.73455 37.8 JAHZ01 1 2649 2570 Scaffold Enterococcus faecalis EnGen0421 20137 3.0112 37.3 JAIA01 8 2980 2878 Scaffold Enterococcus faecalis EnGen0422 20137 3.26073 37.4 JAIB01 14 3266 3129 Scaffold Enterococcus faecalis EnGen0423 20137 2.83241 37.7 JAIC01 2 2765 2668 Scaffold Enterococcus faecalis EnGen0424 20137 2.72258 37.7 JAID01 1 2650 2555 Scaffold Enterococcus faecalis EnGen0425 20137 2.97597 37.3 JAIE01 7 2973 2856 Scaffold Enterococcus faecalis EnGen0426 20137 3.00838 37.4 JAIF01 3 3024 2911 Scaffold Enterococcus faecalis EnGen0427 20137 3.32994 37.1 JAIG01 16 3415 3287 Scaffold Enterococcus faecalis 918 20137 3.31812 37.1 AVNY01 111 3452 3306 Scaffold Enterococcus faecalis Efa HS0914 20137 2.81732 37.3 JPDQ01 15 2770 2682 Scaffold Enterococcus faecalis 20137 3.05163 37.1 LKGS01 966 3193 2559 Scaffold Enterococcus faecalis TUSoD Ef11 20137 2.83665 37.7 ACOX02 11 2808 2683 Contig Enterococcus faecalis PC1.1 20137 2.75392 37.6 ADKN01 79 2688 2614 Contig Enterococcus faecalis OG1X 20137 2.73907 37.7 AFHH01 78 2640 2516 Contig Enterococcus faecalis M7 20137 2.7332 37.7 AGVN01 77 2634 2481 Contig Enterococcus faecalis 10244 20137 3.11229 37.339 ASWX01 79 3122 3006 Contig Enterococcus faecalis E12 20137 2.98348 37.2 AWPI01 117 3022 2898 Contig Enterococcus faecalis EnGen0286 20137 2.92362 37.5 AJET01 12 2882 2794 Contig Enterococcus faecalis MA1 20137 2.96214 37.4 ANMP01 74 2977 2878 Contig Enterococcus faecalis AZ19 20137 2.93529 37.3 AYLU01 98 2896 2813 Contig Enterococcus faecalis FL2 20137 2.66569 37.6 AYKK01 119 2617 2532 Contig Enterococcus faecalis GA2 20137 2.67756 37.6 AYKL01 50 2588 2549 Contig Enterococcus faecalis GAN13 20137 2.84692 37.4 AYLV01 92 2786 2693 Contig Enterococcus faecalis KS19 20137 2.74034 37.5 AYND01 94 2650 2574 Contig Enterococcus faecalis MD6 20137 2.73319 37.5 AYLN01 102 2669 2585 Contig Enterococcus faecalis MN16 20137 2.83359 37.4 AYKM01 63 2745 2684 Contig

183 Enterococcus faecalis MTmid8 20137 2.69046 37.6 AYKU01 61 2595 2538 Contig Enterococcus faecalis MTUP9 20137 2.9734 37.2 AYOJ01 66 2905 2804 Contig Enterococcus faecalis NJ44 20137 2.91373 37.2 AYOK01 129 2870 2775 Contig Enterococcus faecalis NY9 20137 2.98386 37 AYOL01 113 2924 2817 Contig Enterococcus faecalis 20137 2.96543 37.2 JPWN01 13 2956 2854 Contig Enterococcus faecalis 20137 3.01051 37.3 JPTY01 37 2991 2879 Contig Enterococcus faecalis 20137 2.97906 37.4 JPTZ01 40 2999 2887 Contig Enterococcus faecalis 20137 2.94666 37.3 JQHD01 81 2927 2843 Contig Enterococcus faecalis 20137 3.23297 37.3 JSES01 70 3330 3209 Contig Enterococcus faecalis EnGen0310 20137 3.26068 37 AOPW01 104 3357 3246 Contig Enterococcus faecalis JH2-2 20137 2.86463 37.6 CAWH01 172 2826 2618 Contig Enterococcus faecalis 20137 3.05351 36.9 JWBU01 152 2984 2886 Contig Enterococcus faecalis 20137 3.15503 37.1 JWAR01 97 3186 3067 Contig Enterococcus faecalis 20137 3.14873 37.2 JVXY01 172 3189 3070 Contig Enterococcus faecalis 20137 3.18577 37.2 JVXC01 143 3218 3109 Contig Enterococcus faecalis 20137 3.2219 37.3 JVUK01 117 3252 3149 Contig Enterococcus faecalis 20137 2.99657 37.5 JVSW01 88 2965 2843 Contig Enterococcus faecalis 20137 2.92289 37.3 JVZS01 114 2842 2774 Contig Enterococcus faecalis 20137 3.08145 37.1 JVZM01 137 3045 2929 Contig Enterococcus faecalis 20137 2.8754 37.5 JVVS01 51 2821 2732 Contig Enterococcus faecalis 20137 3.13653 37.2 JVTP01 186 3148 3022 Contig Enterococcus faecalis 20137 2.96039 37.4 JVQP01 58 2912 2835 Contig Enterococcus faecalis 20137 3.25221 37.2 JVKC01 96 3251 3110 Contig Enterococcus faecalis 20137 2.97395 37.3 JVCH01 127 2872 2789 Contig Enterococcus faecalis 20137 2.96736 37.4 JVQQ01 46 2927 2838 Contig Enterococcus faecalis 20137 2.88711 37.5 JVOQ01 163 2842 2736 Contig Enterococcus faecalis 20137 2.97325 37.3 JVOC01 161 2968 2895 Contig Enterococcus faecalis 20137 3.09886 37.3 JVNY01 80 3069 2951 Contig Enterococcus faecalis 20137 2.87829 37.5 JVBW01 51 2861 2732 Contig Enterococcus faecalis 20137 2.88764 37.5 JVBV01 60 2856 2738 Contig

184 Enterococcus faecalis 20137 2.8841 37.4 JVAM01 155 2818 2735 Contig Enterococcus faecalis 20137 3.01896 37.2 JUXV01 126 3014 2911 Contig Enterococcus faecalis 20137 2.98925 37.2 JUXL01 188 2980 2886 Contig Enterococcus faecalis 20137 3.01191 37.2 JUWK01 199 2974 2892 Contig Enterococcus faecalis 20137 2.81577 37.4 JUVH01 150 2739 2662 Contig Enterococcus faecalis 20137 3.18909 37.2 JUUM01 292 3226 3094 Contig Enterococcus faecalis 20137 2.98227 37.2 JUUJ01 326 2984 2842 Contig Enterococcus faecalis 20137 2.95563 37.2 JUQC01 182 2938 2843 Contig Enterococcus faecalis 20137 2.79543 37.5 JUPP01 171 2707 2605 Contig Enterococcus faecalis 20137 2.98506 37.2 JUXT01 225 2987 2871 Contig Enterococcus faecalis 20137 3.00691 37.4 JUXC01 223 3010 2894 Contig Enterococcus faecalis 20137 2.99763 37.2 JUVP01 147 2982 2875 Contig Enterococcus faecalis 20137 2.98645 37.2 JUVA01 261 2991 2850 Contig Enterococcus faecalis 20137 3.19095 37.1 JUPR01 307 3246 3095 Contig Enterococcus faecalis 20137 2.86108 37.4 JUOO01 187 2863 2763 Contig Enterococcus faecalis 20137 3.04471 37.3 JUMK01 141 3000 2929 Contig Enterococcus faecalis 20137 2.95477 37.3 JUOP01 215 2927 2821 Contig Enterococcus faecalis 20137 3.20197 37.1 JUNN01 258 3264 3121 Contig Enterococcus faecalis 20137 3.03497 37.3 JUMJ01 140 2982 2894 Contig Enterococcus faecalis 20137 3.05681 37.2 JULA01 129 3047 2943 Contig Enterococcus faecalis 20137 2.76372 37.5 JUNQ01 113 2675 2609 Contig Enterococcus faecalis 20137 3.17273 37.1 JUNL01 472 3292 3061 Contig Enterococcus faecalis 20137 2.91574 37.3 LAEB01 18 2892 2804 Contig Enterococcus faecalis 20137 3.14674 37.1 LKGR01 42 3109 2975 Contig Enterococcus faecalis NBRC 100480 20137 2.83321 37.5 BCQC01 37 2824 2729 Contig Enterococcus faecalis ATCC 29212 20137 3.01126 37.3 MTFY01 49 3098 2959 Contig Enterococcus faecalis ATCC 29212 20137 3.09625 37.2 FPDW01 55 3219 3051 Contig Enterococcus faecalis ATCC 29212 20137 3.25366 37 FPDZ01 48 3327 3095 Contig Enterococcus faecalis ATCC 29212 20137 2.97208 37.3 FPEB01 40 3050 2863 Contig Enterococcus faecalis ATCC 29212 20137 2.89971 37.4 FPDY01 56 2981 2811 Contig

185 Enterococcus faecalis ATCC 29212 20137 2.95394 37.3 FPEA01 44 3061 2886 Contig Enterococcus faecalis ATCC 29212 20137 2.76049 37.5 FPEC01 15 2766 2663 Contig Enterococcus faecalis 20137 2.89672 37.4 JTKT01 57 2838 2746 Scaffold Enterococcus faecalis 20137 2.96923 37.3 JTKW01 34 2941 2832 Scaffold Enterococcus faecalis 20137 2.76281 37.6 JTKS01 24 2687 2592 Scaffold Enterococcus faecalis 20137 2.78667 37.5 JTKU01 20 2692 2611 Scaffold Enterococcus faecalis 20137 2.88866 37.4 JTKV01 103 2834 2749 Scaffold Enterococcus faecalis 20137 3.03771 37.2 JTKX01 89 2985 2853 Scaffold Enterococcus faecalis 20137 2.89236 37.5 JWAW01 90 2809 2720 Scaffold Enterococcus faecalis 20137 2.84673 37.4 JVYW01 131 2793 2713 Scaffold Enterococcus faecalis 20137 3.3048 37.2 JVTX01 110 3330 3206 Scaffold Enterococcus faecalis 20137 3.30387 37.2 JVTK01 111 3325 3203 Scaffold Enterococcus faecalis 20137 3.20343 37.3 JVTG01 95 3221 3082 Scaffold Enterococcus faecalis 20137 2.98751 37.4 JVSV01 73 2990 2881 Scaffold Enterococcus faecalis 20137 2.96521 37.4 JVQS01 54 2928 2839 Scaffold Enterococcus faecalis 20137 3.11136 37.1 JVPV01 388 3162 2959 Scaffold Enterococcus faecalis 20137 2.79238 37.5 JVPG01 140 2727 2651 Scaffold Enterococcus faecalis 20137 3.12153 37.3 JVOF01 95 3117 3019 Scaffold Enterococcus faecalis 20137 2.98897 37.3 JVOA01 145 2990 2911 Scaffold Enterococcus faecalis 20137 2.98429 37.3 JVIY01 133 2983 2901 Scaffold Enterococcus faecalis 20137 2.95914 37.3 JVIK01 103 2957 2878 Scaffold Enterococcus faecalis 20137 3.26723 37.1 JVGB01 101 3306 3188 Scaffold Enterococcus faecalis 20137 3.01627 37.3 JVBG01 176 2989 2877 Scaffold Enterococcus faecalis 20137 2.80974 37.3 JVAI01 107 2724 2652 Scaffold Enterococcus faecalis 20137 3.01936 37.2 JVAD01 130 3017 2914 Scaffold Enterococcus faecalis 20137 2.99707 37.3 JVQY01 105 2995 2876 Scaffold Enterococcus faecalis 20137 2.96057 37.4 JVQT01 64 2902 2835 Scaffold Enterococcus faecalis 20137 2.96191 37.4 JVQF01 57 2913 2831 Scaffold Enterococcus faecalis 20137 3.01209 37.3 JVPS01 114 3013 2934 Scaffold Enterococcus faecalis 20137 2.89199 37.3 JVPJ01 74 2823 2748 Scaffold

186 Enterococcus faecalis 20137 3.02318 37.3 JVOG01 125 3043 2934 Scaffold Enterococcus faecalis 20137 2.95885 37.2 JVOB01 236 2985 2878 Scaffold Enterococcus faecalis 20137 3.01222 37.3 JVJB01 89 3034 2934 Scaffold Enterococcus faecalis 20137 3.07533 37.1 JVID01 70 3070 2947 Scaffold Enterococcus faecalis 20137 3.19108 37.1 JVHL01 101 3216 3059 Scaffold Enterococcus faecalis 20137 3.23003 37.2 JVDH01 161 3311 3162 Scaffold Enterococcus faecalis 20137 2.85333 37.5 JVBD01 104 2778 2705 Scaffold Enterococcus faecalis 20137 3.00262 37.2 JVAN01 230 3041 2900 Scaffold Enterococcus faecalis 20137 2.88817 37.4 JUZT01 140 2814 2731 Scaffold Enterococcus faecalis 20137 2.97827 37.2 JUYS01 248 2968 2814 Scaffold Enterococcus faecalis 20137 3.05917 37.1 JUXZ01 106 3064 2932 Scaffold Enterococcus faecalis 20137 3.07883 37.2 JUWE01 137 3098 2993 Scaffold Enterococcus faecalis 20137 2.83672 37.6 LSFS01 31 2718 2630 Scaffold Enterococcus faecalis 20137 2.92998 37.6 LQAM01 14 2912 2795 Scaffold Enterococcus faecalis TX1467 20137 3.02628 37.1 AFBS01 126 3557 3510 Scaffold Enterococcus faecalis ATCC 29212 20137 3.02706 37.2 ALOD01 126 2443 2347 Contig Enterococcus faecalis CBRD01 20137 2.81317 37.5 AWYG01 140 1887 1874 Contig Enterococcus faecalis PF3 20137 3.21386 37.5 AZIA01 397 3364 3173 Contig Enterococcus faecalis DORA_14 20137 2.96586 37.2991 AZLY01 50 2911 2911 Contig Enterococcus faecalis 20137 3.10643 37.5 JMEC01 38 3051 2734 Contig

187

Conclusion

We extracted a total of 447 genomes of E. faecalis, including six genomes of E. faecalis, sequenced in Marseille, and 407 genomes of E. faecium from the NCBI/Genbank database. The analysis of genome features indicates a statistically significant difference between E. faecalis and E. faecium. Also, there are more genomic recombination hotspots detected in E. faecalis than in E. faecium, with a statistically significant difference (p-value < 10-5) and the majority occurred in branches with mixtures of strains from environments different than those specific to humans. The phylogenic network analysis revealed that in E. faecalis environment and animals might be contaminated from human wastes with low aggregation with human strains insides the two network groups identified. In the other hand, in E. faecium non-human strains are mostly located insides networks with high aggregation with human strains and evidence of direct interaction between human and non-human strains with one large emerging clusters containing a mixture of human and non-human strains. This finding suggests a higher zoonotic dissemination and transmission of E. faecium than in E. faecalis. Also, we found a positive association between the numbers of CRISPR spacers detected per genome and the numbers of recombination hotspots detected in E. faecalis. The anti-endonuclease genes ardA were mainly found in E. faecium with low density where the genome recombination is high. Most importantly, we detected the presence of vancomycin resistance genes mostly in the genomes where a CRISPR system is absent in both species. Also, endonuclease genes, including type I,

II, and III, were found in both species, with a slight increase in E. faecium. However, anti- endonuclease genes (ardA) were absent in E. faecalis, while massively present in more than

90% of the analysed genomes of E. faecium. The presence of a CRISPR system in the genome of E. faecium decreased by 0.77 times the acquisition of vancomycin-resistant genes. The number of recombination hotspots detected in the genome of both species decreased by 0.02 times the acquisition of vancomycin resistance genes. We found that there is a direct association between the absence of CRISPR-spacers, the presence of anti-endonuclease genes (ardA) and the acquisition of vancomycin resistance in E. faecium. 188

In summary, this study shows that there is extensive genomic recombination that has occurred in E. faecalis species due to mobile genetic elements capable of inducing adaptive immunity with the acquisition of a CRISPR system protecting E. faecalis from acquiring external DNA sequences carrying the vancomycin resistance genes vanA, vanB. This observation correlates with a reduced number of CRISPR systems found in E. faecium and the substantial number of anti-endonuclease ardA genes and vancomycin resistance genes found.

The emergence and dissemination of E. faecium infection may be due to zoonotic transmission, and the misuse of antibiotics (avoparcin) may cause the selection of emerging vancomycin resistance in Enterococci. This finding explains why E. faecium is more reported worldwide as a vancomycin-resistant Enterococcus, and not E. faecalis.

189

Chapter III: Taxono-genomic

190

Introduction

Taxono-genomics is a taxonomy approach to better take into consideration recent developments in prokaryote genome sequencing and incorporates the recent proposal that genome sequence and proteome must be part of the description of microbes of medical interest[1]. Recently, 16S rRNA genes sequencing commonly used as phylogenetic taxonomy in bacteria has shown its limitation to the Enterobacteriaceae family [2]. Multiple gene-based phylogenies were later introduced and had frequently been used to overcome this limitation. The currently applied method in prokaryotic taxonomy is called Multilocus Sequence Analysis (MLSA). Multilocus fragments of protein-coding genes are sequenced (four housekeeping genes: ATP synthase F1,

β-subunit: atpD, DNA gyrase, β-subunit: gyrB, RNA polymerase, β-subunit: rpoB, and translation initiation factor IF-2: infB) and subsequently used to calculate a taxonomic, phylogenetic tree. It is known that a phylogenic tree based on the concatenated aligned sequences reflect the “true” relationship of bacterial taxa [3]. However, high genome plasticity makes taxonomy difficult in Enterobacteriaceae family. A new advance in whole genome sequencing (WGS) has enabled accurate classification of the family mentioned above. WGS enables efficient estimation of bacterial G+C% content, in silico DNA–DNA hybridization value (DDH), the average genomic identity of orthologous gene sequence [4] Average

Nucleotide Identity (ANI) and Average Amino acid identity (AAI) [5–7]. The new member of the Enterobacteriaceae family that we intend to describe was isolated from pustule scalp in association with Staphylococcus aureus at Archet II Hospital, Nice located in the South-east of

France. We proposed a new genus located in a distinct phylogenetic position within the

Enterobacteriaceae family based on the 16S rDNA sequence analysis that this strain is the first and type species of a new genus named Nissabacter close to Serratia and Ewingella genus.

However, since it is known that 16S rDNA analysis is limited to support this proposal, we conduct a full phenotypic and genomic and description for an accurate taxonomy of this strain.

191

We published this work as Article V entitled: “‘Nissabacter archeti’, gen. nov, sp. nov., a new member of Enterobacteriaceae family, isolated from human pustule scalp at Archet 2

Hospital, Nice” in New Microbes New Infection. Also an articles IV: “Description of

‘Nissabacter archeti’, gen. nov., sp. nov., isolated from human pustule scalp, which forms a distinct branch of the Enterobacteriaceae family and proposed as Nissabacter gen. nov.”

Submitted in International Journal of Systematic and Evolutionary Microbiology.

192

Reference 1. Fournier P-E, Drancourt M. New Microbes New Infections promotes modern prokaryotic taxonomy: a new section “TaxonoGenomics: new genomes of microorganisms in humans”. New microbes new Infect. 2015;7:48–9. 2. Renvoisé A. Applicabilité de la PCR « universelle » 16S comme outil d ’ identification et de détection bactérienne en laboratoire hospitalier de bactériologie. 2012;5. 3. Glaeser SP, Kämpfer P. Multilocus sequence analysis (MLSA) in prokaryotic taxonomy. Syst. Appl. Microbiol. 2015;38:237–45. 4. Lan Y, Morrison JC, Hershberg R, Rosen GL. POGO-DB - A database of pairwise- comparisons of genomes and conserved orthologous genes. Nucleic Acids Res. 2014;42. 5. Kim M, Oh H-SH-S, Park S-CS-C, Chun J. Towards a taxonomic coherence between average nucleotide identity and 16S rRNA gene sequence similarity for species demarcation of prokaryotes. Int. J. Syst. Evol. Microbiol. Microbiology Society; 2014;64:346–51. 6. Van Belkum A, Struelens M, De Visser A, Verbrugh H, Tibayrenc M. Role of genomic typing in taxonomy, evolutionary genetics, and microbial epidemiology. Clin. Microbiol. Rev. 2001. p. 547–60. 7. Chun J, Rainey FA. Integrating genomics into the taxonomy and systematics of the Bacteria and Archaea. Int. J. Syst. Evol. Microbiol. 2014;64:316–24.

193

Article V

‘Nissabacter archeti’, gen. nov, sp. nov., a new member of Enterobacteriaceae family,

isolated from human pustule scalp at Archet 2 Hospital, Nice.

Kodjovi D. Mlaga, Romain Lotte, Henri Montaudié, Jean-Marc Rolain, Ruimy Raymond

New Microbes New Infections

194 NEW SPECIES

‘Nissabacter archeti’ gen. nov., sp. nov., a new member of Enterobacteriaceae family, isolated from human sample at Archet 2 Hospital, Nice, France

K. D. Mlaga1, R. Lotte2,3, H. Montaudié4, J.-M. Rolain1 and R. Ruimy2,3 1) URMITE, Université Aix-Marseille, UM63, CNRS 7278, IRD 198, INSERM 1095, IHU-Méditerranée Infection, Marseille, 2) Department of Microbiology, Nice Academic Hospital, and University Nice Côte d’Azur, 3) INSERM U1065 (C3M), Bacterial Toxins in Host–Pathogen Interactions, C3M, Bâtiment Universitaire Archimed and 4) Department of Dermatology, Nice Academic Hospital, and University Nice Côte d’Azur, Nice, France

Abstract

We propose the main characteristics of a new bacterium species named Nissabacter archeti strain 2134 (CSURP3445 = LT631518), isolated from pustule scalp of a 29-year-old man at hospital Archet 2, Nice, south of France. © 2017 The Authors. Published by Elsevier Ltd on behalf of European Society of Clinical Microbiology and Infectious Diseases.

Keywords: Enterobacteria, human infection, Nissabacter archeti, pustule, scalp, taxonomy Original Submission: 2 February 2017; Revised Submission: 16 February 2017; Accepted: 21 February 2017 Article published online: 1 March 2017

3130xl Genetic Analyzer capillary sequencer (Applied Bio- Corresponding author: R. Ruimy, Department of Microbiology, systems), which generated 1507 bp. Nice Academic Hospital, and University Nice Côte d’Azur, F-06200 Nice, France Under electron microscopy, individual cells have a slightly E-mail: [email protected] oval form with flagella and were 3.5 μm in length and 1.0 μmin diameter. The 16S rRNA gene of strain 2134 exhibited a 97.67% 16S A Gram-negative bacterial strain 2134, CSURP3445, Enter- rRNA gene sequence similarity with Ewingella americana strain obacteria, aeroanaerobic, isolated from pustule scalp of a 29- CIP 81.94. Strain 2134 is closely related to two standing year-old man from Archet 2 Hospital, Nice, south of France, nomenclature species, Serratia rubidae DSM 4480 and Ewingella was studied for its taxonomy allocation. Initially this bacterium americana strain CIP81.94 (Fig. 1). Because the branches grew on 5% blood agar plate in an aerobic atmosphere at 37°C grouping strain 2134 with these two species are not supported after 24 hours’ incubation. The strain also grew on Luria- by significant bootstrap values (>90%), we propose that the Bertani agar to 37°C after 24 hours’ incubation. The col- strain 2134 is representative of a new genus within the onies were whitish, circular and smooth, with a diameter of 0.5 Enterobacteriaceae family. to 1.0 mm. Identification by matrix-assisted laser desorp- Taxonomically, the bacterial family Enterobacteriaceae tion–ionization time-of-flight mass spectrometry (MALDI- currently has 53 genera (and over 170 named species). Of TOF MS) screening on a MicroFlex spectrometer protein these, 26 genera are known to be associated with infections in analysis (Bruker Daltonics, Leipzig, Germany) did not allow the humans [3,4]. Members of the Enterobacteriaceae are small, identification of the strain 2134 to the genus level [1]. Gram-negative, nonsporing straight rods. They are facultatively Consequently, the sequence of 16S rRNA gene was deter- anaerobic, and most species grow well at 37°C, although some mined to specify its phylogenetic position. Briefly, PCR of 16S species grow better at 25 to 30°C [6]. Strain 2134 exhibited a rRNA gene was performed using the fD1-rP2 primers as 16S rRNA gene sequence similarity of 98.65 % with a validly previously described [2] using Quantitec PCR System 2720 published name with standing in nomenclature [7]. We propose thermal cyclers (Applied Biosystems, Bedford, MA, USA). The the new genus Nissabacter as a new member of Enterobacteri- PCR product was purified and sequenced using ABI Prism aceae family for the fact that it was first isolated from Nice a

New Microbe and New Infect 2017; 17: 81–83 © 2017 The Authors. Published by Elsevier Ltd on behalf of European Society of Clinical Microbiology and Infectious Diseases This is an open access195 article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/) http://dx.doi.org/10.1016/j.nmni.2017.02.001 82 New Microbes and New Infections, Volume 17 Number C, May 2017 NMNI

FIG. 1. Phylogenetic position of strain 2134 within Enterobacteriaceae family by 16S rDNA sequence analysis. Unrooted neighbour-joining tree of concatenated 16S rDNA of 22 reference strains of representative members of Enterobacteriaceae family closely related to strain 2134 [3]. Pleisomonas shigelloides ATCC 14029 was used as outgroup. Values below lines are bootstrap values (1000 replicates) expressed in percent (only values greater than 90% are shown) [4]. Scale bar = accumulated changes per nucleotide. Analyses were conducted in MEGA7 [5]. south-east city of France, and the description of the first species Acknowledgements of this genus Nissabacter archeti for the fact that it was isolated from human pustule scalp in the bacteriology laboratory of the This study was funded by the Fondation Méditerranée Infection. Archet 2 Hospital. The authors thank F. Cadoret for administrative tasks.

Deposit in a culture collection Conflict of Interest

Strain Nissabacter archeti strain 2134 was deposited in the None declared. Collection de Souches de l’Unité des Rickettsies (CSUR, WDCM 875) under number P3445. References

Nucleotide sequence accession number [1] Seng P, Rolain JM, Fournier PE, La Scola B, Drancourt M, Raoult D. MALDI-TOF-mass spectrometry applications in clinical microbiology. – The 16S rRNA gene sequence of the strain 2134 was deposited Future Microbiol 2010;5:1733 54. [2] Drancourt M, Bollet C, Carlioz A, Martelin R, Gayral JP, Raoult D. 16S in GenBank under accession number LT631518 under the ribosomal DNA sequence analysis of a large collection of environmental and name Nissabacter archeti strain 2134. clinical unidentifiable bacterial isolates. J Clin Microbiol 2000;38:3623–30.

© 2017 The Authors. Published by Elsevier Ltd on behalf of European Society of Clinical Microbiology and Infectious Diseases, NMNI, 17,81–83 This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/). 196 NMNI Mlaga et al. ‘Nissabacter archeti’ gen. nov., sp. nov. 83

[3] Khan F, Rizvi M, Shukla I, Malik A. A novel approach for identification of [6] Health Protection Agency. UK standards for microbiology investigations members of Enterobacteriaceae isolated from clinical samples. Biol Med identification of Campylobacter species. 2013. p. 1–22. ID23. 2011;3(2 special issue):313–9. [7] Kim M, Oh HS, Park SC, Chun J. Towards a taxonomic coherence [4] Borman EK, Stuart CA, Wheeler KM. Taxonomy of the family Entero- between average nucleotide identity and 16S rRNA gene sequence bacteriaceae. J Bacteriol 1944;48:351–67. similarity for species demarcation of prokaryotes. Int J Syst Evol [5] Kumar S, Stecher G, Tamura K. MEGA7: molecular evolutionary genetics Microbiol 2014;64(pt 2):346–51. analysis version 7.0 for bigger datasets. Mol Biol Evol 2016;33:1870–4.

© 2017 The Authors. Published by Elsevier Ltd on behalf of European Society of Clinical Microbiology and Infectious Diseases, NMNI, 17,81–83 This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/). 197

Article VI

Description of ‘Nissabacter archeti’, gen. nov., sp. nov., isolated from human pustule scalp, which forms a distinct branch of the Enterobacteriaceae family and proposed as

Nissabacter gen. nov.

Kodjovi D. Mlaga, Romain Lotte, Henri Montaudié, Jean-Marc Rolain, Ruimy Raymond

(Submitted in International Journal of Systematic and Evolutionary Microbiology) Impact factor: 2.79

198 International Journal of Systemat ic and Evolutionary Microbiology Description of 'Nissabacter archeti', gen. nov., sp. nov., isolated from human pustule scalp, which forms a distinct branch within the Enterobacteriaceae family. --Manuscript Draft--

Manuscript Number: IJSEM-D-17-00850 Full Title: Description of 'Nissabacter archeti', gen. nov., sp. nov., isolated from human pustule scalp, which forms a distinct branch within the Enterobacteriaceae family. Article Type: Taxonomic Description Section/Category: New taxa - Keywords: Nissabacter archeti, Nice, pustule, scalp, taxono-genomic, Enterobacteriaceae Corresponding Author: Raymond Ruimy Laboratoire Bactériologie hôpital archet 2 FRANCE First Author: Kodjovi Dodji Mlaga Order of Authors: Kodjovi Dodji Mlaga Romain Lotte Henri Montaudié Jean-Marc Rolain Raymond Ruimy Manuscript Region of Origin: FRANCE Abstract: A novel, facultatively anaerobic, motile, Gram-negative, straight-rod strain, designated 2134 under the collection CSURP3445 = DSM 105398, was isolated from a human scalp pustule at the Archet 2 Hospital, Nice. The cells consist of long rods (approx. 3 - 3.5 µm in length and 1µm wide) and produce whitish, circular (approx. 0.5 - 1 mm), mobile colonies. Strain 2134 was oxidase negative, catalase positive and urease. Acid is produced from D-glucose, citrate, D-mannitol, Inositol, D-sorbitol, L-Rhamnose, D- saccharose, D-Melibiose, D-amygdaline, L-arabinose and lactose, and ONPG reactions are positive. Arginine dihydrolase, lysine decarboxylase, ornithine decarboxylase, Indole and H2S are negative. The fatty acid profiles are characterised by large amounts of Hexadecanoic acid, 3-hydroxy-Tetradecanoic acid, 9- Hexadecenoic acid, 10-Heptadecenoic acid, Dodecanoic acid, 11-Octadecenoic acid, Tetradecanoic acid and Octadecanoic acid. Phylogenetic analyses by MLSA revealed the strain to be close to the genera Serratia, Ewingella, Rahnella and Gibbsiella. The genome size is 5,162,116 bp with a 58.34% G+C content assembled into 50 scaffolds containing 4695 genes, 3 repeat regions, 88 tRNAs, 12 rRNAs and 4594 coding sequences (CDSs). The genome contains three prophages, one intact and 2 questionable, six CRISPR systems (spacers and protein-associated) and 13 genomic pathogenic islands. The distinct phylogenetic position supports the proposal of Nissabacter gen. nov., with the type species Nissabacter archeti strain 2134.

199 Powered by Editorial Manager® and ProduXion Manager® from Aries Systems Corporation Manuscript Including References (Word document) Click here to download Manuscript Including References (Word document) Mlaga et al. manuscript.docx

1 Taxonomic Descriptions

2 TITLE: Description of ‘Nissabacter archeti’, gen. nov., sp. nov., isolated from human

3 pustule scalp, which forms a distinct branch within the Enterobacteriaceae family.

4 Author list: Kodjovi D. Mlaga1, Romain Lotte2,3, Henri Montaudié4, Jean-Marc Rolain1,

5 Raymond Ruimy*2,3

6 Affiliations: 1) URMITE, Aix-Marseille Université, UM63, CNRS 7278, IRD 198, INSERM

7 1095, IHU-Méditerranée Infection, 19-21 Boulevard Jean Moulin, 13385 Marseille Cedex 05,

8 France. 2) Department of Microbiology, Nice Academic Hospital and University Nice Côte

9 d’Azur, Nice, France. 3) INSERM U1065 (C3M), Bacterial Toxins in Host-Pathogen

10 Interactions, Bâtiment Universitaire Archimed, 151 route Saint Antoine de Ginestière, BP 2

11 3194, 06204 Nice Cedex 3, France. 4) Department of Dermatology, Nice Academic Hospital

12 and University Nice Côte d’Azur, Nice, France.

13 *Corresponding author: Prof. Raymond RUIMY

14 Chef de Service, Laboratoire de Bactériologie, Centre Hospitalier Universitaire de Nice,

15 Hôpital de l'Archet II, 151 Route de St Antoine de Ginestière, BP 3079, 06202 NICE Cedex3,

16 France. Tel direct : +33 (0) 4 92 03 62 20, Tel secrétariat : +33 (0) 4 92 03 62 14, Fax: +33 (0)

17 4 92 03 59 52, Email: [email protected]

18 Text word count: 3507, Abstract word count: 217, Number of references: 34, Number of

19 figures: 7, Number of tables: 2, Supplementary data, tables: 4

20 Keywords: Nissabacter archeti, Nice, pustule, scalp, taxono-genomic, Enterobacteriaceae

21 The GenBank accession number for the 16s RNA and whole genome sequence of str 2134 are

22 LT631518 and NZ_FQXW00000000.1 respectively

23

200 1,

24 ABSTRACT

25 A novel, facultatively anaerobic, motile, Gram-negative, straight-rod strain, designated 2134

26 under the collection CSURP3445 = DSM 105398, was isolated from a human scalp pustule at

27 the Archet 2 Hospital, Nice. The cells consist of long rods (approx. 3 - 3.5 µm in length and

28 1µm wide) and produce whitish, circular (approx. 0.5 – 1 mm), mobile colonies. Strain 2134

29 was oxidase negative, catalase positive and urease. Acid is produced from D-glucose, citrate,

30 D-mannitol, Inositol, D-sorbitol, L-Rhamnose, D-saccharose, D-Melibiose, D-amygdaline, L-

31 arabinose and lactose, and ONPG reactions are positive. Arginine dihydrolase, lysine

32 decarboxylase, ornithine decarboxylase, Indole and H2S are negative. The fatty acid profiles

33 are characterised by large amounts of Hexadecanoic acid, 3-hydroxy-Tetradecanoic acid, 9-

34 Hexadecenoic acid, 10-Heptadecenoic acid, Dodecanoic acid, 11-Octadecenoic acid,

35 Tetradecanoic acid and Octadecanoic acid. Phylogenetic analyses by MLSA revealed the strain

36 to be close to the genera Serratia, Ewingella, Rahnella and Gibbsiella. The genome size is

37 5,162,116 bp with a 58.34% G+C content assembled into 50 scaffolds containing 4695 genes,

38 3 repeat regions, 88 tRNAs, 12 rRNAs and 4594 coding sequences (CDSs). The genome

39 contains three prophages, one intact and 2 questionable, six CRISPR systems (spacers and

40 protein-associated) and 13 genomic pathogenic islands. The distinct phylogenetic position

41 supports the proposal of Nissabacter gen. nov., with the type species Nissabacter archeti strain

42 2134.

201 2,

43 Introduction

44 Taxonomically, the bacterial family of Enterobacteriaceae currently includes 53 genera (with

45 over 170 named species), of which 26 genera are known to be associated with infections in

46 humans [1, 2]. Members of the Enterobacteriaceae are small, Gram-negative, non-sporing

47 straight rods. They are facultatively anaerobic and most species grow well at 37°C, although

48 some species grow better at 25-30°C [3]. Classification of organisms has traditionally been

49 based on similarities in their morphological, developmental and nutritional characteristics, but

50 it is now clear that classification of microorganisms according to these criteria does not

51 necessarily correlate well with natural and evolutionary relationships, as defined by

52 macromolecular sequence comparisons [2, 4]. 16S rRNA gene sequencing, commonly used in

53 the phylogenetic taxonomy of bacteria, has recently been found to have its limitations with

54 respect to the Enterobacteriaceae family [5]. Multiple gene-based phylogenies were therefore

55 introduced and have been frequently used to overcome the limitations. The method currently

56 used in prokaryotic taxonomy is Multilocus Sequence Analysis (MLSA), in which multi-locus

57 fragments of protein-coding genes are sequenced (four housekeeping genes: ATP synthase F1,

58 β-subunit: atpD, DNA gyrase, β-subunit: gyrB, RNA polymerase, β-subunit: rpoB and

59 translation initiation factor IF-2: infB) and subsequently used to calculate a taxonomic

60 phylogenetic tree. A phylogenetic tree based on concatenated aligned sequences is known to

61 reflect the “true” relationship of bacterial taxa [6]. Although high genome plasticity makes a

62 taxonomy of the Enterobacteriaceae family difficult, a new advance in whole genome

63 sequencing (WGS) has made accurate classification of this family possible. WGS enables

64 efficient estimation of bacterial G+C% content, the DNA–DNA hybridization value, Average

65 Nucleotide Identity (ANI) and Average Amino acid identity (AAI) [7–9]. We describe here a

66 new member of the Enterobacteriaceae family, that, on the basis of the distinct phylogenetic

67 position of this strain within the Enterobacteriaceae family ascertained with 16S rDNA

68 sequence analysis, was recently proposed as the first type species of a new genus named

202 3,

69 Nissabacter, which is close to the genera Serratia and Ewingella [10]. However, since 16S

70 rDNA analysis in support of this proposal is known to have its limitations, we conducted a full

71 phenotypic and genomic analysis and made a full description of this strain to provide accurate

72 taxonomic classification.

73 Materials and Methods

74 Sample collection, strain isolation, Gram staining and antibiotic susceptibility test

75 A pus sample was obtained from a 29-year-old man suffering from a scalp pustule and a swab

76 sample was sent to the bacteriology laboratory of the Nice Teaching Hospital, Archet II. The

77 strain 2134 was isolated by culture on a 5% blood agar plate in an aerobic atmosphere at 37°C

78 after 24h of incubation. A growth was obtained along with S. aureus on the same plate. A

79 biochemical identification test was performed using an Api20E biochemical test kit (bioMrieux

80 Marcy-l'Étoile, France) and Gram staining was performed using the standard Gram staining

81 protocol. Susceptibility to antibiotics was determined by disk diffusion on Muller-Hinton agar

82 (Sanofi Diagnostics Pasteur, Marnes la Coquettes, France) as recommended by the “Comité de

83 l’Antibiogramme de la Société Française de Microbiologie (http://www.sfm-

84 microbiologie.org/UserFiles/files/casfm/CASFM2013vjuin.pdf)”. The antibiotic disks we used

85 were amoxicillin, amoxicillin/clavulanic acid, cefoxitin, cefalotin, imipenem,

86 trimethoprim/sulfamethoxazole, doxycycline, ciprofloxacin, nalidixic acid, fosfomycin

87 rifampicin and colistin.

88 Identification based on MALDI-TOF typing

89 MALDI-TOF protein analysis was performed as previously described by [11] using a

90 Microflex spectrometer (Bruker Daltonics, Leipzig, Germany). A colony was selected from a

91 culture agar plate and spread on an MSP 96 MALDI-TOF target plate (Bruker). Two distinct

92 deposits from two isolated colonies were tested for strain 2134. Each smear was covered with

203 4,

93 2 μL of matrix solution (saturated solution of alpha-cyano-4-hydroxycinnamic acid in 50%

94 acetonitrile and 2.5% trifluoracetic acid) and allowed to dry for 5 minutes. Spectra were

95 recorded in the positive linear mode in the mass range 2000 to 20,000 Da (parameter settings:

96 ion source 1 (IS1), 20 kV; IS2, 18.5 kV; lens, 7 kV). A spectrum was obtained after 240 shots

97 with variable laser power. Acquisition time was set from 30 seconds and 1 minute per spot. The

98 20 SIT1T spectra were imported into the MALDI BioTyper 3.0 software (Bruker) and analysed

99 by standard pattern matching (with default parameter settings) against the main spectra of 7,379

100 bacteria. A score of ≥ 2 with a validly published species enabled identification at the species

101 level, a score of ≥ 1.7 and < 2 enabled identification at the genus level and a score of < 1.7 was

102 considered an invalid result [12]

103 Morphology and electron microscopy

104 Negative staining was performed using bacteria fixed with 2.5% glutaraldehyde and deposited

105 on carbon formvar film, then incubated for 1 second on ammonium molybdate 1%, dried on

106 blotting paper and finally observed with a TECNAI G20 transmission electron microscope

107 (FEI, Limeil-Brevannes, France) at an operating voltage of 200 keV. GRAM staining was

108 carried out according to the GRAM procedure and observation was made by oil immersion

109 under light microscopy (x100).

110 Fatty acid methyl ester (FAME) analysis by GC/MS

111 Cellular fatty acid methyl ester (FAME) analysis was performed by Gas Chromatography/Mass

112 Spectrometry (GC/MS). Two samples were prepared with approximately 85 mg of bacterial

113 biomass per tube harvested from several culture plates. Fatty acid methyl esters were prepared

114 as described by [13] and GC/MS analyses were carried out as previously described by [14].

115 Briefly, fatty acid methyl esters were separated using an Elite 5-MS column and monitored by

116 mass spectrometry (Clarus 500 - SQ 8 S, Perkin Elmer, Courtaboeuf, France). A spectral

204 5,

117 database search was performed using MS Search 2.0 operated with the Standard Reference

118 Database 1A (NIST, Gaithersburg, USA) and the FAME mass spectral database (Wiley,

119 Chichester, UK).

120 Genomic DNA preparation

121 Strain 2134 was cultured on Tryptone soya agar (BioMerieux) at 37 °C in an aerobic

122 atmosphere. Bacteria grown on three Petri dishes were resuspended in 400 μL of TE buffer,

123 then 200 μL of this suspension was diluted in 1 mL of TE buffer for lysis treatment, which

124 included incubation for 30 minutes with 2.5 μg/μL lysozyme at 37 °C, followed by incubation

125 for 2 hours with 20 μg/μL proteinase K at 37 °C. The extracted DNA was then purified using a

126 QIAGEN spin-column kit.

127 Genome sequencing and assembly

128 Genomic DNA of strain 2134 was sequenced using MiSeq Technology (Illumina Inc, San

129 Diego, CA, USA) with the mate-pair strategy. gDNA was quantified by a Qubit assay using a

130 high sensitivity kit (Life Technologies, Carlsbad, CA, USA) to 120.3 ng/µl. The mate-pair

131 library was prepared with 1.5 µg of genomic DNA using the Nextera mate-pair Illumina guide.

132 The genomic DNA sample was simultaneously fragmented and tagged with a mate-pair

133 junction adapter. The pattern of fragmentation was validated on an Agilent 2100 BioAnalyzer

134 (Agilent Technologies Inc., Santa Clara, CA, USA) with a DNA 7500 labchip. The DNA

135 fragments ranged in size from 1.5 kb to 11 kb with an optimal size of 5.316 kb. No size selection

136 was performed and 640.3 ng of tagged fragments were circularised. The circularised DNA was

137 mechanically sheared to small fragments with an optimal size of 1,550 bp using a Covaris S2

138 device in T6 tubes (Covaris, Woburn, MA, USA). The library profile was visualised on a High

139 Sensitivity Bioanalyzer LabChip (Agilent Technologies Inc., Santa Clara, CA, USA) and the

140 final concentration library was measured at 15.44 nmol/l. The libraries were normalised at 2nM

205 6,

141 and pooled. After a denaturation step and dilution at 15 pM, the pool of libraries was loaded

142 onto the reagent cartridge and then onto the instrument along with the flow cell. Automated

143 cluster generation and sequencing run were performed in a single 2 x 301-bp run. The reads

144 obtained were assembled using an A5-Miseq [15], and genome annotation was performed using

145 a Prokka [16] and submitted under accession number NZ_FQXW00000000.1.

146 Multi-locus sequence analysis (MLSA) and phylogenetic analysis

147 The Multi-locus sequence analysis (15) and the phylogenetic taxonomy and evolution analysis

148 were inferred. Nucleotide sequences of genes rpoB, gyrB, infB and atpD were retrieved from

149 the NCBI database for Enterobacteriaceae strain types previously described in [6].

150 Pseudomonas aeruginosa PAO1 (NC_002516) was used to root the phylogenetic tree. The

151 sequences were concatenated in the order rpoB, gyrB, infB, atpD and aligned using Mafft [18]

152 with default parameters. The evolutionary tree was inferred with the Maximum Likelihood

153 method with bootstrap 1000 and the generalised time-reversal parameter (GTR). The

154 phylogenetic trees were visualised with MEGA [19]

155 Genome description, orthologous gene detection and comparison analysis

156 A GenBank file of the annotated genome of strain 2134 was uploaded to an Artemis plotter [20]

157 for genome feature visualisation. GIPSy software [21] was used to predict the Genome

158 Pathogenic Island (GPI), Genome Resistance Island (GRI) and Genome Symbiotic Island (GSI)

159 using Serratia plymuthica S13 (NZ_CP015613) as the reference genome with the default

160 setting. For comparative analysis, Serratia marcescens Db11 (NZ_HG326223.1), Serratia

161 plymuthica S13 (NZ_CP015613), Ewingella americana ATCC 33852 (NZ_JMPJ00000000),

162 Serratia rubidaea (NZ_CP014474.1) and Rahnella aquatilis (NC_017047.1) were retrieved

163 from the NCBI database. A functional COG profile was determined for each genome by running

164 a Blastp search against the NCBI COG database [22] and a comparison plot was generated.

206 7,

165 Orthologous genes were detected and analysed using the Get-Homologues software [23, 24]

166 with 80% coverage, 80% identity and 1.5 clustering inflation. Gower distance was calculated

167 to estimate the genomic distance between the genomes compared using the pan-genome matrix.

168 RESULTS AND DISCUSSION

169 Strain 2134 grows easily on normal agar plates, such as tryptone soya agar, 5% blood agar plate

170 and Hecktoen plate, in 18h-24h at an optimum growing temperature of 37°C. It is Gram-

171 negative with a rod-shaped morphology, and, like most of the Enterobacteriaceae family

172 members, has motile flagella (Figure 1). We were unable to accurately identify the strain with

173 MALDI-TOF MS as this is the first time it has been isolated (average score: 1.506 < 1.7) (Figure

174 2).

175 All the strains used for comparison had distinct biochemical phenotypic characteristics. Strain

176 2134 shares motility characteristics with S. marcescens, S. plymuthica, Cedecea neteri and E.

177 americana. They all react positively to Ortho-nitro-phenol beta-D-galactosidase (ONPG) and

178 glucose fermentation, which is common to most Enterobacteria (Table 1). However, strain

179 2134 has the specific biochemical characteristic of producing acetyl methyl carbinol from

180 glucose fermentation (VP test) and fermenting L-Rhamnose. It differs from E. americana by

181 the absence of H2S gas production, and from the Serratia genus by the absence of gelatine

182 hydrolysis.

183 Strain 2134 exhibited resistance to amoxicillin, doxycycline, rifampicin and fosfomycin, and

184 was susceptible to trimethoprim/sulfamethoxazole, imipenem, amoxicillin/clavulanic acid,

185 Cefotaxime, Cefoxitin, Cefalotin, ciprofloxacin, nalidixic acid and colistin. No extended-

186 spectrum beta-lactamase activity was detected.

187 The most abundant fatty acid by far was Hexadecanoic acid (43%), followed by the specific 3-

188 hydroxy-Tetradecanoic acid (15%), and the unsaturated structures 9-Hexadecenoic acid (13%)

207 8,

189 and 10-Heptadecenoic acid (9%). Minor amounts of other unsaturated branched and saturated

190 fatty acids were also detected (Table 2).

191 Multi-locus sequence analysis (MLSA) inferred using the Maximum Likelihood method with

192 119 taxa clearly showed that the strain CS is a new genus which may be the closest ancestor of

193 genera Serratia, Ewingella, Rahnella and Gibbsiella with higher resolution (Figure 3). Average

194 Nucleotide Identity (ANI) performed with the SpecI software [25] and confirmed by the ANI

195 web platform (http://enve-omics.ce.gatech.edu/ani/) showed an average 87.72% (below cut-off)

196 identity with four strains: Serratia sp. AS12, S. plymuthica AS9, S. odorifera 4Rx13 and Serratia

197 sp. AS13. The cut-off is known to be > 96% for the same species (Supplementary data Table

198 1). In-silico DDH analysis against E. americana (20.1%), S. marcescens (22.2%) and S.

199 plymuthica (22.1%) revealed a very low DNA-DNA-hybridization percentage, known to be >

200 70% for the same species [26–28] (Supplementary data Table 2).

201 The sequenced genome size of strain 2134 is 5,162,116 bp, close to most Enterobacteriaceae

202 family members, assembled in 50 scaffolds. The numbers of predicted coding sequences were

203 4695 with 110 RNAs, 3 repeat-regions, 89 tRNAs and 12 rRNAs. The GC% of the whole

204 genome is 58.34% (Figure 4). Coding density is 0.892 genes/kb (1120 bases/genes) and the

205 coding percentage is 90.3%. The CRISPRFinder [29] predicted 6 CRISPRs with three true

206 CRISPRs in the genome. The first, which is the largest, was associated with CRISPR-associated

207 operon proteins (cas1, Cas3, Csy1, Csy2, Csy3, Cas6f), followed by a direct repeat sequence

208 with a length of 28bp and 21 spacers. The associated mobile region, also predicted by GIPSy

209 as a genomic pathogenic island, had 100% coverage and 99% identity with a specific region of

210 Salmonella enterica serovar cubana, and 28% coverage and 100% identity with a Klebsiella

211 pneumoniae plasmid pKOXM1A, evidence of a probable insertion of mobile genetic elements

212 (MGE) inside a genome (Figure 5).

208 9,

213 These MGEs span 839256 to 947591 base pairs and essentially contain a plasmid protein of

214 unknown function (Plasmid_RAQPRD), F pilus assembly Type-IV secretion system for

215 plasmid transfer, TraM recognition site of TraD and TraG, TraU proteins, putative cadmium-

216 transporting ATPase, arsenical pump membrane proteins, arsenate reductase, IS2 transposase

217 TnpB, silver-binding protein SilE precursor, silver exporting P-type ATPase, copper resistance

218 operon proteins (CopA, CopB, CopD, CopR), cation efflux system operon proteins (CusS,

219 CusR, CusF, CusC, CusB, CusA), tyrosine recombinase XerC and a large number of

220 hypothetical proteins. This heavy metal carriage suggests the bacterium may have an

221 environmental origin [30]. A total of 13 genomic pathogenic islands, 16 putative genomic

222 islands (Supplementary data Table 4) and 3 prophages were predicted to be acquired by strain

223 CSURP3445, making it a genome with very high plasticity (Figure 5). 14 toxin-anti-toxin

224 modules were found in the genome as well as important virulence factors, such as the cold

225 shock-like protein CspC, modulator of FtsH protease YccA, the heat shock protein HspQ, ion

226 protease, an integration host factor subunit beta, a capsule polysaccharide modification protein,

227 the polysialic acid transport protein KpsD, the putative fimbrial-like protein YcbV, a putative

228 fimbrial operon, a peptidoglycan-associated lipoprotein, a ferric uptake regulation protein, the

229 magnesium and cobalt efflux protein CorC, catalase, a virulence gene transcriptional activator,

230 the colonisation factor antigen 1, the filamentous hemagglutinin transporter protein FhaC,

231 filamentous hemagglutinin, a type IV pilus biogenesis and competence operon protein, a ferrous

232 iron transport operon protein, lactose permease, the actin cross-linking toxin VgrG1, the toxin

233 and drug export protein A, flagellin 2, invasin, Type I and II secretion system proteins F and

234 haemolysin C, making strain 2134 a potentially harmful pathogen for humans. Known

235 antimicrobial resistance genes are beta-lactamase bla, fluoroquinolone flq, and macrolide mls.

236 Strain 2134 has a genome size close to that of Serratia (average difference in GC% of 1.40%)

237 but not to that of E. americana (difference in GC% of 4.44%). A total of 2464 orthologous

209 10,

238 genes were identified as conserved genes across all the genomes that we analysed (core genes)

239 accounting for 52.48% of the proteome of strain 2134, which contains 986 singleton genes and

240 shares 75 accessory orthologous genes with E. americana. A pangenome parsimony

241 phylogenetic tree generated from a gene presence and absence profile plotted against an average

242 nucleotide identity heatmap showed that strain 2134 is genetically close to E. americana

243 (Gower distance = 0.39, ANI = 0.61) (Figure 6). Functional COG analysis performed on strain

244 2134, S. marcescens, S. plymuthica, S. rubidaea and E. americana showed that strain 2134 and

245 E. americana are closely related with differences observed only in inorganic ion transport and

246 metabolism and in secondary metabolite biosynthesis, transport and catabolism COG (Figure 7

247 [P] and [Q], respectively; supplementary data: Table 3). Genomic island search revealed 13

248 putative pathogenicity islands, 13 putative resistance islands and 6 putative symbiotic islands.

249 The genome of strain 2134 contains 3 prophages, with one intact, and 3 CRISPR regions, giving

250 it very high plasticity. We found a putative pathogenicity island with 22% G+C% deviation and

251 44% codon usage factor and composed of 57% hypothetical proteins and 42% genes associated

252 with the CRISPR-associated protein. We also found the identical region in the genome of S.

253 enterica subsp. enterica serovar cubana str CFSAN002050 with 99% identity from 4213500 to

254 4309000, and 28% identity with 22% of K. oxytoca strain M1 plasmid pKOXM1A (composed

255 of 20 coding sequences). This provided evidence of MGE insertion in the genome of strain

256 2134 making it evolutionarily versatile.

257 Based on these phylogenetic and physiological results, we propose strain 2134 as the type strain

258 of the novel taxon Nissabacter archeti gen. nov. sp. nov., and Nissabacter gen. nov. as the type

259 genus of the Enterobacteriaceae family.

260

261

210 11,

262 Description of the genus Nissabacter gen. nov.

263 The cells of Nissabacter (Ni.sa.bac.ter, a bacterium isolated for the first time in Nice (Nissa or

264 Niça in the Niçois Occitan dialect), a city in the Alpes-Maritimes department, south-eastern

265 France) consist of long rods (approx. 3 - 3.5 µm in length and 1µm wide) which produce

266 whitish, circular colonies (approx. 0.5 – 1 mm in diameter), and are Gram-negative and mobile.

267 They are facultatively anaerobic and oxidase-negative. Acid is produced from D-glucose,

268 citrate, D-mannitol, Inositol, D-sorbitol, L-Rhamnose, D-Saccharose, D-Melibiose, D-

269 amygdaline, L-arabinose and lactose, and ONPG reactions are positive. Urease is negative.

270 Acetoin is not produced. Arginine dihydrolase, lysine decarboxylase, ornithine decarboxylase,

271 Indole and H2S are negative. The fatty acid profiles are characterised by large amounts of

272 Hexadecanoic acid, 3-hydroxy-Tetradecanoic acid, 9-Hexadecenoic acid, 10-Heptadecenoic

273 acid, Dodecanoic acid, 11-Octadecenoic acid, Tetradecanoic acid and Octadecanoic acid. The

274 G + C content of the DNA is 58.34%. They are phylogenetically close to genera Ewingella,

275 Gibbsiella and Serratia in the Enterobacteriaceae family.

276 Description of Nissabacter archeti sp. nov.

277 Nissabacter archeti strain 2134 (ar.che.ti. from Hôpital Archet II, where the first species of this

278 genus was isolated.) is, in addition to having the phenotypic and genomic features described for

279 the genus, the first species of the Nissabacter genus which is Gram-negative, has motile

280 flagella, is non-haemolytic, facultatively anaerobic and oxidase negative. It was isolated from

281 a scalp pustule swab taken from a 29-year-old patient at the Hôpital Archet II in Nice, south-

282 eastern France. N. archeti is resistant to amoxicillin, doxycycline, rifampicin and fosfomycin,

283 and is sensitive to trimethoprim/sulfamethoxazole, imipenem, amoxicillin/clavulanic acid,

284 Cefotaxime, Cefoxitin, Cefalotin, ciprofloxacin, nalidixic acid and colistin. The genome size of

285 Nissabacter archeti is 5,162,116 bp with a 58.34% G+C content and assembled into 50

286 scaffolds containing 4695 genes, 3 repeat regions, 88 tRNAs, 12 rRNAs and 4594 coding

211 12,

287 sequences (CDSs). The genome contains three prophages, one intact and two questionable, six

288 CRISPR systems, three true and three questionable, and 13 genomic pathogenic islands.

289 Nissabacter archeti strain 2134 has been deposited in the “Collection de Souches de l’Unité

290 des Rickettsies” (CSUR, WDCM 875) under number P3445, and in the Leibniz-Institute DSMZ

291 - German Collection of Microorganisms and Cell Cultures GmbH under DSM number 105398.

292 The 16S rRNA gene sequence of strain 2134 has been deposited in GenBank under accession

293 number LT631518 with the name bacterium 2134. The draft genome was

294 submitted under the name Nissabacter archeti strain 2134, with Bio project number

295 PRJEB18266 and accession number FQXW01000000.

296 Acknowledgements

297 The authors thank Nicholas ARMSTRONG and Magali RICHEZ who performed the fatty acid

298 analyses, and Frédéric Cadoret for carrying out administrative tasks and depositing the strain

299 collection.

300 Funding information

301 This study was funded by the Fondation Méditerranée Infection.

302 Conflict of Interest

303 None declared.

212 13,

304 Figures legend

305 Figure 1. Morphology of N. archeti. Left: Gram staining (Gram-negative), right: electron

306 microscopy image (oval shape with the presence of flagella).

307 Figure 2. Reference mass spectrum from the MALDI-TOF MS spectrum of Nissabacter

308 archeti.

309 Figure 3. Multi-locus sequence analysis: Molecular Phylogenetic analysis by the Maximum

310 Likelihood method of 4 concatenated housekeeping genes: rpoB, gyrB, infB, atpD.

311 The analysis involved 24 nucleotide sequences. All positions containing gaps and missing data

312 were eliminated.

313 Figure 4. Schematic circular diagram of the Nissabacter archeti chromosome. From the

314 outside, the five circles display: (i) the scale in Mbp; (ii) the predicted coding regions

315 transcribed clockwise or anticlockwise; (iii) tRNA and rRNA (green); (iv) GC deviation (G-

316 C/G+C); (v) G+C content, predicted genomic pathogenic island (pink), prophages (blue),

317 predicted CRISPR (red).

318 Figure 5. Genome comparison of the putative pathogenicity island with Salmonella enterica

319 and the Klebsiella oxytoca plasmid sub-region. The Blast program used was incorporated into

320 the Easyfig program [31]. 1. Nissabacter archeti, 2. S. enterica subsp. enterica Serovar

321 Cubana str CFSAN002050, 3. K. oxytoca strain M1 plasmid pKOXM1A

322 Figure 6. Heatmap derived from an average nucleotide identity matrix calculated with Get-

323 Homologues [23]. The phylogenetic tree (left) is derived from a pangenome parsimony tree

324 based on orthologous genes. The tree was generated using Manhattan distances, the Heatmap

325 (right) was generated using Gower distance to compute dissimilarity from the average

326 nucleotide identity matrix.

327

213 14,

328 Figure 7. Distribution and comparison of functional Clusters of Orthologous Groups (COG)

329 Information storage and processing: [A] RNA processing and modification, [B] Chromatin

330 structure and dynamics, [J] Translation, ribosomal structure and biogenesis, [K] Transcription,

331 [L] Replication, recombination and repair. Cellular processes and signalling: [D] Cell cycle

332 control, cell division, chromosome partitioning, [M] Cell wall/membrane/envelope biogenesis,

333 [N] Cell motility, [O] Posttranslational modification, protein turnover and chaperones, [T]

334 Signal transduction mechanisms, [U] Intracellular trafficking, secretion and vesicular transport,

335 [V] Defence mechanisms, [W] Extracellular structures, [Y] Nuclear structure, [Z]

336 Cytoskeleton. Metabolism: [C] Energy production and conversion, [E] Amino acid transport

337 and metabolism, [F] Nucleotide transport and metabolism, [G] Carbohydrate transport and

338 metabolism, [H] Coenzyme transport and metabolism, [I] Lipid transport and metabolism, [P]

339 Inorganic ion transport and metabolism, [Q] Secondary metabolite biosynthesis, transport and

340 catabolism. Poorly characterized: [R] General function prediction only, [S] Function unknown

214 15,

341 Table 1. Comparative morphology and biochemical reactions of Nissabacter archeti, Serratia

342 marcescens, Serratia plymuthica, Cedecea neteri and Ewingella americana (v: variable, n/a:

343 not available)

S. marcescens S. plymuthica C. neteri E. americana Biochemical N. archeti* [32] [32] [33] [34] ONPG + + + + + ADH - - - + - LDC - + - - - ODC - + - - - CIT + - + + + H2S - - - - + URE - - - - - TDA - - - - IND - + - - - VP - + + + + GEL - + + - - D-Glucose + + + + + D-Mannitol + + + - + Inositol + + + n/a - D-Sorbitol + + + + - L-Rhamnose + - - - - D-Saccharose + + + n/a - D-Melibiose + - + - - D-Amygdaline + + + n/a v L-Arabinose + - + + - Lactose + + + + - Oxidase - - - - - Morphology Motility + + + + + Gram stain - - - - - morphology R-s R-s R-s R-s R-s

215 16,

344 ONPG: Ortho-nitro-phenol β-galactosidase, ADH: Arginine decarboxylase, LDC: Lysin

345 decarboxylase, ODC: Ornithine decarboxylase, CIT: citrate, H2S : production of hydrogen

346 sulphide, URE: Urea, TDA: Tryptophan desaminase, IND: Indole, VP: Voges-Proskauer

347 (acetoin detection), GEL: Gelatin, * this study.

348 Table 2. Cellular fatty acid composition (%)

Fatty acid Name Mean relative % (a)

16:0 Hexadecanoic acid 43.0 ± 0.8

14:0 3-OH 3-hydroxy-Tetradecanoic 14.7 ± 0.7

acid

16:1n7 9-Hexadecenoic acid 13.4 ± 0.2

17:1n7 10-Heptadecenoic acid 9.2 ± 0.2

12:0 Dodecanoic acid 7.3 ± 0.8

18 :1n7 11-Octadecenoic acid 4.9 ± 0.1

14 :0 Tetradecanoic acid 3.0 ± 0.2

18:0 Octadecanoic acid 1.8 ± 0.0

17:0 Heptadecanoic acid TR

18:1n9 9-Octadecenoic acid TR

18:2n6 9,12-Octadecadienoic acid TR

15:0 Pentadecanoic acid TR

5:0 iso 3-methyl-Butanoic acid TR

10:0 Decanoic acid TR

13:0 Tridecanoic acid TR

17:0 anteiso 14-methyl-Hexadecanoic acid TR

17:0 iso 15-methyl-Hexadecanoic acid TR

349 350 a Mean peak area percentage, TR: trace amounts < 1 %

216 17,

351 References

352 1. Khan F, Rizvi M, Shukla I, Malik A. A novel approach for identification of members of

353 Enterobacteriaceae isolated from clinical samples. Biol Med 2011;3:313–319.

354 2. Borman EK, Stuart CA, Wheeler KM. Taxonomy of the Family Enterobacteriaceae. J

355 Bacteriol 1944;48:351–367.

356 3. Health Protection Agency. UK Standards for Microbiology Investigations Identification

357 of Campylobacter species. 2013;ID23:1–22.

358 4. Ewing WH, William WH, William WH. ENTEROBACTERIACEAE TAXONOMY

359 AND NOMENCLATURE W. H. Ewing December 1966.

360 5. Renvoisé A. Applicabilité de la PCR « universelle » 16S comme outil d ’ identification

361 et de détection bactérienne en laboratoire hospitalier de bactériologie. 5.

362 6. Glaeser SP, Kämpfer P. Multilocus sequence analysis (MLSA) in prokaryotic

363 taxonomy. Syst Appl Microbiol 2015;38:237–45.

364 7. Kim M, Oh H-SH-S, Park S-CS-C, Chun J. Towards a taxonomic coherence between

365 average nucleotide identity and 16S rRNA gene sequence similarity for species demarcation of

366 prokaryotes. Int J Syst Evol Microbiol 2014;64:346–51.

367 8. Van Belkum A, Struelens M, De Visser A, Verbrugh H, Tibayrenc M. Role of genomic

368 typing in taxonomy, evolutionary genetics, and microbial epidemiology. Clinical Microbiology

369 Reviews 2001;14:547–560.

370 9. Chun J, Rainey FA. Integrating genomics into the taxonomy and systematics of the

371 Bacteria and Archaea. Int J Syst Evol Microbiol 2014;64:316–324.

372 10. Mlaga KD, Lotte R, Montaudié H, Rolain J-M, Raymond R. ‘Nissabacter archeti’, gen.

373 nov, sp. nov., a new member of Enterobacteriaceae family, isolated from human pustule scalp

217 18,

374 at Archet 2 hospital, Nice. New Microbes New Infect. Epub ahead of print March 2017. DOI:

375 10.1016/j.nmni.2017.02.001.

376 11. Seng P, Rolain J-M, Fournier PE, La Scola B, Drancourt M, et al. MALDI-TOF-mass

377 spectrometry applications in clinical microbiology. Future Microbiol 2010;5:1733–54.

378 12. Togo AHH, Khelaifia S, Lagier J-C, Caputo A, Robert C, et al. Noncontiguous finished

379 genome sequence and description of Paenibacillus ihumii sp. nov. strain AT5. New Microbes

380 New Infect 2016;10:142–150.

381 13. Sasser M. Bacterial Identification by Gas Chromatographic Analysis of Fatty Acids

382 Methyl Esters (GC-FAME).

383 14. Dione N, Sankar SA, Lagier J-C, Khelaifia S, Michele C, et al. Genome sequence and

384 description of Anaerosalibacter massiliensis sp. nov. New microbes new Infect 2016;10:66–76.

385 15. Coil D, Jospin G, Darling AE. A5-miseq: An updated pipeline to assemble microbial

386 genomes from Illumina MiSeq data. Bioinformatics 2015;31:587–589.

387 16. Seemann T. Prokka: rapid prokaryotic genome annotation. Bioinformatics

388 2014;30:2068–9.

389 17. Papke RT, White E, Reddy P, Weigel G, Kamekura M, et al. A multilocus sequence

390 analysis approach to the phylogeny and taxonomy of the Halobacteriales. Int J Syst Evol

391 Microbiol 2011;61:2984–2995.

392 18. Katoh K, Standley DM. MAFFT multiple sequence alignment software version 7:

393 improvements in performance and usability. Mol Biol Evol 2013;30:772–80.

394 19. Kumar S, Stecher G, Tamura K. MEGA7: Molecular Evolutionary Genetics Analysis

395 Version 7.0 for Bigger Datasets. Epub ahead of print 2016. DOI: 10.1093/molbev/msw054.

218 19,

396 20. Rutherford K, Parkhill J, Crook J, Horsnell T, Rice P, et al. Artemis: sequence

397 visualization and annotation. Bioinformatics 2000;16:944–945.

398 21. Soares SC, Geyik H, Ramos RTJ, de Sá PHCG, Barbosa EG V, et al. GIPSy: Genomic

399 island prediction software. J Biotechnol. Epub ahead of print September 2015. DOI:

400 10.1016/j.jbiotec.2015.09.008.

401 22. Tatusov RL, Natale D a, Garkavtsev IV, Tatusova T a, Shankavaram UT, et al. The

402 COG database: new developments in phylogenetic classification of proteins from complete

403 genomes. Nucleic Acids Res 2001;29:22–28.

404 23. Contreras-Moreira B, Vinuesa P. GET_HOMOLOGUES, a versatile software package

405 for scalable and robust microbial pangenome analysis. American Society for Microbiology.

406 Epub ahead of print 15 December 2013. DOI: 10.1128/AEM.02411-13.

407 24. Vinuesa P, Contreras-Moreira B. Robust identification of orthologues and paralogues

408 for microbial pan genomics using GET_HOMOLOGUES: a case study of pIncA/C plasmids.

409 http://www.eead.csic.es/compbio/soft/gethoms.php‐ (accessed 5 September 2016).

410 25. Bland C, Ramsey TL, Sabree F, Lowe M, Brown K, et al. CRISPR recognition tool

411 (CRT): a tool for automatic detection of clustered regularly interspaced palindromic repeats.

412 BMC Bioinformatics 2007;8:209.

413 26. Hassen A, Saidi N, Cherif M, Boudabous A. Resistance of environmental bacteria to

414 heavy metals. Bioresour Technol 1998;64:7–15.

415 27. Mende DR, Sunagawa S, Zeller G, Bork P. Accurate and universal delineation of

416 prokaryotic species. Nat Methods 2013;10:881–4.

219 20,

417 28. Auch AF, von Jan M, Klenk H-P, Göker M. Digital DNA-DNA hybridization for

418 microbial species delineation by means of genome-to-genome sequence comparison. Stand

419 Genomic Sci 2010;2:117–134.

420 29. Meier-Kolthoff JP, Auch AF, Klenk H-P, Göker M. Genome sequence-based species

421 delimitation with confidence intervals and improved distance functions. BMC

422 Bioinformatics;14. Epub ahead of print 2013. DOI: 10.1186/1471-2105-14-60.

423 30. Arahal DR. New Approaches to Prokaryotic Systematics. Elsevier. Epub ahead of print

424 2014. DOI: 10.1016/bs.mim.2014.07.002.

425 31. Grimont F, Grimont PAD. The Genus Serratia. DOI: 10.1007/0-387-30746-x_11.

426 32. Farmer JJ, Sheth NK, Hudzinski JA, Rose HD, Asbury1 MF. Bacteremia Due to

427 Cedecea neteri sp. nov. J Clin Microbiol 1982;16:775–778.

428 33. Grimont PAD, Farmer JJ, Grimont F, Asbury MA, Brenner DJ, et al. Ewingella

429 americana gen.nov., sp. nov.,a new Enterobacteriaceae isolated from clinical specimens. Ann

430 l’Institut Pasteur / Microbille 1983 ;134 :39–52.

431 34. Sullivan MJ, Petty NK, Beatson SA. Easyfig: a genome comparison visualizer.

432 Bioinformatics 2011;27:1009–1010.

433

220 21,

Figure 1 Click here to download Figure Figure1.TIF

221 Figure 2 Click here to download Figure Figure 2.TIF

222 Figure 3 Click here to download Figure 99 ascorbataFigure LMG 3.pdf 7871T Kluyvera cryocrescens LMG 7859T Enterobacter cancerogenus LMG 2693T 96 92 Enterobacter asburiae DSM 17506T Enterobacter ludwigii LMG 23768T Enterobacter kobei DSM 13645T

98 Enterobacter xiangfangensis 10-17T 96 Enterobacter hormaechei CCUG 27126T 96 Enterobacter cloacae subsp. cloacae LMG 2783T 100 Enterobacter cloacae subsp. dissolvens LMG 2683T

98 Lelliottia amnigena LMG 2784T 100 Lelliottia nimipressuralis LMG 10245T Yokenella regensburgei LMG 7872T Raoultella planticola LMG 7870T 87 Raoultella terrigena LMG 3222T Citrobacter amalonaticus LMG 7873T Citrobacter freundii LMG 3246T 95 100 Citrobacter youngae LMG 3252T 84 Klebsiella nicaeensis 2680 97 Klebsiella oxytoca LMG 3055T Enterobacter aerogenes LMG 2094T 96 99 Klebsiella singaporensis LMG 23571T 100 Klebsiella pneumoniae subsp. ozaenae LMG 3113T 100 Klebsiella pneumoniae subsp. rhinoscleromatis LMG 3184T Pluralibacter pyrinus LMG 22970T 100 Pluralibacter gergoviae LMG 5739T zurichensis LMG 23730T 99 Cronobacter muytjensii R-46535T 94 Cronobacter dublinensis subsp. lactaridi LMG 23825T

99 99 Cronobacter dublinensis subsp. dublinensis LMG 23823T 100 95 Cronobacter dublinensis subsp. lausannensis LMG 23824T Cronobacter turicensis LMG 23827T 95 Cronobacter malonaticus LMG 23826T 100 LMG 5740T Cronobacter condimenti LMG 26250T Cronobacter pulveris LMG 24057T 99 Cronobacter helveticus LMG 23732T Kosakonia cowanii LMG 23569T

99 Kosakonia arachidis KCTC 22375T 100 Kosakonia oryzae LMG 24251T Kosakonia radicincitans LMG 23767T toletana LMG 24162T

100 Erwinia pyrifoliae DSM 12163T 99 97 Erwinia piriflorinigrans CFBP 5888T Erwinia aphidicola LMG 24877T

98 Erwinia persicina LMG 11254T 100 99 Erwinia rhapontici LMG 2688T Erwinia billingiae LMG 2613T

92 Erwinia papayae NCPPB 4294T 100 Erwinia psidii LMG 7034T 98 Erwinia tracheiphila LMG 2906T

100 100 Tatumella terrea LMG 22051T

100 Tatumella ptyseos LMG 7888T Tatumella citrea LMG 22049T 98 Tatumella punctata LMG 22050T 99 99 Pantoea brenneri LMG 5343T

100 Pantoea conspicua LMG 24534T Pantoea anthophila LMG 2558T

85 96 Pantoea vagans LMG 24199T

99 Pantoea agglomerans LMG 1286T 99 Pantoea eucalypti LMG 24197T

100 Pantoea stewartii subsp. indologenes LMG 2632T Pantoea stewartii subsp. stewartii LMG 2715T 100 Pantoea allii LMG 24248T 100 98 Pantoea ananatis LMG 2665T Pantoea cypripedii LMG 2657T Pantoea dispersa LMG 2603T Pantoea eucrina LMG 2781T 99 Pantoea wallisii LMG 26277T Pantoea rodasii LMG 26273T 100 Pantoea rwandensis LMG 26275T Mangrovibacter plantisponsor LMG 24236T Serratia quinivorans LMG 7887T 92 84 Serratia proteamaculans CCUG 51551T Serratia grimesii LMG 7883T 100 Serratia liquefaciens LMG 7884T 87 Serratia plymuthica LMG 7886T Serratia glossinae CCUG 57457T 98 100 Serratia fonticola LMG 7882T Serratia odorifera LMG 7885T Serratia rubidaea LMG 5019T

98 91 Serratia ureilytica LMG 22860T

100 Serratia nematodiphila DSM 21420T Serratia marcescens subsp. sakuensis CCM 7122T 99 Serratia marcescens subsp. marcescens LMG 2792T 99 Serratia ficaria LMG 7881T 81 99 Serratia entomophila LMG 8456T

97 Rahnella bruchi FRB 226T Rahnella woolbedingensis FRB 227T 83 100 Rahnella victoriana FRB 225T Rahnella aquatilis LMG 2794T 98 100 Rahnella variigena CIP 105588T 99 Rahnella inusitata DSM 30078T Gibbsiella quercinecans LMG 25500T Nissabacter archeti 2134 Lonsdalea quercina subsp. quercina LMG 2724T 100 Lonsdalea quercina subsp. iberica LMG 26264T Lonsdalea quercina subsp. britannica LMG 26267T Samsonia erythrinae ICMP 13937T 99 Pectobacterium cacticida LMG 17936T 92 98 Pectobacterium carotovorum subsp. brasiliense LMG 21371T 98 Pectobacterium carotovorum subsp. odoriferum LMG 17566T 98 Pectobacterium wasabiae LMG 8444T

99 99 Pectobacterium atrosepticum LMG 2386T 99 Pectobacterium betavasculorum LMG 2466T

97 Brenneria rubrifaciens LMG 2709T Brenneria salicis LMG 2698T 84 Brenneria alni NCPPB 3934T 98 Brenneria nigrifluens LMG 2694T Dickeya paradisiaca LMG 2542T Dickeya dianthicola LMG 2485T 99 98 Dickeya dadantii LMG 25991T 99 Dickeya dieffenbachiae LMG 25992T 97 Dickeya zeae LMG 2505T 100 Dickeya chrysanthemi LMG 2804T 223 Pseudomonas aeruginosa PAO1

0.05 Figure 4 Click here to download Figure Figure 4.TIF

224 Figure 5 Click here to download Figure Figure 5.TIF 225 Figure 6 Click here to download Figure Figure 6.TIF

226 Figure 7 Click here to download Figure Figure 7.TIF

227 Supplementary Material File Click here to download Supplementary Material Files supplementary data.pdf

Supplementary data

Table 1. Average Nucleotide Identity output

NCBI Taxonomy ID NCBI Genome Project ID NCBI Taxonomy Name Type Strain Average % Identity Species cut-off 768490 67315 Serratia sp. AS12 No 87.6919262 Below cut-off 768492 67313 Serratia plymuthica AS9 No 87.6919262 Below cut-off 682634 42253 Serratia odorifera 4Rx13 No 87.82877267 Below cut-off 768493 60455 Serratia sp. AS13 No 87.6919262 Below cut-off

The average result of the 4 genomes is 87.72% below the cut-off (> 96% for new species). 228

Table 2. In-silico DDH results from the GGDC taxonomy platform.

Distance DDH IC P(DDH>70) Difference GC%

E. americana 0.8008 20.1 18.2-20.8 0 4.44

S. marcescens 0.1975 22.2 22.2-24.6 0 1.59

S. plymuthica 0.1988 22.1 19.8-24.5 0 1.22

Table 3. Distribution of functional Clusters of Orthologous Groups (COG)

COG Ewingella americana Nissabacter archeti Rahnella aquatilis Serratia marcescens Serratia plymuthica Serratia rubidaea ATCC 33852 2134 ATCC 33071 Db11 S13 1122

A 1 1 1 1 1 1 B 0 0 0 1 1 1 C 223 238 244 264 285 258 D 37 39 37 36 35 32 E 492 472 542 556 584 509 F 94 87 95 92 100 100 G 388 429 453 404 457 360 H 151 158 159 166 172 173 I 116 111 127 140 147 138

229 J 193 189 185 193 193 194 H 151 158 159 166 172 173 K 461 399 434 483 519 406 L 144 183 163 144 165 147 M 247 235 270 256 273 242 N 103 100 103 114 105 98 O 146 133 147 153 156 144 P 335 287 348 364 377 315 Q 101 82 95 138 151 134 R 603 597 618 653 667 631 S 364 347 371 383 364 355 T 187 181 177 186 194 169 U 115 117 126 126 139 124 C 223 238 244 264 285 258 W 1 2 1 0 0 0

Information storage and processing: [A] RNA processing and modification, [B] Chromatin structure and dynamics, [J] Translation, ribosomal structure and biogenesis, [K] Transcription, [L] Replication, recombination and repair. Cellular processes and signalling: [D] Cell cycle control, cell division, chromosome partitioning, [M] Cell wall/membrane/envelope biogenesis, [N] Cell motility, [O] Post-translational modification, protein turnover and chaperones, [T] Signal transduction mechanisms, [U] Intracellular trafficking, secretion and vesicular transport, [V] Defence mechanisms, [W] Extracellular structures, [Y] Nuclear structure, [Z] Cytoskeleton. Metabolism: [C] Energy production and conversion, [E] Amino acid transport and metabolism, [F] Nucleotide transport and metabolism, [G] Carbohydrate transport and metabolism, [H] Coenzyme transport and metabolism, [I] Lipid transport and metabolism, [P] Inorganic ion transport and metabolism, [Q] Secondary metabolite biosynthesis, transport and catabolism. Poorly characterized: [R] General function prediction only, [S] Function unknown.

230 Table 4. Prediction of mobile genetic elements using GIPSY [21]

Putative genomic island G+C% deviation Codon usage factors Hypothetical protein Gene composition Position Prediction scores Genome 11% 9% 40% NA NA NA Putative Genomic Island 1 40% 20% 10% orf_00534-orf_00543 563295..575223 NA Putative Genomic Island 2 22% 18% 37% orf_00668-orf_00715 711526..761805 NA Putative Pathogenicity Island 1 22% 44% 57% orf_00786-orf_00893 839256..947591 Normal Putative Pathogenicity Island 2 66% 50% 83% orf_00992-orf_00997 1045761..1052585 Strong Putative Genomic Island 3 73% 86% 6% orf_01022-orf_01036 1096364..1119246 NA Putative Genomic Island 4 26% 34% 39% orf_01083-orf_01105 1169139..1196310 NA Putative Genomic Island 5 60% 50% 20% orf_01377-orf_01386 1508227..1516008 NA Putative Genomic Island 6 10% 40% 20% orf_01480-orf_01489 1618638..1634088 NA Putative Pathogenicity Island 3 16% 16% 64% orf_01506-orf_01530 1647846..1677752 Normal

231 Putative Genomic Island 7 21% 15% 31% orf_01535-orf_01553 1681144..1702287 NA Putative Pathogenicity Island 4 21% 7% 50% orf_01564-orf_01577 1712906..1726665 Normal Putative Genomic Island 8 20% 80% 26% orf_01739-orf_01753 1909030..1923608 NA Putative Pathogenicity Island 5 29% 6% 63% orf_02047-orf_02091 2239352..2294771 Normal Putative Genomic Island 9 33% 44% 22% orf_02132-orf_02136 2334556..2339981 NA Putative Pathogenicity Island 6 18% 22% 45% orf_02336-orf_02625 2567605..2854971 Normal Putative Genomic Island 10 13% 26% 33% orf_02634-orf_02649 2867922..2887874 NA Putative Genomic Island 11 30% 33% 23% orf_02733-orf_02792 2995455..3044803 NA Putative Pathogenicity Island 7 9% 0% 45% orf_02970-orf_02991 3233621..3262389 Weak Putative Pathogenicity Island 8 42% 0% 100% orf_02994-orf_03000 3265677..3313217 Strong Putative Genomic Island 12 27% 36% 24% orf_03698-orf_03948 4094681..4336876 NA Putative Pathogenicity Island 9 11% 17% 52% orf_03952-orf_04014 4340684..4410242 Weak Putative Genomic Island 13 20% 16% 29% orf_04160-orf_04184 4573850..4601136 NA Putative Genomic Island 14 26% 13% 40% orf_04250-orf_04260 4679380..4693453 NA Putative Genomic Island 15 33% 0% 33% orf_04279-orf_04290 4714871..4729675 NA Putative Pathogenicity Island 10 12% 0% 62% orf_04299-orf_04306 4736695..4745385 Normal Putative Pathogenicity Island 11 20% 33% 46% orf_04504-orf_04545 4948712..4993149 Normal Putative Pathogenicity Island 12 11% 11% 66% orf_04569-orf_04577 5019523..5032943 Normal Putative Pathogenicity Island 13 25% 5% 60% orf_04608-orf_04624 5069459..5081207 Normal Putative Genomic Island 16 35% 30% 40% orf_04633-orf_04652 5090843..5118113 NA

232

Conclusion Based on these phylogenetic and physiological results, we proposed the str 2134 as a strain of the novel taxon Nissabacter archeti gen. nov., sp. nov., and we proposed Nissabacter gen. nov., as the type genus of the Enterobacteriaceae family. We propose the new genus

Nissabacter as a new member of Enterobacteriaceae family for the fact that it was first isolated from Nice a south-east city of France, and the description of the first species of this genus

Nissabacter archeti for the fact that it was first isolated from a pustule scalp swab. Nissabacter

(Ni.sa.bac.ter, a bacterium isolated for the first time from Nice or Nissa or Niça in Occitan niçois, a city in the south-east of France, prefecture of the Alpes-Maritimes Department ).

233

Chapter IV: Conclusion and perspectives

234

In the light of our PhD projects, we can conclude the bacterial WGS has dramatically contributed to understanding how bacteria live, spread, adapt and evolve within their communities and how they interact with their direct environment especially human.

Computational genome sequence data analysis is becoming one of an essential process in clinical microbiology, and the availability of tools and their friendly-use are urged. We focused the first part of our PhD thesis on the review of the availability of bioinformatics tools and the impact that bacterial genome recombination have made on their behaviour in clinical microbiology.

In the second part of this thesis, we analyse an epidemiological data of S. saprophyticus causing

UTI in Marseille in an association with MALDI-TOF spectral data compared to the Nice community as a control area, both located in the south-East of France. We could demonstrate a restricted geographical spread of S. saprophyticus causing UTI in Marseille community. Also, we could identify a molecular marker that characterised Marseille strains. Also, the genome comparison of clinical and non-clinical strains of S. saprophyticus isolated from various environment shown that there was a significant difference in the number of recombination event occurred in clinical strains genomes than non-clinical. Also, the total number of SNP imported through recombination estimated by the ratio r/m is higher in clinical strains than non-clinical.

Moreover, phylogenic analysis with midpoint rooting showed that non-clinical strains are closer the hypothetical ancestral strains than clinical that seemed to emerge recently. Overall S. saprophyticus initially considered as saprophytic has drifted to become a pathogenic bacterium through massive genome recombination and single nucleotide polymorphism (SNPs) events, resulting from the significant loss of genes categorised in the transcriptional regulatory and carbohydrate metabolism and transport functional groups associated with a positive evolutionary selection of uro-adherence gene uafA. These have led to the emergence of a specific population of S. saprophyticus capable of causing disease, particularly UTIs, in humans.

235

In the second part of our thesis, we could demonstrate an extensive genomic recombination has occurred in E. faecalis species, due to mobile genetic elements (MGE) which are known to induce adaptive immunity in prokaryotes. This recombination occurrence correlates with the acquisition of a CRISPR system found in E. faecalis, which protect the later from acquiring external DNA sequences carrying the vancomycin resistance genes vanA, vanB. Contrariwise we observed a reduced number of CRISPR systems found in E. faecium and a substantial number of anti-endonuclease ardA genes and vancomycin resistance genes found. We suggested that the emergence and dissemination of E. faecium infection may be due to zoonotic transmission, and the misuse of antibiotics (avoparcin) may cause the selection of emerging vancomycin resistance in Enterococci. This finding explains why E. faecium is more reported worldwide as a vancomycin-resistant Enterococcus, and not E. faecalis.

Finally, we have used whole genome sequencing as integrated part of the de-lineage of microorganism at the species and genus level to discover and describe new species/genus. We proposed Nissabacter as a new member of Enterobacteriaceae family and described Nissabacter archeti as the first species of Nissabacter genus, isolated from the scalp swab sample of a 29 years old patient from Hôpital Archet II, Nice.

As a perspective, deciphering S. saprophyticus causing UTI community outbreak in Marseille community using MALDI-TOF MS spectral data analysis, and the comparative genome analysis will improve our knowledge of how S. saprophyticus spread, causing UTI in various communities. We propose a systematic genomic analysis system to be put in place to advance the strategy already established for proper controlling of microbial infection in the unit.

Concerning the comparative analysis of Enterococcus species, we intend to continue with a biological investigation to demonstrate the role the gene ardA directly plays in the acquisition of plasmids carrying vanA and vanB genes clusters and in association with the CRISPR system.

In the new future, our lab is devoted to systematically implement the real-time genome in the routine process to decipher bacterial exhibiting multidrug resistance and atypical features in

236 clinical microbiology. A web server and a benchmark built-up integrating all our in-house scripts that we developed during our thesis are underway for a routine search of antimicrobial resistance genes. Also, it will significantly improve our weekly bacterial surveillance system in our unit.

237 Posters and presentations

1. Kodjovi D. Mlaga; Cédric Abat; Jean-Marc Rolain: Whole genome sequence analysis of

Staphylococcus saprophyticus involved in Urinary Tract Infections (UTIs), Oral poster, 26th

ECCMID Conference, Amsterdam, Netherland, 9 – 12 Avril 2016.

2. Kodjovi D. Mlaga; Cédric Abat; Jean-Marc Rolain: Comparative genomic analysis of

Staphylococcus saprophyticus involved in Urinary Tract Infections (UTIs), Oral poster,

infectiopole day, 3rd July 2016.

3. Kodjovi D. Mlaga, Vincent Garcia, Seydina Diene, Ruimy Raymond, Jean-Marc

Rolain: Comparative genome evolutionary analysis of Enterococcus faecalis and

Enterococcus faecium isolated from various sources revealed a direct association

between the absence of CRISPR system and acquisition of vancomycin resistance

genes, Oral presentation; 27th ECCMID Conference, Vienna, Austria, 22 – 25 Avril

2017. (presented by supervisor)

4. Kodjovi D. Mlaga, Vincent Garcia, Seydina Diene, Ruimy Raymond, Jean-Marc

Rolain: Comparative genome evolutionary analysis of Enterococcus faecalis and

Enterococcus faecium isolated from various sources revealed an association between

the absence of CRISPR system and acquisition of resistance genes, Oral poster,

infectiopole day, 7th July 2017

238 Acknowledgements

I will not be able to make this achievement if The All Mighty and beloved have not been there for me. I will try my best to mention some of them even though they are all close to my heart.

I am grateful to God to be all for me throughout my PhD study.

My first and foremost thanks go to Professor Didier Raoult, for giving me an opportunity to pursue my thesis work in this prestigious institute. I am thankful to him for his advice, all his suggestions during my PhD work.

I will like to express my profound gratitude to Professor Jean-Marc Rolain for being my PhD supervisor. I am grateful for your excellence mentorship, your instructions, your guidance, and all your support during my PhD work.

I am thankful to Professor Raymond Ruimy for co-supervising my thesis. Thanks for your valuable instructions and helpful guidance during my work.

Thanks to Dr Seydina Diene, Dr Fadi Bittar for all your suggestions and advice.

I would like to express my gratitude to Professor Anthony Levasseur, Professor Marie

Kempf, Professor Estelle Jumas-Bilak for kindly agreeing to be the members of the jury.

I would like to express my most profound love to my family, my sweet wife Sara Mlaga; this hard but successful PhD journey will not be possible without you. You have been an incredible support, and humbly, I am wordless to express all my gratitude to you. To my dad, Amenouvor

Mlaga and my mum Ameyo Mlaga, thank you for all your encouragement and your advice.

To my brothers and sisters, Mawuli Mlaga, Akouvi Mlaga, Enyonam Mlaga, Selom Mlaga,

Viviane Mlaga, Ségbedji Mlaga, I just want to say thank you. To my spiritual mentor, Patrice

Djadoo, I am grateful to your spiritual guidance.

Thanks to Dr Abiola Olaitan for all your support and advice, you took my step in my first genome assembly. You are a good friend. I am grateful to Dr Toidi Adekambi for all his help

239 during my PhD work. To Linda Hadjadj, I am pleased to the excellent work environment you provided to me for the few time I spent in the wet lab. To Dr Cedric Abat, thanks for your collaboration and friendship. To all my relatives, friends and my laboratory colleagues in one or other way who have collaborated with me during my PhD work, I say thank you.

I would like to dedicate this thesis to my dad, mum, my sweet wife, my son Joshua and my daughter Jockebed; I owe this success to you. Big love!!

240