Study of the emergence of the diversity of Escherichia coli in vivo by whole genome sequencing Adrien Launay

To cite this version:

Adrien Launay. Study of the emergence of the diversity of Escherichia coli in vivo by whole genome sequencing. Bacteriology. Université Pierre et Marie Curie - Paris VI, 2016. English. ￿NNT : 2016PA066692￿. ￿tel-01599361￿

HAL Id: tel-01599361 https://tel.archives-ouvertes.fr/tel-01599361 Submitted on 2 Oct 2017

HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés.

Thèse de Doctorat de L’université Pierre et Marie Curie - Paris IV Ecole doctorale Complexité du vivant CDV (ED 515) UMR1137 - IAME : équipe QEM

Présenté Par : Mr Adrien LAUNAY

Pour obtenir le grade de DOCTEUR de l’UNIVERSITE PARIS 6

ETUDE DE L'EMERGENCE DE LA DIVERSITE D'ESCHERICHIA COLI IN VIVO PAR SEQUENÇAGE DE GENOMES COMPLETS

Directeur de thèse : Olivier Tenaillon

Présenté et soutenue publiquement le 27 octobre 2016 Devant un jury composé de :

Dr P. GLASER - Président du jury Dr MA. JACQUES - Rapportrice Dr S. DA-RE - Rapportrice Dr G. ACHAZ - Examinateur Dr L. LANDRAU - Examinatrice Dr C. TINSLEY - Examinateur Dr O. Tenaillon - directeur de thèse Remerciements

Je dédie ce travail à Brice et à mes deux Grands-mères.

2 Remerciements

REMERCIEMENTS

Ce manuscrit représente le cheminement de 3 années de recherche, et de rebondissements. J’ai eu durant ces années la chance de pouvoir travailler sur le sujet de l’évolution qui me passionne depuis de nombreuses année. Les études d’évolution expérimentale, et Escherichia coli représentent pour moi une opportunité inégalable de véritablement observer l’évolution dans le vif du sujet. Ce projet m’a par conséquent particulièrement tenu à cœur, et je me dois de remercier tous les gens qui y ont participé directement comme indirectement.

Je tien tout d’abord à remercier particulièrement mon directeur de thèse Olivier, de m’avoir accueilli dans son équipe de recherche. Je le remercie pour sa supervision scientifique, ses conseils en informatiques et toutes les idées qu’il a pu me proposer (Bien qu’il m’ais parfois fallu un certain temps pour bien comprendre ce qui était demandé). Cette expérience a été très enrichissante tant sur le plan humain que scientifique.

Je tiens aussi à remercier tout particulièrement Erick Denamur pour sa précieuse aide, mais surtout pour son accueil chaleureux, au sein de ce qui allait devenir l’unité de recherche IAME.

Je tiens à remercier chaleureusement les membres de mon comité de thèse :

Philippe Glaser pour sa participation en tant que président du Jury, Sandra Da-Re et Marie Agnès Jacques qui ont accepté d’être mes deux rapportrices, Colin Tinsley et Luce Landreau pour leur participation a ce jury, et Guillaume Achaz pour son aide tout au long de ma thèse.

Merci aussi à tous les membres des équipes 1 et 2 de IAME pour ce qu'ils m'ont apporté scientifiquement mais aussi personnellement: Alejandro pour les cours d'espagnol, Alexandra pour son énergie, Harry pour les corrections d'anglais, Mélanie pour les corrections de français, Mohamed pour les discussions interminables, Meril, pour les bons moments, André, Audrey, Jérémie et Marine pour leur bonne humeur, Herve Lenagar pour son aide en informatique, Olivier Clermont, Herve Jaquier, Jean Paul Feugeas et Mathilde pour leur soutien. De même, je tiens à remercier Edouardo Rocha pour ses conseils avisés durant mes deux comités qui m’ont guidé et permis de peaufiner ma thèse.

3 Remerciements

Je tiens aussi, bien entendus, à remercier ma famille française et mexicaine et plus particulièrement mes deux parents et ma sœur sans qui tout cela n'aurait pas été possible. Merci de m'avoir donné le goût de la science et de m'avoir soutenu dans cette voie toutes ces années.

Merci à Alheli, ma merveilleuse femme, qui m’a supporté pendant ces trois années, et m’a apporté calme et réconfort au travers des meilleurs comme des plus mauvais moments.

Merci à mes amis pour les moments de détentes partagés, autour de quelques verres, ou de quelques jeux de société. Merci aux magisteriens et merci aux Palaisiens. Merci Benoit et Raphoo pour les jams sessions endiablés.

Merci à tous pour ces trois années pleines de recherche, de rires et de mariages.

4 Table des matières

TABLE DES MATIERES

Remerciements ...... 3

Table des matières ...... 5

List of abbreviations: ...... 8

I - INTRODUCTION ...... 9

1 E. coli: from ecology to genomics...... 10

1.1 E. coli a commensal of the gut of vertebrates ...... 10

1.2 E. coli a versatile pathogen ...... 11

1.2.1 E. coli are responsible for intra intestinal infections ...... 11

1.2.2 E. coli are responsible for extra intestinal infections ...... 14

1.3 E. coli a structured species ...... 15

1.3.1 Variability in strain carriage as commensals ...... 17

1.3.2 Variability among pathogenic isolation ...... 17

1.4 E. coli genome ...... 18

1.4.1 Core genome and pan genome ...... 18

1.4.2 Difference in gene content and adaptation ...... 19

1.4.3 Difference in conserved genes and adaptation ...... 20

2 In vitro experimental evolution: ...... 22

2.1 The different types of in vitro experimental evolution ...... 22

2.2 Long term evolution experiment (LTEE) ...... 23

2.3 Fitness improvement ...... 24

2.4 Mutator selection ...... 25

2.5 Genetics of mutation rate ...... 26

2.5.1 Control of mutation rate ...... 26

2.5.2 Selection for and against increased mutation rate ...... 29

2.6 Molecular bases of adaptation ...... 30

2.6.1 Genetic convergence ...... 30

5 Table des matières

2.6.2 Clonal interference ...... 31

2.6.3 Recruitment of mutations in global regulators ...... 31

3 Short term evolution in vivo and objectives ...... 36

3.1 Evolution in the mice ...... 36

3.2 Evolution in human ...... 37

3.3.1 Chronic infections ...... 37

3.3.2 Acute infections ...... 40

3.3.3 Commensal evolution ...... 41

3.4 Objectives ...... 42

4 Next generation sequencing (NGS) ...... 43

4.1 Sequencing technologies ...... 43

4.1.1 Sanger Sequencing ...... 43

4.1.2 Next generation sequencing principle: the case of 454 ...... 44

4.1.2 Next generation sequencing’s various technologies ...... 48

4.2 Bioinformatics of genomics ...... 50

4.2.1 With reference genome: alignment and SNP calling algorithms ...... 51

4.2.2 – Assembly of genome when there are no clear reference genome ...... 54

4.3 Computational challenges for bacterial sequencing and repeated regions ...... 57

4.3.1 Mapping and repetitions ...... 57

4.3.2 Assembly and repetitions ...... 58

4.4 NGS for the study of the ecology and evolution of E. coli ...... 59

II - EXPERIMENTAL EVOLUTION OF NATURAL STRAINS OF E. COLI IN A COMMENSAL ENVIRONMENT: THE MICE GUT MODEL ...... 64

1 E. coli experimental evolution in the mice gut ...... 65

2. Article n°1 ...... 67

2.1 Abstract...... 68

2.2 Background...... 69

2.4 Materials and Methods ...... 71

6 Table des matières

2.5 Results...... 75

2.6 Discussion ...... 81

2.7 Figures and tables...... 87

2.8 Suplementary data ...... 94

III - E. COLI EVOLUTION IN HUMANS: THE CASE OF ACUTE INFECTIONS ...... 100

1. Evolution of E. coli during acute infection: ...... 101

2. Article n°2 ...... 103

2.1 Abstract: ...... 104

2.2 Background ...... 105

2.3 Materials and Methods ...... 108

2.4 Results and discussion ...... 113

2.5 Conclusion ...... 123

2.6 Figures ...... 125

2.7 Suplementary data ...... 130

IV - DISCUSSION...... 145

1. E. coli experimental evolution: getting outside the lab ...... 146

2. Evolution in commensal environment ...... 146

3. Adaptation during acute infections...... 147

4. Perspectives: ...... 149

Supplementary studies ...... 151

nhaAR operon and the nhaA phylogeny ...... 151

The CC87 and its specific genes ...... 153

RESUME EN FRANCAIS: ...... 155

1 Evolution expérimentale chez la souris : ...... 157

2 Evolution lors d’infection aigüe ...... 158

Bibliography ...... 161

RESUME : ...... 176

ABSTRACT: ...... 176

7 List of abbreviations:

LIST OF ABBREVIATIONS:

SNP Single nucleotide polymorphsimes RA Read alignment evidence MMR Methyl directed miss-match repair JC New Junction evidence MC Missing coverage EXPEC Extra intestinal pathogenic E. coli EHEC Enterohaemoragic E. coli EIEC Enteroinvasives E. coli ETEC Enterotoxigenic E. coli EAEC Enteroaggregative E. coli DAEC diffusely adherent E. coli EPEC Enteropathogenic E. coli UPEC Uropathogenic E. coli UTI Urinary tract infection CRF Cerebrospinal fluid LTEE Long term evolution experiment ABU Asymptomatic bacteriuria strains NGS Next generation sequencing dNTPs deoxynucleosidetriphosphates ddNTPs di- deoxynucleosidetriphosphates MLST Multilocu sequence typing LB lysogeny broth K Killer NK Non killer IK intermediate killer

8 I - INTRODUCTION

1 E. coli: from ecology to genomics

1 E. COLI: FROM ECOLOGY TO GENOMICS

Amongst all the model organisms in biological studies, E. coli, and more specifically the K12 strain, remains one of the most used and studied. Since its discovery, it has been analyzed in depth and is now the most characterized bacteria. This success is linked to the ease with which E. coli can be cultivated in the laboratory and to the fact that it is amenable to genetic modifications, making it a first choice tool for biotechnologies. E. coli was used to discover bacterial conjugation in 1946, by Joshua Lederberg and Edward Tatum (Lederberg and Tatum, 1946), and has been at the heart of the development of modern genetics, allowing, for instance, Seymour Benzer to show that the gene is not an indivisible entity, as previously believed, and that they were furthermore linear(Benzer, 1959). More genetics and biochemical studies along the years allowed for the high quality annotation and understanding of the genome of the E. coli K12 laboratory strain. Besides its role in the laboratory and in biotechnologies, the ecological interaction of the E. coli species with human is also of great medical and scientific interest.

1.1 E. coli a commensal of the gut of vertebrates

E. coli is a gram negative, nonsporulating, and facultative anaerobe bacterium inhabiting the lower intestine of warm blooded animals and reptiles. It seems to have a special relationship with human: its prevalence is more than 90% in that host, but only 56% in wild animals, 23% in birds and 10% in reptiles (Gordon and Cowling, 2003). As a member of the gut microbiota, E. coli is the predominant aerobic organism in the large intestine, especially in the caecum and the colon, where it is encountered in the mucus layer. E. coli metabolism is adapted to this ecological niche from which it extracts nutrients, most notably gluconate (Tenaillon et al., 2010). E. coli’s doubling time is ranging from around 30 minutes in vitro to 120 minutes in its natural environment. The difference between the two conditions is partly explained by the fact that inside the gut, all the species forming the microbiota have to compete for the resources provided by the intestines. The natural environment of E. coli is therefore defined by the gut microbiota, the properties of the host digestive tract and its diet. The host is a resource provider as well as a mean of transport and dissemination. As such, the ecological interaction between E. coli and its host can be categorized as commensalism: one of the two organisms (the bacteria) benefits from the interaction, while the other (the host) is neither notably 10

1 E. coli: from ecology to genomics harmed nor helped by it. However, with a more detailed analysis, E. coli can be considered as a somewhat helping factor for the host as it protects it against colonization by gut-pathogens, thanks to its competitive characteristics, and through the production of bacetriocins (Vollaard and Clasener, 1994). E. coli was also proved to help in the production of vitamins (Bentley and Meganathan, 1982), and for the stimulation of the immune system (Nowrouzian et al., 2003). The relationship between E. coli and its host can therefore be connected to “mutualism”.

1.2 E. coli a versatile pathogen

E. coli is a multiple faced species. In fact, it can also be qualified as the “enemy within”, as it turns out in any case to be a dangerous pathogen. The afflictions caused by E. coli are classified in mainly two syndromes: intraintestinal diseases (enteric/diarrheal diseases), and extraintestinal diseases (urinary tract infections (UTI) and sepsis/meningitis) (table 1). Each year, E. coli causes about a million deaths due to diarrhea in infants (Kosek et al., 2003; Kotloff et al., 1999) and extraintestinal infections (Russo and Johnson, 2003). Moreover, E. coli is also responsible for approximately 150 million cases of uncomplicated cystitis (Russo and Johnson, 2003).

location type manifestations Enteropathogenic E. coli (EPEC) watery diarrhea hemolytic syndrome, Enterohaemoragic E. coli (EHEC) hemorrhagic colitis intra- Enterotoxigenic E. coli (ETEC) watery diarrhea intestinal Enteroaggregative E. coli (EAEC) acute diarrhea Enteroinvasives E. coli (EIEC), dysenteric syndrome Diffusely adherent E. coli (DAEC) acute diarrhea extra Uropathogenic E. coli (UPEC) cystitis, pyelonephritis intestinal Sepsis/meningitis Bacteremia, meningitis

Table 1: the different types of E. coli infections and their manifestations

1.2.1 E. coli are responsible for intra intestinal infections

The strains responsible for intraintestinal infections are divided in six well characterized categories (Kaper et al., 2004). The enteropathogenic E. coli (EPEC) responsible of watery diarrhea 11

1 E. coli: from ecology to genomics was the first E. coli pathotype described in 1945 in the United Kingdom (fig 1a). This type of E. coli tightly binds to the epithelial cells of the intestines and modifies its cytoskeleton creating a pedestal like shape. Similarly, the enterohaemoragic E. coli (EHEC) can modify the shape of the cells of the intestines, but also produce shiga toxins. It is generally found in the bovine intestinal tract, and can contaminate humans mainly through the consumption of undercooked meat, and raw vegetables (fig 1b). EHEC are responsible of bloody diarrhea as well as the life threatening complication: hemolytic uremic syndrome, particularly among children. The most frequent serogroup found among these strains, is E. coli O157:H7, however other serogroups have been described. EHEC O104:H4 for instance, was responsible for the mostly German outbreak in 2011 that provoked 53 deaths and over 4000 cases of bloody diarrhea. The putative source of contamination originated from some Egyptian organic fenugreek sprouts (Buchholz et al., 2011). The enterotoxigenic E.coli (ETEC) is responsible for traveler’s diarrhea that results from enterotoxin production (fig 1c). The enteroaggregative E.coli (EAEC) also produces enterotoxins and cytotoxines, but most notably creates a biofilm by stacking bacteria on one another on the epithelia. These strains are responsible of watery diarrhea in young children from developing countries. (fig 1d). The enteroinvasives E. coli (EIEC), are highly invasive strains that produce no toxins but use adhesion proteins to bind and enter the epithelial cells, multiply inside it and moves from adjacent cells to adjacent cells causing cell destructions (fig 1e). They are closely related to Shigellas, a group of E. coli that have acquired a virulence plasmid and that produce shiga toxins. They exhibit the same clinical syndrome as the Shigella ranging from a mild diarrhea to dysentery. Finally, the diffusely adherent E. coli (DAEC), induce the development of long cellular extensions that wraps around the adherent bacteria (fig 1f). These strains are associated with persistent diarrheas among children in developing countries.

12

1 E. coli: from ecology to genomics

Figure 1 Pathogenic schema of diarrhoeagenic E. coli. The six recognized categories of diarrhoeagenic E. coli each have unique features in their interaction with eukaryotic cells. Here, the interaction of each category with a typical target cell is schematically represented. These descriptions are largely the result of in vitro studies and might not completely reflect the phenomena that occurs in infected humans. a | EPEC adhere to small bowel enterocytes, but destroy the normal microvillar architecture, inducing the characteristic attaching and effacing lesion. Cytoskeletal derangements are accompanied by an inflammatory response and diarrhoea. 1. Initial adhesion, 2. Protein translocation by type III secretion, 3. Pedestal formation. b | EHEC also induce the attaching and effacing lesion, but in the colon. The distinguishing feature of EHEC is the elaboration of Shiga toxin (Stx), systemic absorption of which leads to potentially life- threatening complications. c | Similarly, ETEC adhere to small bowel enterocytes and induce watery diarrhoea by the secretion of heat-labile (LT) and/or heat-stable (ST) enterotoxins. d | EAEC adheres to small and large bowel epithelia in a thick biofilm and elaborates secretory enterotoxins and cytotoxins. e | EIEC invades the colonic epithelial cell, lyses the phagosome and moves through the cell by nucleating actin microfilaments. The bacteria might move laterally through the epithelium by direct cell-to-cell spread or might exit and re-enter the baso-lateral plasma membrane. f | DAEC elicits a characteristic signal transduction effect in small bowel enterocytes that manifests as the growth of long finger-like cellular projections, which wrap around the bacteria. AAF, aggregative adherence fimbriae; BFP, bundle-forming pilus; CFA, colonization factor antigen; DAF, decay-accelerating factor; EAST1, enteroaggregative E. coli ST1; LT, heat-labile enterotoxin; ShET1, Shigella enterotoxin 1; ST, heat-stable enterotoxin. (from Kaper et al, 2004)

13

1 E. coli: from ecology to genomics

1.2.2 E. coli are responsible for extra intestinal infections

In addition to the intraintestinal , some E. coli (named ExPEC) also behave as opportunistic pathogens involved in extraintestinal diseases (fig2), making E. coli the most common infecting bacteria of the urinary tract. Indeed, E. coli infects every other woman at least once in her life (Russo and Johnson, 2003). In this type of infection, the Uropathogenic E. coli (UPEC) strains bind to urinary tract urothelial cells using type 1 fimbriae to colonize the bladder. The attachment then triggers apoptosis and exfoliation of the bladder epithelial cells.

In cystitis, the type I fimbriae is continually expressed and the infection is confined to the bladder, while in pyelonephritis the type I fimbriae expression is turned off which allows the organism to ascend through the ureter to the kidneys where it can attach by P fimbriae to receptors on the kidney epithelium, that end up being damaged by the haemolysin and other bacterial products.

E. coli can also cross the one cell thick blood barrier to cause a bacteremia. From the blood, E. coli can invade further the central nervous system especially in premature newborn for which the full development of the brain barrier is not yet achieved. This results in neo-natal meningitis with a fatality rate of 15-40%, and severe neurological defects in the survivors. Finally, E. coli is responsible for bacterial pneumonia, spontaneous peritonitis and even ascending cholangitis (Russo and Johnson, 2003). These extraintestinal pathologies have long been neglected as a major , partly because of their versatility. However in 2003, a global study revealed that they could produce close to a million death a year and more than 100,000,000 hospitalizations worldwide (Russo and Johnson, 2003). All these factors make E. coli a public health enemy, and as not all the members of the species have the same propensity to produce different pathologies, it is of great interest to understand the underlying evolutionary mechanisms that rules its adaptation, and to understand what factors determines the pathogenicity and the life style of the strains.

14

1 E. coli: from ecology to genomics

Figure 2: Pathogenesis of urinary tract infection caused by uropathogenic E. coli. The figure shows the different stages of a urinary tract infection. Panels 2, 4, 5 and 11 are courtesy of N. Gunther, A. Jansen, X. Li and D. Auyer (University of Maryland), respectively. CFU, colony-forming units; PMNs, polymorphonuclear leukocytes. (from Kaper et al, 2004)

1.3 E. coli a structured species

To study adaptation, an epidemiological approach can be used to track down the diversity of the species. Along those lines, the first step is to produce a phylogeny of the different E. coli known strains. Studies based on both multilocus enzyme electrophoresis and DNA markers allowed the identification of 4 main phylogenetic groups A, B1, D and B2 and a potential fifth group E (fig 3). With more genomics input and more strains, up to 7 groups have now been identified, adding groups C and F (Herzer et al., 1990). Interestingly these phylogenetic groups, or phylogroups, are also characterized by different profiles of isolation both as commensal and as pathogens suggesting an association between those groups and the different lifestyles of E. coli strains (Escobar-Paramo, 2004). 15

1 E. coli: from ecology to genomics

Fig 3: tree based on the simultaneous analysis of the trpA, trpB pabB, putP, icd and polB using parsimony and rooted on E. fergusonii. Bootstrap values higher than 50% are indicated above the nodes. The vertical bars delineate the major phylogenetic groups: A, B1, C, E, D, and B2.(from Escobar-Paramo, 2004)

16

1 E. coli: from ecology to genomics

1.3.1 Variability in strain carriage as commensals

The different E. coli phylogroups appear to be differentially distributed in the different commensal habitats. For instance the prevalence of the four main phylogroups varies significantly between humans and animals. In human, A group is predominant (40, 5%), followed by B2 strains (25, 5%), and lastly B1 and D that are the least represented (17% each). In animals it is the B1 strains that are dominant (41%), followed by A (22%), B2 (21%) and D with the lower prevalence (16%) (Tenaillon et al., 2010). Domesticated animals have also an increased proportion of A strains than their wild counterparts (from 14% in the wild to 27% in zoo animals), and less B2 (from 30% to 11% respectively). Further studies, showed that those observations could not be simply explained by the existence of a host specific strain, as only a low number of strains seemed to exhibit such a relation with their host. Indeed large variations in the prevalence of the different subgroups could also be found among human populations, suggesting that environmental factors were playing a major role in shaping the distribution of the subgroups. Comparative studies focusing on human (Duriez et al., 2001; Tenaillon et al., 2010) showed that commensal E. coli strains isolated from Europe in the 1980’s (France and Croatia), Asia (Pakistan), Africa (Mali and Benin), and South America (French Guiana, Colombia and Bolivia) belonged predominantly to the A (55%) and B1 (21%) groups, whereas strains from the D (14%) and B2 (10%) phylogroups were less common. On the other hand, strains isolated from Europe in the 2000’s (France and Sweden), Japan, North America (USA), and Australia belonged mainly to the B2 group (43%), followed by the A (24%), D (21%), and to a lesser extend B1 (12%) groups (table 2) (Escobar-Paramo, 2004). Here, we can clearly see a contrast between B2 prevalences, which is high in the industrialized countries, whereas it is lower in tropical countries. This observation was confirmed by the study of metropolitan French people that moved to French Guyana (Skurnik et al., 2008). Their E. coli phylogroup prevalence became somewhat intermediary between the two patterns, having a prevalence of A similar to metropolitan French people, and a prevalence of B2 similar to the one of the native resident. We could consider that those differences could be related to geographical localization or climate, but the shift in B2 proportion in France between 1980 and 2000, seems to favor socioeconomics factors, such as dietary or hygiene habits.

1.3.2 Variability among pathogenic isolation

17

1 E. coli: from ecology to genomics

Associations could also be found between phylogroups and pathogenicity (Escobar-Paramo, 2004). Indeed it is obvious that ExPEC strains are clustered mostly in two phylogenetic groups, B2 and D (Duriez et al., 2001). Concerning intraintestinal pathogenic strains, we can see that the EPEC strains are distributed in two groups: B1 and B2. EHEC strains on the other hand are distributed among groups A and B1, but the major concentration of those strains are found in strains of serotype O157:H7 of group E. ETEC strains are found in groups A, B1, and C. Shigellas and EIEC strains are highly localized and represent highly specialized monophyletic groups. Lastly, DAEC and EAEC strains are more or less distributed all over the phylogroups, except in the E group (fig 3). With those observations, we can clearly see a relation between the life style of the strains and their phylogroups, but we can still wonder what the determinants of such behaviors are. Part of the answer came around the year 2000 and with the beginning of the sequencing era that gave a better understanding of the genome architecture of E. coli.

1.4 E. coli genome

1.4.1 Core genome and pan genome

Even though E. coli is considered the preferred model for biochemical genetics, and was one of the earliest organisms suggested for whole genome sequencing, the K12 strain of E. coli was only the seventh organism to be fully sequenced. The first complete genome sequence of E. coli was published in 1997 (Blattner et al., n.d.). It was followed in 2001 by the O157:H7 strain that was responsible for a hemorrhagic colitis outbreak in the school of Sakai city in Japan (Hayashi et al., 2001). This study revealed a new facet of E. coli genomics as it identified important differences between the genome of the two strains, with a 859kb difference, and 1632 proteins not encoded in K12. Further studies showed similar results, and it became useful to distinguish between two categories of genes (fig 4). In the first one, called the genes of the core genome, we would consider the genes that are ubiquitous among the species or the group of genomes studied, meaning genes that we are assured to find in every new strain sampled. This group is accounting for around 2,000 genes in the E. coli genome. On the other hand, the second group represents the totality of the gene collected from the specie, which combines both the core genome and the accessory genes. It is composed of more than 15,000 genes. The extended set of genes is usually called the pan genome, but pan genome can sometimes also be used in opposition to the core genome, to designate the variable part of the

18

1 E. coli: from ecology to genomics genome, meaning the genes for which the presence in a newly sequenced strain is uncertain. In this thesis, we will refer to pan genome for the whole set of genes and to accessory genes for the genes that are found only in subset of strains. Taking into account that the mean gene content of an E. coli strain is around 4 to 5 thousand genes, we can see that only half of those belongs to the core genome, while the rest of the genes are only partially found in the species. Those differences in genome could be explained by horizontal gene transfer, in which one strain from the same species, or not, transfers genes to a second one using plasmids or bacteriophages that mediate the dissemination of genetic material. Overall, those observations showed how plastic the genomes of E. coli can be, changing the vision that the scientific community had of a rather stable genome, to a highly dynamic structure with a constant flux of insertion and deletions of genes. As the conserved part is highly conserved and genes are not inserted randomly, the genomic organization of E. coli can be qualified as organized disorder (Touchon et al., 2009).

Fig 4: Representation of the all gene repertoire of E. coli. The pan genome designates the ensemble of gene that may be found in some E. coli genomes, while the core genome indicates the part of the genome that is common to all strain. Numbers given represent an approximate number of genes.

1.4.2 Difference in gene content and adaptation

Among the genes of the accessory genome that differed between pathogenic and non- pathogenic strain, one could identify genes related to pathogenicity that can be considered as

19

1 E. coli: from ecology to genomics virulence factors. Indeed, when searching through the difference in gene content of the non- pathogenic K12 and pathogenic 0157:H7 strain, from Japan, the researcher were able to identify 131 genes that seemed to be related to virulence (Hayashi et al., 2001). Because those factors are generating diarrhea, we can easily understand that they have an adaptive value, as diarrhea offers a good way for a strain to move from one host to another. However things do not look so obvious when looking at extraintestinal pathogenic strains, for which the pathogenic phenotype does not confer any transmission advantage. It then became interesting to look into the reasons for the existence of such virulence factors. Indeed, they could seem to drive their carrier into an evolutionary dead end that would, by killing or weakening the host, reduce its own transmission to other host. This is even more troubling when we know that the loss of virulence seems relatively easy to achieve. Such a paradox could be explained by looking at the extraintestinal strains of the B2 phylogroup, for which virulence factors were identified as genes coding for adhesins or proteins involved in iron metabolism for instance. Indeed, such genes where also identified as being associated with persistence of the strains in the commensal microbiota (Le Gall et al., 2007). Furthermore, it was shown that deletion of all pathogenicity island of strain 536 came at a cost in a mice model of commensalism (Diard et al., 2010). In other words, it is because E. coli strains are selected to be highly competitive in one particular environment, the human gut for instance, that they can survive, grow and become consequently pathogenic in another environment such as the urinary tract. In the end, extraintestinal virulence in E. coli is suggested to be a coincidental by product of commensalism.

1.4.3 Difference in conserved genes and adaptation

Besides presence and absence of genes, adaptation may also rely on variation in conserved genes. How much of adaptation does rely on gene acquisition and how much is achieved by modifying the core genome remains an important question. Looking at gene presence and absence is easy, therefore, as soon as genomes were sequenced it was possible to do so. Researchers looked for instance at how to set apart the pathogenic strains from the commensal ones using the differences in genes content. While the virulence factors were identified in that way, presence and absence could not explain all the specificities of the E. coli phylogroups. As mentioned before these groups have specific patterns of prevalence and associated pathologies, yet, no clear group of genes were identified as being determinant for such a differentiation. For example, there was no clear signature in gene content of the B2 group, and horizontal transfer alone wasn’t sufficient to account for the differences between the different groups (Touchon et al., 2009). This shows that the differentiation of E. coli

20

1 E. coli: from ecology to genomics phylogroups and phenotypes does not only come from the variation in gene content, but can also arise from modifications inside the core genome. An example of such gene modifications could be the ones observed in the fimH gene. This gene code for the type I fimbriae, a protein that is implicated in the binding of E. coli to mucosa. It was showed that naturally occurring mutations in fimH result in a better binding to the urinary tract. FimH mutations are therefore an important factor when shifting from the gut commensal habitat to the urinary tract more pathologic habitat (Sokurenko et al., 1998). Therefore it is of great interest to look at the microevolution of E. coli, if we want to gain a full insight its adaptation process in the different habitats humans offer.

21

2 In vitro experimental evolution:

2 IN VITRO EXPERIMENTAL EVOLUTION:

While ecological and evolutionary questions can be tackled using comparative and theoretical observations, bacteria offer the unique opportunity to study their adaptation in the laboratory, in the course of experimental evolution experiments. In such a approach, an organism is placed in a controlled environment in which growth is assured, and its evolution is assessed along the way by phenotypic and genotypic studies.

2.1 The different types of in vitro experimental evolution

Three main techniques can be used to perform experimental evolution (Barrick and Lenski, 2013) (fig 5). They can be adapted to both unicellular and multicellular organisms:

- In the first method called mutation accumulation experiments, a random single individual of a population is sampled at each cycle (fig 5a). The resulting strong genetic bottleneck impedes the opportunity for selection and leads to the fixation of arbitrary mutations with any (but lethal) fitness value. This protocol is therefore used to quantify the mutation rates, biases and effect in the absence of the filtering action of natural selection. The simplest protocol is to plate a colony on a fresh agar plate daily and to pick a random colony on the incubated plate to initiate the next cycle.

-In the two other methods, on the other hand, selection for the fittest individual is at play, leading to an adaptive evolution. In these two methods, repeated growth is assured by a renewal of the media. The first way to do so, is to use a chemostat (fig 5b). A chemostat is a form of test tube in which a constant flow of new media is incoming while old media is continuously discarded from the culture. Bacteria present in the culture are therefore kept in a constant environment at the limit of starvation as they reach an equilibrium in which they exhaust all resources but the new coming ones. This system has several main drawbacks. First, expensive peristaltic pumps are needed so the replication is always limited as well as the power to analyze evolved lineages. Second contamination of the incoming nutrients is always an issue. Third, the bacteria can evolve to stick to the tube walls and are then no longer affected by the outgoing flux. Fourth, as bacteria are close to starvation the first mutations to invade the population are inactivation of rpoS, the general stress response that allow growth to persist in very low nutrient concentration. However such a pleiotropic mutation may not necessarily be desired. 22

2 In vitro experimental evolution:

- The last method developed to evolve bacteria in the laboratory is the use of serial transfers (fig 5c). In that setting, every cycle is characterized by the inoculation of a subsample of the previous cycle’s population in a fresh tube or flask filled with fresh media. The size of the bottleneck is set to maintain a sufficient number of individuals to keep genetic diversity high enough from transfer to transfer to allow adaptation to proceed (Barrick and Lenski, 2013). It is this form of experiment that was mainly used for experimental evolution studies implying E. coli. It has several benefits: it requires low technology equipment’s allowing (i) high replication, (ii) high analytical power for evolved lineages characterization (iii) low contamination rate. Moreover, there are no issues with biofilm formation. However, this system has one drawback: the environment is quite complex with a daily cycle of lag, exponential growth and stationary phase.

Fig 5: the different methods of experimental evolution and their effect on population sizes. (from Barrick and Lenski, 2013)

2.2 Long term evolution experiment (LTEE)

Since the 90ies serial transfer has become the dominant system with notably the most famous experiment using it. In 1988, Richard Lenski, at the University of California Irvine, (and now at 23

2 In vitro experimental evolution:

Michigan State university) started his now famous long term evolution experiment (LTEE) that is still running today (Lenski et al., 1991; Wiser et al., 2013). In that experiment, 12 replicate populations derived from E. coli B (a strain isolated by Felix d’Herelle) has been left to evolve through serial transfer on a daily basis with a 100 fold dilution in 10ml of fresh media containing very limited amount of glucose buffered with citrate. Every 500 generations (75 days) a frozen record of every population was stored for further studies. The revival of these samples allowed direct comparison with the ancestral clones for phenotypic assays or fitness measurements. In February 2016, the strains from Lenski’s experiment reached 64,000 generations of evolution. Many landmark papers have been published from that experiment, dealing with the emergence of innovation or, studying more basic facets of adaptation: fitness improvement (Wiser et al., 2013), and changes in mutation rates (Sniegowski et al., 1997). For instance, one of the most famous observations was the apparition of strains able to use the citrate of the media around the generation 31,000 (Blount et al., 2012, 2008). , Citrate is normally not accessible as a nutrient for E. coli. This interesting observation was furthermore reproducible with 19 different successful attempts at evolving citrate usage from the ancestor of the same lineage but only after the generation 20,000, suggesting that another older mutation was required for such a phenotype to appear.

2.3 Fitness improvement

One of the most basic observation of the LTEE is that fitness is increasing with time. When comparing the fitness of the newly evolved strains to the ancestral strain of E. coli REL606, it was obvious that the strain had adapted to the new environment, having a fitness increased by almost a 1.65 fold at 50,000 generations (Wiser et al., 2013). In itself this observation is a validation of the experimental protocol, but it was also really interesting to see that the adaptation seemed to slow down through time. Indeed, if fitness seemed to increase by 65% at 50,000 generations, it was already almost at 50% at 10,000, and therefore had an increase of only 15% for 40,000 generations afterwards (fig 6). However even if adaptation slowed down, it was still present, and never stopped even after 50,000 generations of adaptation. Models using a power law seemed furthermore to fit best with the data, comforting a never-ending but always slower fitness increase. One reason for the decaying rate of adaptation may come from the fact that beneficial mutations have a lesser impact on fitness in the presence of other beneficial mutations, a phenomenon called diminishing return epistasis (Chou et al., 2011; Khan et al., 2011).

24

2 In vitro experimental evolution:

Fig 6: Relative fitness in comparison to the ancestral strain across 50,000 generations of the LTEE. The lines represents a hyperbolic model and a power law model (from Wiser et al 2013)

2.4 Mutator selection

Another interesting observation that came out of the study of these lineages was the evolution of their mutation rate (Sniegowski et al., 1997). Mutation rates can be computed in two ways. Experimentally by plating the lineages on selective media: counting the selected mutants is then used to estimate a mutation rate. Now with genomics, analyses of the accumulation of mutations through time in the lineages can also be used to infer changes in mutation rates. An early study of the LTEE lineages at 10,000 generations revealed that 3 lineages, Ara-2 Ara-3 and Ara+4, out the 12 original lineages, where found to have an increased mutation rate (fig 7). Furthermore it was shown that the increased mutation rate remained in the population. This experiment then offered an interesting view on how mutator alleles spontaneously emerged, and raised to high frequency in populations adapting to new environments. Later studies using genomics have showed that 6 out of the 12 populations have become mutator populations (Tenaillon et al., 2016). Overall, 4 populations have inactivated genes in the mismatch repair gene and 2 populations have selected an inactivation of mutT. Interestingly, a close study of one of the population has revealed that mutation rate may decay after fixation of mutator alleles. Population Ara-1, in which mutator increased to fixation at around generation 25,000, showed a twofold decrease in mutation rate that emerged through two independent mutations in mutY. (See below) Furthermore, another population Ara-2, presented an even stronger mutation rate fluctuation as the mutation rate reverts towards its ancestral state after about 30,000 generations at high mutation rate. In that case, it is an exact reversion of the mutation that lead to the mutator 25

2 In vitro experimental evolution: phenotype that was selected for (O. Tenaillon unpublished data). Hence the LTEE provides a complete picture of the rise and fall of mutator alleles or how natural selection may modulate mutation rate. As this aspect will be further investigated in the results of this thesis, I will now present some more details on the genetic control of mutation rate and how selection is affecting it.

Fig 7: Rates of mutations to nalidixic acid resistance in the 12 Ara- and Ara+ populations at 10,000 generations (from Sniegowski et al, 1997)

2.5 Genetics of mutation rate

2.5.1 Control of mutation rate

During the life of a bacteria, its DNA can be altered in many different ways. Therefore several mechanisms exist in the cell to mitigate those mutations. They act at different levels (Friedberg et al., 2005)

-It is first during the replication phase that the higher risk of mutation can be observed. Indeed, the DNA polymerase can produce around one mismatch every 10,000 to 100,000 base pair. To counteract those replication errors, the DNA polymerase III, the one responsible of the main replication of the chromosome, possess a subunit encoded by the gene dnaQ, which allows for a proofreading function. Thanks to this subunit, the fidelity of the replication is multiplied by 10,000.

-After replication, some mistakes may have escaped the proofreading activity of the polymerase. a system known as the methyl-directed mismatch repair (MMR) is recruited is 26

2 In vitro experimental evolution: specialized in taking care of the remaining mismatches. It uses 3 proteins: MutH, MutS and MutL. The MutS protein detects the mismatch, and reacts with MutL to recruit and activates MutH into a complex. MutL filaments drive MutH to specific sites that are targets of methylation. Just after replication, there is a time window during which only the matrix strand is methylated while the neosynthetized strand has not yet been methylated. The MutH protein uses that asymmetric methylation to identify the neosynthetized strand. It then recruits a helicase and some exonuclease in order to cut the neosynthetized strand and degrade a fraction of it including the detected mismatch. DNA polymerase, PolIII (Li, 2008) is then recruited to resynthesize the degraded fraction (Harfe and Jinks-Robertson, 2000) (fig 9).

Fig 9: the bacterial paradigm for mismatch repair of DNA replication errors (Harfe and Jinks-Robertson, 2000)

-In addition to replication errors, chemical of physical alteration of the DNA are also at the origin of mutations: they generate non-conventional structures that may be repaired or replicated

27

2 In vitro experimental evolution: erroneously. One highly mutagenic damage is the oxidation of the dNTPs G in 8-oxo-GTP in the nucleotidic pool of the cell. Indeed once integrated in the DNA, the 8-oxo-GTP, can be matched equally with a dATP or a dCTP by the polymerases creating AT->CG mutations. To counteract these mutations, several enzymes have evolved and are conserved from bacteria to human. Among them, the MutT protein is able to hydrolyses 8-oxo-GTP in 8-oxoGDP before it is used for DNA replication (Fowler and Schaaper, 1997). The MutY and MutM glycolase also enter into action if the modified nucleotide is integrated inside the DNA. MutM removes the oxidized base in the Go:C pairs. MutY protein is acting on the Go:A mismatches that it resolves as if the A was the mis-incorporated base (fig 10).

Fig 10: Consequences of incorporation of 8-oxodGTP opposite template C and A, and the role of the mutT or mutY and mutM genes in avoiding A·T→C·G and G·C→T·A transversion mutations. 8-oxodG is represented by a circled G. Not included in this diagram is oxidation of G to 8-oxoG while in DNA. These 8-oxoGs are thought to be processed as 8-oxoGs incorporated opposite C. If unrepaired by the MutM and MutY pathways, they will result in G·C→T·A transversion (from fowler and shaper, 1997)

If mutation is often discussed as needed for adaptation, the mutation rate resulting from the action of all these repairs and fidelity systems is extremely low. Using the synonymous mutations in

28

2 In vitro experimental evolution: the LTEE, an average per base mutation rate of 0.8 10-10 was found (Wielgoss et al., 2011). However, inactivation of the repair mechanisms lead to an increase in mutation rate. As different repair mechanisms play on different facets of the error correction and fidelity, their inactivation results in variable rates. For instance, inactivation of the Mismatch repair genes mutL or mutS, lead to an increase of about 100 fold, with mostly transitions. Inactivation of mutT, results in a 500-fold increase in AT to CG mutations. mutY inactivation results in a 10 fold increase in G to T mutations, however, it lead to a reduction of AT to CG mutation rate if mutT is inactivated. If mutT is not functional, then incorporation of Go in the DNA is the source of Go and not the oxidation of G in the DNA. As Go may be incorporated in front of an A with 50% chance, then mutY will treat the A as the source of problem and mis-repair it to a C enforcing the mutation. Hence, MutY inactivation can lead to an increase or a decrease in the overall mutation rate depending in which background it occurs.

2.5.2 Selection for and against increased mutation rate

How can selection act on mutation rate? Models of population genetics have been used to tackle this question for a century (Sturtevant, 1937). Selection of an allele changing the mutation rate may not directly result from its direct effect on fitness, but rather on the effect of the mutations it generates, a process called second order selection (Tenaillon et al., 2001). This process is particularly efficient in asexual populations in which an allele increasing the mutation rate or mutator allele remains physically linked to the mutations it produces and therefore may benefit or suffer from their fitness effect. As most mutations are deleterious, overproducing mutations is costly and alleles decreasing mutation rate owe to be selected for, as the low bacterial mutation rate suggest. However, simulations and theory show that a few beneficial mutations produced by the mutator alleles in excess may lead to an increase in mutator allele frequency (Tanaka et al., 2003). However, the association between beneficial mutations and mutator allele is only possible if the mutator population is large enough to produce some beneficial mutations. This has been experimentally tested in vitro or in the gut of mice by competing different fraction of mutator alleles with the wild type (Giraud, 2001). When the fraction of mutator is low, their subpopulation is not large enough to produce the rare beneficial mutations and they will eventually go extinct as they overproduce frequent deleterious mutations. On the other hand, the wild type population will be large enough to produce the rare beneficial mutations. When the frequency of mutator is higher they produce beneficial mutations and hitch-hike with those to high frequency and subsequently increase their chance to produce the next beneficial mutation (Chao and Cox, 1983). Theory has shown that a few large effect beneficial

29

2 In vitro experimental evolution: mutations or some recurrent beneficial mutations due to changing environment were enough to select for mutator alleles in large populations despite a constant overproduction of deleterious mutations at much higher rates (Tanaka et al., 2003; Tenaillon et al., 1999). One easy way to select for mutator is to apply multiple antibiotic treatments (Mao et al., 1997). If the mutator is increasing X fold the mutation rate and its initial subpopulation is large enough to produce at least one resistant bacteria per round of selection, then each round of antibiotic treatment is increasing the mutator frequency by an X fold factor. With a 100 fold effect, the initial frequency is increased by a millions fold with three consecutive treatments. This type of selection is certainly responsible for the high prevalence of mutators in bacterial populations colonizing the lung of cystic fibrosis patient (Oliver, 2000).

Overall, theories and experimental data suggest that observing mutator alleles at high frequency in populations is a signature of an active process of adaptation, in which many beneficial mutations are being selected for, while the mutator allele hitchhikes with them.

2.6 Molecular bases of adaptation

2.6.1 Genetic convergence

In addition to observations on the dynamics of adaptation, experimental evolution provides also some insights on the molecular bases of adaptation. Sequencing costs allow now with the coupling of experimental evolution with deep sequencing to uncover the primary targets of natural selection. Using whole genome sequencing, we can compare the evolved lineages to the ancestral strain in order to identify mutations that have accumulated in the genome in the course of the experiment. Using this method 42 mutations were identified in the ARA-1 lineage from the LTEE experiment (Barrick et al., 2009). As none of the coding mutations were found to be synonymous mutations, it appears that most mutations should result from the action of natural selection rather than chance. Furthermore, convergence at the genes level could be observed between all the lineages of the LTEE experiment. When looking at 14 mutated genes from ARA-1 in the other 12 lineages, 3 mutations were present in all lineages, 9 mutations were found in another lineage, and only 2 genes were specific to ARA-1. This evolutionary convergence can be considered as a hallmark of adaptation: Mutations in genes that grant an adaptive advantage will be selected in several populations independently. In another study convergence was analyzed through the comparison of 115 lineages adapting to high temperature for 2,000 generations. Here again convergence was 30

2 In vitro experimental evolution: observed and this time hundreds of different adaptive mutations were identified suggesting that many mutations can contribute to adaptation (Tenaillon et al., 2012).

2.6.2 Clonal interference

As both selection of mutators and multiple sign of convergence suggest, bacterial adaptation in vitro is supported by many independent beneficial mutations. Yet, in asexual populations these mutations cannot go to fixation simultaneously and they compete with one another in a phenomenon called clonal interference (Gerrish and Lenski, 1998). This phenomenon was demonstrated experimentally multiple times. For instance, Lang et al followed some asexual population of yeast (Lang et al., 2013), and by sequencing the whole population at the genome level they could show that clusters of mutations were competing with one another until one finally reaches fixation. In extreme cases, when one loci is under strong selection and has a high rate of beneficial mutation, clonal interference can even be seen within a gene: several mutations affecting the same gene are found segregating in the population. This was shown in Methylobacterium extorquens AM1 in which up to 17 beneficial alleles inactivating trfA coexisted in a single population (Lee and Marx, 2013). So clonal interference seems to be the regime under which beneficial mutations evolve in experimental evolution.

2.6.3 Recruitment of mutations in global regulators

What types of mutations are found? At the molecular level all types of mutations can be found: single nucleotide polymorphisms (SNPs), indels, large deletions and duplications, IS element insertions. Many gene inactivations seem to be adaptive, but adaptation also recruit mutations in important or essential genes. In fact, among the first targets of adaptation, mutations in global regulators play an important role. In the LTEE, among the genes mutated early on across several lineages, two genes, spoT and topA were found to be implicated in the supercoiling of DNA (Cooper et al., 2003). As such, these genes are global regulators as coiling of DNA plays a role on the expression of many genes. In the thermal adaptation previously mentioned, the clearest target was the very pleiotropic RNA polymerase gene. In combination with other studies (Hindré et al., n.d.), these observations suggest that adaptation recruit first modifications in global regulatory pleiotropic hubs, controlling many genes, and later modification in more specialized regulators with a narrow 31

2 In vitro experimental evolution: spectrum that adjust the effects of the first mutations (fig 11). This can be seen as a quick and dirty way to adapt to a new environment by selecting first for mutations that will have a great effect on the bacteria, and, in a second step of fine tuning, either by adjusting the possibly deleterious side effects of the first steps of adaptation, or adapting one step further.

32

2 In vitro experimental evolution:

Fig 11: In vivo evolution of regulatory networks. The structure of the network is hierarchical in the ancestral strain, with hub-encoding genes (dark blue circles) controlling many lower-level regulatory and target genes (light blue circles). The expression profiles of each gene as a function of growth conditions are shown as green curves, with the sum of all curves defining the phenotypic landscape of the individual. The early steps of experimental evolution are characterized by the occurrence of a mutation (yellow arrow) within a hub-encoding gene (red circle), which subsequently shows altered expression. These changes diffuse (red arrows) inside the entire connected network and affect the expression (red curves) of many genes. The overall effect is beneficial, but there are many pleiotropic effects. Later during evolution, compensatory mutations that correct for the deleterious effects of the early mutation affect lower-level genes (orange circles) in the regulatory network and almost restore their expression as well as the expression of some target genes (orange curves). This figure highlights the evolvable structure and organization of regulatory networks. (from Hindré et al, 2012)

33

2 In vitro experimental evolution:

The recruitment of global regulators in the early steps of adaptation seems to be a hallmark of bacterial experimental evolution in the laboratory. Yet, when one looks at the natural diversity present in these genes, it is surprising to see that they seem incredibly stable in nature. For instance the RNA polymerase genes, like rpoB are often found to be mutated in laboratory experiment, but seems to be really conserved in the wild (fig 12). Why are changes in this operon so often favored in experimental evolution whereas those genes are highly conserved in nature? There is no clear answer to these question, but this somehow suggests that these environment differ drastically from nature.

34

2 In vitro experimental evolution:

Fig 12: RNA polymerase B (rpoB) sequence alignments for replicate populations (rows) evolved at high temperature79 (top panel) and those in naturally occurring isolates148 (bottom panel). rpoB is targeted repeatedly in laboratory-based adaptation to high temperature, but is largely invariant between naturally occurring E. coli strains. Colors correspond to different base changes (A, green; C, blue; G, orange; T, red; and deletions, black). (from long et al 2015)

All this studies gave great insight the mechanisms that drives the adaptations of E. coli, and what factors are selected inside a population, But, if the LTEE and similar experiment allowed to identify several remarkable traits of E. coli adaption, it is still mainly based on bacteria cultivated in laboratory conditions, that are far from mimicking the natural habitat of the specie. Therefore to gain any medical, or even ecological insights, it is crucial to try to go further and to analyze E. coli’s differentiation in natura.

35

3 Short term evolution in vivo and objectives

3 SHORT TERM EVOLUTION IN VIVO AND OBJECTIVES

The aim of this thesis is to use Next Generation Sequencing (NGS) to study the evolution of E. coli in environments closer to nature than what has mostly been used in the laboratory. I will first review knowledge on the evolution of E. coli in its natural environment.

3.1 Evolution in the mice

A good approach, to get a realistic vision of E. coli diversification in the wild, was to keep the experimental evolution aspect of the study but to change the environment of the experiment to something closer to the natural environment of the species. To do so several protocols were designed using mice’s gut, as E. coli is a commensal of the vertebrates’ gut. For instance, a study from 2001 used isogenic germ free mice (Giraud, 2001) bred on a sterile environment such that the gut is completely free of microbiota, offering a perfect habitat for E. coli without any competitors. The mice where then inoculated with a K12 strain. The researcher showed that, after 42 days, almost all the lineages had evolved and had sometimes gained high mutation rates through a mutation in a DNA repair gene, converging towards results from Lenski’s work. They later found that some global regulator were also involved in the early stages of adaptation (Giraud et al., 2008). However, these results have a limited power for two reasons. First the strain used was a laboratory strain, that has been passed in laboratory and mutagenized for decades (see below), second and most importantly for that experiment, germ free mice are far from any natural environment: first the immune system is quite abnormal as it has not fully matured due to the lack of bacterial challenging during development. Second, the physiology is very different with a massive septum full of mucus. Third this environment is really different from that of a natural situation, as the absence of other microbes leaves the whole niche to E. coli while it is naturally a minority species. To get closer to natural settings, other teams suggested the use of conventional mice treated with streptomycin. The use of streptomycin is required to allow the colonization of the gut by a streptomycin resistant E. coli. Although it alters some part of the microbiota, its effects are supposed to be minor on the dominant anaerobic microbiota. With that system, Barroso-baptista, observed in the course of 24 days a convergence towards the inactivation of the gat operon which is used in the metabolism of galactitol and in the oxygen pathway (Barroso-Batista et al., 2014). Galacticol being a part of the host’s metabolism of galactose, it was hypothesized that the bacteria were frequently exposed to it and therefore that the strain would be 36

3 Short term evolution in vivo and objectives selected for the gat-negative phenotype. The authors argued for a fast rate of adaptation in vivo. Yet, a closer look reveals some limitations of this approach. The strain used is a K12 E.coli for which the expression of gat operon is constitutive due to inactivation of its repressor. However, the repressor is conserved in all natural isolates suggesting it is under selective pressure in natural environments. Moreover other mutations in respiration could also results from the specificity of the strain, which has been grown on plates for a long time with high oxygen. All in all, we can clearly see that if those studies have results that seem to go along the observations made by the LTEE experiment, they still have some major defaults. Indeed they rely on using environments that are quite far from the natural ones of the bacteria, and on using E. coli strains that are also far from the natural ones, like the K12 strain, that has been used inside the laboratory for almost a century now. Hence, if these experiments suggest a move towards a more natural environment, it is still difficult to know if what was observed is comparable to what happens in the wild. To move further in that direction the use of E. coli K12 should not be taken as a realistic model for the evolution in natura of E. coli. It is necessary to move towards using both natural strains of E. coli and trying to get closer to natural environments of the bacteria for such experiment. In this line of thought, an experimental evolution mice gut study using strains of E. coli closer to the ones found in the wild, will be discussed in the first article, as this could give us a more realistic view of the adaptation of E. coli.

3.2 Evolution in human

3.3.1 Chronic infections

As mentioned early on, E. coli has a special relationship with human, and in that respect, the use of mice as a natural environment may be of limited interest. Mice have a different gut physiology, alimentation and metabolism and indeed E. coli is not the dominant enterrobacteria found in their gut (Nguyen et al., 2015). Yet, doing experimental evolution in humans is obviously a really difficult task, as it require for the strains to be maintained in the gut, the host to be checked regularly, and the intake of antibiotics to be regulated. Moreover, the environment cannot be maintained as constant as in the laboratory without rising some severe ethical issues. However studies found ways to go around such limitations, by following patients in a medicalized controlled environment. For instance, in other bacterial species, epidemiological studies followed the modification of one clone while it propagated in a population. Lieberman conducted a retrospective study on 14 subject with cystic fibrosis infected

37

3 Short term evolution in vivo and objectives in a Burkholderia dolosa outbreak (Lieberman et al., 2011). They analyzed 112 isolates collected over a period of 16 years. They identified 492 polymorphic loci. The mutations were accumulating at a steady rate of 2 single nucleotide polymorphisms per year, and using those it was possible to compute a maximum likelihood tree, which retraced the network of transmission between patients. Furthermore, they observed a great amount of evolutionary convergence, with 17 genes mutated independently amongst the different patients (fig 13). There was even a convergence at the functional level, as the mutated genes could be arranged in functional groups, O2 related gene regulation, antibiotic resistance, outer membrane synthesis, secretion, and a fifth group for which the functions were unknown. It is easy to make sense of some of those mutations, given the environment of the strains sequenced. We can correlate the O2 regulatory genes mutation with the pulmonary localization of the bacteria, the antibiotic resistance with the treatment received by the subject, and the secretion genes with pathogenicity. Given that some of the genes mutated here were not previously identified as being implicated in pathogenesis, such an observation can be of great medical interest. This pioneer study has however some specificities: the lungs of cystic fibrosis patients is not the natural environment of B. dolosa which is in that case an opportunistic pathogen.

Fig 13: the 17 genes under positive selection grouped by biological function. The color intensity indicates the number of mutations observed within this subject. The number of mutations is indicated left. (from Lieberman et al, 2011)

38

3 Short term evolution in vivo and objectives

To study E. coli, a controlled within human evolution experiment was made in a case of UTI. In that study, the colonization of the bladder was carried out with controlled inoculation of a specific type of E. coli: asymptomatic bacteriuria strains (ABU). ABU infections are peculiar, as the E. coli strain colonizes the bladder of the patient, but do not result in a . Because it expresses such an exceptional phenotype, the E. coli strain 83972, can be used for treatment of chronic urinary tracts infections. The strain colonizes the urinary tract, preventing the invasion of other pathogenic strains, while not harming the patient. As the strain remains in the urinary tract of the patient, its evolution can be studied in detail, as both the ancestral genome and its date of inoculation are known. A study was published in 2010, in which they inoculated 6 patients with strain 83972 and then re sampled it at several time points over a period of 423 days(Zdziarski et al., 2010). They identified 37 mutations, out of which 34 where genic. Amongst the genes mutated they observed signs of convergence, towards regulatory genes, those genes being mutated independently in different patients. For instance the gene barA was found mutated in two patients, and encodes for a protein that controls a global regulatory network affecting flagella or adhesins expression, and biofilm formation. It was previously proposed as a virulence trait in urinary tract infection. Other genes found mutated were implicated in adaptations to different stress conditions, like the gene fnr implicated in oxidative stress response, and mutated in two patients. For medical reasons three patients had to be re inoculated with the ancestral strain. This allowed the researchers to evaluate the reproducibility of the experiment. This second round of evolution revealed that several loci were repeatedly mutated. This observation is of great interest as it shows that such experiments are reproducible and that the mutations observed could not be due to drift, but represent an adaptation of the strain to the environment of the host. Furthermore they identified that the mutations lead mostly to loss of functions (fig 14). Similarly to in vitro evolution, some regulators were found in the repeated targets of adaptation. This study suggests that during pathogenesis some of the conclusions learned from experimental evolution in vitro could be valid, but the specificity of the environment used requires further experiments before any conclusion can be drawn.

39

3 Short term evolution in vivo and objectives

Fig 14: different adaptational strategies of E. coli 83972 during prolonged growth in the bladder of 3 different human hosts: PIII-4, PI-2, PII-4. (from Zdziarski et al, 2010)

3.3.2 Acute infections

An alternative approach to inoculation of strains, which can only be done in very specific conditions in humans, is to look at the diversity present in samples. For instance, following an infection, it may be interesting to sample several clones and study their diversity. This has been performed in E. coli with several extraintestinal infections. 226 isolates were gathered from 19 patients suffering from deep and closed E. coli infection. Using enterrobacterial repetitive intergenic consensus PCR (ERIC-PCR) and pulse field electrophoresis (PFGE), Levert el al (Levert et al., 2010) were able to explore the diversity of the samples. They first identified that 4 patients had a polyclonal infection, in which the strains sampled were coming from different phylogroups. Furthermore, micro heterogeneity, in which isolates appeared close but still different using ERIC-PCR and PFGE, was found in the 4 polyclonal infections, as well as in 7 monoclonal infections. The best hypothesis to explain this micro heterogeneity is then that the sampled bacteria have evolved from a common 40

3 Short term evolution in vivo and objectives ancestor. Theses microevolved isolates from an infection were then phenotypically compared to one another and differences in antibiotic resistance and growth rates were found. Furthermore when comparing with commensal clonal population, the variability in phenotype was significantly higher in the infectious samples, which would suggest that the isolates raised from an infectious common ancestor, instead of diversifying in the gut and then creating an infection. Amongst the phenotypic observation, 3 isolates exhibited a mutator phenotype, with the mutations rate having increased by more than 50 fold for one. These 3 samples will be studied in depth using whole genome sequencing in the article n°2 of this thesis.

Similarly, in another bacterial species, signs of selection acting on regulators were found in a study on Streptococcus agalactiae (Almeida et al., 2015), a commensal bacteria that can cause infections in newborns. Newborns are very likely to acquire the S. agalactiae from their mother at birth. To investigate if this transition from mother to infected child is associated with some selection, A. Almeida and collaborators compared the genome of 19 mother-child pairs of S. agalactiae strains, identifying 14 mutations. Amongst those mutations, they observed a convergence with two independent mutations affecting the covRS locus that encode for a regulatory system of virulence. The expression of virulence genes related to covRS, such as bibA, cylE and gbs1895 showed a higher expression in the mutant. These results show that the observed mutations are adaptive, and positively selected during the transition between commensal and pathogenic state and contribute to the virulence of the pathogenic strains. All these findings tend to show that parallels can be made between laboratory experimental evolution and what is observed in natura.

3.3.3 Commensal evolution

However, those studies where oriented towards humans but they focused on infections and did not provide real insights on the commensal environment of E. coli. A study by James R Johnson aimed at analyzing the evolution of E. coli in the human gut (Johnson et al., 2008). To do so, they followed the members of a household for 3 years and made 6 samplings in the father the two daughters, the son and the dog. They identified 2 independent clones that maintained during this period, which allowed them to see how the clones diversified during this period. They followed more precisely one of the clone and were able to retrace how it was transferred from one host to another. They found around 20 mutations in all their samples, amongst which 10 were nonsynonymous. From the data, they deduced a rather low substitution rate with around only 1.1 substitutions per genome per year. This observation suggests, that this strain seems to be already adapted to its environment, 41

3 Short term evolution in vivo and objectives whereas the strain from experimental evolution that were facing laboratory extreme conditions, exhibited a larger substitution rate.

3.4 Objectives

Then can we really trust the results of experimental evolution to make assumption on natural samples of E. coli? Indeed, there is still work that can be done to improve the protocols to conduct experimental evolution in natura. To answer such a question, we need to use the most recent and effective tools that are at our disposal. Indeed, the recent advances in sequencing technologies bring great promises to this type of study as they allow a deeper visualization of adaptation. Following those lines of thought, we have been conducting several experiments in order to respond to those burning questions:

1 – We have conducted an experimental evolution study inside the mice’s gut using a naturally isolate of E. coli: E. coli 536. This experiment offers the opportunity to study the evolution of a natural strain inside an environment closer to E. coli natural niche than an axenic mouse can offer.

2 – We used the samples from 3 of the patients used in the Levert et al. (Levert et al., 2010). By comparing the different isolates sampled from each patients using whole genome sequencing, we are able to identify the forces driving diversification during an infection.

42

4 Next generation sequencing (NGS)

4 NEXT GENERATION SEQUENCING (NGS)

Recent advances in sequencing technologies have drastically reduced the cost and time needed to sequence a full genome. These methods are therefore particularly suitable for the study of bacterial population diversification in which we want to sequence a large number of clones in different samples or evolved lineages. In this section, I am going to describe the technical aspects of this sequencing revolution as well as the tools I used during my thesis in order to study the diversification of E. coli in different environments.

4.1 Sequencing technologies

4.1.1 Sanger Sequencing

I will start with a brief history of sequencing. It is James Watson an Francis Crick that first hypothesized and showed the double helix structure of DNA (Watson and Crick, 1953). In this model, DNA is composed of two strands coiled in a helix, and connected to one another by hydrogens bond, each strand being composed by a sequence of nucleotides, adenine (A), thymine (T), guanine (G), and cytosine (C). The two strands being complementary from one another, a A always being bonded to a T on the other strand, and a G with a C, Watson and Crick proposed that one strand could be used to reconstruct the other (Watson and Crick, 1953). In 1977 Frederick Sanger was the first to develop a DNA sequencing method using a primer extension strategy, which would remain the main DNA sequencing technology used in the world for decades (Sanger et al., 1977), and would earn him a Nobel Prize in 1980. His method uses a primer that binds to the reference sequence, and that is then used by a DNA polymerase. The polymerase uses the deoxynucleosidetriphosphates (dNTPS) present in the mixture. However alongside the regular dNTPs, the media is also rich in radioactively marked di- deoxynucleosidetriphosphates (ddNTPs) that lack 3’-OH which blocks the elongation of the new sequence. Four different mixtures are prepared with only one type ddNTPs. In each of the mixture, we would find a random number of sequences stopped at different length. The fragments are then separated by size using gel electrophoresis with the mixture of each ddNTPs in a different row, and visualized using the radioactivity of the ddNTPs. By relating the distance migrated to the length of the sequence, we are then able to extract a sequence. Over the years, the method was improved most

43

4 Next generation sequencing (NGS) notably by changing the termination method and getting rid of the radioactive phosphorus and replacing it by a fluorescent dyes. In that case, each type of ddNTPs is being marked by a fluorescent dye that emits at a different wavelength, allowing identification of each nucleotide. The first automated machines dedicated to sequencing were produced, and were able to produce sequences of slightly less than 1 kilo bases.

This method was used, and is still used, at large scale to sequence DNA previously amplified by PCR. However, as the will to sequence genome of unknown sequence improved, a new protocol called shotgun sequencing emerged (Anderson, 1981; Staden, 1979). In this method, the genome is sheared into small random fragments that are then cloned in a plasmid of known sequence in E. coli. Each of the fragments is then sequenced, and computers reconstruct the larger picture of the genome sequence. If the number of fragment sequenced is large enough and the shearing of the DNA random enough, as the number of sequences increases, more and more of the sequences overlap with one another. From the overlaps a whole genome can be recovered. This method was used extensively for the first whole genomes that were produced (Feldmann et al., 1994), and in the 1990’s the Human genome project used this technology to sequence the human genome (Lander et al., 2001). It took several hundreds of scientist for 15 years and an approximate cost of 3 billion dollars to release the first draft of the genome in 2001. With an optimization of the protocol, Craig Venter and his company managed to release simultaneously another human genome (his) that they produced in 4 years for a cost of about 100 million dollars (Venter et al., 2001). Few years later, the genome of James Watson was released. This time, it took four months, a handful of scientist and 1.5 million dollars (Wheeler et al., 2008). Nowadays, we are able to produce a full genome by a single person in about a week for a cost of about 1000 dollars. This revolution was made possible by the new generation of sequencer.

4.1.2 Next generation sequencing principle: the case of 454

The method used to sequence James Watson genome was pyrosequencing (Heather and Chain, 2016; Ronaghi et al., 1998), licensed to 454 life science. It has been one of the base of the next generation sequencing revolution (fig 15a). In these technologies, the time consuming step of cloning is avoided. This drastically reduces the costs, as no more large-scale automation are required to handle libraries of thousands of clones. In the protocol, the genome is first randomly fragmented, and then adapters are ligated to the denatured DNA fragments. Then, an emulsion PCR is done. In this procedure, the sequences are placed in water drops in an emulsion of oil with only one fragment per drop and a bead covered with a sequence complementary to the one of the adapters. The DNA 44

4 Next generation sequencing (NGS) fragments binds to the bead, and because all reagent for a DNA amplification are present, starts replicating like in a PCR, allowing a massive in parallel amplification of each fragment. The newly produced fragments then binds to the bead and so on. Then the beads are inserted into a multi-well plate in which a single bead per well is inserted. The sequencing protocol can then start. In this method, the pyrophosphate produced when the dNTPs are incorporated in a growing DNA strand by the DNA polymerase, is converted into ATP by an ATP sulfurylase. The ATP is then used by the luciferase that produces a luminous signal proportional to the quantity of pyrophosphate. As the 4 different dNTPs are given sequentially to the DNA polymerase, recording the production of fluorescence allows the unraveling of the multiple sequences. This method allowed observing the sequencing process in real time in comparison to the electrophoresis used in Sanger method, and could also be done using regular dNTPs. The main inconvenient of this method is to distinguish the number nucleotides added to the sequence in a row, when long nucleotide monomers are sequenced. Indeed, the intensity of the light produced is nonlinear after 5 similar nucleotides.

Overall, the procedure outputs sequences, called reads, for each of the fragments of the original DNA. Even if the fragments sequenced were initially drastically smaller than the one provided by the Sanger sequencing method, the method allows a throughput much larger than the one used for Sanger sequencing that were in the thousands bp. With this technic, we can get an output closer to the millions of reads, and even, for other technologies, to about 100 of millions of reads. The methods developed to assemble genome based on shotgun sequencing can then be applied to those large numbers of reads. Sheer numbers once again compensates the lower sequence quality of these methods: each base of the genome is found in many reads such that a consensus at the base can be identified. The mean number of time a base is represented in the sequencing results is called coverage.

In the end, if Sanger sequencing allows for the production of long sequences, next-generation sequencing gives short sequences but in much greater numbers, (3 to 6 orders of magnitude more). Furthermore, the main advantage of the next-generation sequencing protocol such as the one described, lies in the amplification process, as it allows for the massively parallel amplification of independent fragments whereas the cloning process of the Sanger method was really the limitation factor of the protocol. These methods allow therefore for highly parallelized, cheaper and faster sequencing. Indeed, methods such as pyrosequencing were outcompeted by Sanger method in the 1970’s as they had lower sequence quality and length. But with the renewed interest in them, each of the sequencing technologies used in NGS have made important progress over the last decade. For instance, the length of the reads produced by 454 started with less than 150bp and reached close to

45

4 Next generation sequencing (NGS)

800bp recently (tough 454 is already history now). Indeed, the sequencing technology continues to evolve in such a way that the cost decrease is going faster than predicted by Moore’s law which predict a doubling of computing power every two years.

46

4 Next generation sequencing (NGS)

Fig. 15: Second-generation DNA sequencing parallelized amplification. (a): DNA molecules being clonally amplified in an emulsion PCR (emPCR). Adapter ligation and PCR produces DNA librarieswith appropriate 5′ and 3′ ends, which can then bemade single stranded and immobilized onto individual suitably oligonucleotide-taggedmicrobeads. Bead-DNA conjugates can then be emulsified using aqueous amplification reagents in oil, ideally producing emulsion droplets containing only one bead (illustrated in the two leftmost droplets, with differentmolecules indicated in different colors). Clonal emplification then occurs during the emPCR as each template DNA is physically separate from all others, with daughter molecules remaining bound to the microbeads. This is the conceptual basis underlying sequencing in 454, Ion Torrent and polony sequencing protocols. (b): Bridge amplification to produce clusters of clonal DNA populations in a planar solid-phase PCR reaction, as occurs in Solexa/Illumina sequencing. Single-stranded DNA with terminating sequences complementary to the two lawn- oligos will anneal when washed over the flow-cell, and during isothermal PCRwill replicate in a confined area, bending over to prime at neighboring sites, producing a local cluster of identical molecules. (c) and (d) demonstrate how these two different forms of clonally-amplified sequences can then be read in a highly parallelized manner: emPCR-produced microbeads can be washed over a picotiter plate, containing wells large enough to fit only one bead (c). DNA polymerase can then be added to the wells, and each nucleotide can be washed over in turn, and dNTP incorporation monitored (e.g. via pyrophosphate or hydrogen ion release). Flow-cell bound clusters produced via bridge amplification (d) Can be visualized by detecting fluorescent reversible-terminator nucleotides at the ends of a proceeding extension reaction, requiring cycle- by-cycle measurements and removal of terminators. (from Heather and chain, 2016)

47

4 Next generation sequencing (NGS)

4.1.2 Next generation sequencing’s various technologies

Several other parallelized technologies appeared over the year the most notable being the Ion Torrent, Illumina, Pacbio and Nanopore technologies.

Ion Torrent: The ion torrent technology is pretty similar to the 454 one, but instead of measuring light emitted as a consequence of the addition of a new base, it uses the hydrogen ion released when a new base is incorporated and a PH probe that is more precise than the regular light based methods is then used. (Rothberg et al., 2011)

Illumina: -In Illumina sequencing, the fragmented DNA is ligated to adapters (fig 15bc). The reads are then distributed on a flow cell where both the adapters can bind. There is then an amplification process with the newly sequenced fragments that binds again, nearby, to the flow. This amplification process forms clusters of DNA fragments that are all close together. The chemistry is also different from pyrosequencing. There, all the dNTPs are added simultaneously. They are however distinguishable due to a bound fluorescent dye. The fluorescent group is used as well as a blocking agent that stops the replication process. During the sequencing steps, a dNTPs is incorporated and the replication stops. Then, a laser excites the fluorescent dye to identify the base incorporated. The fluorescent dye that blocks replication is then removed and the process continues. (Bentley et al., 2008)

PacBio: The PacBio sequencing technology uses dNTPs differentially coupled with a fluorescent dye, this time placed in the terminal phosphate (fig 16a). In this procedure one strand of DNA is placed in a well with a polymerase that is fixed at the bottom of the well. When the replication occurs, the DNA polymerase recruits the dNTP necessary and immobilizes it for a few milliseconds, which creates a light signal. The fluorescence at the level of the DNA polymerase being recorded, one can deduce the sequence dNTP according to the fluorescence detected. The technique offers the particularity that it doesn’t need an amplification procedure and generates longer sequences, up to several kb.(Eid et al., 2009)

48

4 Next generation sequencing (NGS)

Nanopore: Lastly, the Nanopore technology represents the last step in sequencing efforts (fig 16b). It can be separated from all the other methods, as it does not use biological reaction per se to read the DNA. It requires neither amplification nor fluorescence to work. It uses a membrane with a biologic pore with an electrolyte. A current is applied so that the ions go through the pore. When a single strand of DNA goes through the pore, it is easy to read out the changes in ion flow through the obstructed pore and to deduce the nucleotide associated with such a reduction in current. Very long read can be produced up to several tens of kb. Once again, it is possible to parallelize several sequencing at the same time, while the operation does not need too much reagent. Because of this last reason it can also be miniaturized to the point of being smaller than a hard drive. A recent study using Nanopore was able to reconstruct a genome of a K12 E. coli strain, obtaining one single complete sequence of the full chromosome (Loman et al., 2015). However, they found a nucleotide identity of 99.5% to the K12 reference, and therefore, concluded on the necessity to improve the method’s accuracy.

Fig. 16. Third-generation DNA sequencing nucleotide detection. (a): Nucleotide detection in a zero-mode waveguide (ZMW), as featured in PacBio sequencers. DNA polymerase molecules are attached to the bottom of each ZMW (*), and target DNA and fluorescent nucleotides are added. As the diameter is narrower than the excitation light’s wavelength, illumination rapidly decays travelling up the ZMW: nucleotides being incorporated during polymerization at the base of the ZMW provide real-time bursts of fluorescent signal, without undue interference from other labelled dNTPs in solution. (b): Nanopore DNA sequencing as employed in ONT's MinION sequencer. Double stranded DNA gets denatured by a processive enzyme (†) which ratchets one of the strands through a biological nanopore (‡) embedded in a synthetic membrane, across which a voltage is applied. As the ssDNA passes through the nanopore the different bases prevent ionic flow in a distinctive manner, allowing the sequence of the molecule to be inferred by monitoring the current at each channel. (from Heather and chain 2016)

49

4 Next generation sequencing (NGS)

One limitation of the method giving short reads it that shot gun sequencing assembly methodologies have trouble with repeated sequence (see below). In the days of Sanger sequencing libraries of 3 or 10kbs were produced and sequenced from both ends. This provided further information that was used to assemble whole genome. Here, the same approach can be used. The two reads produced from each ends are then marked as coming from the same fragment, and are called pair end. However, the fragments amplified are still usually quite small, less than a Kb in Illumina. Another similar procedure is creating mate pairs that can be further away. In this procedure a large fragment of 2 to 5 kb of genome are produced as done previously with Sanger sequencing. Both ends are marked with biotin, and the fragment is circularized before it is fragmented once more into 400 to 600 pieces. Then the biotin fragments are recovered and sequenced at both ends which corresponds to the extremities of the large fragments. Those two methods can prove to be really useful as they allow us to keep some information on the structure of the larger scale structure of the genome sequence. They can be used to handle recombination events, or to deal with repeated regions.

4.2 Bioinformatics of genomics

Overall, the gain in time, ease and price over the last decade have drastically democratized the use of sequencing. If it was before unthinkable to sequence the whole genome of a bacteria to pinpoint mutations responsible of a particular phenotype; it is now an easy and cheap procedure. This technologies also offers great opportunities to study the ecology and evolution of organisms such as bacteria. Indeed the price of the sequencing of a single genome can today be reduced to around 50€. This allows us to sequence a great number of strains in order to identify candidate genes related to a phenotype, pattern of adaptation, or sequence and annotate a newly sampled strain for instance. However, because of the high parallelization, the sequencing procedure does not directly output a single sequence, but provides high numbers of short reads of between 50 to 300bp for Illumina, the cheapest technology. How to use such data has been a challenge for the field of bio-informatics and genomics for the last 10 years.

50

4 Next generation sequencing (NGS)

4.2.1 With reference genome: alignment and SNP calling algorithms

The new sequencing technologies represent a revolution of the biological studies similar to what the Sanger sequencing was 40 years ago. It provides a quantity of information at least 100 times larger than Sanger sequencing. A major advantage of NGS is that it allows for the detection of genetic variants that deviates from a reference genome. In our case a good example of such a usage of NGS would be sequencing a clone from one lineage from Lenski’s LTEE experiment to compare it to the ancestral strain E. coli REL606. In this case, the REL606 strain has already been sequenced and annotated, meaning that we have its complete genome and a good understanding of the different gene positions inside it. We can then compare this reference to the reads of the sequenced evolved genome by using two steps: an alignment to the reference and a SNP calling.

Mapping: Before the rise of the NGS era, when trying to align two DNA sequences together, people would use algorithm based on the Smith-Waterman method. However such algorithm are really sensitive to the length of the sequences that are aligned, which in the case of NGS can be problematic as we are trying to align a great number of small reads to a presumably really long reference. To overcome this difficulty, mapping approaches have been developed. These approaches take advantage of the particular format of the alignment of a NGS output to a reference, by first indexing the reference genome, and then use the created index to search matches with the read. This indexing is usually done using the Burrows wheeler transformed (Li and Durbin, 2009) which produces an index that is around 1.5 times larger than the reference sequence but allows for a quick run through. Then each read is compared to the index, by searching the best match but allowing gaps and errors, similarly to a regular alignment algorithm. This technique reduces the cost in time and memory of such algorithms, and produces an output file that is then easy to parse. Amongst all the mapping algorithms that are in use, BWA (Li and Durbin, 2010, 2009) and Bowtie2 (Langmead and Salzberg, 2012) proved to be the most popular ones.

Mutation calling and breseq: Mutation calling programs usually function in a similar manner. In a first step, the reads are aligned to a reference genome using a mapping algorithm. The programs usually proceed in 3 steps: (i) mapping of the reads to the reference, (ii) identification of mismatches and coverage, (iii) assumption of the mutational events. To characterize the procedure, I will now describe the workflow of one of the available SNP calling algorithm: breseq (Barrick et al., 2014; Deatherage and Barrick, 2014).

51

4 Next generation sequencing (NGS)

The program takes two inputs: (i) sequencing reads in fastq format and (ii) a reference genome. The reference genome can either be in a simple fasta format with only the sequence or in a more complete genebank format, which combine the sequence with the annotation information. It is those information, that will then be used to deduce how the mutations observed modified the genes of the reference. When identifying mutations, the program stores mutation evidence that can take the form of either read alignment evidence (RA), corresponding to SNPs and short indels, missing coverage (MC), and new junction evidence (JC), corresponding to sign of rearrangement.

In the first step of the workflow, breseq will map the reads using bowtie2 (Langmead and Salzberg, 2012), with “stringent” options in order to spot reads with near perfect matches to the reference. Then the unmapped reads are realigned but using more relaxed parameters, that allows split matches, where a read as one side mapping at one point on the reference and another part mapping somewhere else. The split reads are then stored as new junction candidates as they could potentially be the expression of a genomic rearrangement. To assess the reliability of those junction candidates, breseq recreates them by concatenating the two parts of the genomes that seems to have been rearranged, and then remaps all the reads with bowtie with stringent parameters. Then the reads are allocated to their best matches be it on the reference genome or on one of the newly constructed junctions. Each junction is then evaluated, by considering only reads going through the breakpoint of the junction. Then breseq looks at the beginning points of each reads. If the beginning point are too few compared to the number of reads, then the junction is considered as false, as the reads are not mapped evenly across the junction. The confirmed new junctions are then stored as JC evidence by breseq.

Looking at the identified mismatches, whether they are indels or a SNPs, breseq uses a standard Bayesian SNP caller. A score is calculated as the negative log10 of the posterior probability that the base is not the reported base, corrected for multiple testing by multiplying by the genome size. Mismatches are then stored as RA evidence by the program. Lastly, breseq looks at coverage along the reference genome. It first identifies regions with no reads mapping. The program then expends those missing coverage region on each side as long as the coverage is below a calculated cutoff value. Those missing coverage regions are then stored as MC evidence.

breseq uses the different RA, JC and MC evidences stored to make predictions on the mutational events. The program uses the JC and MC evidence jointly in order to identify large deletions, deletions of repeated regions, amplifications of unique sequences, as well as mobile element insertions (see fig 17). Finally breseq returns all the mutations that are observed, with the mutations that were assigned to predicted mutations, along with those for which the program was not

52

4 Next generation sequencing (NGS) able to make a prediction. To present the results, breseq matches the predicted mutations with the reference annotations to clearly identify the impact of the mutations on the reference genome. In the output, the program gives an html table that allows an easy visualization of the mutations as well as a close-up of the mapping of the reads. Indeed, given the high number of reads produced by high throughput sequencing, all sorts of situation can arise that can lead to false positives mutations.

Figure 17: Predicting structural variation from new junction and missing coverage evidence. Types of structural variation for which breseq can predict precise mutational events from new junction sequences (JC) and missing read coverage (MC) evidence are shown in the context of the reference and mutant genomes. For JC evidence, the matched sequence on each side is shown as a solid arrow with a dashed line connecting the two sides. Orange JC arrows indicate that this side of a new sequence junction maps equally well to multiple locations in the reference genome. (from Barrick et al, 2014)

53

4 Next generation sequencing (NGS)

Furthermore, the quality of this type of analysis is directly linked to the quality of the reference genome. Indeed, we are using the reference as a model, and make the assumption that no major structural changes have happened. These types of algorithm are usually oriented towards organism with a stable genome and a stable gene repertoire. However, some bacterial species present particularly plastic genomes. Indeed, E. coli genome harbors such variability from one strain to another with potentially more than half of the genome being different. Therefore if we used K12 for instance as a reference genome to study an E. coli strain on which we do not have any information, we are certain to miss a lot as several genes from the newly sequenced strain will not be in K12. Furthermore, the reads from the strain corresponding to genes that are not present in K12 will have a tendency to be falsely aligned to an analogous sequence of K12, which result in having lots of noise in our output. These types of analysis put in the context of E. coli genomics are therefore mainly used in order to allow for the comparison of an ancestral strain to an evolved one. Indeed, for the outputs to be meaningful a certain level of relatedness between the reference and the query strain in necessary. When trying to study an E. coli strain for which no previous knowledge has been recovered which reference genome should we use? It is a question that is really hard to answer and often the best approach is to turn towards procedures that will try to reconstruct the genome de novo by assembling the reads more than trying to reconstruct it using a possibly wrong reference.

4.2.2 – Assembly of genome when there are no clear reference genome

NGS can be used on model organisms with a good reference genome. The democratization of NGS offers however the opportunity to sequence lots of new organisms, previously un-described. It is also useful in the case of model organisms with plastic genome like E. coli. Therefore it was necessary to build tools able to reconstruct the genomes de novo using the short reads provided by the sequencing technologies. These methods are called de novo assembly as they create a new sequence from scratch, using assembler programs. The main idea behind de novo assembly is to use overlapping reads and to concatenate those in order to reconstruct the whole genome sequence. To do so several methods have been developed and most of them are based on graphs. A graph is an abstraction used in computer science that is composed of nodes connected to each other by edges. When the edges are associated with a direction, then the graph is oriented, and can be roughly

54

4 Next generation sequencing (NGS) visualized by circles connected by arrows. It is this type of graphical representation that is used by programs to assemble genomes.

Computationally, although it can vary from one assembler to another, the method usually revolves around the construction of de Bruijn graph (Miller et al., 2010; Zerbino and Birney, 2008) and to search for the simplest path inside those graph (fig 18). To construct a de Bruijn graph, the assembler takes each reads of the sample and cut it into k-mers, fragments of an arbitrary length k. For each k-mer, the program creates one prefix corresponding of the left side of the k-mer, of length k-1, and a suffix corresponding to the right side of length k-1. Consequently, the prefix and suffix overlap on a length of k-2. The assembler starts then building the graph: for each prefix and suffix, of a k-mer, a new node is created, unless a similar node has already been created, and an edge connecting the suffix to the prefix is created. To interpret this with a concept that is easier to grasp, when a pre-existing node is connected to another, it means that we have observed an overlap between two k-mers. It is this type of overlaps that will allow the program to reconstruct the genome by identifying reads that have overlapping beginning and ends. Once the graph is complete, it is reduced

55

4 Next generation sequencing (NGS) by collapsing the adjacent nodes together when the first has one edge getting out, and the next has one edge getting in.

Figure 18: A pair-wise overlap represented by a K-mer graph. (a) Two reads have an error-free overlap of 4 bases. (b) One K-mer graph, with K=4, represents both reads. The pair- wise alignment is a by-product of the graph construction. (c) The simple path through the graph implies a contigs whose consensus sequence is easily reconstructed from the path.( fom miller et al, 2010)

Once it is done, we can sometimes observe the formation of bubbles along the graph or dead ends. Dead ends are generally associated with sequencing errors and can be easily ruled out by their coverage or by filtering the reads using their quality scores. The bubbles on the other hand are due to problematic regions. Such problematic regions can be repeated regions of which we will talk further later. The program then outputs the different sequences it was able to reconstruct until it reached a bubble. Those sequences are called the contigs. If the sequencing was made using paired ends or mate pair, the program can still go one step further and use the information given by the association of the two reads: If we have two paired reads matching two different contigs with a short distance between them, we can organize the two contigs one after the other in to what is called a scaffold.

In conclusion, the assembly programs like Velvet (Zerbino and Birney, 2008), SOAPdenovo (Luo et al., 2012) or Spades (Bankevich et al., 2012) represent the first step of analysis when sequencing a new organism. They offer the opportunity of reconstructing a genome from scratch, and

56

4 Next generation sequencing (NGS) give therefore the full power to NGS by allowing non –model organisms to be sequenced at low price. The produced scaffolds or contigs can then be used with annotation programs like prokka (Seemann, 2014) to identify genes and find their most probable functions.

4.3 Computational challenges for bacterial sequencing and repeated regions

As we have seen, the NGS era brought a numbers of computational problems. Several tools were produced, but one technical challenge remain across most of the tools: repetitive DNA (Treangen and Salzberg, 2011). Mini-satellites, macro-satellites and satellites sequences represent examples of repeated region, as well as transposon, or the ribosomal DNAs. Those repeated regions are actually fairly common in the genome of all living organisms. For example, in human around 50% of the genome is composed of repeats (Batzer and Deininger, 2002; Schmid and Deininger, 1975). In the extremes of the spectrum, we can find repeated transposable element covering up to 80% of the maize genome (Schnable et al., 2009). In E. coli K12, there are only 4% of repeated regions, which might seem low but is still substantial for de novo assembly. For NGS data, only highly similar repeated regions are problematic, not the diverged ones. For instance, a sequence that is at least 100bp long, and exhibits more than 97% of identity with at least one other sequence in the genome is problematic. Those regions can cause great trouble when dealing with NGS as they produce ambiguities for the different programs be it alignment algorithms or assembly programs.

4.3.1 Mapping and repetitions

Repeated sequences can be really tricky to deal with when using mapping programs. Indeed if two region are really similar sequence wise, the reads from both of the regions will map to each region. This can, of course, happen when two regions are identical but also when they are quite close because mapping algorithms allow for some differences. Those reads can be called multiple mapped reads. They generate a real problem, as they can create a noise for SNP calling algorithms. Furthermore, when dealing with mapping, another problem can appear when aligning reads against a partial reference that might have lost some repeated region. In that case, the reads corresponding to the missing part will erroneously map on the conserved repeat. A few methods can be used to deal

57

4 Next generation sequencing (NGS) with such problems. The first extreme solution is to completely ignore multiple mapped read, but it can prove to be really deleterious as lots of reads are then ignored. Another extreme solution is to accept all the alignments, but it could also as previously stated result in noises. A third option that can appear better would be to only consider the read to map in the regions where it is aligned with the best score. However this can still cause troubles if the wrong reads don’t have anywhere else to go. For instance, when trying to match a whole human genome to only one chromosome, the reads coming from homologous regions of the missing chromosome will be mapped on the only remaining one. If such a problem can easily be solved by aligning against the full reference, cases more cryptic can appear like for instance if there was a deletion of a repeated region in the reference itself. Some tools, like samtools (Li, 2011; Li et al., 2009), allow the user to choose where to map the read in case of an ambiguity, but it is not really applicable when dealing with huge dataset. This problematic is still troublesome and it seems that only longer reads could help us go further.

4.3.2 Assembly and repetitions

When looking at assembly programs, once again repeated regions are problematic (Treangen and Salzberg, 2011). Indeed if there is no read that spans all across a repeated region, meaning that it is not anchored on both sides, then the program will not be able to decide which side of the repeated region to assemble to each other (fig 19a). This ambiguity may result in short contigs in the better case, and chimeric contigs in the worse. In the same manner when dealing with tandem repeats where repeated region are one after the other, the program is not able to distinguish the number of repeated regions. Increasing the coverage will not solve the problem as long as the reads are too short to span across repeated regions, so other technics needed to be developed. A first obvious way to resolve the problem could be to use technologies producing longer reads in order for them to spam across the repeat. However as it is not always possible, another way of dealing with the problem could be to use pair ends and mate pair (fig 19b). Indeed having reads that are associated with one another at a known distance is a great tool to help assemblers to put together contigs in order to build scaffolds. A study furthermore showed that by increasing the length of the insert size of mate pair/pair end helped a lot when trying to assemble the genome of the potato, which has 62% of repetitive sequences (Xu et al., 2011). They used the N50 metric, corresponding to a weighted median statistic such that 50% of the entire assembly is contained in contigs or scaffolds equal to or larger than this value, to found a 100 fold improvement by switching from 2kb inserts in the pair ends to 20kb (Lieberman et al., 2011). Another good strategy that was implemented was to use a statistics based on coverage. Indeed, if we

58

4 Next generation sequencing (NGS) assume a uniform coverage of the genome, repeated regions should have at least twice the coverage of the rest of the genome.

Fig 19: the effect of repeated regions on de novo assembly. (a)The repeated region induce an ambiguity in the assembling process that leads to the reconstruction of only 4 contigs from the reads. (a) Here we can see how longer reads spanning across the repeated region (1) or paired reads (2) can help to remove abiguity to reconstruct a complet genome.

Overall if repeated regions are a problem in all the aspect of NGS sequencing analysis, some methods have been developed with mitigated success. However, if those solutions are possibly helping they are just temporary, as the only ultimate solution to this problem will be the production of much longer reads.

4.4 NGS for the study of the ecology and evolution of E. coli

During my PhD I was interested in using the recent advances in NGS tools to identify signs of selection in the genome of E. coli in order to infer information on the ecology and evolution of the species. I focused on NGS data as it allows the sequencing of complete genomes at low cost, and the subsequent identification mutations, whereas previous approaches, phenotypic as well as genetic, were more targeted towards one character or one particular fraction of the genome. Thanks to NGS,

59

4 Next generation sequencing (NGS) we have the potential to scan for mutations without a prior knowledge or hints towards the genotype of phenotype of the strain, allowing for a broader and less biased vision of the genome.

As discussed above, the first and easiest way to scan for mutations, genome wide is to use a mapping approach. However such a methodology demands a well-annotated genome. In the case of laboratory experimental evolution, it is easy to use as an ancestor a strain that has been well characterized, as the strain REL606 used by Richard Lenski (Deatherage et al., 2015). The genome of the ancestral strain can then be used as a reference to identify mutations. However, when studying natural samples of E. coli, things can be a little bit more difficult. When studying such types of samples, several studies try to use closely related reference genomes in order to identify SNPs and signs of selection. However, as discussed in the first part of this thesis, the genome, and more particularly the gene content of E. coli can be really plastic. As a result, when sequencing a new strain, only about half of its gene pool can be extrapolated. Therefore we cannot effectively pick a reference genome even as related as possible, as there would always be a risk of missing a part of the genome. Amongst other problems that could arise, reads from one part of the newly sequenced genome that would not be present in the reference could cause noise or false positive results when scanning for SNP variants. To go around this type of problems it is possible to use only a previously established core genome of E. coli. The idea here is to use only the genes that we know for sure to be present in the newly sequenced genome. Those genes are extracted from a closely relative strain and used as the reference genome, as we know that it is going to be present in all E. coli samples. This approach was, for instance, used in epidemiology when conducting genomics investigation of outbreaks of the O157:H7 E. coli strain, allowing for a better discrimination of the strains that was not possible using regular methods (Rusconi et al., 2016)

However if such a strategy can be sufficient for this type of study, it is still limited. Indeed, as we have previously shown the variable part of the genome is important when looking at the specialization of the strain. Lots of medically and ecologically meaningful genes are indeed found differentially across the specie like virulence factors or antibiotic resistance genes. For this reason it is important to be as conservative as possible when conducting a comparative studies on natural samples for which no reference genome can be found. For these studies adding a de novo approach to the methodology is mandatory.

A good example of such an analysis can be found in the work from Alexandre Almeida where isolates of Streptococcus agalactiae are sampled and sequenced from a mother and her infant 23. The genome of the strain from each mother is assembled and annotated. The reads from the infant isolates are then compared to the reference genome of the mother to identify SNPs. This approach is

60

4 Next generation sequencing (NGS) furthermore really instinctive, as it seems logical for the mothers S. agalactiae to be the source of the isolates from their infants, and therefore be used as an ancestral genome. Coming back to E. coli, it is possible to do a similar procedure with closely related natural isolates coming from the same source, be it from the digestive tract or from an extra intestinal infection. To do so one of the samples is reconstructed using de novo assembly, and then annotated. It is then this reference genome that is used as a reference for SNP calling against the other samples. This way we can identify how each strain differed from the sample that was arbitrarily chosen as the reference. It is then easy to deduce, using simple parsimony, how the different samples differed from the ancestral clone of the population. This method allows us to uncover the molecular bases of the genetic diversification that occurred in the recent past of the isolates considered.

However this last approach can have several downsides. It is first really susceptible to the quality of the assembly of the reference. Indeed a badly assembled genome with a high number of small contigs can be really hard to process when looking for rearrangement of the genome. For instance, if it is easy for breseq to identify a large deletion in a contigs, which correspond to both a drop in coverage of the sequence and a new junction of both sides of the deletion, it becomes impossible to solve if the deleted region in the reference is separated in several contigs during the aseembly. The program will show several deletions of the different contigs, but will not be able to connect the different events in order to conclude on a singular large event. This observation means that it is necessary for the user of the program to go along the breseq results in order to infer the different events that took place during the diversification of the strain.

Similarly a curation step must also be done for the SNPs and indels events identified, as noise can be present in the sample. A great source of such noise can be prophage genes integrated in the genome. Indeed as mobile elements, they are a vector for lateral gene transfer, and account for a great amount of the inter-strain variability in several bacteria. E. coli is no stranger to this observation with an extreme case represented by the food pathogen Escherichia coli O157:H7 strain Sakai that contains 18 prophage elements, which amount to 16% of its total genome content. Those genes can prove to be really difficult to handle when using SNP calling algorithms due to their high similarity. Indeed, we can clearly see when realizing a simple dot plot of the different lambda like prophage sequences of 0157:H7 against one another that they share a high degree of similarity in a fraction of their sequence (fig 20) while still exhibiting some large differences in other parts (Boyd and Brüssow, 2002). Indeed when comparing the sequences of the Shiga toxin 2- producing prophage from two O157 strains, EDL 933 and Sakai, the sequence were nearly identical over 85% of the prophage genome. Such a level of similarity can cause trouble in both the assembly and the mapping stages. At

61

4 Next generation sequencing (NGS) the assembly stage, both genes can be combined in a single chimeric contig, or fragmented in several very short contigs. Consequently at the mapping stage mis-mapping may be quite common. These errors will result in a high rate of false positive in the mutation detection. In order to get rid of those errors, another step of manual curation then becomes necessary. This stage is both time consuming and prone to errors and biases.

Fig 20: A dot-plot matrix for all lambda-like Sakai prophages including bacteriophage lambda of Escherichia coli O157:H7 strain Sakai (from Boyd and Brüssow, 2002)

In conclusion, both the assembly and mapping programs can be used to analyze E. coli sequencing both from laboratory experimental evolution, by using a pre-existing reference genome, or from natural samples, by producing a reference de novo. We can then easily scan for convergence or other signs of selection amongst the samples. However, this type of technique really relies on the quality of the reference used or constructed, and imply a certain amount of manual filtration of the data. The difficulty of the last stage has prevented most analyses to focus on very short-term diversifications of strains. Indeed, most epidemiological studies focus on quite diverged clones and limit their studies to SNPs in the core genome and a few deletions. During my thesis, I have tried to tackle the question of short-term diversification and to be as stringent as possible in the detection of mutations, using as much of the genome as possible. I have developed many tools and approaches to address the technical problems inherent to the use of a constructed reference. The final approach retained will be presented in paper chapter 2. It is however noteworthy that I have devoted a lot of 62

4 Next generation sequencing (NGS) my PhD to develop somehow more ambitious strategies. I used for instance a reference panel of E. coli genomes to do annotation transfer and to assist the ordering of contigs. However in all these approaches I could not achieve a clean output without the need of an important manual curation step. Once again repeated regions and worse, partially repeated regions such as proghagic regions were part of the problem. The final option we took has the benefit of being simpler compared to the ones I developed previously, we therefore choose it for the final analysis.

63

II - EXPERIMENTAL EVOLUTION OF NATURAL STRAINS OF E. COLI IN A COMMENSAL ENVIRONMENT: THE MICE GUT MODEL

1 E. coli experimental evolution in the mice gut

1 E. COLI EXPERIMENTAL EVOLUTION IN THE MICE GUT

E. coli experimental evolution can be a powerful tool to identify the basis of adaptation and evolution in a broad sense. Using this type of experiment researchers were able to identify the different steps of adaptation with for instance a first step corresponding to a “quick and dirty” adaptation, acting on strong regulatory genes, followed by a “fine tuning” step touching genes that are less pleiotropic. But can experimental evolution be used to make assumptions on the adaptation of the bacteria in the wild as we are mainly using laboratory strains evolving in laboratory conditions?

To answer this question a protocol of experimental evolution in the mice’s gut was designed, with the clear goal of being as close as possible to natural conditions. These characteristics sets this study apart from other similar experiments that rely on either heavily modified strains or axenic mice or on both. To do so, a natural strain of E. coli, E. coli 536, was inoculated to mice. To maintain the strain in the gut of the animal, a pressure on the gut microbiome was imposed by the use of streptomycin in the drinking water of the mice, the 536 strain being naturally resistant. Adaptation was then allowed to proceed for one year in 30 mice, grouped in pairs in 15 independent cages. The whole genome of one colony coming from one mouse from each cage was then sequenced using Illumina 51bp single reads.

In this project, I analyzed the produced genomes with breseq in order to identify mutations of interest. As a reference, we used the genome of E.coli 536 that was recovered from the Genoscope website. However, as we wanted to be as close as possible to our ancestral strain, which could slightly differ from the one of the Genoscope, we also sequenced our ancestral strain. I then used breseq in order to identify differences between our ancestral strains and this completely annotated genome, and to correct it accordingly. This modified reference was then used in the comparative study against the samples from each of the cages.

Overall we observed 95 mutations in the 15 lineages, which would indicate a substitution rate of 1,85 10-10 mutations per base per generation for point mutations, considering 18 generation per day. Such an observation is rather low, being 2 to 4 time lower than the ones made in other experimental evolution studies.

It was particularly striking to observe signs of convergence at the mutational, gene, operon and functional level. Indeed, 6 mutations were found in at least 2 independent lineages, which could indicate a contamination across the cages. However, those mutational events seemed to be incoherent 65

1 E. coli experimental evolution in the mice gut with one another as there wasn’t a simple evolutionary explanation for such a pattern. Furthermore, as most of those mutations appeared in genes for which a strong patterns of genic convergence was observed (i.e. other mutations were found in the genes), selection may have driven the multiple occurrence of these mutations. At a higher level, convergence was observed in 3 of the genes from the dgo operon, the genes dgoR, dgoD and dgoT, that is implicated in galactonate pathway. Further inquiries showed that those mutations had indeed, an effect, with the mutated lineages having a better growth in minimal media supplemented by D-galactonate. Similarly convergence was observed in the rluD and gidB genes that are both responsible for the maturation of ribosomal RNAs, with 13 lineages, out of the 15, that harbored mutations in either one of the gene. Streptomycin being an antiobiotic targeting the translation of RNA, those mutations can be seen as an even stronger adaptation of the lineages to the selective pressure produced artificially during the experiment. To test this hypothesis, 536 strains mutated in either one or both of the genes, were grown on both LB and LB + streptomycin media, revealing a better growth when any one of the mutation was present. However, the effect of the mutation did not seem to be additive as the association of the two mutations did not appear to have an effect higher than that of rluD on its own.

66

2. Article n 1

2. ARTICLE N°1

Using long-term experimental evolution to uncover the patterns and determinants of molecular evolution of an Escherichia coli natural isolate in the streptomycin treated mouse gut

Mathilde Lescat 1,2,3,*, Adrien Launay 1,4*, Mohamed Ghalayini1,2, Mélanie Magnan 1,4, Jérémy Glodt 1,4, Coralie Pintard 1,4, Sara Dion 1,4, , Erick Denamur1,4,5, Olivier Tenaillon 1,4,#

1 INSERM, IAME, UMR 1137, Paris, France

2 Univ Paris Nord, Sorbonne Paris Cité, Bobigny, France

3 APHP, Hôpitaux Universitaires Paris Seine Saint-Denis, Bondy, France

4 Univ Paris Diderot, Sorbonne Paris Cité, Paris, France

5 APHP, Hôpitaux Universitaires Paris Nord Val de Seine, Paris, France

*: The authors participated equally to the work.

Mathilde Lescat: [email protected]; Adrien Launay: [email protected]; Mélanie Magnan: [email protected]; Jeremy Glodt: [email protected]; Coralie Pintard: [email protected]; Sara Dion: [email protected]; Mohamed Ghalayini [email protected]; Erick Denamur: [email protected]; Olivier Tenaillon: [email protected]

# : Corresponding author: Olivier Tenaillon, IAME, INSERM UMR 1137, Faculté de Médecine Paris Diderot, Site Xavier Bichat, 16 rue Henri Huchard, 75018 Paris, France. email: [email protected]

67

2. Article n 1

2.1 Abstract.

Though microbial ecology of the gut is now a major focus of interest, little is known about the molecular determinants of microbial adaptation in the gut. Experimental evolution coupled with whole genome sequencing can provide insights of the adaptive process. In vitro experiments have revealed some conserved patterns: intermediate convergence, epistatic interactions between beneficial mutations and mutations in global regulators. To test the relevance of these patterns and to identify the selective pressures acting in vivo, we have performed a long-term adaptation of an E. coli natural isolate, the streptomycin resistant strain 536, in the digestive tract of streptomycin treated mice. After a year of evolution, a clone from 15 replicates was sequenced. Consistently with in vitro observations, the identified mutations revealed a strong pattern of convergence at the mutation, gene, operon and functional levels. Yet, the rate of molecular evolution was lower than in in vitro and no mutations in global regulators were recovered. More specific targets were observed: the dgo operon, involved in the galactonate pathway that improved growth on D-galactonate, and rluD and gidB, implicated in the maturation of the ribosomes, which mutations improved growth only in the presence of streptomycin. As in vitro, the non-random associations of mutations within the same pathways suggested a role of epistasis in shaping the adaptive landscape. Overall, we show that “evolve and sequence” approach coupled to an analysis of convergence, when applied to a natural isolate, can be used to study adaptation in vivo and uncover the specific selective pressures of that environment.

Key words: Escherichia coli / experimental evolution/ gut colonization mouse model / 536 / streptomycin resistance / D-galactonate

68

2. Article n 1

2.2 Background.

Bacteria have evolved for billions of years and colonized all available ecological niches. Many studies have been performed to understand how evolution has shaped the genome of these complex organisms. Comparative genomics as unravelled the dynamics of gene acquisition and losses in bacterial genomes (Touchon et al., 2009), and the general constraints acting on genome composition (Hershberg and Petrov, 2010). Yet, much less is known about bacterial genomes evolution in the short term in their natural habitat. While we know through the emergence of resistance to antibiotics that bacterial evolution can be caught in the act, most of our evidence of short-term evolution in the wild is linked to drastic selective pressures such as antibiotics (Wong et al., 2012) or escape of the immune system (Flint et al., 2016). Can other selective pressures be strong enough to drive significant changes in bacterial genomes in the sort term? To investigate that possibility, we have decided to study the evolution of a natural isolate of Escherichia coli in the gut of mice.

E. coli is one of the most characterised bacterial species; it has been used as a model organism in the laboratory to unravel the multiple facets of bacterial biology from biochemistry to genetics. Yet, the natural environment of E. coli is not the laboratory. In the wild, E. coli is a versatile species, known as both a widespread gut commensal of the vertebrates and a dangerous pathogen that can also be retrieved in the environment (Tenaillon et al., 2010). As a pathogen E. coli is responsible for about 1 million human deaths yearly due to intra and now mostly extra intestinal diseases (Russo and Johnson, 2003). The species is a structured species subdivided in seven phylogenetic groups A, B1, B2, C, D, E and F (Jaureguy et al., 2008). These groups are not randomly distributed. Previous studies have shown an association between the phylogenetic groups and virulence, with most extra intestinal E. coli pathogens (including urinary tract infection and meningitis strains) belonging to phylogenetic group B2 (Bingen et al., 1998; Picard et al., 1999). The prevalence of the different phylogenetic groups as commensals varies largely across host species and even within host populations (Tenaillon et al., 2010). For instance, the B2 phylogenetic group can be frequently retrieved in wild animal’s guts (Gordon and Cowling, 2003; Lescat et al., 2013), rarely found (<10%) as a commensal among human tropical populations while it is dominant (>30%) among the populations of industrialized countries (Tenaillon et al., 2010). Indeed, over the last decades, the prevalence of the B2 phylogroup has significantly increased in frequency in some western countries such as France. B2 phylogroup strains being both highly efficient extra intestinal pathogens, and more and more prevalent as sthe molecular determinants explaining the differences in prevalence of the different subgroups of E. coli among different hosts or populations remain largely unknown.

69

2. Article n 1

To identify the molecular bases that may lead E. coli to specialize to certain environmental conditions, two main approaches can be used: comparative genomics of natural isolates and experimental evolution. Thanks to the sequencing revolution, more than two hundred complete genomes of E. coli are now accessible and thousands more are underway. We can subsequently compare these genomes in order to identify some genetic characteristics likely to contribute to the phenotypic differentiation of the subgroups. For instance, we compared 130 sequenced genomes to identify the core region that was the most different between B2 to non-B2 strains: the nhaAR operon involved in pH control. Though we could find a contribution of the operon in extra intestinal virulence, we were not able to observe its implication in colonization of the mouse gut (Lescat et al., 2014) nor in any pH linked phenotypes we could measure in the laboratory. This example illustrates the limitation of comparative genomics: though some specific genomic signatures may be found, we have not access to the environmental conditions that shaped them and it is therefore difficult to assess their evolutionary relevance.

An alternative approach relies in evolving bacteria in controlled conditions. In vitro experimental evolution has been used for decades to study bacterial adaptation. Over the last years, the coupling of experimental evolution with whole genome sequencing allowed progresses in the understanding of the genetic bases of adaptation. The most charismatic experiment in the field is the one initiated more than 25 years ago by Richard Lenski. His 12 lineages long-term evolution experiment (LTEE) has now reached more than 60,000 generations. The genomic data as well as targeted sequencing approaches have revealed convergence among the first targets of adaptation involving several global regulators (Maddamsetti et al., 2015). Another large scale study confirmed these results using 115 lineages adapting for 2,000 generations to high temperature: the first target of adaptation was the RNA polymerase one of the most pleiotropic gene in the genome, and a large convergence was observed between replicates (Tenaillon et al., 2012). These studies and others (Herring et al., 2006; Hindré et al., n.d.; Le Gac et al., 2013) suggest that the first steps of adaptation may involve global modifications of gene expression rather than a specific small-scale optimization of some specific pathways. Interestingly, global regulators such as rpoS, have also been found to be involved in the diversification that seems to occur during infections (Levert et al., 2010). However, the relevance of these in vitro or in infectio observations in the primary habitat of E. coli, the gut of vertebrates, remains to be tested.

In vivo assays have also been performed to study adaptation to a more relevant environment for E. coli. In the early 2000’s, Giraud et al. tested the implication of mutator alleles of E. coli in the adaptation to the gut of germ free mice and showed that mutator bacteria possessed a short-term

70

2. Article n 1 advantage and long-term disadvantage in colonizing new environments (Giraud, 2001). Later, Giraud et al. also used germ-free mice inoculated with E. coli MG1655 (Giraud et al., 2008) and observed a rapid adaptation by a single point mutation in EnvZ/OmpR, a two component signal transduction system, which controls more than 100 genes. These observations suggest that global regulators could also be the targets of adaptation in the mouse gut. However, these works are entailed by an important limitation: the gut of adult germ free mice is an artificial environment. Mice that have developed without flora have abnormal immune system and gut physiology (Khoury et al., 1969; Umekasi, 2014; Welling et al., 1980). Furthermore, in these experiments, E. coli was alone in the gut. In the wild, E. coli is a minority species of the gut microbiota, which is largely dominated by numerous anaerobes (Arumugam et al., 2011). As a consequence, E. coli is facing intense competition for nutrients, and may also have access to other nutrients processed by the rest of the microbiota. More recently, Barroso-Batista et al. performed an evolution experiment using the laboratory strain E. coli K-12 in streptomycin treated mice. Streptomycin treatment allows the colonization of the digestive tract by a streptomycin resistant E. coli, without altering theoretically the anaerobic microbiota (Barroso-Batista et al., 2014). They observed the rapid apparition of mutations in the different lineages leading to convergence in inactivation of the gat operon conferring advantage to the strains bearing the mutations and unable to utilize the galactitol. Here the first target of adaptation was not a global regulator. However, one limitation of these studies is the use of a laboratory strain that has been subject to many genetic alterations and that has not seen a gut for close to a century. E. coli K- 12 is a strain evolved in in vitro conditions since 1922, date of its isolation in Palo Alto, from a diphtheria convalescent patient. Genetic modifications, adaptation to oxygen rich media from the laboratory as well as accumulation of mutations favoured by UV irradiations or strong bottlenecks may modify its adaptation to the mouse gut.

In this context, the aims of this study are (i) to identify the basis of molecular adaptation in in vivo commensal conditions using long-term evolution during one year of a natural isolate of E. coli belonging to the phylogenetic group B2 in the digestive tract of streptomycin treated mice and (ii) to test the phenotypic significances of the molecular modifications observed and their possible advantages in this natural environment.

2.4 Materials and Methods

Bacterial strains

71

2. Article n 1

We used E. coli K-12 BW25113 (phylogenetic group A) from the KEIO collection (Baba et al., 2006) and 536 (phylogenetic group B2, serotype O6:K5:H31) reference strains. The 536 strain has been isolated from a patient suffering from urinary tract infection in the early 1980’s (Berger et al., 1982) and subsequently shown to be a good gut colonizer in streptomycin treated mouse (Diard et al., 2010). This strain was acquired by our laboratory in 1998 from Jörg Hacker who isolated it and then stored at -80°C with glycerol.

Mutant strains of K-12, obtained from the KEIO collection (K-12ΔdgoR:kan, K-12ΔrluD:kan, K-12ΔgidB:kan) and of 536, obtained during the course of this study using the lambda-red disruption method (Datsenko and Wanner, 2000) (536ΔdgoR:cm, 536ΔrluD:cm, 536ΔgidB:cm, 536ΔgidBΔrluD:cm), were also used. For the flux cytometry assays, we used K-12 pdgoR issued from the Zaslaver collection (Zaslaver et al., 2006) which beared the plasmid pdgoR which contained the gfp gene placed under the control of the promoter of the dgoRKADT operon. After extraction of this plasmid, we electroporated it in the K-12ΔdgoR:kan mutant strain. We thus obtained K-12ΔdgoR pdgoR strain.

Mouse intestinal colonization assay and sampling strategy

All animal experiments were performed in compliance with the recommendations of the French Ministry of Agriculture and approved by the French Veterinary Services (accreditation A 75- 18-05). The mouse gut colonization model was conducted after approval by the Debré-Bichat Ethics Committee for Animal Experimentation (Protocol Number 2012-17/722-0076) in accordance to the European Decret and French law on the protection of animals. Intestinal colonization by 536 E. coli strain was performed using a mouse model as described elsewhere (Diard et al., 2010). Thirty 6- week-old female outbred mice (Charles River CD-1) were housed in 15 cages (numbered from C1 to C15) during one year (two mice per cage named Ma and Mb). Five days before inoculation of the 536 strain, and throughout the experiment, streptomycin was added to the sterile drinking water at a final concentration of 5 g/L. Likewise the mice were fed with sterile food all along the experiment. Mice free of coliform flora (controlled by plating the faeces on Drigalski plates) were inoculated per os with 106 bacteria in 200 μl of physiological water. The day after the inoculation and during the experiment (Days 1, 5, 33, 47, 155, 188, 217, 294, 336 and 363, the intestinal population of E. coli was estimated by plating dilutions of weighted fresh faeces for each mouse on lysogeny broth (LB) agar with streptomycin (5 g/L) and supplementary samples were performed for sequencing assays (Days 61, 124, 155 and 250). Faeces were harvested for each mouse by putting the mouse in a sterile plastic cage until it defecated. The faeces were stored at -80°C with glycerol for further analysis. For 72

2. Article n 1 sequencing, the frozen faeces were diluted and plated on LB agar with streptomycin. One colony per LB plate was then randomly picked, grown in LB, stored at -80°C with glycerol and thereafter called “clone”. In addition, to study the reproducibility between the sample of the two mice from each cage, 50 colonies were picked from the faeces on the LB plate for each mouse (100 colonies in total) in cage 1 at Day 363. For the polymorphism study, we picked a pool of about 50 colonies from 4 mice (C2Ma, C3Ma, C8Ma and C11Ma) at both the last time point (Days 250, 336, 363 and 363, respectively) and an intermediate time point (Days 124, 155, 155 and 155, respectively).

High throughput sequencing

From the glycerol stocks, the clones were plated on LB plates, and one colony was randomly picked, and DNA was extracted using Wizard® genomic DNA purification kit (Promega). The genomes of the evolved clones, as well as that of the ancestor 536 strain used for inoculation were sequenced on the Illumina HiSeq 2000 (IntegraGen), with one line of single-end 51-bp reads per genome. We also sequenced after DNA extraction the pool of colonies for the reproducibility and polymorphism assays.

Sequence analyses

Point mutations were identified by comparing sequence reads to the genome of the ancestral strain 536, considered as the reference, using breseq 0.23, a pipeline for analyzing resequenced microbial genomes (Deatherage and Barrick, 2014).

Reference reconstruction. In a first step, we sequenced the 536 reference genome and compared the reads to the original strain genome (Brzuszkiewicz et al., 2006) from the MicroScope website (https://www.genoscope.cns.fr/agc/microscope/mage/viewer.php) to identify differences using breseq 0.23 (Deatherage and Barrick, 2014). This pipeline maps the reads on an annotated reference genome in gbk format to precisely identify the mutations that occurred: the type of the mutation, the genes it fell into, or the closest ones, and in the case of point mutation the changes in amino acid associated. We then used gdtools utility program from breseq 0.27.1 (Deatherage and Barrick, 2014) to integrate the identified mutations inside the reference genome. We repeated this operation of comparison and modification of the reference five times until there was no more mutation identified. The iterations are required to correct mutations in close neighbourhood of one another. Using this method, we identified a total of 351 mutations (230 being point mutations, and 121 indels) between our ancestral 536 strain and the one from the database we used, a number compatible with 73

2. Article n 1 errors due to Sanger sequencing method at the genome level.

Detection of mutations. Once the reference genome was obtained, we compared this reference to the sequences of the different evolved clones using the same approach. The identified mutations were then manually checked. We did the same for the pooled colonies and also obtained the frequencies for each mutation from breseq 0.27.1(Deatherage and Barrick, 2014).

Quantification of convergence. To quantify how significant convergence at the gene level is, for each gene of the genome, we computed the expected number of mutations Ei, that would alter this gene if mutations were randomly spread in the chromosome. We compared that expectation to the observed results Oi through the computation of a G statistics for nonsynonymous mutations: in which the sum is made on all genes with some observed mutations. The number of degrees of freedom of G statistics being difficult to compute when many classes are null (genes with no mutations) we compared the observed G value to that of simulated dataset with mutations randomly spread (Tenaillon et al., 2016).

Phenotypic assays

Flux cytometry. The strain K-12 pdgoR and the mutant strain K-12ΔdgoR pdgoR were compared using flux cytometry as described elsewhere (Bleibtreu et al., 2013).

Growth curves. For the comparative growth assays, strains were grown at 37°C in four media, all supplemented with streptomycin at 5 g/L: LB and minimum Davies medium (DM) with glucose, gluconate or galactonate (20 mmol/L for each sugar). Each medium was adjusted at pH 7 and pH 5.7 with MOPS® and TAPS® products, respectively (Sigma). LB is a complex media, whereas DM is a minimal media with only one source of carbon. All the strains studied were grown overnight (O/N) in LB media in deep-weel plates at 37°C with constant stirring at 280 rpm. O/N cultures were prediluted at 1/100 in physiological serum and strains were inoculated each at 1/100 in a Costar® 96 flat-bottomed well plate and recovered by 50µl of mineral oil. Growth was recorded by an Infinite 200 Tecan®, which measured the OD600 in each well every 5 minutes at 37°C, while stirring for 24 hours. Growth assays were repeated 3 times. The maximum growth rate (MGR) was calculated from growth curves obtained by Tecan®. OD600 in nm were collected and Log- transformed. Curves were calculated from a smoothed spline function. The MGR was defined as the time point at which the maximum value of the derivative of the smoothed function was observed. The doubling times (DTs) in mn have then been computed as followed: DT=Log2/(MGR*60).

74

2. Article n 1

Statistical analysis.

The DTs were compared by strain and by medium using a Welch test. All statistics were computed using R (R Development Core Team, 2009, Vienna, Austria) and statistical significance was determined at a p-value of less than 0.05.

2.5 Results

Mouse colonization assay and clone sampling

Thirty mice in 15 independent cages (2 mice per cage) were colonized by the E. coli 536 strain. As it is known that mice are coprophages, we postulated that sampling one clone of one randomly selected mouse per cage should be representative of the population of E. coli colonizing both mice within a single cage. To validate this approach, we sampled in cage 1 at Days 61, 155, 250 and 363,, one clone from each mouse as well as a mix of 50 clones at Day 363 for each mouse and performed sequencing of the individual clones and the mix. We observed a single mutation difference between the two individual clones per cage and this site was the only polymorphic site recovered in the pooled sequencing. These results validate our strategy, sampling a single mouse per cage being relevant.

The 536 strain was maintained in all alive mice throughout the experiment that lasted 363 days. However, ten mice died in the course of the experiment (figure 1), most of these were issued from different cages except for 2 cages (cages 7 and 14) in which the two mice died before day 363 (supplementary table 1). The average population density was 1.4.109 unity forming colony (UFC)/g of faeces along the experiment. More precisely, density first decreased from 3.109 UFC/g of faeces to 2.104 UFC/g of faeces at day 188 and stabilized at 6.107 until the end of experiment at day 363 (figure 1). One clone per cage has been used for sequencing. The sampling points corresponds (i) to the end of the experiment when at least one of the two mice of the cage was still alive (Day 363 for 10 cages: C1, C4, C5, C6, C8, C9, C10, C11, C13, C15), (ii) the time at which the last mouse of the cage died (Day 250 for cage C2, C7 and C14), and for three cages we could not recover E. coli from the frozen faeces and we sampled the last point that produced an E. coli clone (Day 124 for cage C12 and Day 336 for C3) (supplementary table 1).

75

2. Article n 1

Description of mutations in the 536 evolved clones

Using breseq pipeline (Deatherage and Barrick, 2014), we identified 95 independent mutations that occurred during the experiment of mice gut evolution in the 15 genomes as shown in the supplementary table 2. Amongst those mutations, 70 were point mutations, 17 were small indels (less than 50bp), 6 were large indel events (>25bp) and two were IS insertion. For the point mutations, 44 were non-synonymous including 10 stop mutations; 13 were synonymous and 13 intergenic. The synonymous mutations recovered were only transitions and 11 out of 13 were of C-G to T-A type, a bias reported before (Wielgoss et al., 2011). Assuming 18 generations a day in the mice gut (Rang et al., 1999), we could compute a mutation rate of 1.85 10-10 per base-pair per generation using synonymous mutation. This is consistent with previous estimates of mutation rate of 8.9 10-11 found in the LTEE (Wielgoss et al., 2011). The ratio of rate of non-synonymous to the rate of synonymous, Ka/Ks, was 1.90 (Confidence Interval (CI) estimated by bootstrapping mutations: 1.09-4.01). Such a value suggests that about 50% of non-synonymous mutations (including nonsense mutations) are the product of natural selection. Stop codons were even more overrepresented in the dataset compared to the expectation based on synonymous (Kstop/Ks=6.72, CI: 2.78-15.86) or on all non-stop mutations (Kstop/Ka_nonstop=4.51, CI: 1.96-8.39). This suggests a strong selection for gene inactivation.

Among the 17 small indels (<50bp), 13 lead to frameshifts, and none were intergenic. For the 10 lineages that survived over the 363 days, an average of 7 mutations were found per sequenced genome (4 to 11).

Convergence among lineages

Convergence was observed at several levels. First, while 95 mutations were recovered, they corresponded indeed to 76 different mutations. Six mutations were recovered in 2 to 8 lineages. For instance 8 lineages shared the same C to T mutation in an intergenic region between yjiH and kptA, and 5 shared a G to A non-sense mutation in dgoR. While these shared mutations could be a signal of cross contamination, the star like phylogeny of the samples and the mapping of mutations on that phylogeny suggest that these common mutations must have occurred multiple times (figure 2). In other words, using the previous two examples, the lineages having the yjiH/kptA mutations were not fully included in, fully encompassing or mutually exclusive with the lineages having the afore mentioned dgoR mutations: some lineages have both mutations, one, the other or none. This suggests that if the population evolved asexually, in at least one locus the same mutations occurred several times independently. These convergent mutations account for 26% (25/95) of the mutations found.

76

2. Article n 1

Convergence could be integrated further at the gene level. Nine genes were found to be mutated repeatedly in different cages. Adding a large deletion recovered 3 times, the mutations showing some sign of convergence among lineages at the gene level accounted for 56% (53/95) of the mutations found. With the limited number of mutations found in each genome, this is a clear sign of convergence (see below). This signal is further reinforced by the fact that all genes affected multiple times presented at least 2 different mutations at the origin of the convergence (table 1). The fact that different alleles are found in the different lineages, excludes the possibility that convergence is only due to a single mutation that was by chance at high frequency in the initial stock used to inoculate the different mice. For instance, ECP_4610, a putative transporter is affected in 5 lineages with 5 different mutations.

To quantify how significant this convergence at the gene level is, for each gene of the genome, we computed a statistical test for convergence (see method) in the form of a G statistics (Tenaillon et al., 2016). We found G=482.3 that has to be compared to a mean value of 399.2 for simulations, with a Z-score of 8.3. This suggests that the observed value is more than 8 standard deviations away from the simulations mean and results in a p-value lower than 2.10-16. This overwhelming signal of convergence suggests the filtering action of natural selection.

Among this list of genes showing signs of convergence, two clusters of linked genes can be further identified. dgoR, dgoD and dgoT are implicated in the same metabolic pathways. This observation suggests further that convergence may occur at a higher level of integration than the gene. In the same manner, 6 mutations in the rluD gene were found, and 7 in gidB. Both genes are implicated in the maturation of the ribosomes, and both are linked to streptomycin resistance.

Polymorhism within lineages

To have a better insight on the dynamics of adaptation, we sequenced a pool of about 50 colonies from 4 mice (C2Ma, C3Ma, C8Ma and C11Ma) at both the last time point (Days 250, 336, 363 and 363, respectively) and an intermediate time point (Days 124, 155, 155 and 155, respectively). Using an analysis of polymorphism in these data, we can detect mutations with a frequency higher than 5% in most samples. The analysis of the mutations found at the population level confirmed the presence of the mutations of interest we detected in the clone sequenced. Polymorphism data revealed however if mutations were fully fixed (frequency of 100%) in the population or if they had reached intermediate frequencies. Moreover, several mutations affecting the same gene were found simultaneously in some populations. This suggests some competition between these various alleles. Interestingly, all cases of multiple allele affecting the same gene were restricted to the genes we

77

2. Article n 1 identified as target of selection with the clone sequencing approach (supplementary table 2). This confirms that we have identified the biggest targets of selection.

The loci under selection showed traces of polymorphism in two ways (supplementary table 3). Either a single allele was present in the population but not fixed or several alleles were found simultaneously in the population. gidB showed the strongest sign of polymorphism. In mouse C11Ma, at day 155, the whole population seems to have mutated gidB, but 4 alleles are involved: a 25bp deletion, a frameshift, P79L and Q23* at respective frequencies of 8%, 60.2%, 8.7% and 31.3% (note that the frequencies are based on local coverage statistics so the sum is not supposed to be precisely 1). At day 363, the 25bp deletion overtook the other alleles and reached a frequency of 100%. In mice C2Ma, two alleles of gidB, Q167* and the ∆25bp deletion, are found polymorphic at both sampling time. dgoR also had traces of polymorphism, with C8Ma having two competing alleles (∆360bp at 75.3% and Q212* at 20.9%) at one time point (Day 155). Out of the 3 mice studied for polymorphism that had mutations in dgoR at the last sampling time point, only in one case fixation was reached, while in the others the frequency of the alleles was of 55.8% and 74.1%. The study of polymorphism revealed also new alleles in some of the locus of interest. A mutation in the promoter of dgoR reached a frequency of 7.4% in C2Ma at day 124 and another allele between yjiH and kptA reached a frequency of 24% in C8Ma at Day 155 before being out competed by another allele. Finally, in mouse C11Ma, while we did not find any mutation in dgoR in the clone sequenced at the last time point, a mutation was recovered at high frequency in that gene at an earlier time point, but disappeared later on.

The comparison of the alleles found polymorphic in the different populations studied revealed another gene with evidence of convergence. Clone sequencing revealed a frameshift in uxuR in mouse C11Ma. The polymorphism analysis revealed that this mutation was fixed in that population. In 2 other mice we found mutations affecting uxuR. These mutations were at low frequencies (Q116* at 16.7% in mouse C2Ma, A213V at 6.6% in mouse C3Ma) which explains why we have not sampled them with the clone approach. UxuR is a transcriptional factor that regulates a cluster of operons involved in transport and degradation of the sugar acids β-D-glucuronides, glucuronate, gluconate and L-galactonate.

We further focused on both dgoRKADT operon and rluD and gidB complex as the other important convergence targets, the intergenic region between yijH and kptA (hit 11 times but with only 3 alleles) or ECP_4610 (hit 5 times) affect proteins of unknown functions. To limit the use of animals, functional tests were made only in vitro.

78

2. Article n 1

Galactonate operon

Role of dgoR and dgoRKADT operon

Among the most notable mutated genes, the genes related to the galactonate operon, dgoRKADT, seemed particularly interesting. Seventeen mutations comprising 6 stop codons, 2 frameshifts, and 3 missenses were recovered in dgoR, 4 missenses in dgoD and 2 missenses in dgoT such that in 11 out of the 15 cages the operon was mutated (figure 3). The operon is composed of dgoR (coding for the repressor), dgoK (coding for a kinase), dgoA (coding for a aldolase), dgoD (coding for a dehydratase) and dgoT (coding for the D-galactonate transporter). This operon is responsible for the utilization of the D-galactonate sugar in a pathway that is somehow parallel to the Entner Doudouroff pathway (EDp): contrary to this pathway the input is D-galactonate rather than gluconate but likewise to the EDp the outputs are pyruvate and D-glyceraldehyde 3P (Lamble et al., 2004). We confirmed that dgoR is the repressor of the pathway, which is expressed at low level in the absence of D-galactonate and allows the inhibition of the expression of the operon. We observed this regulator activity using reporter strains carrying the gfp gene under control of the dgoRKADT operon promoter in K-12 strain. In LB, which does not contain galactonate, no fluorescence was observed. The presence of D-galactonate, induced fluorescence that was revealed using flux cytometry. Moreover the strain E. coli K-12∆dgoR (Baba et al., 2006) using the same construct presented a constitutive induction of fluorescence in both media (figure 4). These results confirmed the repressor activity of DgoR, which had not yet been observed. To test the effects of the dgoR mutations selected for in our experiments, one evolved lineages of 536 obtained at the end of the experiment from the mouse C8Ma and the ancestor 536 were transformed with the reporter construct plasmid pdgoR. It showed a constitutive expression of fluorescence similar to the one found in E. coli K-12∆dgoR, confirming that inactivation of the repressor has been selected for (data not shown).

Interestingly, mutations in dgoD always occurred in lineages having mutation in dgoR and dgoT mutations always occurred in lineages having also dgoR and dgoD mutations (p=0.009). For instance lineages 5 and 13 both had the 3 genes modified, using 6 different mutations.

To test for a potential benefit linked to mutations in the dgoRKADT operon, we cultured the 15 evolved lineages in different conditions. Growth curves were assayed in different media (LB and minimum medium with glucose or gluconate as the sole carbon source) at different pH (5.7 and 7) and in minimum medium with D-galactonate at pH 5.7 because the sugar is not soluble at neutral pH (figure 5). We then compared the DTs of the seven evolved strains with mutations in dgoR, the two evolved strains with mutations in dgoR and dgoD and the two evolved strains with mutations in dgoR, dgoD and dgoT and the four evolved lineages unaffected in dgoR as well that of the ancestor strains. 79

2. Article n 1

A significant difference in growth was found in minimum medium with D-galactonate between dgoR mutant strains and the others using a Welch test (p=0.000114). In more details, the data suggest that dgoR and dgoR-dgoD (p=0.001) mutants have a higher growth rate than strains non mutated in dgo operon, and that strains with the three mutations dgoR, dgoD and dgoT grow even better than the ones with only two (p=0.0014). Of note, truncating (non sense and frame shift) mutations were observed only in the regulator coding gene dgoR. This suggest that despite an induction of the operon of dgoRKADT operon by D-galactonate, inactivation of the repressor of the operon promotes growth in D-galactonate, and that growth is further improved when genes implicated in the pathway of degradation of the D-galactonate are mutated. No difference was observed in LB, minimum medium with glucose and minimum medium with gluconate between the wild type and the mutant strains at both pH 5.7 and 7 (data not shown).

Focus on rluD and gidB

rluD and gidB, are other genes that attracted our attention. Specifically, 6 mutations, including frame shifts were found in rluD, and 7 mutations, including 25bp deletions were recovered in gidB. rluD encodes for a pseudo uridin synthetase that catalyses the formation of pseudo uridine on the 23S rRNA, in other words rluD is involved in 23S rRNA maturation. Similarly gidB is implicated in the maturation of the 16S subunit of the ribosomes as it encodes for a S-adenoside methionine dependant methyltransferase. In other bacterial species, both genes have been shown to be involved in streptomycin resistance, as streptomycin antibiotic targets the ribosome (Okamoto et al., 2007; Sato and Iino, 2010). It is therefore very likely that these mutations have been selected for the adaptation to the antibiotic selective pressure that was maintained during our experiment. Moreover the selected functions of both genes seem to be associated as no evolved lineage presented mutations in both genes. Given the frequency of these mutations (7 and 6 mutations), we could find a significant repulsion between the mutations (test of non random association using a correlation: p<0.001), suggesting that once one is found in one of the genes the acquisition of a mutation in the other is no more selected for.

We first wanted to test if these mutations increased resistance to streptomycin in our strain. First, we tested if E. coli K-12 mutants of those two genes, K-12ΔrluD and K-12ΔgidB had a higher resistance level by determining the minimum inhibition concentrations (MIC) of streptomycin. K- 12ΔrluD and K-12ΔgidB had an MIC of 20 and 80 mg/L, compared to 16 mg/L for K-12, confirming that the genes could play a role in resistance. We then measured the MIC of streptomycin of strain 536 and the 15 lineages. The mutation K87R responsible for the resistance to streptomycin has been 80

2. Article n 1 identified previously by others and is responsible for high resistance to streptomycin (Pelchovich et al., 2013). Indeed, MIC was over 150 g/L (more than 30 times higher than the concentration used in the experiment in the water of mice) and fairly similar across evolved lineages and the ancestor. We decided not to go further with MIC as such an incredibly high level of resistance prevents any accurate and reproducible measurement.

To go further, we then inactivated both genes independently or in combination in strain 536 and performed growth curves in LB supplemented or not with streptomycin at 5 g/L. The underlying idea was that the mutations could compensate the cost of resistance to streptomycin that has been shown to be costly. Wild type strain 536 and 536ΔgidB grew faster than 536ΔrluD and 536ΔrluDΔgidB (p = 0.001) in LB rejecting the idea that the mutations were compensating a general cost of resistance. Yet, the mutant strains grew faster than the wild type strain 536 in LB with streptomycin (figure 6), revealing that these mutations could readily be selected for under the antibiotic selective pressure used. rluD mutants were dominant over gidB mutations in the sense that rluD was sufficient to promote the reducing of the DT; adding gidB to it did not bring any further improvement. Hence, once rluD is inactivated, we do not expect to find any further mutations in gidB, however, under the conditions we tested, if gidB is mutated first, rluD could still be selected for. Hence our experimental test in LB could only partially explain the observed full repulsion between the two mutated genes.

2.6 Discussion

Experimental evolution coupled to whole genome sequencing is a powerful tool to study bacterial adaptation (Long et al., 2015). Over the last decades, many E. coli populations have been adapted to different media using that approach. Genomics was then used to uncover the molecular determinants of these adaptations and to follow how reproducible adaptation is. Most studies have been done in vitro and have shown some similarities. First, the response to selection is proportional to the initial maladaptation: clones with initial low fitness tend to improve faster (Couce and Tenaillon, 2015). Second, intermediate convergence is observed among replicate lineages. Moreover this convergence is involving different mutations that affect the same genes or functional units. Though integration at the gene level seems the most relevant to study convergence, mutations in different genes involved in the same function can also be identified as a source of convergence (Tenaillon et al., 2012). Third, some epistatic interactions may be found between mutations within

81

2. Article n 1 genes or among genes (Tenaillon et al., 2012). Fourth, global regulators tend to be frequently recruited during the first steps of adaptations (Hindré et al., n.d.). As most studies, especially long-term ones have been done in vitro or in vivo with laboratory strains, the relevance of these observations remained to be questioned. We therefore wanted to know what fraction of these results would still hold if we were to adapt a natural isolate of E. coli to a more natural environment, the mouse gut. We therefore performed a 363-day assay of colonization of the digestive tract of 30 streptomycin treated mice in 15 independent cages with the B2 strain E. coli 536, and sequenced one evolved clone from each cage. Assuming 18 generations per day for E. coli in streptomycin treated mice (Poulsen et al., 1995), this represents about 6,534 generations.

Similarities and differences with in vitro lessons of experimental evolution

We found some similarities and differences with the previous experiments done:

- We recovered an important signal of convergence between lineages at the mutation, gene/operon and function levels. Convergence at the genome level was highly significant; it was variable among lineages. For instance, 3 out of 10 affected genes of in lineage 9 were targeted in other lineages, while the 6 genes affected in lineage 6 were at least found in another lineage. On average, 56% of the mutations affected genes that were also found mutated in at least another lineage. For instance, lineages 5 and 13 shared mutations in dgoR, dgoD, dgoT, rluD and in between yjiH- kptA. They have therefore 5 common targets (hit by different mutations for 3 out of 5) out of 9 or 7 affected genes respectively. They have hence a more than 50% convergence at the gene level.

-As found in vitro, this convergence was based on different mutations in the affected genes, and could be integrated into higher functional levels: rluD-gidB modifications of the ribosomal RNA for instance.

-Also some interactions among mutations within genes or among genes could be found. For instance, though 11 mutations were found in dgoR, no lineage was found with a double mutation in dgoR (p<0.004), which suggest that once the gene is inactivated, no further mutations in the gene are selected for. Among genes, mutations in gidB and rluD were mutually exclusive, while mutations in dgoT only occurred in lineages having a mutation in both dgoR and dgoT. These nonrandom associations of mutations suggest an important role of epistasis in shaping the adaptive landscape.

82

2. Article n 1

One important difference with previous work was that no global regulator was recovered among the mutations found. In vitro, for instance rpoB is a very frequent target of adaptation (Conrad et al., 2010; Tenaillon et al., 2012). Despite a extremely high level of conservation within the species, with just 4 non-synonymous mutations in rpoB among 128 E. coli isolated genomes for a 4 Kb gene (Lescat et al., 2014), the mutations is one of the most frequent target of adaptation in many systems (Long et al., 2015)). The lack of mutations in global regulators in our system, suggests that the conditions encountered may be less artificial than in vitro conditions. Yet, if no global regulators are hit, we may still question the targeted functions to the mouse gut to uncover how specific they are from our experimental conditions or from our strain?

Functional targets of adaptation

One of the dominant targets of adaptation is the inactivation of dgoR that leads to an enhanced ability to grow on D-galactonate. Other studies performed in chemostats with specific sugar as nutrients allowed the observation of mutation in genes enhancing the increase of utilization of these sugars (Ferenci, 2007; Lee and Palsson, 2010). However, the specific convergence in the dgoR gene has never been reported so far. Particularly, it was not recovered by Barroso-Batista et al (Barroso- Batista et al., 2014) who used also streptomycin treated mice to study the evolution of the K-12 strain. Many hypotheses could be proposed. These observations could be linked to the strain studied. We used the strain E. coli 536 which was isolated after a urinary tract infection in a human patient (Brzuszkiewicz et al., 2006), whereas in other studies the strain used for the experiment of colonization was E. coli K-12 which is the typical laboratory strain (Blattner et al., n.d.). One possibility is that strain 536 could harbor an abnormal genetic background relative to other strains. For instance, Barroso-Batista et al found that the first response of E. coli K-12 to the gut was mutations inactivating gat operon, but this operon is abnormal in E. coli K-12: it is constitutively expressed due to an insertion sequence integration in its repressor (Barroso-Batista et al., 2014). Hence the observed response could be more a specificity of the strain than a signature of the environment. When we looked at dgoR, in E. coli 536, we could not find any clear specificity. Our functional assays showed that dgoR does repress efficiently the operon in the presence of glucose and release the repression in the presence of D-galactonate. We performed further analysis of dgoR gene history using fully sequenced genome of E. coli isolates. We observed that the entire operon dgoRKADT was missing in the strains E. coli from the serotype O157:H7 belonging to the E group. This group is usually responsible for enterohemorragic diarrhea in patients but it is also retrieved as commensal in digestive tracts of cattle (Cray and Moon, 1995). This particular observation could be

83

2. Article n 1 linked to the specific digestive tract of cattle. We further used more than 110 E. coli genome sequences (Lescat et al 2014) to retrieve the mean pairwise rate of synonymous (Ks) and nonsynonymous substitutions (Ka) of dgoR, dgoD and dgoT. For each of these genes, the ratio Ka/Ks was much less than 1 (0.0055, 0.013 and 0.019 respectively) which comforts the idea that the operon is under strong purifying selection.

We then tried to identify the source of D-galactonate in the digestive tract of animals or human individuals, but nothing has been described about this sugar except the big amount produced in specific tissues of patients or mice with a specific genetic disease (Ficicioglu et al., 2005; Yager et al., 2004). However, this sugar can be produced from galactose using the De Ley Doudoroff pathway present in specific aerobic bacteria such as Stenotrophomonas maltophilia (Brechtel et al., 2002). Saffarian et al. recently identified that these bacteria especially S. maltophilia can be retrieved in the crypts of mice (Saffarian et al., 2015). Besides, we know that S. maltophilia can be retrieved in the mucous membranes of patients under antibiotic selective pressure especially with imipenem treatment, this bacteria being naturally resistant to this antibiotic as well as to all aminoglycosides, including streptomycin. We therefore retrospectively searched for the presence of this species in mouse feces stored at -80°C in glycerol, but failed to find it. Surprisingly, in other experiments with CD1 mice used for colonization assay, we identified the presence of this species in the feces but were not able to find if the species could produce D-galactonate (data not shown). We can also hypothesize that dgoRKADT operon may be used to catabolize some alternative sugars found in the digestive tract of mice, and that this sugar is unable to relieve the repression by dgoR and favors therefore its inactivation. Overall, from the data we have, we can say that selection for dgoR inactivation seem not to be a specificity of the strain we used, and must be due to the environmental conditions we evolved the strains in. However, it is hard to know if the selection for dgoR loss is generally selected in the mouse gut or if it is selected here as an indirect consequence of the use of streptomycin that alters the composition of the microbiota as Leatham-Jensen et al observed missenses in other genes of an E. coli K-12 strain evolved model of colonization (Leatham-Jensen et al., 2012). Nevertheless, these convergences illustrate the ability of E. coli to adapt to the specificity of its gut environment, whether they are linked to the host, its diet or E. coli interactions with the gut microbiota (Maltby et al., 2013).

Our second target of adaptation was directly linked to the use of antibiotic in our experimental setting. Thirteen out of the 15 lineages had mutations either in gidB or in rluD, two genes involved in maturation of ribosomal RNAs and linked to the rpsL gene in which mutations may be responsible of resistance to streptomycin. Experiments in which bacteria are evolved in presence of antibiotics represents one of the typical cases of selection that favors the fast emergence of mutations. It is also

84

2. Article n 1 a case of medical importance due to the drastic increase of resistance among pathogenic bacteria (Andersson and Hughes, 2010; Toprak et al., 2011). Two types of mutations are generally selected for in such a regime: mutations that increase resistance to the antibiotic on the one hand and compensatory mutations on the other. The latter are mutations that are selected as they compensate the cost of some previously acquired costly resistance mutations (Andersson and Hughes, 2010). In our case, as the concentration of antibiotic used was much lower than the MIC of the strain (5 g/L compared to an MIC of more than 150 g/L), we were therefore not expecting resistance mutations to be selected for. Indeed, in the case of streptomycin, subinhibitory concentrations higher than 1/4 of the MIC can select for resistance mutations (Gullberg et al., 2011), while in the present setting, strains were exposed to concentrations lower than 1/30 of the MIC. We found however that gidB and rluD inactivations were not compensatory mutations, but mutations that improved growth only in the presence of the antibiotic. Interestingly gidB was found to be inactivated in several streptomycin resistant laboratory strains (Jeong et al., 2009) suggesting that the selection of its inactivation may occur quite broadly under streptomycin selective pressure. Finally, the pattern of convergence suggests that two other sets of genes of unknown functions are involved in adaptation to the streptomycin treated mice gut. ECP4610, a putative transcriptional regulator is affected 8 times and the intergenic region between yjiH and kptA is modified 11 times. It would be interesting to look in more details at the phenotypic effects of these mutations to uncover the specificity of their contribution to adaptation.

Rates of adaptation Besides the targets of adaptation, our data gave also the rate at which mutations are selected for in the mice gut. In experimental evolution, fast rate of substitutions can be found, with about 11 mutations in 2,000 generations in a thermal adaptation experiment using a very maladapted strain (Tenaillon et al., 2012), 15 mutations after 5,000 generations in a less challenging environment (Barrick et al., 2009) In vivo, Barroso-Batista et al. found a mean of 2.3 mutations per lineage over 400 generations (Barroso-Batista et al., 2014). Here we found 95 mutations over 4,838 days of evolution, a rate 4.4, 2.4 and 4.7 times lower than the one found in the afore mentioned experiments if we assume 18 generations per day (Rang et al., 1999) and no mutation in the global regulators. Hence, the selective pressures may not be as strong as in other setting, reflecting that despite the antibiotic selective pressure, the strain was more adapted to its environment than the ones of the previous experiments. The polymorphism data were suggesting low selective pressure as in mice C3Ma and C2Ma some polymorphic alleles shifted their frequency by less than 40% over more than 125 days of evolution (table S3) and over 4 populations sequenced after more than 125 days of 85

2. Article n 1 evolution, a single mutation was found fixed (data not shown). The value of Ka/Ks was substantially higher than one (Ka/Ks=1.90, 95% CI 1.09-4.02) but less than the one found in other studies. For instance, a Ka/Ks close to 5 was found in thermal adaptation suggesting that 80% of non-synonymous mutations were the product of selection in that setting. Such high values of Ka/Ks are not rare, for instance in the long-term experimental evolution, the population sequenced (Barrick et al., 2009) had no synonymous mutations fixed after 20,000 generations. The lower value found here is coherent with a much weaker selective pressure in the mouse gut compared to in vitro evolution.

Overall, our data show that an “evolve and sequence” approach (Long et al., 2015) coupled to an analysis of convergence can be used to uncover the molecular bases of short-term adaptation in an in vivo study. With that approach, we identified several genes as targets of selection in the gut of streptomycin treated mice and showed that adaptation could be caught in the act in a somehow natural setting with a natural isolate. This suggests that short term adaptation may contribute to some of the patterns of prevalence we observed in the wild (Tenaillon et al 2010). While some of the dominant targets, gidB/rluD and dgo operon do not seem to be specific of the strain we used, others such as ECP4610 and mutations between yjiH and kptA are more enigmatic and could be specific of the strain, especially ECP4610 that is only found in few strains. Further studies with various strains will be required to investigate the fraction of that adaptation that is dependent on the genetic background of the host. Finally, the presence of genetic responses linked to the antibiotic used to maintain the strain, call for the development of some colonization assays without such a selective pressure.

Data accessibility

The date used to produce the figure 1, 4, 5 and 6 are accessible on the Dryad website : (http://datadryad.org) with the number doi:10.5061/dryad.4g503 and in the table S1

The date used to produce the figures 2 and 3 are presented in the table S2 and in the MicroScope website (http://www.genoscope.cns.fr/agc/microscope/home).

Funding information: This work was supported by the European Research Council under the European Union’s Seventh Framework Program (FP7/2007-2013)/ERC Grant 310944. The funder had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. MG was supported by the grant FDM20150633803 obtained from the « Fondation pour la Recherche Médicale »

86

2. Article n 1

2.7 Figures and tables.

Figure 1: Population density of Escherichia coli 536 per gram of feces of the streptomycin treated mice in 15 independent cages after oral colonization of the strain during 363 days. The population density is represented in logarithmic scale. All density populations from the same mice at the different points in the time are relied by a grey curve. The percentage of alive mice is also represented with the black curve.

87

2. Article n 1

Figure 2: Mutation pattern among the evolved lineages. The dendrogram of the evolved lineages (on the left) was computed using parsimony method and the presence or absence of mutations among the lineages. The convergence at the mutation level is showed in red (light red absence of mutation, bright red: presence of mutation) and illustrates similar mutations that appeared in different lineages with the name of the gene or regions it falls into, and the position on the Escherichia coli 536 genome. The specificity at the gene level is shown in blue (light blue: absence of mutations, bright blue: presence of mutation) for the dgoR, dgoT, and dgoD genes, and in green (light green: absence of mutations, bright green: presence of mutation) for the gidB and rluD genes illustrating mutations falling in the same genes, but with possible differences.

88

2. Article n 1

Figure 3: Diagrammatic representation of the dgoRKADT operon of the 15 evolved lineages compared to the Escherichia coli 536 ancestor strain. The mutations are shown at scale all along the operon. Each evolved lineage mutation is represented with a symbol defined in the legend.

89

2. Article n 1

Figure 4: Comparison of total fluorescence of a reporter plasmid of the promoter activity of the dgoRKADT operon in Escherichia coli K-12 derived strains. Boxplots of the total fluorescence of the strain K-12 pdgoR and the mutated strain K-12ΔdgoR pdgoR have been compared in LB and in minimum medium D-galactonate.

90

2. Article n 1

Figure 5: Mutated strains of Escherichia coli 536 in dgoR grew faster than wild type strains in minimum medium D-galactonate. Boxplots of the doubling times in minutes of 11 mutated strains of E. coli 536 in dgoR and 5 non mutated strains in dgoR with the ancestor, in minimum medium D-galactonate at pH 5.7. We found a significant difference between the dgoR mutated strains and the wild type strains using a Welch test (p =0.001)

91

2. Article n 1

Figure 6: Inactivation of rluD and gidB genes confers an advantage to the Escherichia coli strains in LB with streptomycin. Boxplots of the doubling times in minutes of 536, 536ΔgidB, 536ΔrluD and 536ΔrluDΔgidB in LB, and in LB with streptomycin at 5 g/L. We observed in one side that wild type strain 536 and 536ΔgidB grew faster than 536ΔrluD and 536ΔrluDΔgidB in LB and that the mutant strains grew faster than the wild type strains 536 in LB with streptomycin (p < 0.001).

92

2. Article n 1

Number of Gene Position Mutation Mutation type occurrence of the Mouse Gene info mutation 2742937 IS100 IS insertion 1 C8Ma 2743335 Δ151 bp deletion 1 C10Ma rluD ← 23S rRNA pseudouridine synthase 2743427 Δ7 bp deletion 1 C1Ma 2743564 Δ6 bp deletion 3 C5Ma ; C9Mb ; C13Mb 4055032 C→T A380T (GCA→ACA) 1 C13Mb dgoT ← D-galactonate transporter 4055282 C→T M296I (ATG→ATA) 1 C5Ma 4056705 G→A P230S (CCA→TCA) 1 C5Ma 4056714 T→C T227A (ACG→GCG) 1 C6Ma dgoD ← Galactonate dehydratase 4056917 G→A A159V (GCG→GTG) 1 C13Mb 4057008 T→C I129V (ATC→GTC) 1 C1Ma 4058921 G→A Q212* (CAG→TAG) 5 C2Ma; C5Ma; C12Mb; C14Ma; C15Ma 4059047 Δ360 bp deletion 1 C8Ma 4059269 8 bp x 2 duplication 1 C9Mb DNA binding transcriptional dgoR ← 4059403 G→A S51L (TCG→TTG) 1 C1Ma regulator 4059431 G→A R42C (CGC→TGC) 1 C3Ma 4059436 G→A T40I (ACC→ATC) 1 C13Mb 4059478 G→T S26* (TCG→TAG) 1 C6Ma 4106867 G→A Q167* (CAA→TAA) 1 C2Ma 4106902 Δ25 bp deletion 4 C3Ma; C6Ma; C7Mb; C11Ma S-adenosine methionine dependent gidB ← 4107005 G→A Q121* (CAG→TAG) 1 C15Ma methyltransferase- 4107347 Δ80 bp deletion 1 C14Ma 4816907 A→T L185Q (CTG→CAG) 1 C3Ma 4816998 10 bp→10 bp substitution 1 C8Ma ECP_4610 ← 4817034 C→T G143S (GGC→AGC) 1 C5Ma Putative transcriptional regulator 4817174 G→A T96I (ACT→ATT) 1 C14Ma 4817222 G→A A80V (GCC→GTC) 1 C1Ma C1Ma; C2Ma; C3Ma; C5Ma; C7Mb; C9Mb; 4863869 C→T intergenic (-68/-182) 8 C10Ma; C13Mb Putative inner membrane yjiH ← / → kptA 4863869 C→G intergenic (-68/-182) 1 C6Ma protein/2'-phosphotransferase 4863894 A→G intergenic (-93/-157) 2 C8Ma; C15Ma

Group of 77 genes 294397 Δ68069 bp deletion 3 C6Ma; C11Ma; C14Ma Prophagic region

Table 1: Summary of the mutations that were found in the evolved lineages. The name of the genes as well as the orientation (defined by the arrow), the position along the genome, and the flanking genes in case of an intergenic mutation are furthermore displayed, with the mutations, its type, the number of occurrence, and a description of the gene product. Stop codons are indicated with the symbol *.

93

2. Article n 1

2.8 Suplementary data

Time of Time of Log (UFC of E. sampling for Number Cage/Mouse death or sacrifice coli/g of feces) at last day of sequencing in of mutations in day sampling for quantification day C1Ma* 363 363 8.06070771079491 10 C1Mb 363 363 7.66806682749672 - C2Ma* 294† 250 6.74876747269843 6 C2Mb 363 363 6.63078414258986 - C3Ma* 363† 336 7.96472423965146 7 C3Mb 363 363 7.30570096567686 - C4Ma 363 363 7.23707148260702 - C4Mb* 363 363 7.49987912895009 3 C5Ma* 363 363 7.76644674458993 9 C5Mb 363 363 7.40731865967076 - C6Ma* 363 363 7.0565054840939 6 C6Mb 155† 47 8.43622236663783 - C7Ma 217† 188 under 3 - C7Mb* 294† 250 7.66360091757003 5 C8Ma* 363 363 7.05792629895671 6 C8Mb 155† 47 under 3 - C9Ma 363 363 7.26542541681875 - C9Mb* 363 363 7.66775214552169 10 C10Ma* 363 363 7.81582497171875 4 C10Mb 363 363 7.58502665202918 - C11Ma* 363 363 8.41161623162127 9 C11Mb 363 363 7.36317790241283 - C12Ma 363 363 7.77723790728802 - C12Mb* 217† 124 5.18150550793754 1 C13Ma 363† 336 7.45305582274126 - C13Mb* 363 363 7.01652406582206 7 C14Ma* 294† 250 6.60866213944029 6 C14Mb 294† 217 6.75269642456797 - C15Ma* 363 363 6.30045132205151 6 C15Mb 363 363 7.11922766889276 -

Supplementary table S1: Characteristics of mice sampling. The times of sampling as well as the time of death indicated with the symbol † or sacrifice and the quantification of UFC of E. coli per gram of feces are indicated for each mouse. The mice from which is sampled the sequenced genome is indicated with the symbol *.

94

2. Article n 1

number of gene position mutation mutation type occurrence of the mices gene information mutation proA → 294002 G→A A352T (GCC→ACC) 1 C9Mb gamma-glutamylphosphate reductase lacA ← 440941 C→T G15D (GGC→GAC) 1 C14Ma thiogalactoside acetyltransferase lacI ← 446129 C→T G166D (GGT→GAT) 1 C1Ma DNA-binding transcriptional repressor N5-carboxyaminoimidazole ribonucleotide purK ← 614522 C→T A204T (GCG→ACG) 1 C10Ma synthase lnt ← 717564 G→A P35S (CCT→TCT) 1 C7Mb apolipoprotein N-acyltransferase putative peptide transporter permease synonymous yliD → 886038 A→G 1 C15Ma subunit: membrane component of ABC (CGA→CGG) superfamily synonymous ybjO → 912263 C→T 1 C7Mb putative inner membrane protein (GGC→GGT) synonymous hcr ← 925876 G→A 1 C4Mb HCP oxidoreductase NADH-dependent (TAC→TAT)

synonymous DNA-binding transcriptional regulator of the rutR → 1069082 C→T 1 C11Ma (GCC→GCT) alternative pyrimidine degradation pathway

pepT → 1177210 C→T H162Y (CAT→TAT) 1 C9Mb peptidase T putative cyclic-di-GMP phosphodiesterase; ycgF ← 1237724 C→A M238I (ATG→ATT) 1 C1Ma blue-light sensing protein using FAD (BLUF); prfA → 1288777 C→T T247I (ACT→ATT) 1 C1Ma peptide chain release factor RF-1

synonymous oligopeptide transporter subunit ; membrane oppB → 1324900 C→T 1 C2Ma (ATC→ATT) component of ABC superfamily

putative spermidine/putrescine transporter ydcT → 1487629 C→T P312S (CCT→TCT) 1 C15Ma subunit ; ATP-binding component of ABC superfamily tranporter

ydiV ← 1706492 C→T G102D (GGC→GAC) 1 C9Mb no information ynjB → 1749449 G→A G242R (GGA→AGA) 1 C10Ma no information synonymous carboxy-terminal protease for prc ← 1826617 G→A 1 C3Ma (GGC→GGT) penicillin-binding protein 3 edd ← 1845731 Δ8 bp deletion 1 C11Ma 6-phosphogluconate dehydratase 95

2. Article n 1

synonymous TMAO reductase III (TorYZ) cytochrome torY ← 1869108 G→A 1 C4Mb (CTG→TTG) c-type subunit ECP_1979 ← 2062160 G→A T1394M (ACG→ATG) 1 C5Ma putative peptide/polyketide synthase ECP_2075 ← 2142758 +T insertion 1 C13Mb putative glycosyltransferase fused 5-methylaminomethyl-2-thiouridine-forming trmC → 2474432 G→A C294Y (TGC→TAC) 1 C14Ma enzyme methyltransferase ; FAD-dependent demodification enzyme

purM → 2619283 G→A D17N (GAC→AAC) 1 C9Mb phosphoribosylaminoimidazole synthetase 2742937 IS100 IS insertion 1 C8Ma 2743335 Δ151 bp deletion 1 C10Ma rluD ← 2743427 Δ7 bp deletion 1 C1Ma 23S rRNA pseudouridine synthase C5Ma ; C9Mb ; 2743564 Δ6 bp deletion 3 C13Mb synonymous hycB ← 2827044 G→A 1 C2Ma hydrogenase 3 Fe-S subunit (GCC→GCT) synonymous ECP_2819 → 2967951 C→T 1 C11Ma no information (CGC→CGT) exonuclease V (RecBCD complex) alpha recD ← 2982208 G→A P287S (CCC→TCC) 1 C3Ma chain fused PTS enzyme: PEP-protein ptsP ← 2997249 C→T V598M (GTG→ATG) 1 C5Ma phosphotransferase (enzyme I) ; GAF domain containing protein bglA → 3062548 G→A G56S (GGC→AGC) 1 C4Mb 6-phospho-beta-glucosidase A ECP_3029 ← 3189191 G→A Q287* (CAA→TAA) 1 C2Ma conserved hypothetical protein garK ← 3385169 C→T A49T (GCG→ACG) 1 C7Mb glycerate kinase I yhfZ ← 3618362 8 bp x 2 duplication 1 C8Ma no information yrhA → 3702573 G→A E130K (GAA→AAA) 1 C9Mb no information synonymous putative phosphotransferase system ECP_3758 ← 3940609 G→A 1 C3Ma (CGC→CGT) fructose-specific IIC component ECOLI4668 3978547 IS911 IS insertion 1 C6Ma putative Transcriptional regulator fragment of putative transposase ORF 1 ECP_3843 ← 4013586 G→A pseudogene (343/1065 nt) 1 C11Ma IS66 family (partial) 4055032 C→T A380T (GCA→ACA) 1 C13Mb dgoT ← D-galactonate transporter 4055282 C→T M296I (ATG→ATA) 1 C5Ma 96

2. Article n 1

4056705 G→A P230S (CCA→TCA) 1 C5Ma 4056714 T→C T227A (ACG→GCG) 1 C6Ma dgoD ← galactonate dehydratase 4056917 G→A A159V (GCG→GTG) 1 C13Mb 4057008 T→C I129V (ATC→GTC) 1 C1Ma C2Ma ; C5Ma ; 4058921 G→A Q212* (CAG→TAG) 5 C12Mb ; C14Ma ; C15Ma 4059047 Δ360 bp deletion 1 C8Ma putative DNA-binding transcriptional dgoR ← 4059269 8 bp x 2 duplication 1 C9Mb regulator 4059403 G→A S51L (TCG→TTG) 1 C1Ma 4059431 G→A R42C (CGC→TGC) 1 C3Ma 4059436 G→A T40I (ACC→ATC) 1 C13Mb 4059478 G→T S26* (TCG→TAG) 1 C6Ma 4106867 G→A Q167* (CAA→TAA) 1 C2Ma C3Ma ; C6Ma ; 4106902 Δ25 bp deletion 4 methyltransferase SAM-dependent gidB ← C7Mb ; C11Ma methyltransferase glucose-inhibited 4107005 G→A Q121* (CAG→TAG) 1 C15Ma cell-division protein 4107347 Δ80 bp deletion 1 C14Ma D-ribose transporter subunit ; synonymous rbsB → 4120422 C→T 1 C15Ma periplasmic-binding compoent of ABC (CTG→TTG) superfamily ilvM → 4135934 C→T T16I (ACC→ATC) 1 C8Ma acetolactate synthase II small subunit synonymous rspA → 4366232 A→G 1 C11Ma putative enolase/dehydratase (CGA→CGG) maltose transporter subunit ; membrane malF ← 4457127 Δ1 bp deletion 1 C1Ma component of ABC superfamily maltose transporter subunit ; membrane malF ← 4457483 C→T W361* (TGG→TAG) 1 C13Mb component of ABC superfamily dcuA ← 4597046 +C insertion 1 C1Ma C4-dicarboxylate antiporter yjeE → 4625327 T→C F79L (TTT→CTT) 1 C9Mb ATPase with strong ADP affinity ECP_4609 ← 4816138 43 bp x 2 duplication 1 C5Ma putative transcriptionnal regulator 4816907 A→T L185Q (CTG→CAG) 1 C3Ma 4816998 10 bp→10 bp substitution 1 C8Ma ECP_4610 ← putative Transcriptional regulator 4817034 C→T G143S (GGC→AGC) 1 C5Ma 4817174 G→A T96I (ACT→ATT) 1 C14Ma

97

2. Article n 1

4817222 G→A A80V (GCC→GTC) 1 C1Ma putative Transcriptional ECP_4610 ← / → ECOLI4668 4817473 G→A intergenic (-13/-2) 1 C9Mb regulator/hypothetical protein uxuR → 4859241 2 bp→CCA substitution 1 C11Ma DNA-binding transcriptional repressor C1Ma ; C2Ma ; C3Ma ; C5Ma ; 4863869 C→T intergenic (-68/-182) 8 C7Mb ; C9Mb ; putative inner membrane yjiH ← / → kptA C10Ma ; C13Mb protein/2'-phosphotransferase 4863869 C→G intergenic (-68/-182) 1 C6Ma 4863894 A→G intergenic (-93/-157) 2 C8Ma ; C15Ma synonymous deoB → 4917873 G→A 1 C11Ma phosphopentomutase (CTG→CTA) C6Ma ; C11Ma ; Group of 77 genes 294397 Δ68069 bp deletion 3 Prophaggic region C14Ma

Supplementary table S2: Summary of the mutations that were found in the evolved lineages. The name of the genes as well as the orientation and flanking genes in case of an intergenic mutation are furthermore displayed, with the mutations, its type, the number of occurrence, and a description of the gene product.

98

2. Article n 1

C2Ma C3Ma C8Ma C11Ma Gene Position Mutation Mutation type Day 124 Day 250 Day 155 Day 336 Day 155 Day 363 Day 155 Day 363 (MC 32.8 ± 4) (MC 94.9 ± 8.3) (MC 72.3 ± 5.2) (MC 58.3 ± 4.8) (MC 19.2 ± 3.2) (MC 145.8 ± 8.4) (MC 80 ± 7.4) (MC 81.8 ± 8.7) dgoR ← / → yidX 4059689 C→A intergenic (-135/-144) 7.4% 0% 0% 0% 0% 0% 0% 0%

4058921 G→A Q212* (CAG→TAG) 0% 74.1% 0% 0% 20.9% 0% 0% 0%

dgoR ← 4059047 Δ360 bp deletion 0% 0% 0% 0% 75.3% 100% 94.7% 0%

4059431 G→A R42C (CGC→TGC) 0% 0% 16.3% 54.8% 0% 0% 0% 0%

4106867 G→A Q167* (CAA→TAA) 52.5% 80.9% 0% 0% 0% 0% 0% 0% 4106902 Δ25 bp deletion 21.4% 16.9% 100% 100% 0% 0% 8.0% 100%

gidB ← 4107130 G→A P79L (CCT→CTT) 0% 0% 0% 0% 0% 0% 60.2% 0%

4107165 +A insertion 0% 0% 0% 0% 0% 0% 8.7% 0%

4107299 G→A Q23* (CAG→TAG) 0% 0% 0% 0% 0% 0% 31.3% 0%

4858848 C→T Q116* (CAG→TAG) 0% 16.7% 0% 0% 0% 0% 0% 0%

uxuR → 4859140 C→T A213V (GCC→GTC) 0% 0% 0% 6.6% 0% 0% 0% 0% 4859241 2 bp→CCA insertion 0% 0% 0% 0% 0% 0% 0% 100%

4863869 C→T intergenic (-68/-182) 0% 61.2% 73.9% 100% 0% 0% 0% 0%

4863879 C→T intergenic (-78/-172) 0% 0% 0% 0% 24% 0% 0% 0% yjiH ← / → kptA 4863894 A→G intergenic (-93/-157) 0% 0% 0% 0% 0% 100% 0% 0%

4864006 A→G intergenic (-205/-45) 0% 0% 0% 5.7% 0% 0% 0% 0%

Supplementary table S3: Mutation polymorphism in genes of interest at intermediate and final points.

99

III - E. COLI EVOLUTION IN HUMANS: THE CASE OF ACUTE INFECTIONS

1. Evolution of E. coli during acute infection:

1. EVOLUTION OF E. COLI DURING ACUTE INFECTION:

Over the last 3 decades, experimental evolution studies revealed many facets of the adaptive process of E. coli. It showed how convergence or the apparition of mutator strains could be seen as clear signs of adaptation. However, when trying to bring such experiments closer to the natural conditions of E. coli, the results appear to be less striking. Indeed we showed in the previous study that evolution of E. coli in the mouse’s gut is showing clear signs of convergence but no recruitment of global regulators and a lower substitution rate.

In order to continue investigating the evolution of E. coli in the wild, we looked at one particular case that isn’t well documented, the evolution of E. coli during acute infection. Indeed if studies tried to tackle the question of long term evolution of E. coli in the case of chronical infections (Zdziarski et al., 2010), less is known about the diversification process taking place during short infectious events. Levert et al (Levert et al., 2010) showed by looking at 226 isolates gathered from 19 patients from deep E. coli infections that even in the case of mono clonal infections, were all isolates from one patient came from a common ancestor, signs of micro heterogeneity and phenotypic differences could be seen. The variability of the phenotypes was significantly higher than when comparing with commensal strains, leaning towards the ideas of that the observed diversification occurred in the extraintestinal environment rather than in the gut before the infection. Furthermore, mutator isolates were identified in 3 patients, indicating that adaptations was at stake. In order to go further we used the isolates of these three patient to try to characterize the genomic diversity within these infections.

The three patient had three different types of infections: the Patient 11 was suffering from a rare adult meningitides, where the strains were found in the cerebrospinal fluid (CSF); the patient 13 suffered from a pyelonephritis and lastly, the patient 17 suffered from a peritonitis. For the two first patient 7 isolates were sequenced using Illumina technology while 8 were sequenced for the patient 17. Because of the plasticity of E. coli’s genome previously described, we were not able to use a previously sequenced reference genome. We assembled de novo the genome of one strain, annotate it and then use it as a reference for the other strains of the same infection.

Using this method we were able to identify a high number of mutations across the isolates from the samples. Our first observation was that there was indeed presence of mutator strains in each patient, with the isolate 54 of the patient 11 being mutated in the mutS gene, like the isolate p5_36

101

1. Evolution of E. coli during acute infection: from the patient 17, while the patient 13’s isolate 4433 was mutated in the mutL gene. Furthermore, convergences at the gene level were observed in intra-patient and inter-patient comparisons.

.

102

2. Article n 2

2. ARTICLE N°2

Genomic diversity within Escherichia coli acute infections: multiple convergences within and between infections and early steps of mutator selections.

Adrien Launay1*, Alexandre Bleibtreu1*, Jérémie Chatel1, Veronique Robbe-Saule2, Françoise Norel2, Sara Dion1, Olivier Clermont1, Erick Denamur1@, Olivier Tenaillon1@

1 IAME, U1137, INSERM, Université Denis Diderot, Sorbonne Paris Cité, Paris F75018 France.

2 Institut Pasteur, CNRS ERL3526, Laboratoire Systèmes Macromoléculaires et Signalisation, Département de Microbiologie, 75015 Paris, France

* , @ equal contribution of these authors

103

2. Article n 2

2.1 Abstract:

While there are clear evidences of the adaptive process in chronic bacterial diseases such as lung infections in cystic fibrosis patients, much less is known about the adaptive process in acute infections. The presence of mutator strains in urinary tract infections of Escherichia coli, suggests that adaptation may play a role in acute extraintestinal infections as mutators are strains with increased mutation rate that can be driven to high frequency when adaptation is at play. To investigate if acute infections caused by E. coli can lead to adaptation, we have sequenced 7 to 8 clones of three distinct extra-intestinal infections representing a gradient of distance from the gut: a peritonitis, a pyelonephritis and a rare case of adult meningitides. All infections were chosen as having one of the sampled clones with an increased mutation rate. Genomic analysis revealed a higher diversity in the peritonitis than in the meningitides, and a large contribution of the mutator clones to the population diversity. Within each infection, some genes were altered several time independently, a sign of convergence and a hallmark of natural selection. Some loci such as putA, fimH, glpD, rbsR, rpoS, fnr or hupA, were mutated multiple time both within patients and between, revealing that natural selection may recruit the same targets in these different infections. Finally, in two out of the three mutator strains, the mutation at the origin of the increased mutation rate was a deletion that extended to a gene that was in itself mutated multiple time in the infection. This indicates that the early stages of mutator selection may have been facilitated by a coincidental selection. Overall, these observations support an active role of natural selection in shaping the genomes of E. coli during acute extraintestinal infections.

104

2. Article n 2

2.2 Background

When they change of environment, microorganisms are facing new opportunities for adaptation. Experimental evolution, in which bacterial populations are evolved in an artificial media in the laboratory, has given ample proofs of that concept. Using either long-term evolution (Wiser et al., 2013) or replicate evolution with clones of different initial fitnesses, it was shown that adaptation to new conditions was faster when the initial fit to these conditions was poorer (reviewed in (Couce and Tenaillon, 2015)). Yet, much less is known about the extent of bacterial adaptation in the wild due to changes of environment, presumably because it is hard to track bacterial population facing changes of environment. Nevertheless, one such switch that is well documented and has an important medical relevance is the switch to pathogenic environment made by opportunist pathogens.

Opportunist pathogens are not meant to be colonizing pathogenic environment but may take the opportunity if they have the chance. As a consequence, the pathogenic environment represent a new environment to which they may adapt. The most acknowledged case of such a transition, is found in the colonization of the lung of cystic fibrosis patient by soil bacteria such as Bulkolderia dolosa (Lieberman et al., 2011) and Pseudomonas aeruginosa (Jorth et al., 2015; Smith et al., 2006; Yang et al., 2011).In these chronic infections that may last for years, multiple studies have revealed proofs of molecular adaptation. First, as those patients are heavily treated with antibiotics, multiple antibiotic resistances have been found to emerge in these bacterial populations over the years. Second, among these bacterial populations, a large fraction of high mutation rate clones named mutators have been found. Mutators are bacteria that have inactivated some repair system and that subsequently have high mutation rate (Oliver, 2000). The most common defect found among natural mutator isolates is affecting methyl-directed mismatch repair (MMR) a post replication error correction system relying in the action of genes mutS, mutL and mutH (Denamur and Matic, 2006). Experimental evolution (Sniegowski et al., 1997), simulations (Tenaillon et al., 1999) and theoretical analysis (Good and Desai, 2016) have shown that an increase in mutation rate can be selected for when the supply of beneficial mutations is large: the excess of beneficial mutation produced by the mutator clones drive them to high frequency. Finding mutators is therefore a signature of sustained adaptation. Third, whole genome sequences of epidemic clones that infected several patients revealed traces of convergence. Convergence at the molecular level can also be seen as a hallmark of adaptation (Tenaillon et al., 2012). Among all the possible mutations that may occur, finding mutations with similar effects reflects the filtering action of natural selection. For instance, in B. dolosa infections (Lieberman et al., 2011), apart from convergence in antibiotic resistances genes, recurrent mutations

105

2. Article n 2 in the respiration pathway were found implying a specific adaptation to the lung environment. All these observations prove that adaptation is at play in these chronic infections. Indeed, these chronic infections offer the best opportunity for adaptation to proceed: population sizes are large and time is long, both factors giving opportunities for mutation to emerge and be selected for. One question that remains nevertheless more open is whether acute infections, that have a much shorter lifespan and tend to have smaller population sizes, are also coupled to adaptation.

Escherichia coli extraintestinal pathogens can be used as a model of opportunistic pathogen responsible of acute infections. E. coli is a versatile species. It is a commensal of the gut of vertebrates, the dominant facultative anaerobe of the human microbiota (Tenaillon et al., 2010). E. coli is also a health threatening pathogen responsible of close to 1 million death per year due to intraintestinal and extraintestinal pathologies. Now that diarrheas are better treated, most of the burden imposed by E. coli relies on extraintestinal diseases. E. coli is responsible of several acute infections such as urinary tract infections (UTI), septicemia and new born meningitides. As these pathologies are not epidemic, they can be seen as opportunistic. Yet, not all strains are equally likely to produce these infections. B2 and D phylogroups of E. coli are overrepresented among the strains sampled from these infections (Bingen et al., 1998; Picard et al., 1999). They harbor indeed a larger number of genes named virulence factors that are required for survival in the extra-intestinal niches. These factors suggest at first look a specialization of these strains to extraintestinal virulence. However, D and particularly B2 strains can also be successful commensals of humans. The B2 group is representing close to 50% of commensal strains in the human American population (Zhang et al., 2002). It is increasing in France where it was less than 10% in the 80ies and is now of 35 % (Daubié et al., 2016). Closer investigation has now revealed that virulence factors contribute to the commensal success of the strains carrying them. Deletion of all pathogenicity island of E. coli strain 536 came at a cost in a mouse model of commensalism (Diard et al., 2010). These observations suggest that virulence is a by-product of commensalism (Le Gall et al., 2007) and therefore comfort our hypothesis that E. coli extraintestinal pathologies are opportunistic.

As for the case of chronic infections, several studies provides some indirect hints that adaptation may occur during E. coli acute extra intestinal infections. First, populations sampled from infectious sites show signs of diversity. Early studies, revealed diverse colony morphology of monoclonal E. coli UTI (Bettelheim and Taylor, 1969). More recently, analysis of several clones from the same extraintestinal sample in 19 E. coli extraintestinal infections revealed that in the majority of infections, traces of genetic heterogeneity were recovered. This diversity could be neutral and just the product of chance, but phenotypic analysis revealed that it was associated to some 106

2. Article n 2 phenotypic diversity. Indeed, clones from the same infection differed in their antibiotic resistance, their growth rate and their stress resistance, changes that resulted in differences in their intrinsic virulence tested in a mouse model of septicemia (Levert et al., 2010). Second, among these clones (Levert et al., 2010), and more broadly among urinary tract infections, mutators strains are recovered more frequently than in commensal strains (Denamur et al., 2002). Infectious populations can have a small fraction of mutators among them or be dominated by mutator clones (Couce et al., 2016). As mentioned before, the presence of mutators can be seen as proofs that adaptation is at play. Third, experiments with mutators have shown in animal models that adaptation could proceed. Using a mouse model of urinary tract infection, Labat et al. (Labat et al., 2005) showed that mutator strains could avoid clearance and establish as chronic infections. Similarly, Cooper et al (Cooper et al., 2012) showed that deletion of MMR genes was selected for during a 48h ascending UTI, presumably due to a faster switching rate of motility expression in MMR deficient strains.

These evidences are based on diversity within an infection or on short-term evolution in animal models. There are also some epidemiological evidences that selection may recruit pathoadaptive mutations during acute infections, especially during UTI (Sokurenko, 2016). The pioneer work of Sokurenko (Sokurenko et al., 1998) used comparative genomics to identify the signature of selection in type I Fimbriae component fimH. They found that UTI strains were enriched in nonsynonymous variants of that gene. Experimental studies (Sokurenko, 2016) confirmed the role of these mutations in modulating adhesion affinity/specificity and revealed their contribution to pathology in animal models. Other adhesion genes such as Dr Family adhesins or fimbriae genes such as papG were also found to accumulate nonsynonymous mutations in uropathogenic E. coli. These epidemiological evidences support the contribution of mutation to pathogenicity. However, as the evidence are based on comparative studies, whether the mutations occurred before the infection or during it remains an open question. For instance in patients with recurrent acute urinary tract infection, a pathoadaptive mutation in fimH was found to be polymorphic in the E. coli sampled in the bladder, but it was also found to be polymorphic in the fecal population at the time of the infection, suggesting that the mutation preexisted in the commensal habitat(Sokurenko, 2016).

Hence, all the keys for adaptation seem to be present in E. coli acute extraintestinal infections: genetic diversity associated with some phenotypic diversity having consequences on virulence, proof that adaptation is possible in some conditions, existence of candidate genes that seem to contribute to the adaptive process. Yet, further descriptions of the dynamic of the adaptive process at the infectious site would improve our understanding of that process (Sokurenko, 2016). Along these lines, we decided to characterize the genomic diversity present within the infective population of 3 patients 107

2. Article n 2 sampled from the 19 infections previously mentioned (Levert et al., 2010). We choose to examine patients in which a mutator was found as the presence of mutator is an indirect evidence that selection is going on in these infections, furthermore we took three different types of infections. The first one, a peritonitis is linked to the gut bacterial population, the second a pyelonephritis required E. coli to move from the gut to the bladder and subsequently to move upward to the kidneys, and the third was sampled from a rare case of adult meningitides, which requires for E. coli to reach the cerebrospinal fluid (CSF).

2.3 Materials and Methods

Isolation of clone isolates

Three adult patients with severe extraintestinal E. coli infections were studied (Levert et al., 2010) : Patient 11 (P11) had postoperative infection of the cerebrospinal fluid, patient 13 (P13) suffered from pyelonephritis and patient 17 (P17) had peritoneal fluid infection. In all cases, E. coli alone was recovered. Each patient was infected by a single clone with micro diversity (Levert et al., 2010) . 7, 7 and 8 colonies per patient, respectively, were randomly selected after seeding the pathological sample on blood agar (P11 and P17) or cysteine lactose electrolyte deficient (P13) plates (fig 1A). The isolates were sub cultured twice and stored in glycerol at -80°C.

Human ethics statement

All the sampling procedures of the infected patients were performed in the course of the clinical diagnosis. No additional procedure was performed in the patients for the present study. The study was approved by the institutional ethics committee (Comité de Protection des Personnes, Hôpital Saint-Louis, Paris; #2004-06). The participants were informed of their role in the study and written informed consent was provided by study participants.

Sequencing

The different samples were sequenced using the Illumina sequencing technology. All the isolates of the patient 13 as well as 7 of the patient 17 were sequenced in 51bp single reads. The 108

2. Article n 2 isolate p5_36 of the patient 17 was sequenced in 101bp single reads later on. Lastly, the isolates from the patient 11 were all sequenced in 101bp paired ends. This resulted in an mean coverage of 177X for the patient 17 (min 37X, max 447X), 156X for the patient 13 (min 112X, max 405X) and 80X for the patient 11 (min 10X, max 90X)

Genomic analysis.

In order to analyze the diversity among the different samples, we used an approach combining de novo assembly and mapping to try to reconstruct the history of the strains infecting one patient. For each patient one clone was assembled and annotated. Then the other clones from the same patient were mapped to this genome to identify mutations (fig1)

Reads correction and de novo assembly: In a first step all the sequences were corrected using the correction functions implemented in SPAdes 3.6.2 (Bankevich et al., 2012) that rely on bayeshammer. Then one clone, for which no mutator phenotype was observed, was randomly picked for each patient to serve as the reference genome. This clone was reconstructed using SPAdes with the careful option activated. The contigs were filtered in order to only keep sequences with a length larger than 500bp, in order to discard small erroneous contigs. Then, as we wanted to use the assembled clones, we annotated them using Prokka 1.11 (Seemann, 2014). At the end of this step we obtained a genebank files of one clone for each patient with the fasta sequence and the annotations transferred from the closest genes by Prokka. Those files were then used as reference genome for the mapping analysis (fig 1C).

Read mapping: We used breseq 0.27.2 (Deatherage and Barrick, 2014) in order to identify mutations differentiating each clones from one patient. To do so, we used the previously assembled and annotated clone from one patient as a reference and mapped the corrected reads of the other clones to it. We also mapped the reads from the clone that we chose as a reference back to itself in order to identify possible assembly errors. We then filtered the mutations identified by breseq (fig 1B). When identifying mutations, the program stores mutation evidence that can take the form of either read alignment evidence (RA), single nucleotide polymorphisms (SNPs) and short indels, missing coverage (MC), and new junction evidence (JC), corresponding to reads mapping to one part of the reference on one of its side and to another part on the other side, indicating a possible re- arrangement. The program then uses these evidences to make mutation predictions. The RA are easily transformed into prediction as SNPs or short indels when they are supported by at least 85% of the

109

2. Article n 2 reads. However to identify a large deletion breseq needs to have both a missing coverage of the region as well as the new junction evidence that both sides of the region form a new junction. Therefore in the outputs of breseq we can still find some unassigned missing coverage for which no junction evidence was found.

When filtering the outputs of breseq, we first look at the predicted SNPs and short indels. We first removed the mutations that appeared when the reference genome’s reads are mapped back onto itself, as they can be the result of wrongly mapped reads coming from a closely related part of the genome that isn’t in the assembly. We then removed mutations when they were close to one another. We disregarded mutations that were less than 51bp apart from one another. These clustered mutations are usually caused by reads erroneously mapped. They are typically found in proghagic regions which are repeated in the genome but don’t have 100% identity. These regions tend to create trouble to the assembly and produce either chimeric contigs or fragmented contigs that are discarded by our contig length selection. We also removed all mutations that were found in all strains, as well as mutations for which the frequency of the mutated reads was less than 0.95. Then we looked at the unassigned missing coverage evidences that were only maintained when they represented a possible deletion of more than 1000 bp. A single large deletion may encompass several unlinked contigs, and appear therefore as multiple missing coverage evidences. To avoid counting a single deletion multiple times, all contigs with a loss of coverage extending to one extremity (or both) were blasted. Deletions involving contigs that blasted within 5kb of an E. coli genome or plasmid were clustered together and counted as a single large deletion.

Reconstruction of the diversification: The mutations identified for each clones represents how they differed from the clone that we arbitrarily used as the reference. To try to deduce how these strains would differ from their common ancestor, we used a parsimony method. For each mutations that we observed in the clones of one patient, we looked at how common it was across all the clones. If the exact same mutation was identified in more than half of the total number of clones then it is more likely to be the ancestral state. So, we then shifted this mutation by removing it from the clones bearing it, while adding the opposite mutation to the mutation list of the clones that did not have it. This way we can have a vision of how the clones are most likely to have diversified from a common ancestor.

Multi locus sequence typing (MLST) and serotyping: In order to identify the MLST for the isolates of each patient, we used the contigs of the reconstructed genome that we used as a reference in the CGE website (Larsen et al., 2012) by using both the two methods: #1 (Wirth et al., 2006) and

110

2. Article n 2

#2 (Jaureguy et al., 2008). Similarly we also used the same website and files to do the serotyping of the isolates (Joensen et al., 2015).

Mutation rate assay

Mutation frequencies of the isolates were estimated by monitoring their capacity to generate mutation conferring resistance to rifampicin in triplicates in at least three independent cultures (9 values minimum), as described (Denamur et al., 2002).

Red and dry and rough (rdar) morphotype and H2O2 resistance

Cells of the mutants of the isolate P5-35 from patient 17 were tested with these two phenotypic assays, which are depending of RpoS. For the rdar phenotype, cells were grown on lysogeny broth (LB) agar plates without NaCl (LB0) at 28°C. We determined colony morphology and color using LB0 agar supplemented with Congo red (40µg.ml-1) and Coomassie brilliant blue (20 µg.ml-1), as described previously (Levert et al., 2010; Robbe-Saule et al., 2006). For the H2O2 resistance assay, cells were grown overnight in LB, washed, and re-suspended in 0.9% NaCl to a concentration of about 1.108 cells ml-1. H2O2 was added to a final concentration of 50mM. Aliquots of bacteria were removed at 0, 60 and 90 minutes, and numbers of viable cells were determined on LB plates (Levert et al., 2010).

Construction of mutants

rpoS mutants: Deletion and allelic exchanges of rpoS were performed on the P5-35 isolate exhibiting the D118Y RpoS mutation according to the slightly modified protocol of Datsenko and Wanner (Datsenko and Wanner, 2000; Gerlach et al., 2009). Briefly, resistance genes tetA and tetR were PCR amplified with 60 mer primers matching in their extremities to the 5’ and 3’ parts of the rpoS gene (Table SX) and electroporated in P5-35 isolate previously transformed by the pKD46 plasmid. Recombinant clones were selected on Bochner Maloy (Bochner et al., 1980) . Recombinants were checked by PCR using custom primers. Then, the tetAR cassette was either deleted (P5-35 ∆rpoS) or replaced by the rpoS wild type allele amplified from the P5-40 isolate (P5-35 WT) or the original D118Y allele (P5-35 D118Y).

111

2. Article n 2

hupA mutants: Deletion of hupA was performed on the 4429 and 4431 isolates of patient 13 and on the archetypal B2 phylogroup strain 536 using the Datsenko and Wanner protocol (Datsenko and Wanner, 2000). Chloramphenicol acetyl transferase gene (cat) was amplified with the H1P1_hupA and H2P2-hupA primers (Table SX) and electroporated in the isolates previously transformed by the pKD46 plasmid. Recombinants were selected by plating on antibiotic containing plates and verified by PCR. The cat cassette was then removed in the 4429 ∆hupA::cat strain using the pCP20 plasmid. The scar sequence after FLP-mediated excision of antibiotic resistance gene was verified by PCR and Sanger sequencing. Complementation of the 4429 ∆hupA mutant was performed as follow. The hupA gene was PCR amplified and cloned in the pSMARTGCLK plasmid using GC Cloning & Amplification kits (Lucigen Corporation). The recombinant plasmid was electroporated in the E. coli K-12 strain, purified and then electroporated in the 4429 ∆hupA mutant. The insertion was verified by PCR using the SL1 and SR2 primers and classical Sanger sequencing (Table SX).

Sanzer technology sequencing of hupA and put A

hupA (273bp) and putA (3963bp) genes were amplified by PCR in one and five fragments, respectively. The PCR products were then sequenced by classical Sanger technology.

Mouse sepsis assay

A mouse sepsis model was used to assess the intrinsic extraintestinal virulence of the isolates as described elsewhere (Picard et al., 1999). Briefly, for each isolate, 10 outbred female Swiss OF1 mice (3-4 weeks old) from Janvier (France) received a subcutaneous injection into the nape of the neck of approximately 2x108 CFU of stationary-phase bacteria. After inoculation, the mice (which had free access to food and water) were observed for up to 1 week. Time of death was recorded for each mouse. Surviving mice were euthanatized on day 7 by cervical dislocation. In each lethality assay, two control isolates were included: K-12 MG1655 isolate, which does not kill mice by 7 day post challenge, and CFT073 isolate which shows a lethality of ≥80% by 7 day post challenge (Johnson et al., 2006). Control isolates were prepared in the same conditions as test isolates and injected in 5 mice each. Isolates were categorized as killer (K) or non-killer (NK) if they killed ≥80% or ≤20% of the inoculated mice, respectively. Isolates killing mice between these two values were categorized as intermediate killer (IK) (Johnson et al., 2006). In some experiments, the size of the inoculum was decreased as in (Tourret et al., 2010).

112

2. Article n 2

Competition experiments were performed by inoculating 5 to 6 mice in the same conditions as above by a 1/1 ratio of the various mutant and wild type strains. Bacterial cells were numbered in the spleens of animals after grinding and sprawl on LB plates. Distinction between wild type and mutant strains was done either phenotypically looking for the rdar morphotype, which is depending of RpoS (Robbe-Saule et al., 2006) (rpoS mutants) or genetically by PCR using the Verif_hupA_F and Verif_hupA_R primers which gave a 537 or 640bp product according to the absence or presence of hupA, respectively (Table SX) (hupA mutants).

Animal ethics statement

Animal experiments were performed in compliance with the recommendations of the French Ministry of Agriculture and approved by the French Veterinary Services (accreditation A 75-18-05). All animal experimentation was conducted following European (Directive 2010/63/EU on the protection of animals used for scientific purposes) and National (RD 53/2013) regulations for transport, housing and care of laboratory animals. The protocol used was approved by the Animal Welfare Committee of the Veterinary Faculty in Lugo, University of Santiago de Compostela (AE- LU-002/12/INV MED.02/OUTROS 04). All efforts were made to minimize suffering.

2.4 Results and discussion

We sampled clones from 3 infections of our collections of infections that we previously characterized. We choose three types of infections that also presented traces of genetic heterogeneity and the presence of a mutator. Mutation rate estimates of the clones sampled in each of these 3 patients, are reported in figure 2. Clearly each patient had a mutator allele, as clone p5_36 of patient 17, clone 4433 of patient 13 and clone 54 of patient 11 had an elevated mutation rate, comparable to the rate estimated from our control MMR deficient strain M13.

Mutation analysis

For each of the three patients, we sequenced the whole genome of 7 clones, for the ptient 11, and 8 clones for the patient 17. As. E. coli genome is variable in terms of gene content, for each patient, one non-mutator clone was assembled de novo, and the other sequenced clones where then

113

2. Article n 2 mapped on the subset of contigs of the assembled-one. The missing parts are regions that are in multiple slightly diverged copies in the genome and that can therefore not be assembled de novo. Moreover, as the reference genome we have is composed of independent contigs, it is not possible to be as stringent in mutation detection as we can be using evolved lineages with a known ancestral genome. While we can detect point mutations, and small indels in non-repeated regions with high efficiency, the detection of large deletions generate more trouble as they can encompass several contigs. Using genomes of other E. coli genomes, we can nevertheless cluster deletions affecting independent contigs. Insertion sequences integrations are also difficult to identify as breseq is providing many possible junctions between contigs and we could not identify a consistent way to sort out robustly the various outputs provided by the software. We therefore decided not to report that type of mutations. Nevertheless, using our methodology, we were able to identify a substantial level of variability across isolates from the same patient (table S1).

Within sample diversity

Interestingly the level of diversity in the three patients and correlated with the expectation based on the portal of entry of the infection. In the patient 17 suffering from peritonitis, the isolates appeared to be from the phylogroup A, belonging to the ST4358 using EC#1 and ST132 using EC#2, while the predicted serotype was O25:H16. It had the highest level of diversity with 223 independent mutations in total, leading to a mean of 27.9 mutations by isolates (table S1) ( 120 and 17,14 when discarding the mutator isolates that accounts for a big part of the observed mutations). Among those mutations, 142 SNPs were found, among which 22 were intergenic, 95 were nonsynonymous, 6 were nonsense mutations, and 19 were synonymous mutations. We identified 81 indels events with 37 being small indels while 44 where large deletions. Considering small indels, 18 were found to be genic causing frameshifts in 15 cases while 3 cases did not cause any frameshift. Finally, 19 events were intergenic. Many large deletions were found in these patients. A large fraction of these were linked to plasmids. The high level of diversity, with an overall mutation rate similar to the one of the other populations (figure 3) suggests a quite long divergence between the strains, a divergence that presumably predate the origin of the infection. Some non-mutator clones differed from one another by more than 20 mutations. Assuming a mutation rate of about 2 10-10 per site per generation (Lee et al., 2012), this would suggest about a thousand generations of divergence between some of the samples. As the entry point of the infection is directly connected to the gut, such a large diversity is not unexpected and may result from preexisting diversity in the gut. This observation can furthermore

114

2. Article n 2 be reinforced by the fact that 24 mutations were found to be informative (specific to two isolates at least).

For the isolates sampled from the patient 13, with a pyelonephritis, the strain used as a reference was identified as a B2 strain belonging to the ST372 using EC#1 and an unknown ST using EC#2 while the predicted serotype was O83:H31. In this patient, the isolates exhibited an intermediate level of diversity, with a mean of 9.57 mutations per isolates, for a total number of 67 mutations ( 6.83 and 41 when discarding the mutator isolates). 45 SNPs were identified with 8 intergenic mutations, 7 synonymous, 23 non synonymous and 7 stop mutations. 22 indel were found, with 15 small indels, 6 being intergenic, 9 in coding regions with 7 causing frameshifts and 2 not causing frameshift, and 7 large indels. Only 8 mutations were informative, suggesting a recent emergence of diversity and a star like phylogeny for this group of isolates (fig 3).

Lastly, the patient 11 isolates, sampled from the CSF, had a lower number of mutation, and the strain used as a reference was identified as a A strain belonging to the ST1434 using EC#1 and an unknown using EC#2, while the predicted serotype was O18ab:H55s. Only 38 mutations were identified (table S1) in total with a mean of 5.57 mutations by isolates mutations (21 and 3.5 when discarding the mutator isolates). 24 SNPs were found, with 5 intergenic mutations, 5 synonymous mutations, 13 non synonymous mutations and 1 stop mutation. Furthermore 14 indels were identified in total, with 8 large deletion events, and 6 small indels events, 2 being intergenic, 1 not producing any frameshift, and 3 producing a frameshift in coding genes (fig 3). Furthermore, only 2 informative mutations were found identified in this patient.

Structure of the Diversity using point mutations

Focusing on point mutations and excluding in each population the mutator clone, we could quantify diversity and its structure using a population genetics approach. The measure of diversity as estimated by Watterson (Watterson, 1975) was 26.5, 11.8 and 4.8 at the genome level for patients 17, 13 and 11 respectively. The structure of the tree was more star like than expected under classical coalescent as revealed by negative Tajima’s D statistics in all cases (-1.34, -1.29 ,and -1.44 for patient 17, 13 and 11 respectively, but not statistically different from 0 in all cases).

Looking at mutations in coding regions, we could find that excluding mutator clones, there was 48 nonsynonymous mutations for 3 synonymous among the clones of patient 17, 20 nonsynonymous for 2 synonymous in patient 13 and 8 nonsynonymous mutations and a lack of 115

2. Article n 2 synonymous mutations in patient 11. As we roughly should have 3 fold more nonsynonymous mutations than synonymous, there is a clear over representation of nonsynonymous mutations in the dataset which suggests that natural selection is at play in all patients. In mutator isolates, as observed previously (Wielgoss et al., 2011), the ratio is much closer to expectation with 16 synonymous for 53 nonsynonymous in the mutator clone of patient 17, 5 synonymous and 10 nonsynonymous in patient 13 and 5 and 6 respectively for patient 11. Both the structure of the tree and the fraction of synonymous mutations suggest that selection could have shaped the diversity we observed.

Mutator genotype and contribution to diversity

A high level of total mutations was found in each patient, and within each one of them, we were able to identify at least one isolate with a really high mutations rate. Using genomic sequences, we were able to identify the molecular defect responsible of the mutator phenotype affecting in all cases the MMR machinery. Indeed, we were able to identify a deletion of a part of the mutS gene in one isolates of the patient 17, while this gene was completely absent from the one of the patient 11. Lastly one isolate from the patient 13 showed 2 kind of mutations, with a complete deletion of the mutL gene along with an intergenic point mutation close to the mutH gene end that could interact with the termination process during transcription. As all the mutator phenotypes were related to the mismatch repair system our results are consistent with previous work on mutators in the wild (Denamur and Matic, 2006).

Mutators specific mutations contributed to 41% in patient 17, 38% in patient 13 and 44% in patient 11 of the total number of mutations and 53%, 38%, 54% of the point mutations, respectively. These numbers illustrate that despite their quite low frequency in the samples (12.5 to 14.2% in our samples) MMR deficient strains harbor a much larger fraction of the genetic diversity present in the population.

It is known that MMR defect tend to increase mostly transitions, as well as small indels. When comparing the number of transversion and transition between the mutator strains from each patient to the rest of the isolates we were able to identify a significant difference in the patient 17 (p<0.0001) and patient 11 (p=0.0009), whereas we could not conclude on the association of the mutator genotype with the high number of transversion in patient 13 (p=0.3339) most probably because of the lower number of mutations identified.

Within patient convergence 116

2. Article n 2

One of the goal of our approach was to look for signature of adaptation in the sequences. One particular case of such signature is the existence of convergence at the gene level within the population. Experimental evolution has revealed that when genes are targets of adaptation, many mutations in the genes may be adaptive, even in essential genes that are highly constrained (Long et al., 2015). As a consequence, convergence at the gene level, i.e, finding different mutations affecting the same gene, is very often observed between populations (Tenaillon et al., 2016, 2012; Woods et al., 2006) or within (Lee and Marx, 2013). Within population genic convergence is sign of vigorous adaptation: it shows that the population is large enough to drive several beneficial mutations simultaneously to detectable frequency, a phenomenon named clonal interference (Gerrish and Lenski, 1998). In the three populations under study we could find signs of such convergence (table 1). Indeed, in the patient 17, 13 genes were found to be mutated independently more than once, through point mutations, small indels or through some large overlapping deletions. In patient 13 and 11, 9 and 4 genes were independently mutated respectively. The mutations found in these convergent loci represented 13.9% of all mutations in patient 17 and 25.8% of mutations if we exclude mutations from the mutator clone of that infection. In patient 13, convergent mutations represented 40.3 % of all mutations and 48.9% if we exclude the mutator. Finally, in patient 11, they represented 26.3% of all mutations and 42% if we exclude mutations found in the mutator clone. Hence convergence was substantial in all patients and higher in the two infections that are further away from the gut, suggesting that convergence is somehow more marked in extraintestinal environment.

Between patient convergence

While we expected to see convergence among strains isolated from of a patient, it was interesting to observe that convergence also appeared between patients (table 1). Indeed even if the type of infection are quite different, we could observe some convergence at the gene level between clones of the different patients. These convergence could encompass genes in which convergence was observed within patients, or genes mutated only once per patient but recovered in several patients. For instance putA was mutated 4 times in both patient 17 and 13. rpoS and rbsR were mutated twice in patient 13 and 11 and once in patient 17, they were also previously recovered in another patients (Levert et al., 2010). fnr was also mutated twice in both patient 13 and 17. fimH was mutated 4 times in patient 11 and once in both patient 13 and 17. Here again, convergence was substantial, for instance in patient 13, 18 mutations out of 67 (26.9%) were affecting genes also affected in the other two patients. The number is even higher if we focus on the mutations that show a sign of convergence in 117

2. Article n 2 patient 13: out of the 9 clusters of genes that present sign of convergence within patient 13 strains, 6 are also found mutated in the other patients. Similarly 4 out of 13 loci with signs of convergence of patient 17 strains were found mutated in the other infections and 3 out of 4 in patient 11. Some of the loci involved in convergence will be later discussed.

Mutator and convergence

In each of the three infections, a mutator clone was found among a subset of related clones. This dataset represents therefore a unique opportunity to understand the early emergence of mutator clones in the wild. Interestingly, 2 of the mutators occurred through deletions that encompassed some other genes involved in convergence. In patient 11, the deletion of mutS extends all the way to rpoS which is also found mutated in another clone of that infection. Similarly, the deletion of mutL in patient 13 extends to the neighboring miaA which is also found mutated in two other clones. This suggests that the molecular defect generating the mutator may in itself be under selection in 2 out of three cases. This type of coincidental selection may favor the emergence of mutator to a frequency high enough to generate some beneficial mutations. In all three patients some of the mutations involved in convergence were recovered in mutator clones. A fimH mutation was found in patient 11 mutator clone, a frameshift in rpoS as well as a mutation in fnr and some mutations in an unknown gene in patient 13, and mutations in catM, putP, ybjJ, rpsH and PROKKA_04195 an hypothetical protein in the mutator clone of patient 17. All these mutations are therefore candidate mutations that might have contributed to the increase in frequency of the mutator allele.

Mouse model of virulence

One of the limitations of our approach compared to experimental evolution, is that once we have identified potential mutations, it is difficult to validate them experimentally. Each of the infections we have observed are different and cannot be fully reproduced even in animal models. Yet, over the years it has become clear that a mouse model of septicemia could be a good candidate model to explore the extrinsic extraintestinal virulence of the isolates (Picard et al., 1999). We thus assessed extraintestinal intrinsic virulence of the clone isolates using this model. In that model, 10 mice are injected with a culture of E. coli and the lethality and death timing are recorded. Strains fall mostly in two categories, killer strains that kill more than 8 out of 10 mice, and non-killer that kill less than 2. 118

2. Article n 2

While isolates from patient 11 were all non killers, surprisingly, the isolates of the patient 13 and 17 exhibited a wide range of virulence ranging from killer (K) to non-killer (NK) phenotype in similar proportions in the two patients: K (28.5 and 33 % of the isolates, respectively), intermediate killer (IK) (14 and 11 %, respectively) and NK (57 and 55.5 %, respectively). As expected, among the isolates exhibiting a K phenotype, the isolates of the patient 13 which belong to the B2 phylogroup were more virulent as they killed at least 6 mice in less than 18 hours as compared to the A phylogroup isolates of the patient 17 that killed the mice in two to three days (Picard et al., 1999).

As all phylogenies are star like, we performed an association test between the impact on mortality of the strains in the mice model of septicemia and the presence of the various mutations showing signs of convergence. As clone 32 and 35 from patient 17 had the same genotype and were clearly not independent, they were merged into a single strain. Using this approach, we were note able to identify a clear signal, which tends to suggest that in this situation, virulence may be a complex trait that can be modulated by several mutations acting together. However, we could also argue about the size of the sample.

Convergence case studies:

In the next sections we will present some detail analysis of the mutations we found. These are either based on a literature or they involved further experiments. Before we go into those details it is first important to mention that some mutations are known to occur at high rate in bacterial strains passaged many times in the laboratory. Such mutations may lead to spurious convergences. It is for instance well established that rpoS gene is frequently inactivated during laboratory storage in stabs (Bleibtreu et al., 2014). Our strains were nevertheless submitted to few subcultures and were stored in -80°C after two growth cycles. Moreover, in patient 11 the mutS mutator is linked to an inactivation of rpoS and is loaded with 17 mutations, suggesting a quite ancient emergence. To test if hupA or putA inactivations could also emerge in laboratory conditions in the same way as rpoS inactivations, we sequenced hupA and putA and in the 82 strains of the IAI collection (Bleibtreu et al., 2013). In this collection, rpoS is inactivated in 31% of the strains, and we observed a single of HupA inactivation by an IS2 as well as no inactivating mutation in putA.

fimH and fliC

119

2. Article n 2

In the patient 11, a strong convergence was observed in the fimH gene that was mutated 3 times independently, with one isolate exhibiting two mutations. Indeed mutations V133L and L150V were found together in the isolate 50, while the mutations Q62K and T179A were found in the isolate 49 and 54 respectively. FimH was affected at least once in all patients. This gene has already been well characterized, and code for the subunit FimH of the type I fimbriae, which is also composed of the FimA, FimG, FimF and FimC subunits. This protein is implicated in the binding of E. coli to mucosa, and more particularly to the oropharyngeal epithelium, during fecal oral transmission of E. coli, in order to pass through the stomach to reach the large intestine. It was shown the mutations in fimH could be pathoadaptative as mutations could lead to variation in the binding characteristics of the cell. Naturally occurring mutations in fimH could allow for a better binding to the urinary tract, and therefore be a strong factor when shifting from a commensal to a more pathologic habitat (Sokurenko et al., 1998). In the patient 17 the gene fimH was truncated by a large deletion of 3557bp deleting a part of the gene. A deletion of 36bp was found in patient 2. In patient 11, fimH was the dominant signature of convergence with 4 independent mutations, isolate 50 isolate being even mutated twice. All those mutations were nonsynonymous SNPs. If we take into account the infectious nature of the isolates considered, we can hypothesized on the outcomes of such mutations. Indeed as the two mutations in the patient 17 and the patient 13 seems to be deleterious for the protein, we could argue that the absence of binding of the bacteria to the mucosae could be advantageous in these infections. Infection of patient 11 is a rare case of adult meningitides caused by E. coli. In that case, moving all the way from the gut to the brain may have required several binding strategies that lead to the selection of multiple variants in fimH.

Other exposed structures showed sign of convergence. The flagellin component fliC and a major regulator of flagel expression, ydiV, were also found mutated in patient 13 and 11.

rpoS

RpoS is a major stress response sigma factor that controls the expression of many genes under stress (Hengge-Aronis, 2002). rpoS has an extremely complex regulation for a bacterial gene with control at all possible level: transcription, translation, degradation. rpoS was found mutated twice in patient 11, including a large inactivating deletion, twice in patient 13, including a 11bp frameshift, and once in patient 17. Moreover, miaA which directly controls the production of RpoS was also found mutated repeatedly in patient 13. These results confirm that rpoS inactivation seem to be under selection in extraintestinal infections (Levert et al., 2010) with a source-sink epidemiology (Bleibtreu et al., 2014). 120

2. Article n 2

We explored further the functional consequences of the D118Y rpoS mutation found in the P5-32 and P5-35 isolates of patient 17. This mutation is localized in an important region of the protein, the sub region 2.2, which is essential for the binding of the RNA polymerase core (Lonetto, J Bacteriol, 1992, 174, 3843). This amino acid has been reported mutated in a natural isolate of Salmonella enterica (Robbe-Saule, AEM, 2003, 69, 4352). We generated by allelic exchange three mutants of the P5-35 isolate: an rpoS deletion mutant (P5-35 ∆rpoS), a wild type rpoS allele (P5-35 WT) and the reconstruction of the original D118Y allele (P5-35 D118Y). We first assessed the effect of the mutations on the rdar morphotype and the H2O2 resistance of the mutants. Both assays indicated that the P5-35 D118Y isolate was intermediate between the P5-35 WT and the P5-35 ∆rpoS isolates. We then tested the mutants in the mouse model of sepsis and observed that the three mutants have a killer phenotype similar to the P5-35 isolate. Lastly, we performed competition experiments in the mouse model of sepsis by inoculating a 1/1 ratio of different combinations of mutants: P5-35 WT/P5- 35 D118Y, P5-35 WT/P5-35 ∆rpoS and P5-35 D118Y /P5-35 ∆rpoS with 5 mice per combination. The combination of mutants yielded in all the cases a similar killer phenotype as the individual mutants. Furthermore, the numeration of the different mutants using the rdar morphotype in the spleens of the mice did not evidenced statistical differences in the ratio of the mutants in the spleens as compared to the inoculum.

Collectively, these data indicated that the D118Y has no impact in the mouse model of sepsis, in line with an epidemiological study on 82 E. coli natural isolates showing no correlation between the RpoS status and the virulence in the mouse sepsis model (Bleibtreu et al., 2013). Yet, the mutation still seems to have a phenotypic effect and may correspond to a reduced functionality of rpoS.

putA-hupA

putA gene, coding for both a transcriptional repressor and a membrane based enzyme involved in proline degradation, was found to be independently mutated 4 times in both the patient 13 and 17 (fig 4). Out of the 8 mutations, one was a large deletion, 6 were nonsynonymous and 1 synonymous. putP a transporter of proline that is regulated by putA was also found mutated 3 times (plus the large deletion that encompasses putA and putP) in patient 17 suggesting that modification of proline pathway is under selection in that patient.

In both patients with mutation in putA, 4 mutations in hupA and one in hupB were found and always in association with mutations in putA (p=0.02). hupA and hupB gene, which encodes the subunits of the HU protein, a histone like protein that acts as a global regulator. As mutations in putA

121

2. Article n 2 could be found without mutations in hupA or hupB in patient 13, this suggests, provided selective pressure are identical in both infections, that putA mutations preexisted the selection of hupAB mutations. As hupA inactivations (one clone in each patient has a truncated protein) are known for reducing the growth rate of the bacteria (Huisman et al., 1989), this further suggests that putA modifications may alleviate the cost of hupA inactivation.

We therefore explored further the functional consequences of the truncating hupA mutation in two isolates of the patient 13, i.e. 4429 and 4431 as well as in an archetypal B2 phylogroup strain 536. These three strains are highly virulent in the mouse model of sepsis as they killed at least 6 mice over the 10 inoculated in less than 18 hours (Picard et al., 1999). We choose both 4429 and 4431 isolates as the latest but not the first exhibited a putA mutation, which we hypothesized to be involved in the compensation of the hupA mutation (see above). We first used the classical mouse sepsis assay by inoculating 10 mice with 2 X 108 CFU of each of the following mutants: 4429 ∆hupA, 4429 ∆hupA::cat, 4431 ∆hupA::cat and 536 ∆hupA::cat and the corresponding wild type isolates. No difference was observed in the killing of mice between the wild type and mutant isolates. To evidence a milder effect of the mutation, we then decreased of an order of magnitude the inoculum and increased the number of tested mice to 20, as in (Tourret et al., 2010). For ethical reason, we focused these analyses only on the 4429 / 4429 ∆hupA strains. In these conditions, the 4429 isolate killed 6 mice over 20 inoculated whereas the 4429 ∆hupA mutant killed 1 mouse over 20, with no difference in the time of death of the mice (Fisher test p=0.09). The complemented mutant 4429 ∆hupA pSMARThupA tested in 10 mice did not killed any mouse, indicating a non-significant effect of the complementation on the phenotype. Lastly, we performed a competition experiment by inoculating a 1/1 ratio of the 4429 and 4429 ∆hupA isolates at 2 X 108 CFU in 6 mice. We analyzed the presence of both isolates in the spleens of dead mice by PCR on 100 colonies per mice, the length of the PCR product differentiating the WT and the deleted mutant. No statistical difference in the ratio between the wild type and the mutant in the spleen as compared to the inoculum was observed.

Collectively, these data indicated a weak if any role of hupA in the virulence of the isolates observed. Anyway, it cannot account for the drastic difference of the virulence between the isolates of patient 13.

rbsR.

Interestingly, rbsR was found to be mutated in all patients. rbsR is the repressor of the operon controlling ribose degradation and import. In each patient, at least one mutation inactivated that gene,

122

2. Article n 2 suggesting that selection is favoring a nonfunctional gene, and therefore a constitutive expression of the operon. The gene rbsR was furthermore found to be mutated in another patient, the patient 3, by Levert et al (Levert et al., 2010). We have previously observed a similar constitutive activation of an inducible pathway in an experimental adaptation of E. coli to the mice gut of streptomycin treated mice. In that case inactivation of dgoR lead to a better growth in culture with galactonate despite a functional induction of the dgo operon in the presence of galactonate (Lescat et al., 2016). Interestingly, deletion of the ribose operon is one of the first target of adaptation in the long term experimental evolution of Richard Lenski (Tenaillon et al., 2016). Indeed other targets of adaption in experimental evolution were found mutated in one or several patient. The glp operon with glpD and glpK, for instance, was previously described (Applebee et al., 2011). Similarly, mreB (Tenaillon et al., 2016, 2012) a gene involved in rod shape has been found mutated with diverse mutational strategies along the different experiments. The conservation of the targets of adaptation in different systems indicates that some genes are more prone to be recruited by adaptation than others.

2.5 Conclusion

In order to study adaptation in opportunistic extraintestinal infections of E. coli we have studied the genomic diversity present in 3 infections that varied in their pathophysiology. We found that diversity in the sample was connected to the proximity of the gut. The highest diversity was recovered in the infection closely connected to the gut and the lowest in the most distant from the gut: the meningitis. The identity of the mutations revealed a high level of convergence both within and between infections: in both the meningitis and the pyelonephritis, mutations found in genes hit multiple times represented more than 40% of mutations when mutator specific mutations were excluded from the analysis. As convergence is a hall mark of adaptation, these results suggest that adaptation is at play in these infections and that adaptive diversity may be generated efficiently even in deep and complex infections that require a priori several bottlenecks.

Our data also illustrate for the first time the early emergence of mutator alleles in wild conditions. Here, the mutators harbored only a few mutations and we could therefore hope to identify mutations that drove them to high frequency. Indeed several mutations involved in convergence were found in mutator background. The mutator allele may have used some of these to increase in frequency. Yet, in two cases out of three, we observed that the mutation generating the mutator allele may in itself be adaptive. In both cases the mutator phenotype is due to a large deletion that encompass

123

2. Article n 2 other genes. In both cases the genes co-deleted were found to be targets of selection as they were mutated in some other strains. Hence, in both of these infections, the mutator may have emerged as a byproduct of a selection acting on the deletion of a neighboring gene. This could explain the presence of mutators in short lived infections that may have endured severe bottlenecks. Yet, even if the early selection of mutators may be coincidental, in each of the infections, despite a frequency of less than 14%, mutator clones contributed to close 40% of all mutation and close to 50% of all point mutations. Hence they may greatly contribute to the evolvability of the population.

Acknowledgments

We are grateful to Nicolas Plault for technical help. Alexandre Bleibtreu was partially supported by the “Fondation pour la Recherche Médicale”. This work was partially funded by the ColAdapt grant from the IDEX of Sorbonne Paris Cité.

124

2. Article n 2

2.6 Figures

Fig 1: Workflow of the mutation identification for each patient. All the isolates are sequenced using illumina technology (A). For each patient, one isolate represented in red is used to construct a reference genome using Spades for the assembly process and prokka for the identification and annotation of genes (B). Then the reads of each sample is compared one by one to the reference genome using breseq (C) to identify variants.

125

2. Article n 2

Fig 2: Mutation Frequency of rifampicin resistance for the isolates of each patients. The mutation frequencies of the isolates was estimated by their capacity to generate mutation conferring resistance to rifampicin in three independent cultures as described (Denamur et al, 2002). The Two controls are a MMR deficient strain, M13, and the k12 MG1655 strain

126

2. Article n 2

Fig 3: Phylogenetic representation of the isolates of each patient. Representation reconstructed by neighbor joining methods and using the mutations identified for each patients. The lineages with mutator phenotypes were represented with a light blue edge.

Fig 4: Patterns of convergence in the hupA, hupB and putA genes for the patient 17 and 13. The phylogenetic representation were reconstructed by neighbor joining methods and using the mutations identified for the patients 17 and 13. The mutations in putA, hupA, and hupB are shown in red, green and yellow respectively.

127

2. Article n 2

number of clone affected in patients gene name function total mutations type of mutations 11 13 17 putA Bifunctional protein PutA 8 0 4 4 SNPs, large deletion fimH Minor fimbrial subunit, D-mannose specific adhesin 6 4 1 1 SNPs, large deletion

glpD Aerobic glycerol3phosphate dehydrogenase 6 0 4 2 SNPs, intergenic SNPs, Frameshift rbsR Ribose operon repressor 5 2 2 1 frameshift, SNPs rpoS RNA polymerase sigma factor RpoS 5 2 2 1 large deletion, frameshift, SNPs PROKKA_02004 Hypothetical protein 5 0 5 0 integenic SNPs fnr Fumarate and nitrate reduction regulatory protein 4 0 2 2 large deletion, SNPs hupA DNAbinding protein HUalpha 4 0 1 3 stop mutation, frameshift, SNPs putP Sodium/proline symporter 4 0 0 4 large deletions, SNPs fliC Flagellin 3 1 2 0 SNPs miaA tRNA dimethylallyltransferase 3 0 3 0 SNP, integenic SNP, frameshift catM HTHtype transcriptional regulator CatM 3 0 0 3 frameshift, SNPs acrA Multidrug efflux pump subunit AcrA 2 2 0 0 large deletions mreB Rod shapedetermining protein MreB 2 1 0 1 SNPs crcB/cspE Putative fluoride ion transporter CrcB/Cold shocklike protein CspE 2 1 0 1 Intergenic SNPs ydiV Putative antiFlhC(2)FlhD(4) factor YdiV 2 0 2 0 stop mutations mutS Component of Methyl directed Mismatch Repair 2 0 1 1 large deletion acrB Multidrug efflux pump subunit AcrB 2 0 1 1 frameshift, SNP rpsH 30S ribosomal protein S8 2 0 0 2 SNPs ybjJ Inner membrane protein YbjJ 2 0 0 2 frameshift, integenic icsA Invasion protein IcsA 2 0 0 2 large deletions rop Regulatory protein ro 2 0 0 2 large deletion finO Fertility inhibition protein 2 0 0 2 large deletion dacA DalanylDalanine carboxypeptidase DacA precursor 2 0 0 2 frameshift PROKKA_04195 Hypothetical protein 2 0 0 2 SNPs

Table 1: summary of the mutations of interest. Mutation for which convergence, taking the form of independent mutations, was identified inside a patient (in white) and betwin patient between patients (in grey)

128

2. Article n 2

Percentage of killed Patients isolates mice after 6 days

48 0% 49 0% 50 0% Patient 11 51 0% 52 0% 53 0% 54 0%

4428 0% 4429 100% 4430 0% Table 2: Mouse sepsis assay. Each Patient 13 4431 100% isolates was incolulated by injection to 4432 50% mices into the nape of the neck with 4433 0% approximately 2x108 CFU of stationary- 4434 0% phase bacteria as described in (picard p5_32 100% 1999). The table show te percentage of p5_34 0% killed mice after 7 days

p5_35 90% p5_36 0% Patient 17 p5_38 0% p5_39 40% p5_40 0% p5_41 0%

129

2. Article n 2

2.7 Suplementary data

present Patient Gene type mut Effect Descritption in isolates patient -/- DEL 527bp large_indel -/- 54 11 patient -/rhaA INS 5bp intergenic(-/+41) -/Lrhamnose isomerase 53 11 patient [acrR]-acrA DEL 2087bp large_indel [acrR] acrA 54 11 patient [galM_2]-[galE] DEL 4050bp large_indel [galM_2] galK galT [galE] 54 11 patient [mscK]-acrA DEL 2329bp large_indel [mscK] acrR acrA 54 11 patient [PROKKA_01761]-hsdR_1 DEL 7251bp large_indel [PROKKA_01761] PROKKA_01762 hsdM PROKKA_01764 hsdR_1 54 11 patient Aerobic respiration control sensor protein ArcB/Enhancing lycopene biosynthesis arcB/elbB SNP T->C intergenic(+123/107) 54 11 protein 2 patient asnC SNP G->A synonymous Regulatory protein AsnC 54 11 patient copA_1 INS 1bp frameshift Copper resistance protein A precursor 53 11 patient crcB/cspE SNP C->A intergenic(+14/+40) Putative fluoride ion transporter CrcB/Cold shocklike protein CspE 50 11 patient ddpF SNP T->C synonymous putative DDdipeptide transport ATPbinding protein DdpF 52 11 SNP C->A Q62K Minor fimbrial subunit, D-mannose specific adhesin 52

patient SNP G->T V133L Minor fimbrial subunit, D-mannose specific adhesin 52 fimH 11 SNP C->G L150V Minor fimbrial subunit, D-mannose specific adhesin 54

SNP A->G T179A Minor fimbrial subunit, D-mannose specific adhesin 53 patient fliC_2 SNP T->A D70V Flagellin 53 11 patient glpB SNP G->A synonymous Anaerobic glycerol3phosphate dehydrogenase subunit B 49 11 patient lgt SNP T->C synonymous Prolipoprotein diacylglyceryl transferase 54 11 patient Bifunctional aspartokinase/homoserine dehydrogenase 2/510methylenetetrahydrofolate metL/metF SNP T->C intergenic(+25/324) 54 11 reductase patient mreB SNP G->A D195N Rod shapedetermining protein MreB 54 11 130

2. Article n 2 patient nhaR SNP C->T H131Y Transcriptional activator protein NhaR 53 11 patient NacetylgalactosamineN N'diacetylbacillosaminyldiphosphoundecaprenol pglJ/epsF_3 SNP G->T intergenic(+521/228) 51, 49 11 4alphaNacetylgalactosaminyltransferase/Putative glycosyltransferase EpsF patient PROKKA_00378 DEL 822bp large_indel PROKKA_00378 48 11 patient 52, 51, PROKKA_00654 SNP A->G Q180R transport protein TonB 11 50 patient PROKKA_01178 SNP C->T Q99* Prophage tail length tape measure protein 54 11 patient PROKKA_02795 SNP A->G F117S chemotaxis regulator CheZ 53 11 patient PROKKA_03112 DEL 576bp large_indel PROKKA_03112 54 11 patient PROKKA_03254 DEL 12bp no_frameshift Putative type1 restriction enzyme specificity protein MPN_089 53 11 21 genesPROKKA_03301 PROKKA_03302 PROKKA_03303 PROKKA_03304 patient PROKKA_03305 PROKKA_03306 PROKKA_03307 PROKKA_03308 PROKKA_03309 PROKKA_03301-[rpoS] DEL 20272bp large_indel 54 11 PROKKA_03310 PROKKA_03311 PROKKA_03312 mutS pphB glcR garR_2 PROKKA_03317 fucA_2 ygbM ygbN [rpoS] patient PROKKA_03486/PROKKA_03487 DEL 41bp39bp intergenic(+1/6) tRNAVal/tRNAVal 54 11 patient PROKKA_03620 SNP A->G T46A von Willebrand factor type A domain protein 49 11 patient prpD SNP T->C synonymous 2methylcitrate dehydratase 54 11 patient purM/upp SNP A->G intergenic(218/107) Phosphoribosylformylglycinamidine cycloligase/Uracil phosphoribosyltransferase 54 11 patient SNP T->G L111R Ribose operon repressor 52 rbsR 11 DEL 20bp frameshift Ribose operon repressor 48 patient rpoS SNP T->C N124S RNA polymerase sigma factor RpoS 50 11 patient rstA SNP C->T D35N Transcriptional regulatory protein RstA 50 11 patient wecA INS 1bp frameshift Undecaprenylphosphate alphaNacetylglucosaminyl 1phosphate transferase 50 11 patient 4430, -/- DEL 1078bp large_indel -/- 13 4433 patient -/- SNP T->C intergenic(-/-) -/- 4434 13 patient 4429, -/ftsK INS 5bp intergenic(-/201) -/DNA translocase FtsK 13 4433 patient [amiB]-[miaA] DEL 3265bp large_indel [amiB] mutL [miaA] 4433 13 patient [ompW]-PROKKA_00021 DEL 1657bp large_indel [ompW] PROKKA_00021 4433 13

131

2. Article n 2 patient acrB_2 DEL 1bp frameshift Multidrug efflux pump subunit AcrB 4428 13 patient barA_1 SNP C->T Q287* Signal transduction histidineprotein kinase BarA 4428 13 patient barA_2 SNP G->T E263* Signal transduction histidineprotein kinase BarA 4434 13 patient crp SNP G->A C19Y cAMPactivated global transcriptional regulator CRP 4433 13 patient cytR SNP C->T Q38* HTHtype transcriptional repressor CytR 4429 13 patient fdoH SNP A->G synonymous Formate dehydrogenaseO ironsulfur subunit 4433 13 patient SNP G->T Q90H Flagellin 4434 fliC 13 SNP T->C S477P Flagellin 4428 patient SNP C->G I18M Fumarate and nitrate reduction regulatory protein 4434 fnr 13 SNP G->A G185D Fumarate and nitrate reduction regulatory protein 4433 patient ftsA_2 SNP C->T synonymous Cell division protein FtsA 4433 13 patient glmS SNP G->A synonymous Glutaminefructose6phosphate aminotransferase [isomerizing] 4433 13 SNP A->T D5V Aerobic glycerol3phosphate dehydrogenase 4431

INS 10bp frameshift Aerobic glycerol3phosphate dehydrogenase 4429 patient glpD 13 SNP T->C L151P Aerobic glycerol3phosphate dehydrogenase 4430 4434, SNP A->G E450G Aerobic glycerol3phosphate dehydrogenase 4428 patient glpK SNP C->A S16R Glycerol kinase 4433 13 4429, patient gltC_1 INS 9bp no_frameshift HTHtype transcriptional regulator GltC 4430, 13 4431 patient Oligobetamannosidespecific phosphotransferase enzyme IIB component/putative gmuB/yidP SNP C->T intergenic(67/132) 4434 13 HTHtype transcriptional regulator YidP patient hupA DEL 5bp frameshift DNAbinding protein HUalpha 4428 13 patient ilvB SNP T->C T93A Acetolactate synthase isozyme 1 large subunit 4433 13 patient iscA_1 SNP G->T synonymous Ironbinding protein IscA 4428 13 patient matA/ykgO DEL 1bp intergenic(191/+893) HTHtype transcriptional regulator MatA/50S ribosomal protein L36 2 4433 13 patient miaA SNP C->T Q166* tRNA dimethylallyltransferase 4428 13 patient miaA/hfq DEL 1bp intergenic(+122/417) tRNA dimethylallyltransferase/RNAbinding protein Hfq 4429 13 132

2. Article n 2 patient minC SNP C->T A155T Septum sitedetermining protein MinC 4430 13 patient nusB SNP T->A L85Q N utilization substance protein B 4433 13 patient prfB SNP C->A A177E Peptide chain release factor 2 4433 13 patient PROKKA_00268 DEL 5bp frameshift PKHDtype hydroxylase 4434 13 patient PROKKA_00425/mutH INS 1bp intergenic(22/+47) Integral membrane protein TerC family protein/DNA mismatch repair protein MutH 4433 13 patient PROKKA_01292 SNP C->T G37D Tetratricopeptide repeat protein 4433 13 patient PROKKA_01727 SNP C->T synonymous hypothetical protein 4433 13 PROKKA_01733 PROKKA_01734 PROKKA_01735, PROKKA_03376 PROKKA_01733- PROKKA_03377 PROKKA_03378 PROKKA_03379 PROKKA_03380 PROKKA_03381 PROKKA_01735, PROKKA_03382 PROKKA_03383 PROKKA_03384 PROKKA_03385 PROKKA_03386 PROKKA_03376- PROKKA_03387 PROKKA_03388 PROKKA_03389 PROKKA_03390 PROKKA_03391 patient PROKKA_03393, 4433, DEL - multi_contig PROKKA_03392 PROKKA_03393, PROKKA_01512 PROKKA_01513 13 PROKKA_01512- 4430 PROKKA_01514 PROKKA_01515, PROKKA_03500 PROKKA_03501 PROKKA_01515, PROKKA_03502 nanS_2 PROKKA_03504 PROKKA_03505 PROKKA_03506 PROKKA_03500- PROKKA_03507 PROKKA_03508 dpnA PROKKA_03510 PROKKA_03511 rusA_2 PROKKA_03513, -/- PROKKA_03513, -/- SNP G->A intergenic(85/-) hypothetical protein/- 4433

SNP T->C intergenic(139/-) hypothetical protein/- 4434 patient PROKKA_02004/- SNP A->G intergenic(167/-) hypothetical protein/- 4433 13 SNP T->A intergenic(303/-) hypothetical protein/- 4433

SNP A->G intergenic(390/-) hypothetical protein/- 4434 patient PROKKA_02389/holE INS 1bp intergenic(39/+64) Carbonnitrogen hydrolase/DNA polymerase III subunit theta 4433 13 patient PROKKA_02526 SNP A->G V97A Electron transfer flavoproteinubiquinone oxidoreductase 4433 13 patient PROKKA_02554 SNP G->A A111T hypothetical protein 4433 13 PROKKA_03126- PROKKA_03126 PROKKA_03127 smc PROKKA_03129 PROKKA_03130 patient PROKKA_03136, 4433, DEL - multi_contig PROKKA_03131 PROKKA_03132 PROKKA_03133 PROKKA_03134 PROKKA_03135 13 PROKKA_03394, 4430 PROKKA_03136, PROKKA_03394, PROKKA_00818 PROKKA_00818 patient PROKKA_03158 SNP C->T G84D hypothetical protein 4433 13 patient PROKKA_03411/uvrY SNP G->A intergenic(448/11) hypothetical protein/Response regulator UvrY 4430 13 patient PROKKA_03910/PROKKA_03911 DEL 1bp intergenic(84/+36) hypothetical protein/hypothetical protein 4433 13

133

2. Article n 2 patient PROKKA_04206- 4433, DEL 1431bp large_indel PROKKA_04206 PROKKA_04207 PROKKA_04208 13 PROKKA_04208 4430 patient PROKKA_04314- DEL 2531bp large_indel PROKKA_04314 PROKKA_04315 PROKKA_04316 PROKKA_04317 PROKKA_04318 4433 13 PROKKA_04318 SNP T->G E451A Bifunctional protein PutA 4431 patient SNP C->T R409H Bifunctional protein PutA 4428 putA 13 SNP C->T D182N Bifunctional protein PutA 4434

SNP G->A A93V Bifunctional protein PutA 4430 patient rbsC_1 DEL 1bp frameshift Ribose transport system permease protein RbsC 4434 13 patient INS 2bp frameshift Ribose operon repressor 4434 rbsR 13 SNP T->G L122R Ribose operon repressor 4431 patient rimM SNP G->A synonymous Ribosome maturation factor RimM 4433 13 patient 4428, rpoA SNP C->T W321* DNAdirected RNA polymerase subunit alpha 13 4429 patient DEL 11bp frameshift RNA polymerase sigma factor RpoS 4433 rpoS 13 SNP C->T R100H RNA polymerase sigma factor RpoS 4431 patient fimH DEL 36bp no_frameshift Minor fimbrial subunit, D-mannose specific adhesin 4431 13 patient SNP G->T E55* Putative antiFlhC(2)FlhD(4) factor YdiV 4429 ydiV_2 13 SNP G->T G124* Putative antiFlhC(2)FlhD(4) factor YdiV 4428 patient yibH SNP T->A synonymous Inner membrane protein YibH 4434 13 patient p5_36, -/-, -/- DEL - multi_contig -/-, -/- 17 p5_34 patient -/-, PROKKA_01451 DEL - multi_contig -/-, PROKKA_01451 p5_36 17 patient -/arcC1_1 INS 5bp intergenic(-/94) -/Carbamate kinase 1 p5_36 17 patient -/fecI DEL 3bpGCA intergenic(-/289) -/putative RNA polymerase sigma factor FecI p5_36 17 patient -/gfcA DEL 3bpGCA intergenic(-/255) -/Threoninerich inner membrane protein GfcA precursor p5_36 17 patient -/intS_4 DEL 3bpGCA intergenic(-/+26) -/Putative prophage CPS53 integrase p5_36 17 patient -/pptA SNP A->G intergenic(-/184) -/Tautomerase PptA p5_36 17 patient -/PROKKA_00001 DEL 1bp intergenic(-/+115) -/hypothetical protein p5_36 17 patient -/PROKKA_02539 SNP G->A intergenic(-/+176) -/hypothetical protein p5_36 17 134

2. Article n 2 patient -/PROKKA_03061 DEL 1bp intergenic(-/+1522) -/hypothetical protein p5_34 17 patient -/PROKKA_04414 SNP G->T intergenic(-/+46) -/Prophage CP457 regulatory protein (AlpA) p5_38 17 patient -/yafN_1 DEL 3bpGCA intergenic(-/+31) -/Antitoxin YafN p5_36 17 patient [flu]-[yeeR] DEL 2018bp large_indel [flu] [yeeR] p5_40 17 patient DEL 4112bp large_indel [icsA_1] p5_41 [icsA_1] 17 DEL 3599bp large_indel [icsA_1] p5_38 patient [mutS]-[glcR] DEL 1478bp large_indel [mutS] pphB [glcR] p5_36 17 patient [PROKKA_03112]- [PROKKA_03112] PROKKA_03113 PROKKA_03114 PROKKA_03115 traA traY DEL 6318bp large_indel p5_38 17 PROKKA_03122 PROKKA_03118 traM PROKKA_03120 PROKKA_03121 PROKKA_03122 patient [PROKKA_03750] DEL 1302bp large_indel [PROKKA_03750] p5_40 17 patient [putP]-ybeZ_1 DEL 6336bp large_indel [putP] efeU efeO efeB ybeZ_1 p5_34 17 patient [rapA]-araD DEL 4388bp large_indel [rapA] polB araD p5_40 17 patient [tamB]-[cpdB] DEL 7501bp large_indel [tamB] tamA msrA corC_3 PROKKA_03922 PROKKA_03923 cysQ [cpdB] p5_38 17 patient DEL 9907bp large_indel [uspE] fnr ogt_2 abgT abgB abgA gbpR_2 smrA ydaM p5_41 [uspE]-ydaM 17 DEL 9883bp large_indel [uspE] fnr ogt_2 abgT abgB abgA gbpR_2 smrA ydaM p5_36 patient aat SNP C->T A65V Leucyl/phenylalanyltRNAprotein transferase p5_36 17 patient acrB_2 SNP G->A A967V Multidrug efflux pump subunit AcrB p5_36 17 patient acs-gltP DEL 10543bp large_indel acs nrfA nrfB ydhX_1 PROKKA_00425 ccmF_1 ccmH_1 nrfG gltP p5_34 17 patient actP SNP C->A T205N Cation/acetate symporter ActP p5_36 17 patient alaA SNP G->T P321T Glutamatepyruvate aminotransferase AlaA p5_38 17 patient alaE DEL 1bp frameshift Lalanine exporter AlaE p5_36 17 patient aldA SNP C->T G213E Lactaldehyde dehydrogenase p5_36 17 patient alsK SNP C->T C188Y Dallose kinase p5_36 17 patient aroF SNP G->T C267F Phospho2dehydro3deoxyheptonate aldolase Tyrsensitive p5_36 17 patient atpA SNP C->T D261N ATP synthase subunit alpha p5_36 17 patient atpG SNP A->T L61Q ATP synthase gamma chain p5_38 17 135

2. Article n 2 patient bcsA_1-PROKKA_01501 DEL 7528bp multi_contig bcsA_1 bcsB_1 ltrA PROKKA_01501 p5_41 17 patient betI_1/aspT_1 DEL 1bpTG intergenic(+109/+66) HTHtype transcriptional regulator BetI/Aspartate/alanine antiporter p5_36 17 patient PTS system betaglucosidespecific EIIBCA component/Cryptic betaglucoside bgl operon bglF_1/bglG DEL 1bp intergenic(42/+91) p5_36 17 antiterminator patient bglH_1 SNP A->G synonymous Cryptic outer membrane porin BglH precursor p5_36 17 patient bsmA SNP G->A synonymous Lipoprotein BsmA precursor p5_36 17 SNP C->T L42F HTHtype transcriptional regulator CatM p5_36 patient catM INS 1bp frameshift HTHtype transcriptional regulator CatM p5_38 17 SNP C->G C294W HTHtype transcriptional regulator CatM p5_34 patient codB SNP C->T synonymous Cytosine permease p5_38 17 patient p5_36, crcB_1/cspE SNP C->A intergenic(+14/+40) x 17 p5_38 cspG_2 prokka_30032 yccM_2 torS torT torR torC torA torD cbpm cbpA prokka_03042 patient cspG_2-ybeZ_1, pgaA DEL - multi_contig agp prokka_03044 wrbA prokka_03046 rutG rutF rutE rutD rutC rutB rutA rutR putA p5_40 17 putP efeU efeO efeB ybez_1, pgaA patient cueO SNP G->A P20S Blue copper oxidase CueO precursor p5_36 17 patient cyoB SNP A->T W318R Cytochrome bo(3) ubiquinol oxidase subunit 1 p5_34 17 patient cysW_1 DEL 183bp no_frameshift Sulfate transport system permease protein CysW p5_34 17 patient INS 1bp frameshift DalanylDalanine carboxypeptidase DacA precursor p5_38 dacA 17 INS 5bp frameshift DalanylDalanine carboxypeptidase DacA precursor p5_34 patient dcd SNP A->T D128E Deoxycytidine triphosphate deaminase p5_40 17 patient deoR/ybjG SNP T->C intergenic(17/+41) Deoxyribose operon repressor/Putative undecaprenyldiphosphatase YbjG p5_38 17 patient dmlR_5 SNP G->A R178Q HTHtype transcriptional regulator DmlR p5_36 17 patient dppA_1/PROKKA_00214 SNP A->C intergenic(242/+501) Periplasmic dipeptide transport protein precursor/tRNAPro p5_34 17 patient epmB SNP T->C synonymous Llysine 23aminomutase p5_36 17 patient exoX SNP G->T H134Q Exodeoxyribonuclease 10 p5_41 17 patient fadA SNP T->C synonymous 3ketoacylCoA thiolase p5_36 17 patient fepA, entF, cusF- fepA, fes, prokka_04464 entF, cusF cusB cusA pheP ybdG nfnB PROKKA_00641 DEL - multi_contig p5_38 17 PROKKA_00644, npt PROKKA_00642 ybdK PROKKA_00644, 4'?phosphopantetheinyl transferase Npt

136

2. Article n 2 patient finO PROKKA_03096 PROKKA_03097 traI traD ylpA PROKKA_03101 finO-[PROKKA_03102] DEL 8215bp large_indel p5_40 17 [PROKKA_03102] patient finO-[traD] DEL 4957bp large_indel finO PROKKA_03096 PROKKA_03097 traI [traD] p5_38 17 patient frmR/PROKKA_03716 INS 1bp intergenic(140/+47) Transcriptional repressor FrmR/hypothetical protein p5_36 17 patient ftsZ SNP G->A synonymous Cell division protein FtsZ p5_36 17 patient fusA SNP T->A I654N Elongation factor G p5_38 17 patient gadC DEL 1bp frameshift Glutamate/gammaaminobutyrate antiporter p5_36 17 patient gbpR_1 SNP T->C synonymous HTHtype transcriptional regulator GbpR p5_36 17 patient gcvA_2 SNP G->T V270F Glycine cleavage system transcriptional activator p5_36 17 patient Glutaminebinding periplasmic protein precursor/DNA protection during starvation glnH/dps SNP G->T intergenic(94/+310) p5_38 17 protein patient p5_38, glnP SNP A->G L148P Glutamine transport system permease protein GlnP 17 p5_34 patient glnQ_1 SNP G->A A166V Glutamine transport ATPbinding protein GlnQ p5_36 17 patient glpD SNP T->C S266P Aerobic glycerol3phosphate dehydrogenase p5_41 17 patient glpE/glpD SNP T->C intergenic(135/55) Thiosulfate sulfurtransferase GlpE/Aerobic glycerol3phosphate dehydrogenase p5_38 17 patient glpQ SNP A->T I34F Glycerophosphoryl diester phosphodiesterase precursor p5_40 17 patient gltB SNP G->A synonymous Glutamate synthase [NADPH] large chain precursor p5_36 17 patient Glutamate/aspartate periplasmicbinding protein precursor/putative glutamine ABC gltI/glnM_2 SNP A->G intergenic(+161/9) p5_41 17 transporter permease protein GlnM patient gltK SNP C->T Q127* Glutamate/aspartate transport system permease protein GltK p5_36 17 patient gor SNP G->A G141S Glutathione reductase p5_36 17 patient gsiA_1 INS 1bp frameshift Glutathione import ATPbinding protein GsiA p5_36 17 patient gudX SNP T->C V71A Glucarate dehydrataserelated protein p5_36 17 patient hemL SNP C->T G222D Glutamate1semialdehyde 21aminomutase p5_36 17 patient hemX INS 24bp no_frameshift Putative uroporphyrinogenIII Cmethyltransferase p5_36 17 patient hsrA SNP A->G I198T putative transport protein HsrA p5_36 17 hupA SNP C->T Q20* DNAbinding protein HUalpha p5_41

137

2. Article n 2

patient SNP A->T Q43H DNAbinding protein HUalpha p5_38 17 SNP C->A A72D DNAbinding protein HUalpha p5_34 patient hupB DEL 9bp no_frameshift DNAbinding protein Hubeta p5_40 17 patient idnD SNP C->T T134M Lidonate 5dehydrogenase (NAD(P)(+)) p5_36 17 patient ilvC DEL 1bp frameshift Ketolacid reductoisomerase p5_40 17 patient ilvG SNP G->T E471* Acetolactate synthase isozyme 2 large subunit p5_36 17 patient p5_32, iscR SNP G->A S47F HTHtype transcriptional regulator IscR 17 p5_35 patient ispH SNP T->C Y134H 4hydroxy3methylbut2enyl diphosphate reductase p5_36 17 patient kojP SNP G->T R187L Kojibiose phosphorylase p5_39 17 patient kpsF SNP C->T T277M Arabinose 5phosphate isomerase KpsF p5_36 17 patient leuO SNP G->T F267L HTHtype transcriptional regulator LeuO p5_34 17 patient livJ/rpoH SNP G->A intergenic(35/+236) Leu/Ile/Valbinding protein precursor/RNA polymerase sigma factor RpoH p5_34 17 patient lolC/ycfT SNP A->G intergenic(223/39) Lipoproteinreleasing system transmembrane protein LolC/Inner membrane protein YcfT p5_40 17 patient lsrF SNP C->A G49C putative aldolase LsrF p5_36 17 patient malP_2 SNP A->G Y630H Maltodextrin phosphorylase p5_36 17 patient malS SNP T->G Y440D Alphaamylase precursor p5_40 17 p5_40, patient p5_38, mbeC-rop DEL 1297bp large_indel mbeC rop 17 p5_36, p5_34 patient mepS_1 SNP G->A P49S Murein DDendopeptidase MepS/Murein LDcarboxypeptidase precursor p5_36 17 patient metG SNP A->T D565V MethioninetRNA ligase p5_40 17 patient mltC/nupG SNP T->C intergenic(+99/102) Membranebound lytic murein transglycosylase C precursor/Nucleoside permease NupG p5_36 17 patient moeA SNP T->C N222S Molybdopterin molybdenumtransferase p5_36 17 patient mreB SNP C->T A125V Rod shapedetermining protein MreB p5_39 17 patient mscK SNP A->G D556G Mechanosensitive channel MscK precursor p5_40 17

138

2. Article n 2 patient nagB_2/nagA INS 1bp intergenic(+43/17) Glucosamine6phosphate deaminase/Nacetylglucosamine6phosphate deacetylase p5_36 17 patient narI SNP C->T P13S Respiratory nitrate reductase 1 gamma chain p5_36 17 patient nlpI SNP G->A synonymous Lipoprotein NlpI precursor p5_36 17 patient nrdA SNP T->C D418G Ribonucleosidediphosphate reductase 1 subunit alpha p5_38 17 patient oppD_2 SNP G->A H134Y Oligopeptide transport ATPbinding protein OppD p5_41 17 patient panB DEL 1bp frameshift 3methyl2oxobutanoate hydroxymethyltransferase p5_36 17 patient patA SNP T->C I464V Putrescine aminotransferase p5_40 17 patient pilT SNP T->C T181A Twitching mobility protein p5_36 17 patient podJ SNP C->T P44S Localization factor PodJL p5_36 17 patient potD SNP T->A I261N Spermidine/putrescinebinding periplasmic protein precursor p5_40 17 patient PROKKA_00067 SNP G->A synonymous 26dihydropseudooxynicotine hydrolase p5_36 17 patient PROKKA_00376 SNP C->T A40T hypothetical protein p5_41 17 patient PROKKA_00749 DEL 1bp frameshift putative kinase inhibitor p5_36 17 patient PROKKA_00934- PROKKA_00934 PROKKA_00935 PROKKA_00936 PROKKA_00937 PROKKA_00938 DEL 5419bp large_indel p5_36 17 PROKKA_00939 PROKKA_00939 PROKKA_00940, PROKKA_02910 gstB_2 aroP_2 PROKKA_02913 fecA_2 mcbR curA PROKKA_00940, yncA ydcZ PROKKA_02919 PROKKA_02920 prr ydcV_3 ydcU potA_3 potF_2 ydcR PROKKA_02910-ydcO, patient hicB_1 hicA PROKKA_02929 yhbU_3 sinR ydcOPROKKA_02910 gstB_2 aroP_2 p5_39, PROKKA_02304, DEL - multi_contig 17 PROKKA_02913 fecA_2 mcbR curA yncA ydcZ PROKKA_02919 PROKKA_02920 prr p5_41 PROKKA_00724- ydcV_3 ydcU potA_3 potF_2 ydcR hicB_1 hicA PROKKA_02929 yhbU_3 sinR ydcO, PROKKA_00725 PROKKA_02304, PROKKA_00724 PROKKA_00725 patient PROKKA_00945/PROKKA_00946 SNP C->T intergenic(423/41) hypothetical protein/hypothetical protein p5_36 17 patient PROKKA_01024/cspC SNP C->T intergenic(+725/100) hypothetical protein/Cold shocklike protein CspC p5_41 17 PROKKA_01172-tsh, -/-, PROKKA_02387- patient PROKKA_01172 tsh, -/-, PROKKA_02387 PROKKA_02388 PROKKA_02389, p5_32, PROKKA_02389, INS - multi_contig 17 PROKKA_00087 PROKKA_00088 PROKKA_00089 PROKKA_00090 p5_35 PROKKA_00087- PROKKA_00090 patient PROKKA_01230 SNP T->C D42G Trypsin p5_36 17 patient PROKKA_01361 SNP A->G synonymous anaerobic glycerol3phosphate dehydrogenase subunit B p5_36 17

139

2. Article n 2

PROKKA_01417 amyE PROKKA_01419 PROKKA_01420 lhgO gabD gabT gabP csiR patient PROKKA_01417-gabT gabP- DEL 18534bp large_indel PROKKA_01426 PROKKA_01427 ygaV ygaP stpA alaE PROKKA_01432 p5_41 17 [nrdF] PROKKA_01433 nrdH PROKKA_01435 nrdE [nrdF] patient PROKKA_01529 SNP C->T G59R hypothetical protein p5_34 17 patient PROKKA_01621 SNP T->G L527R PrkA AAA domain protein p5_36 17 patient PROKKA_01629 SNP A->G T69A hypothetical protein p5_36 17 patient PROKKA_01942 DEL 1bp frameshift PerC transcriptional activator p5_39 17 patient PROKKA_01942-ydeO_2 DEL 1395bp large_indel PROKKA_01942 ydeO_2 p5_41 17 patient PROKKA_01944 SNP T->C S47G hypothetical protein p5_41 17 patient PROKKA_01945- p5_36, DEL 3111bp large_indel PROKKA_01945 PROKKA_01946 17 PROKKA_01946 p5_34 p5_32, patient PROKKA_02107 SNP C->T A131T putative ion channel protein p5_35, 17 p5_41 patient PROKKA_02308 SNP A->G E391G hypothetical protein p5_36 17 patient PROKKA_02375 SNP T->C S300P Transposase IS66 family protein p5_36 17 patient p5_39, PROKKA_02432 SNP A->G S45P hypothetical protein 17 p5_41 patient PROKKA_02482/arsC_2 DEL 1bp intergenic(617/+12) hypothetical protein/Arsenate reductase p5_36 17 patient PROKKA_02837 SNP G->A P1128S hypothetical protein p5_36 17 p5_41, patient PROKKA_02868 PROKKA_02869 PROKKA_02870 PROKKA_02871 PROKKA_02872 PROKKA_02868-mobA_1 DEL 5199bp large_indel p5_38, 17 mobA_1 p5_36 patient PROKKA_02999/dadA DEL 33bp intergenic(15/283) SpoVR family protein/Damino acid dehydrogenase small subunit p5_41 17 patient p5_36, PROKKA_03094 DEL 605bp large_indel PROKKA_03094 17 p5_40 PROKKA_03094, p5_32, patient PROKKA_01945- INS - multi_contig PROKKA_03094, PROKKA_01945 PROKKA_01946 p5_35, 17 PROKKA_01946 p5_39 patient PROKKA_03329/virB_2 SNP A->G intergenic(176/+49) hypothetical protein/Virulence regulon transcriptional activator VirB p5_36 17 patient PROKKA_03457 SNP C->T synonymous molybdopterin biosynthesis protein MoeB p5_36 17 patient PROKKA_03557 SNP T->C *407Q hypothetical protein p5_36 17 patient PROKKA_03576 SNP T->C I112T 5formyltetrahydrofolate cycloligase family protein p5_40 17

140

2. Article n 2 patient PROKKA_03591 DEL 1bp frameshift DNAbinding transcriptional regulator DicC p5_36 17 patient p5_41, PROKKA_03686 DEL 989bp large_indel PROKKA_03686 17 p5_34 patient PROKKA_03726, p5_36, DEL - multi_contig PROKKA_03726, PROKKA_04466 cea PROKKA_04468 cnl 17 PROKKA_04466-cnl p5_34 patient PROKKA_03740- PROKKA_03740 PROKKA_03741 PROKKA_03742 PROKKA_03743 PROKKA_03744 DEL 9577bp large_indel p5_41 17 [PROKKA_03747] repE PROKKA_03746 [PROKKA_03747] patient PROKKA_03743 SNP C->T P57S hypothetical protein p5_36 17 patient PROKKA_03825 SNP G->A A11T hypothetical protein p5_36 17 patient PROKKA_03879- DEL 1441bp large_indel PROKKA_03879 [PROKKA_03880] p5_41 17 [PROKKA_03880] patient p5_32, PROKKA_04001 SNP A->T Q723L tetratricopeptide repeat protein 17 p5_35 patient PROKKA_04116/PROKKA_04117 SNP C->T intergenic(171/+143) hypothetical protein/hypothetical protein p5_38 17 patient PROKKA_04123 SNP G->A Q133* hypothetical protein p5_40 17 patient PROKKA_04134 SNP A->G D582G hypothetical protein p5_36 17 patient SNP A->G Y486H hypothetical protein p5_36 17 PROKKA_04195 patient SNP T->C Q83R hypothetical protein p5_34 17 patient PROKKA_04269 DEL 1407bp large_indel PROKKA_04269 p5_38 17 patient p5_32, PROKKA_04272 SNP G->A A265T coproporphyrinogen III oxidase 17 p5_35 patient PROKKA_04301- p5_39, DEL 2919bp large_indel PROKKA_04301 PROKKA_04302 17 PROKKA_04302 p5_40 patient PROKKA_04376-yafL, yafN_1, PROKKA_04376 yafL, Antitoxin YafN, DNA polymerase IV, Chemotaxis protein LafU, DEL - multi_contig p5_40 17 dinB_1, lafU, flhA_1 Flagellar biosynthesis protein FlhA patient PROKKA_04418 DEL 1bp frameshift hypothetical protein p5_34 17 PROKKA_04453 PROKKA_04454 PROKKA_04455 repB_2 PROKKA_04457 xerD_3 PROKKA_04459 lgrD ompT, iutA iucD iucC iucB iucA PROKKA_04261 crcB_2 PROKKA_04263 eno_2 mntB_1 mntB_2 fhuC_2 PROKKA_04268, virB_1 soj_1, -/-, ssbF PROKKA_00927 PROKKA_00928 PROKKA_00929 PROKKA_00930, 28 genes PROKKA_04453-ompT, iutA- Javascript Disabled: All genes shown.finO PROKKA_03096 PROKKA_03097 traI traD patient PROKKA_04268, virB_1-soj_1, -/- ylpA PROKKA_03101 PROKKA_03102 PROKKA_03103 PROKKA_03104 p5_36, DEL - multi_contig 17 , ssbF-PROKKA_00930, finO- PROKKA_03105 PROKKA_03106 PROKKA_03107 PROKKA_03108 PROKKA_03109 p5_34 PROKKA_03122 PROKKA_03110 PROKKA_03111 PROKKA_03112 PROKKA_03113 PROKKA_03114 PROKKA_03115 traA traY PROKKA_03118 traM PROKKA_03120 PROKKA_03121 PROKKA_03122finO PROKKA_03096 PROKKA_03097 traI traD ylpA PROKKA_03101 PROKKA_03102 PROKKA_03103 PROKKA_03104 PROKKA_03105 PROKKA_03106 PROKKA_03107 PROKKA_03108 PROKKA_03109 PROKKA_03110

141

2. Article n 2

PROKKA_03111 PROKKA_03112 PROKKA_03113 PROKKA_03114 PROKKA_03115 traA traY PROKKA_03118 traM PROKKA_03120 PROKKA_03121 PROKKA_03122 patient PROKKA_04582 SNP G->A R269H Phage major capsid protein P2 family p5_36 17 patient prpB/- INS 5bp intergenic(+137/-) Methylisocitrate lyase/- p5_36 17 patient ptsI_3 SNP A->T M498L Phosphoenolpyruvateprotein phosphotransferase p5_39 17 SNP C->A synonymous Bifunctional protein PutA p5_38 patient putA SNP G->A L633F Bifunctional protein PutA p5_34 17 SNP T->C K471E Bifunctional protein PutA p5_41 patient SNP T->C I320T Sodium/proline symporter p5_38 putP 17 SNP G->A A327T Sodium/proline symporter p5_36 patient qseF_2 SNP A->G S262G Transcriptional regulatory protein QseF p5_36 17 patient rbsR DEL 1bp frameshift Ribose operon repressor p5_41 17 patient rop/- DEL 14bp intergenic(+80/-) Regulatory protein rop/- p5_39 17 patient rpoC SNP A->G D264G DNAdirected RNA polymerase subunit beta' p5_38 17 patient p5_32, rpoS SNP C->A D118Y RNA polymerase sigma factor RpoS 17 p5_35 patient rpsA SNP G->A synonymous 30S ribosomal protein S1 p5_36 17 patient SNP C->T A97T 30S ribosomal protein S8 p5_36 17 rpsH patient SNP T->A I7F 30S ribosomal protein S8 p5_38 17 patient rsmB SNP T->C W310R Ribosomal RNA small subunit methyltransferase B p5_38 17 patient rsmI SNP A->G T201A Ribosomal RNA small subunit methyltransferase I p5_36 17 patient fimH-safA DEL 3557bp large_indel fimH fdhF_1 safA p5_40 17 patient sgrR_2-araA DEL 10596bp large_indel sgrR_2 thiB cysW_1 thiQ yabI araC araB araA p5_40 17 patient slt SNP T->C D549G Soluble lytic murein transglycosylase precursor p5_40 17 patient p5_32, spoT SNP C->T R287H Bifunctional (p)ppGpp synthase/hydrolase SpoT 17 p5_35 patient sspA SNP A->G synonymous Stringent starvation protein A p5_36 17 patient sstT SNP T->A L318Q Serine/threonine transporter SstT p5_38 17 142

2. Article n 2 patient tatB SNP G->A P142S Secindependent protein translocase protein TatB p5_36 17 patient tehA SNP C->T R28H Tellurite resistance protein TehA p5_36 17 23 genestehB_2 tehA PROKKA_00494 rimL PROKKA_00496 mdoD PROKKA_00498 patient tehB_2-[ydbC] DEL 27811bp large_indel catM trg PROKKA_00501 PROKKA_00502 cybB gap aldA PROKKA_00506 hrpB_1 p5_39 17 azoR PROKKA_00509 ytpA cdsA_2 ynbA PROKKA_00513 [ydbC] patient thiB SNP C->G Y27* Thiaminebinding periplasmic protein precursor p5_38 17 patient thiC/rsd SNP A->G intergenic(145/+89) Phosphomethylpyrimidine synthase/Regulator of sigma D p5_36 17 patient traC/PROKKA_01112 DEL 8bp intergenic(+566/+203) DNA primase TraC/hypothetical protein p5_40 17 tynA_1 paaZ paaH paaI paaJ paaK paaX dapH_1, 24 genes Javascript Disabled: All genes shown.tehB_2 tehA PROKKA_00494 rimL PROKKA_00496 mdoD PROKKA_00498 catM trg PROKKA_00501 PROKKA_00502 cybB gap aldA patient tynA_1-dapH_1, tehB_2- PROKKA_00506 hrpB_1 azoR PROKKA_00509 ytpA cdsA_2 ynbA PROKKA_00513 DEL - multi_contig p5_41 17 PROKKA_00515 ydbC PROKKA_00515tehB_2 tehA PROKKA_00494 rimL PROKKA_00496 mdoD PROKKA_00498 catM trg PROKKA_00501 PROKKA_00502 cybB gap aldA PROKKA_00506 hrpB_1 azoR PROKKA_00509 ytpA cdsA_2 ynbA PROKKA_00513 ydbC PROKKA_00515 patient uspC/flhD SNP A->T intergenic(130/649) Universal stress protein C/Flagellar transcriptional regulator FlhD p5_34 17 patient uspG SNP C->T synonymous Universal stress protein G p5_36 17 patient xdhA_1 SNP C->T G635D Xanthine dehydrogenase molybdenumbinding subunit p5_36 17 patient xdhA_2 SNP G->A E606K Xanthine dehydrogenase molybdenumbinding subunit p5_36 17 patient xerC_3 SNP G->T L283I Tyrosine recombinase XerC p5_36 17 patient yafL SNP C->T D105N putative endopeptidase YafL precursor p5_36 17 yafO yafP PROKKA_00063 prfB_1 pepD gpt PROKKA_00067 crl phoE proB proA PROKKA_00072 PROKKA_00073 PROKKA_00074 xdhA_1 hcrB kdhB yagU_1 yagU_2 PROKKA_00080 PROKKA_00081 PROKKA_00082yafO yafP patient yafO-PROKKA_00082, p5_39, DEL - multi_contig PROKKA_00063 prfB_1 pepD gpt PROKKA_00067 crl phoE proB proA 17 PROKKA_04103-matA_2 p5_40 PROKKA_00072 PROKKA_00073 PROKKA_00074 xdhA_1 hcrB kdhB yagU_1 yagU_2 PROKKA_00080 PROKKA_00081 PROKKA_00082, PROKKA_04103 PROKKA_04104 PROKKA_04105 matA_2 patient yahA_1/- DEL 3bpTGC intergenic(505/-) Cyclic diGMP phosphodiesterase YahA/- p5_36 17 patient ybiS SNP C->T synonymous putative LDtranspeptidase YbiS precursor p5_36 17 patient ybjI/ybjJ SNP G->T intergenic(55/+48) Flavin mononucleotide phosphatase YbjI/Inner membrane protein YbjJ p5_36 17 patient p5_32, ybjJ DEL 2bp frameshift Inner membrane protein YbjJ 17 p5_35

143

2. Article n 2 patient p5_32, ydeO_1/betC SNP T->C intergenic(+146/256) HTHtype transcriptional regulator YdeO/Cholinesulfatase 17 p5_35 patient yeaN_2/- DEL 3bpTGC intergenic(+26/-) Inner membrane transport protein YeaN/- p5_36 17 patient yebC SNP G->A V74M putative transcriptional regulatory protein YebC p5_36 17 patient yedQ_1 SNP A->C T41P putative diguanylate cyclase YedQ p5_41 17 patient yejM SNP C->A synonymous Inner membrane protein YejM p5_38 17 patient yfgF_1/- SNP C->A intergenic(+26/-) Cyclic diGMP phosphodiesterase YfgF/- p5_36 17 24 genesyfgF_3 nupC mntH PROKKA_02106 PROKKA_02107 glk fryB_2 fryC patient yfgF_3-[evgS] DEL 27234bp large_indel PROKKA_02111 ypdE ptsI_3 rhaR_2 ypdB ypdA alaC htrB_2 PROKKA_02119 p5_41 17 PROKKA_02120 PROKKA_02121 frc_1 oxc PROKKA_02124 frc_2 [evgS] patient ygcB_1 SNP T->G E172A CRISPRassociated endonuclease/helicase Cas3 p5_38 17 patient ygdH SNP G->A Q94* LOG family protein YgdH p5_36 17 patient ygdR_2/gcvA_2 DEL 8bp intergenic(+152/192) putative lipoprotein YgdR precursor/Glycine cleavage system transcriptional activator p5_38 17 patient yheS_2 SNP T->C Y413C putative ABC transporter ATPbinding protein YheS p5_36 17 patient yhjH/kdgK_1 SNP G->A intergenic(169/63) Cyclic diGMP phosphodiesterase YhjH/2dehydro3deoxygluconokinase p5_40 17 patient yjcH-[nhaK] DEL 4876bp large_indel yjcH actP pipB2 [nhaK] p5_34 17 patient yjjV DEL 1bp frameshift putative deoxyribonuclease YjjV p5_36 17 patient ynfF SNP A->G synonymous putative dimethyl sulfoxide reductase chain YnfF precursor p5_36 17

Table S1: Mutations identified in the isolates from the three patients. For intergenic mutations the positions in regards to the two closest genes is indicated in brackets. When multiple genes are touched, a list of the genes is indicated in the the description column, and genes partially deleted are indicated in square brackets.

144

IV - DISCUSSION

1. E. coli experimental evolution: getting outside the lab

1. E. COLI EXPERIMENTAL EVOLUTION: GETTING OUTSIDE THE

LAB

It has now been several decades since Richard Lenski initiated his long-term experimental evolution study. Yet, these still evolving lineages continue to be a source of new discoveries. They have contributed to a better understanding of adaptation, and revealed several evolutionary processes such as the selection of mutator strains, or the primary selection of mutations in highly pleiotropic genes. The fundamental value of this experiment is therefore doubtless. However, the artificial nature of the strains and environment used forces us to question the relevance of the observed conclusion in the wild: How many of the observations would hold true in the nature, how many would be specific to laboratory conditions? To answer to this question, we conducted two studies aiming at identifying the genomic patterns of adaptation in E. coli during its evolution in both, either a commensal environment or during an infection. Applying this approach to E. coli commensal and pathogenic environment is of medical importance as the genetic determinants involved in each of these environments are not fully known.

2. EVOLUTION IN COMMENSAL ENVIRONMENT

In our first study, we used mice that were artificially colonized by the streptomycin resistant E. coli strain 536. Mice were bred for one year under constant administration of streptomycin in the water to maintain the strain 536. A comparison of the evolved genome to that of the ancestor strain, revealed clear signs of adaptation. Indeed convergence could be found at both the gene and functional level, in the form of independent mutations being recovered in different lineages. Two main functional roles clearly emerged from the mutated genes: (1) genes implicated in the 146

3. Adaptation during acute infections metabolism of galactonate, revealing optimization of carbon source utilization and (2) genes implicated in the maturation of ribosomal RNA, suggesting a stronger adaptation to streptomycin. Interestingly, the observed substitution rate was lower than the ones observed in other experimental studies, be it in artificial media or in axenic mice. Furthermore, if we rule out the mutations in the ribosomal RNA maturation genes that are directly linked to streptomycin selective pressure, we found no highly pleiotropic genes among the targets of adaptation. These two observations suggest that adaptation in the mice gut is less active than in laboratory experiments, indicating that the strain we used to initiate the experiment was more adapted to the mouse gut. This is not surprising as natural isolates have experienced recently some gut environment. It suggests nevertheless that some of the patterns of adaptation found in experimental system may be specific to strong maladaptation and may have limited relevance in environment commonly encountered. Yet, experimental evolution in which one strain is moved to a completely new environment could still be representative of some important natural situations for E. coli: the colonization of extraintestinal habitat in humans?

3. ADAPTATION DURING ACUTE INFECTIONS

Colonization of new environments is not rare for E. coli. Indeed, as described previously, certain strains can produce extra-intestinal infections that result most often from a strain living in the commensal habitat that after passing one or more barriers is exposed to a completely new niche. Therefore, we focused our second series of analysis on human infectious produced by E. coli. We analyzed samples of E. coli coming from three different types of infections, a peritonitis, a pyelonephritis and a meningitis, that were shown to come from a single E. coli clone that eventually diversified. All of the infection we chosen had one mutator clone among the 7 to 8 strain sampled. This project prove to be particularly challenging because we did not have access to the ancestral strain and given that E. coli genomes are very variable, we could not arbitrarily pick a sequenced

147

3. Adaptation during acute infections genome as a reference. To resolve this problem we tried different approaches, and finally decided to use one of the isolates from one patient as a reference. All the isolates from the infection were then compared to that reference in order to identify their genetic specificities and to deduce the most probable evolutionary history of the isolates. When trying to see how the different isolates were differentiated from one another, we observed once more strong signs of convergence, with several genes being mutated independently in each patient. We also confirmed the presence of one mutator isolate for each patient that had accumulated a high number of mutations due to a mutation in MMR system. This observation is, to our knowledge the first ever observation of the emerging process of a mutator strain in a wild environment. Mutator strains are usually considered as being a footprint of adaption, as in order to spread to high frequency, the mutator allele relies on the production of beneficial mutations. However, it was striking to observe that 2 out of our 3 mutator phenotypes were due to a large deletion that would delete the MMR genes but also other genes for which convergence was found. This raised a question: where the mutator alleles initially selected for their mutator phenotype, or as a byproduct of the selection acting on the co-deleted gene? Whatever the selective pressure involved in their early selection, mutator alleles were later associated with a large fraction of the diversity present in the populations and could therefore contribute to later adaptations. An alternative sign of adaptation was the presence of convergence in genes like rpoS, which encodes for the sigma factor involved in general stress response. Finding mutations in highly pleiotropic regulatory genes is reminiscent of the first step of adaptation of in vitro experimental evolution, in which mutations affecting such high effect genes, allow a quick adaptation. In the end this study re-enforced our convictions that the observations made by Richard Lenski and his colleges have meaningful effects on the evolution of E. coli in the wild, but more in infections than in commensal conditions. During an infection, E. coli is often found alone and has therefore a large number of option for adaptation, while in the gut it may be much more constrained by the rest of the microbiota.

148

4. Perspectives:

4. PERSPECTIVES:

Though we were able to identify convergence during adaptation to the gut of streptomycin treated mice, an undesired bias appeared, the selective pressure exerted by the streptomycin leaved a footprint in the genome. Despite an extremely high initial level of resistance, strains evolve to cope with the sub-inhibitory antibiotic concentration they were facing. Furthermore, even metabolic genes involved in convergence, affecting for instance the metabolism of galactonate, could be selected because streptomycin. As it affects the microbiota, streptomycin may favor the proliferation of certain species over others and therefore change the metabolic composition of the gut. Stenotrophomonas are for instance galactonate producer that are naturally resistant to streptomycin, and have been found in mice’s gut (though we could not detect them directly in the experiment we did) (Pedron et al., 2012). Streptomycin treatment could favor these species, and as a result, galactonate level would increase and lead to the selection of mutations in the dgo operon of E. coli. To avoid artificial modification of the environment due to streptomycin, a mostly streptomycin-free protocol was developed. Pregnant mice were treated with streptomycin and inoculated by strain 536 E. coli. Streptomycin treatment was ceased at birth, and infants inherited from their maternal microbiota. The infants were then separated from the mother and their E. coli followed through time. This study is now over and the genomes, currently being sequenced, will be analyzed as I have done.

To avoid the trouble of streptomycin, an alternative is to study E. coli in its natural environment the human gut. In that respect, we are currently studying 24 isolates gathered from the gut of one person at 3 different time points. The analysis are still running but were able to identify mutations in the putA, rbsR as well as rpoS gene, three genes that showed sign of convergence in the infection we studied. This result can be really startling as we associated convergence in those mutations as traces of adaptation during the infection. From this observation, we can build two hypotheses: (1) the mutations in these genes were selected in the gut or (2) this

149

4. Perspectives: type of mutation is a typical case of a mutation selected for commensalism that also gives an advantage during an infectious event. Given the high number of independent mutations that we observed in the gene putA, the first hypothesis would then mean that the infections we studied were not caused by a single bacteria but rather a subpopulation of closely related isolates with some preexisting diversity in putA. A good way to test for this hypothesis would be to have access to both the infectious samples but also the feces of the concerned patient, and to do a population studies in order to quantify the diversity of putA present in the gut community of E. coli. Yet, it is difficult to have both samples. An alternative way to test for the adaptive impact of the observed mutations would be to test their effect in conditions of commensalism as we could not find a clear effect of putA on virulence. It would indeed be interesting to see if such mutations could provide a fitness gain to the strains.

As a conclusion, I was able to use the NGS tools to show signs of adaptation of E. coli to its environment in natural conditions, and to show that the observations made by laboratory experimental evolution, even though slightly biased by nature, could hold true in the natural environment of the bacteria. During this analysis the genomic tools proved to be a really helpful tools, allowing for the comparative analysis of several isolates. However, as mentioned, the task was daunting because of the need of manual curation of the mutation detection outputs. More modern sequencing techniques like PacBio and Nanopore with their much longer reads would help produce a better reference for each clone and presumably accelerate the analysis of the data. Along with those new technologies, there is great hopes that such analysis will get easier in the future.

150

Supplementary studies

SUPPLEMENTARY STUDIES

In addition to the two studies previously described, I also contributed to other studies for which I am credited as a secondary author:

nhaAR operon and the nhaA phylogeny

Indeed, I was involved in Lescat et al., 2014, a study aiming to identify the differences in core genes between B2 and non B2 E. coli in a set of 128 genomes. As the study identified strong differences in the nhaAR operon, amongst the strain studied, I did reconstruct a phylogeny bases on the nhaA gene of 121 strains (fig 1). The study then showed a strong implication of the nhaA gene in virulence that wasn't known before hand, whereas no effect on the commensalism of the strain was observed.

151

Supplementary studies

Figure 1 MLST and nhaA phylogenetic trees for 121 strains of E. coli. The trees were reconstructed from (A) multi-locus sequence typing of 8 partial housekeeping genes from the Pasteur scheme representing the species phylogeny and (B) from the nhaA sequences using PHYML (Guindon et al., 2010). Bootstraps values are indicated. Strains studied and belonging to phylogenetic group B2 (red boxed) are indicated. Branches separating the B2 phylogenetic group strains from the other group strains are indicated in blue.

152

Supplementary studies

The CC87 and its specific genes

I was also involved in Skurnik et al., 2016, a study in which we analyzed 403 Escherichia coli / Escherichia clades isolated from human and animal populations that had differential contacts with one another. The study identified of a specific B1 clonal complex, CC87, which was responsible for the increased antimicrobial resistance observed in strains coming from environment with a high human mediated antimicrobial pressure. Indeed the strains appeared to have a high capacity of acquiring and disseminating antimicrobials resistance genes. In this study I helped with the genomic analysis of the CC87 strains. I analyzed the presence/absence patterns of the genes of the strains using a mapping approach based on a database of 128 E. coli genome, and compared it to the pattern of the non CC87 B1 strains (fig 2). This allowed for the identification of 33 genes differentially present in the two groups, 20 of which were regrouped in 3 synteny blocks, two of these being specific to the CC87, while the third one was absent of the CC87. This allowed for the identification of candidates genes that could be responsible for the particular phenotypes of the CC87.

153

Supplementary studies

Fig 2. Specific genic presence/absence pattern of CC87 among B1 strains. The phylogeny of the B1 strains rooted on the phylogroup C was computed with 500 core genes. Gene presence (green) /absence (red) is illustrated for the set of genes that presented a contrasted signature between CC87 strains and non CC87-B1. Among this set, four groups of linked genes are represented with the following color code: blue for metabolic genes; orange for transporters; pink for genes with alternative functions and white for genes of unknown functions.

154

RESUME EN FRANCAIS:

RESUME EN FRANCAIS:

Si Escherichia coli est généralement considérée comme espèce commensale du tube digestif, particulièrement bien caractérisée, elle s’avère en réalité plus versatile. En effet, elle peut aussi jouer un rôle de pathogène tant intra-intestinal, causant parfois des diarrhées sanglantes mortelles, que extra-intestinal, en étant la principale cause d’infection urinaire (Kaper et al., 2004). Elle est responsable d’environ un million de morts par an, faisant de la compréhension de sa diversification un problème de santé publique (Kosek et al., 2003; Kotloff et al., 1999; Russo and Johnson, 2003). Si l’espèce est divisée en sous-groupes phylogénétiques dont la prévalence varie en fonction des pathologies (Escobar-Paramo, 2004), les spécificités génétiques et physiologiques responsables du succès des différents groupes dans ces différents environnements restent à définir.

Pour cette raison de nombreuses études se sont penchées sur le génome de l’espèce mettant notamment à jour l’incroyable plasticité de ce dernier. En, effet si l’on peut considérer que le génome moyen d’E. coli est composé de 5000 gènes, c’est près de 15 000 gènes différents qui ont été identifiés dans les différentes souches séquencée de l’espèce. Ces gènes formant de pan génome de l’espèce, c’est-à-dire l’ensemble du répertoire de gêne pouvant être présent dans une souche. Parmi ces gènes, environ 2000 gènes sont retrouvés parmi toutes les souches, composant le core génome de l’espèce (Touchon et al., 2009). Cette observation a une implication très importante pour l’analyse des génomes, en effet, lorsque une nouvelle souche de E. coli est séquencée, on ne peut être assuré de la présence que du core génome le reste étant variable d’une souche à une autre. En étudiant le contenu en gènes des souches pathogènes de E. coli, plusieurs gènes liées aux infections soit extra-intestinales soit intra-intestinales ont été identifiés et nommés facteurs de virulence. Il semble cependant que les facteurs de virulence impliqués dans les pathologies extra- intestinales soient en fait sélectionnés lors de la colonisation du tube digestif. La pathogénicité extra-intestinale de E. coli ne serait en conséquence qu’un effet secondaire du commensalisme (Le Gall et al., 2007). Il y a donc un grand intérêt à comprendre et analyser les mécanismes responsables de la diversification d’E. coli à la fois tant dans son environnement commensal qu’au cours d’infections extraintestinales. 155

RESUME EN FRANCAIS:

Afin de répondre à ces questions, mais aussi à des questions plus globales en termes d’évolution, des expériences d’évolution expérimentales utilisant E. coli ont été mises en place depuis déjà plusieurs décennies. La plus célèbre, menée par Richard Lenski, est aujourd’hui encore en cours depuis son lancement en 1988 (Lenski et al., 1991). Les douze lignées d’E. coli propagées quotidiennement en parallèle ont maintenant évolué pendant plus de 60 000 générations au laboratoire. Ce type d’études a permis de mettre en évidence les bases de l’adaptation, et d’étudier plusieurs phénomènes évolutifs. Il y a par exemple été observé l’apparition de souches mutatrices, ayant un taux de mutation fortement augmenté (Sniegowski et al., 1997). La sélection d’allèle augmentant le taux de mutation est un signe fort de l’adaptation : pour qu’une tel allèle envahisse la population il faut qu’il produise des mutations bénéfiques et se propage dans la population grâce à l’avantage sélectif produit par celles-ci. Une autre observation fréquente en évolution expérimentale est le recrutement de mutations dans des gènes hautement pléiotropiques au cours de premières phases de l’adaptation. Ces mutations permettent une adaptation rapide et grossière (« quick and dirty ») ayant un impact globale très fort sur la cellule, alors que dans un second temps se sont des gènes plus secondaire qui vont être sélectionné, représentant une étape d’ajustement (Hindré et al., n.d.).

Ces observations sont elles pertinentes pour l’adaptation de E. coli à son environnement naturel ? Jusqu’à présent les rares études s’intéressante à l’adaptation dans des milieux plus naturels, suggèrent une forte similarité avec les observations faites in vitro. Cependant l’utilisation de souche de laboratoire ou de souris sans flore, ou les deux, limitent fortement la pertinence de ces observations. Durant ma thèse, j’ai visé à utiliser les techniques de séquençage haut débit moderne couplés à des méthodes bio-informatique afin d’analyser les résultats d’expériences visant à identifier la diversification d’E. coli tout en restant le plus proche possible des conditions naturelles de l’espèce. Se faisant nous avons cherché à voir si les résultats des expériences d’évolution expérimentales avaient une signification pour des souches en condition naturelles. Pour ce faire nous avons réalisé deux études :

1- Une expérience d’évolution d’un isolat naturel d’E. coli naturelle dans le tube digestif de la souris afin d’étudier l’évolution de l’espèce en condition de commensalisme.

156

RESUME EN FRANCAIS:

2- Une analyse d’échantillons de patient atteint d’infection aigue de E. coli, afin de mesurer et caractériser la diversité observée dans les infections.

1 Evolution expérimentale chez la souris :

Nous avons dans un premier temps réalisé un protocole d’évolution expérimentale dans le tube digestif de souris en utilisant un isolat naturel, E. coli 536. La souche 536 étant résistante à la streptomycine, un traitement antibiotique a été administré aux souris tout au long de l’expérience afin de vider leurs tubes digestifs des enterobactéries, laissant place libre à notre souche qui a ainsi pu se maintenir. Au bout d’un an, les souches ont finalement été prélevées, séquencées, puis j’ai comparé leurs génomes à celui de la souche ancestrale afin d’identifier de possibles mutations s’étant fixées au cours de l’expérience. J’ai ainsi pu observer 95 mutations dans les 15 lignées, avec un taux de mutations de 1.85 10-10 substitutions par base par génération pour les mutations ponctuelles en considérant 18 générations par jours, ce qui semble plutôt bas en comparant aux résultats obtenus en évolution expérimentale. Des signes de convergence génétique ont aussi été identifiés, au niveau de la mutation, du gène, de l’opéron et au niveau fonctionnel. Ces convergences témoignent des effets de la sélection naturelle et de l’adaptation de nos souches à leur nouvel environnement. Notamment l’opéron dgo, impliqué dans la dégradation du galactonate, semble particulièrement touché, les gènes dgoR, dgoT et dgoD étant mutés indépendamment à plusieurs reprises dans les différentes lignées. Ces mutations révèlent une adaptation métabolique de la souche à son nouvel environnement et la recherche de nouvelles sources de carbone. Pareillement les gènes rluD et gidB, tous deux impliqués dans la maturation des ARN ribosomaux, apparaissent mutés indépendamment de nombreuses fois entre lignés avec 13 des 15 lignée touchées dans l’un ou l’autre des gènes. La streptomycine étant un antibiotique touchant l’étape de la traduction, on peut facilement en déduire que les mutations dans ces gènes ont été sélectionnées à cause de la pression de sélection qu’elle exerce sur les souches. Afin de tester cette hypothèse,

157

RESUME EN FRANCAIS: des souches mutées dans l’un, l’autre ou les deux gènes ont été cultivées en milieu LB et LB plus streptomycine. Il a été possible de montrer que les deux mutations apportent un avantage en milieux LB plus streptomycine, mais que l’association des deux mutations ne semble pas permettre une meilleur résistance.

En définitive, des signes d’adaptation ont été trouvés, sous la forme de convergences évolutives. Cependant aucune souche mutatrice n’a été retrouvée, et si des gènes fortement pléiotropiques comme gidB et rluD ont été retrouvés mutés, on sait qu’ils ont été sélectionnés plus pour la résistance à la streptomycine que pour leur caractère pléiotropique. Cette observation, mis en commun avec le taux de substitutions plutôt bas, suggère que les souches sont dans une phase d’adaptation secondaire, la souche étant déjà bien adaptée à son environnement. Si tel est le cas, il est normal que les mutations observées n’aient pas de forts impacts pléiotropiques, mais soient impliqués dans des taches plus secondaires comme la recherche de nouvelles sources de carbone. Ainsi si une partie des observations faites en évolution expérimentale semblent s’appliquer, il est clair que les souches utilisées dans cette expérience étaient nettement mieux adaptées initialement à leurs environnements que les souches utilisée dans l’évolution de Richard Lenski. Cependant la streptomycine ayant produit un biais dans nos mutations, il est nécessaire de réaliser un protocole avec un environnement plus naturel et moins perturbé pour les souches d’E. coli. Un protocole a été mise en place dans lequel une mère enceinte est traitée à la streptomycine et inoculée avec E. coli 536. Apres la naissance, les souriceaux récupérant le microbiote de leur mère, la souche implantée peut se maintenir sans streptomycine. L’expérience est maintenant finie et les résultats sont en phase d’analyse.

2 Evolution lors d’infection aigüe

Afin de me rapprocher toujours plus d’un environnement naturel d’E. coli et plus particulièrement de l’homme, je me suis ensuite intéressé à des échantillons provenant de patients souffrant d’infections extra-intestinales causées par E. coli, afin de savoir si la sélection pouvait 158

RESUME EN FRANCAIS: jouer lors des évènements court d’infection aigue. Pour répondre à cette question, nous avons choisi plusieurs échantillons d’infections précédemment analysée phénotypiquement (Levert et al., 2010): une péritonite (patient 17), une pyélonéphrite (patient 13), et un cas rare de méningite chez l’adulte (patient 11). Pour chaque patient, nous avons séquencé de 7 à 8 isolats et utilisé ces échantillons afin d’identifier quelles mutations ont été sélectionnées au sein de ce nouvel environnement. Pour ce type d’analyse, la plasticité du génome de E. coli peut poser problème. En effet nous n’avons pas accès à l’ancêtre infectieux de nos souches, et le génome d’E. coli étant tellement plastique nous ne pouvons pas être sûr de la composition du génome. Il nous est par conséquent impossible d’utiliser un génome de référence arbitraire sans perdre de l’information. Pour palier à ce problème, nous avons utilisé une méthode mixant assemblage de novo, annotation puis « SNP calling » pour identifier la diversification. Dans ce « workflow », nous prenons un des isolats pour chacun de nos patients. Cet isolats est alors assemblé à l’aide de SPades 3.6.2 avant d’être annoté à l’aide de Prokka 1.11. Ce génome est alors utilisé comme référence dans une approche de « SNP calling » et de mapping à l’aide de breseq 0.27.2 en le comparant a tous les isolats du même patient. Il devient alors facile en compilant les résultats de tous les isolats d’un même patient de reconstruire par maximum de vraisemblance comment chaque souche se différencie de l’ancêtre commun. En utilisant cette méthodologie, nous avons été capable d’identifier 223 mutation indépendantes chez le patient 17, 67 chez le patient 13, et 38 chez le patient 11. La première observation a été la mise en évidence de mutations dans mutS chez les patients 17 et 11, et mutL chez le patient 13, ainsi qu’un grand nombre de mutations dans ces clones mutateurs. Cette observation est très intéressante car il s’agit de la première fois à notre connaissance que l’émergence d’une souche mutatrice en milieu naturelle ait pu être documentée. De plus des convergences ont aussi pu être identifié avec notamment certains gènes retrouvés mutés indépendamment à plusieurs reprises entre les isolats. Ainsi des signes de convergence ont pu être retrouvés pour les gènes putA et hupA chez les patients 13 et 17. Ces deux gènes, montrent de plus de forts signes d’association, une mutation dans hupA n’étant jamais retrouvée sans qu’il n’y ait une mutation dans putA. Des signes de convergence ont aussi été retrouvés pour le gène fimH, codant pour l’adhésine du fimbriae de type 1 qui est connus comme un gène jouant un rôle important dans la pathoadaptation des souches responsables d’infections extraintestinales. En effet 159

RESUME EN FRANCAIS: fimH a été retrouvé muté chez les trois patients et présentait même des signes de convergence au sein du patient 11. Le gène rpoS, codant pour un facteur sigma impliqué dans la réponse au stress, et représentant un gène fortement pléiotropique, a aussi été retrouvé muté deux fois dans le patient 11, deux fois dans le patient 13 et une fois dans le patient 17. En parallèle, un modèle d’analyse de virulence chez la souris a été mis en place, dans lequel les souris sont injectées avec chacun des isolats et la mortalité des souris est quantifiée. Cette analyse a permis de révéler une grande diversité dans la capacité à tuer la souris au seins des isolats du patient 13 et 17. Cependant, en cherchant une association entre les gènes mutés et le caractère tuer/non-tueur des isolats aucune corrélation n’a pu être trouvé indiquant la complexité de ce caractère. En définitive, cette analyse nous a permis de montrer de fort signes d’adaptation pendant l’infection : de la convergence a été observée, tout comme la présence de souches mutatrice et de mutations dans des gènes hautement pleiotropique similairement a ce qui a été décrit dans les expériences d’évolution expérimentale. Ainsi dans le cas d’infection aigue l’adaptation des E. coli semble suivre celle des souches de laboratoire. Cela peut s’expliquer par le brusque changement de milieu que représente pour la bactérie la colonisation des habitats extra-intestinaux.

Pour conclure, les résultats des analyses que j’ai conduis tendent à montrer que les grandes règles de l’adaptation décrites à la suite d’évolution expérimentales réalisées au laboratoire sont en grande partie pertinente pour l’adaptation d’E. coli dans les environnement naturels. Ces règles doivent cependant être légèrement modulées dans le cadre de l’adaptation à la niche commensale car le niveau d’adaptation à cette niche semble être initialement élevé et donc la première phase de l’adaptation moins similaire à celles observées au laboratoire.

160

Bibliography

BIBLIOGRAPHY

Almeida, A., Villain, A., Joubrel, C., Touak, G., Sauvage, E., Rosinski-Chupin, I., Poyart, C., Glaser, P., 2015. Whole-Genome Comparison Uncovers Genomic Mutations between Group B Streptococci Sampled from Infected Newborns and Their Mothers. J. Bacteriol. 197, 3354–3366. doi:10.1128/JB.00429-15 Anderson, S., 1981. Shotgun DNA sequencing using cloned DNase I-generated fragments. Nucleic Acids Res. 9, 3015–3027. Andersson, D.I., Hughes, D., 2010. Antibiotic resistance and its cost: is it possible to reverse resistance? Nat. Rev. Microbiol. 8, 260–271. doi:10.1038/nrmicro2319 Applebee, M.K., Joyce, A.R., Conrad, T.M., Pettigrew, D.W., Palsson, B.O., 2011. Functional and Metabolic Effects of Adaptive Glycerol Kinase (GLPK) Mutants in Escherichia coli. J. Biol. Chem. 286, 23150–23159. doi:10.1074/jbc.M110.195305 Arumugam, M., Raes, J., Pelletier, E., Le Paslier, D., Yamada, T., Mende, D.R., Fernandes, G.R., Tap, J., Bruls, T., Batto, J.-M., Bertalan, M., Borruel, N., Casellas, F., Fernandez, L., Gautier, L., Hansen, T., Hattori, M., Hayashi, T., Kleerebezem, M., Kurokawa, K., Leclerc, M., Levenez, F., Manichanh, C., Nielsen, H.B., Nielsen, T., Pons, N., Poulain, J., Qin, J., Sicheritz-Ponten, T., Tims, S., Torrents, D., Ugarte, E., Zoetendal, E.G., Wang, J., Guarner, F., Pedersen, O., de Vos, W.M., Brunak, S., Doré, J., Weissenbach, J., Ehrlich, S.D., Bork, P., 2011. Enterotypes of the human gut microbiome. Nature 473, 174–180. doi:10.1038/nature09944 Baba, T., Ara, T., Hasegawa, M., Takai, Y., Okumura, Y., Baba, M., Datsenko, K.A., Tomita, M., Wanner, B.L., Mori, H., 2006. Construction of Escherichia coli K-12 in-frame, single- gene knockout mutants: the Keio collection. Mol. Syst. Biol. 2, 2006.0008. doi:10.1038/msb4100050 Bankevich, A., Nurk, S., Antipov, D., Gurevich, A.A., Dvorkin, M., Kulikov, A.S., Lesin, V.M., Nikolenko, S.I., Pham, S., Prjibelski, A.D., Pyshkin, A.V., Sirotkin, A.V., Vyahhi, N., Tesler, G., Alekseyev, M.A., Pevzner, P.A., 2012. SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing. J. Comput. Biol. 19, 455–477. doi:10.1089/cmb.2012.0021 Barrick, J.E., Colburn, G., Deatherage, D.E., Traverse, C.C., Strand, M.D., Borges, J.J., Knoester, D.B., Reba, A., Meyer, A.G., 2014. Identifying structural variation in haploid microbial genomes from short-read resequencing data using breseq. Bmc Genomics 15, 1039. Barrick, J.E., Lenski, R.E., 2013. Genome dynamics during experimental evolution. Nat. Rev. Genet. 14, 827–839. doi:10.1038/nrg3564 Barrick, J.E., Yu, D.S., Yoon, S.H., Jeong, H., Oh, T.K., Schneider, D., Lenski, R.E., Kim, J.F., 2009. Genome evolution and adaptation in a long-term experiment with Escherichia coli. Nature 461, 1243–1247. doi:10.1038/nature08480 Barroso-Batista, J., Sousa, A., Lourenço, M., Bergman, M.-L., Sobral, D., Demengeot, J., Xavier, K.B., Gordo, I., 2014. The First Steps of Adaptation of Escherichia coli to the Gut Are

161

Bibliography

Dominated by Soft Sweeps. PLoS Genet. 10, e1004182. doi:10.1371/journal.pgen.1004182 Batzer, M.A., Deininger, P.L., 2002. ALU REPEATS AND HUMAN GENOMIC DIVERSITY. Nat. Rev. Genet. 3, 370–379. doi:10.1038/nrg798 Bentley, D.R., Balasubramanian, S., Swerdlow, H.P., Smith, G.P., Milton, J., Brown, C.G., Hall, K.P., Evers, D.J., Barnes, C.L., Bignell, H.R., Boutell, J.M., Bryant, J., Carter, R.J., Keira Cheetham, R., Cox, A.J., Ellis, D.J., Flatbush, M.R., Gormley, N.A., Humphray, S.J., Irving, L.J., Karbelashvili, M.S., Kirk, S.M., Li, H., Liu, X., Maisinger, K.S., Murray, L.J., Obradovic, B., Ost, T., Parkinson, M.L., Pratt, M.R., Rasolonjatovo, I.M.J., Reed, M.T., Rigatti, R., Rodighiero, C., Ross, M.T., Sabot, A., Sankar, S.V., Scally, A., Schroth, G.P., Smith, M.E., Smith, V.P., Spiridou, A., Torrance, P.E., Tzonev, S.S., Vermaas, E.H., Walter, K., Wu, X., Zhang, L., Alam, M.D., Anastasi, C., Aniebo, I.C., Bailey, D.M.D., Bancarz, I.R., Banerjee, S., Barbour, S.G., Baybayan, P.A., Benoit, V.A., Benson, K.F., Bevis, C., Black, P.J., Boodhun, A., Brennan, J.S., Bridgham, J.A., Brown, R.C., Brown, A.A., Buermann, D.H., Bundu, A.A., Burrows, J.C., Carter, N.P., Castillo, N., Chiara E. Catenazzi, M., Chang, S., Neil Cooley, R., Crake, N.R., Dada, O.O., Diakoumakos, K.D., Dominguez-Fernandez, B., Earnshaw, D.J., Egbujor, U.C., Elmore, D.W., Etchin, S.S., Ewan, M.R., Fedurco, M., Fraser, L.J., Fuentes Fajardo, K.V., Scott Furey, W., George, D., Gietzen, K.J., Goddard, C.P., Golda, G.S., Granieri, P.A., Green, D.E., Gustafson, D.L., Hansen, N.F., Harnish, K., Haudenschild, C.D., Heyer, N.I., Hims, M.M., Ho, J.T., Horgan, A.M., Hoschler, K., Hurwitz, S., Ivanov, D.V., Johnson, M.Q., James, T., Huw Jones, T.A., Kang, G.-D., Kerelska, T.H., Kersey, A.D., Khrebtukova, I., Kindwall, A.P., Kingsbury, Z., Kokko-Gonzales, P.I., Kumar, A., Laurent, M.A., Lawley, C.T., Lee, S.E., Lee, X., Liao, A.K., Loch, J.A., Lok, M., Luo, S., Mammen, R.M., Martin, J.W., McCauley, P.G., McNitt, P., Mehta, P., Moon, K.W., Mullens, J.W., Newington, T., Ning, Z., Ling Ng, B., Novo, S.M., O’Neill, M.J., Osborne, M.A., Osnowski, A., Ostadan, O., Paraschos, L.L., Pickering, L., Pike, A.C., Pike, A.C., Chris Pinkard, D., Pliskin, D.P., Podhasky, J., Quijano, V.J., Raczy, C., Rae, V.H., Rawlings, S.R., Chiva Rodriguez, A., Roe, P.M., Rogers, J., Rogert Bacigalupo, M.C., Romanov, N., Romieu, A., Roth, R.K., Rourke, N.J., Ruediger, S.T., Rusman, E., Sanches-Kuiper, R.M., Schenker, M.R., Seoane, J.M., Shaw, R.J., Shiver, M.K., Short, S.W., Sizto, N.L., Sluis, J.P., Smith, M.A., Ernest Sohna Sohna, J., Spence, E.J., Stevens, K., Sutton, N., Szajkowski, L., Tregidgo, C.L., Turcatti, G., vandeVondele, S., Verhovsky, Y., Virk, S.M., Wakelin, S., Walcott, G.C., Wang, J., Worsley, G.J., Yan, J., Yau, L., Zuerlein, M., Rogers, J., Mullikin, J.C., Hurles, M.E., McCooke, N.J., West, J.S., Oaks, F.L., Lundberg, P.L., Klenerman, D., Durbin, R., Smith, A.J., 2008. Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456, 53–59. doi:10.1038/nature07517 Bentley, R., Meganathan, R., 1982. Biosynthesis of vitamin K (menaquinone) in bacteria. Microbiol. Rev. 46, 241. Benzer, S., 1959. On the topology of the genetic fine structure. Proc. Natl. Acad. Sci. 45, 1607– 1620.

162

Bibliography

Berger, H., Hacker, J., Juarez, A., Hughes, C., Goebel, W., 1982. Cloning of the chromosomal determinants encoding hemolysin production and mannose-resistant hemagglutination in Escherichia coli. J. Bacteriol. 152, 1241–1247. Bettelheim, K.A., Taylor, J., 1969. A study of Escherichia coli isolated from chronic urinary infection. J. Med. Microbiol. 2, 225–236. Bingen, E., Picard, B., Brahimi, N., Mathy, S., Desjardins, P., Elion, J., Denamur, E., 1998. Phylogenetic Analysis of Escherichia coli Strains Causing Neonatal Meningitis Suggests Horizontal Gene Transfer from a Predominant Pool of Highly Virulent B2 Group Strains. J. Infect. Dis. 177, 642–650. doi:10.1086/514217 Blattner, F.R., Plunkett, G., Bloch, C.A., Perna, N.T., Burland, V., Riley, M., Collado-Vides, J., Glasner, J.D., Rode, C.K., Mayhew, G.F., Gregor, J., Davis, N.W., Kirkpatrick, H.A., Goeden, M.A., Rose, D.J., Mau, B., Shao, Y., n.d. The complete genome sequence of Escherichia coli K-12. Science 277, 1453–1462. Bleibtreu, A., Clermont, O., Darlu, P., Glodt, J., Branger, C., Picard, B., Denamur, E., 2014. The rpoS Gene Is Predominantly Inactivated during Laboratory Storage and Undergoes Source-Sink Evolution in Escherichia coli Species. J. Bacteriol. 196, 4276–4284. doi:10.1128/JB.01972-14 Bleibtreu, A., Gros, P.-A., Laouénan, C., Clermont, O., Le Nagard, H., Picard, B., Tenaillon, O., Denamur, E., 2013. Fitness, Stress Resistance, and Extraintestinal Virulence in Escherichia coli. Infect. Immun. 81, 2733–2742. doi:10.1128/IAI.01329-12 Blount, Z.D., Barrick, J.E., Davidson, C.J., Lenski, R.E., 2012. Genomic analysis of a key innovation in an experimental Escherichia coli population. Nature 489, 513–518. doi:10.1038/nature11514 Blount, Z.D., Borland, C.Z., Lenski, R.E., 2008. Historical contingency and the evolution of a key innovation in an experimental population of Escherichia coli. Proc. Natl. Acad. Sci. 105, 7899–7906. Bochner, B.R., Huang, H.-C., Schieven, G.L., Ames, B.N., 1980. Positive selection for loss of tetracycline resistance. J. Bacteriol. 143, 926–933. Boyd, E.F., Brüssow, H., 2002. Common themes among bacteriophage-encoded virulence factors and diversity among the bacteriophages involved. TRENDS Microbiol. 10, 521–529. Brechtel, E., Huwig, A., Giffhorn, F., 2002. l-Glucitol Catabolism in Stenotrophomonas maltophilia Ac. Appl. Environ. Microbiol. 68, 582–587. doi:10.1128/AEM.68.2.582- 587.2002 Brzuszkiewicz, E., Brüggemann, H., Liesegang, H., Emmerth, M., Olschläger, T., Nagy, G., Albermann, K., Wagner, C., Buchrieser, C., Emody, L., Gottschalk, G., Hacker, J., Dobrindt, U., 2006. How to become a uropathogen: comparative genomic analysis of extraintestinal pathogenic Escherichia coli strains. Proc. Natl. Acad. Sci. U. S. A. 103, 12879–12884. doi:10.1073/pnas.0603038103 Buchholz, U., Bernard, H., Werber, D., Böhmer, M.M., Remschmidt, C., Wilking, H., Deleré, Y., an der Heiden, M., Adlhoch, C., Dreesman, J., others, 2011. German outbreak of Escherichia coli O104: H4 associated with sprouts. N. Engl. J. Med. 365, 1763–1770. Chao, L., Cox, E.C., 1983. Competition Between High and Low Mutating Strains of Escherichia coli. Evolution 37, 125. doi:10.2307/2408181 163

Bibliography

Chou, H.-H., Chiu, H.-C., Delaney, N.F., Segre, D., Marx, C.J., 2011. Diminishing Returns Epistasis Among Beneficial Mutations Decelerates Adaptation. Science 332, 1190–1192. doi:10.1126/science.1203799 Conrad, T.M., Frazier, M., Joyce, A.R., Cho, B.-K., Knight, E.M., Lewis, N.E., Landick, R., Palsson, B.Ø., 2010. RNA polymerase mutants found through adaptive evolution reprogram Escherichia coli for optimal growth in minimal media. Proc. Natl. Acad. Sci. U. S. A. 107, 20500–20505. doi:10.1073/pnas.0911253107 Cooper, L.A., Simmons, L.A., Mobley, H.L.T., 2012. Involvement of Mismatch Repair in the Reciprocal Control of Motility and Adherence of Uropathogenic Escherichia coli. Infect. Immun. 80, 1969–1979. doi:10.1128/IAI.00043-12 Cooper, T.F., Rozen, D.E., Lenski, R.E., 2003. Parallel changes in gene expression after 20,000 generations of evolution in Escherichia coli. Proc. Natl. Acad. Sci. 100, 1072–1077. Couce, A., Alonso-Rodriguez, N., Costas, C., Oliver, A., Blázquez, J., 2016. Intrapopulation variability in mutator prevalence among urinary tract infection isolates of Escherichia coli. Clin. Microbiol. Infect. 22, 566.e1-566.e7. doi:10.1016/j.cmi.2016.03.008 Couce, A., Tenaillon, O.A., 2015. The rule of declining adaptability in microbial evolution experiments. Front. Genet. 6. doi:10.3389/fgene.2015.00099 Cray, W.C., Moon, H.W., 1995. Experimental infection of calves and adult cattle with Escherichia coli O157:H7. Appl. Environ. Microbiol. 61, 1586–1590. Datsenko, K.A., Wanner, B.L., 2000. One-step inactivation of chromosomal genes in Escherichia coli K-12 using PCR products. Proc. Natl. Acad. Sci. U. S. A. 97, 6640–6645. Daubié, A.-S., Dahbi, G., Eddi, A., Branger, C., the COLIVILLE Group, Massot, M., Couffignal, C., Mentré, F., Blanco, J., Denamur, E., Clermont, O., Jauréguy, F., Picard, B., Mora, A., 2016. Phylogenetic, virulence and antibiotic resistance characteristics of commensal strain populations of Escherichia coli from community subjects in the Paris area in 2010 and evolution over 30 years. Microbiology 162, 642–650. doi:10.1099/mic.0.000242 Deatherage, D.E., Barrick, J.E., 2014. Identification of mutations in laboratory-evolved microbes from next-generation sequencing data using breseq. Methods Mol. Biol. Clifton NJ 1151, 165–188. doi:10.1007/978-1-4939-0554-6_12 Deatherage, D.E., Traverse, C.C., Wolf, L.N., Barrick, J.E., 2015. Detecting rare structural variation in evolving microbial populations from new sequence junctions using breseq. Front. Genet. 5. doi:10.3389/fgene.2014.00468 Denamur, E., Bonacorsi, S., Giraud, A., Duriez, P., Hilali, F., Amorin, C., Bingen, E., Andremont, A., Picard, B., Taddei, F., Matic, I., 2002. High Frequency of Mutator Strains among Human Uropathogenic Escherichia coli Isolates. J. Bacteriol. 184, 605–609. doi:10.1128/JB.184.2.605-609.2002 Denamur, E., Matic, I., 2006. Evolution of mutation rates in bacteria. Mol. Microbiol. 60, 820– 827. doi:10.1111/j.1365-2958.2006.05150.x Diard, M., Garry, L., Selva, M., Mosser, T., Denamur, E., Matic, I., 2010. Pathogenicity- Associated Islands in Extraintestinal Pathogenic Escherichia coli Are Fitness Elements Involved in Intestinal Colonization. J. Bacteriol. 192, 4885–4893. doi:10.1128/JB.00804- 10

164

Bibliography

Duriez, P., Clermont, O., Bonacorsi, S., Bingen, E., Chaventré, A., Elion, J., Picard, B., Denamur, E., 2001. Commensal Escherichia coli isolates are phylogenetically distributed among geographically distinct human populations. Microbiology 147, 1671–1676. Eid, J., Fehr, A., Gray, J., Luong, K., Lyle, J., Otto, G., Peluso, P., Rank, D., Baybayan, P., Bettman, B., Bibillo, A., Bjornson, K., Chaudhuri, B., Christians, F., Cicero, R., Clark, S., Dalal, R., deWinter, A., Dixon, J., Foquet, M., Gaertner, A., Hardenbol, P., Heiner, C., Hester, K., Holden, D., Kearns, G., Kong, X., Kuse, R., Lacroix, Y., Lin, S., Lundquist, P., Ma, C., Marks, P., Maxham, M., Murphy, D., Park, I., Pham, T., Phillips, M., Roy, J., Sebra, R., Shen, G., Sorenson, J., Tomaney, A., Travers, K., Trulson, M., Vieceli, J., Wegener, J., Wu, D., Yang, A., Zaccarin, D., Zhao, P., Zhong, F., Korlach, J., Turner, S., 2009. Real-Time DNA Sequencing from Single Polymerase Molecules. Science 323, 133–138. doi:10.1126/science.1162986 Escobar-Paramo, P., 2004. A Specific Genetic Background Is Required for Acquisition and Expression of Virulence Factors in Escherichia coli. Mol. Biol. Evol. 21, 1085–1094. doi:10.1093/molbev/msh118 Feldmann, H., Aigle, M., Aljinovic, G., Andre, B., Baclet, M.C., Barthe, C., Baur, A., Becam, A.M., Biteau, N., Boles, E., others, 1994. Complete DNA sequence of yeast chromosome II. EMBO J. 13, 5795. Ferenci, T., 2007. Bacterial Physiology, Regulation and Mutational Adaptation in a Chemostat Environment, in: Poole, R.K. (Ed.), Advances in Microbial Physiology. Academic Press, pp. 169–315. Ficicioglu, C., Yager, C., Segal, S., 2005. Galactitol and galactonate in red blood cells of children with the Duarte/ genotype. Mol. Genet. Metab. 84, 152–159. doi:10.1016/j.ymgme.2004.11.001 Flint, A., Butcher, J., Stintzi, A., 2016. Stress Responses, Adaptation, and Virulence of Bacterial Pathogens During Host Gastrointestinal Colonization. Microbiol. Spectr. 4. doi:10.1128/microbiolspec.VMBF-0007-2015 Fowler, R.G., Schaaper, R.M., 1997. The role of the mutT gene of Escherichia coli in maintaining replication fidelity. FEMS Microbiol. Rev. 21, 43–54. doi:10.1111/j.1574- 6976.1997.tb00344.x Friedberg, E.C., Walker, G.C., Siede, W., Wood, R.D., 2005. DNA Repair and Mutagenesis, 2 edition. ed. ASM Press, Washington, D.C. Gerlach, R.G., Jackel, D., Holzer, S.U., Hensel, M., 2009. Rapid Oligonucleotide-Based Recombineering of the Chromosome of Salmonella enterica. Appl. Environ. Microbiol. 75, 1575–1580. doi:10.1128/AEM.02509-08 Gerrish, P.J., Lenski, R.E., 1998. The fate of competing beneficial mutations in an asexual population. Genetica 102, 127–144. Giraud, A., 2001. Costs and Benefits of High Mutation Rates: Adaptive Evolution of Bacteria in the Mouse Gut. Science 291, 2606–2608. doi:10.1126/science.1056421 Giraud, A., Arous, S., Paepe, M.D., Gaboriau-Routhiau, V., Bambou, J.-C., Rakotobe, S., Lindner, A.B., Taddei, F., Cerf-Bensussan, N., 2008. Dissecting the Genetic Components of Adaptation of Escherichia coli to the Mouse Gut. PLoS Genet 4, e2. doi:10.1371/journal.pgen.0040002 165

Bibliography

Good, B.H., Desai, M.M., 2016. Evolution of mutation rates in rapidly adapting asexual populations (No. biorxiv;062760v1). Gordon, D.M., Cowling, A., 2003. The distribution and genetic structure of Escherichia coli in Australian vertebrates: host and geographic effects. Microbiol. Read. Engl. 149, 3575– 3586. Gullberg, E., Cao, S., Berg, O.G., Ilbäck, C., Sandegren, L., Hughes, D., Andersson, D.I., 2011. Selection of Resistant Bacteria at Very Low Antibiotic Concentrations. PLoS Pathog. 7. doi:10.1371/journal.ppat.1002158 Harfe, B.D., Jinks-Robertson, S., 2000. DNA mismatch repair and genetic instability. Annu. Rev. Genet. 34, 359–399. Hayashi, T., Makino, K., Ohnishi, M., Kurokawa, K., Ishii, K., Yokoyama, K., Han, C.-G., Ohtsubo, E., Nakayama, K., Murata, T., others, 2001. Complete genome sequence of enterohemorrhagic Eschelichia coli O157: H7 and genomic comparison with a laboratory strain K-12. DNA Res. 8, 11–22. Heather, J.M., Chain, B., 2016. The sequence of sequencers: The history of sequencing DNA. Genomics 107, 1–8. doi:10.1016/j.ygeno.2015.11.003 Hengge-Aronis, R., 2002. Signal Transduction and Regulatory Mechanisms Involved in Control of the σS (RpoS) Subunit of RNA Polymerase. Microbiol. Mol. Biol. Rev. 66, 373–395. doi:10.1128/MMBR.66.3.373-395.2002 Herring, C.D., Raghunathan, A., Honisch, C., Patel, T., Applebee, M.K., Joyce, A.R., Albert, T.J., Blattner, F.R., van den Boom, D., Cantor, C.R., Palsson, B.Ø., 2006. Comparative genome sequencing of Escherichia coli allows observation of bacterial evolution on a laboratory timescale. Nat. Genet. 38, 1406–1412. doi:10.1038/ng1906 Hershberg, R., Petrov, D.A., 2010. Evidence that mutation is universally biased towards AT in bacteria. PLoS Genet. 6, e1001115. doi:10.1371/journal.pgen.1001115 Herzer, P.J., Inouye, S., Inouye, M., Whittam, T.S., 1990. Phylogenetic distribution of branched RNA-linked multicopy single-stranded DNA among natural isolates of Escherichia coli. J. Bacteriol. 172, 6175–6181. Hindré, T., Knibbe, C., Beslon, G., Schneider, D., n.d. New insights into bacterial adaptation through in vivo and in silico experimental evolution. Nat. Rev. Microbiol. 10, 352–365. doi:10.1038/nrmicro2750 Huisman, O., Faelen, M., Girard, D., Jaffe, A., Toussaint, A., Rouviere-Yaniv, J., 1989. Multiple defects in Escherichia coli mutants lacking HU protein. J. Bacteriol. 171, 3704–3712. Jaureguy, F., Landraud, L., Passet, V., Diancourt, L., Frapy, E., Guigon, G., Carbonnelle, E., Lortholary, O., Clermont, O., Denamur, E., Picard, B., Nassif, X., Brisse, S., 2008. Phylogenetic and genomic diversity of human bacteremic Escherichia coli strains. BMC Genomics 9, 560. doi:10.1186/1471-2164-9-560 Jeong, H., Barbe, V., Lee, C.H., Vallenet, D., Yu, D.S., Choi, S.-H., Couloux, A., Lee, S.-W., Yoon, S.H., Cattolico, L., Hur, C.-G., Park, H.-S., Ségurens, B., Kim, S.C., Oh, T.K., Lenski, R.E., Studier, F.W., Daegelen, P., Kim, J.F., 2009. Genome sequences of Escherichia coli B strains REL606 and BL21(DE3). J. Mol. Biol. 394, 644–652. doi:10.1016/j.jmb.2009.09.052

166

Bibliography

Joensen, K.G., Tetzschner, A.M.M., Iguchi, A., Aarestrup, F.M., Scheutz, F., 2015. Rapid and Easy In Silico Serotyping of Escherichia coli Isolates by Use of Whole-Genome Sequencing Data. J. Clin. Microbiol. 53, 2410–2426. doi:10.1128/JCM.00008-15 Johnson, J.R., Clermont, O., Menard, M., Kuskowski, M.A., Picard, B., Denamur, E., 2006. Experimental mouse lethality of Escherichia coli isolates, in relation to accessory traits, phylogenetic group, and ecological source. J. Infect. Dis. 194, 1141–1150. Johnson, J.R., Owens, K., Gajewski, A., Clabots, C., 2008. Escherichia coli Colonization Patterns among Human Household Members and Pets, with Attention to Acute Urinary Tract Infection. J. Infect. Dis. 197, 218–224. doi:10.1086/524844 Jorth, P., Staudinger, B.J., Wu, X., Hisert, K.B., Hayden, H., Garudathri, J., Harding, C.L., Radey, M.C., Rezayat, A., Bautista, G., Berrington, W.R., Goddard, A.F., Zheng, C., Angermeyer, A., Brittnacher, M.J., Kitzman, J., Shendure, J., Fligner, C.L., Mittler, J., Aitken, M.L., Manoil, C., Bruce, J.E., Yahr, T.L., Singh, P.K., 2015. Regional Isolation Drives Bacterial Diversification within Cystic Fibrosis Lungs. Cell Host Microbe 18, 307–319. doi:10.1016/j.chom.2015.07.006 Kaper, J.B., Nataro, J.P., Mobley, H.L.T., 2004. Pathogenic Escherichia coli. Nat. Rev. Microbiol. 2, 123–140. doi:10.1038/nrmicro818 Khan, A.I., Dinh, D.M., Schneider, D., Lenski, R.E., Cooper, T.F., 2011. Negative Epistasis Between Beneficial Mutations in an Evolving Bacterial Population. Science 332, 1193– 1196. doi:10.1126/science.1203801 Khoury, K.A., Floch, M.H., Hersh, T., 1969. SMALL INTESTINAL MUCOSAL CELL PROLIFERATION AND BACTERIAL FLORA IN THE CONVENTIONALIZATION OF THE GERMFREE MOUSE. J. Exp. Med. 130, 659–670. Kosek, M., Bern, C., Guerrant, R.L., 2003. The global burden of diarrhoeal disease, as estimated from studies published between 1992 and 2000. Bull. World Health Organ. 81, 197–204. Kotloff, K.L., Winickoff, J.P., Ivanoff, B., Clemens, J.D., Swerdlow, D.L., Sansonetti, P.J., Adak, G.K., Levine, M.M., 1999. Global burden of Shigella infections: implications for vaccine development and implementation of control strategies. Bull. World Health Organ. 77, 651–666. Labat, F., Pradillon, O., Garry, L., Peuchmaur, M., Fantin, B., Denamur, E., 2005. Mutator phenotype confers advantage in Escherichia coli chronic urinary tract infection pathogenesis. FEMS Immunol. Med. Microbiol. 44, 317–321. doi:10.1016/j.femsim.2005.01.003 Lamble, H.J., Milburn, C.C., Taylor, G.L., Hough, D.W., Danson, M.J., 2004. Gluconate dehydratase from the promiscuous Entner–Doudoroff pathway in Sulfolobus solfataricus. FEBS Lett. 576, 133–136. doi:10.1016/j.febslet.2004.08.074 Lander, E.S., Linton, L.M., Birren, B., Nusbaum, C., Zody, M.C., Baldwin, J., Devon, K., Dewar, K., Doyle, M., FitzHugh, W., others, 2001. Initial sequencing and analysis of the human genome. Nature 409, 860–921. Lang, G.I., Rice, D.P., Hickman, M.J., Sodergren, E., Weinstock, G.M., Botstein, D., Desai, M.M., 2013. Pervasive genetic hitchhiking and clonal interference in forty evolving yeast populations. Nature 500, 571–574. doi:10.1038/nature12344

167

Bibliography

Langmead, B., Salzberg, S.L., 2012. Fast gapped-read alignment with Bowtie 2. Nat. Methods 9, 357–359. doi:10.1038/nmeth.1923 Larsen, M.V., Cosentino, S., Rasmussen, S., Friis, C., Hasman, H., Marvig, R.L., Jelsbak, L., Sicheritz-Ponten, T., Ussery, D.W., Aarestrup, F.M., Lund, O., 2012. Multilocus Sequence Typing of Total-Genome-Sequenced Bacteria. J. Clin. Microbiol. 50, 1355– 1361. doi:10.1128/JCM.06094-11 Le Gac, M., Cooper, T.F., Cruveiller, S., Médigue, C., Schneider, D., 2013. Evolutionary history and genetic parallelism affect correlated responses to evolution. Mol. Ecol. 22, 3292– 3303. Le Gall, T., Clermont, O., Gouriou, S., Picard, B., Nassif, X., Denamur, E., Tenaillon, O., 2007. Extraintestinal virulence is a coincidental by-product of commensalism in B2 phylogenetic group Escherichia coli strains. Mol. Biol. Evol. 24, 2373–2384. Leatham-Jensen, M.P., Frimodt-Møller, J., Adediran, J., Mokszycki, M.E., Banner, M.E., Caughron, J.E., Krogfelt, K.A., Conway, T., Cohen, P.S., 2012. The Streptomycin- Treated Mouse Intestine Selects Escherichia coli envZ Missense Mutants That Interact with Dense and Diverse Intestinal Microbiota. Infect. Immun. 80, 1716–1727. doi:10.1128/IAI.06193-11 Lederberg, J., Tatum, E.L., 1946. Gene recombination in Escherichia coli. Lee, D.-H., Palsson, B.Ø., 2010. Adaptive Evolution of Escherichia coli K-12 MG1655 during Growth on a Nonnative Carbon Source, l-1,2-Propanediol. Appl. Environ. Microbiol. 76, 4158–4168. doi:10.1128/AEM.00373-10 Lee, H., Popodi, E., Tang, H., Foster, P.L., 2012. Rate and molecular spectrum of spontaneous mutations in the bacterium Escherichia coli as determined by whole-genome sequencing. Proc. Natl. Acad. Sci. 109, E2774–E2783. doi:10.1073/pnas.1210309109 Lee, M.-C., Marx, C.J., 2013. Synchronous Waves of Failed Soft Sweeps in the Laboratory: Remarkably Rampant Clonal Interference of Alleles at a Single Locus. Genetics 193, 943–952. doi:10.1534/genetics.112.148502 Lenski, R.E., Rose, M.R., Simpson, S.C., Tadler, S.C., 1991. Long-term experimental evolution in Escherichia coli. I. Adaptation and divergence during 2,000 generations. Am. Nat. 1315–1341. Lescat, M., Adrien, L., Magnan, M., Glodt, J., Pintard, C., Dion, S., Ghalayini, M., Denamur, E., Tenaillon, O., 2016. Data from: Using long-term experimental evolution to uncover the patterns and determinants of molecular evolution of an Escherichia coli natural isolate in the streptomycin treated mouse gut. doi:10.5061/dryad.4g503 Lescat, M., Clermont, O., Woerther, P.L., Glodt, J., Dion, S., Skurnik, D., Djossou, F., Dupont, C., Perroz, G., Picard, B., Catzeflis, F., Andremont, A., Denamur, E., 2013. Commensal Escherichia coli strains in Guiana reveal a high genetic diversity with host-dependant population structure. Environ. Microbiol. Rep. 5, 49–57. doi:10.1111/j.1758- 2229.2012.00374.x Lescat, M., Reibel, F., Pintard, C., Dion, S., Glodt, J., Gateau, C., Launay, A., Ledda, A., Cruvellier, S., Tourret, J., Tenaillon, O., 2014. The Conserved nhaAR Operon Is Drastically Divergent between B2 and Non-B2 Escherichia coli and Is Involved in Extra- Intestinal Virulence. PLoS ONE 9. doi:10.1371/journal.pone.0108738 168

Bibliography

Levert, M., Zamfir, O., Clermont, O., Bouvet, O., Lespinats, S., Hipeaux, M.C., Branger, C., Picard, B., Saint-Ruf, C., Norel, F., Balliau, T., Zivy, M., Le Nagard, H., Cruvellier, S., Chane-Woon-Ming, B., Nilsson, S., Gudelj, I., Phan, K., Ferenci, T., Tenaillon, O., Denamur, E., 2010. Molecular and Evolutionary Bases of Within-Patient Genotypic and Phenotypic Diversity in Escherichia coli Extraintestinal Infections. PLoS Pathog. 6. doi:10.1371/journal.ppat.1001125 Li, G.-M., 2008. Mechanisms and functions of DNA mismatch repair. Cell Res. 18, 85–98. Li, H., 2011. A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics 27, 2987–2993. doi:10.1093/bioinformatics/btr509 Li, H., Durbin, R., 2010. Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics 26, 589–595. doi:10.1093/bioinformatics/btp698 Li, H., Durbin, R., 2009. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25, 1754–1760. doi:10.1093/bioinformatics/btp324 Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G., Abecasis, G., Durbin, R., 1000 Genome Project Data Processing Subgroup, 2009. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079. doi:10.1093/bioinformatics/btp352 Lieberman, T.D., Michel, J.-B., Aingaran, M., Potter-Bynoe, G., Roux, D., Davis, M.R., Skurnik, D., Leiby, N., LiPuma, J.J., Goldberg, J.B., McAdam, A.J., Priebe, G.P., Kishony, R., 2011. Parallel bacterial evolution within multiple patients identifies candidate pathogenicity genes. Nat. Genet. 43, 1275–1280. doi:10.1038/ng.997 Loman, N.J., Quick, J., Simpson, J.T., 2015. A complete bacterial genome assembled de novo using only nanopore sequencing data. Nat. Methods 12, 733–735. doi:10.1038/nmeth.3444 Long, A., Liti, G., Luptak, A., Tenaillon, O., 2015. Elucidating the molecular architecture of adaptation via evolve and resequence experiments. Nat. Rev. Genet. 16, 567–582. doi:10.1038/nrg3937 Luo, R., Liu, B., Xie, Y., Li, Z., Huang, W., Yuan, J., He, G., Chen, Y., Pan, Q., Liu, Y., others, 2012. SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler. GigaScience 1, 1. Maddamsetti, R., Lenski, R.E., Barrick, J.E., 2015. Adaptation, Clonal Interference, and Frequency-Dependent Interactions in a Long-Term Evolution Experiment with Escherichia coli. Genetics 200, 619–631. doi:10.1534/genetics.115.176677 Maltby, R., Leatham-Jensen, M.P., Gibson, T., Cohen, P.S., Conway, T., 2013. Nutritional Basis for Colonization Resistance by Human Commensal Escherichia coli Strains HS and Nissle 1917 against E. coli O157:H7 in the Mouse Intestine. PLoS ONE 8, e53957. doi:10.1371/journal.pone.0053957 Mao, E.F., Lane, L., Lee, J., Miller, J.H., 1997. Proliferation of mutators in A cell population. J. Bacteriol. 179, 417–422. Miller, J.R., Koren, S., Sutton, G., 2010. Assembly algorithms for next-generation sequencing data. Genomics 95, 315–327. doi:10.1016/j.ygeno.2010.03.001

169

Bibliography

Nguyen, T.L.A., Vieira-Silva, S., Liston, A., Raes, J., 2015. How informative is the mouse for human gut microbiota research? Dis. Model. Mech. 8, 1–16. doi:10.1242/dmm.017400 Nowrouzian, F., Hesselmar, B., Saalman, R., Strannegård, I.-L., Åberg, N., Wold, A.E., Adlerberth, I., 2003. Escherichia coli in Infants’ Intestinal Microflora: Colonization Rate, Strain Turnover, and Virulence Gene Carriage. Pediatr. Res. 54, 8–14. doi:10.1203/01.PDR.0000069843.20655.EE Okamoto, S., Tamaru, A., Nakajima, C., Nishimura, K., Tanaka, Y., Tokuyama, S., Suzuki, Y., Ochi, K., 2007. Loss of a conserved 7-methylguanosine modification in 16S rRNA confers low-level streptomycin resistance in bacteria. Mol. Microbiol. 63, 1096–1106. doi:10.1111/j.1365-2958.2006.05585.x Oliver, A., 2000. High Frequency of Hypermutable Pseudomonas aeruginosa in Cystic Fibrosis Lung Infection. Science 288, 1251–1253. doi:10.1126/science.288.5469.1251 Pedron, T., Mulet, C., Dauga, C., Frangeul, L., Chervaux, C., Grompone, G., Sansonetti, P.J., 2012. A Crypt-Specific Core Microbiota Resides in the Mouse Colon. mBio 3, e00116- 12-e00116-12. doi:10.1128/mBio.00116-12 Pelchovich, G., Schreiber, R., Zhuravlev, A., Gophna, U., 2013. The contribution of common rpsL mutations in Escherichia coli to sensitivity to ribosome targeting antibiotics. Int. J. Med. Microbiol. IJMM 303, 558–562. doi:10.1016/j.ijmm.2013.07.006 Picard, B., Garcia, J.S., Gouriou, S., Duriez, P., Brahimi, N., Bingen, E., Elion, J., Denamur, E., 1999. The link between phylogeny and virulence in Escherichia coli extraintestinal infection. Infect. Immun. 67, 546–553. Poulsen, L.K., Licht, T.R., Rang, C., Krogfelt, K.A., Molin, S., 1995. Physiological state of Escherichia coli BJ4 growing in the large intestines of streptomycin-treated mice. J. Bacteriol. 177, 5840–5845. Rang, C.U., Licht, T.R., Midtvedt, T., Conway, P.L., Chao, L., Krogfelt, K.A., Cohen, P.S., Molin, S., 1999. Estimation of Growth Rates of Escherichia coli BJ4 in Streptomycin- Treated and Previously Germfree Mice by In Situ rRNA Hybridization. Clin. Diagn. Lab. Immunol. 6, 434–436. Robbe-Saule, V., Jaumouille, V., Prevost, M.-C., Guadagnini, S., Talhouarne, C., Mathout, H., Kolb, A., Norel, F., 2006. Crl Activates Transcription Initiation of RpoS-Regulated Genes Involved in the Multicellular Behavior of Salmonella enterica Serovar Typhimurium. J. Bacteriol. 188, 3983–3994. doi:10.1128/JB.00033-06 Ronaghi, M., Uhlén, M., Nyrén, P., 1998. A Sequencing Method Based on Real-Time Pyrophosphate. Science 281, 363–365. doi:10.1126/science.281.5375.363 Rothberg, J.M., Hinz, W., Rearick, T.M., Schultz, J., Mileski, W., Davey, M., Leamon, J.H., Johnson, K., Milgrew, M.J., Edwards, M., Hoon, J., Simons, J.F., Marran, D., Myers, J.W., Davidson, J.F., Branting, A., Nobile, J.R., Puc, B.P., Light, D., Clark, T.A., Huber, M., Branciforte, J.T., Stoner, I.B., Cawley, S.E., Lyons, M., Fu, Y., Homer, N., Sedova, M., Miao, X., Reed, B., Sabina, J., Feierstein, E., Schorn, M., Alanjary, M., Dimalanta, E., Dressman, D., Kasinskas, R., Sokolsky, T., Fidanza, J.A., Namsaraev, E., McKernan, K.J., Williams, A., Roth, G.T., Bustillo, J., 2011. An integrated semiconductor device enabling non-optical genome sequencing. Nature 475, 348–352. doi:10.1038/nature10242

170

Bibliography

Rusconi, B., Sanjar, F., Koenig, S.S.K., Mammel, M.K., Tarr, P.I., Eppinger, M., 2016. Whole Genome Sequencing for Genomics-Guided Investigations of Escherichia coli O157:H7 Outbreaks. Front. Microbiol. 7. doi:10.3389/fmicb.2016.00985 Russo, T.A., Johnson, J.R., 2003. Medical and economic impact of extraintestinal infections due to Escherichia coli: focus on an increasingly important endemic problem. Microbes Infect. Inst. Pasteur 5, 449–456. Saffarian, A., Mulet, C., Naito, T., Bouchier, C., Tichit, M., Ma, L., Grompone, G., Sansonetti, P.J., Pédron, T., 2015. Draft Genome Sequences of Acinetobacter parvus CM11, Acinetobacter radioresistens CM38, and Stenotrophomonas maltophilia BR12, Isolated from Murine Proximal Colonic Tissue. Genome Announc. 3. doi:10.1128/genomeA.01089-15 Sanger, F., Nicklen, S., Coulson, A.R., 1977. DNA sequencing with chain-terminating inhibitors. Proc. Natl. Acad. Sci. 74, 5463–5467. Sato, T., Iino, T., 2010. Genetic analyses of the antibiotic resistance of Bifidobacterium bifidum strain Yakult YIT 4007. Int. J. Food Microbiol. 137, 254–258. doi:10.1016/j.ijfoodmicro.2009.12.014 Schmid, C.W., Deininger, P.L., 1975. Sequence organization of the human genome. Cell 6, 345– 358. Schnable, P.S., Ware, D., Fulton, R.S., Stein, J.C., Wei, F., Pasternak, S., Liang, C., Zhang, J., Fulton, L., Graves, T.A., Minx, P., Reily, A.D., Courtney, L., Kruchowski, S.S., Tomlinson, C., Strong, C., Delehaunty, K., Fronick, C., Courtney, B., Rock, S.M., Belter, E., Du, F., Kim, K., Abbott, R.M., Cotton, M., Levy, A., Marchetto, P., Ochoa, K., Jackson, S.M., Gillam, B., Chen, W., Yan, L., Higginbotham, J., Cardenas, M., Waligorski, J., Applebaum, E., Phelps, L., Falcone, J., Kanchi, K., Thane, T., Scimone, A., Thane, N., Henke, J., Wang, T., Ruppert, J., Shah, N., Rotter, K., Hodges, J., Ingenthron, E., Cordes, M., Kohlberg, S., Sgro, J., Delgado, B., Mead, K., Chinwalla, A., Leonard, S., Crouse, K., Collura, K., Kudrna, D., Currie, J., He, R., Angelova, A., Rajasekar, S., Mueller, T., Lomeli, R., Scara, G., Ko, A., Delaney, K., Wissotski, M., Lopez, G., Campos, D., Braidotti, M., Ashley, E., Golser, W., Kim, H., Lee, S., Lin, J., Dujmic, Z., Kim, W., Talag, J., Zuccolo, A., Fan, C., Sebastian, A., Kramer, M., Spiegel, L., Nascimento, L., Zutavern, T., Miller, B., Ambroise, C., Muller, S., Spooner, W., Narechania, A., Ren, L., Wei, S., Kumari, S., Faga, B., Levy, M.J., McMahan, L., Van Buren, P., Vaughn, M.W., Ying, K., Yeh, C.-T., Emrich, S.J., Jia, Y., Kalyanaraman, A., Hsia, A.-P., Barbazuk, W.B., Baucom, R.S., Brutnell, T.P., Carpita, N.C., Chaparro, C., Chia, J.-M., Deragon, J.-M., Estill, J.C., Fu, Y., Jeddeloh, J.A., Han, Y., Lee, H., Li, P., Lisch, D.R., Liu, S., Liu, Z., Nagel, D.H., McCann, M.C., SanMiguel, P., Myers, A.M., Nettleton, D., Nguyen, J., Penning, B.W., Ponnala, L., Schneider, K.L., Schwartz, D.C., Sharma, A., Soderlund, C., Springer, N.M., Sun, Q., Wang, H., Waterman, M., Westerman, R., Wolfgruber, T.K., Yang, L., Yu, Y., Zhang, L., Zhou, S., Zhu, Q., Bennetzen, J.L., Dawe, R.K., Jiang, J., Jiang, N., Presting, G.G., Wessler, S.R., Aluru, S., Martienssen, R.A., Clifton, S.W., McCombie, W.R., Wing, R.A., Wilson, R.K., 2009. The B73 Maize Genome: Complexity, Diversity, and Dynamics. Science 326, 1112–1115. doi:10.1126/science.1178534 171

Bibliography

Seemann, T., 2014. Prokka: rapid prokaryotic genome annotation. Bioinformatics 30, 2068–2069. doi:10.1093/bioinformatics/btu153 Skurnik, D., Bonnet, D., Bernde-Bauduin, C., Michel, R., Guette, C., Becker, J.-M., Balaire, C., Chau, F., Mohler, J., Jarlier, V., Boutin, J.-P., Moreau, B., Guillemot, D., Denamur, E., Andremont, A., Ruimy, R., 2008. Characteristics of human intestinal Escherichia coli with changing environments. Environ. Microbiol. 10, 2132–2137. doi:10.1111/j.1462- 2920.2008.01636.x Skurnik, D., Clermont, O., Guillard, T., Launay, A., Danilchanka, O., Pons, S., Diancourt, L., Lebreton, F., Kadlec, K., Roux, D., Jiang, D., Dion, S., Aschard, H., Denamur, M., Cywes-Bentley, C., Schwarz, S., Tenaillon, O., Andremont, A., Picard, B., Mekalanos, J., Brisse, S., Denamur, E., 2016. Emergence of Antimicrobial-Resistant Escherichia coli of Animal Origin Spreading in Humans. Mol. Biol. Evol. 33, 898–914. doi:10.1093/molbev/msv280 Smith, E.E., Buckley, D.G., Wu, Z., Saenphimmachak, C., Hoffman, L.R., D’Argenio, D.A., Miller, S.I., Ramsey, B.W., Speert, D.P., Moskowitz, S.M., others, 2006. Genetic adaptation by Pseudomonas aeruginosa to the airways of cystic fibrosis patients. Proc. Natl. Acad. Sci. 103, 8487–8492. Sniegowski, P.D., Gerrish, P.J., Lenski, R.E., others, 1997. Evolution of high mutation rates in experimental populations of E. coli. Nature 387, 703–705. Sokurenko, E., 2016. Pathoadaptive Mutations in Uropathogenic Escherichia coli. Microbiol. Spectr. 4. doi:10.1128/microbiolspec.UTI-0020-2015 Sokurenko, E.V., Chesnokova, V., Dykhuizen, D.E., Ofek, I., Wu, X.-R., Krogfelt, K.A., Struve, C., Schembri, M.A., Hasty, D.L., 1998. Pathogenic adaptation of Escherichia coli by natural variation of the FimH adhesin. Proc. Natl. Acad. Sci. 95, 8922–8926. Staden, R., 1979. A strategy of DNA sequencing employing computer programs. Nucleic Acids Res. 6, 2601–2610. Sturtevant, A.H., 1937. Essays on Evolution. I. On the Effects of Selection on Mutation Rate. Q. Rev. Biol. 12, 464. doi:10.1086/394543 Tanaka, M.M., Bergstrom, C.T., Levin, B.R., 2003. The evolution of mutator genes in bacterial populations: the roles of environmental change and timing. Genetics 164, 843–854. Tenaillon, O., Barrick, J.E., Ribeck, N., Deatherage, D.E., Blanchard, J.L., Dasgupta, A., Wu, G.C., Wielgoss, S., Cruveiller, S., Médigue, C., Schneider, D., Lenski, R.E., 2016. Tempo and mode of genome evolution in a 50,000-generation experiment. Nature. doi:10.1038/nature18959 Tenaillon, O., Rodriguez-Verdugo, A., Gaut, R.L., McDonald, P., Bennett, A.F., Long, A.D., Gaut, B.S., 2012. The Molecular Diversity of Adaptive Convergence. Science 335, 457– 461. doi:10.1126/science.1212986 Tenaillon, O., Skurnik, D., Picard, B., Denamur, E., 2010. The population genetics of commensal Escherichia coli. Nat. Rev. Microbiol. 8, 207–217. doi:10.1038/nrmicro2298 Tenaillon, O., Taddei, F., Radman, M., Matic, I., 2001. Second-order selection in bacterial evolution: Selection acting on mutation and recombination rates in the course of adaptation. Res. Microbiol. 152, 11–16. doi:10.1016/S0923-2508(00)01163-3

172

Bibliography

Tenaillon, O., Toupance, B., Le Nagard, H., Taddei, F., Godelle, B., 1999. Mutators, population size, adaptive landscape and the adaptation of asexual populations of bacteria. Genetics 152, 485–493. Toprak, E., Veres, A., Michel, J.-B., Chait, R., Hartl, D.L., Kishony, R., 2011. Evolutionary paths to antibiotic resistance under dynamically sustained drug stress. Nat. Genet. 44, 101–105. doi:10.1038/ng.1034 Touchon, M., Hoede, C., Tenaillon, O., Barbe, V., Baeriswyl, S., Bidet, P., Bingen, E., Bonacorsi, S., Bouchier, C., Bouvet, O., Calteau, A., Chiapello, H., Clermont, O., Cruveiller, S., Danchin, A., Diard, M., Dossat, C., Karoui, M.E., Frapy, E., Garry, L., Ghigo, J.M., Gilles, A.M., Johnson, J., Le Bouguénec, C., Lescat, M., Mangenot, S., Martinez-Jéhanne, V., Matic, I., Nassif, X., Oztas, S., Petit, M.A., Pichon, C., Rouy, Z., Ruf, C.S., Schneider, D., Tourret, J., Vacherie, B., Vallenet, D., Médigue, C., Rocha, E.P.C., Denamur, E., 2009. Organised Genome Dynamics in the Escherichia coli Species Results in Highly Diverse Adaptive Paths. PLoS Genet. 5, e1000344. doi:10.1371/journal.pgen.1000344 Tourret, J., Diard, M., Garry, L., Matic, I., Denamur, E., 2010. Effects of single and multiple pathogenicity island deletions on uropathogenic Escherichia coli strain 536 intrinsic extra- intestinal virulence. Int. J. Med. Microbiol. 300, 435–439. doi:10.1016/j.ijmm.2010.04.013 Treangen, T.J., Salzberg, S.L., 2011. Repetitive DNA and next-generation sequencing: computational challenges and solutions. Nat. Rev. Genet. doi:10.1038/nrg3117 Umekasi, Y., 2014. Use of gnotobiotic mice to identify and characterize key microbes responsible for the development of the intestinal immune system. Proc. Jpn. Acad. Ser. B Phys. Biol. Sci. 90, 313–332. doi:10.2183/pjab.90.313 Venter, J.C., Adams, M.D., Myers, E.W., Li, P.W., Mural, R.J., Sutton, G.G., Smith, H.O., Yandell, M., Evans, C.A., Holt, R.A., others, 2001. Celera Genomics, 45 West Gude Drive, Rockville, MD 20850, USA. 2 GenetixXpress, 78 Pacific Road, Palm Beach, Sydney 2108, Australia. 3 Berkeley Drosophila Genome Project, University of California, Berkeley, CA 94720, USA. 4 Department of Biology, Penn State University, 208 Mueller Lab, University Park, PA 16802, USA. 5 Department of Genetics, Case Western Reserve University School of , BRB-630, 10900 Euclid Avenue, Cleveland, OH 44106, USA. Science 291, 1304–1351. Vollaard, E.J., Clasener, H.A., 1994. Colonization resistance. Antimicrob. Agents Chemother. 38, 409. Watson, J.D., Crick, F.H.C., 1953. THE STRUCTURE OF DNA. Cold Spring Harb. Symp. Quant. Biol. 18, 123–131. doi:10.1101/SQB.1953.018.01.020 Watterson, G.A., 1975. On the number of segregating sites in genetical models without recombination. Theor. Popul. Biol. 7, 256–276. doi:10.1016/0040-5809(75)90020-9 Welling, G.W., Groen, G., Tuinte, J.H., Koopman, J.P., Kennis, H.M., 1980. Biochemical effects on germ-free mice of association with several strains of anaerobic bacteria. J. Gen. Microbiol. 117, 57–63. doi:10.1099/00221287-117-1-57 Wheeler, D.A., Srinivasan, M., Egholm, M., Shen, Y., Chen, L., McGuire, A., He, W., Chen, Y.- J., Makhijani, V., Roth, G.T., Gomes, X., Tartaro, K., Niazi, F., Turcotte, C.L., Irzyk, 173

Bibliography

G.P., Lupski, J.R., Chinault, C., Song, X., Liu, Y., Yuan, Y., Nazareth, L., Qin, X., Muzny, D.M., Margulies, M., Weinstock, G.M., Gibbs, R.A., Rothberg, J.M., 2008. The complete genome of an individual by massively parallel DNA sequencing. Nature 452, 872–876. doi:10.1038/nature06884 Wielgoss, S., Barrick, J.E., Tenaillon, O., Cruveiller, S., Chane-Woon-Ming, B., Medigue, C., Lenski, R.E., Schneider, D., Andrews, B.J., 2011. Mutation Rate Inferred From Synonymous Substitutions in a Long-Term Evolution Experiment With Escherichia coli. G3amp58 GenesGenomesGenetics 1, 183–186. doi:10.1534/g3.111.000406 Wirth, T., Falush, D., Lan, R., Colles, F., Mensa, P., Wieler, L.H., Karch, H., Reeves, P.R., Maiden, M.C.J., Ochman, H., Achtman, M., 2006. Sex and virulence in Escherichia coli: an evolutionary perspective. Mol. Microbiol. 60, 1136–1151. doi:10.1111/j.1365- 2958.2006.05172.x Wiser, M.J., Ribeck, N., Lenski, R.E., 2013. Long-Term Dynamics of Adaptation in Asexual Populations. Science 342, 1364–1367. doi:10.1126/science.1243357 Wong, A., Rodrigue, N., Kassen, R., 2012. Genomics of adaptation during experimental evolution of the opportunistic pathogen Pseudomonas aeruginosa. PLoS Genet. 8, e1002928. doi:10.1371/journal.pgen.1002928 Woods, R., Schneider, D., Winkworth, C.L., Riley, M.A., Lenski, R.E., 2006. Tests of parallel molecular evolution in a long-term experiment with Escherichia coli. Proc. Natl. Acad. Sci. 103, 9107–9112. Xu, X., Pan, S., Cheng, S., Zhang, B., Mu, D., Ni, P., Zhang, G., Yang, S., Li, R., Wang, J., Orjeda, G., Guzman, F., Torres, M., Lozano, R., Ponce, O., Martinez, D., De la Cruz, G., Chakrabarti, S.K., Patil, V.U., Skryabin, K.G., Kuznetsov, B.B., Ravin, N.V., Kolganova, T.V., Beletsky, A.V., Mardanov, A.V., Di Genova, A., Bolser, D.M., Martin, D.M.A., Li, G., Yang, Y., Kuang, H., Hu, Q., Xiong, X., Bishop, G.J., Sagredo, B., Mejía, N., Zagorski, W., Gromadka, R., Gawor, J., Szczesny, P., Huang, S., Zhang, Z., Liang, C., He, J., Li, Y., He, Y., Xu, J., Zhang, Y., Xie, B., Du, Y., Qu, D., Bonierbale, M., Ghislain, M., del Rosario Herrera, M., Giuliano, G., Pietrella, M., Perrotta, G., Facella, P., O’Brien, K., Feingold, S.E., Barreiro, L.E., Massa, G.A., Diambra, L., Whitty, B.R., Vaillancourt, B., Lin, H., Massa, A.N., Geoffroy, M., Lundback, S., DellaPenna, D., Robin Buell, C., Sharma, S.K., Marshall, D.F., Waugh, R., Bryan, G.J., Destefanis, M., Nagy, I., Milbourne, D., Thomson, S.J., Fiers, M., Jacobs, J.M.E., Nielsen, K.L., Sønderkær, M., Iovene, M., Torres, G.A., Jiang, J., Veilleux, R.E., Bachem, C.W.B., de Boer, J., Borm, T., Kloosterman, B., van Eck, H., Datema, E., te Lintel Hekkert, B., Goverse, A., van Ham, R.C.H.J., Visser, R.G.F., 2011. Genome sequence and analysis of the tuber crop potato. Nature 475, 189–195. doi:10.1038/nature10158 Yager, C., Ning, C., Reynolds, R., Leslie, N., Segal, S., 2004. Galactitol and galactonate accumulation in heart and skeletal muscle of mice with deficiency of galactose-1- phosphate uridyltransferase. Mol. Genet. Metab. 81, 105–111. doi:10.1016/j.ymgme.2003.10.001 Yang, L., Jelsbak, L., Marvig, R.L., Damkiaer, S., Workman, C.T., Rau, M.H., Hansen, S.K., Folkesson, A., Johansen, H.K., Ciofu, O., Hoiby, N., Sommer, M.O.A., Molin, S., 2011.

174

Bibliography

Evolutionary dynamics of bacteria in a human host environment. Proc. Natl. Acad. Sci. 108, 7481–7486. doi:10.1073/pnas.1018249108 Zaslaver, A., Bren, A., Ronen, M., Itzkovitz, S., Kikoin, I., Shavit, S., Liebermeister, W., Surette, M.G., Alon, U., 2006. A comprehensive library of fluorescent transcriptional reporters for Escherichia coli. Nat. Methods 3, 623–628. doi:10.1038/nmeth895 Zdziarski, J., Brzuszkiewicz, E., Wullt, B., Liesegang, H., Biran, D., Voigt, B., Grönberg- Hernandez, J., Ragnarsdottir, B., Hecker, M., Ron, E.Z., Daniel, R., Gottschalk, G., Hacker, J., Svanborg, C., Dobrindt, U., 2010. Host Imprints on Bacterial Genomes— Rapid, Divergent Evolution in Individual Patients. PLoS Pathog. 6, e1001078. doi:10.1371/journal.ppat.1001078 Zerbino, D.R., Birney, E., 2008. Velvet: Algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 18, 821–829. doi:10.1101/gr.074492.107 Zhang, L., Foxman, B., Marrs, C., 2002. Both Urinary and Rectal Escherichia coli Isolates Are Dominated by Strains of Phylogenetic Group B2. J. Clin. Microbiol. 40, 3951–3955. doi:10.1128/JCM.40.11.3951-3955.2002

175

RESUME :

Escherichia coli est une espèce commensale du tube digestif, mais elle peut aussi se révéler être un dangereux pathogène intra ou extra intestinal. Un même clone pouvant passer d'un état commensal à pathogène, la compréhension des mécanismes impliqués dans la diversification d'E. coli dans ces deux habitats représente un enjeu majeur de santé publique. Des expériences d'évolution expérimentale utilisant E. coli ont permis de révéler différentes facettes de l'adaptation bactérienne. Cependant, ces expériences de laboratoire utilisant des conditions artificielles, on peut s'interroger sur la pertinence des observations qui en découlent en milieu naturel et plus globalement s’interroger sur la part de la sélection naturelle dans la diversification de E. coli dans la nature. Pour répondre à ces questions, j'ai analysé les profils génomiques de diversification de E. coli au cours (1) d’une adaptation au tube digestif de souris ou (2) dans des infections extra-intestinales. Dans les deux cas, j’ai pu montrer une importante convergence au niveau du gène : un même gène étant muté plusieurs fois indépendamment, un signe que l’adaptation est active. Dans les infections aigues, des mutations touchant des régulateurs globaux ont été retrouvées, alors que dans le tube digestif les cibles de l’adaptation semblaient plus spécifiques. Enfin, les échantillons issus des infections incluant des souches a fort taux de mutation dites mutatrices, j'ai pu documenter pour la première fois la génomique de l'émergence de bactéries mutatrices en milieu naturel. En conclusion, mes travaux montrent que l’adaptation joue un rôle important dans la diversification de E. coli en milieu naturel et que ce processus s’apparente à celui observé dans des milieux artificiels de laboratoire. L’adaptation semble néanmoins plus active en conditions d’infections aigues que dans le tube digestif de souris.

Mots clés : Escherichia coli / évolution expérimentale / model souris de colonisation de l’intestin / infection aigue / souches mutatrices

ABSTRACT:

Escherichia coli is a commensal species living in the digestive tract of vertebrates, but can also be a harmful pathogen involved in both intra and extraintestinal diseases. As clones can behave both as commensals and pathogens, the comprehension of the mechanisms involved in the diversification of E. coli in those two habitats represents a major public health concern. In vitro experimental evolution studies using E. coli have unraveled the different faces of bacterial adaptation. However, as those experiments used artificial conditions, the relevance of these observations and more generally the contribution of adaptation to the diversification of E. coli in the wild remain questionable. To answer these questions, I analyzed the genomic profiles of diversification of E. coli during (1) adaptation to the mice digestive tract or (2) during acute extraintestinal infections. In both cases, I found a strong convergence at the gene level, i.e. observation of several impendent mutations in the same gene, suggesting a dynamic adaptation. In acute infections, mutations in global regulators were recovered, while more specific genes were recruited in the mice gut. Finally, the existence of clones with high mutation rate in the infections, allowed me to document for the first time the genomics of mutator emergence in the wild. In conclusion, my work shows that adaptation is playing an important role in the diversification of E. coli, and that this process is fairly similar to the one observed in the laboratory. Nevertheless, adaptation seems more active during infections than in the mice gut.

Keywords: Escherichia coli / experimental evolution/ gut colonization mouse model / acute infection/ mutator strains