Aix-Marseille Université Faculté de Médecine de Marseille Ecole Doctorale des Sciences de la Vie et de la Santé THESE DE DOCTORAT

présentée et soutenue le 18 Décembre 2013 par Mano Joseph MATHEW

En vue de l'obtention du grade de docteur de l'Université Aix-Marseille Spécialité : Pathologie humaine et Maladies Infectieuses

______Insight into intracellular bacterial genome repertoire using comparative genomics

______

Composition du jury : M. le Professeur Jérôme ETIENNE Rapporteur M. le Professeur Max MAURIN Rapporteur M. le Professeur Jean-Louis MEGE Président du Jury M. le Professeur Didier RAOULT Directeur de Thèse

Unité de Recherche sur les Maladies Infectieuses Tropicales et Emergentes (URMITE), UM 63 CNRS 7278 IRD 198 INSERM 1095

1

2

To my Lord, precious family and friends…

3

4 Preamble

Le format de présentation de cette thèse correspond à une recommandation de la spécialité Maladies Infectieuses et Microbiologie, à l'intérieur du Master de Sciences de la Vie et de la Santé qui dépend de l'Ecole Doctorale des Sciences de la Vie de Marseille. Le candidat est amené à respecter des règles qui lui sont imposées et qui comportent un format de thèse utilisé dans le Nord de l'Europe permettant un meilleur rangement que les thèses traditionnelles. Par ailleurs, la partie introduction et bibliographie est remplacée par une revue envoyée dans un journal an de permettre une évaluation extérieure de la qualité de la revue et de permettre à l'étudiant de le commencer le plus tôt possible une bibliographie exhaustive sur le domaine de cette thèse. La thèse est présentée sur article publié, accepté ou soumis associé d'un bref commentaire donnant le sens général du travail. Cette forme de présentation a paru plus en adéquation avec les exigences de la compétition internationale et permet de se concentrer sur des travaux qui bénéficieront d'une diffusion internationale.

Professeur Didier RAOULT

5 Abstract

Prokaryotic microorganisms are prevalent in all the environments on Earth. Given their ecological ubiquity, it is not surprising to find many prokaryotic species in close relationships with members of many eukaryotic taxa, often establishing a persistent association, which is known as symbiosis. Conforming to the fitness effects on the members of the symbiotic relationship, associations can be referred to as parasitism, mutualism or commensalism and, depending on the location of the symbiont with respect to host cells, as ectosymbiosis or endosymbiosis. Genome sequencing, especially using Next Generation Sequencing (NGS) has changed radically the face of microbiology and has helped to discern how the diverse group of intracellular evolved to survive and replicate in host cells. Therefore, the initial purpose of my thesis is to understand with the help of comparative genomics, genomic variations based on coexistence, by examining data on the ancient existence of intracellular bacteria, their host adaptation and the differences between sympatry and allopatry. The first part of my thesis is a review giving insight into intracellular bacterial genome repertoire and symbionts. The goal of this review is to explore how intracellular microbes acquire their specific lifestyle. Due to their different evolutionary trajectories, these bacteria have different genomic compositions. We reviewed data on the ancient existence of intracellular bacteria, their host adaptation and the differences between sympatry and allopatry. Furthermore, we elaborate on the genomic repertoire to understand the phenomenon of gene loss in intracellular bacteria. To understand the genomic repertoire and its composition in intracellular bacteria, it is essential to understand specialization in bacteria with respect to their niches. A comparison of the genomic contents of bacteria with certain lifestyles revealed the bacterial capacity to exchange genes to different extents,

6 depending on the ecosystem. Moreover, genomics has provided important clues to the mechanisms driving the genome-reduction process, the functions that are retained when a species becomes intracellular, and the role of the host in molding the genomic composition of intracellular bacteria. The second part of my thesis present about the genome sequence of Diplorickettsia massiliensis strain 20B which is an obligate intracellular, gram negative bacterium isolated from Ixodes ricinus ticks collected from Slovak. In the third part, we investigated the genome repertoire of Diplorickettsia massiliensis compared to closely related bacteria according to its niche, revealing its allopatric lifestyle. In this study, we compared the genomic features of Diplorickettsia massiliensis with twenty-nine sequenced species (Legionella strains, burnetii strains, strains and grylli) using multi-genus pangenomic approach. This thesis work provides original data and sheds light on intracellular bacterial diversity.

Keywords : Intracellular bacteria, Diplorickettsia massiliensis, genome repertoire, allopatry, sympatry, pangenome, gammaprotebacteria

7 Résumé

Les microorganismes sont présents dans presque tous les habitats de la planète. Compte tenu de leur ubiquité écologique, il n'est pas surprenant de trouver de nombreuses espèces procaryotes en relations étroites avec des membres de nombreux taxons eucaryotes, établissant souvent une association persistante appelée symbiose. En fonction des interactions entre les partenaires au sein de cette relation symbiotique, celle ci peut être considérée comme du parasitisme, du mutualisme ou du commensalisme. Et selon l'emplacement du symbiote par rapport aux cellules de l'hôte, comme de l'ectosymbiose ou de l'endosymbiose. Le séquençage des génomes, en particulier le séquençage à haut débit (NGS), a oet alio ote ophesio de lolutio des différents groupes de bactéries intracellulaires et de leur survie au sein des cellules hôtes. Lojetif de ette thse est do de comprendre, avec l'aide de la génomique comparative, les variations génomiques liées à la coexistence, en examinant les données concernant l'existence ancienne de bactéries intracellulaires, leur adaptation à leur hôte et les différences entre sympatrie et allopatrie. La première partie de ma thèse est une revue donnant un aperçu du répertoire génomique des bactéries intracellulaires et de leurs symbiotes. L'objectif de cette étude est d'explorer le processus permettant aux bactéries intracellulaires d'acquérir leur mode de vie spécifique. En raison de leurs différentes voies évolutives, ces bactéries ont des compositions génomiques différentes. Nous avons commencé par examiner les données à propos de l'existence ancienne de bactéries intracellulaires, leur adaptation à leur hôte et les différences entre sympatrie et allopatrie. En outre, nous avons exploré le répertoire génomique de ces bactéries pour comprendre le phénomène de perte de gènes chez les bactéries intracellulaires. Pour comprendre le répertoire génomique et sa composition chez

8 bactéries intracellulaires, il est nécessaire de comprendre la spécialisation de ces bactéries par rapport à leurs niches. Une comparaison du contenu génomique de plusieurs bactéries avec différents modes de vie a révélé la capacité des bactéries à échanger des gènes à des degrés différents, en fonction de l'écosystème. Dailleus, la goiue a foui dipotats indices sur, les mécanismes causant le processus de réduction des génomes, les fonctions qui sont conservés losuue espèce devient itaellulaie et lifluee ue l'hôte peut aoi su la opositio génomique des bactéries intracellulaires. La deuxième partie de ma thèse porte sur la séquence du génome de la souche Diplorickettsia massiliensis 20B qui est une bactérie intracellulaire obligatoire à Gram négatif isolée à partir des tiques de Slovaquie Ixodes ricinus. Dans ma troisième et dernière partie, nous exploré le répertoire du génome de Diplorickettsia massiliensis en le comparant aux génomes de bactéries phylogénétiquement très proches de Diplorickettsia massiliensis, issues de différentes niches. Ceci a permis de révélé son mode de vie allopatrique. Dans cette étude, nous avons comparé les caractéristiques du génome de Diplorickettsia massiliensis avec vingt-neuf espèces séquencées de Gammaproteobacteria (Legionella, , Francisella tularensis et Rickettsiella grylli) en utilisant l'approche pangénomique multi-genre. Ce travail de thèse fournit des données oigiales et peet dappote plus de luie su la diesit des bactéries intracellulaires.

Mots clés : Bactéries intracellulaires, Diplorickettsia massiliensis, répertoire génomique, sympatrie, allopatrie, pangénom, Gammaproteobacteria

9 10 Contents

Preamble 5 Abstract 6 Résumé 8 Contents 11 1 Chapter One : Introduction 13

2 Chapter Two: Review 17 2.1 Review: Genome repertoire of intracellular bacteria and symbionts 3 Chapter Three: Genome sequencing of intracellular bacteria 63 3.1 Article 1: Genome Sequence of Diplorickettsia massiliensis, an Emerging Ixodes ricinus-Associated Human Pathogen

4 Chapter Four: Comparative genomics 73 4.1 Article 2: The genomic repertoire of Diplorickettsia massiliensis reveals its allopatric lifestyle

5 Chapter Five: Conclusions 119 5.1 Conclusions and perspectives 5.2 Future perspective Bibliography 125 Acknowledgements 143

11

12 Chapter 1

Introduction

The following section introduces the reader about the studies on intracellular bacteria and their interactions between intracellular bacteria and different niches. In the past, microbiologists were mainly restricted to the study of microorganisms that could be isolated and grown on relatively simple media. This often made it almost impossible to study species that cannot survive outside their hosts, and severely limited our knowledge of the genetics of these organisms. Advances in culture techniques and genome sequencing now allow these organisms to be studied, and the results of these endeavours have enlightened us on their complete genetic code and provided powerful insights into their exquisite relationships with their hosts. Three microbal categories have been defined based on their niches: free-living, facultative intracellular and obligate intracellular bacteria. The genomes of intracellular bacteria are extremely varied. Examples of facultative intracellular bacteria, which can multiply inside vacuoles, include spp., Francisella tularensis spp. and Mycobacterium tuberculosis spp., and the obligate intracellular bacteria include Chlamydia spp., whereas Listeria monocytogenes, , enteroinvasive and some spp. are able to enter and replicate in the cytosol of mammalian cells (Zientz, et al., 2004). Intracellular bacteria need factors to distinguish, intrude and

13 replicate within the host cells when their intracellular phase is transient. The intracellular location may facilitate the understanding of host metabolites, which support bacterial multiplication in a relatively safe host compartment devoid of potent host defense mechanisms. Moreover, the intracellular compartment may allow the diffusion of bacteria within the host and, after evading the host cells, the bacteria may be released into the environment or directly transmitted to another host organism (Finlay & Falkow, 1997, Gross, et al., 2003, Zientz, et al., 2004). Genome sequencing, especially using Next Generation Sequencing (NGS) has changed radically the face of microbiology and has helped to discern how the diverse group of intracellular bacteria evolved to survive and replicate in host cells. In the first part of my thesis, we reviewed literature to summarize the knowledge on the ancient existence of intracellular bacteria, their host adaptation and the differences between sympatry and allopatry. Moreover, genomics has provided important clues to the mechanisms driving the genome-reduction process, the functions that are retained when a species becomes intracellular, and the role of the host in molding the genomic composition of intracellular bacteria (Chapter2).

Subsequently my thesis work proceeds from the observation that, despite the recent advent of sequencing techniques, little is still known about the interactions between intracellular bacteria and various niches. In the second part of my thesis, we report our work on the genome completion and sequencing of Diplorickettsia massiliensis strain 20B which is an obligate intracellular, gram negative bacterium isolated from

14 Ixodes ricinus ticks collected from Slovak. D. massiliensis belongs to the Gammaproteobacteria class, is non-endospore-forming, and is shaped as small rods that are usually grouped in pairs. An initial phylogenetic analysis based on 16S rRNA showed that Diplorickettsia massiliensis clustered with Rickettsiella grylli. Because of its low 16S rDNA similarity (94%) with R. grylli, it was classified as a new genus Diplorickettsia into the family and the order . D. massiliensis strain 20B was identified in three patients with suspected tick-borne infections that exhibited a specific seroconversion. The evidence of infection was further reconfirmed by using PCR-assay, thus establishing its role as a human pathogen. Therefore, we were interested to understand the genome repertoire of Diplorickettsia massiliensis. Furthermore, we investigated the genome repertoire of Diplorickettsia massiliensis compared to closely related bacteria according to niche, revealed its allopatric lifestyle. In this study, we compared the genomic features of Diplorickettsia massiliensis with twenty-nine sequenced Gammaproteobacteria species (Legionella strains, Coxiella burnetii strains, Francisella tularensis strains and Rickettsiella grylli) using multi-genus pangenomic approach and sheds light on intracellular bacterial diversity.

15

16

Chapter 2

Review: Genome repertoire of intracellular bacteria and symbionts

17

18 2.1 Review: Genome repertoire of intracellular bacteria and symbionts Mano J. Mathew 1 and Didier Raoult1*

1 Unité de Recherche sur les Maladies Infectieuses et Tropicales Emergentes: URMITE, Aix Marseille Université, UMR CNRS 7278, IRD 198, INSERM 109, Faculté de Médecine, 27 Bd Jean Moulin, 13005, Marseille, .

Submitted to FEMS Microbiology Review

*Corresponding author. E-mail: [email protected]

Keywords: Genome repertoire, intracellular, host-microbe, facultative, obligate, genome reduction, virulence, secretion system

19

Abstract

The recent explosion in knowledge of the diverse group of intracellular bacteria has helped to discern how these microbes evolved to survive and replicate in host cells. This review highlights the genomic repertoire of intracellular bacteria and symbionts by examining data on the ancient existence of intracellular bacteria, their host adaptation and the differences between sympatry and allopatry. Moreover, genomics has provided important clues to the mechanisms driving the genome-reduction process, the functions that are retained when a species becomes intracellular, and the role of the host in molding the genomic composition of intracellular bacteria are highlighted. This wealth of information will contribute to a better understanding of the interactions between intracellular bacteria and various niches.

20

Contents Introduction Intracellular bacteria: an ancient outlook Sympatric and allopatric lifestyles Genomic repertoire – Bias in base compositions – Metabolic variations – Ribosomal split operons – Other observations Loss of non-virulent genes in intracellular bacteria Gene duplication facilitating adaptation in intracellular bacteria Mobilome of intracellular bacteria – General distribution of the mobilome in intracellular bacteria – Types of mobile genetic elements – Transposable elements – Repeated palindromic elements (RPEs) – Ankyrin and tetratricopeptide repeat proteins Secretion system machinery in intracellular bacteria Concluding remarks Acknowledgements References

21 Introduction

Understanding the genome repertoire of intracellular bacteria and symbionts cannot be considered without first grappling with the uncertainties and ambiguities in the meanings of the terms repertoire, intracellular bacteria and symbionts. Additionally, none of these terms lends itself to a straightforward explanation. This confusion must be addessed efoe delig ito this ople sujet. The te geoe epetoie ootes the etie geoi opositio of a ogaism. The goal of this review is to explore the microbes that reside inside cells and how they came to acquire this specific lifestyle. Three microbe categories have been defined based on their niches: free-living, facultative intracellular and obligate intracellular bacteria. The genomes of intracellular bacteria are extremely varied. Examples of facultative intracellular bacteria, which can multiply inside vacuoles, include Legionella pneumophila spp., Francisella tularensis spp. and Mycobacterium tuberculosis spp., and the obligate intracellular bacteria include Chlamydia spp., whereas Listeria monocytogenes, Shigella flexneri, enteroinvasive Escherichia coli and some Rickettsia spp. are able to enter and replicate in the cytosol of mammalian cells [1]. Intracellular bacteria need factors to distinguish, intrude and replicate within the host cells when their intracellular phase is transient. The intracellular location may facilitate the understanding of host metabolites, which support bacterial multiplication in a relatively safe host compartment devoid of potent host defense mechanisms. Moreover, the intracellular

22 compartment may allow the diffusion of bacteria within the host and, after evading the host cells, the bacteria may be released into the environment or directly transmitted to another host organism [1-3].

These intracellular bacteria possess certain mechanisms used to protect or invade host cells. Legionella pneumophila induces its own uptake and blocks lysosomal fusion; otherwise, lysosomes would degrade the bacteria [4]. It also uses a Type IV secretion system known as Dot/Icm to inject effector proteins into the host cell required for bacterium sustainability [5], meanwhile, Salmonella and Mycobacterium spp. are very resistant to intracellular killing by phagocytic cells [6, 7]. A comprehensive list of intracellular bacteria is shown in Table 1.

Obligate intracellular bacteria cannot multiply outside host cells, as they lack many biosynthetic pathways; hence, they are dependent on host cells. These cells are also known as obligate endosymbionts, which multiply exclusively inside the cells of many eukaryotic organisms and usually have no extracellular state. Compared with their free-living relatives, obligate intracellular bacteria exhibit a set of features shared by intracellular parasites and endosymbionts. They tend to have small population sizes, and their genomes are usually small and show marked AT nucleotide biases, increased rates of nucleotide substitution, random accumulation of deleterious mutations, accelerated sequence evolution and the loss of genes that are involved in recombination and repair pathways [8].

23 The word symbiont, originating from the Greek simbios, or living together, was first introduced by Anton de Bary in 1879 and was defined as the peaet assoiatio etee to o oe ogaiss of diffeet speies, at least duig a pat of the life le (Gil, et al., 2004). Considering the ecological ubiquity of bacteria, it is not surprising to find many species in close relationships with members of several eukaryotic taxa. Depending on the fitness effects on the members of the symbiotic relationship, the relationship can be referred to as parasitism, mutualism or commensalism. Based on the location of the symbiont in relation to the host cells, these relationships may be ectosymbiotic or endosymbiotic. Rickettsia are frequently identified in close relationships with vectors that may assist in the transmission of the organism to mammalian hosts [9]. Between 15 and 20% of the known have symbiotic relationships with bacteria, making them the most species-rich group. The nutritional enrichment that bacteria offer to insects could be an interesting factor in the evolutionary success of this group [10, 11].

The recent explosion in knowledge of bacterial pathogenesis has assisted efforts to discern why certain intracellular bacteria have evolved to survive and replicate in host cells as part of their pathogenic mechanisms. Recent developments in genomics have introduced concepts such as bacterial genome expansion and reduction, which have provided insight into bacterial genome evolution [12]. The comparison of free-living and intracellular bacteria has revealed dramatic differences in genome size and content. In this paper, we review the genomic repertoire of

24 intracellular bacteria and symbionts. We begin by reviewing data on the ancient existence of intracellular bacteria, their host adaptation and the differences between sympatry and allopatry. Furthermore, we elaborate on the genomic repertoire to understand the phenomenon of gene loss in intracellular bacteria.

Intracellular bacteria: an ancient outlook Ecologists and biologists are fascinated with the enormous diversity of bacteria that complete their life cycles within, or closely associated with, eukaryotic cells. Symbiosis between unicellular and multicellular organisms has contributed considerably to the evolution of life on Earth [13]. These interactions include a broad range of effects on hosts, from invasive pathogenesis to obligate relationships in which the hosts depend on infection for survival or reproduction. Bacterial associations can be difficult to categorize, and many bacteria can be unambiguously labeled as mutualists (bacteria that assist in the fitness of the host) or as parasites (bacteria that decrease the fitness of the host).

Intracellular bacteria are found in a wide range of niches. Due to their different evolutionary trajectories, these bacteria have different genomic compositions. These intracellular bacteria, unfortunately, have no fossil record to assist scientists in determining when they acquired the ability to survive inside other organisms. Based on an endosymbiotic origin for mitochondria and other eukaryotic organelles [14, 15], we predict that the intracellular culture is ancient and predated the

25 emergence of eukaryotic organisms. Intracellular bacteria exhibit three important properties: a) size differences compared to non-intracellular bacteria, b) a mechanism for insertion into the host and c) survival within the host. These initial interactions could have resulted in the survival of symbiotic microbes. The situation in which the survival of the host occurs at the expense of the microbe is termed predation, the situation in which the host is harmed is called intracellular pathogenesis and the situation in which the microbes are damaged is called incompatibility or antagonism. Each situation is subject to selection that allows the emergence of varied types of bacteria-host relationships. In the case of insects, the arthropod lineage arose 385 million years (MY) ago and swiftly diversified [8]. The early establishment of symbiotic relationships among insects and bacteria approximately 300 MY ago and the nutritional advantage that these bacteria offered to insects could have been key factors in the evolutionary success of this group [8]. Mealy bug beta-proteobacterial endosymbiosis was the first stable intracellular symbiotic association identified involving two species of bacteria [16]. Facultative or obligate intracellular bacteria can be identified throughout the tree of life from eukaryotic microorganism protists to multicellular plants and [17]. The order, which belongs to , comprises obligate intracellular bacteria that are closely related to mitochondrial origin, having diverged approximately 850–1500 MY ago [9]. Rickettsiales species have well-known close relationships with varied eukaryotic hosts, as shown by the manipulation of cellular process such as host reproduction [18, 19]. These relationships have led to a massive

26 integration of bacterial genome fragments into host cells [20, 21]. Studies on Rickettsiales have thus improved the knowledge of intracellular bacteria contemporaneous with mitochondrial origin, as parts of a Rickettsiales genome were found integrated into the nucleus of one eukaryotic host, and another genome fragment was found to be integrated into the mitochondrial genome of another host [22]. In another striking example of lateral genetic transfer, nearly the entire Wolbachia genome was found to be integrated into the genome of its host [20, 23]. A recent study on mitochondrial protein-based phylogeny suggested that Rickettsiales and Rhizobiales may have diverged 1.5 billion years ago (BYA) [24, 25]. Their fusion likely created the first mitochondrion approximately 1 BYA [24]. Additionally, the origin of mitochondrial genes is not limited to the Rickettsiales, and the transfer of these genes did not happen in a single event but rather through numerous successive events [24, 25]. These studies clearly establish that the intracellular bacterial lifestyle is ancient and constantly co-evolving with the host [26]. Before we understand the genomic repertoire and its composition in intracellular bacteria, it is essential to understand specialization in bacteria with respect to their niches.

Sympatry and Allopatry lifestyle A comparison of the genomic contents of bacteria with certain lifestyles revealed the bacterial capacity to exchange genes to different extents, depending on the ecosystem [27]. Allopatric speciation in bacteria is associated with restricted opportunities to exchange genes with other

27 organisms, although gene duplication, mutation and deletion are more frequently observed. A prominent example is the association of with the human louse Pediculus humanus [9]. Allopatry is generally associated with genome reduction, especially in pathogens that have a small genomic repertoire compared to less specialized bacteria. In sympatry, multiple bacteria infect the same host and thus undergo massive genetic exchange [28]. Some authors [29, 30] have identified which bacteria participate in each intracellular lifestyle. For example, the strictly intracellular bacteria that live in narrow niches are allopatric, and intracellular bacteria that live in amoebas are sympatric, as in the case of Legionella sp., where an amoeba definitely constitute the place for DNA exchange [29, 30]. Intracellular bacteria that have sympatric relationships within amoebas exhibit larger genomes than their relatives [30]. The bacteria that live in a sympatric manner interact with many other bacteria belonging to divergent phyla, allowing them to share genes at an increased rate. The sympatric lifestyle is associated with larger genomes, larger pan-genomes, a larger mobilome and genetic exchanges with other bacteria. These bacteria often have more genes, ribosomal operons, better metabolic capacities and significant resistance to antibiotics [31]. Gene recombination is found in sympatric organisms, resulting in genetic diversity [32]. In , using a single gene phylogenetic approach, researchers found that some genes could be linked to those of other bacteria, namely Rickettsia bellii, , Legionella sp. and Francisella sp. [32]. The different sizes and functions of the genes suggested random horizontal gene transfer in R. felis [32]. Bacteria in

28 sympatric environments have conserved genomes with phenotypic plasticity and exhibit species complexity. Species complexity may have promoted varied genomic repertoires that produced environmentally adaptable alternative phenotypes [33]. However, like several obligate pathogens, many of these obligate intracellular endosymbionts have an extraordinary genome repertoire, an extremely reduced genome size and correspondingly less coding capacity [34]. Hence, it is likely that the mutual relationships of these bacteria with their host cells may have promoted genome reduction. Thus, it is important to understand the dynamics of the processes whereby new genes are acquired and old genes are removed.

Genomic repertoire: an insight Complete genome sequences are available for many bacteriome- associated symbionts with shared features. The genome size, number of genes and G + C content of intracellular bacteria, which has become reduced during the specialization to an intracellular niche, reflect a continual selective pressure for a minimal genome [35]. The reason for this reduction could be that an intracellular niche reduces the possibility for gene acquisition by lateral gene transfer (LGT) [31, 36-38]. Genes may also be lost upon adaptation to the niche [39, 40]. In free-living bacteria, the G + C base composition is close to 50%; in obligate intracellular bacteria, it ranges from 16.5–33%. The genome sizes of these bacteria vary depending on the host adaptation stage [41].

29 Bias in base composition The most extreme bias in base composition is the uninterrupted shift towards an increased A + T content. This content is highest at sites that are neutral or near neutral with respect to selection, such as silent positions in codons and intergenic spacers. A + T content is favored by mutational bias and is also commonly found in obligate such as Rickettsiales and Chlamydiales. The bias has an important effect on the amino acid composition of proteins, but in the Buchnera genome, the silent sites and spacer base compositions have less than 10% G + C content, while the overall genome composition has 25–30% G + C content [42-45]. In general, the mutational bias reflects the loss of DNA repair pathways. Support this trend, many repair genes are retained in Baumannia cicadellinicola, which has 33% G + C content, whereas no repair genes are retained in Carsonella ruddii and Sulcia muelleri, which have 16.5% and 22% G + C content, respectively [46].

Metabolic variations Compared to free-living bacteria, host-dependent bacteria exhibit fewer transcriptional regulators, as determined from a statistical comparative analysis of 317 bacterial genomes from bacteria with different lifestyles [47]. Genes involved in translation modification and transcription are often among the lost genes [47]. In bacteriocytes such as Carsonella ruddii and Sulcia muelleri [44, 48], genes involved in important processes such as translation, replication and transcription are depleted, along with genes

30 required for the production of cell envelope components [36, 49, 50]. The suggestion that host functions can replace those of the original bacterial cell envelope can be demonstrated by the enclosure of symbionts in a host-derived membrane within the bacteriocytes (Buchnera aphidicola, Sulcia muelleri and Carsonella ruddii); these symbionts lose a greater proportion of genes involved in the production of the cellular envelope than those of the symbionts that are free in the cytosol (Wigglesworthia glossinidia, Candidatus Blochmannia). Bacterial symbionts that live in harmony within host mitochondria or host nuclei [51, 52], and mutualistic bacterial symbionts dwelling within different types of bacterial symbionts in the host cytoplasm are examples of rare close associations [16, 53]. Put differently, the transition from free living to intracellular culture is facilitated with the loss of large segments of DNA [8, 54]. Rickettsia spp. have lost many genes needed for metabolic pathways, including those for sugar, purine and amino acid metabolism [55]. Similarly, the loss of DNA for host adaptation was observed in Candidatus Candidatus Blochmannia, which is an obligate endosymbiont of ants [56]. Conversely, gene acquisition can be observed in the eukaryote L. pneumophila, which is closely associated with amoebae [57]. Genome reduction is also associated with increased pathogenicity, as seen for and Rickettsia prowazekii [18, 47, 58, 59].

31 Ribosomal split operons In a recent study on intracellular bacteria, several abnormal or split ribosomal operons were identified. This abnormal feature occurred independently in several groups of specialized bacteria [60]. Split ribosomal operons are found in Rickettsiales, and Leptospira species, the group containing Mycoplasma and Buchnera and recently, in birtlesii. In the study on B. birtlesii, the authors found that disrupted genes belonged to the translation COG and ribosomal operon. The number of activated genes in a restricted environment is much lower than that in a changing environment, as the translation genes are not used extensively. If the bacteria do not use many ribosomal operons in their current environment, they often lose them, and restricting translation is critical for specialization, as speciation is often correlated with ribosomal operon inactivation [47, 60]. In another comparative genomic analysis of free-living and host-dependent bacteria, the host-dependent bacteria exhibited fewer rRNA genes, more split rRNA operons and fewer transcriptional regulators, characteristics that are linked to slow growth rates [47]. The identification of function-dependent and non-random loss of 100 orthologous genes in the analyzed intracellular bacteria revealed that these bacteria from different phyla underwent convergent evolution by specialization according to their niche [47]. The ribosomal RNA (rRNA) genes are classically organized in operons with the general structure 16S-23S-5S; transfer RNA (tRNA) genes are typically found in the spacer between the 16S and the 23S rRNA genes [47]. Intracellular bacteria have fewer copies of each rRNA gene than free- 32 living bacteria and significantly lower copy numbers of typical rRNA operons. In obligate intracellular bacteria such as Rickettsia sp., split rRNA operons are important evolutionary factors [61]. The co-adaptation of host genes and the modification of ancestral bacterial genes create the base for symbiosis [62]. A minimal genome size is typically observed in sequenced symbiont genomes.

Other observations Adenine-specific DNA methylase is an enzyme that methylates specific DNA targets, namely GANTC for alphaproteobacteria, resulting in a reduction of the thermodynamic stability of the DNA. This alteration changes transcriptional regulation, which is important in host-pathogen interactions and is missing in specialized bacteria [60]. Another distinctive attribute of obligate symbionts is the elevated expression of heat shock proteins, which is linked to lower thermal stability [60]. In Buchnera and other obligate intracellular symbionts, the expression of GroEL, a protein associated with chaperonin, is elevated [63, 64]. Based on microarray and quantitative RT-PCR studies on available genome sequences, in the absence of stress, other heat shock proteins also show unusually elevated expression in these bacteria [65, 66]. It is likely that a compensatory adaptation balances the effects of mutations genome-wide with lower protein stability [43, 45, 67].

33 Loss of non-virulent genes in intracellular bacteria After understanding the various elements involved in gene loss, it is important to understand how gene loss must have occurred in intracellular bacteria. Two crucial mechanisms of evolution, Lamarckian and Darwinian, have been commonly studied [68]. The central Lamarckian concept is that phenotypic changes result from adaptation to a niche and can be transmitted vertically [69]. In contrast, in the present vision of evolutionary biology and in agreement with post-Darwinian experiments, genetic modifications produce phenotypic changes and precede the selection of the fittest individuals in a given niche. In this situation, genotypic changes precede phenotypic changes. Lamarckian evolution may have been involved in bacterial speciation events associated with a reduction in the genome size [47], a finding that contradicts the dominant model in which speciation and fitness gains are linked to an increase in the gene repertoire. Thus, the main course of speciation (through adaptation to a given environment) is usually through allopatry [70] and is related to genome size reduction through the loss of useless genes— aodig to the Laakia odel desied Moa, use it o lose it [40]. In several intracellular pathogens, namely Shigella, Salmonella and Francisella tularensis, when certain genes were inactivated or deleted, the bacteria became pathogenic. These genes are called antivirulence genes [71]. Gene loss is seminal to specialization. As an excellent example, 100 orthologous genes were lost in all specialized bacteria, as determined by a comparative analysis of 317 bacterial genomes from different niches [47]. The most notable genes were associated with ribosomal operons, 34 translation regulation and metabolism [47]. In the study on B. birtlesii, the identification of a deletion in one of the two rRNA operons and disruptions in genes that are associated with translation showed the importance of translation for specialization in a specific niche [60]. Other interesting features of intracellular bacteria include gene duplication, which facilitates adaptation to different environments; the mobilome, which transports virulent genes (repeat elements that cause instability and lead to evolution); and a secretion system, which assists in bacterial colonization, invasion and survival within the niche.

Gene duplication facilitating adaptation in intracellular bacteria Gene duplication facilitates the adaptation of bacteria to changing environments and new niches [72]. The high number of duplicated genes in small intracellular bacterial genomes, including those of Rickettsia species, constitutes an intriguing phenomenon. After gene duplication, the copies undergo one of three possible processes: they may retain the same function and produce an increased amount of the gene product; they may accumulate deleterious mutations and become non-functional; or under positive selection, they may acquire divergent mutations and eventually evolve new functions and confer a selective advantage in a new niche [73-75]. For example, the Rickettsia prowazekii and Rickettsia conorii genomes both contain two copies of the virB4 gene that are distantly related to each other and have evolved under different

35 functional constraints [18]. These copies show differences in non- synonymous substitution frequencies, indicating different functions and counter-selective constraints within the same genome [76]. In a sequenced Rickettsia spp., SpoT paralogs (4–14 copies) were found to have functions that control the concentration of alarmone [(p)ppGpp, guanosine tetra-and pentaphosphates] in response to starvation in Escherichia coli, as was the relA gene. Alarmone acts as an effector of transcription, creating changes in cellular metabolism and (p)ppGpp- mediated regulation, which may be involved in pathogenesis and bacterial symbiosis [77]. All 14 spoT genes were transcribed in Rickettsia felis [78] whereas, interestingly, the five spoT genes present in R. conorii were differentially regulated depending on the niche. Gene families such as TLc, ProP, AmpG and Sca have been identified in Rickettsia spp., in which multiple copies of TLc, which exchanges ADP for host cytoplasmic ATP, may be important for efficient host cell adaptation [78]. The multiple copies of the proline/betaine transporter ProP seem to play an important role in the adaptation of Rickettsia spp. to osmotic stress and to host temperature conditions. AmpG may confer natural resistae to β-lactam antibiotics, and Sca proteins function in host-parasite interactions and adaptive responses to host defense systems [59]. A genome analysis of Rickettsia spp. disclosed 17 members of the Sca family that showed diverse patterns of expression across various species and whose N- terminal domains were highly variable, which may have facilitated immune evasion and persistent growth [78, 79].

36 Mobilome of intracellular bacteria In recent years, much data on the distribution of mobile genetic elements in bacterial genomes has become available [38, 80]. The genomic science so far indicates that most bacterial genomes have viral origins, and in some cases these elements make up to 20% of the host genome [81]. These mobile DNA elements, such as prophages, contribute more than 50% of the strain-specific DNA in many important pathogens [82-84] and are common transporters of virulent genes in bacteria [85-87]. They constitute the mobilome and include transposable elements, plasmids, bacteriophages and associated genes for which horizontal movement is critical [88, 89]. For this reason, understanding the mobilome of intracellular bacterial genomes is necessary.

General distribution of mobilome in intracellular bacteria Few mobile genetic elements are observed in free-living organisms with larger genome sizes of 4–10 Mb. Facultative intracellular bacteria are not restricted by host replication and are capable of living and reproducing either inside or outside of host cells, as is the case for some pathogenic bacteria. Their genome sizes of 2–7 Mb are similar to those of some free-living organisms, and they have intermediate population sizes [90]. The number of mobile genetic elements found in obligate intracellular and facultative intracellular bacteria show similar ranges, but facultative intracellular species contain four-fold more mobile DNA elements than obligate intracellular bacteria. This observation is consistent with predictions that these elements are similar to those of

37 free-living obligate species [90]. Wolbachia pipientis is an exception; its mobile genetic elements comprise less than 2% of its genome. This estimate is similar to the lower end of the range of facultative intracellular bacterial species [91]. Reductive evolution is supported by the small genome size and deletion biases [92]. The Rickettsiales order shows reductive evolution and also contains various families of mobile elements, such as plasmids, transposases, and phage-related genes [32, 61]

Types of mobile genetic elements There are three main classes of mobile genetic elements that occur in prokaryotes: a) The first are small pieces of extrachromosomal DNA that are either linear or circular and mostly replicate independently in the host. These elements are called plasmids and are subject to evolution. Lateral transfer from a donor to a recipient bacterial cell by direct contact between the cells occurs via conjugative plasmids. b) Phage elements, as the name suggests, are derived from phages, which are viruses of bacteria that use the host machinery to replicate by a process in which the DNA of the phage enters the host cell and integrates into the bacterial genome as a prophage. These integrated prophage DNA molecules are passively inherited until DNA excision and phage-induced lysis of the bacterial cell takes place [93]. c) Transposable elements are short inverted repeats that typically encode for proteins that help move genes and, in a few cases, are embedded in the prophage regions [94]. Genome analysis of Rickettsiae revealed a large fraction of mobile DNA that helps the movement of DNA within and between genomes [18]. Plasmids are considered conjugative

38 plasmids when they are dispersed by conjugation from cell to cell if they can spread autonomously. Recent genomic data and phylogenetic analyses have established the presence of conjugative plasmids and suggested the existence of LGT events in the Rickettsia genus [95].

Transposable elements In , transposable elements constitute the largest portion of mobile DNA. A similar amplification of transposable elements was noted in other intracellular bacteria such as Wolbachia pipientis wMel [seven types of IS elements (51 copies in total) and four types of GII introns (17 copies)] [91], Parachlamydia sp. UWE25 [82 IS transposases (TPases)] [96], R. felis (82 TPase) [79], and R. bellii (39 TPases) [97]. In O. tsutsugamushi, the transposable element copy number is 10 times higher than that of obligate intracellular bacteria. contains the highest number of insertion sequence (IS) elements among the prokaryotes (701 copies in a 4469 kb chromosome and a 183 kb plasmid)

[98]. The number of prophage genes per genome is intermediate to those of plasmids and transposable elements, while the proportion of plasmid genes is notably small [90]. These intracellular bacterial genomes are dominated by transposable elements, which can integrate into a genome that already has a copy of the same transposable element and generally do not require a specific site for insertion [90]. In contrast, phages are site-specific and confer immunity to multiple infections. They also serve as vectors that carry other mobile elements, such as transposable elements, into a host genome [90]. There is a striking

39 difference between the quantity of transposable elements and prophage- related genes found in intracellular prokaryotes, as prophage genomes comprise tens of genes, whereas a transposable element carries a single gene (encoding a transposase or reverse transcriptase/maturase) [99].

Repeated palindromic elements (RPEs) Repeated elements are usually confined to the intergenic regions of bacterial genomes [100]. For some of these RPEs, the variable number of tandem repeats represents inter-individual length variability and has been used for genotyping [100, 101]. RPEs are well studied in Rickettsia spp. They are approximately 100–150 bp and invade both coding and noncoding regions of the genome [102-104]. With the ability to insert themselves within the existing protein coding frame, these RPEs often generate new reading frames within a preexisting gene, creating an additional peptide segment of 30–50 amino acids in the final gene product. Repetitive DNA might be inserted with the help of plasmids. Repeats are important, as they have roles in genomic instability and evolution. The bacterial chromosomes that contain elevated repeat density also show significant rates of rearrangements, leading to an accelerated loss of gene order [105]. Transposons and other extragenic interspersed repeats may function in gene rearrangement and duplication [106, 107].

40 Ankyrin and tetratricopeptide repeat proteins Ankyrin (Ank) and tetratricopeptide (TPR) repeat proteins have been found in several intracellular bacteria and have roles in host-pathogen interactions. Nearly 4% of the Rickettsia belli and R. felis genomes consisted of Ank and TPR proteins [108]. These proteins participate in various functions, including chaperone activity, cell cycle regulation, transcription, gene regulation, signal transduction and protein transport [109-113]. TRPs establish infection and manipulate host cell trafficking events in L. pneumophila [57, 114, 115], whereas Ank proteins found in Anaplasma spp., Wolbachia spp. and Ehrlichia spp. are translocated into the host cell cytoplasm and nucleus, playing dual roles in interfering with host cell signaling by interacting with the host cytoskeleton and in altering gene transcription by binding to host chromatin [115]. The deletion or mutation of genes encoding for Ank proteins reduced the virulence of Rickettsia peacockii and strain Iowa compared to the R. rickettsii strain Sheila Smith [114].

Secretion systems machinery in Intracellular bacteria The interactions between intracellular bacteria and the host cells are enabled using Type IV secretion systems (T4SSs). These systems are required for bacterial colonization, invasion and persistence within the niche and consist of supra-molecular transporters ancestrally related to bacterial conjugation systems. They are complex proteins embedded in the bacterial cell envelope, and one type has been well studied in

41 Rickettsia [79, 97], Bartonella [116], Wolbachia [117], L. pneumophila [5], N. sennetsu [118], N. risticii [118] and O. tsutsugamushi [119]. The T4SSs are not only able to transport diverse macromolecule substrates, proteins and virulence factors but are also able to transfer DNA through bacterial conjugation [30, 120-123]. Genes that encode T4SS (VirB/VirD4 and Trw) components have been found in several species of Bartonella [116]. In Bartonella rattaustraliani, pNH4 encodes a T4SS containing a complete set of proteins responsible for conjugal transfer, i.e., TraA, TraC, TraD and TraG/VirD4 [116]. These systems are described as essential pathogenicity factors in several mammalian pathogens, including and Bartonella tribocorum [116]. The main role for T4SSs is to translocate virulence factors to hosts and to promote DNA transfer [121]. The protein encoded by traA initiates DNA transfer for bacterial systems by relaxing DNA at a site-and strand-specific nick [124], while TraC is necessary for the assembly of F pilin into the mature F pilus structure [125]. The coupling protein traD is essential for transferring DNA by connecting the DNA processing machinery to the Mpf transfer apparatus [126] a and TraG is critical for the translocation of substrates through the inner cell membrane [127]. T4SSs in the Bartonella genus are typically located on chromosomes, and only Bartonella grahamii has a T4SS on its plasmid pBGR3 [128]. In L. pneumophila, Dot/Icm T4SS facilitates the inhibition of phagosome-lysosome fusion and the recruitment to the rough endoplasmic reticulum to support replication in the host cell. The components of the dot/icm loci are classified as T4SSs due to homology with genes. In Legionella, the T4SS is encoded by 26 dot/icm genes

42 arranged in two distinct regions of the chromosome, each approximately 20 kb in length. Region I contains dotDCB and dotA-icmVWX [129]. Region II contains 18 genes, most of which are dot and icm genes [130]. The dot/icm loci of the five L. pneumophila strains discussed above exhibit very high nucleotide conservation, ranging from 98 to 100% among most orthologs. The exceptions are dotA and icmX; additionally, the icmC gene of the Corby strain is shorter than and more divergent from (84% nucleotide identity) that of the Paris strain. Sequence comparisons of the dot/icm genes to other known open reading frames revealed that at least 18 of the dot/icm genes show similarity to components of the bacterial conjugative DNA transfer systems, particularly the IncI plasmids ColIB-P9 from Shigella flexneri and R64 from [130]. The bacterial genomic information suggests that T4SSs are not limited to Legionella and related bacteria and IncI plasmids [131]. Interestingly, nearly all the T4SSs found in sequence analyses are encoded on plasmids [132]. Notable exceptions include the Legionella, Coxiella and Rickettsiella Dot/Icm systems. It is likely that a common ancestor of these closely related bacteria acquired a chromosomally encoded T4SS that played a critical role in its survival. The chromosomal acquisition of the T4SS might be related to the adaptation of the ancestor bacterium to an intracellular lifestyle. The genes encoding T4SSs tend to accumulate in several conserved gene clusters; it appears that there is little pressure to keep them at a single locus. The conserved gene clusters include (a) dotD- dotC-dotB (traH-traI-traJ in I-type conjugation systems), (b) dotM/icmP- dotL/icmO (trbA-trbC), and (c) dotI/icmL-dotH/icmK-dotG/icmE (traM-

43 traN-traO). Together with the other genes found in all T4SSs, including dotA (traY) and dotO/icmB (traU), these conserved genes are expected to encode core components that play fundamental roles in transport [131]. The other genes of the dot/icm system include dotH, dotI, and dotO, which are essential for intracellular growth and evasion of the endocytic pathway, and icmGCDJBF and icmTSRQPO, which are involved in macrophage cell death [133]. The type IV secretion system in intracellular bacteria is critical for survival in this intracellular niche, possibly because it allows future specialization as a mammalian pathogen [116].

Concluding remarks The genomic era has paved the way to major findings regarding intracellular bacteria. Symbiosis between unicellular and multicellular organisms has contributed considerably to the evolution of life. Intracellular bacteria are found in a wide range of niches and from various evolutionary trajectories, resulting in different genomic compositions. Based on an endosymbiotic origin for mitochondria and other eukaryotic organelles, we believe that the intracellular culture is ancient and constantly co-evolving with the host. The comparison of bacterial genomic content and lifestyles has revealed that the capacity to exchange genes depends on the bacterial niche. Allopatric speciation in bacteria is linked to the restricted opportunity to exchange genes with other organisms, whereas gene duplications, mutations and deletions are more often observed. The sympatric lifestyle is linked with larger genomes, larger pan-genomes, a larger mobilome and

44 genetic exchanges with other bacteria. It is likely that the mutual relationships between these bacteria and their host cells may have promoted a noticeable reduction. One of the reasons for genome reduction could be that the intracellular niche reduces the opportunity for gene acquisition by lateral gene transfer, and the other is that genes are lost upon adaptation to the niche. Comparative analyses of bacterial genomes from different lifestyles, including free-living and host-dependent bacteria, show that host- dependent bacteria exhibit fewer transcriptional regulators. The numbers of abnormal or split ribosomal operons have been identified, and it appears that this abnormal event occurred independently in several groups of specialized bacteria. If the bacteria do not use many ribosomal operons, they are likely to lose them, and restricting translation is critical for specialization, as speciation is often correlated with ribosomal operon inactivation. Comparative genomic-based analyses of free-living and host- dependent bacteria found that host-dependent bacteria exhibited fewer rRNA genes, more split rRNA operons and fewer transcriptional regulators, characteristics that are linked to slow growth rates. Lamarckian evolution may have played a role in bacterial speciation events associated with a reduction in the genome size, an observation that contradicts the dominant model, which assumes that speciation and fitness gain are linked with an increase in the gene repertoire. Gene duplication facilitates adaptation for bacteria to changing environments and the use of new niches. Gene copies often show differences in non- synonymous substitution frequencies, indicating different functions and

45 counter-selective constraints within the same genome. The number of mobile genetic elements found in obligate intracellular bacteria and facultative intracellular species are within a similar range, but facultative intracellular species contain four-fold more mobile DNA elements than obligate intracellular bacteria. This observation is consistent with predictions that these element compositions are similar to those of free- living obligate species. Repeated palindromic elements have important roles in genomic instability and evolution. Intracellular bacteria possess mechanisms to protect or to invade host cells. The interactions between intracellular bacteria and host cells are enabled by Type IV secretion systems (T4SSs). These systems are required for bacterial colonization, invasion and persistence within the niche and are supra-molecular transporters ancestrally related to bacterial conjugation systems. The main role for T4SSs is to translocate virulence factors to hosts and to promote DNA transfer. The T4SS facilitates the inhibition of phagosome-lysosomes fusion and facilitates the transport to the rough endoplasmic reticulum to support replication in the host cell. Type IV secretion systems in intracellular bacteria are critical for bacterial survival in the intracellular niche, possibly allowing for future specialization as a mammalian pathogen. This system is common in intracellular bacteria and appears to have been acquired from different origins, demonstrating that genomes have converged to adapt to a common lifestyle. The sequencing of additional intracellular bacterial genomes will enable the acquisition of a more precise picture of the genetic properties

46 associated with the intracellular lifestyle. This effort will also contribute to a better understanding of the interactions between intracellular bacteria and different niches and the complex mechanisms implicated in pathogenicity.

Acknowledgements We would like to thank Roshan Padmanabhan for his support, suggestions, corrections and Ripsy Merrin Chacko for helpful remarks.

47 References:

1. Zientz, E., T. Dandekar, and R. Gross, Metabolic interdependence of obligate intracellular bacteria and their hosts. Microbiology and molecular biology reviews : MMBR, 2004. 68(4): p. 745-70. 2. Gross, R., J. Hacker, and W. Goebel, The Leopoldina international symposium on parasitism, commensalism and symbiosis--common themes, different outcome. Molecular microbiology, 2003. 47(6): p. 1749-58. 3. Finlay, B.B. and S. Falkow, Common themes in microbial pathogenicity revisited. Microbiology and molecular biology reviews : MMBR, 1997. 61(2): p. 136-69. 4. Fernandez-Moreira, E., J.H. Helbig, and M.S. Swanson, Membrane vesicles shed by Legionella pneumophila inhibit fusion of phagosomes with lysosomes. Infection and immunity, 2006. 74(6): p. 3285-95. 5. D'Auria, G., et al., Legionella pneumophila pangenome reveals strain-specific virulence factors. BMC genomics, 2010. 11: p. 181. 6. Pilsczek, F.H., A. Nicholson-Weller, and I. Ghiran, Phagocytosis of Salmonella montevideo by human neutrophils: immune adherence increases phagocytosis, whereas the bacterial surface determines the route of intracellular processing. The Journal of infectious diseases, 2005. 192(2): p. 200-9. 7. Friedland, J.S., R.J. Shattock, and G.E. Griffin, Phagocytosis of Mycobacterium tuberculosis or particulate stimuli by human monocytic cells induces equivalent monocyte chemotactic protein-1 gene expression. Cytokine, 1993. 5(2): p. 150-6. 8. Gil, R., A. Latorre, and A. Moya, Bacterial endosymbionts of insects: insights from comparative genomics. Environmental microbiology, 2004. 6(11): p. 1109-22. 9. Renvoise, A., et al., Intracellular Rickettsiales: Insights into manipulators of eukaryotic cells. Trends in molecular medicine, 2011. 17(10): p. 573-83. 10. Douglas, A.E., Mycetocyte symbiosis in insects. Biological reviews of the Cambridge Philosophical Society, 1989. 64(4): p. 409-34. 11. Moran, N.A. and P. Baumann, Bacterial endosymbionts in animals. Current opinion in microbiology, 2000. 3(3): p. 270-5. 12. Stepkowski, T. and A.B. Legocki, Reduction of bacterial genome size and expansion resulting from obligate intracellular lifestyle and adaptation to soil habitat. Acta biochimica Polonica, 2001. 48(2): p. 367-81. 13. Lynn Margulis, R.F., Symbiosis as a Source of Evolutionary Innovation: Speciation and Morphogenesis1991: The MIT Press. 14. Margulis, L., Symbiosis and evolution. Scientific American, 1971. 225(2): p. 48- 57.

48 15. Margulis, L., The origin of plant and cells. American scientist, 1971. 59(2): p. 230-5. 16. von Dohlen, C.D., et al., Mealybug beta-proteobacterial endosymbionts contain gamma-proteobacterial symbionts. Nature, 2001. 412(6845): p. 433-6. 17. Corsaro, D., et al., Intracellular life. Critical reviews in microbiology, 1999. 25(1): p. 39-79. 18. Merhej, V. and D. Raoult, Rickettsial evolution in the light of comparative genomics. Biological reviews of the Cambridge Philosophical Society, 2011. 86(2): p. 379-405. 19. Werren, J.H., L. Baldo, and M.E. Clark, Wolbachia: master manipulators of invertebrate biology. Nature reviews. Microbiology, 2008. 6(10): p. 741-51. 20. McNulty, S.N., et al., Endosymbiont DNA in endobacteria-free filarial nematodes indicates ancient horizontal genetic transfer. PloS one, 2010. 5(6): p. e11029. 21. Klasson, L., et al., Horizontal gene transfer between Wolbachia and the mosquito Aedes aegypti. BMC genomics, 2009. 10: p. 33. 22. Koonin, E.V., The origin and early evolution of eukaryotes in the light of phylogenomics. Genome biology, 2010. 11(5): p. 209. 23. Dunning Hotopp, J.C., et al., Widespread lateral gene transfer from intracellular bacteria to multicellular eukaryotes. Science, 2007. 317(5845): p. 1753-6. 24. Georgiades, K. and D. Raoult, The rhizome of Reclinomonas americana, Homo sapiens, Pediculus humanus and Saccharomyces cerevisiae mitochondria. Biology direct, 2011. 6: p. 55. 25. Georgiades, K., et al., Phylogenomic analysis of Odyssella thessalonicensis fortifies the common origin of Rickettsiales, Pelagibacter ubique and Reclimonas americana mitochondrion. PloS one, 2011. 6(9): p. e24857. 26. Casadevall, A., Evolution of intracellular pathogens. Annual review of microbiology, 2008. 62: p. 19-33. 27. Whitman, W.B., The modern concept of the procaryote. J Bacteriol, 2009. 191(7): p. 2000-5; discussion 2006-7. 28. Georgiades, K., et al., Gene gain and loss events in Rickettsia and Orientia species. Biology direct, 2011. 6: p. 6. 29. Gimenez, G., et al., Insight into cross-talk between intra-amoebal pathogens. BMC genomics, 2011. 12: p. 542. 30. Moliner, C., P.E. Fournier, and D. Raoult, Genome analysis of microorganisms living in amoebae reveals a melting pot of evolution. FEMS microbiology reviews, 2010. 34(3): p. 281-94. 31. Audic, S., et al., Genome analysis of Minibacterium massiliensis highlights the convergent evolution of water-living bacteria. PLoS Genet, 2007. 3(8): p. e138.

49 32. Merhej, V., et al., The rhizome of life: the sympatric Rickettsia felis paradigm demonstrates the random transfer of DNA sequences. Molecular biology and evolution, 2011. 28(11): p. 3213-23. 33. Marco, D., Metagenomics and the niche concept. Theory in biosciences = Theorie in den Biowissenschaften, 2008. 127(3): p. 241-7. 34. Wernegreen, J.J., Genome evolution in bacterial endosymbionts of insects. Nature reviews. Genetics, 2002. 3(11): p. 850-61. 35. Mira, A., H. Ochman, and N.A. Moran, Deletional bias and the evolution of bacterial genomes. Trends in genetics : TIG, 2001. 17(10): p. 589-96. 36. Tamas, I., et al., 50 million years of genomic stasis in endosymbiotic bacteria. Science, 2002. 296(5577): p. 2376-9. 37. Wernegreen, J.J., For better or worse: genomic consequences of intracellular mutualism and parasitism. Current opinion in genetics & development, 2005. 15(6): p. 572-83. 38. Moran, N.A. and G.R. , Genomic changes following host restriction in bacteria. Current opinion in genetics & development, 2004. 14(6): p. 627-33. 39. Darby, A.C., et al., Intracellular pathogens go extreme: genome evolution in the Rickettsiales. Trends in genetics : TIG, 2007. 23(10): p. 511-20. 40. Moran, N.A., Microbial minimalism: genome reduction in bacterial pathogens. Cell, 2002. 108(5): p. 583-6. 41. Toft, C. and S.G. Andersson, Evolutionary microbial genomics: insights into bacterial host adaptation. Nature reviews. Genetics, 2010. 11(7): p. 465-75. 42. Degnan, P.H., A.B. Lazarus, and J.J. Wernegreen, Genome sequence of Blochmannia pennsylvanicus indicates parallel evolutionary trends among bacterial mutualists of insects. Genome research, 2005. 15(8): p. 1023-33. 43. Moran, N.A., Accelerated evolution and Muller's rachet in endosymbiotic bacteria. Proceedings of the National Academy of Sciences of the United States of America, 1996. 93(7): p. 2873-8. 44. Nakabachi, A., et al., The 160-kilobase genome of the bacterial endosymbiont Carsonella. Science, 2006. 314(5797): p. 267. 45. van Ham, R.C., et al., Reductive genome evolution in Buchnera aphidicola. Proceedings of the National Academy of Sciences of the United States of America, 2003. 100(2): p. 581-6. 46. Moran, N.A., J.P. McCutcheon, and A. Nakabachi, Genomics and Evolution of Heritable Bacterial Symbionts. Annual Review of Genetics, 2008. 42(1): p. 165- 190. 47. Merhej, V., et al., Massive comparative genomic analysis reveals convergent evolution of specialized bacteria. Biology direct, 2009. 4: p. 13. 48. McCutcheon, J.P. and N.A. Moran, Parallel genomic evolution and metabolic interdependence in an ancient symbiosis. Proceedings of the National Academy of Sciences of the United States of America, 2007. 104(49): p. 19392-7.

50 49. Perez-Brocal, V., et al., A small microbial genome: the end of a long symbiotic relationship? Science, 2006. 314(5797): p. 312-3. 50. Shigenobu, S., et al., Genome sequence of the endocellular bacterial symbiont of aphids Buchnera sp. APS. Nature, 2000. 407(6800): p. 81-6. 51. Arneodo, J.D., et al., Ultrastructural detection of an unusual intranuclear bacterium in Pentastiridius leporinus (: ). Journal of invertebrate pathology, 2008. 97(3): p. 310-3. 52. Sassera, D., et al., 'Candidatus Midichloria mitochondrii', an endosymbiont of the tick Ixodes ricinus with a unique intramitochondrial lifestyle. International journal of systematic and evolutionary microbiology, 2006. 56(Pt 11): p. 2535- 40. 53. Moran, N.A., et al., The players in a mutualistic symbiosis: insects, bacteria, viruses, and virulence genes. Proceedings of the National Academy of Sciences of the United States of America, 2005. 102(47): p. 16919-26. 54. Fraser-Liggett, C.M., Insights on biology and evolution from microbial genome sequencing. Genome research, 2005. 15(12): p. 1603-10. 55. Renesto, P., et al., Some lessons from Rickettsia genomics. FEMS microbiology reviews, 2005. 29(1): p. 99-117. 56. Wernegreen, J.J., A.B. Lazarus, and P.H. Degnan, Small genome of Candidatus Blochmannia, the bacterial endosymbiont of Camponotus, implies irreversible specialization to an intracellular lifestyle. Microbiology, 2002. 148(Pt 8): p. 2551-6. 57. Cazalet, C., et al., Evidence in the Legionella pneumophila genome for exploitation of host cell functions and high genome plasticity. Nature genetics, 2004. 36(11): p. 1165-73. 58. Fournier, P.E., et al., Analysis of the genome reveals that virulence acquisition in Rickettsia species may be explained by genome reduction. BMC genomics, 2009. 10: p. 166. 59. Ogata, H., et al., Mechanisms of evolution in Rickettsia conorii and R. prowazekii. Science, 2001. 293(5537): p. 2093-8. 60. Rolain, J.M., et al., Partial disruption of translational and posttranslational machinery reshapes growth rates of Bartonella birtlesii. mBio, 2013. 4(2): p. e00115-13. 61. Blanc, G., et al., Reductive genome evolution from the mother of Rickettsia. PLoS genetics, 2007. 3(1): p. e14. 62. Moran, N.A., J.P. McCutcheon, and A. Nakabachi, Genomics and evolution of heritable bacterial symbionts. Annual Review of Genetics, 2008. 42: p. 165-90. 63. Fares, M.A., A. Moya, and E. Barrio, GroEL and the maintenance of bacterial endosymbiosis. Trends in genetics : TIG, 2004. 20(9): p. 413-6. 64. McCutcheon, J.P. and N.A. Moran, Extreme genome reduction in symbiotic bacteria. Nature reviews. Microbiology, 2012. 10(1): p. 13-26.

51 65. Moran, N.A., H.E. Dunbar, and J.L. Wilcox, Regulation of transcription in a reduced bacterial genome: nutrient-provisioning genes of the obligate symbiont Buchnera aphidicola. J Bacteriol, 2005. 187(12): p. 4229-37. 66. Wilcox, J.L., et al., Consequences of reductive evolution for gene expression in an obligate endosymbiont. Molecular microbiology, 2003. 48(6): p. 1491-500. 67. Fares, M.A., et al., Endosymbiotic bacteria: groEL buffers against deleterious mutations. Nature, 2002. 417(6887): p. 398. 68. Koonin, E.V., Darwinian evolution in the light of genomics. Nucleic acids research, 2009. 37(4): p. 1011-34. 69. Colson, P. and D. Raoult, Lamarckian evolution of the giant Mimivirus in allopatric laboratory culture on amoebae. Frontiers in cellular and infection microbiology, 2012. 2: p. 91. 70. Georgiades, K. and D. Raoult, Defining pathogenic bacterial species in the genomic era. Frontiers in microbiology, 2010. 1: p. 151. 71. Bliven, K.A. and A.T. Maurelli, Antivirulence genes: insights into pathogen evolution through gene loss. Infect Immun, 2012. 80(12): p. 4061-70. 72. Hooper, S.D. and O.G. Berg, On the nature of gene innovation: duplication patterns in microbial genomes. Molecular biology and evolution, 2003. 20(6): p. 945-54. 73. Schmitz-Esser, S., et al., ATP/ADP translocases: a common feature of obligate intracellular amoebal symbionts related to Chlamydiae and Rickettsiae. J Bacteriol, 2004. 186(3): p. 683-91. 74. Aravind, L., et al., Evidence for massive gene exchange between archaeal and bacterial hyperthermophiles. Trends in genetics : TIG, 1998. 14(11): p. 442-4. 75. Walsh, J.B., How often do duplicated genes evolve new functions? Genetics, 1995. 139(1): p. 421-8. 76. Frank, A.C., H. Amiri, and S.G. Andersson, Genome deterioration: loss of repeated sequences and accumulation of junk DNA. Genetica, 2002. 115(1): p. 1-12. 77. Braeken, L., B. Van der Bruggen, and C. Vandecasteele, Flux decline in nanofiltration due to adsorption of dissolved organic compounds: model prediction of time dependency. The journal of physical chemistry. B, 2006. 110(6): p. 2957-62. 78. Blanc, G., et al., Molecular evolution of rickettsia surface antigens: evidence of positive selection. Molecular biology and evolution, 2005. 22(10): p. 2073-83. 79. Ogata, H., et al., The genome sequence of Rickettsia felis identifies the first putative conjugative plasmid in an obligate intracellular parasite. PLoS biology, 2005. 3(8): p. e248. 80. Dai, L., et al., Database for mobile group II introns. Nucleic acids research, 2003. 31(1): p. 424-6. 81. Casjens, S., Prophages and bacterial genomics: what have we learned so far? Molecular microbiology, 2003. 49(2): p. 277-300.

52 82. Van Sluys, M.A., et al., Comparative analyses of the complete genome sequences of Pierce's disease and citrus variegated chlorosis strains of Xylella fastidiosa. J Bacteriol, 2003. 185(3): p. 1018-26. 83. Banks, D.J., S.B. Beres, and J.M. Musser, The fundamental contribution of phages to GAS evolution, genome diversification and strain emergence. Trends in microbiology, 2002. 10(11): p. 515-21. 84. Ohnishi, M., K. Kurokawa, and T. Hayashi, Diversification of Escherichia coli genomes: are bacteriophages the major contributors? Trends in microbiology, 2001. 9(10): p. 481-5. 85. Boyd, E.F. and H. Brussow, Common themes among bacteriophage-encoded virulence factors and diversity among the bacteriophages involved. Trends in microbiology, 2002. 10(11): p. 521-9. 86. Boyd, E.F., B.M. Davis, and B. Hochhut, Bacteriophage-bacteriophage interactions in the evolution of pathogenic bacteria. Trends in microbiology, 2001. 9(3): p. 137-44. 87. Miao, E.A. and S.I. Miller, Bacteriophages in the evolution of pathogen-host interactions. Proceedings of the National Academy of Sciences of the United States of America, 1999. 96(17): p. 9452-4. 88. Koonin, E.V. and Y.I. Wolf, Genomics of bacteria and archaea: the emerging dynamic view of the prokaryotic world. Nucleic acids research, 2008. 36(21): p. 6688-719. 89. Frost, L.S., et al., Mobile genetic elements: the agents of open source evolution. Nature reviews. Microbiology, 2005. 3(9): p. 722-32. 90. Bordenstein, S.R. and W.S. Reznikoff, Mobile DNA in obligate intracellular bacteria. Nature reviews. Microbiology, 2005. 3(9): p. 688-99. 91. Wu, M., et al., Phylogenomics of the reproductive parasite Wolbachia pipientis wMel: a streamlined genome overrun by mobile genetic elements. PLoS biology, 2004. 2(3): p. E69. 92. Andersson, S.G., et al., Comparative genomics of microbial pathogens and symbionts. Bioinformatics, 2002. 18 Suppl 2: p. S17. 93. Simek, K., et al., Changes in bacterial community composition and dynamics and viral mortality rates associated with enhanced flagellate grazing in a mesoeutrophic reservoir. Appl Environ Microbiol, 2001. 67(6): p. 2723-33. 94. Simser, J.A., et al., A novel and naturally occurring transposon, ISRpe1 in the Rickettsia peacockii genome disrupting the rickA gene involved in actin-based motility. Molecular microbiology, 2005. 58(1): p. 71-9. 95. Blanc, G., et al., Lateral gene transfer between obligate intracellular bacteria: evidence from the Rickettsia massiliae genome. Genome research, 2007. 17(11): p. 1657-64. 96. Horn, M., et al., Illuminating the evolutionary history of chlamydiae. Science, 2004. 304(5671): p. 728-30.

53 97. Ogata, H., et al., Genome sequence of Rickettsia bellii illuminates the role of amoebae in gene exchanges between intracellular pathogens. PLoS Genet, 2006. 2(5): p. e76. 98. Yang, F., et al., Genome dynamics and diversity of Shigella species, the etiologic agents of . Nucleic acids research, 2005. 33(19): p. 6445-58. 99. Labrador, M. and V.G. Corces, Transposable element-host interactions: regulation of insertion and excision. Annu Rev Genet, 1997. 31: p. 381-404. 100. van Belkum, A., et al., Short-sequence DNA repeats in prokaryotic genomes. Microbiology and molecular biology reviews : MMBR, 1998. 62(2): p. 275-93. 101. Fournier, P.E., et al., Use of highly variable intergenic spacer sequences for multispacer typing of Rickettsia conorii strains. Journal of clinical microbiology, 2004. 42(12): p. 5757-66. 102. Amiri, H., C.M. Alsmark, and S.G. Andersson, Proliferation and deterioration of Rickettsia palindromic elements. Molecular biology and evolution, 2002. 19(8): p. 1234-43. 103. Claverie, J.M. and H. Ogata, The insertion of palindromic repeats in the evolution of proteins. Trends in biochemical sciences, 2003. 28(2): p. 75-80. 104. Ogata, H., et al., Selfish DNA in protein-coding genes of Rickettsia. Science, 2000. 290(5490): p. 347-50. 105. Rocha, E.P., DNA repeats lead to the accelerated loss of gene order in bacteria. Trends in genetics : TIG, 2003. 19(11): p. 600-3. 106. Baldridge, G.D., et al., Transposon insertion reveals pRM, a plasmid of Rickettsia monacensis. Appl Environ Microbiol, 2007. 73(15): p. 4984-95. 107. Moran, J.V., R.J. DeBerardinis, and H.H. Kazazian, Jr., Exon shuffling by L1 retrotransposition. Science, 1999. 283(5407): p. 1530-4. 108. Ogata, H., et al., Rickettsia felis, from culture to genome sequencing. Annals of the New York Academy of Sciences, 2005. 1063: p. 26-34. 109. Li, J., A. Mahajan, and M.D. Tsai, Ankyrin repeat: a unique motif mediating protein-protein interactions. Biochemistry, 2006. 45(51): p. 15168-78. 110. Mosavi, L.K., et al., The ankyrin repeat as molecular architecture for protein recognition. Protein science : a publication of the Protein Society, 2004. 13(6): p. 1435-48. 111. Rubtsov, A.M. and O.D. Lopina, Ankyrins. FEBS letters, 2000. 482(1-2): p. 1-5. 112. Blatch, G.L. and M. Lassle, The tetratricopeptide repeat: a structural motif mediating protein-protein interactions. BioEssays : news and reviews in molecular, cellular and developmental biology, 1999. 21(11): p. 932-9. 113. Bork, P., Hundreds of ankyrin-like repeats in functionally diverse proteins: mobile modules that cross phyla horizontally? Proteins, 1993. 17(4): p. 363-74. 114. Felsheim, R.F., T.J. Kurtti, and U.G. Munderloh, Genome sequence of the endosymbiont Rickettsia peacockii and comparison with virulent Rickettsia rickettsii: identification of virulence factors. PloS one, 2009. 4(12): p. e8361.

54 115. Caturegli, P., et al., ankA: an Ehrlichia phagocytophila group gene encoding a cytoplasmic protein antigen with ankyrin repeats. Infection and immunity, 2000. 68(9): p. 5277-83. 116. Saisongkorh, W., et al., Evidence of transfer by conjugation of type IV secretion system genes between Bartonella species and Rhizobium radiobacter in amoeba. PloS one, 2010. 5(9): p. e12666. 117. Saridaki, A. and K. Bourtzis, Wolbachia: more than just a bug in insects genitals. Current opinion in microbiology, 2010. 13(1): p. 67-72. 118. Lin, M., et al., Analysis of complete genome sequence of Neorickettsia risticii: causative agent of Potomac horse fever. Nucleic acids research, 2009. 37(18): p. 6076-91. 119. Cho, N.H., et al., The Orientia tsutsugamushi genome reveals massive proliferation of conjugative type IV secretion system and host-cell interaction genes. Proceedings of the National Academy of Sciences of the United States of America, 2007. 104(19): p. 7981-6. 120. Burns, D.L., Type IV transporters of pathogenic bacteria. Current opinion in microbiology, 2003. 6(1): p. 29-34. 121. Christie, P.J., Type IV secretion: intercellular transfer of macromolecules by systems ancestrally related to conjugation machines. Molecular microbiology, 2001. 40(2): p. 294-305. 122. Christie, P.J. and J.P. Vogel, Bacterial type IV secretion: conjugation systems adapted to deliver effector molecules to host cells. Trends in microbiology, 2000. 8(8): p. 354-60. 123. Deng, W., et al., VirE1 is a specific molecular chaperone for the exported single-stranded-DNA-binding protein VirE2 in Agrobacterium. Molecular microbiology, 1999. 31(6): p. 1795-807. 124. Chen, I., P.J. Christie, and D. Dubnau, The ins and outs of DNA transfer in bacteria. Science, 2005. 310(5753): p. 1456-60. 125. Schandel, K.A., M.M. Muller, and R.E. Webster, Localization of TraC, a protein involved in assembly of the F conjugative pilus. J Bacteriol, 1992. 174(11): p. 3800-6. 126. Beranek, A., et al., Thirty-eight C-terminal amino acids of the coupling protein TraD of the F-like conjugative resistance plasmid R1 are required and sufficient to confer binding to the substrate selector protein TraM. J Bacteriol, 2004. 186(20): p. 6999-7006. 127. Schroder, G. and E. Lanka, TraG-like proteins of type IV secretion systems: functional dissection of the multiple activities of TraG (RP4) and TrwB (R388). J Bacteriol, 2003. 185(15): p. 4371-81. 128. Berglund, E.C., et al., Run-off replication of host-adaptability genes is associated with gene transfer agents in the genome of mouse-infecting Bartonella grahamii. PLoS genetics, 2009. 5(7): p. e1000546.

55 129. Matthews, M. and C.R. Roy, Identification and subcellular localization of the Legionella pneumophila IcmX protein: a factor essential for establishment of a replicative organelle in eukaryotic host cells. Infection and immunity, 2000. 68(7): p. 3971-82. 130. Vogel, J.P., et al., Conjugative transfer by the virulence system of Legionella pneumophila. Science, 1998. 279(5352): p. 873-6. 131. Nagai, H. and T. Kubori, Type IVB Secretion Systems of Legionella and Other Gram-Negative Bacteria. Frontiers in microbiology, 2011. 2: p. 136. 132. Nora, T., et al., Molecular mimicry: an important virulence strategy employed by Legionella pneumophila to subvert host functions. Future microbiology, 2009. 4(6): p. 691-701. 133. Andrews, H.L., J.P. Vogel, and R.R. Isberg, Identification of linked Legionella pneumophila genes essential for intracellular growth and evasion of the endocytic pathway. Infection and immunity, 1998. 66(3): p. 950-8.

56 Table S1: List of some of the main sequenced intracellular genomes (as of October 2013) indicating the genome size, GC contents, number of protein coding genes, number of plasmids and the year of publishing along with its lifestyle. Lifestyle: FI- Facultative intracellular; OI- Obligate intracellular Niche Bacteria Size GC % Protein Plasmids Year Citation gammaproteobacteria OI Buchnera aphidicola Acyrthosiphon pisum 5p 0.64 26 555 0 2009 1 OI Buchnera aphidicola Acyrthosiphon pisum 0.64 26 564 2 2001 2 OI Buchnera aphidicola Baizongia pistaciae 0.62 25 504 1 2003 3 OI Buchnera aphidicola Cinara cedri 0.42 20 357 1 2006 4 OI Buchnera aphidicola Schizaphis graminum 0.64 25 546 0 2002 5 OI Buchnera aphidicola Acyrthosiphon pisum Tuc7 0.64 26 553 0 2009 1 OI Wigglesworthia glossinidia 0.7 22 611 1 2002 6 OI Candidatus Blochmannia floridanus 0.71 27 583 0 2003 7 OI Candidatus Blochmannia pennsylvanicus str. BPEN 0.79 29 610 0 2005 8 OI Baumannia cicadellinicola str Hc 0.69 33 595 0 2006 9 FI Sodalis glossinidius 4.17 54 2432 3 2006 10 OI Candidatus Hamiltonella defensa 5AT 2.1 40 2094 1 2009 11 FI Photorhabdus asymbiotica 5.06 42 4390 1 2009 12,13 OI Candidatus Carsonella ruddii 0.16 16 182 0 2006 14 FI Shigella flexneri 2a 2457T 4.6 50 4060 0 2003 15 FI Shigella flexneri 2a 301 4.6 50 4176 1 2002 16 FI Shigella flexneri 2a 5 str. 8401 4.57 50 4114 0 2006 17 FI Legionella pneumophila Corby 3.58 38 3204 0 2007 18 FI Legionella pneumophila Lens 3.35 38 2878 1 2004 19 FI Legionella pneumophila Paris 3.5 38 3027 1 2004 20 FI Legionella pneumophila subsp. pneumophila str. Philadelphia 1 3.4 38 2942 0 2001 19 FI Legionella pneumophila 2300/99 Alcoy 3.5 38 3190 0 2010 19 FI Legionella pneumophila subsp. pneumophila LPE509 3.5 38 3331 1 2013 21

57 FI Legionella pneumophila subsp. pneumophila str. Thunder Bay 3.5 38 2998 0 2013 21 FI NSW150 4.1 37 3470 1 2010 22 OI Coxiella burnetii CbuG_Q212 2 42 1866 0 2008 23 OI Coxiella burnetii CbuK_Q154 2.1 42 1900 1 2008 23 OI Coxiella burnetii Dugway 5J108-111 2.2 42 1993 1 2007 23 OI Coxiella burnetii RSA 331 2 42 1930 1 2009 23 OI Coxiella burnetii RSA 493 2 42 1817 1 2001 23 FI Francisella tularensis subsp. holarctica FSC200 1.9 32 1438 0 2012 24 FI Francisella tularensis subsp. tularensis TI0902 1.9 32 1544 0 2012 25 FI Francisella tularensis subsp. tularensis TIGB03 2.0 32 1624 0 2012 26 FI Francisella tularensis subsp. tularensis NE061598 1.9 32 1836 0 2009 27 FI Francisella tularensis subsp. holarctica F92 1.9 32 1842 0 2012 28 FI Francisella tularensis holarctica FTNF002-00 FTA 1.9 32 1580 0 2007 29 FI Francisella tularensis holarctica LVS 1.9 32 1754 0 2006 30 FI Francisella tularensis holarctica OSU18 1.9 32 1555 0 2006 30 FI Francisella tularensis mediasiatica FSC147 1.9 32 1406 0 2008 31 FI Francisella tularensis tularensis FSC198 1.9 32 1605 0 2006 30 FI Francisella tularensis tularensis SCHU S4 Schu S4 1.9 32 1604 0 2004 32 FI Francisella tularensis tularensis WY96-3418 1.9 32 1634 0 2007 29 OI Candidatus Vesicomyosocius okutanii HA 1.02 31 937 0 2007 33 OI Candidatus Ruthia magnifica str. Cm 1.16 34 976 0 2006 34 FI ATCC 23344 5.83 68 5024 0 2004 35 FI Burkholderia mallei NCTC 10229 5.76 68 5509 0 2007 35 FI Burkholderia mallei NCTC 10247 5.85 68 5415 0 2007 35 FI Burkholderia mallei SAVP1 5.23 68 5188 0 2007 35 OI Polynucleobacter necessarius subsp. asymbioticus QLW- 2.16 44 2077 0 2007 36 P1DMWA-1 OI Polynucleobacter necessarius subsp. necessarius STIR1 1.56 45 1508 0 2008 36

58 alphaproteobacteria FI KC583 1.45 38 1283 0 2007 37 FI Bartonella grahamii as4aup 2.34 38 1737 1 2009 38 FI Bartonella henselae str. Houston-1 1.93 38 1488 0 2004 39 FI str. Toulouse Toulose 1.58 38 1142 0 2004 39 FI Bartonella tribocorum CIP 105476 2.6 38 2069 1 2007 40 OI Candidatus Hodgkinia cicadicola str. Dsem 0.14 58 169 1 2009 41 FI Phenylobacterium zucineum (strain HLK1) 4 71 3529 1 2007 42 OI Anaplasma Centrale str. Israel 1.2 49 923 0 2009 43 OI Anaplasma marginale str. Florida 1.2 49 940 0 2009 44 OI Anaplasma marginale str. St. Maries 1.2 49 948 0 2003 45 OI Anaplasma phagocytophilum 1.47 41 1264 0 2006 46 OI Ehrlichia canis str. Jake 1.3 28 925 0 2005 47 OI str. Arkansas 1.18 30 1105 0 2006 48 OI Ehrlichia ruminantium str. Gardel 1.5 27 950 0 2005 49 OI Ehrlichia ruminantium str. Welgevonden 1.51 27 958 0 2005 50 OI Ehrlichia ruminantium str. Welgevonden 1.5 27 888 0 2003 50 OI Wolbachia pipientis wPip 1.5 34 1275 0 2008 51 OI Wolbachia pipientis wMel 1.27 35 1195 0 2002 52 OI Wolbachia pipientis wMel TRS 1.08 34 805 0 2005 53 OI Wolbachia pipientis wRi 1.45 35 1150 0 2009 54 OI Neorickettsia risticii str. Illinois 0.88 41 892 0 2009 55 OI Neorickettsia sennetsu str. Miyayama 0.86 41 932 0 2006 56 OI Rickettsia africae ESF-5 1.28 32 1030 1 2009 57 OI str. Hartford 1.23 32 1258 0 2007 58 OI Rickettsia bellii OSU 85-389 1.53 31 1475 0 2007 58 OI Rickettsia bellii RML369-C 1.52 31 1429 0 2006 59 OI Rickettsia canadensis str. McKiel 1.16 31 1090 0 2007 60 OI Rickettsia conorii str. Malish 7 1.27 32 1374 0 2001 61

59 OI Rickettsia felis URRWXCal2 1.49 32 1400 2 2005 62 OI Rickettsia massiliae MTU5 1.36 32 968 1 2007 63 OI Rickettsia peacockii str. Rustic 1.3 32 927 1 2009 64 OI Rickettsia prowazekii str. Madrid E 1.1 29 835 0 2001 65 OI Rickettsia rickettsii str. 'Sheila Smith' 1.26 32 1343 0 2007 66 OI Rickettsia rickettsii str. Iowa 1.27 32 1383 0 2008 67 OI Rickettsia typhi str. Wilmington 1.11 28 838 0 2004 68 OI Candidatus Rickettsia amblyommii str. GAT-30V 1.48 32 1390 3 2012 69 OI str. Cutlack 1.32 32 1261 1 2012 70 OI Rickettsia canadensis str. CA410 1.15 31 1016 0 2012 71 OI Rickettsia heilongjiangensis 054 1.28 32 1297 0 2011 72 OI YH 1.28 32 971 0 2011 73 OI Rickettsia massiliae str. AZT80 1.28 33 1207 1 2012 74 OI Rickettsia montanensis str. OSU 85-930 1.28 33 1217 0 2012 75 OI str. Portsmouth 1.30 32 1318 0 2012 76 OI Rickettsia philipii str. 364D 1.29 33 1344 0 2012 77 OI Rickettsia prowazekii str. Breinl 1.11 29 920 0 2013 78 OI Rickettsia prowazekii str. BuV67-CWPP 1.11 29 843 0 2012 79 OI Rickettsia prowazekii str. Chernikova 1.11 29 845 0 2012 80 OI Rickettsia prowazekii str. Dachau 1.11 29 839 0 2012 81 OI Rickettsia prowazekii str. GvV257 1.11 29 829 0 2012 82 OI Rickettsia prowazekii str. Katsinyian 1.11 29 844 0 2012 83 OI Rickettsia prowazekii str. NMRC Madrid E 1.11 29 938 0 2013 84 OI Rickettsia prowazekii str. Rp22 1.11 29 950 0 2010 85 OI Rickettsia prowazekii str. RpGvF24 1.11 29 834 0 2012 86 OI Rickettsia rhipicephali str. 3-7-female6-CWPP 1.31 32 1266 1 2012 87 OI Rickettsia rickettsii str. Arizona 1.27 32 1343 0 2012 88 OI Rickettsia rickettsii str. Brazil 1.26 33 1332 0 2012 89 OI Rickettsia rickettsii str. Colombia 1.27 33 1350 0 2012 90

60 OI Rickettsia rickettsii str. Hauke 1.27 33 1340 0 2012 91 OI Rickettsia rickettsii str. Hino 1.27 33 1335 0 2012 92 OI Rickettsia rickettsii str. Hlp#2 1.27 33 1308 0 2012 93 OI Rickettsia slovaca 13-B 1.28 33 1112 0 2011 94 OI Rickettsia slovaca str. D-CWPP 1.28 33 1347 0 2012 95 OI Rickettsia typhi str. B9991CWPP 1.11 29 839 0 2012 96 OI Rickettsia typhi str. TH1527 1.11 29 838 0 2012 96 OI Orientia tsutsugamushi str. Boryong 2.13 30 1182 0 2007 97 OI Orientia tsutsugamushi str. Ikeda 2 30 1967 0 2008 98 deltaproteobacteria OI Lawsonia intracellularis PHE/MN1-00 1.46 33 1180 3 2006 99 Bacteroidetes OI Candidatus Sulcia muelleri GWSS 0.25 22 227 0 2007 100 OI Candidatus Sulcia muelleri SMDSEM 0.28 22 242 0 2009 101 OI Blattabacterium Bge 0.64 27 586 0 2009 102 OI Blattabacterium BPLAN 0.64 28 576 1 2009 102 OI Candidatus Amoebophilus asiaticus 5a2 1.9 35 1283 0 2008 103 Actinobacteria OI Mycobacterium leprae TN 3.27 57 1605 0 2001 104 FI Renibacterium salmoninarum ATCC 33209 3.16 56 3507 0 2007 105 FI Tropheryma whipplei TW08/27 0.93 46 783 0 2003 106 FI Tropheryma whipplei Twist 0.93 46 808 0 2003 107 Chlamydiae OI Chlamydophila abortus S26/3 1.14 39 932 0 2003 108 OI Chlamydophila caviae GPIC 1.17 39 998 1 2002 109 OI Chlamydophila felis Fe/C-56 1.17 39 1005 1 2006 110 OI Chlamydophila pneumoniae AR39 1.23 40 1112 0 2001 111 OI Chlamydophila pneumoniae CWL029 1.23 40 1052 0 2001 112 OI Chlamydophila pneumoniae J138 1.23 40 1069 0 2001 113

61 OI Chlamydophila pneumoniae TW-183 1.23 40 1113 0 2003 114 OI Chlamydia muridarum Nigg 1.07 40 904 1 2001 111 OI Chlamydia trachomatis A/HAR-13 1.04 41 911 1 2005 115 OI Chlamydia trachomatis D/UW-3/CX 1.04 41 895 0 2001 116 OI Chlamydia trachomatis 434/Bu 1.04 41 874 0 2008 117 OI Chlamydia trachomatis B/Jali20/OT Jali20 1.04 41 875 0 2009 118 OI Chlamydia trachomatis B/TZ1A828/OT 1.04 41 880 0 2009 119 OI Chlamydia trachomatis L2b/UCH-1/proctitis 1.04 41 874 0 2008 120 OI Candidatus Protochlamydia amoebophila UWE25 2.41 34 2031 0 2004 121 Firmicutes FI Listeria monocytogenes Clip80459 CLIP80459 2.9 38 2766 0 2009 122 FI Listeria monocytogenes EGD-e 2.94 37 2846 0 2001 123 FI Listeria monocytogenes HCC23 2.98 38 2974 0 2008 124 FI Listeria monocytogenes str. 4b F2365 2.91 38 2821 0 2001 125 Tenericutes OI Phytoplasma Onion yellows OY-M 0.85 27 750 0 2003 126 OI Phytoplasma Australiense 0.88 27 684 0 2008 127 OI Phytoplasma Aster yellows witches-broom AYWB 0.71 26 671 4 2006 128 OI Phytoplasma mali str. AT 0.6 21 479 0 2008 129 OI Mycoplasma penetrans HF-2 1.36 25 1037 0 2002 130

62

Chapter 3

Genome sequencing of intracellular bacteria

63

64

3.1 Article 1: Genome Sequence of Diplorickettsia massiliensis, an Emerging Ixodes ricinus-Associated Human Pathogen Mano J. Mathew1, Geetha Subramanian1, Thi-Tien Nguyen1, Catherine Robert1, Oleg Mediannikov1, Pierre-Edouard Fournier1, Didier Raoult1*

1 Unité de Recherche sur les Maladies Infectieuses et Tropicales Emergentes: URMITE, Aix Marseille Université, UMR CNRS 7278, IRD 198, INSERM 109, Faculté de Médecine, 27 Bd Jean Moulin, 13005, Marseille, France.

Published in J. Bacteriol. June 2012 vol. 194 no. 12 3287

*Corresponding author. E-mail: [email protected]

65 66 Preamble to article 1

The order Legionellales is composed of several pathogenic, aerobic, motile and nutritionally fastidious pleomorphic gram negative bacteria from the class gammaproteobacteria. The order Legionellales is composed of two families: Legionellaceae and Coxiellaceae. Many species of Legionella cause legionellosis. The family Coxiellaceae consists of Aquicella, Coxiella (an intracellular bacterium that is the causative agent of ) (Beare, et al., 2009), Diplorickettsia and Rickettsiella (an intracellular parasite of Gryllus bimaculatus) (Roux, et al., 1997, Mediannikov, et al., 2010). Almost all bacteria isolated from ticks (Ixodes ricinus) are pathogenic for humans, notably Borrelia burgdorferi, Borrelia afzelii, Borrelia garinii, Rickettsia helvetica, Rickettsia monacensis and Francisella tularensis (Parola & Raoult, 2001). F. tularensis, which causes or plague-like disease, belongs to the order (Beckstrom-Sternberg, et al., 2007).

D. massiliensis strain 20B is an obligate intracellular, gram negative bacterium isolated from Ixodes ricinus ticks collected in 2006 from the southeastern part of the Rovinka forest in Slovakia (Mediannikov, et al., 2010). D. massiliensis belongs to the Gammaproteobacteria class, is non- endospore-forming, and is shaped as small rods that are usually grouped in pairs. An initial phylogenetic analysis based on 16S rRNA showed that D. massiliensis clustered with Rickettsiella grylli (Roux, et al., 1997, Mediannikov, et al., 2010). Because of its low 16S rDNA similarity (94%)

67 with R. grylli, it was classified as a new genus Diplorickettsia into the family Coxiellaceae and the order Legionellales (Mediannikov, et al., 2010). D. massiliensis strain 20B was identified in three patients with suspected tick-borne infections that exhibited a specific seroconversion. The evidence of infection was further reconfirmed by using PCR-assay, thus establishing its role as a human pathogen. This article reports the genome of D. massiliensis 20B, contains 1,727,973 bp with a G+C content of 38.9%. When compared to closely related gammaproteobacteria, D. massiliensis, with 1.7 Mb, had a bigger genome than Rickettsiella grylli, with 1.4 Mb but smaller than Coxiella burnetii strain CbuK_Q154, with 2.0 Mb. However, D. massiliensis had more metabolism-related genes (501 genes) than Rickettsiella grylli (360) and Coxiella burnetii (459); it also had more genes involved in energy production and conversion (109 versus 75 and 84, respectively) and more genes involved in translation, ribosomal structure, and biogenesis (170 versus 134 and 135, respectively).

68

69

70 71 72

Chapter 4

Comparative genomics

73

74

4.1 Article 2: The genomic repertoire of Diplorickettsia massiliensis reveals its allopatric lifestyle

Mano J. Mathew1, Laetitia Rouli1and Didier Raoult1*

1Unité de Recherche sur les Maladies Infectieuses et Tropicales Emergentes: URMITE, Aix Marseille Université, UMR CNRS 7278, IRD 198, INSERM 109, Faculté de Médecine, 27 Bd Jean Moulin, 13005, Marseille, France.

Submitted to Biology Direct

*Corresponding author. E-mail: [email protected]

75 76 Preamble to article 2

In this study, we used a pangenomic approach to elucidate strain-specific genes as well as genomic differences and similarities between Diplorickettsia massiliensis strain 20B and twenty-nine sequenced species, including Legionella strains, Coxiella burnetii strains, F. tularensis strains and R. grylli. We conducted a global pangenome analysis with these thirty genomes as well as individual pangenome sets belonging to Coxiella, Legionella and Francisella. An individual pangenome was constructed for the Coxiella genus using five sequenced Coxiella burnetii reference strains, ten sequenced L. pneumophila strains and twelve sequenced F. tularensis strains. Another pangenome set was constructed from ten sequenced L. pneumophila strains and a single L. longbeachae NSW150 strain. A single R. grylli genome and the D. massiliensis strain 20B genome were also included in the above-mentioned pangenome set. We estimated the sizes of both the pangenome and the core genomes. Based on these pangenomes, we described the distribution of functional genes and gene families across the different genomes analyzed, and specifically characterized the D. massiliensis strain 20B genome.

77 78 Title: The genomic repertoire of Diplorickettsia massiliensis reveals its allopatric lifestyle

Running title: The genomic repertoire of Diplorickettsia massiliensis reveals its allopatric lifestyle

Mano J. Mathew1, Laetitia Rouli1and Didier Raoult1*

1 Unité de Recherche sur les Maladies Infectieuses et Tropicales Emergentes: URMITE, Aix Marseille Université, UMR CNRS 7278, IRD 198, INSERM 109, Faculté de Médecine, 27 Bd Jean Moulin, 13005, Marseille, France.

Submitted to Biology Direct

*Corresponding author. E-mail: [email protected]

79 Abstract

Background Diplorickettsia massiliensis strain 20B is an obligate intracellular, gram- negative bacterium isolated from Ixodes ricinus ticks collected in Slovakia. In this study, we compared the genomic features of D. massiliensis strain 20B with twenty-nine sequenced Gammaproteobacteria species (Legionella strains, Coxiella burnetii strains, Francisella tularensis strains and Rickettsiella grylli) using multi-genus pangenomic approach.

Results Using phylogenomic analysis, we found that D. massiliensis shares 635 genes with Rickettsiella grylli and clusters with Coxiella burnetii. We identified 908 genes (61.56%) in common with Gammaproteobacteria that constitute the core genome of D. massiliensis and 518 genes (35.12%) that represent the dispensable genome. We also identified a link between total gene content and different bacterial lifestyles. We observed that fewer genes and a lower G+C content correlated with a smaller genome size and helped the bacteria to adapt to the host. Because of the reduced genomic repertoire, we speculate that fewer lateral gene transfers have occurred in D. massiliensis. A pangenomic approach allowed us to explore the different strategies by which facultative or obligate intracellular organisms specialize to particular host.

Conclusion These results significantly contribute to our understanding of genome repertoires. This approach can be used to uncover interesting genomic features that cannot be predicted using conventional methods. Moreover, the variability that we identified between the L. pneumophila strains and L. longbeachae NSW150 may warrant re-classifying them as separate subspecies.

Keywords: Genome repertoire; pangenome; Diplorickettsia; allopatric; comparative genomics

80 Background The order Legionellales is composed of several pathogenic, aerobic, motile and nutritionally fastidious pleomorphic gram negative bacteria from the class gammaproteobacteria. The order Legionellales is composed of two families: Legionellaceae and Coxiellaceae. Many species of Legionella cause legionellosis. The family Coxiellaceae consists of Aquicella, Coxiella (an intracellular bacterium that is the causative agent of Q fever) [1], Diplorickettsia and Rickettsiella (an intracellular parasite of Gryllus bimaculatus) [2, 3]. Almost all bacteria isolated from ticks (Ixodes ricinus) are pathogenic for humans, notably Borrelia burgdorferi, Borrelia afzelii, Borrelia garinii, Rickettsia helvetica, Rickettsia monacensis and Francisella tularensis [4]. F. tularensis, which causes tularemia or plague- like disease, belongs to the Thiotrichales order [5]. D. massiliensis strain 20B is an obligate intracellular, gram negative bacterium isolated from Ixodes ricinus ticks collected in 2006 from the southeastern part of the Rovinka forest in Slovakia [3]. D. massiliensis belongs to the Gammaproteobacteria class, is non-endospore-forming, and is shaped as small rods that are usually grouped in pairs. An initial phylogenetic analysis based on 16S rRNA showed that D. massiliensis clustered with Rickettsiella grylli [2, 3]. Because of its low 16S rDNA similarity (94%) with R. grylli, it was classified as a new genus Diplorickettsia into the family Coxiellaceae and the order Legionellales [3]. D. massiliensis strain 20B was identified in three patients with suspected tick-borne infections that exhibited a specific seroconversion. The evidence of infection was further reconfirmed by using PCR-assay, thus 81 establishing its role as a human pathogen. Whole genome sequencing was performed at a later date [6, 7]. Recent advances in next generation sequencing techniques have led to the initiation of large-scale microbial genome projects [8]. Comparative genomics studies use conventional non-sequence-based technologies such as microarrays targeting genes or non-coding regions, studies of specific pathways and whole genome sequence alignment [9]. Bacterial strains from the same species may exhibit variations in their genetic repertoire, with differences in both genomic structure and sequence between strains, reflecting the extraordinary adaptability of prokaryotic species. Thus, sequencing a single genome per species it is often insufficient for describing the genetic variability of the species. This led to the concept of a pangenomic approach, which takes into account the genetic makeup of a bacterial species and its genomic diversity from genus to genus. The pangenome of a bacterial species is larger than the total gene content of any individual strain within the species. The pangenome is composed of three parts: the core genome (genes shared by all of the strains), the accessory or dispensable genome (shared by only some of the strains) and unique genes (strain-specific) [10]. The accessory genome can reveal evidence of lateral gene transfer events that occurred during the evolutionary history of a strain and likely contributed to the evolutionary potential of the organism. Furthermore, a distinction can be made between closed pangenomes and open pangenomes. A pangenome is closed when, despite the addition of new genomes, the gene content remains unchanged, such as in Bacillus

82 anthracis [9, 10]. In contrast, a pangenome is open, as in the case of Escherichia coli [11], if the gene pool increases with the addition of new genomes. Pangenome studies can reveal changes that are not easily detectable using standard annotation analysis [12]. For example, pangenome studies have facilitated the identification of strain-specific genes in L. pneumophila. The L. pneumophila dispensable genome, acquired by horizontal gene transfer, may act as a reservoir that could confer evolutionary advantages over strains that lack this gene pool [13]. These microbial pathogens exhibit a striking ability to adapt to new hosts, antibiotics, and host immune systems [14]. In this study, we used a pangenomic approach to elucidate strain- specific genes as well as genomic differences and similarities between D. massiliensis strain 20B and twenty-nine sequenced species, including Legionella strains, Coxiella burnetii strains, F. tularensis strains and R. grylli. We conducted a global pangenome analysis with these thirty genomes as well as individual pangenome sets belonging to Coxiella, Legionella and Francisella. An individual pangenome was constructed for the Coxiella genus using five sequenced Coxiella burnetii reference strains, ten sequenced L. pneumophila strains and twelve sequenced F. tularensis strains. Another pangenome set was constructed from ten sequenced L. pneumophila strains and a single L. longbeachae NSW150 strain. A single R. grylli genome and the D. massiliensis strain 20B genome were also included in the above-mentioned pangenome set. We estimated the sizes of both the pangenome and the core genomes. Based on these pangenomes, we described the distribution of functional genes

83 and gene families across the different genomes analyzed, and specifically characterized the D. massiliensis strain 20B genome.

Results Comparison of genomic features The main features of the genomes analyzed here are summarized in Table 1. The chromosomes from the thirty genomes compared in this study range in size from 1.6 to 4.15 Mb and have a G+C content ranging from 37.1 to 42.6%. The R. grylli genome is smaller than the D. massiliensis strain 20B and F. tularensis genomes (1.6, 1.7 and 1.9 Mb, respectively), and the L. longbeachae strain NSW150 genome (4.1 Mb) is larger than those of other L. pneumophila strains. The number of protein coding genes per genome within the various strains and species is relatively similar, but the gene composition is much more variable. The distribution of proteins by length among the organisms is shown in Figure 1. We compared the genomic and proteomic repertoires based on protein length, genome size, and G+C content and found that D. massiliensis strain 20B fell between the two groups (the first group being C. burnetii and F. tularensis while the second group is L. pneumophila) but closer to the former, which has more pathogenic proteins. The coding density of these genomes ranges from 71.29% to 90.86%. In the Legionella species, coding regions account for more than 85% of the genome. The number of proteins associated with D. massiliensis strain 20B is much larger than in R. grylli, F. tularensis and C. burnetii. Figure 2 summarizes the distribution

84 of G+C content (%) and genome size (Mb). The Kyoto Encyclopedia of Genes and Genomes (KEGG) characteristics of the organisms are summarized in Figure 3.

Pangenome analysis Figure 4 summarizes the results from the individual pangenomes of C. burnetii, Legionella, L. pneumophila, F. tularensis as well as the set of all thirty genomes analyzed. These genomes are characterized by a fairly high number of hypothetical proteins, for which annotation is still incomplete. Genes belonging to the core and dispensable genomes have been classified according to their predicted function based on COG and KEGG categories for the respective pangenomes (Additional file 1). The C. burnetii pangenome is closed, as we found a finite number of gene clusters. The L. pneumophila pangenome is open (unlimited) because the number of pangenome clusters and core genome clusters changed depending on how many different genomes were included in the analysis. The F. tularensis pangenome was on the borderline between being considered an open or closed genome (Additional file 2).

The Coxiella burnetii pangenome The C. burnetii pangenome consists of 6,871 CDS with 1,080 core genes (92.04 %) and 491 dispensable genes (7.15 %) (Additional file 3). A total of 56 genes were specific to the C. burnetii CbuG_Q212 (6), C. burnetii CbuK_Q154 (6), C. burnetii Dugway 5J108-111 (34), C. burnetii RSA 331 (9) and C. burnetii RSA 493 (1) genomes. Notably, 70 out of these

85 491 accessory genes (14.25%) were hypothetical proteins. Of the 1,080 genes belonging to the core genome, 956 (88.6%) were attributed to a COG category, and 510 (47.3%) were attributed to a KEGG category. In the case of the 491 dispensable genes, 421 (85.7%) were assigned to a COG category, and 185 (37.6%) were assigned to a KEGG category. Using the COG database, we identified minor differences between the compartments in the defense mechanisms (V) and intracellular trafficking, secretion and vesicular transport (U) categories. Using the KEGG database, we found that C. burnetii Dugway 5J108-111 contains a greater number of CDSs involved in environmental information processing and metabolism than other strains. The core genome represented 92% of the pangenome (Additional file 2), showing again the high rate of conservation.

The Legionellales pangenome The Legionella pangenome consists of 23,736 CDSs with 1,410 core genes (82.44 %) and a dispensable genome of 3791 CDSs (15.97 %) (Additional file 3). A total of 378 genes were specific to the L. pneumophila str. Lens (14), L. pneumophila str. Paris (20), L. pneumophila 2300/99 Alcoy (7), L. pneumophila subsp. pneumophila HL06041035 (21), L. pneumophila subsp. pneumophila str. Lorraine (8), L. pneumophila str. Corby (6), L. pneumophila subsp. pneumophila str. Philadelphia 1 (3), L. pneumophila subsp. pneumophila LPE509 (1) and L. pneumophila subsp. pneumophila str. Thunder Bay (3) genomes. Of these 378 unique genes, 295 (78.04 %) were present in L. longbeachae strain NSW150. Of the 1,410 genes

86 belonging to the core genome, 1,316 (93.4%) were attributed to a COG category and 688 (48.8%) were attributed to a KEGG category. In the case of the 3791 dispensable genes, 3273 (86.3%) were attributed to a COG category and 1464 (38.6%) were attributed to a KEGG category. We observed several differences in the CDSs from the Cell wall/membrane/envelope biogenesis (M) COG category. Legionellales has a greater number of CDSs involved in membrane transport and signal transduction (based on KEGG categories), which is associated with environmental information processing. In particular, L. longbeachae strain NSW150 has a greater number of genes associated with energy production and conservation (C), signal transduction (T), and defense mechanisms (V) and fewer genes related to cell motility (N), based on COG categories. Significant differences were observed in the number of CDSs associated with cellular processes, particularly flagellar assembly, which is important for cell motility and carbohydrate and energy metabolism.

The Legionella pneumophila pangenome The L. pneumophila pangenome consists of 21,459 CDSs with a core genome of 1,572 genes (90.71 %) and a dispensable genome of 1881 CDSs (8.77 %) (Additional file 3). A total of 112 genes were specific to the L. pneumophila str. Lens (20), L. pneumophila str. Paris (27), L. pneumophila 2300/99 Alcoy (7), L. pneumophila subsp. pneumophila HL06041035 (26), L. pneumophila subsp. pneumophila str. Lorraine (15), L. pneumophila str. Corby (9), L. pneumophila subsp. pneumophila str.

87 Philadelphia 1 (4), L. pneumophila subsp. pneumophila LPE509 (1) and L. pneumophila subsp. pneumophila str. Thunder Bay (3) genomes. Of the 1,572 genes belonging to the core genome, 1,465 (93.2 %) were attributed to a COG category, and 760 (48.4 %) were attributed to a KEGG category. In the case of the 1,881 dispensable genes, 1,524 (81%) were attributed to a COG category, and 661 (35.14 %) were attributed to a KEGG category. We identified differences in the cell wall/membrane/envelope biogenesis (M) category based on the CDSs for which a function could be identified using the COG database. We found that greater number of CDSs are involved in signal transduction (the bacterial secretion system and the two-component system, which are associated with environmental information processing) and translation (ribosomal elements that are associated with genetic information processing). We did not observe any differences in the cellular processes category.

The Francisella tularensis pangenome The F. tularensis pangenome consists of 16,596 CDSs with a core of 1,010 genes (86.05 %) and a dispensable genome of 2297 CDSs (13.84 %) (Additional file 3). A total of 18 genes were specific to the F. tularensis subsp. holarctica F92 (1), F. tularensis subsp. holarctica LVS (1), F. tularensis subsp. holarctica OSU18 (4), F. tularensis subsp. mediasiatica FSC147 (3), F. tularensis subsp. tularensis NE061598 (4) and F. tularensis subsp. tularensis WY96-3418 (5) genomes. Of the 1,010 genes belonging to the core genome, 775 (76.8 %) were attributed to a COG category, and

88 415 (41.1 %) were attributed to a KEGG category. In the case of the 2297 dispensable genes, 1,881 (81.8 %) were attributed to a COG category, and 886 (38.5 %) were attributed to a KEGG category. We observed greater number of CDSs involved in information storage and processing (translation, ribosomal structure and biogenesis (J); and replication, recombination and repair (L)) and metabolism (amino acid transport metabolism (E), carbohydrate transport metabolism (G) and inorganic ion transport metabolism (P)). We found that F. tularensis subsp. holarctica F92, F. tularensis subsp. holarctica LVS and F. tularensis subsp. holarctica FTNF002-00 have a greater number of CDSs involved in replication, recombination and repair (L) compared to other F. tularensis genomes.

The Gammaproteobacteria pangenome The Gammaproteobacteria pangenome consists of 49,833 CDS with a core of 627 genes (47.16 %) and a dispensable genome of 25,933 genes (52.04 %) (Additional file 4, Figure 5). The organisms that share the greatest number of core genes are as follows: 618 out of 627 in Legionella strains, 617 genes in L. pneumophila strains, 578 genes in F. tularensis strains and 580 genes in C. burnetii strains. The organisms that share the greatest number of dispensable genes are as follows: 13,640 out of 25,933 in Legionella strains (52.6 %), 12,458 in L. pneumophila strains (48.04 %), 8,048 in F. tularensis strains (31.03 %) and 3,268 in C. burnetii strains (12.6 %). A total of 400 genes were specific to the C. burnetii (28), F. tularensis (9), Legionella (272), L. pneumophila (62), R. grylli (42) and D.

89 massiliensis strain 20B (49) genomes. Of the 627 genes belonging to the core genome, 594 (94.8 %) were attributed to a COG category, and 342 (54.6 %) were attributed to a KEGG category. Among the 25,933 dispensable genes, 22,402 (86.4 %) were attributed to a COG category, and 10,484 (40.4 %) were attributed to a KEGG category. In the core genome, we observed differences in the number of CDSs involved in metabolism (based on COG categories), namely in energy production and conversion (C), amino acid transport metabolism (E) and coenzyme transport metabolism (H). In the dispensable genome, the greatest number of CDSs was associated with amino acid transport metabolism (E). A similar functional distribution was found in the set of dispensable genes based on KEGG categories, in that a greater number of CDSs were associated with metabolism categories but a lesser number were associated with folding, sorting and degradation, glycan biosynthesis metabolism, replication and repair and translation. By analyzing 1,475 genes from D. massiliensis strain 20B using OrthoMCL, we identified a core genome of 908 genes (61.56 %) and a dispensable genome of 518 genes (35.12 %). The majority of the genes in the core genome were associated with COG categories contributing to metabolism (energy production and conversion (C) and coenzyme transport and metabolism (H)) and information storage and processing (translation, ribosomal structure and biogenesis (J); and replication, recombination and repair (L)). Based on KEGG category assignments, a greater number of CDSs in the core genome were associated with translation, cofactor and vitamin metabolism, nucleotide metabolism and carbohydrate

90 metabolism. Genes associated with amino acid metabolism and carbohydrate metabolism were highly represented among the dispensable genes. Of the 49 unique genes identified, 15 encoded hypothetical proteins. Some specific genes were identified, including PhoPQ-activated pathogenicity-related protein, dehydrogenases, SAM- dependent methyltransferases, galactose mutarotase and others (Additional file 5). Based on KEGG categories, these unique genes were associated with metabolism, environmental information processing, genetic information processing, two-component systems and sulfur relay systems.

Phylogenomic analysis A phylogenomic tree constructed based on gene content (i.e., the presence or absence of protein-coding genes, as predicted by COG and KEGG) showed different genome clustering than a whole genome tree (Figure 6). In the phylogenomic tree constructed based on COG classification, D. massiliensis strain 20B clustered with R. grylli and clustered closely with C. burnetii strains. In contrast, in the tree constructed based on KEGG classification, R. grylli formed a cluster with the C. burnetii strains, and D. massiliensis strain 20B was not included in this cluster. Based on all genes associated with cellular processes as determined by KEGG classification, D. massiliensis strain 20B clustered with four C. burnetii strains (C. burnetii CbuG_Q212, C. burnetii CbuK_Q154, C. burnetii Dugway 5J108-111 and C. burnetii RSA 493). Based on an analysis of COG categories, D. massiliensis strain 20B and

91 R. grylli clustered closely with C. burnetii strains, with the exception of five COG categories. For cell cycle control, cell division, and chromosome partitioning (D), nucleotide transport metabolism (F), coenzyme transport metabolism (H), lipid transport metabolism (I) and secondary metabolite biosynthesis, transport and catabolism (Q), D. massiliensis strain 20B and R. grylli clustered with the F. tularensis strains.

Discussion Pangenomic studies were described by Tettelin et al. in 2005 [15]. These types of studies analyze bacterial species in detail using different criteria and can determine whether the nature of the pangenome is open or closed. C. burnetii, an obligate intracellular bacterium [1], has a closed pangenome with a core/pangenome ratio of 92% (Additional file 2) and a relatively constant set of core genes [30]. Another example of a gammaproteobacterium with a closed pangenome is Buchnera aphidicola [16], which has a core/pangenome ratio of 98%. In this study we analyzed the facultative intracellular bacteria L. pneumophila and F. tularensis, which have core/pangenome ratios of 82% and 87%, respectively. Although their ratios were very close to the threshold of 89%, both of these bacteria can be considered to have open pangenomes, unlike the E. coli pangenome, which is infinite [11]. Our results show that the clubbed pangenome of D. massiliensis is composed of 23,500 (47.1%) core genes, 13,399 (57%) genes shared by Legionella and C. burnetii, 12,114 (51.5%) genes shared by C. burnetii and F. tularensis, and 18,363 (78.1%) genes shared by Legionella and 92 F. tularensis. Moreover, based on the phylogenomic trees we constructed, we conclude that D. massiliensis is more closely related to R. grylli than to C. burnetii. D. massiliensis and R. grylli shared 635 genes and clustered more often with C. burnetii. These results are in agreement with Pearson et al. [17], who showed that R. grylli is one of the closest known neighbors of C. burnetii. We also observed differences in lifestyle among the species analyzed in this study. Pangenomic studies elucidate the link between gene content and bacterial lifestyles. An allopatric lifestyle is defined by a narrow ecological niche with restricted opportunities for acquiring DNA from other organisms. An allopatric lifestyle can be associated with genome reduction, especially in pathogens that have smaller genomic repertoires compared to less specialized bacteria [18], smaller pangenomes and smaller mobilomes. In contrast, a sympatric lifestyle is associated with larger genomes, larger pangenomes, a larger mobilome and more frequent genomic exchanges with other bacteria. Moliner et al. [19, 20] described two different types of intracellular lifestyles: allopatric bacteria that are strictly intracellular bacteria and therefore live in narrow niches, and sympatric bacteria such as Legionella spp. that live in amoebas where DNA exchange can take place [19, 20]. The authors noted that intracellular bacteria living in amoebas generally have a larger genome, whereas other intracellular pathogens suffer from massive gene loss due to specialization. D. massiliensis, R. grylli, C. burnetii and F. tularensis have smaller genomes and exhibit losses of function compared to Legionella species.

93 Based on the comparison of G+C content and genome size and previous work by Merhej et al., we identified three distinct lifestyles: D. massiliensis and R. grylli are extremely allopatric, C. burnetii and F. tularensis are allopatric and have very little interaction with other organisms, and Legionella are sympatric, as they live in amoebas. In addition, we compared the gene losses and gains (based on COG functional analysis) in the genomes analyzed in this study to those analyzed by Merhej et al. [21]. We found that the more specialized a bacterium is, the more genes it has related to transcriptional regulation (K), defense mechanisms (V), inorganic ions (P), amino acid metabolism (E) and less genes in translation (J). For all of these categories, we observed a considerable difference between Legionella spp. (more pronounced in L. longbeachae than in L. pneumophila strains) and the other bacteria. Moreover, for each of these categories, D. massiliensis and Rickettsiella grylli have fewer genes with an assigned COG function. Based on KEGG classification, we also found that D. massiliensis and Rickettsiella grylli show immense losses of genes related to amino acid metabolisms. These results are in agreement with those obtained by Merhej et al. These results allowed us to divide the species analyzed in this study into three categories based on lifestyle. D. massiliensis and R. grylli are extremely allopatric species with fewer functional genes (as classified by COG), including a high loss of amino acid metabolism genes, and less severe loss in genes related to translation and transcription. The intermediate allopatric bacteria, C. burnetii and F. tularensis, have more functional genes compared to the extremely allopatric species. Sympatric

94 bacteria such as Legionella, especially L. longbeachae, possess the greatest number of functional genes (as classified by COG and KEGG) compared to the other species analyzed in this study. A comparative genomics-based analysis of free-living and host- dependent bacteria showed that intracellular bacteria contain fewer rRNA genes [21]. These bacterial genomes contained more split rRNA operons and fewer transcriptional regulators than other bacteria, which was linked to slower growth rates that are adaptive for their ecological niche [21]. The deletion of inactivation of certain genes renders several intracellular pathogens such as Shigella, Salmonella, and F. tularensis pathogenic. These genes are referred to as antivirulence genes [22]. A recent study in B. birtlesii identified a deletion in one of the two rRNA operons and disrupted genes associated with translation that are important for specialization to a specific niche [23]. The number of activated genes in a restricted environment is much lower than in a changing environment, as genes involved in translation are not expressed extensively [23]. If bacteria do not typically express ribosomal operons in their respective environments, then these operons are subject to loss [23]. Bacterial specialization involves a striking degree of gene loss, including decreased gene numbers, changes in G+C content and decreased numbers of both incomplete and intact ribosomal operons [21, 24, 25]. Restricting translation is critical for specialization, as speciation is often correlated with ribosomal operon inactivation [21, 23] and gene inactivation.

95 Conclusion This study of intracellular Gammaproteobacteria has contributed to our understanding of bacterial specialization based on the ecological niche. The genome size and gene content of the bacteria are associated with lifestyle. A smaller number of genes and a relatively low G+C content were observed in the genomes analyzed here, similar to other studies of intracellular bacteria [18]. Gene loss resulting in a smaller genome size has been a driving force in the adaptation of these bacteria to their hosts. Due to the reduction in the genomic repertoire, we speculate that fewer lateral gene transfers occur in D. massiliensis compared to other intracellular bacteria [26]. We used a multi-genus pangenomic approach to characterize the genomic repertoire of representative strains and compare the distribution of genes in D. massiliensis strain 20B with other genomes. We found that majority of the genes in D. massiliensis strain 20B were shared with other gammaproteobacteria. A pangenomic approach facilitates the exploration of different strategies by which facultative or obligate intracellular bacteria adapt to particular hosts and contributes significantly to our understanding of genome repertoires. This approach can be used to uncover unique genomic features that cannot be predicted by conventional methods. Moreover, our results suggest that the Legionella strains could be re-classified based on their genomic variability.

96 Methods Determination of genomic data For the genomic comparison, we used thirty sequenced species including five C. burnetii strains, ten L. pneumophila strains, L. longbeachae strain NSW150 [27], twelve F. tularensis strains, Rickettsiella grylli and D. massiliensis strain 20B from the Gammaproteobacteria class. The information related to genome properties (genome size, coding regions, G+C content, total number of genes, RNA-coding genes, protein-coding genes, genes with a predicted function, genes assigned to Clusters of Orthologous Groups of proteins (COGs), genes with peptide signals and genes with transmembrane helices) was retrieved from NCBI (ftp://ftp.ncbi.nih.gov/genomes/Bacteria/) and IMG/ER [28] (https://img.jgi.doe.gov) (Table 1). Open reading frames (ORFs) were predicted for the draft genome using Prodigal [29] with default parameters, but the predicted ORFs were excluded if they spanned a sequencing gap. The predicted bacterial protein sequences were searched against the GenBank database [30] and the Clusters of Orthologous Groups (COG) database using BLASTP (E-value 10-5 and coverage ≥ 70%). Pangenome analysis All of the CDSs from each genome were pooled together and clustered using OrthoMCL [32] using the following parameters: an overlap of at least 70% and a minimum of 80% similarity. Only protein sequences longer than 50 amino acids were considered for further analysis. Homologous sequences were selected using the all-against-all BLASTp algorithm with an E value of less than 0.00001. Then, the orthologous

97 sequences clustering was analyzed using the Markov Cluster algorithm, which is based on probability and graph flow theory and allows the simultaneous classification of global relationships in a similarity space [32]. An inflation index of 1.5 was used to regulate cluster tightness (granularity), and the resulting clustered ortholog groups were analyzed further. Several Perl/Python scripts were compiled in our laboratory for massive data handling, namely for the calculation of core set (shared among all strains), dispensable set (shared between at least two) and unique set (organism-specific) genes from the OrthoMCL results. Functional annotation was derived using WebMGA [33] against the Cluster of Orthologous Groups [34] and the Kyoto Encyclopedia of Genes and Genomes [35].

Genome alignment and gene content-based phylogenomics Using MAUVE [36], the backbone output file generated after global genome alignment was used to calculate the composition of core distribution depending on the pangenome size [37]. This core/pangenome ratio is used to determinate if a pangenome is open or closed. The gene content of the genomes was classified based on twenty- five functional COG categories and was used to construct phylogenomic trees. The gene content was converted to a matrix of discrete binary characters ("0" and "1" for absence and presence, respectively) [38] and used to construct the matrix for Euclidean distances between pairs of points. The MEV (MultiExperiment Viewer) [39] was used to represent the

98 results visually. The G+C content and COG data were compared with previous work performed by Merhej et al. [21].

List of abbreviations PCR: polymerase chain reaction; KEGG: Kyoto Encyclopedia of Genes and Genomes; COG: clusters of orthologous groups; CDS: coding sequences

Competing interests and funding

The authors declare that they have no competing interests.

Authors' contributions

DR designed the research project. MJM performed the genomic analysis. MJM and DR analyzed the data. MJM and LR wrote the paper. DR revised the paper. All authors read and approved the final version.

Acknowledgements

We would like to thank Roshan Padmanabhan for technical support, suggestions, corrections and Ripsy Merrin Chacko for helpful remarks.

99 References

1. Beare PA, Unsworth N, Andoh M, Voth DE, Omsland A, Gilk SD, Williams KP, Sobral BW, Kupko JJ, 3rd, Porcella SF, et al: Comparative genomics reveal extensive transposon-mediated genomic plasticity and diversity among potential effector proteins within the genus Coxiella. Infection and immunity 2009, 77:642-656. 2. Roux V, Bergoin M, Lamaze N, Raoult D: Reassessment of the taxonomic position of Rickettsiella grylli. International journal of systematic bacteriology 1997, 47:1255-1257. 3. Mediannikov O, Sekeyova Z, Birg ML, Raoult D: A novel obligate intracellular gamma-proteobacterium associated with ixodid ticks, Diplorickettsia massiliensis, Gen. Nov., Sp. Nov. PloS one 2010, 5:e11478. 4. Parola P, Raoult D: Ticks and tickborne bacterial diseases in humans: an emerging infectious threat. Clinical infectious diseases : an official publication of the Infectious Diseases Society of America 2001, 32:897-928. 5. Beckstrom-Sternberg SM, Auerbach RK, Godbole S, Pearson JV, Beckstrom- Sternberg JS, Deng Z, Munk C, Kubota K, Zhou Y, Bruce D, et al: Complete genomic characterization of a pathogenic A.II strain of Francisella tularensis subspecies tularensis. PloS one 2007, 2:e947. 6. Mathew MJ, Subramanian G, Nguyen TT, Robert C, Mediannikov O, Fournier PE, Raoult D: Genome sequence of Diplorickettsia massiliensis, an emerging Ixodes ricinus-associated human pathogen. Journal of bacteriology 2012, 194:3287. 7. Subramanian G, Mediannikov O, Angelakis E, Socolovschi C, Kaplanski G, Martzolff L, Raoult D: Diplorickettsia massiliensis as a human pathogen. European journal of clinical microbiology & infectious diseases : official publication of the European Society of Clinical Microbiology 2012, 31:365-369. 8. Peterson J, Garges S, Giovanni M, McInnes P, Wang L, Schloss JA, Bonazzi V, McEwen JE, Wetterstrand KA, Deal C, et al: The NIH Human Microbiome Project. Genome research 2009, 19:2317-2323. 9. Hu B, Xie G, Lo CC, Starkenburg SR, Chain PS: Pathogen comparative genomics in the next-generation sequencing era: genome alignments, pangenomics and metagenomics. Briefings in functional genomics 2011, 10:322-333.

100 10. Medini D, Donati C, Tettelin H, Masignani V, Rappuoli R: The microbial pan- genome. Current opinion in genetics & development 2005, 15:589-594. 11. Rasko DA, Rosovitz MJ, Myers GS, Mongodin EF, Fricke WF, Gajer P, Crabtree J, Sebaihia M, Thomson NR, Chaudhuri R, et al: The pangenome structure of Escherichia coli: comparative genomic analysis of E. coli commensal and pathogenic isolates. Journal of bacteriology 2008, 190:6881-6893. 12. Rocha EP: Evolutionary patterns in prokaryotic genomes. Current opinion in microbiology 2008, 11:454-460. 13. D'Auria G, Jimenez-Hernandez N, Peris-Bondia F, Moya A, Latorre A: Legionella pneumophila pangenome reveals strain-specific virulence factors. BMC Genomics 2010, 11:181. 14. Wren BW: Microbial genome analysis: insights into virulence, host adaptation and evolution. Nature reviews Genetics 2000, 1:30-39. 15. Medini D, Donati C, Tettelin H, Masignani V, Rappuoli R: The microbial pan- genome. Curr Opin Genet Dev 2005, 15:589-594. 16. Snipen L, Almoy T, Ussery DW: Microbial comparative pan-genomics using binomial mixture models. BMC Genomics 2009, 10:385. 17. Pearson T, Hornstra HM, Sahl JW, Schaack S, Schupp JM, Beckstrom-Sternberg SM, O'Neill MW, Priestley RA, Champion MD, Beckstrom-Sternberg JS, et al: When Outgroups Fail; Phylogenomics of Rooting the Emerging Pathogen, Coxiella burnetii. Syst Biol 2013, 62:752-762. 18. Georgiades K, Merhej V, El Karkouri K, Raoult D, Pontarotti P: Gene gain and loss events in Rickettsia and Orientia species. Biol Direct 2011, 6:6. 19. Gimenez G, Bertelli C, Moliner C, Robert C, Raoult D, Fournier PE, Greub G: Insight into cross-talk between intra-amoebal pathogens. BMC Genomics 2011, 12:542. 20. Moliner C, Fournier PE, Raoult D: Genome analysis of microorganisms living in amoebae reveals a melting pot of evolution. FEMS Microbiol Rev 2010, 34:281-294. 21. Merhej V, Royer-Carenzi M, Pontarotti P, Raoult D: Massive comparative genomic analysis reveals convergent evolution of specialized bacteria. Biol Direct 2009, 4:13. 22. Bliven KA, Maurelli AT: Antivirulence genes: insights into pathogen evolution through gene loss. Infection and immunity 2012, 80:4061-4070.

101 23. Rolain JM, Vayssier-Taussat M, Saisongkorh W, Merhej V, Gimenez G, Robert C, Le Rhun D, Dehio C, Raoult D: Partial disruption of translational and posttranslational machinery reshapes growth rates of Bartonella birtlesii. MBio 2013, 4:e00115-00113. 24. Moran NA, Wernegreen JJ: Lifestyle evolution in symbiotic bacteria: insights from genomics. Trends Ecol Evol 2000, 15:321-326. 25. Andersson JO, Andersson SG: Insights into the evolutionary process of genome degradation. Current opinion in genetics & development 1999, 9:664- 671. 26. Audic S, Robert C, Campagna B, Parinello H, Claverie JM, Raoult D, Drancourt M: Genome analysis of Minibacterium massiliensis highlights the convergent evolution of water-living bacteria. PLoS Genet 2007, 3:e138. 27. Cazalet C, Gomez-Valero L, Rusniok C, Lomma M, Dervins-Ravault D, Newton HJ, Sansom FM, Jarraud S, Zidane N, Ma L, et al: Analysis of the Legionella longbeachae genome and transcriptome uncovers unique strategies to cause Legionnaires' disease. PLoS Genet 2010, 6:e1000851. 28. Markowitz VM, Chen IM, Palaniappan K, Chu K, Szeto E, Grechkin Y, Ratner A, Jacob B, Huang J, Williams P, et al: IMG: the Integrated Microbial Genomes database and comparative analysis system. Nucleic acids research 2012, 40:D115-122. 29. Hyatt D, Chen GL, Locascio PF, Land ML, Larimer FW, Hauser LJ: Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC bioinformatics 2010, 11:119. 30. Benson DA, Karsch-Mizrachi I, Clark K, Lipman DJ, Ostell J, Sayers EW: GenBank. Nucleic acids research 2012, 40:D48-53. 31. Benson G: Tandem repeats finder: a program to analyze DNA sequences. Nucleic acids research 1999, 27:573-580. 32. Li L, Stoeckert CJ, Jr., Roos DS: OrthoMCL: identification of ortholog groups for eukaryotic genomes. Genome research 2003, 13:2178-2189. 33. Wu S, Zhu Z, Fu L, Niu B, Li W: WebMGA: a customizable web server for fast metagenomic sequence analysis. BMC Genomics 2011, 12:444. 34. Tatusov RL, Fedorova ND, Jackson JD, Jacobs AR, Kiryutin B, Koonin EV, Krylov DM, Mazumder R, Mekhedov SL, Nikolskaya AN, et al: The COG database: an updated version includes eukaryotes. BMC bioinformatics 2003, 4:41.

102 35. Moriya Y, Itoh M, Okuda S, Yoshizawa AC, Kanehisa M: KAAS: an automatic genome annotation and pathway reconstruction server. Nucleic acids research 2007, 35:W182-185. 36. Darling AE, Mau B, Perna NT: progressiveMauve: multiple genome alignment with gene gain, loss and rearrangement. PloS one 2010, 5:e11147. 37. Sheppard SK, Didelot X, Jolley KA, Darling AE, Pascoe B, Meric G, Kelly DJ, Cody A, Colles FM, Strachan NJ, et al: Progressive genome-wide introgression in agricultural Campylobacter coli. Mol Ecol 2013, 22:1051-1064. 38. Pellegrini M, Marcotte EM, Thompson MJ, Eisenberg D, Yeates TO: Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. Proceedings of the National Academy of Sciences of the United States of America 1999, 96:4285-4288. 39. Saeed AI, Sharov V, White J, Li J, Liang W, Bhagabati N, Braisted J, Klapa M, Currier T, Thiagarajan M, et al: TM4: a free, open-source system for microarray data management and analysis. BioTechniques 2003, 34:374-378.

103 Figures legends

Figure 1: Protein sequence length distributions. All the organisms represented in different colors and symbols.

Figure 2: The distribution of GC content (%), genomic size (Mb) is represented in red and blue respectively

Figure 3: KEGGs characteristics according to categories for the organisms considered in the analysis

Figure 4: Summarizes the comparison using pangenome analysis. The results obtained from comparing the five complete genomes of pathogenic C. burnetii strains, eleven complete genomes of pathogenic Legionella strains, ten complete genomes of pathogenic L. pneumophila strains, eleven complete genomes of pathogenic Francisella tularensis strains and thirty complete genomes in relation to their orthologs/accessory gene distribution. The middle circle represents the number of Core functions and each petal corresponds to the number of accessory functions.

Figure 5 - Distribution of the accessory functions in whole set (30 organisms). The middle circle represents the number of Core functions and each petal corresponds to the number of accessory functions.

Figure 6 - Phylogenomics analysis based on COG and KEGG information, clustering based on Euclidean distance method.

104 Figure 1

Figure 2

106 Figure 3

107 Figure 4

108 Figure 5

109 Figure 6

110 Table 1- General characteristics of the organisms considered for the analysis

Organism Size Coding Accession Niche Chrs Plasmids GC% Gene Protein PMID (Mb) Density Number Diplorickettsia massiliensis 20B Ticks 1 - 1.73 39.3 2333 2287 79.56 AJGC00000000 22628513 Rickettsiella grylli Insect 1 - 1.58 37.8 1557 1410 90.86 AAQJ02000001.1 5753287 Coxiella burnetii Dugway 5J108-111 1 1 2.21 42.3 2362 2045 82.04 NC_009727.1 19047403 Coxiella burnetii CbuG_Q212 1 - 2.01 42.6 2091 1866 77.55 NC_011527.1 19047403 Coxiella burnetii CbuK_Q154 1 1 2.1 42.6 2183 1942 77.69 NC_011528.1 19047403 Coxiella burnetii RSA 331 Animals 1 1 2.05 42.7 2278 1975 78.43 NC_010117.1 19047403 Coxiella burnetii RSA 493 1 1 2.03 42.6 2095 1847 78.00 NC_002971.3 19047403 Francisella tularensis subsp. tularensis SCHU S4 1 - 1.89 32.3 1852 1604 79.17 NC_006570.2 15640799 Francisella tularensis subsp. tularensis TIGB03 1 - 1.97 32.3 1850 1624 76.76 NC_016933.1 22535949 Francisella tularensis subsp. holarctica F92 1 - 1.89 32.2 1890 1842 80.55 NC_019537.1 23405342

Francisella tularensis subsp. holarctica FSC200 1 - 1.89 32.2 1810 1438 71.29 NC_019551.1 23209222 Francisella tularensis subsp. holarctica FTNF002-00 1 - 1.89 32.2 1887 1581 76.33 NC_009749.1 19756146 Francisella tularensis subsp. holarctica LVS 1 - 1.9 32.2 2020 1754 82.40 NC_007880.1 15780452 Francisella tularensis subsp. holarctica OSU18 1 - 1.9 32.2 1932 1555 74.62 NC_008369.1 16980500 Francisella tularensis subsp. mediasiatica FSC147 1 - 1.89 32.3 1750 1406 71.94 NC_010677.1 19521508 Francisella tularensis subsp. tularensis FSC198 1 - 1.89 32.3 1852 1605 79.21 NC_008245.1 17406676 Francisella tularensis subsp. tularensis NE061598 1 - 1.89 32.3 1888 1836 82.23 NC_017453.1 20140244 Francisella tularensis subsp. tularensis TI0902 1 - 1.89 32.3 1764 1544 76.58 NC_016937.1 22535949 Ticks+Amoeba Francisella tularensis subsp. tularensis WY96-3418 1 - 1.9 32.3 1872 1634 80.04 NC_009257.1 17895988 Legionella pneumophila subsp. pneumophila str. Philadelphia 1 1 - 3.4 38.3 3003 2943 88.49 NC_002942.5 15448271 Legionella pneumophila str. Paris 1 1 3.64 38.4 3278 3166 87.19 NC_006368.1 15467720 Legionella pneumophila 2300/99 Alcoy 1 - 3.52 38.4 3243 3190 87.75 NC_014125.1 20236513 Legionella pneumophila str. Corby 1 - 3.58 38.5 3257 3204 87.06 NC_009494.2 17888731 Legionella pneumophila str. Lens 1 1 3.41 38.4 3058 2934 87.14 NC_006369.1 15467720 Legionella pneumophila subsp. pneumophila ATCC 43290 1 - 3.36 38.2 2981 2926 89.11 NC_016811.1 22374950

Legionella pneumophila subsp. pneumophila HL06041035 Amoeba 1 - 3.49 38.4 3184 3059 87.17 NC_018140.1 22044686 Legionella pneumophila subsp. pneumophila LPE509 1 1 3.51 38.3 3383 3331 88.66 NC_020521.1 23792742 Legionella pneumophila subsp. pneumophila str. Lorraine 1 1 3.62 38.4 3327 3221 87.48 NC_018139.1 22044686 Legionella pneumophila subsp. pneumophila str. Thunder Bay 1 - 3.46 38.2 3043 2998 88.04 NC_021350.1 23826259 Legionella longbeachae NSW150 1 1 4.15 37.1 3739 3470 84.73 NC_013861.1 20174605 111 Additional file 1- Functional analysis. COGs, KEGGs distribution within the core and dispensable compartments.

112 Additional file 2- Pangenome of some summary. The % column corresponds to the core/pan- genome ratio.

Species Genome Niche Average Pangenome Core % used Genome size (bp) genome size size (bp) Salmonella enterica 20 Animals 4.8Mb 96520000 59960168 62

Campylobacter jejuni 14 Human, chicken 1.7MB 23720000 18122022 76

Helicobacter pylori 10 Human 1.6Mb 16370000 12849693 78

Haemophilus influenzae 9 Human 1.8Mb 17170000 13728166 80

Legionella pneumophila 10 Amoeba 3.4Mb 34548036 28477841 82

Francisella tularensis 13 Ticks, Amoeba 1.8Mb 24690000 21468663 87

Yersinia pestis 12 Rodents 4.7Mb 55015109 48947637 89

Coxiella burnetii 5 Animals 2Mb 6690114 6150819 92

Buchnera aphidicola 8 Aphid 0.6Mb 5133548 5033068 98

113 Additional file 3- Individual Pangenome summary based on OrthoMCL clustering. Corresponding information regarding core, accessory and unique genes in the organisms studies Proteins used by Core Accessory Core Accessory Unique Total No Organisms Orthomcl genes Cluster Cluster genes genes cluster Group Core Accessory Unique Pangenome Coxiella burnetii (5 genomes) 6871 1080 6431 6324 491 56 1290 1993 92 7.15 0.82 Coxiella burnetii CbuG_Q212 1359 1079 87 1264 89 6 1172 401 93 6.55 0.44 Coxiella burnetii CbuK_Q154 1394 1079 106 1279 109 6 1191 394 91.8 7.82 0.43 Coxiella burnetii Dugway 5J108-111 1414 1080 122 1256 124 34 1236 421 88.8 8.77 2.4 Coxiella burnetii RSA 331 1351 1076 66 1276 66 9 1151 313 94.5 4.89 0.67 Coxiella burnetii RSA 493 1353 1079 102 1249 103 1 1182 464 92.3 7.61 0.07 Pangenome Francisella (12 genomes) 16596 1010 280 14281 2297 18 1308 2329 86.1 13.84 0.11 Francisella tularensis subsp. holarctica F92 1557 1008 209 1315 241 1 1218 192 84.5 15.48 0.06 Francisella tularensis subsp. holarctica FSC200 1248 999 143 1104 144 0 1142 142 88.5 11.54 0 Francisella tularensis subsp. holarctica FTNF002-00 1362 998 164 1193 169 0 1115 165 87.6 12.41 0 Francisella tularensis subsp. holarctica LVS 1534 1009 224 1271 262 1 1234 197 82.9 17.08 0.07 Francisella tularensis subsp. holarctica OSU18 1299 1001 147 1147 148 4 1152 180 88.3 11.39 0.31 Francisella tularensis subsp. mediasiatica FSC147 1243 1000 130 1109 131 3 1133 150 89.2 10.54 0.24 Francisella tularensis subsp. tularensis FSC198 1369 1009 188 1181 188 0 1197 236 86.3 13.73 0 Francisella tularensis subsp. tularensis NE061598 1512 1010 227 1269 239 4 1241 215 83.9 15.81 0.26 Francisella tularensis subsp. tularensis SCHU S4 1368 1009 188 1180 188 0 1197 236 86.3 13.74 0 Francisella tularensis subsp. tularensis TI0902 1322 1007 196 1126 196 0 1203 209 85.2 14.83 0 Francisella tularensis subsp. tularensis TIGB03 1393 1007 197 1195 198 0 1204 216 85.8 14.21 0 Francisella tularensis subsp. tularensis WY96-3418 1291 1007 192 1191 193 5 1204 191 92.3 14.95 0.39 Pangenome Legionella (11 genomes) 23736 1410 570 19567 3791 378 2358 1078 82.4 15.97 1.59 L_ longbeachae NSW150 2277 1356 194 1724 258 295 1845 107 75.7 11.33 12.96 L_pneumo_ 2300_99Alcoy 2177 1404 376 1773 397 7 1787 101 81.4 18.24 0.32 L_pneumo_ pneumophila_ATCC43290 2108 1398 261 1763 333 0 1731 95 83.6 15.8 0 L_pneumo_ Pneumophila_HL 2178 1400 342 1799 358 21 1763 96 82.6 16.44 0.96 L_pneumo_ Pneumophila_Lorraine 2162 1399 339 1812 342 8 1746 100 83.8 15.82 0.37 L_pneumo_ str. Corby 2179 1401 382 1780 393 6 1789 106 81.7 18.04 0.28 L_pneumo_ subsp. pneumophila str193Philadelphia 2145 1408 337 1800 342 3 1748 97 83.9 15.94 0.14 L_pneumo_ subsp. pneumophila str576Philadelphia 2109 1404 320 1783 325 1 1725 99 84.5 15.41 0.05 L_pneumo_ Thunder Bay 2167 1407 343 1814 350 3 1753 102 83.7 16.15 0.14

114 L_ pneumophila str. Lens 2071 1393 319 1727 330 14 1726 89 83.4 15.93 0.68 L_ pneumophila str. Paris 2163 1398 353 1780 363 20 1772 86 82.3 16.78 0.92 Pangenome Legionella pneumophila (10 genomes) 21459 1572 346 19466 1881 112 2030 971 90.7 8.77 0.52 L_ pneumophila str. Lens 2071 1553 153 1888 163 20 1726 89 91.2 7.87 0.97 L_ pneumophila str. Paris 2163 1561 184 1942 194 27 1772 86 89.8 8.97 1.25 L_pneumo_ 2300_99Alcoy 2177 1565 215 1936 234 7 1787 101 88.9 10.75 0.32 L_pneumo_ pneumophila_ATCC43290 2108 1564 167 1937 171 0 1731 95 91.9 8.11 0 L_pneumo_ Pneumophila_HL 2178 1563 174 1962 190 26 1763 96 90.1 8.72 1.19 L_pneumo_ Pneumophila_Lorraine 2162 1562 169 1975 172 15 1746 100 91.4 7.96 0.69 L_pneumo_ str. Corby 2179 1564 216 1943 227 9 1789 106 89.2 10.42 0.41 L_pneumo_ subsp. pneumophila str193Philadelphia 2145 1570 174 1962 179 4 1748 97 91.5 8.34 0.19 L_pneumo_ subsp. pneumophila str576Philadelphia 2109 1566 158 1945 163 1 1725 99 92.2 7.73 0.05 LPE L_pneumo_ Thunder Bay 2167 1569 181 1976 188 3 1753 102 91.2 8.68 0.14

115 Additional file 4- Whole set Pan-genome summary based on OrthoMCL clustering and corresponding information regarding core, accessory and unique genes in the organisms studies Proteins used by Core Accessory Core Accessory Unique Total No Organisms Orthomcl genes Cluster Cluster genes genes cluster Group Core Accessory Unique Whole (30 genomes) 49833 627 2102 23500 25933 400 3130 5886 47.2 52.04 0.8 Coxiella burnetii (5 genomes) 6871 580 682 3575 3268 28 1290 1993 52 47.56 0.41 Coxiella burnetii CbuG_Q212 1359 572 598 721 636 2 1172 401 53.1 46.8 0.15 Coxiella burnetii CbuK_Q154 1394 574 616 725 668 1 1191 394 52 47.92 0.07 Coxiella burnetii Dugway 5J108-111 1414 577 640 693 702 19 1236 421 49 49.65 1.34 Coxiella burnetii RSA 331 1351 572 574 733 613 5 1151 313 54.3 45.37 0.37 Coxiella burnetii RSA 493 1353 575 606 703 649 1 1182 464 52 47.97 0.07 Francisella tularensis (12 genomes) 16596 578 8048 8539 8048 9 1308 2329 51.5 48.49 0.05 Francisella tularensis subsp. holarctica F92 1557 573 644 823 733 1 1218 192 52.9 47.08 0.06 Francisella tularensis subsp. holarctica FSC200 1248 570 572 638 610 0 1142 142 51.1 48.88 0 Francisella tularensis subsp. holarctica FTNF002-00 1362 565 550 640 584 0 1115 165 47 42.88 0 Francisella tularensis subsp. holarctica LVS 1534 576 657 775 758 1 1234 197 50.5 49.41 0.07 Francisella tularensis subsp. holarctica OSU18 1299 568 582 675 622 2 1152 180 52 47.88 0.15 Francisella tularensis subsp. mediasiatica FSC147 1243 565 568 629 614 0 1133 150 50.6 49.4 0 Francisella tularensis subsp. tularensis FSC198 1369 573 624 708 661 0 1197 236 51.7 48.28 0 Francisella tularensis subsp. tularensis NE061598 1512 576 662 787 722 3 1241 215 52.1 47.75 0.2 Francisella tularensis subsp. tularensis SCHU S4 1368 573 624 707 661 0 1197 236 51.7 48.32 0 Francisella tularensis subsp. tularensis TI0902 1322 572 631 651 671 0 1203 209 49.2 50.76 0 Francisella tularensis subsp. tularensis TIGB03 1393 572 632 704 689 0 1204 216 50.5 49.46 0 Francisella tularensis subsp. tularensis WY96-3418 1291 573 629 711 676 2 1204 191 55.1 52.36 0.15 Legionella (11 genomes) 23736 618 1468 9824 13640 272 2358 1078 41.4 57.47 1.15 L_ longbeachae NSW150 2277 608 1027 886 1181 210 1845 107 38.9 51.87 9.22 L_ pneumophila str. Lens 2071 610 1108 863 1200 8 1726 89 41.7 57.94 0.39 L_ pneumophila str. Paris 2163 610 1148 899 1251 13 1772 86 41.6 57.84 0.6 L_pneumo_ 2300_99Alcoy 2177 613 1168 890 1281 6 1787 101 40.9 58.84 0.28 L_pneumo_ pneumophila_ATCC43290 2108 612 1119 877 1231 0 1731 95 41.6 58.4 0 L_pneumo_ Pneumophila_HL 2178 611 1135 888 1273 17 1763 96 40.8 58.45 0.78 L_pneumo_ Pneumophila_Lorraine 2162 613 1127 944 1212 6 1746 100 43.7 56.06 0.28 L_pneumo_ str. Corby 2179 612 1171 888 1285 6 1789 106 40.8 58.97 0.28

116 L_pneumo_ subsp. pneumophila str193Philadelphia 2145 612 1133 898 1244 3 1748 97 41.9 58 0.14 L_pneumo_ subsp. pneumophila str576Philadelphia 2109 614 1110 896 1212 1 1725 99 42.5 57.47 0.05 L_pneumo_ Thunder Bay 2167 611 1140 895 1270 2 1753 102 41.3 58.61 0.09 Legionella pneumophila (10 genomes) 21458 617 1351 8938 12458 62 2030 971 41.7 58.06 0.29 L_ pneumophila str. Lens 2071 610 1108 863 1200 8 1726 89 41.7 57.94 0.39 L_ pneumophila str. Paris 2163 610 1148 899 1251 13 1772 86 41.6 57.84 0.6 L_pneumo_ 2300_99Alcoy 2177 613 1168 890 1281 6 1787 101 40.9 58.84 0.28 L_pneumo_ pneumophila_ATCC43290 2108 612 1119 877 1231 0 1731 95 41.6 58.4 0 L_pneumo_ Pneumophila_HL 2178 611 1135 888 1273 17 1763 96 40.8 58.45 0.78 L_pneumo_ Pneumophila_Lorraine 2162 613 1127 944 1212 6 1746 100 43.7 56.06 0.28 L_pneumo_ str. Corby 2179 612 1171 888 1285 6 1789 106 40.8 58.97 0.28 L_pneumo_ subsp. pneumophila str193Philadelphia 2145 612 1133 898 1244 3 1748 97 41.9 58 0.14 L_pneumo_ subsp. pneumophila str576Philadelphia 2109 614 1110 896 1212 1 1725 99 42.5 57.47 0.05 L_pneumo_ Thunder Bay 2167 611 1140 895 1270 2 1753 102 41.3 58.61 0.09 Rickettsiella grylli 1155 554 459 654 459 42 987 38 56.6 39.74 3.64 Diplorickettsia massiliensis 1475 546 518 908 518 49 970 48 61.6 35.12 3.32

117 Additional file 5- Diplorickettsia massiliensis strain 20B description of unique genes

Gene ID Cluster ID COG Functional describtion 12043713 OG5_126962 R hypothetical protein 12042690 OG5_127396 RTKL Serine/threonine protein kinase 12043131 OG5_127837 G Galactose mutarotase and related enzymes 12042957 OG5_129515 Q Probable taurine catabolism dioxygenase 12043183 OG5_131640 O Predicted redox protein, regulator of disulfide bond formation 12043061 OG5_131654 C Ferredoxin 12042314 OG5_132174 R FOG:Ankyrin repeat 12043224 OG5_133030 P 3'-Phosphoadenosine 5'-phosphosulfate (PAPS) 3'-phosphatase 12042814 OG5_134761 R FOG:Ankyrin repeat 12043977 OG5_136591 H SAM-dependent methyltransferases 12042324 OG5_136663 S Uncharacterized conserved protein 12043610 OG5_137437 T hypothetical protein 12042285 OG5_137732 R hypothetical protein Dehydrogenases with different specificities (related to short-chain alcohol 12042283 OG5_137790 IQR dehydrogenases) Response regulators consisting of a CheY-like receiver domain and a winged-helix 12043203 OG5_138525 T DNA-binding 12041993 OG5_141810 R PhoPQ-activated pathogenicity-related protein 12043260 OG5_142772 T FOG:CheY-like receiver 12043907 OG5_146647 R Soluble lytic murein transglycosylase and related regulatory proteins 12044061 OG5_146777 RTKL hypothetical protein 12043846 OG5_150445 S Uncharacterized protein conserved in bacteria 12044192 OG5_152528 T FOG:CheY-like receiver 12043032 OG5_153661 R hypothetical protein 12043044 OG5_156771 M Predicted choline kinase involved in LPS biosynthesis 12043182 OG5_158947 J contains the PP-loop ATPase domain 12043983 OG5_158957 T FOG:CheY-like receiver 12042792 OG5_164552 M UDP-glucose pyrophosphorylase 12043504 OG5_164798 S Uncharacterized protein conserved in bacteria 12042562 OG5_165570 D hypothetical protein 12042797 OG5_166276 M hypothetical protein 12043119 OG5_166572 R hypothetical protein 12043357 OG5_167967 R DNA primase (bacterial type) 12043647 OG5_170999 R Predicted periplasmic protein 12042005 OG5_171413 R hypothetical protein 12043834 OG5_172467 R hypothetical protein 12043019 OG5_175478 R Amino acid transporters 12043613 OG5_176228 H SAM-dependent methyltransferases 12043373 OG5_178450 TK DNA-binding HTH domain-containing proteins 12042871 OG5_178715 K Predicted nucleotide-binding protein containing TIR -like domain 12042514 OG5_181916 R hypothetical protein 12042318 OG5_185753 R hypothetical protein 12042907 OG5_191435 R hypothetical protein 12042962 OG5_204787 R hypothetical protein 12042193 OG5_211174 R hypothetical protein 12042682 OG5_211971 R hypothetical protein 12044129 OG5_215038 R hypothetical protein 12043970 OG5_228892 Q hypothetical protein 12043121 OG5_229846 R hypothetical protein 12043292 OG5_244660 R Uncharacterized protein conserved in bacteria 12044183 OG5_245288 R hypothetical protein

118

Chapter 5

Conclusions

119

120 5.1 Conclusions and perspectives

Based on an endosymbiotic origin for mitochondria and other eukaryotic organelles, we believe that the intracellular culture is ancient and constantly co-evolving with the host. Comparative analyses of bacterial genomes from different lifestyles, including free-living and host- dependent bacteria, show that host-dependent bacteria exhibit fewer transcriptional regulators. Lamarckian evolution may have played a role in bacterial speciation events associated with a reduction in the genome size, an observation that contradicts the dominant model, which assumes that speciation and fitness gain are linked with an increase in the gene repertoire. Intracellular bacteria possess mechanisms to protect or to invade host cells. The interactions between intracellular bacteria and host cells are enabled by Type IV secretion systems (T4SSs). These systems are required for bacterial colonization, invasion and persistence within the niche and are supra-molecular transporters ancestrally related to bacterial conjugation systems.

The study of intracellular Gammaproteobacteria has contributed to our understanding of bacterial specialization based on the ecological niche. The genome size and gene content of the bacteria are associated with lifestyle. A smaller number of genes and a relatively low G+C content were observed in the genomes analyzed here, similar to other studies of intracellular bacteria (Georgiades, et al., 2011). Gene loss resulting in a 121 smaller genome size has been a driving force in the adaptation of these bacteria to their hosts. Due to the reduction in the genomic repertoire, we speculate that fewer lateral gene transfers occur in D. massiliensis compared to other intracellular bacteria (Audic, et al., 2007). We used a multi-genus pangenomic approach to characterize the genomic repertoire of representative strains and compare the distribution of genes in D. massiliensis strain 20B with other genomes. We found that majority of the genes in D. massiliensis strain 20B were shared with other gammaproteobacteria. A pangenomic approach facilitates the exploration of different strategies by which facultative or obligate intracellular bacteria adapt to particular hosts and contributes significantly to our understanding of genome repertoires. This approach can be used to uncover unique genomic features that cannot be predicted by conventional methods. Moreover, our results suggest that the Legionella strains could be re-classified based on their genomic variability. The sequencing of additional intracellular bacterial genomes will enable the acquisition of a more precise picture of the genetic properties associated with the intracellular lifestyle. This effort will also contribute to a better understanding of the interactions between intracellular bacteria and different niches and the complex mechanisms implicated in pathogenicity.

122 5.2 Future perspectives

Current knowledge barely scratches the surface of the diversity of these intracellular bacteria and the complex host associations. Genomic studies have shifted from looking only at genes and protein coding sequences to exploring the entire genome. It will be interesting to learn more about the genomic repertoire of emerging intracellular bacterial pathogens because of its adverse roles. Genomic analyses will provide a springboard for phylogenomic profiling, pangenomics, transcriptomics and proteomics, which will ultimately enable better understanding of how intracellular bacteria exploit their environment, and help to elucidate the mysteries of pathogenicity among pathogenic intracellular bacteria.

123 124

Bibliography

125

126

Amiri, H., C. M. Alsmark, et al. (2002). "Proliferation and deterioration of Rickettsia palindromic elements." Molecular biology and evolution 19(8): 1234-1243. Andersson, J. O. and S. G. Andersson (1999). "Insights into the evolutionary process of genome degradation." Curr Opin Genet Dev 9(6): 664-671. Andersson, S. G., C. Alsmark, et al. (2002). "Comparative genomics of microbial pathogens and symbionts." Bioinformatics 18 Suppl 2: S17. Andrews, H. L., J. P. Vogel, et al. (1998). "Identification of linked Legionella pneumophila genes essential for intracellular growth and evasion of the endocytic pathway." Infection and immunity 66(3): 950-958. Aravind, L., R. L. Tatusov, et al. (1998). "Evidence for massive gene exchange between archaeal and bacterial hyperthermophiles." Trends in genetics : TIG 14(11): 442-444. Arneodo, J. D., A. Bressan, et al. (2008). "Ultrastructural detection of an unusual intranuclear bacterium in Pentastiridius leporinus (Hemiptera: Cixiidae)." Journal of invertebrate pathology 97(3): 310-313. Audic, S., C. Robert, et al. (2007). "Genome analysis of Minibacterium massiliensis highlights the convergent evolution of water-living bacteria." PLoS Genet 3(8): e138. Baldridge, G. D., N. Y. Burkhardt, et al. (2007). "Transposon insertion reveals pRM, a plasmid of Rickettsia monacensis." Appl Environ Microbiol 73(15): 4984-4995. Banks, D. J., S. B. Beres, et al. (2002). "The fundamental contribution of phages to GAS evolution, genome diversification and strain emergence." Trends in microbiology 10(11): 515-521.

127 Beare, P. A., N. Unsworth, et al. (2009). "Comparative genomics reveal extensive transposon-mediated genomic plasticity and diversity among potential effector proteins within the genus Coxiella." Infect Immun 77(2): 642-656. Beckstrom-Sternberg, S. M., R. K. Auerbach, et al. (2007). "Complete genomic characterization of a pathogenic A.II strain of Francisella tularensis subspecies tularensis." PLoS One 2(9): e947. Benson, D. A., I. Karsch-Mizrachi, et al. (2012). "GenBank." Nucleic Acids Res 40(Database issue): D48-53. Benson, G. (1999). "Tandem repeats finder: a program to analyze DNA sequences." Nucleic Acids Res 27(2): 573-580. Beranek, A., M. Zettl, et al. (2004). "Thirty-eight C-terminal amino acids of the coupling protein TraD of the F-like conjugative resistance plasmid R1 are required and sufficient to confer binding to the substrate selector protein TraM." J Bacteriol 186(20): 6999-7006. Berglund, E. C., A. C. Frank, et al. (2009). "Run-off replication of host- adaptability genes is associated with gene transfer agents in the genome of mouse-infecting Bartonella grahamii." PLoS genetics 5(7): e1000546. Blanc, G., M. Ngwamidiba, et al. (2005). "Molecular evolution of rickettsia surface antigens: evidence of positive selection." Molecular biology and evolution 22(10): 2073-2083. Blanc, G., H. Ogata, et al. (2007). "Lateral gene transfer between obligate intracellular bacteria: evidence from the Rickettsia massiliae genome." Genome research 17(11): 1657-1664. Blanc, G., H. Ogata, et al. (2007). "Reductive genome evolution from the mother of Rickettsia." PLoS genetics 3(1): e14. Blatch, G. L. and M. Lassle (1999). "The tetratricopeptide repeat: a structural motif mediating protein-protein interactions." BioEssays : news and reviews in molecular, cellular and developmental biology 21(11): 932-939.

128 Bliven, K. A. and A. T. Maurelli (2012). "Antivirulence genes: insights into pathogen evolution through gene loss." Infect Immun 80(12): 4061-4070. Bordenstein, S. R. and W. S. Reznikoff (2005). "Mobile DNA in obligate intracellular bacteria." Nature reviews. Microbiology 3(9): 688- 699. Bork, P. (1993). "Hundreds of ankyrin-like repeats in functionally diverse proteins: mobile modules that cross phyla horizontally?" Proteins 17(4): 363-374. Boyd, E. F. and H. Brussow (2002). "Common themes among bacteriophage-encoded virulence factors and diversity among the bacteriophages involved." Trends in microbiology 10(11): 521-529. Boyd, E. F., B. M. Davis, et al. (2001). "Bacteriophage-bacteriophage interactions in the evolution of pathogenic bacteria." Trends in microbiology 9(3): 137-144. Braeken, L., B. Van der Bruggen, et al. (2006). "Flux decline in nanofiltration due to adsorption of dissolved organic compounds: model prediction of time dependency." The journal of physical chemistry. B 110(6): 2957-2962. Burns, D. L. (2003). "Type IV transporters of pathogenic bacteria." Current opinion in microbiology 6(1): 29-34. Casadevall, A. (2008). "Evolution of intracellular pathogens." Annual review of microbiology 62: 19-33. Casjens, S. (2003). "Prophages and bacterial genomics: what have we learned so far?" Molecular microbiology 49(2): 277-300. Caturegli, P., K. M. Asanovich, et al. (2000). "ankA: an Ehrlichia phagocytophila group gene encoding a cytoplasmic protein antigen with ankyrin repeats." Infection and immunity 68(9): 5277-5283. Cazalet, C., L. Gomez-Valero, et al. (2010). "Analysis of the Legionella longbeachae genome and transcriptome uncovers unique strategies to cause Legionnaires' disease." PLoS Genet 6(2): e1000851.

129 Cazalet, C., C. Rusniok, et al. (2004). "Evidence in the Legionella pneumophila genome for exploitation of host cell functions and high genome plasticity." Nature genetics 36(11): 1165-1173. Chen, I., P. J. Christie, et al. (2005). "The ins and outs of DNA transfer in bacteria." Science 310(5753): 1456-1460. Cho, N. H., H. R. Kim, et al. (2007). "The Orientia tsutsugamushi genome reveals massive proliferation of conjugative type IV secretion system and host-cell interaction genes." Proceedings of the National Academy of Sciences of the United States of America 104(19): 7981-7986. Christie, P. J. (2001). "Type IV secretion: intercellular transfer of macromolecules by systems ancestrally related to conjugation machines." Molecular microbiology 40(2): 294-305. Christie, P. J. and J. P. Vogel (2000). "Bacterial type IV secretion: conjugation systems adapted to deliver effector molecules to host cells." Trends in microbiology 8(8): 354-360. Claverie, J. M. and H. Ogata (2003). "The insertion of palindromic repeats in the evolution of proteins." Trends in biochemical sciences 28(2): 75-80. Colson, P. and D. Raoult (2012). "Lamarckian evolution of the giant Mimivirus in allopatric laboratory culture on amoebae." Frontiers in cellular and infection microbiology 2: 91. Corsaro, D., D. Venditti, et al. (1999). "Intracellular life." Critical reviews in microbiology 25(1): 39-79. D'Auria, G., N. Jimenez-Hernandez, et al. (2010). "Legionella pneumophila pangenome reveals strain-specific virulence factors." BMC genomics 11: 181. Dai, L., N. Toor, et al. (2003). "Database for mobile group II introns." Nucleic acids research 31(1): 424-426. Darby, A. C., N. H. Cho, et al. (2007). "Intracellular pathogens go extreme: genome evolution in the Rickettsiales." Trends in genetics : TIG 23(10): 511-520.

130 Darling, A. E., B. Mau, et al. (2010). "progressiveMauve: multiple genome alignment with gene gain, loss and rearrangement." PloS one 5(6): e11147. Degnan, P. H., A. B. Lazarus, et al. (2005). "Genome sequence of Blochmannia pennsylvanicus indicates parallel evolutionary trends among bacterial mutualists of insects." Genome research 15(8): 1023-1033. Deng, W., L. Chen, et al. (1999). "VirE1 is a specific molecular chaperone for the exported single-stranded-DNA-binding protein VirE2 in Agrobacterium." Molecular microbiology 31(6): 1795-1807. Douglas, A. E. (1989). "Mycetocyte symbiosis in insects." Biological reviews of the Cambridge Philosophical Society 64(4): 409-434. Dunning Hotopp, J. C., M. E. Clark, et al. (2007). "Widespread lateral gene transfer from intracellular bacteria to multicellular eukaryotes." Science 317(5845): 1753-1756. Fares, M. A., A. Moya, et al. (2004). "GroEL and the maintenance of bacterial endosymbiosis." Trends in genetics : TIG 20(9): 413-416. Fares, M. A., M. X. Ruiz-Gonzalez, et al. (2002). "Endosymbiotic bacteria: groEL buffers against deleterious mutations." Nature 417(6887): 398. Felsheim, R. F., T. J. Kurtti, et al. (2009). "Genome sequence of the endosymbiont Rickettsia peacockii and comparison with virulent Rickettsia rickettsii: identification of virulence factors." PloS one 4(12): e8361. Fernandez-Moreira, E., J. H. Helbig, et al. (2006). "Membrane vesicles shed by Legionella pneumophila inhibit fusion of phagosomes with lysosomes." Infection and immunity 74(6): 3285-3295. Finlay, B. B. and S. Falkow (1997). "Common themes in microbial pathogenicity revisited." Microbiology and molecular biology reviews : MMBR 61(2): 136-169. Fournier, P. E., K. El Karkouri, et al. (2009). "Analysis of the Rickettsia africae genome reveals that virulence acquisition in Rickettsia

131 species may be explained by genome reduction." BMC genomics 10: 166. Fournier, P. E., Y. Zhu, et al. (2004). "Use of highly variable intergenic spacer sequences for multispacer typing of Rickettsia conorii strains." Journal of clinical microbiology 42(12): 5757-5766. Frank, A. C., H. Amiri, et al. (2002). "Genome deterioration: loss of repeated sequences and accumulation of junk DNA." Genetica 115(1): 1-12. Fraser-Liggett, C. M. (2005). "Insights on biology and evolution from microbial genome sequencing." Genome research 15(12): 1603- 1610. Friedland, J. S., R. J. Shattock, et al. (1993). "Phagocytosis of Mycobacterium tuberculosis or particulate stimuli by human monocytic cells induces equivalent monocyte chemotactic protein- 1 gene expression." Cytokine 5(2): 150-156. Frost, L. S., R. Leplae, et al. (2005). "Mobile genetic elements: the agents of open source evolution." Nature reviews. Microbiology 3(9): 722- 732. Georgiades, K., M. A. Madoui, et al. (2011). "Phylogenomic analysis of Odyssella thessalonicensis fortifies the common origin of Rickettsiales, Pelagibacter ubique and Reclimonas americana mitochondrion." PloS one 6(9): e24857. Georgiades, K., V. Merhej, et al. (2011). "Gene gain and loss events in Rickettsia and Orientia species." Biology direct 6: 6. Georgiades, K. and D. Raoult (2010). "Defining pathogenic bacterial species in the genomic era." Frontiers in microbiology 1: 151. Georgiades, K. and D. Raoult (2011). "The rhizome of Reclinomonas americana, Homo sapiens, Pediculus humanus and Saccharomyces cerevisiae mitochondria." Biology direct 6: 55. Gil, R., A. Latorre, et al. (2004). "Bacterial endosymbionts of insects: insights from comparative genomics." Environmental microbiology 6(11): 1109-1122.

132 Gimenez, G., C. Bertelli, et al. (2011). "Insight into cross-talk between intra-amoebal pathogens." BMC genomics 12: 542. Gross, R., J. Hacker, et al. (2003). "The Leopoldina international symposium on parasitism, commensalism and symbiosis--common themes, different outcome." Molecular microbiology 47(6): 1749- 1758. Hooper, S. D. and O. G. Berg (2003). "On the nature of gene innovation: duplication patterns in microbial genomes." Molecular biology and evolution 20(6): 945-954. Horn, M., A. Collingro, et al. (2004). "Illuminating the evolutionary history of chlamydiae." Science 304(5671): 728-730. Hu, B., G. Xie, et al. (2011). "Pathogen comparative genomics in the next-generation sequencing era: genome alignments, pangenomics and metagenomics." Brief Funct Genomics 10(6): 322-333. Hyatt, D., G. L. Chen, et al. (2010). "Prodigal: prokaryotic gene recognition and translation initiation site identification." BMC Bioinformatics 11: 119. Klasson, L., Z. Kambris, et al. (2009). "Horizontal gene transfer between Wolbachia and the mosquito Aedes aegypti." BMC genomics 10: 33. Koonin, E. V. (2009). "Darwinian evolution in the light of genomics." Nucleic acids research 37(4): 1011-1034. Koonin, E. V. (2010). "The origin and early evolution of eukaryotes in the light of phylogenomics." Genome biology 11(5): 209. Koonin, E. V. and Y. I. Wolf (2008). "Genomics of bacteria and archaea: the emerging dynamic view of the prokaryotic world." Nucleic acids research 36(21): 6688-6719. Labrador, M. and V. G. Corces (1997). "Transposable element-host interactions: regulation of insertion and excision." Annu Rev Genet 31: 381-404. Li, J., A. Mahajan, et al. (2006). "Ankyrin repeat: a unique motif mediating protein-protein interactions." Biochemistry 45(51): 15168-15178. 133 Li, L., C. J. Stoeckert, Jr., et al. (2003). "OrthoMCL: identification of ortholog groups for eukaryotic genomes." Genome Res 13(9): 2178-2189. Lin, M., C. Zhang, et al. (2009). "Analysis of complete genome sequence of Neorickettsia risticii: causative agent of Potomac horse fever." Nucleic acids research 37(18): 6076-6091. Lynn Margulis, R. F. (1991). Symbiosis as a Source of Evolutionary Innovation: Speciation and Morphogenesis, The MIT Press. Marco, D. (2008). "Metagenomics and the niche concept." Theory in biosciences = Theorie in den Biowissenschaften 127(3): 241-247. Margulis, L. (1971). "The origin of plant and animal cells." American scientist 59(2): 230-235. Margulis, L. (1971). "Symbiosis and evolution." Scientific American 225(2): 48-57. Markowitz, V. M., I. M. Chen, et al. (2012). "IMG: the Integrated Microbial Genomes database and comparative analysis system." Nucleic Acids Res 40(Database issue): D115-122. Mathew, M. J., G. Subramanian, et al. (2012). "Genome sequence of Diplorickettsia massiliensis, an emerging Ixodes ricinus-associated human pathogen." J Bacteriol 194(12): 3287. Matthews, M. and C. R. Roy (2000). "Identification and subcellular localization of the Legionella pneumophila IcmX protein: a factor essential for establishment of a replicative organelle in eukaryotic host cells." Infection and immunity 68(7): 3971-3982. McCutcheon, J. P. and N. A. Moran (2007). "Parallel genomic evolution and metabolic interdependence in an ancient symbiosis." Proceedings of the National Academy of Sciences of the United States of America 104(49): 19392-19397. McCutcheon, J. P. and N. A. Moran (2012). "Extreme genome reduction in symbiotic bacteria." Nature reviews. Microbiology 10(1): 13-26. McNulty, S. N., J. M. Foster, et al. (2010). "Endosymbiont DNA in endobacteria-free filarial nematodes indicates ancient horizontal genetic transfer." PloS one 5(6): e11029. 134 Mediannikov, O., Z. Sekeyova, et al. (2010). "A novel obligate intracellular gamma-proteobacterium associated with ixodid ticks, Diplorickettsia massiliensis, Gen. Nov., Sp. Nov." PLoS One 5(7): e11478. Medini, D., C. Donati, et al. (2005). "The microbial pan-genome." Curr Opin Genet Dev 15(6): 589-594. Merhej, V., C. Notredame, et al. (2011). "The rhizome of life: the sympatric Rickettsia felis paradigm demonstrates the random transfer of DNA sequences." Molecular biology and evolution 28(11): 3213-3223. Merhej, V. and D. Raoult (2011). "Rickettsial evolution in the light of comparative genomics." Biological reviews of the Cambridge Philosophical Society 86(2): 379-405. Merhej, V., M. Royer-Carenzi, et al. (2009). "Massive comparative genomic analysis reveals convergent evolution of specialized bacteria." Biology direct 4: 13. Miao, E. A. and S. I. Miller (1999). "Bacteriophages in the evolution of pathogen-host interactions." Proceedings of the National Academy of Sciences of the United States of America 96(17): 9452-9454. Mira, A., H. Ochman, et al. (2001). "Deletional bias and the evolution of bacterial genomes." Trends in genetics : TIG 17(10): 589-596. Moliner, C., P. E. Fournier, et al. (2010). "Genome analysis of microorganisms living in amoebae reveals a melting pot of evolution." FEMS microbiology reviews 34(3): 281-294. Moran, J. V., R. J. DeBerardinis, et al. (1999). "Exon shuffling by L1 retrotransposition." Science 283(5407): 1530-1534. Moran, N. A. (1996). "Accelerated evolution and Muller's rachet in endosymbiotic bacteria." Proceedings of the National Academy of Sciences of the United States of America 93(7): 2873-2878. Moran, N. A. (2002). "Microbial minimalism: genome reduction in bacterial pathogens." Cell 108(5): 583-586. Moran, N. A. and P. Baumann (2000). "Bacterial endosymbionts in animals." Current opinion in microbiology 3(3): 270-275. 135 Moran, N. A., P. H. Degnan, et al. (2005). "The players in a mutualistic symbiosis: insects, bacteria, viruses, and virulence genes." Proceedings of the National Academy of Sciences of the United States of America 102(47): 16919-16926. Moran, N. A., H. E. Dunbar, et al. (2005). "Regulation of transcription in a reduced bacterial genome: nutrient-provisioning genes of the obligate symbiont Buchnera aphidicola." J Bacteriol 187(12): 4229- 4237. Moran, N. A., J. P. McCutcheon, et al. (2008). "Genomics and Evolution of Heritable Bacterial Symbionts." Annual Review of Genetics 42(1): 165-190. Moran, N. A. and G. R. Plague (2004). "Genomic changes following host restriction in bacteria." Current opinion in genetics & development 14(6): 627-633. Moran, N. A. and J. J. Wernegreen (2000). "Lifestyle evolution in symbiotic bacteria: insights from genomics." Trends in ecology & evolution 15(8): 321-326. Moriya, Y., M. Itoh, et al. (2007). "KAAS: an automatic genome annotation and pathway reconstruction server." Nucleic Acids Res 35(Web Server issue): W182-185. Mosavi, L. K., T. J. Cammett, et al. (2004). "The ankyrin repeat as molecular architecture for protein recognition." Protein science : a publication of the Protein Society 13(6): 1435-1448. Nagai, H. and T. Kubori (2011). "Type IVB Secretion Systems of Legionella and Other Gram-Negative Bacteria." Frontiers in microbiology 2: 136. Nakabachi, A., A. Yamashita, et al. (2006). "The 160-kilobase genome of the bacterial endosymbiont Carsonella." Science 314(5797): 267. Nora, T., M. Lomma, et al. (2009). "Molecular mimicry: an important virulence strategy employed by Legionella pneumophila to subvert host functions." Future microbiology 4(6): 691-701. Ogata, H., S. Audic, et al. (2000). "Selfish DNA in protein-coding genes of Rickettsia." Science 290(5490): 347-350.

136 Ogata, H., S. Audic, et al. (2001). "Mechanisms of evolution in Rickettsia conorii and R. prowazekii." Science 293(5537): 2093-2098. Ogata, H., B. La Scola, et al. (2006). "Genome sequence of Rickettsia bellii illuminates the role of amoebae in gene exchanges between intracellular pathogens." PLoS Genet 2(5): e76. Ogata, H., P. Renesto, et al. (2005). "The genome sequence of Rickettsia felis identifies the first putative conjugative plasmid in an obligate intracellular parasite." PLoS biology 3(8): e248. Ogata, H., C. Robert, et al. (2005). "Rickettsia felis, from culture to genome sequencing." Annals of the New York Academy of Sciences 1063: 26-34. Ohnishi, M., K. Kurokawa, et al. (2001). "Diversification of Escherichia coli genomes: are bacteriophages the major contributors?" Trends in microbiology 9(10): 481-485. Parola, P. and D. Raoult (2001). "Ticks and tickborne bacterial diseases in humans: an emerging infectious threat." Clin Infect Dis 32(6): 897-928. Pearson, T., H. M. Hornstra, et al. (2013). "When Outgroups Fail; Phylogenomics of Rooting the Emerging Pathogen, Coxiella burnetii." Systematic biology 62(5): 752-762. Pellegrini, M., E. M. Marcotte, et al. (1999). "Assigning protein functions by comparative genome analysis: protein phylogenetic profiles." Proc Natl Acad Sci U S A 96(8): 4285-4288. Perez-Brocal, V., R. Gil, et al. (2006). "A small microbial genome: the end of a long symbiotic relationship?" Science 314(5797): 312-313. Peterson, J., S. Garges, et al. (2009). "The NIH Human Microbiome Project." Genome Res 19(12): 2317-2323. Pilsczek, F. H., A. Nicholson-Weller, et al. (2005). "Phagocytosis of Salmonella montevideo by human neutrophils: immune adherence increases phagocytosis, whereas the bacterial surface determines the route of intracellular processing." The Journal of infectious diseases 192(2): 200-209.

137 Rasko, D. A., M. J. Rosovitz, et al. (2008). "The pangenome structure of Escherichia coli: comparative genomic analysis of E. coli commensal and pathogenic isolates." J Bacteriol 190(20): 6881- 6893. Renesto, P., H. Ogata, et al. (2005). "Some lessons from Rickettsia genomics." FEMS microbiology reviews 29(1): 99-117. Renvoise, A., V. Merhej, et al. (2011). "Intracellular Rickettsiales: Insights into manipulators of eukaryotic cells." Trends in molecular medicine 17(10): 573-583. Rocha, E. P. (2003). "DNA repeats lead to the accelerated loss of gene order in bacteria." Trends in genetics : TIG 19(11): 600-603. Rocha, E. P. (2008). "Evolutionary patterns in prokaryotic genomes." Curr Opin Microbiol 11(5): 454-460. Rolain, J. M., M. Vayssier-Taussat, et al. (2013). "Partial disruption of translational and posttranslational machinery reshapes growth rates of Bartonella birtlesii." mBio 4(2): e00115-00113. Roux, V., M. Bergoin, et al. (1997). "Reassessment of the taxonomic position of Rickettsiella grylli." Int J Syst Bacteriol 47(4): 1255- 1257. Rubtsov, A. M. and O. D. Lopina (2000). "Ankyrins." FEBS letters 482(1- 2): 1-5. Saeed, A. I., V. Sharov, et al. (2003). "TM4: a free, open-source system for microarray data management and analysis." Biotechniques 34(2): 374-378. Saisongkorh, W., C. Robert, et al. (2010). "Evidence of transfer by conjugation of type IV secretion system genes between Bartonella species and Rhizobium radiobacter in amoeba." PloS one 5(9): e12666. Saridaki, A. and K. Bourtzis (2010). "Wolbachia: more than just a bug in insects genitals." Current opinion in microbiology 13(1): 67-72. Sassera, D., T. Beninati, et al. (2006). "'Candidatus Midichloria mitochondrii', an endosymbiont of the tick Ixodes ricinus with a

138 unique intramitochondrial lifestyle." International journal of systematic and evolutionary microbiology 56(Pt 11): 2535-2540. Schandel, K. A., M. M. Muller, et al. (1992). "Localization of TraC, a protein involved in assembly of the F conjugative pilus." J Bacteriol 174(11): 3800-3806. Schmitz-Esser, S., N. Linka, et al. (2004). "ATP/ADP translocases: a common feature of obligate intracellular amoebal symbionts related to Chlamydiae and Rickettsiae." J Bacteriol 186(3): 683- 691. Schroder, G. and E. Lanka (2003). "TraG-like proteins of type IV secretion systems: functional dissection of the multiple activities of TraG (RP4) and TrwB (R388)." J Bacteriol 185(15): 4371-4381. Sheppard, S. K., X. Didelot, et al. (2013). "Progressive genome-wide introgression in agricultural Campylobacter coli." Molecular ecology 22(4): 1051-1064. Shigenobu, S., H. Watanabe, et al. (2000). "Genome sequence of the endocellular bacterial symbiont of aphids Buchnera sp. APS." Nature 407(6800): 81-86. Simek, K., J. Pernthaler, et al. (2001). "Changes in bacterial community composition and dynamics and viral mortality rates associated with enhanced flagellate grazing in a mesoeutrophic reservoir." Appl Environ Microbiol 67(6): 2723-2733. Simser, J. A., M. S. Rahman, et al. (2005). "A novel and naturally occurring transposon, ISRpe1 in the Rickettsia peacockii genome disrupting the rickA gene involved in actin-based motility." Molecular microbiology 58(1): 71-79. Snipen, L., T. Almoy, et al. (2009). "Microbial comparative pan-genomics using binomial mixture models." BMC genomics 10: 385. Stepkowski, T. and A. B. Legocki (2001). "Reduction of bacterial genome size and expansion resulting from obligate intracellular lifestyle and adaptation to soil habitat." Acta biochimica Polonica 48(2): 367-381.

139 Subramanian, G., O. Mediannikov, et al. (2012). "Diplorickettsia massiliensis as a human pathogen." Eur J Clin Microbiol Infect Dis 31(3): 365-369. Tamas, I., L. Klasson, et al. (2002). "50 million years of genomic stasis in endosymbiotic bacteria." Science 296(5577): 2376-2379. Tatusov, R. L., N. D. Fedorova, et al. (2003). "The COG database: an updated version includes eukaryotes." BMC Bioinformatics 4: 41. Toft, C. and S. G. Andersson (2010). "Evolutionary microbial genomics: insights into bacterial host adaptation." Nature reviews. Genetics 11(7): 465-475. van Belkum, A., S. Scherer, et al. (1998). "Short-sequence DNA repeats in prokaryotic genomes." Microbiology and molecular biology reviews : MMBR 62(2): 275-293. van Ham, R. C., J. Kamerbeek, et al. (2003). "Reductive genome evolution in Buchnera aphidicola." Proceedings of the National Academy of Sciences of the United States of America 100(2): 581- 586. Van Sluys, M. A., M. C. de Oliveira, et al. (2003). "Comparative analyses of the complete genome sequences of Pierce's disease and citrus variegated chlorosis strains of Xylella fastidiosa." J Bacteriol 185(3): 1018-1026. Vogel, J. P., H. L. Andrews, et al. (1998). "Conjugative transfer by the virulence system of Legionella pneumophila." Science 279(5352): 873-876. von Dohlen, C. D., S. Kohler, et al. (2001). "Mealybug beta- proteobacterial endosymbionts contain gamma-proteobacterial symbionts." Nature 412(6845): 433-436. Walsh, J. B. (1995). "How often do duplicated genes evolve new functions?" Genetics 139(1): 421-428. Wernegreen, J. J. (2002). "Genome evolution in bacterial endosymbionts of insects." Nature reviews. Genetics 3(11): 850-861.

140 Wernegreen, J. J. (2005). "For better or worse: genomic consequences of intracellular mutualism and parasitism." Current opinion in genetics & development 15(6): 572-583. Wernegreen, J. J., A. B. Lazarus, et al. (2002). "Small genome of Candidatus Blochmannia, the bacterial endosymbiont of Camponotus, implies irreversible specialization to an intracellular lifestyle." Microbiology 148(Pt 8): 2551-2556. Werren, J. H., L. Baldo, et al. (2008). "Wolbachia: master manipulators of invertebrate biology." Nature reviews. Microbiology 6(10): 741- 751. Whitman, W. B. (2009). "The modern concept of the procaryote." J Bacteriol 191(7): 2000-2005; discussion 2006-2007. Wilcox, J. L., H. E. Dunbar, et al. (2003). "Consequences of reductive evolution for gene expression in an obligate endosymbiont." Molecular microbiology 48(6): 1491-1500. Wren, B. W. (2000). "Microbial genome analysis: insights into virulence, host adaptation and evolution." Nat Rev Genet 1(1): 30-39. Wu, M., L. V. Sun, et al. (2004). "Phylogenomics of the reproductive parasite Wolbachia pipientis wMel: a streamlined genome overrun by mobile genetic elements." PLoS biology 2(3): E69. Wu, S., Z. Zhu, et al. (2011). "WebMGA: a customizable web server for fast metagenomic sequence analysis." BMC genomics 12: 444. Yang, F., J. Yang, et al. (2005). "Genome dynamics and diversity of Shigella species, the etiologic agents of bacillary dysentery." Nucleic acids research 33(19): 6445-6458. Zientz, E., T. Dandekar, et al. (2004). "Metabolic interdependence of obligate intracellular bacteria and their insect hosts." Microbiology and molecular biology reviews : MMBR 68(4): 745-770.

141

142 Acknowledgements

I thank God for providing me patience, persistence and perspiration. I thank all people who stood with me in completing my thesis. I would not have been able to achieve my thesis without the help and support of countless people over the past three years.

I must express my sincere gratitude to my guide and director Professor Didier Raoult, for his constant suggestions and guidance. I also would like to thank him for creating a scientific environment at URMITE to learn and improve my skills and also I would like to thank him for providing me the financial help (AP-HM) to make my life easier in France.

I am indeed thankful to the core bioinformatics team for helping me in solving various technical issues. I express my hearty thanks to Ghislain, Gregory, Fabrice and Olivier for their constant support.

I would like to thank the reviewers of my thesis, Prof. Jérôme ETIENNE and Prof. Max MAURIN for their scientific advises and detailed review during the preparation of my thesis. There sincere suggestions indeed helped me to improve my thesis. I thank Prof. Jean-Louis MEGE for his support and honoring me by acting as the president of my thesis jury.

My thesis completion would have been harder without these guys. I owe a special thanks to Catherine Robert and her team, especially Thi-Tien Nguyen for teaching me molecular biology techniques. I remember here their time and patience. I am thankful to Francine Simula, Valerie Filosa and Sylvain Buffet for their administrative support and their constant help.

In the field of genomics, when I was naïve and lost, Roshan Padmanabhan guided me with various skills and methods needed. He not only gave me the feedbacks, but many a times, he helped me in understanding the problems and also in writing the manuscripts. Without his guidance and constant feed backs, this PhD would not have been achievable.

143

144

My friends in France, India especially (Sagar, Vishal, Sijo, Mayur) and others who are in different parts of the world were my sources of laughter, joy, happiness and support come from. I am happy that, my friendships with you have extended well beyond our shared times. I owe a special thanks to all those guys for keeping me determined.

I need to give a special sincere thanks to my wife Ripsy, who stood by me both in my happy and difficult times. Last but not least, I would like to express my sincere gratitude to my mother Susamma Mathew, father M J Mathew, brother Mithun, Mummy Daisy Chacko, Papa K V Chacko, brother in law Rohind, my grandfather and my grandmother for their unconditional love, blessing and support. I dedicate my thesis to three important persons, my mummy (who is my first teacher & my inspiration), my wife (who is my better half) and my papa (who encouraged).

145