Aix-Marseille Université Faculté de Médecine de Marseille Ecole Doctorale des Sciences de la Vie et de la Santé THÈSE DE DOCTORAT Présentée par Sourabh JAIN

Date et lieu de naissance: 14-May-1984, Inde

Comparative genomic study for identifying gene acquisitions in Megavirales

Soutenance de la thèse le 06-JULY-2017 En vue de l’obtenir du grade de Docteur de l’Université d’Aix-Marseille Membres du jury de la thèse Docteur Pierre PONTAROTTI Directeur de Thèse Professeur Didier RAOULT Co-Directeur de Thèse Professeur Patrick FORTERRE Rapporteur Docteur Franck PANABIERES Rapporteur Professeur Jean Louis MEGE Examiner

Laboratoire d’accueil

URMITE Unité de Recherche sur les Maladies Infectieuses et Tropicales Emergentes, UMR 6236, Faculté de Médecine 27, Boulevard Jean Moulin, 13385 Marseille, France

I2M UMR-CNRS 7373, Evolution Biologique et Modélisation, Aix-Marseille Université 3, place V. Hugo case 19, 13331 Marseille Cedex 3 France

1

2

CONTENTS

Abstract………………………………………...... 5

Résumé …………………………………….…...... 9

Avant-Propos……………………………………….....13

Chapter 1 Introduction…….…………...………...………...... ……15

Chapter 2 Megavirale diversity and evolution……………………25

Chapter 3 MimiLook: A Phylogenetic Workflow for Detection of Gene Acquisition in Major Orthologous Groups of Megavirales ………………………………….....……..79

Chapter 4 Contribution of horizontal gene transfers in evolution of family specific genome mosaicism of Megavirales ….127

Conclusions & Future Perspective ………………...183

Acknowledgements …………………………………189

3

4

Abstract

Discovery of giant viruses with giant genome size and surprising genomic features raises different question about their origin and evolution. The diversity of Megavirales (MVs) imposes difficulties in collectively evaluating their phylogenetic relationships. While small subset of conserved core genes and phylogenomic analyses based on them, provide useful classification of MVs, but they give little insight on the remaining un- conserved and variable gene content of accessory genomes. Thus, many phylogenetic studies have pointed out decisive role of HGTs and genetic exchanges on evolution of MVs, but, majority of them are based on closely related MV families. However, exact proportion of instances of genes acquired horizontally varies greatly with the methodologies used for their detection of interpretation of phylogenies prepared. Therefore, it is necessary to adopt some systematic searching for detecting reticulate evolutionary events like HGT in MVs to decipher genomic composition and genome mosaicism of distantly related MV families. To investigate the contribution of HGTs in distantly related MV families, we have determined gene distributions and gene phylogenies for the 86 complete MV ORFomes classified in 6 defined and 4 putative families, in context of their homologs from other domains of life. At first, we prepared an automated phylogenetic workflow MimiLook, which deduces orthologous groups (OGs)

5 from ORFomes of MVs and constructs phylogeny by performing alignment generation, alignment editing and BLASTP searching across NCBI nr sequence database. Finally, this tool detects statistically validated events of gene acquisitions with the help of T-REX algorithm. We found 4577 clusters of orthologus groups (OGs), out of it, 91% of OGs are found to be family specific (i.e. represented by species classified in one MV family only), whereas, only 9% are represented by from 2 or more MV families. In step 2 of our analysis, we found 414 OGs with detected HGT event. 174 were inferred to have transferred from , 106 to have transferred from bacteria and 9 gene families to have transferred from cellular domains other than eukaryotes or bacteria (archaea, and viruses, including phages). 52 OGs were detected as cases of sympatric transfers (gene transfer by association of MVs with more than one cellular domain). Interestingly, 129 gene families were identified to be involved in gene transfers from MVs to other cellular domains. We applied a similar procedure to the 7,898 non-orthologous proteins to detect transfer events and putative donors and identified 259 instances of HGT from non-orthologous proteins, of which 135 cases were from eukaryotes, 82 cases from bacteria, 11 cases from Phages and other viruses, 31 cases where MVs are transferring protein to other cellular domains. Instances of HGT were found to be depicting donor specificity, as viruses of vertebrates/invertebrates (Poxviridae, Ascoviridae and Iridoviridae) acquired genes from donors like

6

Euteleostomii, Eutheria, Baculoviridae and proteobacteria; algal viruses (Phycodnaviridae) and protozoan viruses (pandoravirus, Mimiviridae, pithovirus, and Marseilleviridae) were found to be acquiring genes majorly from cellular donors like Dictyostellium, Mammeillales, Firmicutes, Clostridiales, Klebsormidium, Rozella allomycis, Ooomycetes and Phytophthora. In conclusion, clear distinction can be seen in the genome mosaicism of distantly related Megavirale families, where they evolved via genome specificity and family specific gene acquisitions from their respective ecological niche. Evolution of Megavirale families can be evidently based on phylogenetic analysis of few core genes as well as similarities of their gene contents, but, knowing that the horizontal gene transfer play a major role on the gene contents of Megavirales, it could be unforeseen to decipher the evolution of all Megavirale families by this approach. Keywords: Megavirales; Horizontal gene transfer; MimiLook; Comparative genomics; phylogeny

7

8

Résumé

La diversité de Megavirales (MV) impose des difficultés à évaluer collectivement leurs relations phylogénétiques. Bien qu'un petit sous-ensemble de gènes de base conservés et des analyses phylogénomiques basés sur eux, fournissent une classification utile des MV, mais ils donnent peu de perspicacité sur le reste du contenu génétique non conservé et variable des génomes accessoires. Ainsi, de nombreuses études phylogénétiques ont souligné le rôle décisif des HGT et des échanges génétiques sur l'évolution des MV, mais la plupart d'entre eux sont basés sur des familles MV étroitement liées. Cependant, la proportion exacte des cas de gènes acquis horizontalement varie considérablement avec les méthodologies utilisées pour leur détection de l'interprétation des phylogénies préparées. Par conséquent, il est nécessaire d'adopter une recherche systématique de la détection d'événements évolutifs réticulés comme HGT dans les MV pour déchiffrer la composition génomique et la mosaïque du génome de familles MV à distance. Pour étudier la contribution des HGT dans les familles de MV éloignées, nous avons déterminé les distributions de gènes et les phylogénies de gènes pour les 86 ORFomes complets de MV classés dans 6 familles définies et 4 putatives, dans le cadre de leurs homologues d'autres domaines de la vie. Au début, nous avons préparé un flux de travail phylogénétique automatisé MimiLook, qui détermine les groupes

9 orthologues (OG) des ORFomes des MV et construit la phylogénie en effectuant une génération d'alignement sur les homologues BLASTP. Enfin, cet outil détecte des événements statistiquement validés d'acquisitions de gènes avec l'aide de T-REX. Nous avons trouvé 4577 groupes de groupes d'orthologues, hors de celui-ci, 91% des OG se révèlent spécifiques à la famille, alors que seulement 9% sont représentés par des protéines de 2 familles de MV ou plus. À l'étape 2 de notre analyse, nous avons trouvé 414 OG avec événement HGT détecté. On a déduit que 174 ont été transférés des eucaryotes, 106 pour avoir transféré des bactéries et 9 familles de gènes pour avoir transféré des domaines cellulaires autres que les eucaryotes ou les bactéries. 52 OG ont été détectés comme des cas de transferts sympatriques. Notons que 129 familles de gènes ont été identifiées comme impliquées dans le transfert de gènes de MV à d'autres domaines cellulaires. Similairement, 7898 protéines non orthologues pour détecter les événements de transfert et les donneurs putatifs et identifié 259 cas de HGT à partir de protéines non orthologues, dont 135 cas proviennent d'earyaryotes, 82 cas de bactéries, 11 cas de Phages et autres Virus, 31 cas où les MV transfèrent des protéines sur d'autres domaines cellulaires. Les exemples de HGT ont révélé la spécificité des donneurs, car les virus des vertébrés/invertébrés (Poxviridae, Ascoviridae et Iridoviridae) ont acquis des gènes de donneurs comme Euteleostomii, Eutheria, Baculoviridae et proteobactéries; Les virus des algues (Phycodnaviridae) et les virus des

10 protozoaires (pandoravirus, Mimiviridae, Pithovirus et Marseilleviridae) ont été en train d'acquérir des gènes principalement auprès de donneurs cellulaires comme Dictyostellium, Mammeillales, Firmicutes, Clostridiales, Klebsormidium, Rozella allomycis, Ooomycetes et Phytophthora. En conclusion, une distinction claire peut être observée dans le mosaïque du génome des familles de Megavirale éloignées, où elles ont évolué par spécificité génomique et acquisitions de gènes spécifiques à la famille de leur créneau écologique respectif. L'évolution des familles de Megavirale peut évidemment être basée sur l'analyse phylogénétique de quelques gènes de noyau ainsi que sur les similitudes de leur contenu génétique, mais, sachant que le transfert de gène horizontal joue un rôle majeur sur le contenu des gènes de Megavirales, il pourrait être imprévisible de déchiffrer Évolution de toutes les familles Megavirale par cette approche. Mots-clés: Megavirales; Horizontal gene transfer; MimiLook; comparative génomique; phylogenie

11

12

Avant-Propos

Le format de présentation de cette thèse correspond à une recommandation de la Spécialité Génomique et Bioinformatique, à l‟intérieur du Master de Sciences de la Vie et de la Santé qui dépend de l‟Ecole Doctorale des Sciences de la Vie de Marseille. Le candidat est amené à respecter des règles qui lui sont imposées et qui comportent un format de thèse utilisé dans le Nord de l‟Europe permettant un meilleur rangement que les thèses traditionnelles. Par ailleurs, la partie introduction et bibliographie est remplacée par une revue envoyée dans un journal afin de permettre une évaluation extérieure de la qualité de la revue et de permettre à l‟étudiant de le commencer le plus tôt possible une bibliographie exhaustive sur le domaine de cette thèse. Par ailleurs, la thèse est présentée sur article publié, accepté ou soumis associé d‟un bref commentaire donnant le sens général du travail. Cette forme de présentation a paru plus en adéquation avec les exigences de la compétition internationale et permet de se concentrer sur des travaux qui bénéficieront d‟une diffusion internationale.

Dr. Pierre Pontarotti

13

14

Chapter-1

Introduction

15

16

Viruses have been defined as capsid encoding organisms that uses ribosomal encoding organisms (Bacteria, Archaea and Eukaryotes) as hosts for their replication (Raoult and Forterre, 2008)⁠. Nucleo-cytoplasmic large DNA viruses (NCLDVs) are a group of viruses that infect diverse members of eukaryotes and possess a large double-stranded DNA genome varying in size from 100 kb to 2.5 Mb (Iyer et. al., 2001). They are considered to form a monophyletic group based on the conservation of several genes (Iyer et. al., 2006), which led to the recent proposal of the order ‘Megavirales’ to refer to this viral group (Colson et. al., 2013). Currently, 7 defined family Mimiviridae, Marseilleviridae, Phycodnaviridae, Poxviridae, Asfarviridae, Iridoviridae and Ascoviridae ( Colson et. al., 2013) compose putative order Megavirales. The recent discoveries of pandoraviruses (Philippe et. al., 2013), Mollivirus sibericum (Legendre et. al., 2015), Pithovirus sibericum (Legendre et. al., 2014) and faustovirus (Reteno et al., 2015) substantially expanded the known diversity of Megavirales lineages. These giant viruses infect a broad range of eukaryotic hosts such as protozoa, algae, vertebrate and invertebrates. They usually replicate inside the cytoplasm or nucleus of their hosts, having a genome size varying from ~100 to 2,500 kilo base pairs (Colson et al., 2013)⁠. Alternatively, they have been named giruses due to their exceptionally large size of genome and other remarkable features, which makes them different from other viruses

17

(Claverie et al., 2006)⁠. Acanthamoeba polyphaga mimivirus was the first giant virus discovered in 2003 as a parasite of amoebae collected from the water sample of a cooling tower in England. The Mimivirus genome sequence size is larger than that of small bacteria, being 1.2 mega base pairs in length and encoding more than 1,200 proteins (Raoult et al., 2004)⁠. Mimivirus was even initially misclassified a Gram positive bacteria, and named “Bradford coccus”, due to its staining properties and large particle size similar to those of some bacteria. Unavailability of ribosomal genes was a concern to classify it definitively as a bacteria and electron microscopy finally revealed its icosahedral capsid, thus classifying this “microbe” as a virus (Scola et al., 2003; Raoult et al., 2004). Mimivirus evolutionary analyses using core conserved gene phylogenies established its link with NCLDV (Raoult et al., 2004; Iyer et al., 2006)⁠. Subsequently, researchers identified and classified many new related giant viruses from different environments (Fischer et al., 2010; Philippe et al., 2013; Arslan et al., 2011; Yoosuf et al., 2013; Boyer et al., 2009; Pagnier et al., 2013; Aherfi et al., 2014)⁠. Strikingly, some of the recently discovered giant viruses (pandoraviruses, Pithovirus sibericum and Mollivirus sibericum) isolated from amoeba largely differ from those previously described by their morphology (Philippe et al., 2013; Legendre et al., 2014, 2015), whereas faustoviruses were isolated on other amoebae than those from the genus

18

Acanthamoeba (Reteno et. al., 2015)⁠. Noteworthy, one of the pandoravirus isolates (Pandoravirus inopinatum) was initially misclassified as a eukaryotic endosymbiont (Scheid et al., 2014; Scheid, 2014; Antwerpen et al., 2015). Moreover, giant viruses or their sequences have been detected in human samples (Popgeorgiev and Boyer, 2013; Saadi et al., 2013) and from different environmental samples (Ghedin and Claverie, 2005; Monier et al., 2008; Kristensen et al., 2010; Williamson et al., 2012). Furthermore, giant viruses have several novel biological features, which include the fact that mimiviruses can be infected by parasitic viruses (virophages) (La scola et. al., 2008), they can contain mobile DNA elements (transpovirons) (Desneus et. al., 2012), and they have a defence mechanism against virophages termed the mimivirus virophage resistance element (MIMIVIRE) (Levasseur et. al., 2015). Megavirales are thought to be monophyletic based on a common set of approximately 30 homologous genes known as ‘core genes’(Iyer et. al., 2006). As they either replicate exclusively in the cytoplasm or begin their cycle in the host nucleus before passage in the cytoplasm, they carry most of the genes necessary for their own DNA metabolism, replication and , in addition to those involved in virion assembly and packaging. Nevertheless, core genes represent only a tiny fraction of genomic repertoire. Thus, the small numbers of conserved genes among the family and the extraordinary

19 overall genomic complexity and variability have raised many questions about the origins and the evolution of the Megavirales. So, we reviewed literature to summarize the knowledge on the composition and evolution of these gene repertoires for each of the families that compose the order Megavirales, and particularly depicted the core genome, genes acquired by horizontal gene transfer, duplicated genes, and ORFans (Chapter Two). Microbiologists and evolutionary biologists have investigated the evolution of Megavirales genes by sequence and molecular phylogenetic analyses and proposed theories on the origin of Megavirales and its association with the emergence of eukaryotes (Filee et. al., 2007; Moreira and Lopez-garcia 2009, 2015; Raoult 2009; Forterre and Gaia 2016). Depending on hypotheses, Megavirales genes can have different origins. A recent hypothesis postulates that Megavirales evolved from ancient DNA transposons of the family that are themselves the remnant of Tectoviridae-like bacteriophages that entered the protoeukaryotic cell along with the alphaproteobacterial endosymbiont (Krupovic and Koonin 2015). Under this scenario, Megavirales are the product of a ‘melting pot’ of early viral evolution and part of their genes were directly derived from an ancient virus world as ‘hallmark viral genes’ (Koonin et. al., 2006). Alternatively, Megavirales genes can be acquired from known cellular organisms including viral hosts (‘the accretion hypothesis’) (Filee 2007; Yutin et. al., 2014;

20

Moreira and Lopez-Garcia 2015) or could have been vertically inherited from unknown and ancestral cellular organisms (‘the reductive evolution from the fourth (or more) domain(s) hypothesis’) (Raoult et.al, 2010; Abergel et. al., 2015). These scenarios are not totally exclusive of each other. For instance, an ancient cellular organism might have evolved to an ancestor of Megavirales by reductive genome evolution, followed by later re-accumulation of cellular genes in the course of viral evolution, a “genome accordion” (Filee, 2013). Subsequently, there has been an intense debate about the relative importance of gene transfers from the host during the course of Megavirale evolution. Many phylogenetic studies (Iyer et al., 2006; Filee et al., 2007, 2008; Moreira and Brochier-Armanet, 2008; Yutin et al., 2013, Moreira and Lopez-Garcia, 2005; Yutin et al., 2014) have pointed out decisive role of HGTs and genetic exchanges on evolution of MVs, but, majority of them are based on closely related Megavirale families. The diversity of Megavirales (classified in 10 distantly related families with varying gene content) imposes difficulties in collectively evaluating their phylogenetic relationships. While small subset of conserved core genes and phylogenomic analyses based on them, provide useful classification of MVs, but they give little insight on the remaining un-conserved and variable gene content of accessory genomes. Nevertheless, a consensus tend to emerge in which all Megavirales infecting protozoa

21

(Mimiviridae, Marseilleviridae, Phycodnaviridae, etc.) display few cases of gene transfers from the hosts (ranging from 7 to 22 which represent a small fraction, less than 1%, of the total proteome). By contrast, despite having smaller genomes, NCLDVs infecting metazoa have the highest ratio of host-derived genes (number/genome length) (Filee et. al., 2008). Among them, poxviruses have the strongest tendency to acquire host genes (up to 13% of total proteome). Thus, it appears clearly that host gene acquisition constitutes a quantitatively preponderant way of gene novelties in Megavirales. However, exact proportion of instances of genes acquired horizontally varies greatly with the methodologies used for their detection of interpretation of phylogenies prepared. Therefore, it is necessary to adopt some systematic searching for detecting reticulate evolutionary events like HGT in Megavirales to decipher genomic composition and genome mosaicism of distantly related MV families. To investigate the contribution of HGTs in distantly related Megavirale families, we have determined gene distributions and gene phylogenies for the 86 complete Megavirale ORFomes classified in 6 defined and 4 putative families, in context of their homologs from other domains of life. At first, we prepared an automated phylogenetic workflow MimiLook, prepared as a Perl command line program, that deduces orthologous groups (OGs) from ORFomes of Megavirales and constructs

22 phylogenetic trees by performing alignment generation, alignment editing and BLASTP searching across the NCBI nr protein sequence database. Finally, this tool detects statistically validated events of gene acquisitions with the help of the T-REX algorithm by comparing individual gene tree with NCBI species tree. By implementing MimiLook, we noticed that nine percent of Megavirale gene families (i.e., OGs) have been acquired by HGT, 80% OGs were Megavirale specific and eight percent were found to be sharing common ancestry with members of cellular domains (, Bacteria, Archaea, Phages or other viruses) and three percent were ambivalent (Chapter 3). In other words, vast numbers of genes are found to be un-conserved in distantly related MV families. Similarly, family-specific gene acquisition pattern was also seen, where only 3% of OGs acquired by horizontal gene transfer are shared between families, compared to 6% of acquired OGs which are specific in particular MV family. Instances of HGT were found to be depicting donor specificity, as viruses of Metazoa (Poxviridae, Ascoviridae and Iridoviridae) acquired genes from donors like Euteleostomii, Eutheria, Baculoviridae and proteobacteria; algal viruses (Phycodnaviridae) and viruses of protozoa (pandoravirus, Mimiviridae, pithovirus, and Marseilleviridae) were found to be acquiring genes majorly from cellular donors like Dictyostellium, Mammeillales, Firmicutes, Clostridiales,

23

Klebsormidium, Rozella allomycis, Ooomycetes and Phytophthora. Taking into consideration all the data, clear distinction can be seen in the genome mosaicism of distantly related Megavirale families, where they evolved via genome specificity and family specific gene acquisitions from their respective ecological niche. Our systematic search for HGT events of non-megavirale origin provides the first estimate of the total contribution of HGT in family specific genome mosaicism of distantly related Megavirales (Chapter 4).

24

Chapter-2

Megavirale diversity and evolution: A review

25

26

TITLE PAGE

Full-length title: Megavirales diversity and evolution

Author list: Sourabh Jain1,2, Philippe Colson2,3, Didier Raoult2,3, Pierre Pontarotti1*

Affiliations:

1Aix-Marseille Université, Ecole Centrale de Marseille, I2M UMR 7373, CNRS équipe Evolution Biologique et Modélisation, Marseille, France; [email protected]

2Aix-Marseille Université, Unité de Recherche sur les Maladies Infectieuses et Tropicales Emergentes (URMITE), UM63 CNRS 7278 INSERM U1095 IRD 198, Faculté de Médecine, Marseille, France ; [email protected]

3 IHU Méditerranée Infection, Assistance Publique-Hôpitaux de Marseille, Centre Hospitalo-universitaire Timone, Pôle des Maladies Infectieuses et Tropicales Clinique et Biologique, Fédération de Bactériologie-Hygiène-Virologie, Marseille, France ; [email protected]

*Correspondence: [email protected]

Keywords: Mimiviridae; Marseilleviridae; Phycodnaviridae; Poxviridae; Asfarviridae; Ascoviridae, Iridoviridae; Megavirales; Nucleocytoplasmic large DNA viruses; Giant viruses,

27

INTRODUCTION

Nucleo-cytoplasmic large DNA viruses (NCLDV) constitute an apparently monophyletic group that was first coined in 2001 (Iyer et al. 2001) and consists of seven defined viral families, namely Poxviridae, Ascoviridae, Iridoviridae, Asfarviridae, Phycodnaviridae, Mimiviridae and Marseilleviridae with 3 putative families Faustovirus, Pithovirus and Pandoravirus infecting a broad variety of eukaryotes (Figure 1). Thus, these viruses infect a widespread range of eukaryotic hosts including green and brown algae (phycodnaviruses), various protists (mimiviruses and marseilleviruses) or Metazoa (poxviruses, iridoviruses, asfarviruses) (Koonin and Yutin 2010) and they either replicate exclusively in the cytoplasm of the host cells, or possess both cytoplasmic and nuclear stages in their life cycle (Moss, 2001). The NCLDVs encompass a considerably broad range of viruses that infect hosts composing a major part of the whole range of eukaryotic diversity. Besides, these viruses share a common ancestral origin as indicated by a set of ancestral genes and common virion architecture and virus reproduction within cytoplasmic factories, which support the classification of all the NCLDV families into a new viral order, named the “Megavirales” in reference to the large or giant size of the virions and their genomes. The gene repertoire of the Megavirales members encompasses several groups of genes among which core genes that are shared by all or a majority of viruses, genes transferred

28 laterally, duplicated genes and ORFan genes. In the present review, we will summarize these gene contents.

GENE CONTENT OF THE NCLDVS

Core genes

Iyer et al. described in 2001 the monophyletic origin of members of four viral families, Poxviridae, Asfarviridae, Iridoviridae and Phycodnaviridae, and gathered them in a superfamily, the nucleocytoplasmic large DNA viruses, to encompass all these viruses based on their large size, their DNA genome and the nucleic or cytoplasmic stages observed during the viral replication cycle (Iyer et al. 2001). In 2006, this work was updated by analyzing Mimivirus, discovered in 2003, and additional genomes of iridoviruses, phycodnaviruses and poxviruses (Iyer et al. 2006). Core genes were identified for these viruses that were classified as class I when found in all families, class II when missing in some species despite being present in all families, class III when absent from one family, and class IV when absent from more than one (Iyer et al. 2006; Iyer et al. 2001). Nine genes were found to be shared by all members of all families of NCLDVs including a VV D5-type ATPase, a DNA polymerase (B- family), a VV A32 virion packaging ATPase, a VV A18 helicase, a capsid protein (D13), a thiol oxidoreductase, a VV D6/D11-like helicase, a S/T protein kinase, and a transcription factor (VLTF2). In addition, members of at least three of the four families shared 22 other core genes. Recent analyses delineated about 50 core genes in the

29

NCLDVs (Yutin and Koonin 2012). In 2009, Yutin et al. described NCLDV clusters of orthologous groups of proteins (COGs), they named Nucleo-Cytoplasmic Virus Orthologous Groups (NCVOGs) (Yutin et al. 2009). A total of 1,445 NCVOGs were identified among which 177 are represented in more than one NCLDV family and a set of 47 conserved genes was identified by a maximum-likelihood reconstruction, which were likely present in the genome of the common ancestor of the megaviruses. Also, five NCVOGs were identified that are shared by all the NCLDV genomes namely, the major capsid protein (orthologs of vaccinia virus D13 protein), primase-helicase (VV D5), Family B DNA polymerase (VV E9), packaging ATPase (VV A32), and transcription factor (VV A2). The majority of the core genes of the Megavirales members encode enzymes involved in DNA metabolism and replication, or structural proteins. Megavirales members therefore encode a nearly complete DNA replication apparatus in addition to key enzymes involved in the final steps of the DNA metabolism (Iyer et al. 2006; Koonin & Yutin, 2010; Yutin & Koonin, 2012; Yutin et al. 2009). The Megavirales core genes seem to have originated from different sources including homologous genes of bacteriophages, bacteria and eukaryotes, which suggests origin of these viruses at an early stage of the evolution of eukaryotes through extensive mixing of genes from widely different genomes (Koonin & Yutin, 2010; Yutin et al. 2009) and more recent analyses highlighted substantial complexity and

30 diversity of these evolutionary scenarios (Yutin & Koonin, 2012).

Poxviruses

The poxviruses (family Poxviridae) are a family of double-stranded DNA (dsDNA) viruses with very large genomes (130–360 kilobase pairs (kbp) in length), usually encoding more than 150 genes per genome (Table 1) (Lefkowitz et al. 2006, Moss 2001). Poxviruses are well known for the two member viruses namely, Variola virus (VARV) and Vaccinia virus (VACV). VARV is the causative agent of smallpox, a disease that ravaged the human population until its eradication in 1977 by a worldwide vaccination campaign. Poxvirus replication occurs in the cytoplasm, thus preventing the virus from using nuclear enzymes of the host and requiring it to encode its own enzymes for DNA replication ( Lefkowitz et al. 2006). The discovery of homologs of vertebrate immune system signaling molecules in the genomes of poxviruses and herpesviruses sparked the interest in studying horizontal transfer of host genes to poxviruses (Hughes & Friedman, 2005; McFadden, 1995). Many of the host-derived genes apparently hold the function of immunomodulatory genes and genes involved in nucleic acid metabolism. Viral proteins that are very identical to host genes are predicted to be functional proteins which interfere with a variety of host immune defense mechanisms including antigen display, cytokines and their receptors, cytoplasmic signaling resulting from

31 immune activation, and genes involved in resistance of cells to oxidative stress and apoptosis. The two important studies to identify horizontal gene transfer events in poxviruses were carried out by Hughes and Friedman with a systematic search for horizontally transferred genes by phylogenetic methods (Hughes & Friedman, 2005), while Bratke and McLysaght rather studied gene order around putative horizontally transferred genes to identify single and multiple gene events (Bratke and McLysaght 2008). They used two basic principles (a) horizontally transferred genes at conserved positions relatively to neighboring genes supports a single transfer event; (b) horizontally transferred gene at different genomic locations supports several transfer events (Bratke & McLysaght, 2008). Austin L. Hughes has done a detailed study on origin and evolution of viral interleukin-10 and other DNA virus genes with vertebrate homologues (Hughes 2002). There were cases in which the phylogenies provided strong evidence that poxvirus genes originated well prior to the origin of vertebrates, including casein kinase-related PK2 in poxviruses prior to deuterostome–protostome divergence, rpoA and rpoB prior to animal– divergence. Interestingly, poxvirus proteins that originated early in the history of life include proteins playing fundamental roles in DNA replication, such as rpoA and rpoB. The presence of such proteins may have been necessary for the origin of DNA viruses themselves. Gene duplications in poxviruses were often lineage specific, and the most extensively duplicated viral gene families were found in only a few of

32 the genomes. Twenty two gene families were present in at least one of the species of subfamily Entomopoxvirinae and at least one of the species of subfamily Chordopoxvirinae. A total of 1005 gene families were found in at least one of the 17 poxvirus genomes, while 95 families included two or more members in at least one of the genomes (Hughes, 2002).

Ascoviruses

Ascoviridae is a family of double stranded large DNA viruses which infect insects, where they produce large enveloped virions that are 150 by 400 nm in size and cause chronic fatal disease, with cytopathology resembling that of apoptosis (Bigot et al. 1997; Federici et al. 2000). Ascoviruses have circular genomes, size ranging from 116 to 190 kbp (Table1). In ascoviruses, lateral gene transfers were identified by BLASTp that detected homologs in eukaryotic, bacterial genomes, and genomes from other megaviruses than ascoviruses and from viruses that do not belong to the proposed order Megavirales. Six open reading frames (ORFs) have been identified as of eukaryotic (Zinc-dependent metalloprotease, Unknown protein, Endonuclease, Serine/Threonine protein kinase, Hydroxysteroid (17- beta) dehydrogenase, Metallo-hydrolase) and bacterial origin (two Metallo-hydrolase, Acyl-Coenzyme A Binding Protein, BRO-like protein 12, CK1 family protein kinase, RedQ-like DEAD helicase and four ORFs from other Megavirales or non-Megavirales (IAP-like

33 protein, Unknown protein, Ubiquitin, NTPase/helicase) (Bigot et al. 2008). The bro gene and bro-like genes were identified in viral families Ascoviridae and Iridoviridae but not in other invertebrate or vertebrate genomes, vertebrate viruses, transposons, nor in prokaryotic genomes except in prophages or bacterial transposons. The phylogenetic analysis of bro genes suggested that they have resulted from the recombination of viral genomes that allowed the duplication and loss of genes and, on the same time, the acquisition of genes by horizontal transfer over evolutionary time (Bideshi et al. 2003). The common feature of many eukaryotic dsDNA viruses is the presence of multigene families. Major capsid protein is one of the genes studied in detail. In the genome of Trichoplusia ni (TnAV2), there are two ORFs coding the major capsid protein and the sequences shared 100 % identity, which is not common in multigene families. Likewise, thymidine kinase has two homologues and baculovirus repeated open reading frame (bro) had three homologues (Wang et al. 2006). The DpAV genome contains 6-8 interspersed repeated sequences of 494 bp with two imperfect palindromes and similar enhancer motifs of the ubiquitous and virus early transcription factors. These homologous regions are earlier noted in baculoviruses, which are implicated in viral DNA replication (Bigot et al. 1997). Five repeat regions were found in the entire genome of Heliothis virescens (HvAV3) with 94-100 % of identity among the repeats. These repeat regions code for a putative protein. The C terminus of this protein consists of a conserved

34 transposase domain. This putative domain was conserved in most of the transposase proteins. The presence of putative transposable elements within the repeat regions indicates that DNA might have been transfered to the ascovirus genome from the host. This element may be the possible reason for the duplication of the gene in the genome (Asgari et al. 2007).

Iridoviruses

The Iridoviridae is a family of linear double stranded large DNA viruses (~120-200 nm) (He et al. 2002; Jakob et al. 2001; Shi et al. 2010). The genome size of iridoviruses ranges from 105 to 212 kbp (Table1). This family of viruses infects vertebrates (Ranavirus, Megalocytivirus, Lymphocystivirus) and invertebrate (Iridovirus, Chloriridovirus) hosts. The important characteristic of this family of viruses is its ability to infect diverse array of hosts, which likely at least partly explains the diversity of their gene content between different genera. The iridovirus genomes are circularly permuted and terminally redundant. During the co- evolution of iridoviruses and their hosts, gene gains and losses are likely to have host-specific effects. The gained genes could help evasion from host defenses while lost genes could be associated with loss of antigenic signal to the host cell immune system or the increase of virulence (Bubić et al. 2004; McLysaght et al. 2003). Horizontal gene transfers in iridoviruses that involve their hosts may have a high rate due to the nuclear stage of iridovirus

35

DNA replication (Chinchar et al. 2009; Williams et al. 2005). Huang et al studied the gene gain and gene loss events based on the presence of clusters of orthologs genes for the 13 genomes that were sequenced (Huang et al. 2009). The phylogenetic tree based on eleven concatenated proteins indicated that gene loss could occur throughout the tree, reptile ranavirus and amphibian ranavirus (+2/-) have less gene gain-and-loss events than fish ranavirus (+50/-24), fish lymphocystivirus (+65/-26), fish megalocytivirus (+86/- 19) and insect iridovirus (+105/-). In iridoviruses, major replicative and transcription enzymes possibly originated from their eukaryotic hosts and the presence of these genes in all genera of this family indicates that the ancestral iridovirus must have acquired genes from its eukaryotic hosts and later differentiated into the five current genera. The enzyme ribonucleotide reductase that comprises small and large subunits RR-1 and RR-2, have homologs in all iridoviruses except members of the genus Megalocytivirus, and they are thought to be derived from Rickettsia–like eubacteria (Gammon et al. 2010). This enzyme plays an important role in eukaryotic DNA synthesis. In addition, the RR-1 gene from members of genus Iridovirus possesses an intein whereas viruses from genera Ranavirus and Lymphocystivirus do not. The presence of an intein in the RR-1 gene is very rare and noticed only in certain bacteria and phage. In contrast, megalocytiviruses encode only the RR-2 gene with low homology with those from other members of iridoviruses.

36

Phylogenetic analysis suggests that the megalocytivirus RR-2 gene originated from a previous eukaryotic host.

Asfarviruses

African swine fever virus (ASFV) is a unique and complex pathogen that infects wild and domestic swine and members of the family Argasidae composed of soft- bodied ticks (Dixon et al. 1990; Lubisi, et al. 2007). The ASFV double-stranded DNA genome differs in length from about 170 to 193 kbp depending on the isolate (Table1). Due to the gain or loss of ORFs from the multigene families, ASFV encodes between 151 and 167. Short tandem repeats are present in asfarviruses that vary in different isolates, being either located within genes or within intergenic regions, leading to small length variations in the genome The genes are distributed equally on both positive and negative strands. Multiple gene families are very common in asfarviruses, approximately 30% of paralogous genes being present in the genome, their number differing between different isolates (Agüero et al. 1990; Jones et al. 1987; Pires et al. 1997; Yozawa et al. 1994). The genes composing these familes of proteins are arranged adjacent to each other and have the same orientation, which indicates that they are evolved by (De La Vega et al. 1994; Rodriguez et al. 1990). It also has been noted that these genes tend to be positioned at the terminal regions of the genome, their copies being distributed on the either end of the asfarvirus genome, and these genes was proposed

37 to be transformed during genome replication and resolution. Some of these genes may have different function due to the presence of extra functional domains and large sequence divergence.

Faustovirus

Up to 2015, all giant viruses of amoeba were isolated by co-culturing on Acanthamoeba polyphaga or A. castellanii, which are phagocytic protists, among the most predominant organisms in water and soil (Aherfi et. al., 2016; Pagnier et. al., 2013) . Recently, another free- living amoeba, Vermamoeba vermiformis, described as the most common amoeba in hospital water (Pagnier et. al., 2015), was used in a high-throughput strategy that aimed at isolating new giant viruses from environmental samples with new co-culture supports (Reteno et. al., 2015). This approach was fruitful, as it led to the discovery of new giant viruses, which represent a putative new viral group. Faustovirus E12, the prototype strain, was isolated from a sewage sample collected in Marseille, France. Its genome is a 466,265 -long circular double-stranded DNA that was predicted to encode 451 proteins. About two thirds of these putative proteins have no homolog in sequence databases, and 13% have homologs in other Megavirales members, mostly asfarviruses, then phycodnaviruses, mimiviruses, marseilleviruses and ascoviruses. Other best matches include mostly sequences from bacteria (9%) and eukaryotes (7%). Eight Faustovirus close relatives were

38 thereafter isolated from sewage samples collected in France, Senegal and Lebanon, which compose four lineages (Benamar et. al., 2016). Phylogenomics showed that faustoviruses were the most closely related, although distantly, to asfarviruses. The Faustovirus discovery is further strong evidence that the diversity of giant viruses of amoebas is probably largely undiscovered and new giant virus lineages will continue to be described in the near future.

Phycodnaviruses

The phycodnaviruses are DNA viruses that infect algae (Dunigan et al. 2006). These viruses have a wide range of hosts (including algae from both marine and fresh water) and this is associated with considerable genetic diversity, though morphology is similar. The family name Phycodnaviridae has been quoted mainly because of two of their characteristics: “Phyco” comes from their algal hosts and “” comes from their double stranded DNA genomes. Phycodnaviruses are grouped into six genera named on the basis of the viral host, namely Chlorovirus, Coccolithovirus, Prasinovirus, Prymnesiovirus, Phaeovirus and Raphidovirus (Dunigan et al. 2006; Wilson et al. 2009). These genomes have size ranging from 100 kbp to over 550 kbp (Table1) (Dunigan et al. 2006; Van Etten & Meints, 1999). Phycodnaviruses infect a wide range of hosts and hence possibilities of gene exchange are numerous. The chloroviruses encode enzymes required for the synthesis and glycosylation of

39 structural proteins, namely two UDP-D-glucose 4,6- dehydratases (UGDs) and bifunctional UDP-4-keto-6- deoxy-D-glucose epimerase/reductase (UGER). Phylogeny showed that there was a possible recent horizontal gene transfer of UGD gene from a green algal host. At the same time, UGER was absent in Acanthocystis turfacea chlorella virus 1, but the host, chlorella, may encode this enzyme. Both of these genes are late genes that plays an important role in posttranslational modification of capsid proteins (Parakkottil Chothi et al. 2010). Ostreococcus tauri virus OtV-2 likely acquired cytochrome b5, RNA polymerase sigma factor and a high-affinity phosphate transporter encoding gene from its host, the three proteins showing a high homology with the osterococcus proteins. Moreover the genes encoding cytochrome b5, RNA polymerase sigma factor genes and four unknown functional proteins were arranged adjacent to each other in the viral genome. This may be a possible so-called “hot spot” region in the Ostreococcus tauri virus2, which is more prone to harbor host genes (Weynberg et al. 2011). In addition, Emiliania huxleyi virus 86 (EhV-86) acquired seven genes involved in sphingolipid biosynthesis pathway from its host, microalga Emiliania huxleyi (Monier et al. 2009; Wilson et al. 2005). Insertion elements present in phycodnaviruses belong to bacterial and archaeal IS607 family (Frost et al. 2005). These insertion sequences do not occur between genes of bacterial origin and other genes, instead they co-localize with the stretches of bacterial-like genes, which support that they have been

40 inherited from bacterial genomes along with bacterial- like genes (Filée et al. 2007). The number of bacterial- like genes seems to depend on the host, hosts that engulf bacteria being able to provide ecological niche for viral access to bacterial gene pools. The two phycodnaviruses ESV-1 and EHV86, which infect Ectocarpus siliculosus and Emilinia huxleyi, respectively, two free-living algae that are not learned to ingest bacteria, have very few mobile genetic elements (only two copies of IS4 family element), but other phycodnaviruses that infect Chlorella spp. with a symbiosis lifestyle show considerable gain of bacterial-like genes (Filée et al. 2008).

Mimiviruses

The discovery of Acanthamoeba polyphage mimivirus (APMV) by co-culturing with Acanthamoeba hosts changed dramatically the outlook of viruses because of its particle size and its gene content (La Scola et al. 2003; Raoult et al. 2004). Mimivirus genome size ranges from 617 kb to 1,259 kbp (Table1). In Mimivirus, homologs were identified for 9/9 class I core genes (100 %), 6/8 class II core genes (75 %), 11/14 class III core genes (79 %) and 16/30 class IV core genes (53 %) (Raoult et al. 2004). Among class II core genes, Mimivirus lacks two genes which are important for the biosynthesis of 3'- deoxythymidine-5'-triphosphate: thymidylate kinase and 3'-deoxipyridine-5'-triphosphate pyrophosphatase (dUTPase), but class IV core genes thymidylate synthase and thymidine kinase have a homolog in Mimivirus.

41

Likewise, Mimivirus misses class III core gene adenosine 5'-triphosphate (ATP)-dependent DNA ligase, which was replaced by class IV core gene nicotinamide dinucleotide (NAD)-dependent ATP ligase. The Mimivirus genome is rich in nucleotide synthesis enzymes including a deoxynucleoside kinase, a cytidine deaminase and a nucleoside diphosphate kinase, reported to be the first found in a double stranded DNA virus. Raoult et al. have identified in Mimivirus several unique genes that were not previously reported in viruses, includes proteins coding for translation associated proteins, DNA repair enzymes, chaperones and new enzymatic pathways and genes that are believed being trademark genes of cellular organisms (La Scola et al. 2003; Legendre et al. 2011; Raoult et al. 2004). Since 2008, several new mimiviruses including close relatives to Mimivirus that form three lineage A, B and C (Mamavirus, Terra2 virus, Moumouvirus, Courdo11 virus, Megavirus chilensis) and others more distantly related (Cafeteria roenbergensis virus (CroV)) have been isolated from different phagocytic protists in different niches, including fresh water, ocean and soil (Arslan et al. 2011; Fischer et al. 2010; La Scola et al. 2010; Yoosuf et al. 2012). Only 4.6% of the Mimivirus gene repertoire is composed of NCLDV core genes, which indicates that this gene content is lineage specific. Lateral gene transfer and gene duplications have also strongly influenced the composition of the Mimvirus genome (Filée et al. 2008; Iyer et al. 2006; Raoult et al. 2004). Moreira and Brochier-Armanet specifically studied a set of 198

42

Mimivirus proteins attributed to COG families (Moreira & Brochier-Armanet, 2008; Tatusov et al. 2003). A total of 126 ORFs with clear homologs were retrieved, the phylogenetic analyses inferring an eukaryotic origin for 60 of the 126 Mimiviral ORFs that have reliable homologs in cellular species, approximately 10% of which appeared to be acquired from amoebae. Filee et al. also identified 96 genes of bacterial origin. The bacterial- like genes show a strong bias in Mimivirus (and at least one phycodnavirus, NY2A) toward DNA replication and repair (20% of proteins) and cell envelope (12.5% of proteins) in COGs functional gene categories. Three consecutive open reading frames encoding a sugar transaminase, a glycosyltransferase, and a protein of unknown function were identified in the Mimivirus genome that are syntenic with three ORFs in the genome of Clostridium acetobutylicum indicating the inheritance of these bacterial-like genes as a short contiguous block; in addition, the bacterial-like genes tended to be clustered toward the extremities of the Mimivirus genomes. Furthermore, a 38-kb genomic region of putative bacterial origin was identified in the CroV genome that encodes 34 ORFs, 14 being most similar to bacterial proteins, among which 7 are predicted to function in carbohydrate metabolism (Fischer et al. 2010). These findings further support the speculation that these genes may have been acquired from a bacterium by the frequent encounters of CroV and phagocytosed bacteria inside the host cytoplasm. These findings suggest that eukaryotic hosts using bacteria as food may work as a hotspot for the

43 exchange of DNA between replicating viruses and bacteria, thus providing a biological niche with access to bacterial genes (Filée et al. 2007; Fischer et al. 2010; Raoult & Boyer, 2010). The Moumouvirus genome analysis revealed substantial gene loss compared to Megavirus chiliensis, indicating that genomes of mimivirus form this lineage experienced genome reduction. In comparison with the Megavirus chiliensis genome, A total of 85 genes located in the terminal regions of the Megavirus chiliensis genome have been apparently lost in the moumouvirus lineage; an alternative, less parsimonious evolutionary scenario would involve independent acquisition of these genes in the Mimivirus and the Megavirus lineages. Two genes encoding metabolic enzymes, cysteine dioxygenase and NAD-dependent epimerase/dehydratase, are shared by Moumouvirus and CroV to the exclusion of other Megavirales members (Yoosuf et al. 2012). Mobile genetic elements have been detected in the Mimivirus genome that were previously thought to be specific of prokaryotes (Filée et al. 2007). They include insertion sequences, two homing endonucleases, and an intein, considered as major agents of lateral gene transfer in prokaryotes. The insertion sequences contain two ORFs, a transposase and a protein of unknown function (Frost et al. 2005; Ton-Hoang et al. 2005). In addition, the concurrent presence of gene typically detected in prophages of bacteria and a nearby HNH endonuclease supports the hypothesis of acquisition by lateral gene transfer from a bacteriophage (Filée et al. 2007).

44

Duplicated genes were found to compose about one-third of the Mimivirus gene content (Suhre, 2005). Using PSI- BLAST with various e-values (1e-5 to 1e-25), 244 to 398 paralogous genes were identified that compose 58 and 86 families, respectively. Moreover, duplicated genes are inserted about twice as frequently in the parallel orientation as in the antiparallel orientation, with respect to the coding direction of the matching gene (20 vs. 12%). Large paralogous families in Mimivirus are related to virus-host interactions. Ankyrin double-helix repeat containing proteins are the most repetitive protein (66 homologs). These proteins are ubiquitously found in large paralogous families in both viral and bacterial genomes. WD repeats, L cluster, Pfam FNIP repeats and protein kinases are other paralogous proteins prevalent in Mimivirus. Also, glycosyltransferases, poxvirus transcription factors, transposase site-specific integrase- resolvases and collagen triple helix repeat containing proteins are widely present in Mimivirus genome. These proteins have wide range of functions including virus- host interactions, host signaling or other regulatory processes.

In 2008, La Scola et al. described a new strain of Mimivirus, named Mamavirus (La Scola et al. 2008). The further observation of Mamavirus revealed a novel virus- like agent called Sputnik which is icosahedral in shape and small (50 nm in size) and coexisted in the amoebal cytoplasm of the infected cells and inside the mamavirus factories. Sputnik was named a virophage, because of its

45 functional analogy to bacteriophages, as it only multiplies within A. castellanii if these cells are co-infected with Mimivirus or Mamavirus. The Sputnik genome encodes a protein with homologs in a marine metagenome that belongs to the family of bacterial insertion sequence transposase DNA-binding subunits, and the Sputnik ORF 10 is closely related to integrases of the tyrosine recombinase family from archaeal viruses and proviruses. The virophage could be a vehicle mediating lateral gene transfer between giant viruses (La Scola et al. 2008). In 2011, Fischer et al. identified Mavirus, another virophage that parasitizes Cafeteria roenbergensis virus (Fischer & Suttle, 2011). Yau et al thereafter reported a new virophage that preys on phycodnaviruses of prasinophytes (Yau et al. 2011). Sputnik 2 was the fourth virophage described thus far and was isolated from a human-associated sample (Cohen et al. 2011). Recently Santini et al discovered another virophage infecting Phaeocystis globosa virus PgV-16T named as PgVV the genome of which has a length of 19,527 bp (Santini et al. 2013). It has been recently shown that the virophages of the mimiviruses have a broad host range and thus can serve as vectors for gene exchanges among the three different lineages of amoeba-associated mimiviruses (Clarke et al. 2013; Desnues et al. 2012; Gaia et al. 2013; Yutin & Koonin, 2009). Construction in a recent study of Clusters of Mimivirus Orthologous Genes (mimiCOGs) led to reclassify Organic lake phycodnaviruses and Phaeocystis globosa viruses as members of the family Mimiviridae, though these viruses were initially classified

46 within the family Phycodnaviridae, which further indicates that only viruses within the family Mimiviridae support so far the reproduction of virophages (Yutin et al. 2013)

Marseilleviruses

The family Marseilleviridae encompasses viruses with a double stranded DNA genome (Colson et al. 2013). Marseillevirus, the founding member of this family discovered in 2008 has a circular DNA (Boyer et al. 2009) while the genome of Lausannevirus, another marseillevirus described in 2011, was found to be either a linear molecule with terminal repeats or a circularized molecule (Thomas et al. 2011). So far, three genomes of Marseilleviridae has been sequenced and annotated, including the recently reported Cannes8 virus (Aherfi et al. 2013; Boyer et al. 2009; Thomas et al. 2011). The size of these genomes ranges from 346 kbp to 374 kbp (Table1). In the Marseillevirus genome, 28 of the 457 predicted ORFs are bona fide NCLDV core genes, out of the 41 previously defined classes I–III genes (Boyer et al. 2009). Six ORFs are universal NCLDV proteins, and 17 are shared with Mimivirus/Mamavirus but are absent in other Megavirales members. Based on phylogenetic analysis, 51 Marseillevirus ORFs might be of NCLDV origin. As in other megaviruses, including Mimivirus, the proportion of Marseillevirus ORFs that belong to the NCLDV core gene set is very small (6.1%). All core genes reported in Marseillevirus have orthologs in

47

Lausannevirus, including a thymidine kinase (Thomas et al. 2011). Comparative genomics and phylogenetic analysis of Marseillevirus genes have strongly highlighted the mosaicism of the Marseillevirus genome and identified gene exchange with bacteria, archaea, other viruses and eukaryotes including amoeba (Boyer et al. 2009). Interestingly, non-random connection between inferred origins and functions of marseillevirus genes was observed. Notably, genes encoding defense and repair functions, in particular nucleases, tended to be of bacterial and bacteriophage origin, genes encoding metabolic enzymes and proteins implicated in protein and lipid modification or degradation tended to be of bacterial and eukaryotic origins and genes related to signal transduction tended to be of eukaryotic origin. The Marseillevirus and Lausannevirus were found to encode three histone-like proteins (Boyer et al. 2009; Thomas et al. 2011). Histone-like proteins have been described in several viruses including H3-H4 protein in Heliothis Zea virus, H4 protein in bracoviruses and H2B protein in Ostreid herpesvirus integrated to amphioxus genome (Cheng et al. 2002; De Souza et al. 2010; Gad & Kim, 2008). Viral histones may interact with the host cell DNA or regulate the viral DNA. Eukaryotic organisms acquired 4 copies of histones, H2A, H2B, H3 and H4, which help to form the nucleosome and wrap the analysis, 51 Marseillevirus ORFs might be of NCLDV origin. As in other megaviruses, including Mimivirus, the proportion of Marseillevirus ORFs that belong to the NCLDV core gene set is very small (6.1%). All core genes reported in

48

Marseillevirus have orthologs in Lausannevirus, including a thymidine kinase (Thomas et al. 2011). Comparative genomics and phylogenetic analysis of Marseillevirus genes have strongly highlighted the mosaicism of the Marseillevirus genome and identified gene exchange with bacteria, archaea, other viruses and eukaryotes including amoeba (Boyer et al. 2009). Interestingly, non-random connection between inferred origins and functions of marseillevirus genes was observed. Notably, genes encoding defense and repair functions, in particular nucleases, tended to be of bacterial and bacteriophage origin, genes encoding metabolic enzymes and proteins implicated in protein and lipid modification or degradation tended to be of bacterial and eukaryotic origins and genes related to signal transduction tended to be of eukaryotic origin. The Marseillevirus and Lausannevirus were found to encode three histone-like proteins (Boyer et al. 2009; Thomas et al. 2011). Histone-like proteins have been described in several viruses including H3-H4 protein in Heliothis Zea virus, H4 protein in bracoviruses and H2B protein in Ostreid herpesvirus integrated to amphioxus genome (Cheng et al. 2002; De Souza et al. 2010; Gad & Kim, 2008). Viral histones may interact with the host cell DNA or regulate the viral DNA. Eukaryotic organisms acquired 4 copies of histones, H2A, H2B, H3 and H4, which help to form the nucleosome and wrap the DNA (Talbert & Henikoff, 2010). Histones are present in all archaeal phyla including the deepest branching phylum Thaumarchaeota (Cubonová et al. 2005; Sandman &

49

Reeve, 2006). Ancestral marseilleviruses may haveacquired histone doublets from an unknown eukaryote (Thomas et al. 2011). MORN repeat- containing proteins, various endonucleases and serine/threonine kinases, F-box containing proteins and ubiquitins were abundantly present in members of the family Marseilleviridae (Aherfi et al. 2013; Boyer et al. 2009; Thomas et al. 2011). The membrane occupation and recognition nexus (MORN) repeat domains enhances membrane-membrane or membrane-cytoskeleton interactions (Gubbels et al. 2006).

Pandoravirus

Discovery of two new giant viruses whose genomic analysis showed that they had no phylogenetic kinship with the Mimiviridae occurred. First appointed internally ,"new life form", because of their size (more than one micron long), their morphology (a kind of amphora), and their properties (without apparent division multiplication ). These two new viruses (Philippe et al. 2013) were termed as Pandoravirus salinus (isolated from the Chilean coast) and Pandoravirus dulcis (isolated from a pond in the middle of La Trobe University near Melbourne). They represented the first specimens of the Pandoraviridae family, which now numbers many new members (undergoing characterization). In electron microscopy, their virions reveal the same complex ultrastructure: an internal compartment bordered by a membrane itself

50 surrounded by a 70 nm thick tegument made up of three layers: 20 nm of an internal layer not very dense to electrons, 25 nm d A mesh of fibrils parallel to the electron dense surface, and 25 nm of a layer of materials of intermediate density. An apical pore is visible at one end of each particle whose opening allows its contents to be delivered into the cytoplasm of the host through a channel formed by fusion of the inner membrane of the virion with that of the vacuole of the host. the amoeba. In contrast to the Mimiviridae, the Pandoravirions do not show a dense central region of electrons usually corresponding to the genetic material. Paradoxically, the localization and physical structure of the enormous genome characterizing Pandoraviruses remains mysterious for the moment.

Pandoraviruses have proved to be endowed with properties that prove devastating for many of the prevailing ideas about viruses. Pandoravirus salinus is, for example, provided with a DNA genome of 2, 8 million base pairs with the ability to encode more than 2,500 proteins, more than 90% of which have no sequence similarity to proteins (viral or cellular) already listed in the databases. Paradoxically, in spite of a genome whose size is comparable to that of the smallest eukaryotic microorganisms () and three times larger than that of mycoplasmas, Pandoraviruses depend on the host nucleus for their replication, unlike Mimiviridae whose infectious disease cycle takes place entirely in the cytoplasm. The synthesis of the particles of

51

Pandoravirus is carried out by a totally original mechanism where the complex tegument that delimits the virions seems to be synthesized together with their contents. The synthesis of these amphora-shaped virions is initiated from the tip (apex), the opening of which serves to pour the contents of the particle into the cytoplasm at the initiation of the infection.

The discovery of Pandoravirus, which does not have any relationship to Mimiviridae, immediately suggested that the diversity of giant viruses infecting the same amoebic host may be greater than had been suggested by the repetitive isolation of the first Mimiviridae. The amphore shape of the Pandoravirus particles also suggested that the giant viruses were not limited to icosahedral morphologies, opening the possibility that other viral families associated with virions without particular symmetry could have been confused with parasitic bacteria such as was also the case for Pandoravirus (Claverie et. al., 2015).

Pithovirus

Co-culture protocol of a few grams of permafrost samples (Yashina et.al., 2012) with Acanthamoeba castellanii rapidly revealed cell mortality accompanied by particle multiplication in a form comparable to that of Pandoravirus virions, although slightly more elongated (1, 5 μm vs 1 μm). They appear as a 60 nm thick envelope composed of parallel strips but perpendicular to the surface. This envelope surrounds a lipid membrane

52 that delimits an internal compartment, the only discernable structure of which is a small sphere of about 50 nm in diameter that is very dense to electrons (sometimes also visible in Pandoraviruses.

Combined with the absence of split replication, the determination of the complete sequence of the genome and its gene content allowed us to conclude from the truly viral nature of this prehistoric microorganism that we named Pithovirus sibericum with reference to its Amphore and its origin (Legendre et. al., 2014). Its double-stranded DNA genome is circular (or circularly permuted), rich in AT (64%) like that of Mimiviridae, but has only 610 033 base pairs, a very modest size compared to that of the particle Voluminous to date). Paradoxically, although only 467 proteins are encoded, Pithovirus seems to have all the functions necessary for its replication without resorting to the nucleus of its host, unlike Pandoraviruses. However, ss expected from a virus, the Pithovirus genome does not reveal any trace of a translation machine (ribosome, tRNA [transfer RNA] ligases, ribosomal RNA). Pithovirus sibéricum thus inaugurates a new family of giant viruses (the "Pithoviridae") with full cytoplasmic replication, but without the slightest phylogenetic kinship with the Mimiviridae. As it is usual for all giant viruses inaugurating a new family, more than two thirds of its proteins have no significant similarity in the databases [ Pithovirus sibéricum thus inaugurates a new family of giant viruses (the "Pithoviridae") with full cytoplasmic

53 replication, but without the slightest phylogenetic kinship with the Mimiviridae. As it is usual for all giant viruses inaugurating a new family, more than two thirds of its proteins have no significant similarity in the databases [Pithovirus sibéricum thus inaugurates a new family of giant viruses (the "Pithoviridae") with full cytoplasmic replication, but without the slightest phylogenetic kinship with the Mimiviridae. Recently, our group also described a new member of Pithovirus sibericum named Pithovirus massiliensis (their genomes are identical to 84% on average) (Levasseur et. al., 2016). The new family of Pithoviridae is therefore expected to become new members in the years to come.

ORFans in NCLDV

ORFans refers to genes without detectable homologs in sequence databases (Fischer & Eisenberg, 1999). ORFan genes have a limited phylogenetic distribution and homologous genes are either restricted to closely related organisms or not detectable at all in other organisms. Another observation is that the proportion of ORFans continues to remain same even though the number of sequenced genomes is increasing (Yin & Fischer, 2006). There are various hypothesis has been made about the origin of ORFans, some believe that ORFans are originated from genome duplication, lateral gene transfer or might correspond to de novo created genes (Daubin & Ochman, 2004; Davids et al. 2003). Several studies emphasize the fact that ORFans represents genes of viral

54 origin. Viral genomes possess higher proportion of ORFans compared to other microrganisms. Boyer et al. carried out a study to decipher the importance of ORFans in Megavirales families (Ascoviridae, Iridoviridae, Poxviridae, Phycodnaviridae, Asfarviridae, Mimiviridae and Marseilleviridae) (Boyer et al. 2010). At least one representative member was selected in each family, and its genome was submitted to new ORF prediction to bring normalization in viral genome prediction. A total of 38% of predicted ORFs in all viral genomes showed no match against RefSeq and were classified as ORFans. However ORFan percentage showed a large range of variation [between 2.8% (PBCV-NY2A) and 75.2% (EhV-86)] according to the type of virus (Table 2). The metaORFans are ORFans having homologs in environmental databases. MetaORFans proportions in megavirus genomes were 3.5%. The detailed number of ORFans and metaORFans in each genome of Megavirales members are represented in Table 2. Some members of families Iridoviridae, Poxviridae, Phycodnaviridae and Ascoviridae, found no significant match against the environmental databases. In contrast, more than 10% of asfarvirus ORFans were converted to metaORFans. In all megavirus genomes analyzed, mean ORFan length (587 bp) was significantly shorter than non-ORFan length (1,149 bp), indicating that ORFans are over-represented among the shorter ORFs in these genomes. Besides, ORFans and non-ORFans exhibit a similar nucleotide composition pattern.

55

Evolutionary scenarios of Megavirales

Beyond about 20 genes that they can have in common, giant viruses possess hundreds of genes encoding proteins without any similarity to the cellular world (bacteria, archaebacteria, eukaryotes) or even with other viruses. For each first virus inaugurating a new family, more than two thirds of its proteins (and up to 90% for Pandoraviruses) are "orphans". Such a situation does not argue for a common origin (except for the few genes they have in common), nor for a cellular origin (prokaryote or eukaryote). Viral genome gigantism challenges the traditional definition of viruses conceived as small and simple organisms. Understanding the evolutionary forces acting on the Megavirale genomes have profound implication for our knowledge of the viral world and the interplays between cellular organisms and viruses. Additionally, there are accumulating evidences that several Megavirales caused respiratory tract infections. Although the exact mechanism of their pathogenocity is currently unknown, mimiviruses are suspected to be the ethiologic agents of numerous cases of pneumonia acquired by patients in intensive-care institution but also by apparent healthy patients (Kutikhin et al., 2014). Origin of genome gigantism in these families is still a matter of an intense controversy between the advocates of the “genome degradation hypothesis” (Claverie, 2006) and those defending the “genome expansion hypothesis” (Moreira and Lopez-Garcia, 2005; Filee et al., 2008; Yutin et al., 2014). The “genome degradation hypothesis”

56 postulates that Giant Viruses (Megavirales) derive from a cellular ancestor by progressive genome simplification linked to the adaptation to a parasitic lifestyle. Notably, presence of typical cellular hallmark genes as translational genes, supports the hypothesis that Megavirales derive from a cellular ancestor (Arslan et al., 2011). However, phylogenetic studies indicate that most, if not all, of these translational genes result from lateral gene transfers (LGTs) from cellular organisms (Moreira and Lopez-Garcia, 2005; Yutin et al., 2014). Indeed, many studies have pointed out the central role of lateral gene transfers during the evolution of Megavirales (Iyer et al., 2006; Filee et al., 2007, 2008; Moreira and Brochier-Armanet, 2008; Yutin et al., 2013). Finally, gene and genome duplications (Suhre, 2005; Filee and Chandler, 2008) in addition to dissemination of various mobile genetic elements as introns or transposons (Filee et al., 2007; Desnues et al., 2012) have also been identified as important player for GV genome evolution. The combination of these forces support the “genome expansion hypothesis” in which Megavirales evolved from a relatively simple viral ancestor by progressive gene accretion and duplication. The nature of this ancestor remains speculative but recent discoveries indicate that Megavirales may derive from DNA transposons belonging to the /Virophage superfamily (Krupovic and Koonin, 2015). However, experimental data have shown that under laboratory conditions, members of the Mimiviridae can experience rapid genome expansion/contraction. Indeed, under

57 peculiar selective constraints, Poxvirus genomes undergo successive steps of gene duplications and gene losses (Elde et al., 2012). Symmetrically, when mimiviruses are cultivated in bacteria-free media (the preys of their amoebal hosts), numerous genome reductions occur, mainly caused by large deletions (Boyer et al., 2011). On the basis of these data, recently a model was proposed by Filee in which Megavirales evolved using a complex process of “genomic accordion” instead of a general tendency toward either genome expansion or reduction (Filee, 2013). According to this hypothesis, Megavirales should undergo successive cycles of genome expansion and reduction in order to adapt to modified environmental conditions or new hosts.

These two hypotheses suffer from major difficulties. The first problem of the inflationary scenario is to postulate evolutionary pressure which is contrary to that documented for all parasites, in particular intracellular parasites: parasitism is inevitably accompanied by the irreversible loss of genes and functions which leads the parasite into a Increasing dependence on its host. The second problem is the mysterious origin (cellular or viral) of the genes "acquired" by these viruses. In all likelihood, the viral genomes should keep a phylogenetic trace of their origins, but this is not the case since they do not very much resemble any known gene. The reductionist scenario, remains in perfect harmony with the universal tendency in parasites to lose genes. But the lack of similarity of viral genes with those of the current cellular

58 world, like that between the different viral families, remains to be understood. It was proposed that the different families of giant viruses come from different proto-cellular lines that could coexist in competition with the one that ultimately led to the common ancestor of the current cells (LUCA: last universal cellular ancestor ). These "losing" lines of the evolutionary competition would then have found as means of survival only to become parasites of that "winning", at the origin of the totality of the cellular world of today. Under this scenario, the giant viruses would be real "living" fossils of aborted proto-cell lines. The detailed study of their physiology could, therefore, inform us as to the very origin of life. The recent discovery of fossils suggesting the existence of multicellular organisms as early as the first peak oxygenation of the Earth's atmosphere 2.6 billion years ago (El Albani et.al., 2010, 2014) shows the extent to which the extinction of organisms may have been underestimated in the current evolution scenario.

Conclusion

Mimiviruses and marseillevirus have fostered studies on members of the proposed order Megavirales. As these viruses share several genes with cellular organisms, this catalyzed the debate about the definition of viruses and their classification in the living world (Raoult & Forterre, 2008; Raoult, 2009). Indeed, the tree of life was initially based on ribosomal analyses that delineated three branches of life, Eukarya, Bacteria and Archaea, while

59 viruses were not included on this classification because they lack ribosomes (Moreira & López-García, 2009). From the outset, Mimivirus has been proposed to compose a fourth branch of life (Raoult et al. 2004). Then, phylogenetic and phyletic analyses of information genes, involved in nucleotide biosynthesis, transcription and translation (for Mimivirus), allowed to show a four branch topology where Megavirales members stand as a monophyletic group aside Eukarya, Bacteria and Archaea (Boyer et al. 2010) . This issue is still controversial but strengthened by an increasing body of evidence (Williams et al. 2011; Nasir et al. 2012). Studies of the pangenome of the order i and its viral families have shown a substantial amount of lateral gene transfers in the viral genomes, though the core gene set indicate a common ancestral origin. The discovery of very different giant viruses, associated with oblong or spherical particles, also makes it possible to imagine that their diversity, probably as important as that of phylogenetic and genomic icosahedral capsid viruses, could be hidden in these new types of vehicles. Most of them have so far been confused with small non-cultivable bacteria. While the discovery of giant viruses has abruptly and dramatically expanded the scope of virology, it has also revived much more fundamental biological questions about the status of viruses: whether they are alive or not, origin and their mode of evolution, and their relationship with the cellular world.

60

LEGENDS

Figure 1: Phylogenetic treesconstructed based on the family B DNA polymerase from selected members of the family “Megavirales” and Pandoravirus dulcis and P. salinus using the maximum likelihood method

The numbers at tree nodes indicate bootstrap replicates of 100. The line indicates the group of viruses infecting diverse hosts.

Table 1: General viral characteristics of the members of the order Megavirales

Table 2: ORFan classification in the selected members of the order Megavirales

Percentages were calculated in comparison with total number of ORF for each species

61

Figure 1

62

Table 1

63

Table 2

64

REFERENCES

Agüero, M., Blasco, R., Wilkinson, P., Viñuela, E. (1990). Analysis of naturally occurring deletion variants of African swine fever virus: multigene family 110 is not essential for infectivity or virulence in pigs. Virology, 176, 195–204.

Aherfi, S., Pagnier, I., Fournous, G., Raoult, D., La Scola, B., Colson, P. (2013). Complete genome sequence of Cannes 8 virus, a new member of the proposed family “Marseilleviridae.” Virus genes, 47(3):550-5

Aherfi, S.; Colson, P.; La Scola, B.; Raoult, D (2016). Giant viruses of amoebas: an update. Front Microbiol., 7:349.

Arslan, D., Legendre, M., Seltzer, V., Abergel, C., Claverie, J. M. (2011). Distant Mimivirus relative with a larger genome highlights the fundamental features of Megaviridae. Proceedings of the National Academy of Sciences, 108, 1–6.

Asgari, S., Davis, J., Wood, D., Wilson, P., McGrath, A. (2007). Sequence and organization of the Heliothis virescens ascovirus genome. The Journal of general virology, 88, 1120– 1132.

Barker, J., Brown, M. (1994). Trojan horses of the microbial world: protozoa and the survival of bacterial pathogens in the environment. Mircobiology, 140(6), 1253–1259.

Bideshi, D. K., Renault, S., Stasiak, K., Federici, B. A., Bigot, Y. (2003). Phylogenetic analysis and possible function of bro- like genes, a multigene family widespread among large double-stranded DNA viruses of invertebrates and bacteria. The Journal of general virology, 84, 2531–2544.

65

Bigot, Y, Rabouille, A., Sizaret, P. Y., Hamelin, M. H., Periquet, G. (1997). Particle and genomic characteristics of a new member of the Ascoviridae: Diadromus pulchellus ascovirus. The Journal of general virology, 78 ( Pt 5), 1139– 1147.

Bigot, Yves, Samain, S., Augé-Gouillou, C., Federici, B. A. (2008). Molecular evidence for the evolution of ichnoviruses from ascoviruses by symbiogenesis. BMC Evolutionary Biology, 8, 253.

Boughalmi, M., Saadi, H., Pagnier, I., Colson, P., Fournous, G., Raoult, D., La Scola, B. (2013). High-throughput isolation of giant viruses of the Mimiviridae and Marseilleviridae families in the Tunisian environment. Environmental microbiology, 15, 2000–7.

Boyer, M., Gimenez, G., Suzan-Monti, M., Raoult, D.(2010). Classification and determination of possible origins of ORFans through analysis of nucleocytoplasmic large DNA viruses. Intervirology, 53, 310–320.

Boyer, M., Madoui, M.-A., Gimenez, G., La Scola, B., Raoult, D. (2010). Phylogenetic and Phyletic Studies of Informational Genes in Genomes Highlight Existence of a 4th Domain of Life Including Giant Viruses. PLoS ONE, 5, 8.

Boyer, M., Yutin, N., Pagnier, I., Barrassi, L., Fournous, G., Espinosa, L., Robert, C., Azza, S., Sun, S., Rossmann, M. G, Suzan -Monti, M., La Scola, B., Koonin, E. V., Raoult, D. (2009). Giant Marseillevirus highlights the role of amoebae as a melting pot in emergence of chimeric microorganisms.

66

Proceedings of the National Academy of Sciences of the United States of America, 106, 21848–21853.

Bratke, K.A., McLysaght, A. (2008). Identification of multiple independent horizontal gene transfers into poxviruses using a comparative genomics approach. BMC Evolutionary Biology, 8, 67.

Bubić, I., Wagner, M., Krmpotić, A., Saulig, T., Kim, S., Yokoyama, W. M., Jonjic, S., Koszinowski, U. H. (2004). Gain of Virulence Caused by Loss of a Gene in Murine Cytomegalovirus. Journal of Virology, 78, 7536–7544.

Chen, N., Li, G., Liszewski, M. K., Atkinson, J. P., Jahrling, P. B., Feng, Z., Schriewer, J., Buck, C., Wang, C., Lefkowitz, E. J., Esposito, J. J., Harms, T., Damon, I. K., Roper, R. L., Upton, C., Buller, R. M (2005). Virulence differences between monkeypox virus isolates from West Africa and the Congo basin. Virology, 340, 46–63.

Cheng, C. H., Liu, S. M., Chow, T. Y., Hsiao, Y. Y., Wang, D. P., Huang, J. J., Chen, H. H. (2002). Analysis of the complete genome sequence of the Hz-1 virus suggests that it is related to members of the Baculoviridae. Journal of Virology, 76(18):9024-34.

Cheng, X. W., Carner, G. R., Brown, T. M. (1999). Circular configuration of the genome of ascoviruses. The Journal of general virology, 80 ( Pt 6), 1537–1540.

Chinchar, V. G., Hyatt, A., Miyazaki, T., Williams, T. (2009). Family Iridoviridae: poor viral relations no longer. Current Topics in Microbiology and Immunology, 328, 123–70.

67

Clarke, M., Lohan, A. J., Liu, B., Lagkouvardos, I., Roy, S., Zafar, N., Bertelli, C., Schilde, C., Kianianmomeni, A., Bürglin, T. R., Frech, C., Turcotte, B.,Kopec, K. O., Synnott, J. M., Choo, et. al., (2013). Genome of Acanthamoeba castellanii highlights extensive lateral gene transfer and early evolution of tyrosine kinase signaling. Genome biology, 14, R11.

Cohen, G., Hoffart, L., La Scola, B., Raoult, D., Drancourt,M. (2011). Ameba-associated Keratitis, France. Emerging infectious diseases, 17(7), 1306–1308.

Colson, P., De Lamballerie, X., Fournous, G., Raoult, D. (2012). Reclassification of Giant Viruses Composing a Fourth Domain of Life in the New Order Megavirales. Intervirology, 55, 321–332.

Colson, P., Fancello, L., Gimenez, G., Armougom, F., Desnues, C., Fournous, G., Yoosuf, N., Million, M., La Scola, B., Raoult, D. (2013). Evidence of the megavirome in humans. Journal of clinical virology, 57(3), 191–200.

Colson, P., Gimenez, G., Boyer, M., Fournous, G., Raoult, D. (2011). The Giant Cafeteria roenbergensis Virus That Infects a Widespread Marine Phagocytic Protist Is a New Member of the Fourth Domain of Life. PLoS ONE, 6, 11.

Colson, P., Pagnier, I., Yoosuf, N., Fournous, G., La Scola, B., Raoult, D. (2013). “Marseilleviridae”, a new family of giant viruses infecting amoebae. Archives of virology, 158, 915–20.

68

Colson, P., Raoult, D. (2010). Gene repertoire of amoeba- associated giant viruses. Intervirology, 53, 330–343.

Colson, P., Yutin, N., Shabalina, S. A., Robert, C., Fournous, G., La Scola, B., Raoult, D., Koonin, E. V. (2011). Viruses with More Than 1,000 Genes: Mamavirus, a New Acanthamoeba polyphaga mimivirus Strain, and Reannotation of Mimivirus Genes. Genome biology and evolution, 3, 737– 742.

Cubonová, L., Sandman, K., Hallam, S. J., Delong, E. F., Reeve, J. N. (2005). Histones in crenarchaea. Journal Of Bacteriology, 187, 5482–5485.

Daubin, V., Ochman, H. (2004). Bacterial genomes as new gene homes: the genealogy of ORFans in E. coli. Genome Research, 14, 1036–1042.

Davids, W., Fuxelius, H. H, Andersson, S. G (2003). The journey to smORFland. Comparative and Functional Genomics, 4(5), 537–541.

De La Vega, I., González, A., Blasco, R., Calvo, V., Viñuela, E. (1994). Nucleotide sequence and variability of the inverted terminal repetitions of African swine fever virus DNA. Virology, 201, 152–156.

De Souza, R. F., Iyer, L. M., Aravind, L. (2010). Diversity and evolution of chromatin proteins encoded by DNA viruses. Biochimica et Biophysica Acta, 1799, 302–318.

69

Desnues, C., Boyer, M., Raoult, D. (2012). Sputnik, a virophage infecting the viral domain of life. Advances in virus research, 82, 63–89.

Desnues, C., La Scola, B., Yutin, N., Fournous, G., Robert, C., Azza, S., Jardot, P., Monteil, S., Campocasso, A., Koonin, E. V., Raoult, D. (2012). Provirophages and transpovirons as the diverse mobilome of giant viruses. Proceedings of the National Academy of Sciences of the United States of America, 109, 18078–18083.

Dixon, L. K., Bristow, C., Wilkinson, P. J.,Sumption, K. J. (1990). Identification of a variable region of the African swine fever virus genome that has undergone separate DNA rearrangements leading to expansion of minisatellite-like sequences. Journal of Molecular Biology, 216, 677–688.

Dunigan, D. D., Fitzgerald, L. A.,Van Etten, J. L. (2006). Phycodnaviruses: a peek at genetic diversity. Virus Research, 117, 119–132.

Federici, B. A., Bigot, Y., Hamm, J. J., Granados, R. R., Vlak, J. M. Miller, L. K. (2000). Family Ascoviridae. In Virus . Seventh Report of the International Committee on Taxonomy of Viruses, pp. 261–265. Edited by M. H. V. van Regenmortel, C. M. Fauquet, D. H. L. Bishop, E. B. Carstens, M. K. Estes, S. M. Lemon, J. Maniloff, M. A. Mayo, D. J. McGeoch, C. R. Pringle R. B. Wickner. San Diego: Academic Press.

Filée, J, Siguier, P., Chandler, M. (2007). I am what I eat and I eat what I am: acquisition of bacterial genes by giant viruses. Trends in TIG, 23, 10– 15.

70

Filée, J., Pouget, N., Chandler, M. (2008). Phylogenetic evidence for extensive lateral acquisition of cellular genes by Nucleocytoplasmic large DNA viruses. BMC Evolutionary Biology, 8, 320.

Fischer, D., Eisenberg, D. (1999). Finding families for genomic ORFans. Bioinformatics, 15(759-762).

Fischer, M. G., Allen, M. J., Wilson, W. H., Suttle, C. A. (2010). Giant virus with a remarkable complement of genes infects marine zooplankton. Proceedings of the National Academy of Sciences, 107, 19508–13.

Fischer, M. G., Suttle, C. A. (2011). A virophage at the origin of large DNA transposons. Science, 332, 231–234.

Frost, L. S., Leplae, R., Summers, A. O., Toussaint, A. (2005). Mobile genetic elements: the agents of open source evolution. Nature Reviews Microbiology, 3, 722–732.

Gad, W., Kim, Y. (2008). A viral histone H4 encoded by Cotesia plutellae bracovirus inhibits haemocyte-spreading behaviour of the diamondback moth, Plutella xylostella. The Journal of general virology, 89, 931–938.

Gaia, M., Pagnier, I., Campocasso, A., Fournous, G., Raoult, D., La Scola, B. (2013). Broad spectrum of Mimiviridae allows its isolation using a Mimivirus reporter. PLoS ONE, 8(4).

Gammon, D. B., Gowrishankar, B., Duraffour, S., Andrei, G., Upton, C., Evans, D. H. (2010). Vaccinia Virus–Encoded Ribonucleotide Reductase Subunits Are Differentially

71

Required for Replication and Pathogenesis. PLoS Pathogens, 6, 20.

Ghigo, E., Kartenbeck, J., Lien, P., Pelkmans, L., Capo, C., Mege, J. L., Raoult, D. (2008). Ameobal pathogen mimivirus infects macrophages through phagocytosis. PLoS pathogens, 4, e1000087.

Gubbels, M. J., Vaishnava, S., Boot, N., Dubremetz, J. F., Striepen, B. (2006). A MORN-repeat protein is a dynamic component of the Toxoplasma gondii cell division apparatus. Journal of Cell Science, 119, 2236–2245.

Hammarlund, E., Lewis, M. W., Carter, S. V, Amanna, I., Hansen, S. G., Strelow, L. I., Wong, S. W., Yoshihara, P., Hanifin, J. M., Slifka, M. K. (2005). Multiple diagnostic techniques identify previously vaccinated individuals with protective immunity against monkeypox. Nature Medicine, 11, 1005–1011.

He, J. G., Lü, L., Deng, M., He, H. H., Weng, S. P., Wang, X. H., Zhou, S.Y., Long, Q, X., Wang, X. Z., Chan, S. M. (2002). Sequence analysis of the complete genome of an iridovirus isolated from the tiger frog. Virology, 292, 185–197.

Horn, M., Wagner, M. (2004). Bacterial endosymbionts of free living Amoebae. Eukaryot.Microbiol., 51, 509–514.

Huang, Y., Huang, X., Liu, H., Gong, J., Ouyang, Z., Cui, H., Cao, J., Zhao, Y., Wang, X., Jiang, Y., Qin, Q. (2009). Complete sequence determination of a novel reptile iridovirus isolated from soft-shelled turtle and evolutionary analysis of Iridoviridae. BMC Genomics, 10, 224.

72

Hughes, A. L. (2002). Origin and evolution of viral interleukin-10 and other DNA virus genes with vertebrate homologues. Journal of Molecular Evolution, 54, 90–101.

Hughes, A. L., Friedman, R. (2005). Poxvirus genome evolution by gene gain and loss. Molecular Phylogenetics and Evolution, 35, 186–195.

Iyer, L. M., Balaji, S., Koonin, E. V., Aravind, L. (2006). Evolutionary genomics of nucleo-cytoplasmic large DNA viruses. Virus Research, 117, 156– 184.

Iyer, L.M., Aravind, L., Koonin, E. V. (2001). Common origin of four diverse families of large eukaryotic DNA viruses. Journal of virology, 75, 11720– 11734.

Jakob, N. J., Müller, K., Bahr, U., Darai, G. (2001). Analysis of the first complete DNA sequence of an invertebrate iridovirus: coding strategy of the genome of Chilo iridescent virus. Virology, 286, 182–196.

Jones, E. V, Puckett, C., Moss, B. (1987). DNA-dependent RNA polymerase subunits encoded within the vaccinia virus genome. Journal of Virology, 61, 1765–1771.

Khan, M., La Scola, B., Lepidi, H., Raoult, D. (2007). Pneumonia in mice inoculated experimentally with Acanthamoeba polyphaga mimivirus. Microbial Pathogenesis, 42(2-3), 56–61.

73

Koonin, E. V., Yutin, N. (2010). Origin and evolution of eukaryotic large nucleo-cytoplasmic DNA viruses. Intervirology, 53, 284–292.

La Scola, B, Desnues, C., Pagnier, I., Robert, C., Barrassi, L., Fournous, G., Merchat, M., Suzan-Monti, M., Forterre, P., Koonin, E.V., Raoult, D. (2008). The Virophage as a Unique Parasite of Giant Mimivirus. Nature, 455 (7209):100-4.

La Scola, B., Audic, S., Robert, C., Jungang, L., de Lamballerie, X., Drancourt, M., Birtles, R., Claverie, J. M., Raoult, D. (2003). A giant virus in amoebae. Science, 299(5615):2033.

La Scola, B., Campocasso, A., N‟Dong, R., Fournous, Flaudrops, C., Raoult, D. (2010). Tentative characterization of new environmental giant viruses by MALDI-TOF mass spectrometry. Intervirology, 53, 344–353.

La Scola, B., Marrie, T. J., Auffray, J. P., Raoult, D. (2005). Mimivirus in pneumonia patients. Emerging infectious diseases, 11, 449–452.

Lefkowitz EJ, Wang C, Upton C. (2006). Poxviruses: past, present, and future. Virus Research, 117(1), 105–118.

Legendre, M., Santini, S., Rico, A., Abergel, C., Claverie, J. M. (2011). Breaking the 1000-gene barrier for Mimivirus using ultra-deep genome and transcriptome sequencing. Virology Journal, 8, 99.

Legendre M, Bartoli J, Shmakova L, et al. Thirty-thousand- year-old relative of giant icosahedral viruses with a

74 pandoravirus morphology. Proc Natl Acad Sci USA 2014; 111: 4274-4279.

Levasseur A, Andreani J, Delerce J, Bou Khalil J, Robert C, La Scola B, Raoult D. Pithovirus reveals its genetic conservation and evolution. Genome Biol Evol 2016; 8: 2333- 2339.

Lubisi, B. A., Bastos, A. D. S., Dwarka, R. M., Vosloo, W. (2007). Intra-genotypic resolution of African swine fever viruses from an East African domestic pig cycle: a combined p72-CVR approach. Virus Genes, 35, 729–735.

McFadden, G. (1995). Viroceptors, Virokines, and Related Immune Modulators Encoded by DNA Viruses. Austin: Landes.

McLysaght, A., Baldi, P. F., Gaut, B. S. (2003). Extensive gene gain associated with adaptive evolution of poxviruses. Proceedings of the National Academy of Sciences of the United States of America, 100, 15655–60.

Moliner, C., Fournier, P. E., Raoult, D. (2010). Genome analysis of microorganisms living in amoebae reveals a melting pot of evolution. FEMS Microbiology Reviews, 34, 281–294.

Monier, A., Pagarete, A., De Vargas, C., Allen, M. J., Read, B., Claverie, J. M., Ogata, H. (2009). Horizontal gene transfer of an entire metabolic pathway between a eukaryotic alga and its DNA virus. Genome Research, 19, 1441– 1449.

75

Moreira, D., Brochier-Armanet, C. (2008). Giant viruses, giant chimeras: The multiple evolutionary histories of Mimivirus genes. BMC Evolutionary Biology, 8, 12.

Moreira, D., López-García, P. (2009). Ten reasons to exclude viruses from the tree of life. Nature Reviews Microbiology, 7, 306–311.

Moss, B. (2001). Poxviridae: the viruses and their replication. In G. DE Fields BN, Knipe DM, Howley PM (Ed.), Fields Virology (pp. 2849–2884). Philadelphia: Williams & Wilkins.

Nasir, A., Kim, K. M., Caetano-anolles, G. (2012). Giant viruses coexisted with the cellular ancestors and represent a distinct supergroup along with superkingdoms Archaea, Bacteria and Eukarya. BMC Evolutionary Biology, 12, 156.

Pagnier, I.; Reteno, D.G.; Saadi, H.; Boughalmi, M.; Gaia, M.; Slimani, M.; Ngounga, T.; Bekliz, M.; Colson, P.; Raoult, D.; La Scola, B. A decade of improvements in Mimiviridae and Marseilleviridae isolation from amoeba. Intervirology. 2013, 56, 354-363.

Pagnier, I.; Valles, C.; Raoult, D.; La, S.B. Isolation of Vermamoeba vermiformis and associated bacteria in hospital water. Microb. Pathog. 2015, 80, 14-20.

Parakkottil Chothi, M., Duncan, G. A., Armirotti, A., Abergel, C., Gurnon, J. R., Van Etten, J. L., Bernardi, C., Damonte, G., Tonetti, M. (2010). Identification of an l-Rhamnose Synthetic Pathway in Two Nucleocytoplasmic Large DNA Viruses. Journal of Virology, 84, 8829–8838.

76

Philippe, N., Legendre, M., Doutre, G., Couté, Y., Poirot, O., Lescot, M., Arslan, D., Seltzer, V., Bertaux, L., Bruley, C., Garin, J., Claverie, J.M., Abergel, C. (2013). Pandoraviruses: amoeba viruses with genomes up to 2.5 Mb reaching that of parasitic eukaryotes. Science 341, 281–6.

Pires, S., Ribeiro, G., Costa, J. V. (1997). Sequence and organization of the left multigene family 110 region of the Vero-adapted L60V strain of African swine fever virus. Virus Genes, 15, 271–274.

Raoult, D., Forterre, P. (2008). Redefining viruses: lessons from Mimivirus. Nature reviews Microbiology, 6, 315–319.

Raoult, D. (2009). There is no such thing as a tree of life (and of course viruses are out!). Nature Reviews Microbiology.

Raoult, D., Audic, S., Robert, C., Abergel, C., Renesto, P., Ogata, H., La Scola, B., Suzan, M., Claverie, J. M. (2004). The 1.2-megabase genome sequence of Mimivirus. Science, 306, 1344–1350.

Raoult, D., Boyer, M. (2010). Amoebae as genitors and reservoirs of giant viruses. Intervirology, 53, 321–329.

Reteno, D.G.; benamar, S.; Bou, K.J.; Andreani, J.; Armstrong, N.; Klose, T.; Rossmann, M.; Colson, P.; Raoult, D.; La Scola, B. Faustovirus, an asfarvirus-related new lineage of giant viruses infecting amoebae. J Virol. 2015, 89, 6585-94.

77

Benamar, S.; Reteno, D.G.; Bandaly, V.; Labas, N.; Raoult, D.; La Scola, B. Faustoviruses: Comparative Genomics of New Megavirales Family Members. Front Microbiol. 2016, 7:3.

Yashina S, Gubin S, Maksimovich S, et al. Regeneration of whole fertile plants from 30,000-y-old fruit tissue buried in Siberian permafrost. Proc Natl Acad Sci USA 2012; 109: 4008-4013.

78

Chapter-3

MimiLook: A Phylogenetic Workflow for Detection of Gene Acquisition in Major Orthologous Groups of

Megavirales

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

Supplementary Materials (Included in thesis) Figure S1: Mapping OGs and evolutionary scenarios on reference tree.

Figure S2: An example output of statistically validated HGT as predicted by T-REX and then verified by human expertise.

Table S1: 86 Megavirale ORFomes used in this study

Externally hosted supplementary file 1 Doi: 10.6084/m9.figshare.4653622 https://figshare.com/s/b3aea00d41a9dce4c5a9 Description: Workflow script available on this link

Supplementary Materials (Not included due to size constraint) The following are available online at www.mdpi.com/1999- 4915/9/4/72/s1 Table S2: Grouped Megavirale genes in cluster of orthologus groups, Table S3: Presence/absence matrix of orthologus groups in respective Megavirale species Table S4: Predicted evolutonary scenarios in OGs of Megavirales. Externally hosted supplementary file 2 Doi: 10.6084/m9.figshare.4645273 https://figshare.com/s/c167f72d6e4613dfd3e6 Description: Phylogenetic trees of clustered proteins (OGs) available on this link

96

Figure S1.Mapping OGs and evolutionary scenarios on reference tree. Family specific nodes are labeled from A–J and internal nodes are labeled from K–S. At first, OGs which are present in family specific nodes (A–J) are plotted on reference tree, followed by the plotting of OGs present on internal nodes (K–S) (shown in figure label). After distribution of OGs, evolutionary scenarios are tagged using the OG present in each node.

97

Figure S2.An example output of statistically validated HGT as predicted by T-REX and then verified by human expertise. In both trees, tip label is the TaxID of corresponding protein sequence. Scale below denotes substitutions per site. (A) HGTs detected by T-REX marked on species tree. Pink arrow denotes the direction of transfer. Total of four HGTs have been detected in this particular tree, but, as we are concerned about the transfers related to MVs only, thus, the transfer denoted as 23 was selected in our filtered output 5. (B) Gene tree prepared by blast output of single OG (midpoint rooted and ladderize descending). Here, now we can confirm the event of HGT from common ancestor of Eutheria (belonging to Amniota) to Megavirale (Poxviridae) (denoted by arrow), thus, this workflow paved as a way to detect putative donors and putative receptors in an HGT event.

98

Table S1. 86 Megavirale ORFomes used in this study (it can be downloaded ftp://ftp.ncbi.nih.gov/genomes/Viruses/).

99

100

101

Externally hosted supplementary file 1 Perl code for MimiLook #please execute the next line before running this program ###export PATH=~/anaconda_ete/bin:$PATH #!/usr/bin/perl -w use strict; use warnings; use File::chdir; $CWD = '/home/olivier/ARUP/FASTAFOLDER'; #Main working directory set here and can be changed accordingly open (LOGFILE,">logfile.txt"); ##This file will save all log records print LOGFILE "GROUP\tNumber_of_genomes\tSpecies_specific(Y/N)\tLongest_protein_i d\tGenome_of_longest_protein\tLength_of_longest_proteins\tNumber_of_si gnificant_blast_hits\tbacterial_hit\tarchea_hits\tothervirus_hits\teukaryotic_h its\tfungi_hits\tMegavirus_hits\tUnknown_count\tTree_generated(Y/N)\n"; #print "Please enter you preferred indentity such as 40\n"; #chomp (my $target_identity=); my $target_identity=50.00; #print "Please enter you preferred evalue such as 0.0004\n"; #chomp (my $target_evalue=); my $target_evalue=.0001; #print "Please enter you preferred querycoverage such as 60\n"; #chomp (my $target_q_coverage=); my $target_q_coverage=70.00; print "Your inputs are Identity=$target_identity\tE- value=$target_evalue\tQuery Coveragoveragee=$target_q_coverage\n\n"; #create_folder(); ##This module will create necessary folders to save outputs ###Comment this line if the folders have been created previously

102 runningOrthoMcl();##To run OrthoMcl ### Comment this line if clustering was done previously distance_tree(); ## To create distance matrix based tree ###Commnet this line if you don't want to create distance matrix based tree #running_BlastP(); ##To run BlastP ###Comment this line if Blast was done seprately blast_filtering(); ##To filter Blast result ###Commnet this line if you don't want to create phylogentic trees detect_HGT(); ##To run T-Rex ###Comment this line if you don't want to run T-Rex algorithm sub create_folder { qx(mkdir fasta_files/);##This will create a folder to save fasta sequences for all the homologous sequences of test protein qx(mkdir header_files/); ##This will create a folder to save header information for all homologous sequences of test protein qx(mkdir alignment_files/); ##This will create a folder to save alignment for all homologous sequences of test protein qx(mkdir annotation_files/); ##This will create a folder to save taxonomic lineages for all homologous sequences of test protein qx(mkdir tree_files/); ##This will create a folder to save phylogenetic trees generated from the alignment of homologous sequences of test protein #qx(mkdir hits_in_bacteria/); ##This will create a folder to save the phylogenetic trees of test proteins which have at least one hit in bacterial domain #qx(mkdir hits_in_archeae/);##This will create a folder to save the phylogenetic trees of test proteins which have at least one hit in archeal domain #qx(mkdir hits_in_euk/); ##This will create a folder to save the phylogenetic trees of test proteins which have at least one hit in archeal domain

103 qx(mkdir hits_in_megavirus/); ##This will create a folder to save the phylogenetic trees of test proteins which have at which are megaviral specific qx(mkdir DIST_MATRIX/); qx(mkdir TREX/); ##This will create a folder to save TRex resulsts qx(mkdir TREX/HGT_TREES/); qx(mkdir TREX/NONHGT_TREES/); #qx(mkdir ORTHOOUTFOLDER);## This will crate a folder to save OrthMCL resulsts } sub runningOrthoMcl { #*******************************************Module for running OrthoMCl****************************************************# #+++++++This module will run OthoMcl gene clustering algorithm and choose represntative sequence for BlastP analysis+++++++# print "Running OrthoMCl\n"; open (BLASTPRO,">proteins_for_blast.txt"); open (PROLIST,">protein_list.txt"); open (ORTHOSTAT,">orthomcl_stats.txt"); print ORTHOSTAT "GROUP\tNumber_of_genomes\tSpecies_specific(Y/N)\tLongest_protein_i d\tGenome_of_longest_protein\tLength_of_longest_proteins\n"; #To run OrthoMCL #print "Running OrthoMCL\n"; #qx(rm -r ORTHOOUTFOLDER/); #qx(mkdir ORTHOOUTFOLDER/); #qx(orthomcl.pl proteome_test ORTHOOUTFOLDER localhost olivier Qotot11);

104

#$CWD = '/home/olivier/ARUP/FASTAFOLDER/ORTHOOUTFOLDER/'; #qx(bin/startOrthocml.sh); #$CWD = '/home/olivier/ARUP/FASTAFOLDER'; open (GROUPFILE,"ORTHOOUTFOLDER/groups.txt")|| print "cannot open group file from orthoMCL\n"; my @Orthoquery_protein=; close GROUPFILE; chomp @Orthoquery_protein; #To select the longest sequence from each and store in the query_protein "proteins_for_blast.txt" foreach my $i(0..10)#$#Orthoquery_protein) { my @line=split(/\s+/,$Orthoquery_protein[$i]); chop $line[0]; my @seq=(); my @genomes=();my @id=(); my @sorted_len=();my @protein_id=(); my @len=(); foreach my $i2(1..$#line) { my $genome=''; my $gene=''; ($genome,$gene)=split(/\|/,$line[$i2]); #print "\n\n\n$genome\t$gene\n"; open (INPUTFILE,"proteome_test/$genome.fasta") or die "cannot open $genome.fasta\n"; my @genome_query_protein=; close INPUTFILE; chomp @genome_query_protein; foreach my $j(0..$#genome_query_protein) { if ($genome_query_protein[$j]=~/\|$gene\|/) { #print "ok\n"; my $k=$j+1; my $protein_seq='';

105

until(($genome_query_protein[$k]=~/^>/) || ($k >= $#genome_query_protein)) { $protein_seq.=$genome_query_protein[$k]; $k++; } my $protein_length= length($protein_seq); push(@genomes,$genome); push (@protein_id,$gene); push (@len, $protein_length); push(@seq, $protein_seq); last; } } } #Counting the number of genomes in each COG my@uniq_genomes=uniq(@genomes); my $genome_count=scalar(@uniq_genomes); if($genome_count==1){print ORTHOSTAT "$line[0]\t$genome_count\tY\t";} elsif($genome_count>1){print ORTHOSTAT "$line[0]\t$genome_count\tN\t";} @sorted_len=sort{$b<=>$a}@len; my ($index) = grep { $len[$_] == $sorted_len[0]} (0 .. (@len-1)); print BLASTPRO ">$line[0]|$protein_id[$index]\n$seq[$index]\n"; print PROLIST "$line[0]\n"; print ORTHOSTAT "$protein_id[$index]\t$genomes[$index]\t$sorted_len[0]\n";

106

} close BLASTPRO; close PROLIST; print ORTHOSTAT "\n\n"; } sub running_BlastP { #***************************************Module for running BlastP algorithm***************************************#

#+++++++++++++++++++++This module will run BlastP against NCBI NR database and filter the results++++++++++++++++++# print "Running Blast\n"; qx(/home/olivier/ncbi-blast-2.2.28+/bin/blastp -db /home/olivier/nr/nr -query proteins_for_blast.txt -evalue 1e-04 -out blastout.txt -max_hsps_per_subject 1 -num_threads 10 -outfmt "7 qseqid sseqid pident stax qlen evalue bitscore score length pident positive qcovhsp staxids"); } sub blast_filtering {

#************************Module for filtering Blast Result and Generating Phylogenetic Tree****************************# #+++++++++++++++This module will filter the Blast result and Generate Phylogenetic trees for subsequent analysis+++++++# print "Blast Filtering\n"; open (BLASTOUT,"blastout.txt"); my @blast_out=; close BLASTOUT; chomp @blast_out; open (PROLIST,"protein_list.txt"); my @query_protein=; close PROLIST; chomp @query_protein;

107

open (TREELIST,">TREX/nonmegaviral_tree_list.txt"); open (FASTALIST,">fasta_files/fasta_list.txt"); open (BLASTSTAT,">blasthit_distribution.txt"); print BLASTSTAT "GROUP\tNumber_of_significant_blast_hits\tbacterial_hit\tarchea_hits\tothe rvirus_hits\teukaryotic_hits\tfungi_hits\tMegavirus_hits\tUnknown_count\tT ree_generated(Y/N)\n"; foreach my $i1(0..$#query_protein) ##How to control number of query proteins ofor which trees will be generated { print "Now working on $query_protein[$i1]\n"; my @all_gi=(); foreach my $i2 (0..$#blast_out) ##To get the GI ids of significant hits { if ($blast_out[$i2]=~/^$query_protein[$i1]\|\d+/) { my @line=split(/\t/,$blast_out[$i2]); my $identity=$line[2]; my $q_coverage=$line[9]; if(($identity > $target_identity) && ($q_coverage > $target_q_coverage)) ##To get GI ids that matches the criteria { $line[1]=~/gi\|(\d\d*)\|/; my $GI=$1; push (@all_gi, $GI); } } } if($#all_gi>=0)

108

{ my @uniq_gi=uniq(@all_gi); #To remove redundant GI ids in the list, if any my $l=scalar(@uniq_gi); if($l>=2) { print LOGFILE "Your query id $query_protein[$i1] matched to $l unique number of hits with GI numbers @uniq_gi\n";

open (OUT1,">fasta_files/$query_protein[$i1].faa"); #To open the query_protein to store fasta sequences of each gi ids open (OUT2,">header_files/$query_protein[$i1].header"); # To open the query_protein to store header information of the hits

foreach my $i3(0..$#uniq_gi) { ##For retrieving sequences from blastdbcm; my $gi_identity='';my $gi_taxon=''; foreach my $i4(0..$#blast_out) { if (($blast_out[$i4]=~/^$query_protein[$i1]\|\d+/) && ($blast_out[$i4]=~/gi\|$uniq_gi[$i3]\|/)) { my @line=split(/\t/,$blast_out[$i4]); $gi_identity=$line[2]; #To get identity value corresponding to the GI Id my @aa=split(/\;/,$line[10]); #To get taxonomic ID corresponding to the GI Id $gi_taxon=$aa[0]; last;

109

} } ##For retrieving sequences from blastdbcm; my @header=qx(/home/olivier/ncbi-blast- 2.2.28+/bin/blastdbcmd -db /home/olivier/nr/nr -dbtype prot -entry $uniq_gi[$i3] -outfmt '%f'); chomp @header; if($header[0]=~/\w+/) #To check whether header is retrived or not { print OUT2 "GI::$uniq_gi[$i3]\|::$gi_taxon\|Header::$header[0]\n"; ##header print OUT1 ">$uniq_gi[$i3]|$gi_identity|$gi_taxon\n"; ##sequence foreach my $xx(1..$#header) { print OUT1 "$header[$xx]\n"; } } else { print LOGFILE "Header and sequence information cannot be retrieved for hit GI ID $uniq_gi[$i3] in query id $query_protein[$i1]\n"; } } close OUT1; close OUT2; #For filtering redundant sequences

110

open (OUT3,">fasta_files/$query_protein[$i1].new"); open (FILE,"fasta_files/$query_protein[$i1].faa"); print FASTALIST "$query_protein[$i1]\n"; my @file=; close FILE; chomp @file; #Note:::::::::Basic idea of this filtering step is to find if multiple hits were present from a single genome #::::::i.e. multiple proteins from same taxon id, #:::::If so this step will remove all the redundant proteins except the protein with highest identity value my @taxon_id=(); my $sp_names=''; foreach my $j (0..$#file) { if($file[$j]=~/^>/) { $file[$j]=~/\|(\d+)$/; push (@taxon_id,$1); } } my @uniq_taxon_id=uniq(@taxon_id); my $hit_number=scalar(@uniq_taxon_id); foreach my $ii(0..$#uniq_taxon_id) { my @identity=(); my @prt=(); $sp_names.=$uniq_taxon_id[$ii]." "; foreach my $b (0..$#file) { if(($file[$b]=~/^>/) && ($file[$b]=~/\|$uniq_taxon_id[$ii]$/))

111

{ $file[$b]=~/\|(.*)\|(\d+)$/; my $aa='';

push(@identity,$1); ##To store the all the identity values associated with a single taxon id my $c=$b+1; until (($file[$c]=~/^>/) || ($c == ($#file+1))) { $aa.=$file[$c]; $c++; } push (@prt,$aa); ##To store all the proteins associated with a single taxon id } } my @sorted_identity=sort {$b<=>$a} @identity; ##To grep the index of largest idenity values and print corresponding protein using this index value my( $index )= grep { $identity[$_] eq $sorted_identity[0] } 0..$#identity; print OUT3 ">$uniq_taxon_id[$ii]\n$prt[$index]\n"; } ##FOR TAXONOMIC ANNOTATION FORM ETE3 open (OUT4,">annotation_files/$query_protein[$i1].annotation"); my $annotation=qx(ete3 ncbiquery --search $sp_names --info);

112

print OUT4 "$annotation\n";

print LOGFILE "Your query id $query_protein[$i1] matched to $#uniq_taxon_id+1 unique number e.g. @uniq_taxon_id\n"; close OUT3; close OUT4;

###Classification of COG GROUPS ACCORDING TO THEIR BLAST HIT DISTRIBUTION

open (ANNOTATIONS,"annotation_files/$query_protein[$i1].annotation") || print "sorry cannot open $query_protein[$i1].annotation\n"; my @annotation_file=;close ANNOTATIONS; chomp @annotation_file; open (MEGAV,"megaviral_ids.txt") || print "sorry cannot open megaviral_ids.txt\n"; ##This file contains taxonomic ids of all megavirus genomes my @megafile=;close MEGAV; chomp @megafile;

my $total_genome_count=0; my $bacteria_count=0; my $euk_count=0; my $archea_count=0;my $virus_count=0; my $megavirus_count=0; my $unclassified++; foreach my $i2(0..$#annotation_file) { if($annotation_file[$i2]=~/^\d+/) { $total_genome_count++; my @line=split(/\t/,$annotation_file[$i2]);

113

foreach my $i5(0..$#megafile) ##To count number of megavirus genomes in the file { my @mega_id=split(/\t/,$megafile[$i5]); if("$mega_id[0]" == "$line[0]") {

$megavirus_count++; } } if($line[3]=~/Bacteria/i){$bacteria_count++;} ##To check other types of genomes in the file elsif($line[3]=~/Eukaryote/i){$euk_count++;}

elsif($line[3]=~/Archeae/i){$archea_count++;} elsif($line[3]=~/Virus/i){$virus_count++;} else{$unclassified++;} } } print BLASTSTAT "$query_protein[$i1]\t$bacteria_count\t$archea_count\t$virus_count\t$euk_ count\t$megavirus_count\t$unclassified\n"; ##To create alignment and phylogenetic trees

if($hit_number>=3)##TO CHECK IF THERE IS SUFFICIENT NUMBER OF HITS { ##Alignment with Muscle

114

system ("muscle -quiet -in fasta_files/$query_protein[$i1].new -out alignment_files/$query_protein[$i1].aln"); if(-z "alignment_files/$query_protein[$i1].aln") { print LOGFILE "CHECK\nCHECK\nCHECK\nProblem in allingment query_protein for home/olivier/ARUP/FASTAFOLDER/alignment_files/$query_protein[$i1].a ln\n"; } #To create species tree using the taxon ID through ete3 toolkit system ("ete3 ncbiquery --search $sp_names -- tree |ete3 annotate >tree_files/$query_protein[$i1].nwk");

if(-z "tree_files/$query_protein[$i1].nwk") { print LOGFILE "CHECK\nCHECK\nCHECK\nProblem in SPECIES Tree query_protein for $query_protein[$i1].nwk\n"; }

else { system ("perl -p -i -e 's/[A-Z].*? \- //g' tree_files/$query_protein[$i1].nwk"); }

115

#To create Gene tree of retrieved fasta sequences by Fasttree program system ("/home/olivier/FastTree -quiet -nopr alignment_files/$query_protein[$i1].aln >> tree_files/$query_protein[$i1].nwk"); if(-z "tree_files/$query_protein[$i1].nwk") { print LOGFILE "CHECK\nCHECK\nCHECK\nProblem in Gene Tree query_protein for $query_protein[$i1].nwk\n"; } ##To move mega virus specific trees if($total_genome_count == $megavirus_count) { qx(cp tree_files/$query_protein[$i1].nwk hits_in_megavirus/$query_protein[$i1].nwk); } #to move other trees for Trex analysis else { print TREELIST "$query_protein[$i1]\n"; qx(cp tree_files/$query_protein[$i1].nwk TREX/$query_protein[$i1].nwk); } } else {

116

print LOGFILE "Tree is not generated for $query_protein[$i1] due to insufficient number of hits = $hit_number\n"; } } else { print LOGFILE "Your query id $query_protein[$i1] discarded from analysis due to insufficient number of hits= $l, GI ids @uniq_gi\n"; } } else { print LOGFILE "Your query id $query_protein[$i1] discarded from analysis beacuse no significant hit was found in NCBI nr database \n"; }

} close TREELIST; close FASTALIST; } sub distance_tree { #*************************************Module for construction of distance matrix based tree*****************************************# #++++++++++++++++++++For each group this module will first remove paralogs by selecting longest sequence for each genome; generate alignments in phylip format using muscle; create distnace matrix through protdist(phylip); generate a super matrix from these distance matrix using

117

DSM (ATGC); and lastly from that super matric it will generate a distance matrix based tree using fastme (ATGC) *************************# open (FT,"/home/olivier/ARUP/phylip-3.696/exe/tt"); print FT "Y\n"; print "Now generating distance matrix based tree\n\n"; open (F1,"ORTHOOUTFOLDER/groups.txt"); my @group_file=; close F1; chomp @group_file; my $count=0; foreach my $i(0..10)#$#group_file) { $count++; my @line=split(/\s+/,$group_file[$i]); chop $line[0]; open (FH,">DIST_MATRIX/$line[0].fasta"); ##To store the sequence of members of each group my @genomes=(); foreach my $i1(1..$#line) { my $genome=''; my $gene=''; ($genome,$gene)=split(/\|/,$line[$i1]); push(@genomes,$genome); } my @uniq_genomes=uniq(@genomes); ##This protion is to remove paralogs from each group ##If multilple members were found from a single genome it will select the longest sequence and store it in the corresponding seuence file foreach my $i2(0..$#uniq_genomes) {

118

my @genes=$group_file[$i]=~/$uniq_genomes[$i2]\|(\S+)/g; #print "$line[0]|$uniq_genomes[$i2]|@genes\n"; open (INPUTFILE,"proteome_test/$uniq_genomes[$i2].fasta") or die "cannot open $uniq_genomes[$i2].fasta\n"; my @genome_file=; close INPUTFILE; chomp @genome_file; my @protein_id=(); my @len=(); my @seq=(); foreach my $i3(0..$#genes) { my $protein_seq=''; foreach my $i4(0..$#genome_file) { if($genome_file[$i4]=~/gi\|$genes[$i3]\|/) { my $k=$i4+1; until(($genome_file[$k]=~/^>/) || ($k >= $#genome_file)) { $protein_seq.=$genome_file[$k]; $k++; } } } my $protein_length= length($protein_seq); push (@protein_id,$genes[$i3]); push (@len, $protein_length); push(@seq, $protein_seq); }

my @sorted_len=sort{$b<=>$a}@len;

119

my ($index) = grep { $len[$_] == $sorted_len[0]} (0 .. (@len-1)); print FH ">$uniq_genomes[$i2]\n$seq[$index]\n"; } ##To align the memebers of each group using muscle alignment tool in phylip format system ("muscle -quiet -in DIST_MATRIX/$line[0].fasta -phyiout DIST_MATRIX/$line[0].phylip"); if(-z "DIST_MATRIX/$line[0].phylip") { print LOGFILE "NO phylip alignment is generated for $line[0].phylip\n"; } ##To generate distance matrix for each group using protdist in phylip system ("cp DIST_MATRIX/$line[0].phylip infile"); system ("/home/olivier/ARUP/phylip-3.696/exe/protdist < /home/olivier/ARUP/phylip-3.696/exe/tt"); system ("mv outfile DIST_MATRIX/$line[0].dist"); } ##To generate a supermatrix using SDM software from all the distance matrices as generated by protdist system ("cat DIST_MATRIX/*.dist > DIST_MATRIX/dist_matrix.txt");

#system ("sed -i -e '1s/^/$count\n\n/' DIST_MATRIX/dist_matrix.txt"); #system ("java -jar /home/olivier/ARUP/SDM/SDM.jar -i DIST_MATRIX/dist_matrix.txt -m $count");

120

##To generate a concatinated tree from SDM super matrix using Fastme #system ("/home/olivier/ARUP/fastme-2.1.5/src/fastme -i DIST_MATRIX/dist_matrix.out"); } sub detect_HGT { print "Searching for HGT\n"; #*************************************Module for detecting HGT by T-Rex algorithm*****************************************# #++++++++++++++++++++This module will run T-Rex algorithm and list all HGTs with selected species*************************# open (TREELIST,"TREX/nonmegaviral_tree_list.txt"); my @tree_file=; close TREELIST; chomp @tree_file; open (MEGAV,"megaviral_ids.txt") || print "sorry cannot open megaviral_ids.txt\n"; ##This file contains taxonomic ids of all megavirus genomes my @megafile=;close MEGAV; chomp @megafile;

open (RESULT,">TREX/myhgts.txt"); print RESULT "Identifier\tNumber\tspecies_in_source_subtree\tspecies_in_receipient_subtr ee\ttwo species tree branches affected by this HGT\n"; foreach my $i1(0..$#tree_file) { my $hgt=0;

121

open (ANNOTATIONS,"annotation_files/$tree_file[$i1].annotation") || print "sorry cannot open $tree_file[$i1].annotation\n"; my @annotation_file=;close ANNOTATIONS; chomp @annotation_file; my %lineage=(); my %names=(); foreach my $i2(0..$#annotation_file) { if($annotation_file[$i2]=~/^\d+/) { my @line=split(/\t/,$annotation_file[$i2]); $names{$line[0]} = "$line[1]"; $lineage{$line[0]}="$line[2]"; } }

qx(/home/olivier/ARUP/hgt/hgt - inputfile=TREX/$tree_file[$i1].nwk); qx(mv results.txt TREX/$tree_file[$i1].out);

open (F2,"TREX/$tree_file[$i1].out") || print "Can not open $tree_file[$i1].out\n"; my @file2=; close F2; chomp @file2; foreach my $j(0..$#file2) { if(($file2[$j]=~/^HGT\s\d/) && ($file2[$j]!~/Trivial/)) { my @ids=(); my $source_lineage=''; my $source_name=''; my $recepient_lineage=''; my $recepient_name='';

122

my $source_megavirus_count=0;my $recepient_megavirus_count=0; my $raw=$file2[$j+3]; my @source_subtree=split(/, /, $file2[$j+1]); ##Taxon ids with in source subtree foreach (@source_subtree) { #To check if any of taxon id in source lineage is megavirus foreach my $i5(0..$#megafile) { my @mega_id=split(/\t/,$megafile[$i5]); if("$mega_id[0]" == "$_") { $source_megavirus_count++; } } $source_lineage.=$lineage{$_}; $source_lineage.=','; $source_name.=$names{$_}; $source_name.=','; } my @recepient_subtree=split(/, /, $file2[$j+2]); ##Taxon ids within recepient subtree foreach (@recepient_subtree) { $recepient_lineage.=$lineage{$_}; #To check if any of taxon id in receipient lineage is megavirus

123

foreach my $i5(0..$#megafile) { my @mega_id=split(/\t/,$megafile[$i5]); if("$mega_id[0]" == "$_") {

$recepient_megavirus_count++; } }

$recepient_lineage.=','; $recepient_name.=$names{$_}; $recepient_name.=','; }

if(($source_megavirus_count>0) || ($recepient_megavirus_count>0)) { $hgt++; print RESULT "$tree_file[$i1]\tHGT\t@source_subtree\t$source_name\t$source_lineage\t @recepient_subtree\t$recepient_name\t$recepient_lineage\t$raw\n"; } } } if($hgt>0) {

124

system ("mv TREX/$tree_file[$i1].nwk TREX/HGT_TREES/$tree_file[$i1].nwk") } else { system ("mv TREX/$tree_file[$i1].nwk TREX/NONHGT_TREES/$tree_file[$i1].nwk") } } } sub uniq{ for (my $i=$#_;$i>=0;$i--) { my @rest = @_; my $test = splice(@rest,$i,1); if (grep($_ eq $_[$i],@rest)){@_ = @rest}; } return @_; }

125

126

Chapter-4

Contribution of horizontal gene transfers in evolution of family specific genome mosaicism of Megavirales

(Manuscript to be submitted)

Target journal: Viruses special issue “Viruses of Microbes”

127

128

Type of the Paper (Article)

Contribution of horizontal gene transfers in evolution of family specific genome mosaicism of Megavirales

Sourabh Jain1,2, Philippe Colson2, Didier Raoult2, Pierre Pontarotti1*

1 Aix-Marseille Université, Ecole Centrale de Marseille, I2M UMR 7373, CNRS équipe Evolution Biologique et Modélisation, Marseille, France; [email protected] 2 URMITE, Aix Marseille Université, UM63, CNRS 7278, IRD 198, INSERM 1095, IHU - Méditerranée Infection, AP-HM, 19-21 boulevard Jean Moulin, 13005 Marseille, France; [email protected], [email protected] * Correspondence: [email protected]

Abstract: Apart from small subsets of defined ancestral genes and vertically-inherited core genes, the origin and evolution of a large fraction of the genes of the members of proposed order Megavirales (MV) remains unresolved. Horizontal gene transfers (HGTs) transfers have been described to be a major evolutionary force acting on the genomes of these viruses, but these assumptions were based on a limited number of viruses classified within closely related established or putative MV families. The availability of new and distantly related genomes of viruses from 3 defined and 7 putative MV families offers an interesting opportunity to better understand the frequency and role of HGTs in the genome mosaicism of these viruses. Here, we deduced orthologous groups (OGs) from 86 predicted protein sets and inferred evolutionary scenarios through phylogenies based on both clustered and un-clustered proteins. Remarkably, no general trend of genomic composition was

129 found between distantly related MV members, as only 10% of OGs were shared among them. A major fraction of OGs (approximately 90%) was found to exhibit family specificities. Similarly, family-specific gene acquisition patterns were also observed. Thus, only 3% of OGs acquired by horizontal gene transfer were shared between families, compared to 6% of those specific to MV family. Furthermore, phylogenetic analyses based on un-clustered proteins also showed family- specific gene acquisitions in distantly related MV members. Overall, HGT depicted donor specificity, as viruses of vertebrates/invertebrates (poxviruses, ascoviruses and iridoviruses) acquired genes majoraly from donors like Euteleostomii, Eutheria, Baculoviridae and proteobacteria; whereas, algal viruses (Phycodnaviridae) and protozoan viruses (pandoravirus, Mimiviridae, Pithovirus, and Marseilleviridae) were found to be acquiring genes majorly from cellular donors like Dictyostellium, Mammeillales, Firmicutes, Clostridiales, Klebsormidium, Rozella allomycis, Ooomycetes and Phytophthora. Taking into consideration all the data, clear distinction can be seen in the genome mosaicism of distantly related MV families, where they evolved via genome specificity and family specific gene acquisitions from their respective ecological niche. Our systematic search for HGT events of non-megavirale origin provides the first estimate of the total contribution of HGT in family specific genome mosaicism of distantly related Megavirales.

Keywords: Megavirales; Horizontal gene transfers; MimiLook; genome mosaicism

130

1. Introduction

Origin of “Megaviromes” has been debatable as there is disagreement on whether these are a result of smaller viruses acquiring genes (genome expansion hypothesis) or they are cellular genomes adapting to parasitic lifestyle by genome simplification (genome degradation hypothesis) [1–6]. Recently, it has also been hypothesized that megaviromes may evolve through a complex accordion-like process, with successive steps of genome expansions through duplications and gene transfers followed by genome reduction [7]. Evolution of megavirale (MV) genomes in genome expansion and genome accordion hypothesis is supported by the combination of evolutionary forces like horizontal gene transfers (HGT)[7–11] of bacterial and eukaryotic genes, gene and genome duplications [12,13] and dissemination of various mobile genetic elements as introns or transposons [14].

Since the discovery of first giant virus, Mimivirus that infect Acanthamoeba, other “giants” have been isolated from environmental samples, animals and humans [1,15,16] over the past decade. This has led to identification of new or putative new viral families namely pandoraviruses, pithoviruses, faustoviruses and Mollivirus [17–20], along with defined viral families Mimiviridae, Ascoviridae, Asfarviridae, Iridoviridae, Phycodnaviridae, Marseilleviridae and Poxviridae. Members of these viral families (formerly NCLDVs) were shown to exhibit unique phenotypic and genotypic characteristics that differentiate them from traditional concept of a virus as an obligate pathogen with small, filterable particle size and a

131 simple genome, and expands the biological definition of a virus [16,21]. Also, these viruses were thought to share a putative ancient common ancestor harboring about 50 conserved core genes[22–24] and supposed to be monophyletic. Therefore, it was proposed to reclassify all these recognized or putative viral families in a new viral order named Megavirales [25].

Members of proposed order Megavirales (MV) are characterized by genomic size variation (100kb to 2770 kb) encoding from 110 to 2556 proteins, coupled with diverse genomic repertoire [21]. MVs possess genes that have not been previously detected in any viruses, including genes encoding components of the translation system, translation factors and a variety of metabolic enzymes. Furthermore, MVs have several novel biological features, like their susceptibility to infection by other bioactive particles, termed “virophage” [26], defense mechanism to combat these virophages (MIMIVIRE) [27] and presence of transpovirons (mobile DNA elements) [14]. In addition to infecting a wide range of eukaryotic hosts, including vertebrates, protists, and unicellular alga [25], MVs have also been isolated from human blood[28] and have been found in the human virome[29], indicating a potential role in pathogenicity (or at least a route of exposure). Harboring all these peculiarities, MVs have now become a hot topic for evolutionary and biological inquiries and nevertheless, unraveling their evolutionary strategies will broaden our knowledge of viral world.

132

The diversity of MVs (classified in 10 distantly related families with varying gene content) imposes difficulties in collectively evaluating their phylogenetic relationships. While small subset of conserved core genes and phylogenomic analyses based on them, provide useful classification of MVs, but they give little insight on the remaining un-conserved and variable gene content of accessory genomes. Thus, many aforesaid phylogenetic studies have pointed out decisive role of HGTs and genetic exchanges on evolution of MVs, but, majority of them are based on closely related MV families. However, exact proportion of instances of genes acquired horizontally varies greatly with the methodologies used for their detection of interpretation of phylogenies prepared. Therefore, it is necessary to adopt some systematic searching for detecting reticulate evolutionary events like HGT in MVs to decipher genomic composition and genome mosaicism of distantly related MV families.

Attempts to detect gene acquisitions in MVs and to infer their evolutionary history have involved comparisons of their sequences and genomic compositions. Here, taking advantage of the availability of distantly related MV ORFomes, we have systematically searched potential HGT events in MVs by using a comparative genomics analysis of 86 MV members coupled with an automated phylogenetic reconstruction and tree topology scan. Our automated workflow MimiLook [30] described earlier, detected instances of HGT by interpreting phylogenies of a) clustered MV proteins (orthologous groups), and b) un-clustered MV proteins. Further, it allowed retrieving all cases of HGT in megavirales, reported so far in the

133 literature as well as new candidate cases of HGT not identified before. Overall, we show that up to 9% of protein-coding genes originate from HGT of non-megavirale origin in “Megavirome”. Further, no general trend of genome composition and gene acquisition was seen, as, orientation of transfer (putative donor and receiver) reveals family specificity in Megaviromes. In other words, specific cellular donors were found to exchange genes with specific family of MVs. We place these findings in the context of the ongoing debate on the evolutionary origin of MVs and contribution of genes acquired horizontally in evolution of MV genomes.

2. Materials and Methods

Our in-house automated workflow MimiLook [30] inputs whole ORFomes in fasta format and executes the following basic steps in sequential order: 1) Identification of clustered proteins (orthologus groups) and un-clustered proteins, 2) Reference species tree construction, 3) Detection of HGT events, 4) Mapping of HGTs and orthologous group (OG) information on reference MV family tree (Figure 1). This four step methodology has been discussed in detail earlier, so here, basic information of each step will be discussed in following paragraphs.

134

Figure 1. Brief description of HGT searching strategies implemented in MimiLook [30]. Step 1 (black colored arrow) identifies clustered and un-clustered proteins., Step 2 (colored green) constructs reference species tree taking orthologous groups (OGs) deduced in consideration, Step 3 (colored blue) leads to detection of HGT event through construction of phylogenies, and at last, Step 4 allows OG information and HGT event to be distributed on megavirales reference family tree.

135

2.1 Orthologus group determination

Complete ORFomes of 86 MVs distributed in 10 families (Table S1) were queried in OrthoMCL [31] to retrieve orthologous protein groups (OG) with >=30% identity threshold. Only protein sequences longer than 50 amino acids were considered for further analysis. Homologous sequences were selected using the all-against-all BlastP algorithm [32] with an E value of less than 0.00001. Then, clustering of the orthologous sequences was analyzed using the Markov Cluster algorithm [31] which is based on probability and graph flow theory that allows the simultaneous classification of global relationships in a similarity space.

2.2 Species tree construction

Presence/absence OG matrix was used to construct reference MV family tree of Megavirales using the distance super-matrix approach. Amino acid sequences of each OG based on p/a matrix was retrieved from ORFome set and they were further subjected to alignment generation using MUSCLE [33] and gene tree were constructed using FastTree [34]. A species tree was constructed from all gene trees using the distance supermatrix approach (SDM method weighted by alignment lengths [35]. A balanced minimum evolution tree was inferred from the resulting distance supermatrix using FastME [36], using NNI, SPR, and TBR (Tree Bifurcation and Reconnection) tree topology refinement. Further, MV family tree topology was inferred from species tree to be used as a reference tree for tagging various evolutionary scenarios (see section 2.4).

136

2.3 Detection of HGT event with information on orientation of transfer

BlastP [32] was used against local nr database to find potential homologs of each OG and un-clustered proteins. An E value cutoff equal to or less than 10−5 was set and to filter out the redundant hits and possible contaminants from blast output, hits below the threshold of 50% query size (coverage) and 30% identity were discarded. OGs with cellular homolog other than megavirales were considered further for possible HGT detection. Protein sets (containing amino acid sequences of optimal blast hits) were then queried in alignment program Muscle [33] to generate the alignment. After generation of alignment, program Gblocks [37] was used to assist with the removal of poorly aligned positions and highly divergent regions. Maximum likelihood (ML) trees from each such alignment were constructed using FastTree [34] using whelan and goldman (WAG) [38] model of amino acid evolution. For further analysis, gene trees generated from protein sets are than queried in T-REX [39] for detection of HGT event. The trees generated by FastTree algorithm were used as gene trees for this analysis. Species trees were constructed for each gene tree separately. Each pair of species and gene tree was then subjected to T-Rex [39] algorithm for inferring probable HGTs with information of donor branch, recipient branch and various statistical values like Robinson foulds distance, least square coefficient and bipartition dissimilarity. Tree topologies corresponding to a potential HGT event were then verified

137 among the ensemble of produced phylogenetic trees, using human expertise. The pattern searched consisted in the presence of at least a node ‘X’ partitioning the tree in two sub- clades, one monophyletic clade containing only Megavirales and possibly other members of Megavirales not included in dataset and another distinct clade containing only non-Megavirale species.

2.4 Distribution of OGs and HGT events on reference MV family tree

MV family tree topology was inferred from the reference species constructed (section 2.2). MV family specific OGs were plotted on this family tree considering presence/absence parameter i.e. if particular OG is present in at least one species of MV family, than it is considered to be present in that family. Once the family specific OGs were attributed to their respective nodes, OGs shared between MV families were than tagged on the tree. If a particular shared OG is present in multiple external nodes (family specific nodes), than it is adjudged to be present in common internal node. Further, instances of HGT detected by our workflow were plotted on this tree.

2.5 Functional Annotation of Candidate LGT-acquired Genes

The clustered and un-clustered proteins were scanned in Interpro [ref..] database. Using the Interpro2GO association file, gene ontology_molecular (GO_MF) function terms were assigned to proteins on the basis of their Interpro domain composition. This allowed us to slim down on the GO terms

138 based on binding activity, catalytic activity and others. Further, proteins which were not assigned GO functional category, InterproID was used to annotate them. Remaining proteins which were not assigned any functional annotation were termed as #N/A.

3. Results and Discussion

3.1 Identification of orthologous groups and detection of HGT event

Dataset used in this study consists of 86 MV ORFomes classified in 10 MV families (defined or putative). 4577 clusters of orthologus groups (OGs) were deduced enlisting 21256 proteins out of 29153 viral proteins used. 7898 proteins remain un-clustered and their analysis is discussed separately (see section 3.4). Out of it, 91% of OGs (4168) are found to be family specific (i.e. represented by species classified in one MV family only), whereas, only 9% of OGs (409) are represented by proteins from 2 or more MV families (Figure 2a). In other words, vast numbers of genes were found to be un-conserved in distantly related MV families. Groups with paralogs from same species (396 OGs) were separated out from the tally of family specific OGs, thus, reducing the number to 3772 family specific OGs. However, detection of HGT event was carried out on these OGs as per workflow. A glance at the distribution pattern of shared OGs shows that only few OGs (23) are represented by 6-10 MV families; whereas maximum OGs (386 OGs) are represented by 2-5 MV families(Figure 2b). Genomic variation can also be seen in the distribution of family specific OGs, where, Mimiviridae,

139 pandoraviruses and Phycodnaviridae varied greatly in their gene content (with 924, 772 and 1108 OGs respectively) with respect to other MV families (Figure 2c).

Figure 2. Distribution of orthologus groups (OGs) in Megavirale families. (A) Number of shared and un-shared OGs. (B) Distribution of shared OGS. 386 OGs are represented by 2-5 Megavirale families, while only 23 OGs are represented by 6-10 families. (C) Distribution of unshared OGS. Mimiviridae, pandoraviruses and Phycodnaviridae varied greatly in their gene content (with 924, 772 and 1108 OGs respectively), while, less family specific OGs are present in other families.

140

Reference Megavirale family tree was inferred from species tree constructed based on super distance matrix approach. The tree delineates three major clades; clade 1 clusters viruses of vertebrates/invertebrates (Poxviridae, Iridoviridae and Ascoviridae), clade 2 clusters viruses of protozoa and algal viruses (Marseilleviridae, Mimiviridae, pithovirus and Phycodnaviridae), while, clade 3 clusters faustovirus and Asfarviridae. Nodes were labeled on tree, A-J for family specific nodes, and, K-S for shared internal node (Figure 3).

Further, these OGs of clustered proteins were sent for automatic phylogenetic analysis using ML methodology in FastTree [34] and topologies supporting potential horizontal gene transfer events were searched using T-REX [39] within the MimiLook framework [30]. Phylogenetic trees were successfully constructed for all OGs and a total of 414 OGs yielded phylogenetic trees with topologies supporting HGT events (Table S2; 10.6084/m9.figshare.4645273). The corresponding proteins in each orthologous group were considered as possibly acquired via horizontal gene transfer. Furthermore, orthologus group information and detected HGTs from 4577 phylogenetic trees were plotted on this reference tree (Figure 3). Out of 414 OGs, 105 gene acquisitions were of bacterial origin, 171 were of eukaryotic origin and 9 were found to be acquired horizontally from phages and viruses (other than Megavirales). Interestingly, 129 OGs were found to transferring gene vice versa i.e. from Megavirales to other cellular domains. In all, 305 transfers were detected in family specific nodes (A-J) and 109 in shared nodes (K-S).

141

Figure 3. Distribution of orthologous groups and HGTs on distance-based reference tree. Numbers written in black on each node depicts the total number of OGs present at that node. Numbers written in blue indicates the transfers predicted (please see figure legend).

142

Divergent pattern of gene acquisitions via HGT was observed, as cellular donors and receivers involved in HGT event were found to be ensemble of eukaryote, bacteria, virus and phages in each node. Large number of gene acquisitions via HGT (302 OGs) was found in Mimiviridae (node D), Phycodnaviridae (node G), pandoravirus (node C), Poxviridae (node J) and Iridoviridae (node I), compared to other family specific nodes. Surprisingly, only 2 gene acquisitions were found in pithovirus (node E) and Marseilleviridae (node F). Also, no HGT event was detected in two nodes: Asfarviridae (node A) and faustovirus (node B). Similar divergent pattern of gene acquisition via HGT was observed in shared nodes, where, eukaryotic and bacterial genes were found to be acquired by Megavirales at each delineating node (Figure 3). Apart from it, 53 cases of HGT were inferred to be acquired by sympatric (those transfers in which one particular domain was not tagged as ‘donor’) associations between Megavirale and other cellular entities (Table S3).

3.2 Putative donors for HGT reveal family specific gene acquisition

From the 414 phylogenetic trees indicative of an HGT event, the ensemble of species was present in the putative donor sub- trees (Table S2). Donor sub-trees arranged in monophyletic groups, composed exclusively of non-megavirale species and holding the closest outgroup position relative to the Megavirales receiver group. Eukaryotes were present in donor sub-trees in 171 cases and bacterial species were present in 101 cases, out of the 414 phylogenetic trees, thus, representing

143 the most frequent genetic exchanges with Megavirales. In addition to, 129 cases were detected as Megavirales as donor group and other cellular entities as receiver group (Figure 4).

Figure 4. Horizontal gene transfer (HGTs) and information of putative donors and receptors at each node of reference tree. Node labels are same as figure 3. Clear distinction can be seen at each node, where, each Megavirale family is acquiring genes from specific set of donors.

144

3.2.1 Donors identified in shared nodes (K-S)

In 22 cases detected at node S (originating node), donor sub- tree contained multiple donors from Bilateria, Oomycetes, , Neognathe and Trypanosomatidae and depicts donor specificity, where Megavirale families present in receiver sub- tree were found to acquire gene independently from specific donors. Clear taxanomic division of cellular entities present in donor sub-trees is visible from 20 cases found in node Q and node R (nodes delineating reference Megavirale tree into three sub-clades), where receiver sub-trees containing viruses of vertebrates (Poxviridae, Ascoviridae and Iridoviridae) were found to acquire gene from Eukayotes (Neoptera and Theria), Bacteria (Selonomonas, Legionellacae) and Phages (Baculoviridae), while, receiver sub-trees consisting of algal viruses and protozoan viruses (Mimiviridae, Marseilleviridae, pandoravirus, pithovirus, Phycodnaviridae) were acquiring genes from their interactions with , Mammeillales, Oomycetes, Klebsormidium and Trypanosomatidae (Eukaryotes) and gammaproteobacteria. Interestingly, in 17 cases, Megavirales are present in donor sub-tree and found to be transferring genetic material to receivers such as Klebsormidium, Trichomonas vaginalis G3, Ectocarpus siliculosus and Baculoviridae.

In internal nodes K and P, only 2 cases were detected by Pyrenomonadales and Enterobacterales, respectively. In node O, which is delineating algal viruses (Phycodnaviridae) from protozoan viruses, 36 cases (15 from eukaryotes, 7 from bacteria, 14 from Megavirales) were detected. The species

145 found in donor sub-tree included proteobacteria and bacteriodetes (bacteria); Dictyostellida, Phytophthora, Acanthamoeba, Trypanosomatidae and Mammeillales (eukaryotes). Major receivers of proteins from Megavirales were found to be Klebsormidium, apart from Myoviridae and Fonticula alba.

In sub internal nodes L-M (delineating protozoan viruses), proteins were acquired from different origins, including bacterial species from proteobacteria and eukaryotic species from Mammiellales, Oomycetes, Trichomonas and Acanthamoeba. Here, transfer from Megavirales to Acanthamoaeba was also observed in many cases.

3.2.2 Donors identified in family specific nodes (A-J)

In nodes representing viruses of vertebrates/invertebrates total of 77 cases were found where, the most frequent category of eukaryotes found in donor sub-trees is bony vertebrates (Euteleostomi), placental mammals (Eutheria and Theria) and protostomia (invertebrates). Bacterial donors present in Iridoviridae (node I), includes protoebacteria and bacteriodetes, while, node H (Ascoviridae) and node J (Poxviridae) consists of viral donors from Baculoviridae. Surprisingly, many transfers were seen from Iridoviridae to Flavobacterium and Lasius niger. Eukaryotes were found to exchange proteins in Poxviridae and Iridoviridae, but not in Ascoviridae, while, bacterial donors were found only in Iridoviridae. Viral donors were present in donor sub-trees of Poxviridae and Ascoviridae only.

146

In node G of algal viruses (Phycodnaviridae), eukaryotes found in donor sub-tree were from Mammiellaes, Dictyosteliida amd Klebsormidium and bacterial species found were from bacillales, fimicutes and clostridiales. In many cases, Phycodnaviridae were found to be transferring proteins to Klebsormidium and Ectocarpus siliculosus

Maximum numbers of transfers were reported in node C and D of protozoan viruses (Mimiviridae and pandoravirus). Major eukaryotic donors were found to be Dictyosteliida, Rozella allomycis, Mucor and Dikarya (in Mimivridae); whereas, species from Mammeillales, Saprolegnia, Phytophthora and Acanthamoeba were found to donating proteins to pandoraviruses. Contrastingly, no distinction of donors from bacteria were found in both families, as proteins from Bacillales, Streptomyces, Streptocccus and proteobacteria origin were acquired.

3.3 HGTs from non-orthologous groups (un-clustered proteins)

We applied a similar procedure to the 7,898 non-orthologous proteins to detect transfer events and putative donors. Of these, 5,233 proteins did not return any homologous hit in nr database and were therefore considered as orphan. The remaining 2,665 proteins were analyzed further and evolutionary scenarios were detected from their phylogenetic trees (Table S4). Of these, 1815 proteins had hits only in Megavirale species (considered as Megavirale-specific), 467 had hits with other non-Megavirale species where Megavirales were in monophyletic clade (considered as shared origin with

147 other cellular domains), 283 proteins were detected as gene transfers (259 as HGTs and 24 as sympatric associations) and in 152 proteins nothing can be inferred due to weak phylogenetic signals (considered as Ambigous) (Figure 5). Thus, we identified 259 instances of HGT from non- orthologous proteins, of which 135 cases were from eukaryotes, 82 cases from bacteria, 11 cases from Phages and other viruses, 31 cases where Megavirales are transferring protein to other cellular domains (Table S4).

Eukaryotic, bacterial and viral species found in donor sub-trees of Poxviridae, Ascoviridae and Iridoviridae (viruses of vertebrates/invertebrates) were Euteleostomi, Eutheria, Ecdysozoa, Bacillales, Baculoviridae; whereas, in algal and protozoan viruses, ensemble of species includes Mammeillales, Dictyostellium, Rozella, Mucor, Bacillaes, Clostridiales, Chlamydiales and Pseudomonadaecae.

148

Figure 5. Evolutionary scenarios from un-clustered proteins (non- orthologous groups). From total of 2665 un-clustered proteins, 1815 were Megavirale specific; 259 were involved in horizontal transfer ; 467 were sharing clades with homologs of other domains of life, 152 were considered ambiguous and 24 were having sympatric associations. Out of 259 transfers, 135 and 82 were transferred from eukaryote and bacteria, respectively to megavirales; while, 31 were detected as probable transfer by Megavirales to other cellular domains.

149

3.4 Donor specificity in cellular gene acquisitions

Taking into consideration all instances of HGT (414 in orthologous groups and 259 in non-orthologous groups), clear distinction can be seen in the genome mosaicism of distantly related MV families, where they evolved via genome specificity and family specific gene acquisitions from their respective ecological niche. Our results suggested that major donors in HGT event shaping genomes of Poxviridae, Ascoviridae and Iridovoridae are from Deuterostomia (euteleostomi, Euthiera, Theria and Boroeutheria) and protostomia (Neoptera and Endopterygota) and Baculoviridae. Previous studies based on phylogenetic reconstructions of gene families with members in other viruses and their hosts have suggested that multiple horizontal gene transfer (HGT) events have taken place into poxvirus genomes from other viruses [40] and from their eukaryotic hosts [41–44]. Our results indicate that these viruses have a common origin and independent gene acquisitions from divergent eukaryotic phylums and viral family of baculoviridae constitutes significantly in family specific gene innovations in Megavirales infecting Metazoa. Also, these viral families have very low levels of bacterial-like genes, thus, it can be concluded that gene transfers from bacteria might not be playing a decisive role in genome mosaicism of these Megavirale families. Interestingly, family specific gene transfer from Iridoviridae to flavobacterium sp. JRM and Lasius niger was observed. Thus, we analysed these transfers further and found possible case of contamination or sequencing error in Flavobacterium sp. JRM contigs (contig

150

114, contig 144, contig 148, contig 155, contig 163 and contig 169) (Supplementary table); while, a possible case of viral integration in Lasius niger.

Contrastingly, eukaryotic donors from stramenopiles (Oomycetes – saprolegnia and phytopthora); fungi (Leotiomyceta); viridiplantae (mameillaes and Klebsormidium) are found to have early interactions in shaping protozoan and algal virus genomes. Phycodnaviridae genome mosaicism is found to have major interactions with Viridiplantae and Amoebazoa; whereas protozoam viruses have genetic exchanges with donors from Fungi, Stramenopiles and Parabasilia. Bacterial genes from Proteobacteria, Terrabacteria (clostridiales, bacillales, actinobacteria) and bacteriodetes have also contributed in the genome mosaic at each node of diversification. Central role of gene acquisitions from eukaryotes and bacteria in evolution of protozoan and algal Megavirale genomes has been shown in many previous studies [7–11]. Thus, it can be concluded that viral families infecting protozoans and algae have acquired genes from diverse, but, specific set of eukaryotic and bacterial donors, which are either their host or not known to be interacting with these viruses.

3.5 Functions of Genes Acquired via HGT

In clustered proteins, 285 OGs were detected as transferred in Megavirales, while, 129 were detected as transferred by Megavirales to other cellular domains. Out of 285 HGTs detected as acquired by Megavirales, 96 were assigned GO_MF and 105 were assigned InterproID. 56 OGs were not

151 assigned any function category. In ‘molecular function’ ontology, differences of highest amplitudes included an over- representation of the ‘catalytic activity and ‘binding activity’ proteins (Figure 6a, Table S2).

In total, 58 OGs of catalytic activity (Methyltransferases, kinases, lyases, hydrolases, oxidoreductases, peptidases) and 33 OGs of binding activity (DNA/RNA, ATP/GTP, Ion) were found to be acquired by Megavirales. Majority of OGs shared between different MV families characterized as binding proteins are acquired from eukaryotes (Dictyostelium, Phytopthora and Trypanosomatidae), while bacterial origin binding proteins are also acquired by family specific OGs along with eukaryotes. However, 2 OGs present in Poxviridae only (OG_02557, OG_01174) are acquired from viruses Baculoviridae and Circoviridae. Similarly, OGs shared between MV families characterized in functional category catalytic activity (involved in metabolic process) are majorly acquired from eukaryotes with phylum like Dictyostellium, Trypanosomatidae and Alveolata in donor clade, while in family specific OGs, major donors are eukaryotes in viruses of vertebrates, bacteria in phydodnaviridae and pandoraviruses and eukaroytes in Mimivridae and Marseilleviridae. Interestingly, shared OG functionally characterized as electron carrier (OG_00343) was found to be of eukaryotic origin, while, family specific OG present in Mimiviridae (OG_01399) was found to be acquired from Proteobacteria. OGs that were assigned InterproID were characterized as ankyrin repeat (49 OGs), majorly from Mammiellales (Eukaryote). 56 OGs were

152 characterized functionally in various interpro IDs from variety of bacteria and eukaryotic origin (Table S2).

Out of 129 OGs detected as transfer by MV to other cellular domains, 31 were functionally characterized by ‘molecular function’ ontology’, 16 by InterProID and rest remains non- characterized. Major receivers of transferred genes were detected as Ectocarpus siliculosus (ion binding and catalytic activity), Klebsormidium flaccidum (magnesium ion binding and structure molecule activity), Acanthamoeba castellani, and some proteobacteria.

In case of un-clustered proteins, 82 proteins were functionally annotated in ‘molecular function’ category (27 binding, 51 catalytic activity, 4 other). In viruses of metazoans, major donors of binding proteins are from phylum Eutheria and for catalytic activity donor species are from phylum Euteleostoomi and Amniota; whereas, in case of catalytic activity proteins, distinction can be easily seen as major donors in phycodnaviridae are from Mammiellales, while, in other protozaon viruses, transferred proteins are of bacterial origin (Figure 6b, Table S4). 73 protiens were assigned various interpro IDs, of which maximum were of eukaryotic origin. Major species in donor clades were from Mammiellales and clostridiales (Phycodnaviridae and Pandoravirus), Dictyostellida and clostridiales (Mimiviridae), Baculoviridae (Poxviridae). In case of 31 proteins detected as transferred from Megavirales to other cellular domains, 7 were assigned GO_MF category, in which major receiver of structure molecules activity proteins was Klesbormidium flaccidum

153 species; whereas, 12 were assigned interproId, of which major receivers are Lasius niger (transfer detected by Poxviridae and Ascoviridae).

Figure 6. Functional annotation of clustered proteins (OGs) and un- clustered proteins according to GO_MF and Interpro database scan. A) Functional category distribution of detected HGTs in ortholgous groups (OGs) i.e. clustered proteins (to megavirales from other domains). B) Functional category distribution of detected HGTs in unclustered proteins (to Megavirales from other domains).

154

4. Conclusion and future perspectives

Our analysis represents the first comprehensive pan-genomic search for HGT events in Megavirales with phylogenetic validation. Previous reports have shown that different genes acquired by HGT in closely related Megavirale families may play important roles in evolution of Megaviromes and hence, have been functionally significant []. Our systematic search for HGT events of non-megavirale origin provides the first estimate of the total contribution of HGT in family specific genome mosaicism of distantly related Megavirales. Analyses of our phylogenetic tree topologies have indicated that multiple different species and kingdoms were positioned in donor clades, suggesting that there is not a single or low number of donors but a multitude of possible cellular species from which Megavirales have acquired genes. But, despite of the ensemble of donor species, Megavirales depicts family specific gene acquisitions which resonates the fact that cellular gene acquisitions is a major evolutionary force in genome mosaicism of each family of Megavirale. Donor specificity also suggests that these viruses have acquired genes with a balanced process of genetic exchanges with their environmental interactions and they are not just gene robbers. Another important observation from the output is that some of the species (Rozella allomycis, Klebsormidium flaccidum, Lasius niger) with continuous gene exchanges with Megavirales are not known to be hosts of these viruses nor they are known to interact with them thus, suggesting the possibility of finding new viral inserts in eukaryotic domains. Further, our analysis has also revealed contamination present

155 in database (Flavobacterium sp. JRM) which might be a case of mis-annotation or sequencing error.

Overall, it seems likely that HGT has been frequent in the course of Megavirale evolution. In most of the cases proposed in previous studies, there was little information regarding likely donor lineages and when in the course of evolution HGT event likely occurred. Here, we used distance based tree reconstruction method to provide a snapshot of underlying evolutionary relationships of diversely related Megavirale families and proposed putative donor and recipients. Despite much effort, we are still unable to place viruses on the universal tree of life and their origin remains speculative, thus, even when donors and recipients are known, there is rarely supporting evidence regarding the absence of genes from relatives of the Megavirale family that diverged prior to the transfer. Nevertheless, almost all Megavirale families acquire genes of eukaryotic and bacterial origin and few acquisitions of viral origin, hinting their ancient origin with divergence of other cellular domains. Indeed, each Megavirale family was found to have a specific pattern of gene acquisition, where there genome mosacism is due to the gene exchanges with specific host-virus interactions.

156

References

1. Raoult, D.; Audic, S.; Robert, C.; Abergel, C.; Renesto, P.; Ogata, H.; Scola, B. L.; Suzan, M.; Claverie, J.-M. The 1.2- Megabase Genome Sequence of Mimivirus. Science 2004, 306, 1344–1350.

2. Filée, J.; Siguier, P.; Chandler, M. I am what I eat and I eat what I am: acquisition of bacterial genes by giant viruses. Trends Genet. 2007, 23, 10–15.

3. Moreira, D.; López-García, P. Ten reasons to exclude viruses from the tree of life. Nat. Rev. Microbiol. 2009, 7, 306–311.

4. Raoult, D. There is no such thing as a tree of life (and of course viruses are out!). Nat. Rev. Microbiol. 2009, 7, 615– 615.

5. Moreira, D.; López-García, P. Evolution of viruses and cells: do we need a fourth domain of life to explain the origin of eukaryotes? Phil Trans R Soc B 2015, 370, 20140327.

6. Forterre, P.; Gaïa, M. Giant viruses and the origin of modern eukaryotes. Curr. Opin. Microbiol. 2016, 31, 44–49.

7. Iyer, L. M.; Balaji, S.; Koonin, E. V.; Aravind, L. Evolutionary genomics of nucleo-cytoplasmic large DNA viruses. Virus Res. 2006, 117, 156–184.

8. Filée, J.; Pouget, N.; Chandler, M. Phylogenetic evidence for extensive lateral acquisition of cellular genes by

157

Nucleocytoplasmic large DNA viruses. BMC Evol. Biol. 2008, 8, 320.

9. Moreira, D.; Brochier-Armanet, C. Giant viruses, giant chimeras: The multiple evolutionary histories of Mimivirus genes. BMC Evol. Biol. 2008, 8, 12.

10. Yutin, N.; Colson, P.; Raoult, D.; Koonin, E. V. Mimiviridae: clusters of orthologous genes, reconstruction of gene repertoire evolution and proposed expansion of the giant virus family. Virol. J. 2013, 10, 106.

11. Yutin, N.; Wolf, Y. I.; Koonin, E. V. Origin of giant viruses from smaller DNA viruses not from a fourth domain of cellular life. Virology 2014, 0, 38–52.

12. Suhre, K. Gene and Genome Duplication in Acanthamoeba polyphaga Mimivirus. J. Virol. 2005, 79, 14095–14101.

13. Filée, J.; Chandler, M. Convergent mechanisms of genome evolution of large and giant DNA viruses. Res. Microbiol. 2008, 159, 325–331.

14. Desnues, C.; La Scola, B.; Yutin, N.; Fournous, G.; Robert, C.; Azza, S.; Jardot, P.; Monteil, S.; Campocasso, A.; Koonin, E. V.; Raoult, D. Provirophages and transpovirons as the diverse mobilome of giant viruses. Proc. Natl. Acad. Sci. U. S. A. 2012, 109, 18078–18083.

15. Scola, B. L.; Audic, S.; Robert, C.; Jungang, L.; Lamballerie, X. de; Drancourt, M.; Birtles, R.; Claverie, J.-M.;

158

Raoult, D. A Giant Virus in Amoebae. Science 2003, 299, 2033–2033.

16. Aherfi, S.; Colson, P.; La Scola, B.; Raoult, D. Giant Viruses of Amoebas: An Update. Front. Microbiol. 2016, 7.

17. Philippe, N.; Legendre, M.; Doutre, G.; Couté, Y.; Poirot, O.; Lescot, M.; Arslan, D.; Seltzer, V.; Bertaux, L.; Bruley, C.; Garin, J.; Claverie, J.-M.; Abergel, C. Pandoraviruses: Amoeba Viruses with Genomes Up to 2.5 Mb Reaching That of Parasitic Eukaryotes. Science 2013, 341, 281–286.

18. Legendre, M.; Bartoli, J.; Shmakova, L.; Jeudy, S.; Labadie, K.; Adrait, A.; Lescot, M.; Poirot, O.; Bertaux, L.; Bruley, C.; Couté, Y.; Rivkina, E.; Abergel, C.; Claverie, J.-M. Thirty-thousand-year-old distant relative of giant icosahedral DNA viruses with a pandoravirus morphology. Proc. Natl. Acad. Sci. 2014, 111, 4274–4279.

19. Reteno, D. G.; Benamar, S.; Khalil, J. B.; Andreani, J.; Armstrong, N.; Klose, T.; Rossmann, M.; Colson, P.; Raoult, D.; Scola, B. L. Faustovirus, an Asfarvirus-Related New Lineage of Giant Viruses Infecting Amoebae. J. Virol. 2015, 89, 6585–6594.

20. Legendre, M.; Lartigue, A.; Bertaux, L.; Jeudy, S.; Bartoli, J.; Lescot, M.; Alempic, J.-M.; Ramus, C.; Bruley, C.; Labadie, K.; Shmakova, L.; Rivkina, E.; Couté, Y.; Abergel, C.; Claverie, J.-M. In-depth study of Mollivirus sibericum, a new 30,000-y-old giant virus infecting Acanthamoeba. Proc. Natl. Acad. Sci. 2015, 112, E5327–E5335.

159

21. Sharma, V.; Colson, P.; Pontarotti, P.; Raoult, D. Mimivirus inaugurated in the 21st century the beginning of a reclassification of viruses. Curr. Opin. Microbiol. 2016, 31, 16–24.

22. Koonin, E. V.; Yutin, N. Origin and Evolution of Eukaryotic Large Nucleo-Cytoplasmic DNA Viruses. Intervirology 2010, 53, 284–292.

23. Yutin, N.; Koonin, E. V. Hidden evolutionary complexity of Nucleo-Cytoplasmic Large DNA viruses of eukaryotes. Virol. J. 2012, 9, 161.

24. Yutin, N.; Wolf, Y. I.; Raoult, D.; Koonin, E. V. Eukaryotic large nucleo-cytoplasmic DNA viruses: Clusters of orthologous genes and reconstruction of viral genome evolution. Virol. J. 2009, 6, 223.

25. Colson, P.; Lamballerie, X. D.; Yutin, N.; Asgari, S.; Bigot, Y.; Bideshi, D. K.; Cheng, X.-W.; Federici, B. A.; Etten, J. L. V.; Koonin, E. V.; Scola, B. L.; Raoult, D. “Megavirales”, a proposed new order for eukaryotic nucleocytoplasmic large DNA viruses. Arch. Virol. 2013, 158, 2517–2521.

26. La Scola, B.; Desnues, C.; Pagnier, I.; Robert, C.; Barrassi, L.; Fournous, G.; Merchat, M.; Suzan-Monti, M.; Forterre, P.; Koonin, E.; Raoult, D. The virophage as a unique parasite of the giant mimivirus. Nature 2008, 455, 100–104.

27. Levasseur, A.; Bekliz, M.; Chabrière, E.; Pontarotti, P.; La Scola, B.; Raoult, D. MIMIVIRE is a defence system in

160 mimivirus that confers resistance to virophage. Nature 2016, 531, 249–252.

28. Popgeorgiev, N.; Boyer, M.; Fancello, L.; Monteil, S.; Robert, C.; Rivet, R.; Nappez, C.; Azza, S.; Chiaroni, J.; Raoult, D.; Desnues, C. Marseillevirus-Like Virus Recovered From Blood Donated by Asymptomatic Humans. J. Infect. Dis. 2013, 208, 1042–1050.

29. Yolken, R. H.; Jones-Brando, L.; Dunigan, D. D.; Kannan, G.; Dickerson, F.; Severance, E.; Sabunciyan, S.; Talbot, C. C.; Prandovszky, E.; Gurnon, J. R.; Agarkova, I. V.; Leister, F.; Gressitt, K. L.; Chen, O.; Deuber, B.; Ma, F.; Pletnikov, M. V.; Van Etten, J. L. Chlorovirus ATCV-1 is part of the human oropharyngeal virome and is associated with changes in cognitive functions in humans and mice. Proc. Natl. Acad. Sci. U. S. A. 2014, 111, 16106–16111.

30. Jain, S.; Panda, A.; Colson, P.; Raoult, D.; Pontarotti, P. MimiLook: A Phylogenetic Workflow for Detection of Gene Acquisition in Major Orthologous Groups of Megavirales. Viruses 2017, 9, 72.

31. Li, L.; Stoeckert, C. J.; Roos, D. S. OrthoMCL: Identification of Ortholog Groups for Eukaryotic Genomes. Genome Res. 2003, 13, 2178–2189.

32. Altschul, S. F.; Madden, T. L.; Schäffer, A. A.; Zhang, J.; Zhang, Z.; Miller, W.; Lipman, D. J. Gapped BLAST and PSI- BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997, 25, 3389–3402.

161

33. Edgar, R. C. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 2004, 32, 1792–1797.

34. Price, M. N.; Dehal, P. S.; Arkin, A. P. FastTree 2 – Approximately Maximum-Likelihood Trees for Large Alignments. PLOS ONE 2010, 5, e9490.

35. Criscuolo, A.; Berry, V.; Douzery, E. J. P.; Gascuel, O. SDM: A Fast Distance-Based Approach for (Super)Tree Building in Phylogenomics. Syst. Biol. 2006, 55, 740–755.

36. Lefort, V.; Desper, R.; Gascuel, O. FastME 2.0: A Comprehensive, Accurate, and Fast Distance-Based Phylogeny Inference Program. Mol. Biol. Evol. 2015, 32, 2798–2800.

37. Castresana, J. Selection of Conserved Blocks from Multiple Alignments for Their Use in Phylogenetic Analysis. Mol. Biol. Evol. 2000, 17, 540–552.

38. Whelan, S.; Goldman, N. A General Empirical Model of Protein Evolution Derived from Multiple Protein Families Using a Maximum-Likelihood Approach. Mol. Biol. Evol. 2001, 18, 691–699.

39. Boc, A.; Diallo, A. B.; Makarenkov, V. T-REX: a web server for inferring, validating and visualizing phylogenetic trees and networks. Nucleic Acids Res. 2012, 40, W573– W579.

162

40. Dall, D.; Luque, T.; O’Reilly, D. Insect–virus relationships: Sifting by informatics. BioEssays 2001, 23, 184– 193.

41. Bratke, K. A.; McLysaght, A. Identification of multiple independent horizontal gene transfers into poxviruses using a comparative genomics approach. BMC Evol. Biol. 2008, 8, 67.

42. Hughes, A. L. Origin and Evolution of Viral Interleukin-10 and Other DNA Virus Genes with Vertebrate Homologues. J. Mol. Evol. 2002, 54, 90–101.

43. Hughes, A. L.; Friedman, R. Poxvirus genome evolution by gene gain and loss. Mol. Phylogenet. Evol. 2005, 35, 186– 195.

44. Monier, A.; Claverie, J.-M.; Ogata, H. Horizontal gene transfer and nucleotide compositional anomaly in large DNA viruses. BMC Genomics 2007, 8, 456.

163

Supplementary Materials (Included in thesis)

Table S1. 86 Megavirale ORFomes used in this study (it can be donwloaded from ftp://ftp.ncbi.nih.gov/genomes/Viruses/).

Table S2: 414 OGs detected as HGT event (in clustered proteins) with information on their putative donor and recipient lineages Table S3: 53 OGs detected as sympatric transfer event (in clustered proteins) Table S4: 259 proteins detected as HGT event (in un-clustered proteins) with information on their putative donor and recipient lineages

Supplementary Materials (Not included due to size constraint)

Externally hosted supplementary file 1 Doi: 10.6084/m9.figshare.4645273 https://figshare.com/s/c167f72d6e4613dfd3e6 Description: Phylogenetic trees of clustered proteins (OGs) available on this link

Externally hosted supplementary file 2 Doi: 10.6084/m9.figshare.4970579 https://figshare.com/s/748d458da42614bc1546 Description: Phylogenetic trees of un-clustered proteins available on this link

164

Table S1. 86 Megavirale ORFomes used in this study (it can be downloaded ftp://ftp.ncbi.nih.gov/genomes/Viruses/).

165

166

167

Table S2: 414 OGs detected as HGT event (in clustered proteins) with information on their putative donor and recipient lineages

OG_ID Family_present DONOR RECIPIENT Donor group OG_01721 Bac Oscillatoriales OG_02512 pandora Bac Proteobacteria OG_03200 pandora Bac Proteobacteria OG_03213 pandora Bac Paenibacillus OG_03214 pandora Bac Bacillales OG_03231 pandora Bac Bacillales OG_03234 pandora Bac Legionellaceae OG_03405 pandora Bac Bacillales OG_03525 pandora Bac Protoebacteria OG_03546 pandora Bac Bacillales OG_03584 pandora Bac Micrococcaceae OG_03614 pandora Bac Xanthomonas OG_03634 pandora Bac Archangiaceae OG_03659 pandora Bac Thermus OG_03705 pandora Bac Actinobacteria OG_03761 pandora Bac Streptomyces OG_03772 pandora Bac Bacillales OG_03774 pandora Bac Bacillales OG_03778 pandora Bac Protoebacteria OG_01117 pandora Euk Phytopthora OG_01723 pandora Euk Mameillales OG_01757 pandora Euk Mameillales OG_01760 pandora Euk Phytopthora OG_02513 pandora Euk Phytopthora OG_02524 pandora Euk Chlorophyta OG_02567 pandora Euk Acanthaomoeba castellani OG_03189 pandora Euk Mammiellales OG_03194 pandora Euk Streptophyta OG_03248 pandora Euk Bodo saltans OG_03473 pandora Euk Mamiellales OG_03521 pandora Euk Mamiellales OG_03550 pandora Euk Mamiellales OG_03625 pandora Euk Acanthaomoeba castellani OG_03641 pandora Euk Mamiellales OG_03646 pandora Euk Mamiellales OG_03662 pandora Euk Saprolegnia OG_03668 pandora Euk Mamiellales

168

OG_03687 pandora Euk Mamiellales OG_03711 pandora Euk Saprolegnia OG_03718 pandora Euk Mamiellales OG_03719 pandora Euk Mamiellales OG_03781 pandora Euk Mamiellales OG_01729 pandora MV Euk Mamiellales OG_00369 mimi Bac Sorangium cellulosum OG_00598 mimi Bac Pelagibacteraceae OG_00619 mimi Bac Leptotrichiaceae OG_00790 mimi Bac Streptococcus OG_01087 mimi Bac Streptococcus OG_01399 mimi Bac Proteo OG_01464 mimi Bac Proteo OG_01505 mimi Bac Clostridiales OG_01508 mimi Bac Listeria OG_01520 mimi Bac Micrococcales OG_01524 mimi Bac Chitinophagaceae OG_01627 mimi Bac Terra OG_01642 mimi Bac Desulfobacteracae OG_01677 mimi Bac Methylomonas OG_01685 mimi Bac Parcubacteria OG_01690 mimi Bac Pedobacter OG_02283 mimi Bac Firmicutes OG_02301 mimi Bac Gammaproteobacteria OG_02336 mimi Bac Burkholderiaceae OG_02388 mimi Bac Burkholderiaceae OG_02409 mimi Bac Bacteroidales OG_02460 mimi Bac Bacillales OG_02472 mimi Bac Hymenobacter OG_02552 mimi Bac Pelagibacteraceae OG_02802 mimi Bac Brachyspira OG_04200 mimi Bac Gammaproteobacteria OG_04221 mimi Bac Deuterostomia OG_04236 mimi Bac Bacillales OG_04471 mimi Bac Streptococcus OG_04487 mimi Bac Bacillales OG_04488 mimi Bac Clostridiales OG_04568 mimi Bac Brachyspira OG_04607 mimi Bac Proteobacteria OG_00161 mimi Euk Dictyosteliida OG_00942 mimi Euk Apicomplexa OG_01134 mimi Euk Rozella allomycis OG_01589 mimi Euk Rhodophyta OG_01613 mimi Euk Schistostoma (flat worms) OG_01615 mimi Euk Endopterygota (insects)

169

OG_01620 mimi Euk Dictyostellida OG_01635 mimi Euk Dictyostellida OG_01647 mimi Euk Rozella allomycis OG_01669 mimi Euk Leotimyceta OG_01696 mimi Euk coccomyxa (green algae) OG_02307 mimi Euk Micromonas OG_02308 mimi Euk Leotimyceta OG_02334 mimi Euk Euteleostomi OG_02341 mimi Euk Leotimyceta OG_02364 mimi Euk Lopotrochozoa (Bilateria) OG_02431 mimi Euk Leotimyceta OG_02453 mimi Euk Rozella OG_03661 mimi Euk Rozella OG_04055 mimi Euk Ecdyosozoa OG_04216 mimi Euk Oligohymenophorea OG_04235 mimi Euk Endopterygota (insects) OG_04256 mimi Euk Dictyostellida OG_04275 mimi Euk Rozella allomycis OG_04400 mimi Euk Rozella allomycis OG_04414 mimi Euk Rozella allomycis OG_04422 mimi Euk Dictyostellida OG_04426 mimi Euk Dikarya OG_04440 mimi Euk Mucor OG_04448 mimi Euk Bilateria OG_04500 mimi Euk Rozella allomycis OG_04501 mimi Euk Rozella allomycis OG_04512 mimi Euk Dictyostellida OG_04514 mimi Euk Dictyostellida OG_04526 mimi Euk Rozella allomycis OG_04533 mimi Euk Rozella allomycis OG_04561 mimi Euk Dictyostellida OG_04581 mimi Euk Bilateria OG_04608 mimi Euk Dictyosteliida OG_04618 mimi Euk Rozella allomycis OG_04619 mimi Euk Boroeutheria OG_01142 mimi Euk Dictyostellida OG_01577 mimi MV Virophage Sputnik OG_02351 mimi MV Euk Acanthaamoeba OG_04193 mimi MV Bac OG_04239 mimi MV Euk Dictyostellida OG_04393 mimi MV Euk,Bac Brachyspira and Trichomonas vaginalis OG_04406 mimi MV Euk,Bac Brachyspira and Trichomonas vaginalis OG_04458 mimi MV Euk Dictyostellida OG_04506 mimi MV Bac Micavibrio OG_04558 mimi MV Bac Brachyspira

170

OG_04580 mimi MV Euk Trichomonas vaginalis OG_04591 mimi Mv Euk OG_04599 mimi MV Euk Dictyosteliida OG_04627 mimi MV Bac Wolbachia OG_00813 Pitho Euk Alveolata OG_00476 marseille MV Bac OG_02035 marseille Bac Thermovibrio ammonificans OG_02041 marseille Bac Cystobacter OG_02068 marseille Bac Proteobacteria OG_03900 marseille Bac Bacteriodales OG_01299 marseille Euk Saprolegniaceae OG_02070 marseille Euk Dikarya OG_00225 phycodna Bac Proteobacteria OG_00262 phycodna Bac Bacillales OG_00391 phycodna Bac Spirochaetaceae OG_00427 phycodna Bac Bacteriodetes OG_00453 phycodna Bac Verrucomicrobia OG_00775 phycodna Bac Proteobacteria OG_00783 phycodna Bac Proteobacteria OG_00848 phycodna Bac Proteobacteria OG_00853 phycodna Bac Bacillales OG_00883 phycodna Bac Bacteria OG_00887 phycodna Bac Gammaproteobacteria OG_00981 phycodna Bac Bacteria OG_01049 phycodna Bac Proteobacteria OG_01243 phycodna Bac Proteobacteria OG_01266 phycodna Bac Proteobacteria OG_01280 phycodna Bac Bacillales OG_01287 phycodna Bac Firmicutes OG_01870 phycodna Bac Parcubacteria OG_01920 phycodna Bac Actinobacteria OG_02151 phycodna Bac Firmicutes OG_02477 phycodna Bac Proteobacteria OG_02478 phycodna Bac Proteobacteria OG_02479 phycodna Bac Proteobacteria OG_02915 phycodna Bac Cyanobacteria OG_02923 phycodna Bac Helicobacter OG_03100 phycodna Bac Proteobacteria OG_04066 phycodna Bac Clostridiales OG_04068 phycodna Bac Proteobacteria OG_04189 phycodna Bac Cyanobacteria OG_00441 phycodna Euk klebsormidium OG_00815 phycodna Euk Endopterygota OG_00994 phycodna Euk Klebsormidium OG_01016 phycodna Euk Mameillales

171

OG_01058 phycodna Euk Mameillales OG_01063 phycodna Euk Mameillales OG_01863 phycodna Euk Mameillales OG_01954 phycodna Euk klebsormidium OG_02258 phycodna Euk Mammiellaes OG_03041 phycodna Euk Dictyosteliida OG_03048 phycodna Euk Chlorella variabilis OG_03049 phycodna Euk Chlorella variabilis OG_03840 phycodna Euk Mamiellales OG_04188 phycodna Euk Dictyosteliida OG_04266 phycodna Euk Guillardia theta CCMP2712 OG_04281 phycodna Euk Mamiellales OG_04284 phycodna Euk Mameillales OG_02256 phycodna Euk Saccharomycetales OG_00212 phycodna MV Euk Emilinia huxley OG_00243 phycodna MV Euk Klebsormidium OG_00437 phycodna MV Euk OG_00530 phycodna MV Euk OG_00576 phycodna MV Euk OG_00613 phycodna MV Euk OG_00902 phycodna MV Euk OG_00909 phycodna MV Euk OG_00966 phycodna MV Euk OG_01125 phycodna MV Euk OG_01353 phycodna MV Euk OG_04090 phycodna MV Clostridiales (bacteria) OG_04146 phycodna MV Euk OG_04147 phycodna MV Euk OG_04149 phycodna MV Euk OG_04151 phycodna MV Euk OG_04152 phycodna MV Euk OG_04153 phycodna MV Euk OG_04154 phycodna MV Euk OG_04155 phycodna MV Euk OG_04156 phycodna MV Euk OG_04157 phycodna MV Euk OG_04159 phycodna MV Euk OG_04160 phycodna MV Euk OG_04161 phycodna MV Euk OG_04162 phycodna MV Euk OG_04163 phycodna MV Euk OG_04164 phycodna MV Euk OG_04166 phycodna MV Euk OG_04167 phycodna MV Euk OG_04168 phycodna MV Euk

172

OG_04170 phycodna MV Euk OG_04172 phycodna MV Euk OG_04173 phycodna MV Euk OG_04174 phycodna MV Euk OG_04175 phycodna MV Euk OG_04176 phycodna MV Euk OG_04177 phycodna MV Euk OG_04178 phycodna MV Euk OG_04179 phycodna MV Euk OG_04180 phycodna MV Euk OG_04181 phycodna MV Euk OG_04182 phycodna MV Euk OG_04183 phycodna MV Euk OG_04185 phycodna MV Euk OG_04190 phycodna MV Euk OG_00655 phycodna Phage Caudovirales OG_02248 phycodna Phage Myoviridae OG_00482 Asco Virus Alphabaculovirus OG_04633 Asco Virus Alphabaculovirus OG_01043 irido Euk salmoninae (Euteleostomi) OG_01172 irido Euk Endopterygota OG_01397 irido Euk Cyprinidae (euteleostomi) OG_02274 irido Euk Pancrustacae OG_02539 irido Euk Dictyostellida OG_02781 irido Euk Endopterygota OG_04014 irido Euk Protostomia OG_04376 irido Euk Protostomia OG_00189 irido MV Bac Flavobacterium OG_00581 irido MV Bac Flavobacterium OG_00582 irido MV Bac Flavobacterium OG_00630 irido MV Bac Flavobacterium OG_00766 irido MV Bac Flavobacterium OG_00768 irido MV Bac Flavobacterium OG_00769 irido MV Bac Flavobacterium OG_01382 irido MV Bac Flavobacterium OG_01383 irido MV Bac Flavobacterium OG_01385 irido MV Bac Flavobacterium OG_01386 irido MV Bac Flavobacterium OG_01389 irido MV Bac Flavobacterium OG_01390 irido MV Bac Flavobacterium OG_01391 irido MV Bac Flavobacterium OG_01392 irido MV Bac Flavobacterium OG_01393 irido MV Bac Flavobacterium OG_01395 irido MV Bac Flavobacterium OG_01396 irido MV Bac Flavobacterium

173

OG_02127 irido MV Bac Flavobacterium OG_02227 irido MV Bac Flavobacterium OG_02228 irido MV Bac Flavobacterium OG_02229 irido MV Bac Flavobacterium OG_02230 irido MV Bac Flavobacterium OG_02232 irido MV Bac Flavobacterium OG_04121 irido MV Bac Flavobacterium OG_04007 irido MV Bac Flavobacterium OG_04023 irido MV Euk Lasius niger OG_04048 irido MV Euk Lasius niger OG_04057 irido MV Bac Flavobacterium OG_02276 irido MV Bac Flavobacterium OG_02578 Pox Bac Firmicutes OG_00289 Pox Euk Anura (frogs and toads) OG_00364 Pox Euk Euteleostomi OG_00400 Pox Euk Bilateria OG_01015 Pox Euk Alveolata OG_01188 Pox Euk Schizopora (flies) OG_01201 Pox Euk Euteleostomi OG_01202 Pox Euk Euteleostomi OG_01209 Pox Euk Protostomia OG_01226 Pox Euk Euteleostomi OG_01229 Pox Euk Eutheria (placentals) OG_01709 Pox Euk Eutheria (placentals) OG_01783 Pox Euk Myrmicinae (ants) OG_01823 Pox Euk Eutheria (placentals) OG_01824 Pox Euk Euteleostomi OG_01836 Pox Euk Trichomonas OG_02215 Pox Euk Protostomia OG_02783 Pox Euk Chalcidoidea (wasps) OG_02787 Pox Euk Bilateria OG_03154 Pox Euk Eutheria (placentals) OG_03841 Pox Euk Eutheria (placentals) OG_03845 Pox Euk Theria OG_04000 Pox Euk Eutheria (placentals) OG_04006 Pox Euk Ecdysozoa OG_04634 Pox Euk Euteleostomi OG_04655 Pox Euk Neoptera OG_00798 Pox MV Bac OG_01007 Pox Virus Baculoviridae OG_01774 Pox Virus Baculoviridae OG_02557 Pox Virus Circoviridae OG_04653 Pox Virus Baculoviridae OG_04133 asfar fausto Euk Pyrenomonadales (Guillardia theta) OG_00751 mimi pandora Bac Proteobacteria

174

OG_01032 mimi pandora Bac Pseudomonadacae OG_01067 mimi pandora Bac Proteobacteria OG_04277 mimi pandora Euk Oomycetes OG_02514 pandora pitho Euk Mameillales OG_02520 pandora pitho Euk Mameillales OG_02529 pandora pitho Euk Mameillales OG_02533 pandora pitho Euk Mameillales OG_03159 pandora pitho Euk Mammiellales OG_03271 pandora pitho Euk Mameillales OG_03590 pandora pitho Euk Mamiellales OG_03591 pandora pitho Euk Mamiellales OG_00750 mimi pandora Euk (mimi); Archea Trichomonas (mimi) and Methanosarcinales (Pandora) (pandora) OG_01363 mimi pandora Euk (multiple donors) Phytophtora (Mimi); Mamiellales (rest) OG_02170 mimi pandora Euk (multiple donors) Crytosporidium (Mimi) and Amoeba (Pandoravirus) OG_02172 mimi pandora Euk (multiple donors) Mameillales OG_01147 pandora pithovirus Euk (multiple donors) Acanthamoeaba (Pandora); leotimyceta (Pitho) OG_03600 pandora pitho Euk (multiple donors) Mamiellales (pandora); klebsormidium (pitho) OG_00996 marseille pandora Euk (Pandora); Pyremonadales Marseillevirus outside OG_01033 mimi pandora Euk (pandora); mimi Oomycetes outside OG_01030 mimi pandora MV Euk Dictyostelium OG_02174 mimi pandora MV Euk, Bac Acanthaamoeba OG_00565 mimi pandora pithovirus MV Euk Acanthaamoeba OG_00746 mimi pandora pithovirus MV Euk Acanthaamoeba OG_04273 pandora pithovirus MV Euk OG_00615 mimi pithovirus MV Acanthamoe ba OG_00785 mimi pithovirus Bac Actinobacteria OG_01294 marseille mimi Bac (Mimivirus); Parcubacteria Marseille outside OG_04369 mimi pithovirus Bac Clostridiales OG_00163 mimi phycodna Bac Bacteriodetes/Chlorobi group OG_00556 phycodna pithovirus Bac Multiple donors (proteobacteria) OG_00663 marseille mimi phycodna Bac Borelia OG_00789 mimi phycodna pithovirus Bac Cyanobacteria OG_01262 mimi phycodna Bac Proteobacteria OG_01341 mimi phycodna Bac Cytophagales OG_00181 marseille mimi phycodna Bac (Phycodna and Gammaproteobacteria and leotimyceta pithovirus Marseilleviridae) and Euk (Mimiviridae) OG_00304 pandora phycodna Euk Piroplasmida (Alveolata) OG_00734 mimi phycodna Euk Dictyostelium OG_01024 mimi phycodna Euk Oligohymenophorea (Alveolata)

175

OG_01741 mimi phycodna Euk Schistostoma (Mimi), Ectocarpus (phycodnaviridae) OG_01899 pandora phycodna Euk Acanthamaoeba OG_02132 pandora phycodna Euk Phytopthora OG_02138 mimi phycodna Euk Trypanosomatidae (Mimiviridae); Phycodna outside OG_02481 mimi phycodna Euk Dictyostelium (EUK) OG_02924 mimi phycodna Euk Piroplasmida (Alveolata) OG_02925 mimi phycodna Euk Embryophyta (streptophyta) OG_00290 marseille mimi phycodna Euk (Convergent Trypanosomatidae (Mimiviridae); Evolution) Marseilleviridae outside OG_00566 mimi phycodna Euk (Convergent Dictyostelium (Phycodnaviridae) OR Evolution) pentapetalae; mimiviridae outside OG_00287 mimi phycodna Euk (multiple donors) Trypanosomatidae (Mimiviridae and Ostreococcus) and chlorella variablis (clorella virus) OG_00736 mimi phycodna Euk (multiple donors) Dictyostelium (phycodnaviridae); Trypanosomatidae (Mimiviridae) OG_01105 mimi phycodna Euk (multiple donors) Mameillales (phycodna) and Oomycetes (Mimi) OG_00009 mimi phycodna MV Bac Bacteriodetes/Chlorobi group OG_00154 mimi phycodna MV Euk klebsorbidium OG_00156 mimi phycodna MV Euk klebsorbidium OG_00180 pandora phycodna MV Euk OG_00201 mimi phycodna MV Euk Fungi - and spermatophyta OG_00227 mimi pandora phycodna MV Euk klebsorbidium OG_00246 marseille mimi pandora MV Bac Gammaproteobacteria phycodna pithovirus OG_00270 mimi pandora phycodna MV Euk Fonticula alba OG_00411 mimi phycodna pithovirus MV Bac Pithovirus OG_00533 mimi phycodna MV Euk klebsorbidium OG_01371 mimi phycodna MV Euk OG_01697 mimi phycodna MV Phage Myoviridae OG_03598 pandora phycodna Mv Euk Klesbsorbidium OG_04566 mimi phycodna Mv Archea Euryarchaeota OG_01090 Asco mimi Bac Bacillus OG_00333 asfar pox Euk Boreoeutheria OG_00017 irido marseille mimi pandora Euk (multiple donors) Bilateria phycodna pox OG_00106 mimi phycodna pox Euk (multiple donors) OG_00332 mimi phycodna pox Euk (multiple donors) OG_00343 fausto pox Euk (multiple donors) Neognathe (pox) and Oomycetes (fausto) OG_00138 phycodna pox Euk (pox outside) Trypanosomatidae OG_00190 Asco irido pandora Eukaryote (muliple phycodna pox donors) OG_00221 irido marseille phycodna Eukaryote (muliple Dikarya (marseille) and Trypanosoma (irido, donors) phycodna)

176

OG_00191 irido phycodna MV Klebsormidi um OG_00237 fausto marseille mimi pox MV Trichomonas Vaginalis G3 OG_00301 mimi pandora pox MV Trichomonas Vaginalis G3 OG_01367 irido mimi pox MV Trichomonas Vaginalis G3 OG_00003 Asco asfar fausto irido MV Mucoraceae marseille mimi pandora phycodna pithovirus pox OG_00004 asfar fausto irido marseille MV Euk (Dictyostelium and Entamaoeba) mimi pandora phycodna pithovirus pox OG_00012 Asco irido marseille mimi MV Euk Klebsormidium and Acanthamoeba phycodna OG_00013 Asco irido marseille mimi MV Euk Klebsormidium and Ectocarpus pandora phycodna OG_00112 Asco irido mimi pox MV Bac Clostridioides difficile OG_00120 irido mimi phycodna MV Euk Ectocarpus OG_00179 irido mimi phycodna pox MV Bac Mycoplasma OG_00412 asfar fausto irido mimi MV Euk, Arc phycodna OG_00524 mimi pandora phycodna pox MV Euk Bilateria OG_00324 asco irido Bac Enterobacterales OG_00108 irido pox Mv Euk Euteleostomi OG_00174 Asco irido pox MV Virus Baculoviridae OG_00621 irido pox Euk Ovalentria OG_00808 irido pox Euk Theria OG_01000 irido pox Bac Legionallecae OG_01735 Asco irido pox Virus Baculovirida e OG_02494 irido pox Euk Neoptera OG_02495 irido pox Euk Neoptera OG_02496 irido pox Bac Selenomona s OG_04643 Asco pox Euk Neoptera OG_00170 asfar fausto marseille MV Phage Caudovirales phycodna pithovirus OG_00185 fausto phycodna MV Euk Klebsorbidium and Ectocarpus OG_00195 asfar fausto marseille Bac Multiple donors from Gammaproteobacteria pandora OG_00302 asfar mimi phycodna Euk Leotiomyceta OG_00479 fausto pandora Euk Multiple donors from Mamiellales OG_00483 fausto mimi pandora Euk Leotiomyceta phycodna OG_00770 fausto mimi phycodna Euk Oomycetes (containing pytopthora) OG_01045 fausto mimi Euk Klebsorbidium OG_01144 asfar mimi Euk Trypanosomatidae OG_01306 fausto marseille Bac Multiple donors from Gammaproteobacteria

177

Table S3: 53 OGs detected as sympatric transfer event (in clustered proteins)

OG_ID Family_present OG_02517 pandora OG_03505 pandora OG_03595 pandora OG_03613 pandora OG_03716 pandora OG_03795 pandora OG_00744 mimi OG_01078 mimi OG_01080 mimi OG_01292 mimi OG_01526 mimi OG_01543 mimi OG_01609 mimi OG_01617 mimi OG_02175 mimi OG_02400 mimi OG_02421 mimi OG_02581 mimi OG_03660 mimi OG_04253 mimi OG_04401 mimi OG_04435 mimi OG_04465 mimi OG_04480 mimi OG_04498 mimi OG_04593 mimi OG_04604 mimi OG_00214 phycodna OG_00288 phycodna OG_00569 phycodna OG_01370 phycodna OG_01804 phycodna OG_02946 phycodna OG_00850 phycodna OG_01235 Pox OG_01237 Pox OG_01239 Pox OG_01837 Pox OG_02800 Pox OG_01706 mimi pandora pithovirus OG_00537 marseille mimi OG_00671 marseille mimi pithovirus OG_00216 phycodna pithovirus OG_00279 mimi phycodna OG_00846 mimi phycodna OG_01740 mimi phycodna OG_02914 mimi phycodna

178

Table S4: 259 proteins detected as HGT event (in un-clustered proteins) with information on their putative donor and recipient lineages

ProteinID Tree_ID Family Donor Recepient Donor/Receiver group YP_762435.1 24 Ascoviridae Virus Baculoviridae YP_803250.1 32 Ascoviridae Virus Baculoviridae YP_803253.1 34 Ascoviridae Euk YP_803279.1 35 Ascoviridae Euk Bilateria YP_803355.1 39 Ascoviridae Euk Bilateria YP_001110896.1 124 Ascoviridae Virus Baculoviridae YP_001110941.1 132 Ascoviridae Virus Baculoviridae YP_001110943.1 133 Ascoviridae Virus Baculoviridae YP_001110977.1 138 Ascoviridae Virus Baculoviridae NP_042728.1 2538 Asfarviridae bacteria alphaproteobacteria NP_042768.1 2555 Asfarviridae bacteria chlamydiales YP_009046650.1 2096 Iridoviridae MV Euk YP_006347673.1 1209 Iridoviridae MV Bacteria Flavobacterium YP_073552.1 1671 Iridoviridae MV Euk Flavobacterium YP_164176.1 2030 Iridoviridae MV Bacteria Flavobacterium NP_149492.1 157 Iridoviridae MV Euk Lasius niger NP_149674.1 162 Iridoviridae MV Euk Lasius niger NP_149675.1 163 Iridoviridae MV Euk Lasius niger YP_009046761.1 2105 Iridoviridae MV Euk Lasius niger YP_009046784.1 2107 Iridoviridae MV Euk Lasius niger NP_612247.1 635 Iridoviridae Bacteria Bacillales YP_654647.1 2 Iridoviridae Euk Entamaoeba NP_612291.1 657 Iridoviridae Euk Bilateria NP_612325.1 677 Iridoviridae Euk Bilateria YP_073568.1 1672 Iridoviridae Euk eutheria YP_073669.1 1677 Iridoviridae euk euteleostomii YP_073684.1 1679 Iridoviridae euk euteleostomii YP_073693.1 1680 Iridoviridae euk euteleostomii YP_164204.1 2034 Iridoviridae euk eutheria YP_009046750.1 2103 Iridoviridae Euk Euteleostomi YP_004347282.1 1108 Marseilleviridae Bacteria microgenametes YP_009094821.1 2160 Marseilleviridae bacteria pseudomonadacae YP_007354709.1 1344 Mimiviridae MV Euk Acanthamoeba YP_004895093.1 1169 Mimiviridae MV Euk Dictyostelium YP_007418986.1 1470 Mimiviridae MV Euk Dictyostelium YP_003969633.1 740 Mimiviridae MV Euk Dictyostellium YP_007418983.1 1468 Mimiviridae MV Euk Gullardia theta YP_003969808.1 757 Mimiviridae MV Euk Klebsormidium

179

YP_003969954.1 778 Mimiviridae MV Euk Klebsormidium YP_003969646.1 741 Mimiviridae Bacteria Bacillales YP_003969688.1 743 Mimiviridae Bacteria Lactobacillales YP_003969703.1 745 Mimiviridae Bacteria Parcubacteria YP_003969751.1 750 Mimiviridae Bacteria Brachyspira YP_003969757.1 753 Mimiviridae Bacteria Clostridiales YP_003969776.1 754 Mimiviridae Bacteria Sphingomonadacae YP_003969780.1 755 Mimiviridae Bacteria Alteromonadacae YP_003969878.1 764 Mimiviridae Bacteria Campylobactareles YP_003969879.1 765 Mimiviridae Bacteria Bacillales YP_003969892.1 769 Mimiviridae Bacteria Alteromonadales YP_003969897.1 771 Mimiviridae Bacteria chlamydiales YP_003969898.1 772 Mimiviridae Bacteria gammaproteobacteria YP_003969901.1 773 Mimiviridae Bacteria alphaproteobaceria YP_003970170.1 796 Mimiviridae Bacteria Lactobacillales YP_003986499.1 801 Mimiviridae Bacteria Brachyspira YP_003986577.1 833 Mimiviridae Bacteria Bacillales YP_003986624.1 845 Mimiviridae Bacteria Bacillales YP_003986628.1 847 Mimiviridae Bacteria pseudomonadacae YP_003986735.1 871 Mimiviridae Bacteria Clostridiales YP_003987189.1 930 Mimiviridae Bacteria acinetobacteracae YP_003987229.1 940 Mimiviridae Bacteria streptomycetales YP_003987286.1 950 Mimiviridae Bacteria acinetobacteracae YP_007354717.1 1348 Mimiviridae Bacteria Flavobacteria YP_007418397.1 1425 Mimiviridae Bacteria Bacillales YP_003969695.1 744 Mimiviridae Euk Mammiellales YP_003969744.1 748 Mimiviridae Euk Dictyostellidda YP_003969752.1 751 Mimiviridae Euk Oomycetes YP_003969804.1 756 Mimiviridae Euk Ascomycota YP_003969861.1 761 Mimiviridae Euk Embryophyta YP_003969882.1 766 Mimiviridae Euk Phytopthora YP_003969945.1 777 Mimiviridae Euk Entamaoeba YP_003970019.1 781 Mimiviridae Euk Ascomycota YP_003970082.1 790 Mimiviridae Euk Oligohymenophorea YP_003970175.1 798 Mimiviridae Euk Dictyostellidda YP_003986788.1 880 Mimiviridae euk Dictyostellidda YP_003987332.1 966 Mimiviridae euk dictyostellium YP_007353996.1 1252 Mimiviridae Euk Alveolata YP_007354102.1 1284 Mimiviridae Euk Neoptera YP_007354492.1 1315 Mimiviridae Euk Rozella YP_007354570.1 1320 Mimiviridae Euk Rozella YP_007354571.1 1321 Mimiviridae Euk Rozella YP_007354610.1 1322 Mimiviridae Euk Dictyostellida YP_007354622.1 1323 Mimiviridae Euk Dictyostellidda YP_007354637.1 1328 Mimiviridae Euk Dictyostellidda YP_007354679.1 1334 Mimiviridae Euk Dictyostellidda

180

YP_001498046.1 531 Phycodnaviridae MV Euk Clostridiales NP_077638.1 99 Phycodnaviridae MV Euk Ectocarpus siliculosus NP_077702.1 112 Phycodnaviridae MV Euk Ectocarpus siliculosus YP_002154708.1 605 Phycodnaviridae MV Euk Ectocarpus siliculosus YP_002154741.1 613 Phycodnaviridae MV Euk Ectocarpus siliculosus YP_001497994.1 530 Phycodnaviridae MV Euk Klebsormidium YP_294216.1 2518 Phycodnaviridae MV Euk Trichomonas NP_077542.1 61 Phycodnaviridae Bacteria Enterobactereacae YP_001425635.1 166 Phycodnaviridae Bacteria Fusobacteriacae YP_001426781.1 446 Phycodnaviridae Bacteria Bacillales YP_001427233.1 484 Phycodnaviridae Bacteria Parcubacteria YP_001427314.1 488 Phycodnaviridae Bacteria Bacillales YP_001497537.1 513 Phycodnaviridae Bacteria Clostridiales YP_001498154.1 536 Phycodnaviridae Bacteria Enterobactereacae YP_001648113.1 559 Phycodnaviridae Bacteria Rickettsiales YP_002154763.1 619 Phycodnaviridae Bacteria Commomonadacae YP_007676318.1 1505 Phycodnaviridae Bacteria rhizobiacae YP_008052335.1 1572 Phycodnaviridae Bacteria rhizobiacae YP_008052336.1 1573 Phycodnaviridae Bacteria proteobacteria YP_008052338.1 1574 Phycodnaviridae Bacteria Bacillales YP_008052401.1 1592 Phycodnaviridae Bacteria proteobacteria YP_008052403.1 1593 Phycodnaviridae Bacteria acinetobacteracae YP_008052470.1 1606 Phycodnaviridae Bacteria micrococcales YP_008052620.1 1639 Phycodnaviridae Bacteria proteobacteria YP_008052661.1 1652 Phycodnaviridae Bacteria Clostridiales YP_008052693.1 1657 Phycodnaviridae Bacteria Clostridiales YP_008052717.1 1660 Phycodnaviridae Bacteria Clostridiales YP_008052733.1 1664 Phycodnaviridae Bacteria Clostridiales YP_008052748.1 1668 Phycodnaviridae Bacteria entoplasmatales YP_009052178.1 2117 Phycodnaviridae bacteria chlamydiales YP_009052179.1 2119 Phycodnaviridae bacteria gammaproteobacteria YP_009052171.1 2134 Phycodnaviridae bacteria smithella YP_009052369.1 2135 Phycodnaviridae bacteria campylobacteriales YP_009052162.1 2136 Phycodnaviridae bacteria mycoplasma YP_009052121.1 2142 Phycodnaviridae bacteria lactobacillales YP_009052398.1 2143 Phycodnaviridae bacteria campylobacterales YP_294125.1 2446 Phycodnaviridae bacteria clostridiales YP_294190.1 2499 Phycodnaviridae bacteria candidatus NP_077567.1 71 Phycodnaviridae Euk Ectocarpus siliculosus NP_077568.1 72 Phycodnaviridae Euk Ectocarpus siliculosus NP_077597.1 82 Phycodnaviridae Euk Ectocarpus siliculosus NP_077599.1 84 Phycodnaviridae Euk Ectocarpus siliculosus NP_077600.1 85 Phycodnaviridae Euk Klebsormidium NP_077711.1 116 Phycodnaviridae Euk Ectocarpus siliculosus YP_001425806.1 210 Phycodnaviridae Euk Bilateria YP_001426008.1 282 Phycodnaviridae Euk Bilateria

181

YP_004061824.1 1042 Phycodnaviridae Euk Mammiellales YP_004063477.1 1045 Phycodnaviridae Euk Mammiellales YP_004063585.1 1047 Phycodnaviridae Euk Mammiellales NP_048673.2 1142 Phycodnaviridae euk eutheria NP_048691.2 1143 Phycodnaviridae euk Basidiomycota YP_007676259.1 1501 Phycodnaviridae Euk Micromonas YP_007676291.1 1503 Phycodnaviridae Euk Mammiellales YP_008052353.1 1580 Phycodnaviridae Euk Ascomycota YP_008052385.1 1588 Phycodnaviridae Euk Bilateria YP_008052732.1 1663 Phycodnaviridae euk Bilateria YP_009052252.1 2123 Phycodnaviridae euk ascomycota YP_009052155.1 2125 Phycodnaviridae euk Aureococcus YP_009052448.1 2130 Phycodnaviridae euk schistosoma YP_009052187.1 2138 Phycodnaviridae euk aureococcus YP_009052153.1 2140 Phycodnaviridae euk aureococcus YP_293768.1 2178 Phycodnaviridae euk emilinia huxleyi YP_293785.1 2187 Phycodnaviridae euk emilinia huxleyi YP_293804.1 2197 Phycodnaviridae euk emilinia huxleyi YP_293833.1 2218 Phycodnaviridae euk microsporadia YP_293857.1 2239 Phycodnaviridae euk emilinia huxleyi YP_293908.1 2274 Phycodnaviridae euk emilinia huxleyi YP_293958.1 2317 Phycodnaviridae euk embryophyta YP_294102.1 2430 Phycodnaviridae euk emilinia huxleyi YP_294107.1 2434 Phycodnaviridae euk plasmodium YP_294113.1 2438 Phycodnaviridae euk Aureococcus YP_294116.1 2441 Phycodnaviridae euk entamoeba YP_294157.1 2473 Phycodnaviridae euk trypanosoma YP_294158.1 2474 Phycodnaviridae euk emilinia huxleyi YP_294173.1 2484 Phycodnaviridae euk emilinia huxleyi NP_049035.1 2648 Phycodnaviridae euk plasmodium YP_001427319.1 490 Phycodnaviridae Phages Caudovirales YP_008052687.1 1655 Phycodnaviridae Virus circoviridae YP_003212951.1 701 Phycodnaviridae Euk emilinia huxleyi YP_009000978.1 2061 Pithovirus bacteria oscillatariophycidae YP_009001015.1 2062 Pithovirus bacteria brachyspira YP_009001035.1 2063 Pithovirus bacteria bacillales YP_009001172.1 2068 Pithovirus bacteria bacillales YP_009001173.1 2069 Pithovirus bacteria parcubacteria YP_009001291.1 2074 Pithovirus bacteria chlamydiales YP_009001303.1 2075 Pithovirus euk mucor YP_009001307.1 2076 Pithovirus bacteria bacillales YP_009001328.1 2079 Pithovirus bacteria bacillales YP_009001339.1 2080 Pithovirus euk mesangiospermae YP_008003801.1 1534 Poxviridae MV Bacteria Flavobacterium YP_008003616.1 1525 Poxviridae MV Euk Lasius niger YP_008003656.1 1526 Poxviridae MV Euk Lasius niger

182

Conclusions & Future Perspective

183

184

Comparative genome analysis of distantly related Megavirales revealed that there are no general trends of genome composition and gene acquisitions. Indeed, each Megavirale family most likely has a specific pattern of genome acquisition, probably reflecting different host- virus interactions. Our results that the genomes of Megavirales members harbor many specific genes with no cellular homologs (approximately 80% Megavirale- specific OGs), as well as substantial numbers of genes inferred to have been transferred horizontally (approximately nine percent of OGs) from other cellular organisms, suggest that these viruses are not bags of unused genes taken from various cellular organisms, but, instead they might have acquired genes with a balanced process of genetic exchanges with their environmental interactions. Also, the impact of gene acquisition was found to be unlimited which can have a profound effect on Megavirale evolution.

Our analysis represents the first comprehensive pan- genomic search for HGT events in Megavirales with phylogenetic validation. Our systematic search for HGT events of non-megavirale origin provides the first estimate of the total contribution of HGT in family specific genome mosaicism of distantly related Megavirales. Previous reports have shown that different genes acquired by HGT in closely related Megavirale families may play important roles in evolution of Megaviromes and hence, have been functionally significant (Iyer et al., 2006; Filee et al., 2007, 2008;

185

Moreira and Brochier-Armanet, 2008; Yutin et al., 2013). Our systematic workflow, MimiLook, described here elucidates the evolutionary scenarios in Megavirale ORFomes by detecting the instances of gene acquisition, gene specificity in different families and also provided information about shared ancestry of genes with other cellular domains. Analyses of our phylogenetic tree topologies have indicated that multiple different species and kingdoms were positioned in donor clades, suggesting that there is not a single or low number of donors but a multitude of possible cellular species from which Megavirales have acquired genes. But, despite of the ensemble of donor species, Megavirales depicts family specific gene acquisitions which resonates the fact that cellular gene acquisitions is a major evolutionary force in genome mosaicism of each family of Megavirale (Fig. 1 and Fig. 2). Donor specificity also suggests that these viruses have acquired genes with a balanced process of genetic exchanges with their environmental interactions and they are not just gene robbers. Another important observation from the output is that some of the species (Rozella allomycis, Klebsormidium flaccidum, Lasius niger) with continuous gene exchanges with Megavirales are not known to be hosts of these viruses nor they are known to interact with them thus, suggesting the possibility of finding new viral inserts in eukaryotic domains, which can be verified through experimental procedures. Further, our analysis has also revealed contamination present in database (Flavobacterium sp. JRM) which might be a case of mis-annotation or

186 sequencing error, thus, MimiLook can be further implemented time to time in future studies to identify new cases of a contamination/artifacts in public databases. Furthermore, genes with no homologs, considered as ORFan can be searched in protein structure databases to identify protein folds which are unique to Megavirales and protein folds which are homologues to defined protein structure.

Despite much effort, we are still unable to place viruses on the universal tree of life and their origin remains speculative, thus, even when donors and recipients are known, there is rarely supporting evidence regarding the absence of genes from relatives of the Megavirale family that diverged prior to the transfer. Overall, it seems likely that HGT has been frequent in the course of Megavirale evolution. It remains to be seen just how frequent HGTs have been and what are the rules that govern the whole process. As we learn more about the mechanisms and rules of HGT, we will be better equipped to model the process and more rigorous test of its occurrence in shaping Megavirale evolution and biology.

187

Fig. 1: Genome moscaisim of viruses of metazoan

Fig. 2: Genome mosaicism in viruses of protozoa

188

189

Acknowledgments

I am very pleased to acknowledge all the people associated with my life and during my PhD tenure. First, of all I would like to thank my supervisor Professor Pierre Pontarotti for giving me a wonderful opportunity and funding support. Also, without his suggestion and guidance it was impossible to accomplish my doctoral degree.Secondly, I would like to show high gratitude to my co-supervisor Prof. Didier Raoult for his guidance, supervision and support. I am heartily thankful to Prof. Philippe Colson for suggestions and improvement of scientific manuscripts. I would like to thank the reviewers of my thesis, Professor Patrick Forterre and Dr Panabieres Franck for their scientific advises and detailed review during the preparation of my thesis. Their sincere suggestions indeed helped me to improve my thesis. I would like to thank to god because I believe he is somewhere always with me, I know for him nothing is impossible. Exclusive thanks to all my family members. Also I would like to dedicate my degree to my mother Mrs. Rekha Jain and father Mr. Vivek Jain, because of all their sacrifices that you have made for me. Your prayer and love makes me strong enough to stand against all difficulties. My special thank to my grandfather for his blessing and care. I would admire my wife Ruchi for her love and support, also she stand with me in good and bad moments. I express my love for my younger brother

190

Apoorv Jain, he was always encouraging and supporting with his best wishes. I would like to thank the entire EBM team for their help and support (Vivek Keshri, Vikas Sharma, Charbel, Sandrine, Issa, Justine, Marie Helene rome, Evyleyne and Olivier). I would also like to thank all of my friends who supported me in all situation, and endeavor towards my goal (Nishant, Ankush, Sweta Nidhi, Ganesh warthi, Neha Bose, Sandhya, Rohan, Mykel, Suresh, Usha, Bhavika and Balamurli). I would like to thanks my seniors and friends Vinod sir, Anshuman sir, Rajat, Deepak, Kapil Rathi. Last but not least, special thanks to our secretary (Michene Pitacollo, Valrie Filosa and Alexia Battistini), without their help and support even we cannot imagine our stay in France. I also present a sincere thanks to all, who directly or indirectly helped in this difficult and memorable journey.

191