Pandoravirus Celtis Illustrates the Microevolution Processes at Work In
Total Page:16
File Type:pdf, Size:1020Kb
bioRxiv preprint doi: https://doi.org/10.1101/500207; this version posted February 11, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. 1 Pandoravirus celtis illustrates the microevolution processes at work 2 in the giant Pandoraviridae genomes 3 Running title: Evolutionary mechanisms in Pandoraviridae 4 Matthieu Legendre1, Jean-Marie Alempic1, Nadège Philippe1, Audrey Lartigue1, Sandra 5 Jeudy1, Olivier Poirot1, Ngan Thi Ta1, Sébastien Nin1, Yohann Couté2, Chantal Abergel1*, 6 Jean-Michel Claverie1* 7 1Aix Marseille Univ, CNRS, IGS, Structural and Genomic Information Laboratory (UMR7256), 8 Mediterranean Institute of Microbiology (FR3479), 163 Avenue de Luminy, F-13288 Marseille, 9 France 10 2Univ. Grenoble Alpes, CEA, Inserm, BIG-BGE, 38000 Grenoble, France. 11 * Correspondence: 12 Jean-Michel Claverie 13 [email protected] 14 Chantal Abergel 15 [email protected] 16 Keywords: de novo gene creation, comparative genomics, Acanthamoeba, Giant viruses, Soil 17 viruses, hAT transposase 18 Abstract 19 With genomes of up to 2.7 Mb propagated in µm-long oblong particles and initially predicted to 20 encode more than 2000 proteins, members of the Pandoraviridae family display the most extreme 21 features of the known viral world. The mere existence of such giant viruses raises fundamental 22 questions about their origin and the processes governing their evolution. A previous analysis of 23 six newly available isolates, independently confirmed by a study including 3 others, established 24 that the Pandoraviridae pan-genome is open, meaning that each new strain exhibits protein- 25 coding genes not previously identified in other family members. With an average increment of 26 about 60 proteins, the gene repertoire shows no sign of reaching a limit and remains largely 27 coding for proteins without recognizable homologs in other viruses or cells (ORFans). To explain 28 these results, we proposed that most new protein-coding genes were created de novo, from pre- 29 existing non-coding regions of the G+C rich pandoravirus genomes. The comparison of the gene 30 content of a new isolate, P. celtis, closely related (96 % identical genome) to the previously 31 described P. quercus is now used to test this hypothesis by studying genomic changes in a 32 microevolution range. Our results confirm that the differences between these two similar gene 33 contents mostly consist of protein-coding genes without known homologs (ORFans), with 34 statistical signatures close to that of intergenic regions. These newborn proteins are under slight 35 negative selection, perhaps to maintain stable folds and prevent protein aggregation pending the 36 eventual emergence of fitness-increasing functions. Our study also unraveled several insertion 37 events mediated by a transposase of the hAT family, 3 copies of which are found in P. celtis and 38 are presumably active. Members of the Pandoraviridae are presently the first viruses known to 39 encode this type of transposase. bioRxiv preprint doi: https://doi.org/10.1101/500207; this version posted February 11, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. Evolutionary mechanisms in Pandoraviridae 40 1 Introduction 41 The Pandoraviridae is a proposed family of giant dsDNA viruses - not yet registered by the 42 International Committee on Taxonomy of Viruses (ICTV) - multiplying in various species of 43 Acanthamoeba through a lytic infectious cycle. Their linear genomes, flanked by large terminal 44 repeats, range from 1.9 to 2.7 Mb in size, and are propagated in elongated oblong particles 45 approximately 1.2 µm long and 0.6 µm in diameter (Fig. S1). The prototype strain (and the one 46 with the largest known genome) is Pandoravirus salinus, isolated from shallow marine sediments 47 off the coast of central Chile (Philippe et al., 2013). Other members were soon after isolated from 48 worldwide locations. Complete genome sequences have been determined for P. dulcis 49 (Melbourne, Australia) (Philippe et al., 2013), P. inopinatum (Germany) (Antwerpen et al., 50 2015), P. macleodensis (Australia), P. neocaledonia (New Caledonia), and P. quercus (France) 51 (Legendre et al., 2018), and three isolates from Brazil (P. braziliensis, P. pampulha, and P. 52 massiliensis) (Aherfi et al., 2018). A standard phylogenetic analysis of the above strains 53 suggested that the Pandoraviridae family consists of two separate clades (Claverie et al., 2018; 54 Fig. 1). The average proportion of identical amino acids between Pandoraviridae orthologs within 55 each clade is above 70% while it is below 55% between members of the A and B clades. 56 Following a stringent reannotation of the predicted protein-coding genes using transcriptomic and 57 proteomic data, our comparative genomic analysis reached the main following conclusions 58 (Legendre et al., 2018): 59 1) the uniquely large proportion of predicted proteins without homologs outside of the 60 Pandoraviridae (ORFans) is real and not due to bioinformatic errors induced by the 61 above-average G+C content (>60%) of Pandoravirus genomes; 62 2) the Pandoraviridae pan genome appears “open” (i.e. unbounded); 63 3) as most of the genes are unique to each strain are ORFans, they were not horizontally 64 acquired from other (known) organisms; 65 4) they are neither predominantly the result of gene duplications. 66 The scenario of de novo and in situ gene creation, supported by the analysis of their sequence 67 statistical signatures, thus became our preferred explanation for the origin of strain specific genes. 68 In the present study, we take advantage of the high similarity (96.7% DNA sequence identity) 69 between P. quercus and a newly characterized isolate, P. celtis, to investigate the microevolution 70 processes initiating the divergence between Pandoraviruses. Our results further support de novo 71 gene creation as a main diversifying force of the Pandoraviridae family. 72 2 Materials and Methods 73 2.1 Virus Isolation, Production, and Purification 74 P. celtis and P. quercus were isolated in November 2014 from samples of surface soil taken at the 75 base of two trees (Celtis autralis and Quercus ilex) less than 50 meters apart in an urban green 76 space of Marseille city (GPS: 43°15'16.00"N, 5°25'4.00"E). Their particles were morphologically 77 identical to previously characterized pandoraviruses (Fig. S1). The viral populations were 78 amplified by co-cultivation with A. castellanii. They were then cloned, mass-produced and 79 purified as previously described (Philippe et al., 2013). 80 2.2 Genome and transcriptome sequencing, annotation 2 bioRxiv preprint doi: https://doi.org/10.1101/500207; this version posted February 11, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. Evolutionary mechanisms in Pandoraviridae 81 The P. celtis genome was fully assembled from one PacBio SMRT cell sequence data with the 82 HGAP 4 assembler (Chin et al., 2013) from the SMRT link package version 5.0.1.9585 with 83 default parameters and the “aggressive” option=true. Genome polishing was finally performed 84 using the SMRT package. Stringent gene annotation was performed as previously described 85 (Legendre et al., 2018). Briefly, data from proteomic characterization of the purified virions were 86 combined with stranded RNA-seq transcriptomic data, as well as protein homology among 87 previously characterized pandoraviruses. The transcriptomic data were generated from cells 88 collected every hour over an infectious cycle of 15 hours. They were pooled and RNAs were 89 extracted prior poly(A)+ enrichment. The RNA were then sent for sequencing. Stranded RNA-seq 90 reads were then used to accurately annotate protein-coding as well as non-coding RNA genes. A 91 threshold of gene expression (median read coverage >5 over the whole transcript) larger than the 92 lowest one associated to proteins detected in proteomic analyses was required to validate all 93 predicted genes (including novel genes) (Table 1). Genomic regions exhibiting similar expression 94 levels but did not encompass predicted proteins or did overlap with protein-coding genes 95 expressed from opposite strands were annotated as “non-coding RNA” (ncRNA) after assembly 96 by Trinity (Grabherr et al., 2011). When only genomic data were available, namely for P. 97 pampulha, P. massiliensis and P. braziliensis (Aherfi et al., 2018), we annotated protein-coding 98 genes using ab initio prediction coupled with sequence conservation information as previously 99 described (Legendre et al., 2018). 100 Gene clustering was performed on all available Pandoraviruses’ protein-coding genes using 101 Orthofinder (Emms & Kelly, 2015) with defaults parameters except for the “msa” option for the 102 gene tree inference method. 103 Functional annotation of protein-coding genes was performed using a combination of protein 104 domains search with the CD-search tool (Marchler-Bauer & Bryant, 2004) and HMM-HMM 105 search against the Uniclust30 database with the HHblits tool (Remmert et al., 2011). In addition, 106 we used the same procedure to update the functional gene annotation of P. salinus, P. dulcis, P. 107 quercus, P. macleodensis and P. neocaledonia (Genbank IDs: KC977571, KC977570, 108 MG011689, MG011691 and MG011690). 109 2.3 Phylogenetic and Selection pressure analysis 110 The phylogenetic tree (Fig. 1) was computed from the concatenated multiple alignment of the 111 sequences of Pandoravirus core proteins corresponding to single-copy genes. The alignments of 112 orthologous genes peptide sequences were done using Mafft (Katoh et al., 2002). The tree was 113 computed using IQtree (Hoang et al., 2018) with the following options: -m MFP –bb 10000 –st 114 codon –bnni. The best model chosen was: GY+F+R5. Codon sequences were subsequently 115 mapped on these alignments.