Transcriptome Reconstruction and Functional Analysis of Eukaryotic Marine Plankton
Total Page:16
File Type:pdf, Size:1020Kb
Downloaded from genome.cshlp.org on October 7, 2021 - Published by Cold Spring Harbor Laboratory Press 1 Transcriptome reconstruction and functional analysis of eukaryotic marine plankton 2 communities via high-throughput metagenomics and metatranscriptomics 3 4 Alexey Vorobev1,2,#*, Marion Dupouy1†, Quentin Carradec1,2, Tom O. Delmont1,2, Anita 5 Annamalé1‡, Patrick Wincker1,2, Eric Pelletier1,2* 6 7 1 Metabolic Genomics, Genoscope, Institut de Biologie François Jacob, CEA, CNRS, Univ Evry, 8 Université Paris Saclay, 91000 Evry, France; 9 2 Research Federation for the study of Global Ocean Systems Ecology and Evolution, FR2022 / 10 Tara Oceans GOSEE, 3 rue Michel-Ange, 75016 Paris, France; 11 # Current address: INSERM U932, PSL University, Institut Curie, Paris, France; 12 † Current address: AGAP, Univ. Montpellier, CIRAD, INRA, Montpellier SupAgro, Montpellier, 13 France; 14 ‡ Current address: Discngine SAS, 79 rue Ledru-Rollin, 75012 Paris, France 15 16 17 Running title: Reconstruction of marine eukaryotic transcriptomes 18 19 20 Keywords: Plankton, Eukaryotes, Metagenomics, Metatranscriptomics, Canopy clustering 21 22 23 *Corresponding Authors: 24 25 Alexey Vorobev 26 INSERM U932, PSL University, Institut Curie, Paris, France 27 Phone: +33 (0)1-56-24-64-76 28 Email: [email protected] 29 30 Or 31 32 Eric Pelletier 33 CEA/Genoscope, 2 rue Gaston Crémieux, 91000 Evry, France 34 Phone: +33 (0)1-60-87-25-19 35 Email: [email protected] 36 1 Downloaded from genome.cshlp.org on October 7, 2021 - Published by Cold Spring Harbor Laboratory Press 37 Abstract: 38 Large scale metagenomic and metatranscriptomic data analyses are often restricted by their gene- 39 centric approach, limiting the ability to understand organismal and community biology. De novo 40 assembly of large and mosaic eukaryotic genomes from complex meta -omics data remains a 41 challenging task, especially in comparison with more straightforward bacterial and archaeal 42 systems. Here we use a transcriptome reconstruction method based on clustering co-abundant genes 43 across a series of metagenomic samples. We investigated the co-abundance patterns of ~37 million 44 eukaryotic unigenes across 365 metagenomic samples collected during the Tara Oceans expeditions 45 to assess the diversity and functional profiles of marine plankton. We identified ~12 thousand co- 46 abundant gene groups (CAGs), encompassing ~7 million unigenes, including 924 metagenomics 47 based transcriptomes (MGTs, CAGs larger than 500 unigenes). We demonstrated the biological 48 validity of the MGT collection by comparing individual MGTs with available references. We 49 identified several key eukaryotic organisms involved in dimethylsulfoniopropionate (DMSP) 50 biosynthesis and catabolism in different oceanic provinces, thus demonstrating the potential of the 51 MGT collection to provide functional insights on eukaryotic plankton. We established the ability of 52 the MGT approach to capture interspecies associations through the analysis of a nitrogen-fixing 53 haptophyte-cyanobacterial symbiotic association. This MGT collection provides a valuable resource 54 for an analysis of eukaryotic plankton in the open ocean by giving access to the genomic content 55 and functional potential of many ecologically relevant eukaryotic species. 2 Downloaded from genome.cshlp.org on October 7, 2021 - Published by Cold Spring Harbor Laboratory Press 56 Introduction 57 As an alternative to individual genome or transcriptome sequencing, environmental genomics 58 has been used for many years to access the global genomic content of organisms from a given 59 environment (Joly and Faure 2015). However, large scale metagenomic and metatranscriptomic 60 data analyses are often restricted by their gene-centric approach, limiting the ability to draw an 61 integrative functional view of sampled organisms. Nevertheless, constructing gene catalogs from 62 environmental samples provides a useful framework for a general description of the structure and 63 functional capabilities of microbe-dominated communities (Venter et al. 2004; Qin et al. 2010; 64 Brum et al. 2015; Sunagawa et al. 2015; Carradec et al. 2018). Gene-centric approaches allow deep 65 and detailed exploration of communities of organisms, but they are usually undermined by limited 66 contextual information for different genes, apart from taxonomic affiliation based on sequence 67 similarity. 68 Several methods have been developed to shift the scientific paradigm from a gene-centric to 69 an organism-centric view of environmental genomic and transcriptomic data. These methods use 70 direct assemblies of metagenomic reads to generate contigs that encompass several genes. High 71 recovery of bacterial and archaeal genomes has been achieved through traditional assembly 72 strategies from both high and low diversity environmental samples (Tully et al. 2018; Dick et al. 73 2009; Albertsen et al. 2013). However, when dealing with complex communities of organisms, 74 traditional genome assembly approaches are often impaired by the large amount of sequence data 75 and the genome heterogeneity. 76 To circumvent these limitations, approaches based on reference genomes have been proposed 77 (Caron et al. 2009; Pawlowski et al. 2012). Other approaches are based on binning of assembly 78 contigs across a series of samples to extract information (Sharon et al. 2012; Albertsen et al. 2013). 79 Significant results for bacteria and archaea have been achieved so far (Delmont et al. 2018; Parks et 80 al. 2017; Pasolli et al. 2019; Nayfach et al. 2019). However, eukaryotic organisms have larger, more 81 complex genomes that require significantly higher sequence coverage than bacteria and archaea. 3 Downloaded from genome.cshlp.org on October 7, 2021 - Published by Cold Spring Harbor Laboratory Press 82 Even in cases where significantly long contigs can be obtained, the mosaic structure of eukaryotic 83 genes and the difficulty predicting them de-novo from a genome sequence leads to poor gene 84 recovery (Boeuf et al. 2019). Only eukaryotes possessing small genomes and a high proportion of 85 mono-exonic genes are expected to provide results similar to those achieved for bacteria or archaea. 86 Recently a semi-supervised method based on a model trained with a set of diverse references 87 allowed eukaryotic genome reconstruction from complex natural environments (West et al. 2018; 88 Olm et al. 2019). 89 Novel reference-independent clustering approaches that produce genomes from metagenomic 90 data have recently been developed (Albertsen et al. 2013; Imelfort et al. 2014; Delmont et al. 2018; 91 Kang et al. 2019; Nielsen et al. 2014) and successfully applied to prokaryote-dominated 92 communities. One of these approaches efficiently delineated co-abundant gene groups (CAGs), the 93 largest of which were termed MetaGenomic Species (MGS), across a series of human gut 94 microbiome samples (Nielsen et al. 2014). This method uses the metagenomic abundance profiles 95 of a reference gene catalog, determined by stringent mapping of raw metagenomic reads onto 96 sequences in this catalog, and defines clusters of genes showing similar variations of abundance 97 profiles across a collection of samples. 98 The vast majority of the planktonic biomass in the global ocean consists of single-cell 99 eukaryotes and multicellular zooplankton (Dortch and Packard 1989; Gasol et al. 1997). Globally, 100 these organisms play an important role in shaping the biogeochemical cycles of the ocean, and 101 significantly impact food webs and climate. Despite recent advances in understanding their 102 taxonomic and gene functional compositions (Joly and Faure 2015; Carradec et al. 2018; Venter et 103 al. 2004), little is known about the biogeographical preferences and metabolic potential of many 104 eukaryotic plankton species from an organism-centric perspective. Several collections of reference 105 marine eukaryote organisms’ sequences have been created, the largest one being the MMETSP 106 collection (Keeling et al. 2014). However, the majority of fully sequenced marine eukaryotic 107 genomes or transcriptomes are derived from cultured organisms. Due to the limited availability of 4 Downloaded from genome.cshlp.org on October 7, 2021 - Published by Cold Spring Harbor Laboratory Press 108 cultured representatives of many dominant in the open ocean plankton, including zooplankton 109 representatives, reference sequences represent only a small fraction of the natural biological 110 diversity (De Vargas et al. 2015; Sibbald and Archibald 2017). 111 Here, we used the rationale of this reference independent, gene co-abundance method (Nielsen 112 et al. 2014) to delineate transcriptomes by mapping metagenomic sequencing data obtained from 113 365 metagenomic readsets generated from marine water samples collected from the global ocean 114 during the Tara Oceans expedition onto the metatranscriptome-derived Marine Atlas of Tara 115 Oceans Unigenes (MATOU-v1 catalog (Carradec et al. 2018)) obtained from the same set of Tara 116 Oceans stations (Fig. S1). The samples were collected from all the major oceanic provinces except 117 the Arctic, typically from two photic zone depths (subsurface - SRF and deep-chlorophyll 118 maximum - DCM), and across four size fractions (0.8–5μm, 5–20μm, 20–180μm, and 180– 119 2000μm). 120 Results 121 Construction of the MGT collection 122 Of the 116,849,350 metatranscriptomic-based unigenes of the MATOU-v1 catalog, 123 37,381,609 (32%) were detected by metagenomic