1 Accurate and Sensitive Detection of Microbial Eukaryotes from Whole 1
Total Page:16
File Type:pdf, Size:1020Kb
bioRxiv preprint doi: https://doi.org/10.1101/2020.07.22.216580; this version posted January 25, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. 1 Accurate and sensitive detection of microbial eukaryotes from whole 2 metagenome shotgun sequencing 3 Abigail L. Lind1 and Katherine S. Pollard1,2,3,* 4 1. Gladstone Institute of Data Science and Biotechnology, San Francisco, CA. 5 2. Institute for Human Genetics, Department of Epidemiology and Biostatistics, and 6 Institute for Computational Health Sciences, University of California, San Francisco, 7 CA. 8 3. Chan Zuckerberg Biohub, San Francisco, CA. 9 * Corresponding author: [email protected] 10 11 Abstract 12 Background 13 Microbial eukaryotes are found alongside bacteria and archaea in natural microbial 14 systems, including host-associated microbiomes. While microbial eukaryotes are critical 15 to these communities, they are challenging to study with shotgun sequencing 16 techniques and are therefore often excluded. 17 18 Results 19 Here we present EukDetect, a bioinformatics method to identify eukaryotes in shotgun 20 metagenomic sequencing data. Our approach uses a database of 521,824 universal 21 marker genes from 241 conserved gene families, which we curated from 3,713 fungal, 22 protist, non-vertebrate metazoan, and non-streptophyte archaeplastid genomes and 23 transcriptomes. EukDetect has a broad taxonomic coverage of microbial eukaryotes, 24 performs well on low-abundance and closely related species, and is resilient against 1 bioRxiv preprint doi: https://doi.org/10.1101/2020.07.22.216580; this version posted January 25, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. 25 bacterial contamination in eukaryotic genomes. Using EukDetect, we describe the 26 spatial distribution of eukaryotes along the human gastrointestinal tract, showing that 27 fungi and protists are present in the lumen and mucosa throughout the large intestine. 28 We discover that there is a succession of eukaryotes that colonize the human gut during 29 the first years of life, mirroring patterns of developmental succession observed in gut 30 bacteria. By comparing DNA and RNA sequencing of paired samples from human stool, 31 we find that many eukaryotes continue active transcription after passage through the 32 gut, though some do not, suggesting they are dormant or nonviable. We analyze 33 metagenomic data from the Baltic Sea and find that eukaryotes differ across locations 34 and salinity gradients. Finally, we observe eukaryotes in Arabidopsis leaf samples, 35 many of which are not identifiable from public protein databases. 36 37 Conclusions 38 EukDetect provides an automated and reliable way to characterize eukaryotes in 39 shotgun sequencing datasets from diverse microbiomes. We demonstrate that it 40 enables discoveries that would be missed or clouded by false positives with standard 41 shotgun sequence analysis. EukDetect will greatly advance our understanding of how 42 microbial eukaryotes contribute to microbiomes. 43 44 Keywords: fungi, protists, algae, nematode, arthropod, choanoflagellate, microbial 45 eukaryotes, helminth, whole metagenome sequencing 46 47 Background 2 bioRxiv preprint doi: https://doi.org/10.1101/2020.07.22.216580; this version posted January 25, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. 48 Eukaryotic microbes are ubiquitous in microbial systems, where they function as 49 decomposers, predators, parasites, and producers [1]. Eukaryotic microbes inhabit 50 virtually every environment on earth, from extremophiles found in geothermal vents, to 51 endophytic fungi growing within the leaves of plants, to parasites inside the animal gut. 52 Microbial eukaryotes have complex interactions with their hosts in both plant and animal 53 associated microbiomes. For example, in plants, microbial eukaryotes assist with 54 nutrient uptake and can protect against herbivory [2]. In animals, microbial eukaryotes in 55 the gastrointestinal tract metabolize plant compounds [3]. Microbial eukaryotes can also 56 cause disease in both plants and animals, as in the case of oomycete pathogens in 57 plants or cryptosporidiosis in humans [4,5]. In humans, microbial eukaryotes interact in 58 complex ways with the host immune system, and their depletion and low diversity in 59 microbiomes from industrialized societies mirrors the industrialization-driven “extinction” 60 seen for bacterial residents in the microbiome [6–8]. Outside of host associated 61 environments, microbial eukaryotes are integral to the ecology of aquatic and soil 62 ecosystems, where they are primary producers, partners in symbioses, decomposers, 63 and predators [9,10]. 64 Despite their importance, eukaryotic species are often not captured in studies of 65 microbiomes [11]. By far the most frequently used metabarcoding strategy for studying 66 host-associated and environmental microbiomes targets the 16S ribosomal RNA gene, 67 which is found exclusively in bacteria and archaea, and therefore cannot be used to 68 detect eukaryotes. Alternatively, amplicon sequencing of eukaryotic-specific (18S) or 69 fungal-specific (ITS) ribosomal genes are effectively used in studies specifically 70 targeting eukaryotes. Our understanding of the eukaryotic diversity in host-associated 3 bioRxiv preprint doi: https://doi.org/10.1101/2020.07.22.216580; this version posted January 25, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. 71 and environmental microbiomes is largely due to advances in eukaryotic-specific 72 amplicon sequencing [12–15]. However, like all amplicon strategies they can suffer from 73 limited taxonomic resolution and difficulty in distinguishing between closely related 74 species [16,17]. 75 Unlike amplicon strategies, whole metagenome sequencing captures fragments 76 of DNA from the entire pool of species present in a microbiome, including eukaryotes. 77 Whole metagenome sequencing data is increasingly common in microbiome studies, as 78 these data can be used to assemble new genomes, disambiguate strains, and reveal 79 presence and absence of genes and pathways [18]. As well as these methods work for 80 identifying bacteria and archaea, multiple challenges remain to robustly and routinely 81 identify eukaryotes in whole metagenome sequencing datasets. First, eukaryotic 82 species are estimated to be present in many environments at lower abundances than 83 bacteria, and comprise a much smaller fraction of the reads of a shotgun sequencing 84 library than do bacteria [1,19]. Despite this limitation, interest in detecting rare taxa and 85 rare variants in microbiomes has increased alongside falling sequencing costs, resulting 86 in more and more microbiome deep sequencing datasets where eukaryotes could be 87 detected and analyzed in a meaningful way. 88 Apart from sequencing depth, a second major barrier to detecting eukaryotes 89 from whole metagenome sequencing is the availability of methodological tools. 90 Research questions about eukaryotes in microbiomes using whole metagenome 91 sequencing have primarily been addressed by using eukaryotic reference genomes, 92 genes, or proteins for either k-mer matching or read mapping approaches [20,21]. 93 However, databases of eukaryotic genomes and proteins have widespread 4 bioRxiv preprint doi: https://doi.org/10.1101/2020.07.22.216580; this version posted January 25, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. 94 contamination from bacterial sequences, and these methods therefore frequently 95 misattribute bacterial reads to eukaryotic species [22,23]. Gene-based taxonomic 96 profilers, such as Metaphlan3, have been developed to detect eukaryotic species, but 97 these target a small subset of microbial eukaryotes (122 eukaryotic species as of the 98 mpa_v30 release) [24,25]. 99 To enable routine detection of eukaryotic species from any environment 100 sequenced with whole metagenome sequencing, we have developed EukDetect, a 101 bioinformatics approach that aligns metagenomic sequencing reads to a database of 102 universally conserved eukaryotic marker genes. As microbial eukaryotes are not a 103 monophyletic group, we have taken an inclusive approach and incorporated marker 104 genes from 3,713 eukaryotes, including 596 protists, 2010 fungi, 146 non-streptophyte 105 archaeplastids, and 961 non-vertebrate metazoans (many of which are microscopic or 106 near-microscopic). This gene-based strategy avoids the pitfall of spurious bacterial 107 contamination in eukaryotic genomes that has confounded other approaches. The 108 EukDetect pipeline will enable different