Estimating the Quality of Eukaryotic Genomes Recovered from Metagenomic Analysis

bioRxiv preprint doi: https://doi.org/10.1101/2019.12.19.882753; this version posted January 20, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. Estimating the quality of eukaryotic genomes recovered from metagenomic analysis Paul Saary1*, Alex L. Mitchell1, and Robert D. Finn1* 1European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK *Corresponding author: [email protected], [email protected] Abstract Introduction Eukaryotes make up a large fraction of microbial biodiversity. However, the field of metage- The past two decades have seen a dramatic nomics has been heavily biased towards the advancement in our understanding of the mi- study of just the prokaryotic fraction. This croscopic organisms present in environments focus has driven the necessary methodolog- (known as microbiomes) such as oceans, soil and ical developments to enable the recovery host-associated sites, like the human gut. Most of prokaryotic genomes from metagenomes, of this knowledge has come from the applica- which has reliably yielded genomes from tion of modern DNA sequencing techniques to thousands of novel species. More recently, mi- the collective genetic material of the microorgan- crobial eukaryotes have gained more atten- isms, using methods such as metabarcoding (am- tion, but there is yet to be a parallel explo- plification of marker genes) or metagenomics sion in the number of eukaryotic genomes re- (shotgun sequencing). Based on the analysis of covered from metagenomic samples. One of such sequence data, it is thought that up to 99 % the current deficiencies is the lack of a uni- of all microorganisms are yet to be cultured versally applicable and reliable tool for the es- (Rinke et al. 2013). timation of eukaryote genome quality. To ad- To date, the overwhelming number of dress this need, we have developed EukCC, a metabarcoding and metagenomics studies have tool for estimating the quality of eukaryotic focused on the bacteria that are present within genomes based on the dynamic selection of a sample. However, viruses and eukaryotes are single copy marker gene sets, with the aim also important members of the microbial com- of applying it to metagenomics datasets. We munity, both in terms of number and function demonstrate that our method outperforms cur- (Paez-Espino et al. 2016; Carradec et al. 2018; rent genome quality estimators and have ap- Olm et al. 2019; Karin et al. 2019). Indeed, the plied EukCC to datasets from two different unicellular protists and fungi are estimated to biomes to enable the identification of novel account for about 17 % of the global microbial ∼ genomes, including a eukaryote found on the biomass. Within the microbial eukaryotic human skin and a Bathycoccus species ob- biomass, the genetically diverse unicellular tained from a marine sample. organisms known as protists account for as much as 25 % (Bar-On et al. 2018). Today, ∼ the increasing number of completed genomes has revealed that the “protists” classification encompasses a number of divergent sub-clades: 1 bioRxiv preprint doi: https://doi.org/10.1101/2019.12.19.882753; this version posted January 20, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. Saary et al. (2019) 1 INTRODUCTION animals and some protists are encompassed are incomplete, so estimating the quality of a in the Opisthokonta clade; others are grouped MAG in terms of completeness and contamina- into the Amoebozoa or (T)SAR ((Telonemids), tion can not rely on genomic comparisons. In Stramenopiles, Alveolates, and Rhizaria) clades, the absence of a reference genome, quality es- or into further groups. However, the exact root timates for MAGs have used universal single of the overall eukaryotic tree and the number copy marker genes (SCMGs) (Parra et al. 2007; of primary clades remains a topic of discussion Mende et al. 2013; Simão et al. 2015; Parks et al. (Baldauf 2003; Burki 2014; Burki et al. 2019). 2015). As these genes are expected to only oc- Despite the increase in the number of com- cur once within a genome, comparing the num- plete and near complete genomes, metabarcod- ber of SCMGs found within a binned genome to ing and metagenomic approaches that have in- the number of expected marker genes provides cluded the analysis of microbial eukaryotes have an estimation of completeness, while additional demonstrated that the true diversity of protists copies of a marker gene can be used as an indi- is far greater than that currently reflected in cator of contamination. After such evaluations, the genomic reference databases (such as Ref- binned genomes achieving a certain quality can Seq or ENA). For example, a recent estimate be classified as either medium or high quality based on metabarcoding sequencing suggests (Bowers et al. 2017). that 150,000 eukaryotic species exist in the Due to biases in sampling and extraction oceans alone (Vargas et al. 2015), but only 4,551 methods, the majority of MAGs produced to representative species have an entry in GenBank date correspond to prokaryotic organisms. For (15. Nov 2019). Thus, if the functional role of a prokaryotic MAGs, CheckM (Parks et al. 2015) microbiome is to be completely understood, we is the most widely used tool to estimate com- need to know what these as yet uncharacterised pleteness and contamination, although other ap- organisms are and the functional roles they are proaches have also been used (Pasolli et al. 2019) performing. and individual sets of prokaryotic SCMGs are Currently, one of the best approaches for un- also provided by BUSCO (Simão et al. 2015; Wa- derstanding microbiome function is through terhouse et al. 2018) as well as by anvi’o (Eren et the assembly of shotgun reads (usually 200- al. 2015). However, even with size fractionation 500 bp long) to obtain longer contigs (typi- of samples to enrich for prokaryotes prior to li- cally in the range of 2000-500,000 bp). These brary preparation, eukaryotic cells frequently re- contigs provide access to complete proteins, main in the samples, with some eukaryotic DNA which may then be interpreted within the recovered as MAGs (Delmont et al. 2018; West context of surrounding genes. In the last et al. 2018), while others use size fractionation few years, it has become commonplace to ex- specifically to enrich for eukaryotes(Karsenti et tend this type of analysis to recover puta- al. 2011; Carradec et al. 2018). tive genomes, termed metagenome assembled As with the quality estimation of bacterial genomes (MAGs). MAGs are generated by group- MAGs, SCMGs have been used to assess eu- ing contigs into sets that are believed to have karyotic isolate genomes. CEGMA (Parra et al. come from a single organism – a process known 2007) used 240 universal single copy marker as binning. However, even after binning MAGs, genes identified from six model organisms to es- they vary in their completeness and can be timate genome completeness, which was then fragmented, due to a combination of biologi- superseded by BUSCO. The major advance of cal (e.g. abundance of microbes), experimental BUSCO compared to CEGMA, was the provi- (e.g. depth of sequencing) and technical (e.g. al- sion of curated sets of marker genes for sev- gorithmic) reasons. Furthermore, the computa- eral eukaryotic and prokaryotic clades, in additional methods used for binning the contigs can tion to the single universal eukaryotic marker sometimes fail to distinguish between contigs gene set. While BUSCO (version 3) provides sets that have come from different organisms, lead- to estimate completeness of eukaryota, protists, ing to a chimeric genome (termed contamina- plants and fungi, it remains up to the user to tion). As highlighted above, reference databases select which is the most suitable set when as- 2 bioRxiv preprint doi: https://doi.org/10.1101/2019.12.19.882753; this version posted January 20, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. Saary et al. (2019) 2 RESULTS sessing genome quality. Although BUSCO has reference genomes. While BUSCO reports com- been used for quality metric calculation of eu- plete, fragmented and duplicated BUSCOs, for karyotic MAGs (West et al. 2018), this manual the sake of simplicity we summarized all these selection can be challenging, especially when as ‘matched’ BUSCOs (Figure 1 A). One of the dealing with large numbers of genomes from main applications of BUSCO has been the as- unknown species. Besides these more univer- sessment of fungal genomes, which also repre- sal approaches, others have focused on certain sent the most numerous eukaryotic genomes clades of Eukaryotes: FGMP (Cissé and Stajich in the reference databases. Thus, it was unsur- 2019) estimates genome quality of Fungal iso- prising that > 95% of the 303 eukaryotic BUS- late genomes for which it utilises both SCMGs COs were matched in genomes coming from and also highly conserved regions found within Ascomycota, Mucoromycota and Basidiomycota. fungal genomes. For protists, anvi’o provides a However, BUSCO performed less well on eu- reduced number of profiles from the BUSCO karyotic genomes arising from other taxonomic ‘protist’ set to estimate the genome quality of groups. Notably, the numbers of BUSCOs found eukaryotic MAGs (unpublished).

Estimating the Quality of Eukaryotic Genomes Recovered from Metagenomic Analysis

Multigene Eukaryote Phylogeny Reveals the Likely Protozoan Ancestors of Opis- Thokonts (Animals, Fungi, Choanozoans) and Amoebozoa

Genome-Resolved Metagenomics of Eukaryotic Populations During Early Colonization of Premature Infants and in Hospital Rooms Matthew R

Protist Phylogeny and the High-Level Classification of Protozoa

S41467-021-25308-W.Pdf

Reconstruction of Protein Domain Evolution Using Single-Cell Amplified

Complex Communities of Small Protists and Unexpected Occurrence Of

The Early History of the Metazoa—A Paleontologist's Viewpoint

A New Subspecies of a Ciliate Euplotes Musicola Isolated from Industrial Effluents

A Putative Origin of the Insect Chemosensory Receptor

Systema Naturae. the Classification of Living Organisms

Inferring Ancestry

Discovery of New Mycoviral Genomes Within Publicly Available Fungal Transcriptomic Datasets