De novo Assembly and Comparative Genomics on Eukaryotic Species Mixtures Bastian Greshake† , Andreas Blaumeiser †, Simonida Zehr †, Francesco Dal Grande* , Anjuli Meiser §, Imke Schmitt* §, Ingo Ebersberger† † Department for Applied Bioinformatics, Institute for Cell Biology and Neuroscience, Goethe University, Frankfurt am Main, Germany * Biodiversity and Climate Research Centre, Senckenberg Gesellschaft für Naturforschung, Frankfurt am Main, Germany § Institute of Ecology, Evolution and Diversity, Goethe University, Frankfurt am Main, Germany Summary Mutualistic symbiotic relationships are found across organisms of all complexity. In extreme instances, as in We use in silico-generated data sets to sound out the performance of different assembly paradigms on some , the interaction appears so close that the participating organisms grow only poorly – or even Whole Genome Shotgun (WGS) data from eukaryotic species mixtures. On this basis we have begun not at all – when cultivated in isolation. This renders mutualistic symbionts valuable objects to study the reconstructing the metagenome of the pustulata. Using a hybrid sequencing approach, that genomic basis of adaptation and co-evolution. The close interdependence in such communities, however, combines Illumina short read and PacBio long read data, we have assembled the genome of the mycobiont confounds genomic studies. In many cases separate sequencing of the participating organisms is not feasible, and a major fraction of the algal photobiont. We integrate this data with genome sequences of closely leaving metagenomics approaches as the method of choice. Here we address how and to what extent related non-lichenized fungi as a first step towards analyzing how lichenization affects genome evolution. eukaryotic genomes can be reconstructed from such data. 1. Assembler Evaluation with Simulated Twin Sets [1] 3. Sequencing the L. pustulata metagenome A Pilot Study B Box II summarizes the assembly results of the metagenome Number of Contigs skimming data (cf. 1A) with MIRA. 64,180

Total Length Fugus

119 Mbp +- Whole N50 A comparison to the twin set Lasallia pustulata Cladonia grayi [3] Asterochloris sp. [4] Assembly 3.3 Kbp

analysis (cf. 2) indicates issues

Number of Contigs organism

thallus sequence > & measure read statistics - - - with the reconstruction of the 6977 8872 19,371

Total Length

algal genome. 37 Mbp 14 Mbp 34 Mbp Alga

N50

- 19 kbp 2 kbp 3 kbp - FLASH> [2] Concatenate contigs of draft assembly A qPCR analysis of the lichen Coverage > into one pseudo-chromosome each > 80x

- 14x - 10x thallus reveals a substantially Fungal Algal Bacteria 0 10000 20000 30000 40000 total copy number higher than expected fungal-to- Box II Illumina Assembly Thallus Figure 1: qPCR results for a fungal and an algal single copy gene. The fungal:algal ratio is around 15:1. 15 million read pairs algal genome ratio of 15:1 (Fig. 1). observed (black) and fitted (2x250 bp) (blue) insert size distribution - - Final Assembly: Short-Read meets Long-Read Illumina reads PacBio reads Data set characteristics Reference Genomes

> > - - 2.7 million polymerase reads ART [5] 15 million 2x250 bp 15 million 2x250 bp N50: 15 kb paired end reads - mate pair reads (350 bp insert) (5 kbp insert) - WGS - simulation

> Assemble with MIRA -

>

-

- - > Assemble PacBio reads with FALCON [12] and scaffold with SSPACE-LongRead [13] >- Error-correct PacBio reads Simulated WGS reads with ECTools [14] using Illumina contigs

} -

Assemble error-corrected PacBio reads and Illumina reads with SPAdes. Scaffold with SSPACE-LongRead

De Bruijn [6] [7] [8] Graph based Velvet MetaVelvet SPAdes Standard Metagenome Multisized - > - 10:0 9:1 8:2 7:3 6:4 5:5 de Bruijn Graph DBG Assembler de Bruijn Graph

Overlap Layout [9] [10] [11] based MIRA Omega sga Overlap Layout Metagenome String Graph + Graph Based OLC Assembler Assembler 4:6 3:7 2:8 1:9 0:10 Box I Assembler Selection - >- - > Merge reads simulated from either reference genomes to form L. pustulata twin sets with varying coverage ratios for the two genomes. Assemble each set with 6 different assemblers. Taxonomic Assignment using MEGAN [15]. Genome Annotation with MAKER [16], using SNAP, GeneMark and AUGUSTUS. - - > > - 2. Assembly Results of the Twin Sets -

Number of Scaffolds 141 43 Total Length 0:10 1:9 2:8 3:7 4:6 5:5 6:4 7:3 8:2 9:1 10:0 45 Mbp 33 Mbp N50 0.8 Mbp 1.7 Mbp Number of Genes 15,393 11,112 Trebouxia sp. Lasallia pustulata Box III PacBio/Hybrid Assemblies For the high coverage fungal genome, we follow the standard procedure of doing an assembly using only PacBio data. In case of the low-coverage algal genome, we use a hybrid assembly utilizing both PacBio and Illumina data [17]. 80,000,000 NG50 Size 4. Does Lichenization Facilitate Gene Loss?

Mycosphaerella graminicola Assembler Ancestral Gene Set To investigate lineage specific gene loss, Dothideomycetes MIRA the Last Common Ancestor (LCA) gene set of the Phaeosphaeria nodorum Omega was reconstructed using OMA [18] (Figure 2). In total 12,595 Lasallia pustulata SPAdes orthologous groups were formed (Figure 3). sga Uncinocarpus reesii Velvet Eurotiomycetes

Absence of LCA Genes For 1,357 groups, genes were only Aspergillus kawachii

Assembly size in bp size Assembly 60,000,000 MetaVelvet found in 7 species. In 1/3 or these groups the L. pustulata ortholog Sclerotinia sclerotiorum Sordariomyceta is missing, hinting that these genes are lost as a consequence of Verticillium dahliae lichenization (Figure 4). Chaetomium globosum Figure 2: Tree of 8 species used for creating the LCA gene set.

8 M.graminicola P.nodorum 7 L.pustulata 6 U.reesii 5 A.kawachii 4

{ S.sclerotiorum

* 3 40,000,000 size of ortholog group V.dahliae 2 C.globosum 0 1000 2000 3000 4000 0 100 200 300 400 count genes missing in LCA core set Assembly results for the 11 twin sets. Bars are centered at total assembly length, red lines indicate reference lengths. Height of bars shows the NG50 size. Assemblies marked Figure 3: Results of the orthology prediction with OMA. For 1153 groups Figure 4: Distribution of genes missing in the LCA set. Lasallia pustulata is missing in twice as with an asterisk cover less than 50% of the reference length. A default height was used in those instances. all 8 species were found. For 1357 groups only 7 species were found. many orthologous groups as any other species.

[1] Greshake B, Zehr S, Dal Grande F et al. Mol Ecol Res (2015) epub ahead of print [12] https://github.com/PacificBiosciences/FALCON [2] Magoc T and Salzberg S. Bioinformatics (2011) 27 (21):2957-63 [13] Boetzer M and Pirovano W. BMC Bioinformatics (2014) 15:211 Bastian Greshake [3] http://genome.jgi.doe.gov/Clagr2/ [14] https://github.com/jgurtowski/ectools [4] http://genome.jgi.doe.gov/Astpho2/ [15] Huson DH, Mitra S, Ruscheweyh H et al. Genome Research (2011) 21: 1552-1560 Contact [email protected] References [5] Huang W, Li L, Myers JR, Marth GT. Bioinformatics (2012) 28 (4):593-594 [16] Campbell MS, Holt C, Moore B, Yandell M. [6] Zerbino DR and Birney E. Genome Research (2008) 18:821-829. Curr Protoc Bioinformatics (2014) 48:4.11.1-4.11.39 Goethe University, Frankfurt am Main, Germany [7] Namiki T, Hachiya T, Tanaka H, Sakakibara Y. Nucleic Acids Res, (2012) 40(20), e155 [17] Mike Schatz, PAG 2014 (http://schatzlab.cshl.edu/presentations/2014-01-14.PAG.Single% [8] Bankevich A, Nurk S, Antipov D et al. Journal of Computational Biology (2012) 19(5):455-477 20Molecule%20Assembly.pdf) Max-von-Laue-Straße 13, 60438 Frankfurt am Main [9] http://sourceforge.net/projects/mira-assembler/ [18] http://omabrowser.org/standalone/ doi:10.6084/m9.figshare.1562330 [10] Haider B, Ahn T, Bushnell B et al. Bioinformatics (2014) btu395 [11] Simpson JT and Durbin R. Bioinformatics (2010) 26 (12): i367-i373