Sequencing and Characterizing the Genome of the pustulata F Biodiversity and Climate † ** † BiK Research Centre Bastian Greshake , Francesco Dal Grande , Imke Schmitt , Ingo Ebersberger † Department for Applied Bioinformatics, Institute for Cell Biology and Neuroscience, Goethe University, Frankfurt am Main, Germany * Biodiversity and Climate Research Centre, Senckenberg Gesellschaft für Naturforschung, Frankfurt am Main, Germany Motivation are composite organisms comprising a fungal mycobiont and one or several species of green algae data for lichens. Here we present an initial analysis of the Lasallia pustulata genome and transcriptome. De or cyanobacteria as photobionts. Fossil evidences for lichens date back to the Early Devonian approximately novo assembly quality is highly dependent on the input data, the chosen algorithm and the parameter 400 MYA. As an effect of this long standing interaction both mycobiont and photobiont often grow poorly settings. Using a simulation approach, we first explored the performance of different assembly strategies without their partner. In some cases, such as for Lasallia pustulata, a solitary cultivation of the mycobiont has on simple meta-genomes of varying coverage ratios. The best performing strategy was then taken to been impossible so far. The molecular basis for this reciprocal dependence remains yet to be determined, reconstruct the genome sequences of the lichen Lasallia pustulata from a set of 30 million MiSeq reads. and understanding the evolutionary implications of lichenization for the interacting partners in general is still The resulting data for the mycobiont was then used for initial gene prediction, phylogenetic placement in its infancies. This circumstance is partly due to the scarcity of both genome sequences and transcriptome and functional annotation. 1. Assembly Strategy 3. Tree Reconstruction

Shotgun sequencing of the lichen L. pustulata obtained 15 million MiSeq read We used HaMStR [10] to identify orthologs to 162 genes, pairs of 250 bp in length, with a mean insert size of 336 bp (Figure 1). We which have previously been used to resolve the simulated twin sets resembling the L. pustulata data in insert size distribution, pezizomycete phylogeny [11]. Orthologous sequences read count and length using ART [1] and the draft genomes of the lichenized were aligned with MAFFT [12] (-linsi) and concatenated Cladonia grayi and its photobiont Asterochloris sp. Coverage ratios for into a supermatrix with 115,155 amino acid positions. the alga and the fungus varied between 10:0 and 0:10 in the 11 twin sets. We Removing columns with >50% undetermined amino acids then assessed the performance of four different assemblers on this mixed or gaps retained a supermatrix of 45,999 amino acid Numberof Fragments species species data (Figure 2). positions. Insert Size Figure 1: Insert size distribution for the L. pustulata Maximum likelihood tree reconstruction with RAxML [13], whole genome shotgun library. (LG+G+F) obtained the tree in Figure 5. Eurotiomycetes Alga Fungus L. pustulata is placed into monophyletic Coverage Ratio (Fungus : Alga) forming the sister clade of the Dothideomycetes. 0:10 10:0 1:9 2:8 3:7 4:6 5:5 6:4 7:3 8:2 9:1

Lecanoromycetes 90

N50 Size

4 Mbp - 1.4 Mbp - Dothideomycetes 70 Metagenomes 0.2 Mbp -

50 Alga TotalAssembly Length (Mbp)

Fungus Figure 5: Phylogeny of the , branch labels give the bootstrap support. 30

Figure 2: Assembly results for the 11 simulated lichen whole genome shotgun data sets. Coverage ratios for the Alga and the fungus vary from 10:0 to 0:10. The hights of the vertical bars represent the contig N50 sizes. 4. Functional Annotation (Gene Ontology) Number of Contigs We preliminary annotated the 8,156 L. pustulata genes with Gene Ontology terms using Blast2GO [14]. For data derived from a single species (10:0;0:10) all assemblers 64,180 About 7,500 of our genes could be annotated with GO terms (Figure 6) and 5,000 genes were assigned perform comparable. Only a single assembler, MIRA, is unaffected Total Assembly Length Enzyme Codes (Figure 7). 119,028,408 bp by the varying coverage ratios and outperforms all other GO-level distribution Data Distribution

Largest Contig 1,750 Sequences assemblers. Thus the assembly of L. pustulata was done with MIRA 0 2,500 5,000 7,500 520,743 bp 1,500 Box 1 Without Blast Results (Box 1). 1,250 Lasallia pustulata N50 Size 1,000 Without Blast Hits Assembly 3,373 bp 750 With Blast Results 500 # Annotations 250 With Mapping Results

0 Annotated Sequences 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

2. Taxonomic Assignment & Gene Prediction GO Level (Total Annotations = 23900, Mean Level = 6.452, Std. Deviation = 2.877) Total Sequences P F C Figure 7: Distribution of annotated to unannotated sequences Figure 6: Distribution of GO terms for biological process (P), molecular function (F) nucleotides and cellular component (C) 1,471 bacteria + 560 viruses (NCBI) 110 fungi, 20 plants, 20 animals Summary BlastN Sequencing Taxonomic MEGAN: yes discard (2,141 contigs) RNA DNA Data Preprocessing - min support: 1 placement nt - Isolated from lichen thallus - Isolated from lichen thallus - Sequencing via MiSeq - Sequencing via MiSeq - Adapter trimming (cutadapt) 64,180 contigs - min score: 50 conflict? - 15 million read pairs of 35-250 bp length - 15 million read pairs of 250 bp length - Merging overlapping paired end reads (FLASH) - top percent: 10 Taxonomic - Insert sizes from 90 bp to 660 bp (average ~350 bp) - Insert sizes from 250 bp to 640 bp (average ~350 bp) placement aa no de novo Assembly Data Preprocessing Assembler Evaluation on Twin Sets - Input: 13 Million overlapped reads, 2 million read pairs - Assembly with MIRA BlastX - Adapter trimming (cutadapt) - Simulations based on genomes of lichenized Results in: - Merging overlapping paired end reads (FLASH) fungus Cladonia grayi and photobiont * 64,180 contigs amino acids Asterochloris sp. * 119,028,408 bp total length - Twin Sets: Read length, read count and - * N50 size: 3,373 bp Fungus Alga insert size distribution as observed in real data 1,471 bacteria + 560 viruses (NCBI) - 11 ratios of fungus to alga Number of Contigs - Evaluating: MIRA, Velvet, MetaVelvet, String Graph Assembler 110 fungi, 20 plants, 20 animals 6,977 8,872 Taxonomic Assignment - MEGAN (sequence similarity based taxonomic assignment) Total Length Database for MEGAN consisting of proteomes & genomes of: Figure 3: Workflow used for our MEGAN analysis. * viral & prokaryotic genomes in NCBI Genome 37,469,368 bp 14,839,567 bp * Eukaryotes: 110 fungi, 20 plants, 20 animals Gene Prediction Largest Contig - AUGUSTUS, trained for the fungal data The taxonomic assignment of the L. pustulata meta- Training data: 161,762 bp 21,823 bp * genes found via CEGMA as first evidences genome contigs was done with MEGAN [6] (Box II & * RNASeq data as second set of evidences N50 Size Annotation Tree Reconstruction Figure 3). - Blast against NCBI nr-prot database - Search for orthologs to 162 genes amongst 126 fungal 19,048 bp 2,158 bp - Mapping of GO terms and Enzyme Codes with BLAST2GO species with HaMStR - Align with MAFFT - Remove columns with >50% gaps or ambiguities Gene prediction was done with AUGUSTUS [7] only for Box II Taxonomic - Reconstruct tree with RAxML Assignment the fungal contigs, using a two-step procedure. First, Phylogenomic tree reconstruction firmly places L. pustulata AUGUSTUS was trained using L. pustulata genes found by MIRA (Overlap Consensus-based) consistently performs within the Lecanoromycetes and those as sister group to CEGMA [8] (298 proteins in 231 groups, 93.15% best on assembling simple meta-genomes 500 the dothideomycetes completeness). Second, complementary RNASeq data

Count Using MIRA & MEGAN we were able to recover ~37.5 from L. pustulata was mapped to the contigs with Tophat Preliminary functional annotation assigned GO terms and/ - Mbp of the mycobiont genome of Lasallia pustulata [9] and was used to create intron evidences for a refined 0 2000 or Enzyme Codes to about 7,500 of the predicted genes Length in Amino Acids gene prediction. AUGUSTUS annotated 8,156 genes with Figure 4: Length of the Predicted Proteins. AUGUSTUS trained with additional RNAseq data an average length of 418 amino acids (Figure 4). annotated 8,156 genes

[1] Huang W, Li L, Myers JR, and Marth GT. Bioinformatics (2012) 28 (4): 593-594 [11] Ebersberger I, de Matos Simoes R, Kupczok A, Gube M, Kothe E, Voigt K, and von Haeseler Poster URL [2] http://sourceforge.net/projects/mira-assembler/ A. Mol Biol Evol (2012) 29 (5): 1319-1334 Bastian Greshake [3] Zerbino DR and Birney E. Genome Research (2008) 18:821-829. [12] Katoh, Standley (2013) Molecular Biology and Evolution 30:772-780 [4] Simpson JT and Durbin R. Bioinformatics (2010) 26 (12): i367-i373 [13] Stamatakis A. Bioinformatics (2006) 22 (21): 2688-2690 Contact [email protected] References [5] Namiki T, Hachiya T, Tanaka H, Sakakibara Y. Nucleic Acids Res, (2012) 40(20), e155 [14] Conesa A, Götz S, Garcia-Gomez JM, Terol J, Talon M and Robles M. Bioinformatics, (2005) [6] Huson DH, Auch AF, Qi J, et al. Genome Research (2007) 17: 000 21: 3674-3676. Goethe University, Frankfurt am Main, Germany [7| Stanke M, Steinkamp R, Waack S and Morgenstern B (2004) Nucleic Acids Research, Vol. 32, W309-W312 Max-von-Laue-Straße 13, 60438 Frankfurt am Main [8] Genis Parra, Keith Bradnam and Ian Korf (2007) Bioinformatics, 23: 1061-1067 doi:10.6084/m9.figshare.1046678 [9] Trapnell C, Pachter L, Salzberg SL. Bioinformatics (2009) 25 (9): 1105-1111. [10] http://sourceforge.net/projects/hamstr/