Umeå Science Center!

Working without a “reference”

EBI, Cambridge, UK, October 22nd, 2013 Nicolas Delhomme Outline

• Introduction on de-novo assembly tools • “reference” genome? As in • no good reference, but still a genome? • no genome whatsoever, transcriptome only? • Combining the genome and the transcritptome

2 de-novo assembly tools

• genome • CLC, ABySS, velvet, FERMI • GAM (graph assembly) • Opera, BEEST (scaffolder) • ... • transcriptome • SOAP-denovo, OASES, Trans-ABySS, Trinity,...

3 Example driven

4 A “reference” genome

5 The Socio-economic Interest

• Sweden’s most economically important tree • 30 % net exports • 3,000 spruce trees per citizen • Annual growth increment worth 3 billion USD • ~300 USD per citizen per year • Available genome sequence will aid and facilitate • Genomic selection / breeding for biomass productivity, quality, health • Optimisation of cellulose and wood fibre qualities (new materials) • Optimised feedstock for bio-refineries

6 The Biological Interest

• Science of conifers • Evolution: The last major plant group without a sequenced genome • Ecology: Dominant members of boreal forests • Biology: Unique biological features

7 Sequencing and Assembly

Challenges • 19.6 Gbp genome • 12 evenly sized chromosomes (Chromosome sorting not possible) • Fairly high heterozygosity • High repeat content

Vischi et al (2003)

A Minina, SLU BioCenter Uppsala

8 Genome Assembly approach

b c a 70 100 Transcriptome Haploid tissue Diploid tissue 60 90 (22 samples) 50 80 40 70 30 60 WGS Fosmid pools WGS RNAseq 20 50 (450 pools) (95x span) (38M read pairs) 10 Haploid WGS assembly 40

(38x) (%) Genome coverage Merged assembly 0 P. abies 1.0Francesco assembly distribution Cumulative 30 Vezzi 0 1 2 3 4 90 70 50 30 10 collaborating with Number of features (x107) % of gene aligned to a single sca!old Inserts Bud Mishra and Haploid WGS Fosmid pool 300 bp Guiseppi Narzisi assembly assemblies 650 bp d Unassembled reads Haploid WGS assembly 2.4 kbp Feature Response Curve LTR gypsy 9.8 Gbp / 0.2 Gbp 4.4 kbp LTR copia GAM-NGS LTR unknown 10.4 LINE Merged kbp DNA TE assembly Unclassi"ed Low copy 12.0 Gbp / 2.0 Gbp BESST Sca!olded Fosmid pools P. abies 1.0 assembly assembly 12.0 Gbp / 4.3 Gbp

P. abies 1.0 assembly 9 12.0 Gbp / 4.3 Gbp Total assembly size / Size in sca!olds > 10kb Gene discovery pipeline

GMAP public, Trinity and 454 transcripts 256 manually curated high quality digiNorm genes Cufflinks

BLASTx proteins Augustus +Eugene (bwa subset + Scaffolds >10 Splice Machine Kbp)

10 Gene Number estimation

• Rigault et al, 2011 approach (Picea glauca) derived from Ewing and Green method (human, 2001) itself derived from Waterson et al., 1992 (C. elegans). It was used as well by Alexandrov et al. in 2009 for Zea mays.

• Based on the rule of proportionality:

G x m = g1 x g2

with g1 a fraction of all genes G, g2 an independent set of genes and m the size of the intersect(g1,g2), we derive G by

G = g1 x g2 / m

11 Gene Number estimation

• 4 sets: public: PlantGDB ESTs, UCDavis PUTs ours: Trinity, Newbler

Pgdb - Pgdb - Pgdb - UCD - UCD - Newbler - UCD Newbler Trinity Newbler Trinity Trinity m 1,275 1,592 348 20,466 5,056 9,930 g1 17,985 48,519 6,303 46,923 12,405 15,170 g2 8,685 8,679 8,685 17,364 17,488 47,179 G 122,510 264,507 157,303 39,810 42,907 72,075

12 CEGMA and PLAZA Core Gene Sets

* CEGMA Representation 60.00 PLAZA core gene coverage

Complete Complete Partial Partial 50.00

# % # % 40.00 master 137 55.24 213 85.89

% 30.00 Master * Core Eukaryotic Genes Mapping Approach 20.00 Diploid 10.00 Fosmid

0.00 40~90% >90% Protein sequene coverage in single contig

Conclusions • assembly contains majority of gene space • Gene space remains partially fragmented

13 Fragmentation Estimation

12,500 Picea sitchensis Full Length ESTs

50% identity

30% fragmentation at a 90% identity threshold 90% identity

14 Very long introns

!"#"$%"&'(!$)*"+,$,-.+',.'.(,"+'!"/0%,'(!.*',1"'2!"/"+3"'.('%.+)'$++.,$,"&'!"2"$,/4'50*"!.0/'3$/"/'.(' (!$)*"+,"&' )"+"/' !"2!"/"+,' 2.,"+,-$%%6' "7,!"*"%6' %.+)' -+,!.+/' $+&8' $33.!&-+)%68' ,1"' -+,!.+' /-9"' *",!-3/' !"2.!,"&' 1"!"' /1.0%&' :"' !")$!&"&' $/' %.;"!' :.0+&/4' <7$*2%"/' .(' /031' (!$)*"+,"&' )"+"' /,!03,0!"/' -&"+,-(-"&'Arabidopsis(!.*'*$+0$%'$++.,$,-.+'.(',1"'=>?@A :.7')"+"'($*-%6'$!"'/1.;+'-+'@022%"*"+,$!6'B-)0!"'C4D4'' ' !"##$%&%'()*+,-).$%,/01'E!.//A/2"3-"/'"7.+'/-9"'/,$,-/,-3/'(.!'/"#"+'/"F0"+3"&'2%$+,')"+.*"/'$+&',1"'!"#$%&'(#G-)1'E.+(-&"+3"' $%#&)&*&+Poplar'2!"&-3,-.+')"+"'/",4' H-%"'-+&-3$,"/'2"!3"+,-%"8'/4&4'-+&-3$,"/'/,$+&$!&'&"#-$,-.+'$+&'*$&'-+&-3$,"/'*"$+'$#"!$)"' &-(("!"+3"4'I'J'KLMA-%"'!"2!"/"+,/',1"'+0*:"!'.('-+,!.+/';-,1'%"+),1')!"$,"!',1$+',1"'KLMA-%"'/-9"4'N1"'/"#"+'20:%-3%6'$#$-%$:%"' )"+.*"/'-+3%0&"'(-#"'$+)-./2"!*/O',-$%&.+/(&(#*0$1&$)$2#!+/313(#*-&40+4$-/$2#5&*&(#6&)&7'-$2#8-9:$#($*&6$'$+&';'$#<$9('$+&',;.' :$/$%'2%$+,/O'='1$>&)'11$#<+'11').+-77&&'$+&'!09(4+<&*-'11$#/$*')("# #Grape !"#/$*')(# ="#<+'11').+-77&&# !"#$%&'(# ;"#<$9(# 8"#($*&6$# 5"#6&)&7'-$# !"#*-&40+4$-/$# ,"#*0$1&$)$# ,.,$%'Rice' PQCRLC' SLLKL' TTPCCT' DQRRUC' TPCDKU' PUDKLQ' CPLKRK' PCDQDT' *"$+' CDP' TPC' CKL' CUU' TQT' CDQ' CKQ' CPT' /4&4' DTS' TKP' TUT' TDS' DKS' CSC' DSC' CSC' *"&-$+Maize' PDT' PUS' PLT' PDK' PUS' PLD' PLR' PCS' *$&' PPR' PDS' PCL' PPD' PLQ' PPP' PPQ' SQ' *-+Norway' D' spruceT' P' C' T' P' P' P' *$7' DRDPR' QRQK' PTDTR' UKPP' PLTQT' KRCT' PLPKL' CRLTC' KRMA-%"' LPD' UCR' QSL' QTU' SLQ' LCR' QRK' DDR' KLMSelaginellaA-%"' URC' PRQQ' KKT' KCP' PCSP' UDC' KUL' QKQ' KKMA-%"' PTSR' CRQR' PSQK' PUQR' CLST' PDCS' CTUQ' PDCS' I'J'KLMPhyscomitrellaA-%"' SPPL' DCSC' PQLQL' CTRPC' PLQLR' SULR' PRSPR' QCLD' I'J'KKMA-%"' PQCC' SLQ' TTPL' DQRD' TPTL' PULC' CPQR' PCDS' ' ' 2,

_$=>?@S'!"#$%&'()*)%$+,-).$/+01H'?>ZPD'`>bZQa'

233456 05(5 '00,0 ,537'

#$8,(4,4 #$8'6-0(

3,, !"#$%&'()*+#,- 15 •_ $=>?@PL2,384'`N=Sa' HC genes contain 2,697 introns >5 Kbp • 2,679 (99%) introns contain TE ./(0' 1(,203 1,/4(

#$54,3'6 #$54//.(/ #$50/..6

!"##$%&%'()*+,456"*%,/07,V"2!"/"+,$,-#"')"+"'/,!03,0!"/'(.!',;.'=>?@A:.7')"+"/';1"!"'"7.+/'$!"'(!$)*"+,"&'$3!.//' +0*"!.0/')"+.*-3'/3$((.%&/4'5$!!.;':%$3W'%-+"/'-+&-3$,"'$%-)+"&'N!-+-,6',!$+/3!-2,/'$+&',1-3W"!'3.%.0!"&'%-+"/'-+&-3$,"'&.*$-+/' ;1"!"'!"&'!"2!"/"+,/',1"'=>?@A8'20!2%"',1"'XA8':%0"',1"'YA'$+&')!""+',1"'EA&.*$-+8'!"/2"3,-#"%64'Z.+)"!'3.%.0!"&'%-+"/'%$:"%%"&' ;-,1'=>[\\\\\'-+&-3$,"')"+.*-3'/3$((.%&'X?/4']1"!"'$+'-+,!.+'/2$+/'$'/3$((.%&A/3$((.%&':.0+&$!6',1"'*-+-*$%'-+("!!"&'-+,!.+' /-9"'-/'-+&-3$,"&':6'$'J'/-)+4'^,1"!'-+,!.+'/-9"/'$!"')-#"+4', ' ' '

WWW.NATURE.COM/NATURE | 24 ARTICLE RESEARCH ARTICLE RESEARCH

b a Gene a families OrphansGenes b +541/–463 Gene +/– Gene families gain/loss Arabidopsis thaliana families8,440Orphans1,780Genes27,407 +265/–72 8,362 +541/–463 Arabidopsis thaliana 8,440 1,780 27,407 +/– Gene families gain/loss +720/–157 +265/–72 8,362 Populus trichocarpa 8,925 2,396 40,141 +501/–81 8,169 +720/–157 +501/–81 Populus trichocarpa 8,925 2,396 40,141 8,169 +275/–457 Vitis vinifera 7,987 2,779 26,238 +661/–279 7,749 +275/–457 Vitis vinifera 7,987 2,779 26,238 +661/–279 7,749 +1,376/–175 Oryza sativa 10,049 7,353 41,363 +1503/–404 +1,376/–175 +676/–298 8,848 Oryza sativa 10,049 7,353 41,363 7,367 +1503/–404 8,848 +1,279/–280 Zea mays 9,847 3,785 39,172 +676/–298 7,367 +1,279/–280 Zea mays 9,847 3,785 39,172 +538 +1,021/–1,773 6,989 Picea abies 6,615 1,837 28,354 +538 +1,021/–1,773 6,989 Picea abies 6,615 1,837 28,354 +967/–1,946 6,451 Selaginella moellendorffii 6,010 947 18,384 +967/–1,946 6,451 Selaginella moellendorffii 6,010 947 18,384 +1,156 Physcomitrella patens 7,607 6,770 28,090 +1,156 Physcomitrella patens 7,607 6,770 28,090

200 500 1,000 2,000 5,000 Expression value of these genes 200 500 10,00020,000 1,000 2,000 5,000 10,00020,000 10% longest intron length (bp) 10% longest intron length (bp) c 4 High-con!dence set d c 4 High-con!dence set d 2 200,000 2 200,000 Promoter/UTR Promoter/UTR 0 CDS 0 CDS Repeat Repeat –2

(FPKM)) –4 150,000 (FPKM)) –4 150,000

10 lncRNA 10 4 lncRNA Expression Expression (log (log 2 0 0 100,000100,000 Count Count –2 –4 0.60.6 50,00050,000

0.30.3 transcripts Fraction of transcripts Fraction of 0 0 0 00 1010 100100 1,0001,000 10,00010,000 100,000100,000 1818 19 19 20 20 21 2221 23 22 24 23 24 16 Size (nt) CumulativeCumulative intron intron size size (b p)(bp) Size (nt) FigureFigure 1 1 || TheThe gene-space and and transcribed transcribed fraction fraction of of the the P.abies P.abies 1.0 1.0 genegene families. families.b, Boxplotb, Boxplot representation representation of length of length distribution distribution for the for 10% the 10% assembly.assembly. a a,, Gene Gene family loss loss and and gain gain in in eight eight sequenced sequenced plant plant genomes longestlongest introns introns in the in the same same eight eight genomes. genomes.c, Scatterc, Scatter plots plots of cumulative of cumulative intron intron (Arabidopsis thaliana, Populus trichocarpa, Vitis vinifera, Oryza sativa, Zea length against log10 expression calculated as fragments per kilobase per million (Arabidopsis thaliana, Populus trichocarpa, Vitis vinifera, Oryza sativa, Zea length against log10 expression calculated as fragments per kilobase per million maysmays,,PiceaPicea abies abies, Selaginella moellendorffii moellendorffiiandandPhyscomitellaPhyscomitella patens patens). Gene). Gene mappedmapped reads reads (FPKM) (FPKM) for high-confidencefor high-confidence gene gene loci (top, loci (top, coloured coloured orange) orange) families were identified using TribeMCL (inflation value 4), and the DOLLOP and green for lncRNA loci (middle, shaded green). The bottom panel shows a families were identified using TribeMCL (inflation value 4), and the DOLLOP and green for lncRNA loci (middle, shaded green). The bottom panel shows a program from the PHYLIP package was used to determine the minimum gene histogram of cumulative intron size in the two sets of loci. d, Distribution of program from the PHYLIP package was used to determine the minimum gene histogram of cumulative intron size in the two sets of loci. d, Distribution of set for ancestral nodes of the phylogenetic tree. We used plant genome small (18–24-nucleotide (nt)) RNAs and their co-alignment-based colocation set for ancestral nodes of the phylogenetic tree. We used plant genome small (18–24-nucleotide (nt)) RNAs and their co-alignment-based colocation annotations filtered to remove transposable elements. ‘Orphans’ refers to gene to genomic features (repeats, high-confidence genes and their promoter/ annotationsfamilies containing filtered only to remove a single transposable gene. Blue numbers elements. indicate ‘Orphans’ the number refers to of gene UTRs).to genomic CDS, coding features sequence. (repeats, high-confidence genes and their promoter/ families containing only a single gene. Blue numbers indicate the number of UTRs). CDS, coding sequence.

To trace the history of transposable elements in vascular we We clustered LTRs of complete elements to identify transpos- inferredTo trace phylogenies the history of aof domain transposa ofblethe elements reverse transcriptase in vascular genes plants of we able elementWe clustered families LTRs32. More of than complete 86% of elements the elements to identify remained transpos- as 32 inferredboth Ty1/Copia phylogeniesand ofTy a3/ domainGypsy elements. of the reverse The phylogenies transcriptase revealed genes of singletons,able element indicating families that. More LTR-RTs than are 86% quite of the divergent elements and remained that as bothseveralTy1/Copia diverse andand ancientTy3/ transposGypsy elements.able element The subfamilies, phylogenies present revealed in theresingletons, are several indicating low-abundance that LTR-RTs families. are We quite searched divergent three LTR- and that severalalmost diverse all of the and examined ancient conifer transpos genera,able whereas element only subfamilies, a few subfamilies present in RTthere families are forseveral signatures low-abundance of unequal families. intra-element We searched recombination three LTR- almostwere expanded all of the in examined the angiosperm conifer genomes genera, whereas(Fig. 2b, c only and aSupplementary few subfamilies eventsRT families in scaffolds for signatures.50 kb and of 20 unequal complete intra-element fosmids33. For recombination families wereInformation expanded 3.11). in the Most angiosperm internal cla gdesenomes with significant (Fig. 2b, c and bootstrap Supplementary support ALISEI,events 3K05 in scaffolds and 4D08_5.50 kb we and identified 20 complete 21, 22 fosmids and 3933 complete. For families Informationwere consistently 3.11). Mostspecies-specific, internal cla indicatingdes with significant that most bootstrap expansions support of elements,ALISEI, and 3K05 four, and five 4D08_5 and no solo we identifiedLTRs, respectively 21, 22 and(Supplemen- 39 complete wereextant consistently transposable species-specific, element families indicating occurred that after mo divergence.st expansions Two of taryelements, Information and four, 3.10). five Although and no this solo data LTRs, set is respectively limited, the (Supplemen- analysis extantspecies-specific transposable amplification element bursts families were occurred evident: a afterTy1/Copia divergence.family Two in suggestedtary Information that LTR-RT-related 3.10). Although sequences this datamight set be is removed limited, less the fre- analysis species-specificJ. communis and amplification a Ty3/Gypsy burstsfamily were in T. evident: baccata a.WeusedcompleteTy1/Copia family in quentlysuggested by unequal that LTR-RT-related recombination sequences than in other might plant be genomes. removed The less fre- J.LTR-RTs communis fromandP. a abiesTy3/Gypsyand P. glaucafamilyto in investigateT. baccata further.Weusedcomplete the timing of ratioquently of solo-LTRs by unequal to complete recombination elements than in P. in abies otheris , plant1:9, whereas genomes. in The LTR-RTsconifer transposable from P. abies elementand P. insertions glauca to31 investigate(Supplementary furtherInformation the timing of A.ratio thaliana of solo-LTRs, rice and barley to complete it is 1:1 elements (ref. 33), 0.6:1 in P. (ref. abies 34)is and,1:9, 16:1 whereas (for in conifer3.4–3.8). transposable In contrast to element a similarsetofelementsidentifiedin insertions31 (SupplementaryOryzaInformation sativa theA. abundant thaliana, BARE rice and 1 element barley it35 is), 1:1 respectively. (ref. 33), 0.6:1 Taken (ref. together, 34) and these 16:1 (for 3.4–3.8).and O. glaberrima In contrast(Fig. to a 2d), simila wersetofelementsidentifiedin detected no evidence of recentOryza activity sativa findingsthe abundant indicated BARE that the 1 extant element set35 of), transposable respectively. elements Taken together, in P. abies these and(thatO. is, glaberrima less than 5 Myr(Fig. ago) 2d), in weP. detected abies. Instead, no evidence insertions of seem recent to activity have accumulatedfindings indicated slowly over that tens the extant or hundreds set of transposable of millions years, elements mainly in byP. abies (thatoccurred is, less over than several 5 Myr tens ago) of millions in P. abies.of yearsInstead, (older insertions insertions seem are moreto have theaccumulated insertion of slowly LTR-RT over elements tens or with hundreds limited of transposable millions years, element mainly by occurredlikely to escape over several detection). tens of Analysis millions ofof68 years orthologous (older insertions transposable are ele- more removal.the insertion of LTR-RT elements with limited transposable element likelyment to insertions escape detection).in P. abies and AnalysisP. glauca of 68further orthologous supported transposable this: 63 inser- ele- removal.An analysis of introns across taxa provided further insight into the menttions insertions apparently in predatedP. abies divergence,and P. glauca andfurther only fivesupported occurred this: after 63 inser- the genomeAn ofanalysis the last of common introns across ancestor taxa to provided the conifers. further We insight identified into the tionslineages apparently separated predated 13–20 Myr divergence, ago (Supplementary and only five Information occurred 3.9). after the orthologuesgenome of of the normal last sized common (50–300 ancestor bp) and to long the (1–20 conifers. kb) introns We identified in lineages separated 13–20 Myr ago (Supplementary Information 3.9). orthologues of normal00 sized MONTH (50–300 2013 bp) | VOL and long 000 (1–20 | NATURE kb) introns | 3 in ©2013 Macmillan Publishers Limited. All rights reserved 00 MONTH 2013 | VOL 000 | NATURE | 3 ©2013 Macmillan Publishers Limited. All rights reserved Genome stats

• Very drafty • 10,000,000 scaffolds! • But • Almost complete gene space • But • estimated 30% fragmentation • But this can be addressed • using RNA-Scaffolding • using long reads (PacBio), e.g. see L_RNA_Scaffolder (Xue Wei, BMCGenomics, 2013)

17 No genome, just transcripts

18 Strategy: read mapping vs. de-novo assembly

19 Strategy: read mapping vs. de-novo assembly

20 21 Expression pattern goes by tissue first

22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 Combining both

39 The final gene discovery pipeline

GMAP public, Trinity and 454 transcripts 256 manually 21 tissue/condition/time-point curated high quality digiNorm genes Cufflinks Illumina mRNA Seq 50M 2x100 reads per sample BLASTx proteins Augustus Assemble all together with +Eugene Trinity (min_kmer=10) (bwa subset + Scaffolds >10 Splice Machine Kbp) Identify contaminants

40 An expression catalogue

• 22 libraries, 50 M PE reads per library • Normalised pool, 1 plate 454 mRNA + total RNA

• Newbler: (Isogroups/isotigs) mRNA: 26,364/36,069 total RNA: 23,876/33,426 A B Figure-1 (Grabherr) 64 293 279 4545 7,284 2,079 86,123 39,919 • Trinity: (component/cluster/sequence)211,989 200,000 Total: 77,189/88,098/118,799 All transcripts

150,000 12 Many large-scale RNA-Seq studies focusing on key 100,000 8 biological questions now underway 4 number of transcripts 50,000 expression (log2 FPKM) expression 0 0 41 0 20 40 60 80 100

NA GC percent

Fungi Parent Archaea Metazoa Eukaryota

Viridiplantae C D

Fungi Embryophyta

12 12

8 8

4 4 expression (log2 FPKM) expression expression (log2 FPKM) expression 0 0 0 20 40 60 80 100 0 20 40 60 80 100 GC percent GC percent Combining the genome and transcriptome

Genomic scaffold 27 Mbp BLASTn align Trinity transcripts 72 Mbp bwa align digiNorm RNASeq reads

Assembly bwa hit GMAP Mbp alignment (total) exon/gene Mbp No alignment (total) P.abies 1.0 524 72/149 26,140 (5 Gbp) (2.3 Gbp)

Numerous pseudogenes!

42 Are we done? Wait!

Genomic scaffold 27 Mbp BLASTn align Trinity transcripts 72 Mbp bwa align digiNorm RNASeq reads 524 Mbp

Assembly bwa hit GMAP Mbp alignment (total) exon/gene Mbp (total) P.abies 1.0 524 72/149 (5 Gbp) (2.3 Gbp)

No 26,140 alignment?

43 Validation: unexpected contamination

44 45 Appearsgenuine enough • • Contains: fungi chlorophyta

Delhomme, Street and Grabherr, submitted That’s more like it

46 Delhomme, Street and Grabherr, submitted A new meta-transcriptomics approach

Lophodermium_oxycocci

Acarospora_smaragdula3 Tryblidiopsis_pinastri Figure-5 (Grabherr) Acarospora_smaragdula2

Acarospora_smaragdula1

sample04c Acarospora_smaragdula sample04a

Hypohelion_scirpinum Lophodermium_piceae Arthonia_ruana Lirula_macrospora Opegrapha_calcarea Coccomyces_tumidus Hypoderma_rubi sample05a Cryptosporiopsis_californiae

0.972 sample05b sample22c 0.976 Penicillium_purpurogenum1 sample15a Penicillium_purpurogenum2 0.997 0.999 1 Penicillium_sp. sample23b 0.97 Aspergillus_leporis sample09a 0.998 1 Leptographium_truncatum1 1 sample21b 1 Leptographium_truncatum2 0.962 0.986 Grosmannia_penicillata sample04b sample23a 0.996 sample22e 0.966 sample22b 0.985 sample05c sample15b sample21d sample04d 0.999 1 Mycosphaerella_punctiformis sample21c 1 sample23c 0.995 1 sample05d sample21a sample09c sample05e sample22a Stenella_musae Passalora_loranthi Mycosphaerella_mozambica Dothideomycetes6 Dothideomycetes1 Dothideomycetes2 Botryosphaeria_corticis Sphaeropsis_sapinea Sampled from spruce Dothideomycetes5 Arthoniomycetes, Arthoniales Dothideomycetes4 Lecanoromycetes, Acarosporales

Dothideomycetes3 Leotiomycetes, Rhytismatales Preussia_minima sample09b Eurotiomycetes, Eurotiales Sordariomycetes, Ophiostomatales

Undifilum_oxytropis Dothideomycetes, Capnodiales Dothideomycetes, incertae sedis Melanomma_pulvis-pyrius Dothideomycetes, Botryosphaeriales

Cryomyces_antarcticus

Botryosphaeria_stevensii Dothideomycetes, incertae sedis Dothideomycetes, Pleosporales 47 Dothideomycetes, incertae sedis 0.3 Conclusion

Manuscript New • Genome size • Meta-genome assembly – due to repeat – standard RNA-Seq can be – no evidence of WGD used for meta- – suggest a less active transcriptomics. mechanism of TE removal than in angiosperms (e.g. – Delhomme, Street and unequal recombination) Grabherr, submitted

• Transcriptome assembly • New sequencing data – 26,000 High Confidence genes and new approaches – numerous pseudogenes – PacBio – lncRNA, sRNA – Gene Fusion tools

48 Deceiving Gene Fusion Tools

• Blast A. thaliana XBCP3 against the genome – 5’ end on scaffold MA_10436540 (MC) – 3’ end on scaffold MA_280780 (HC)

• Running FusionMap (Ge H. et al., Bioinformatics, 2011) – “fusion” from MA_10436540 (14884, +) to MA_280780 (1471 +) with a canonic GT-AG spice site

49 Using long PacBio reads

• MADS27 from scaffold MA_17919 to MA_20467 is – not found as a fusion – but supported by 16 PacBio reads

50 Scaffolding to increase contiguity

• Out of 28,354 HC genes – 5,561 are on RNA-Seq scaffolds. – 3,142 gaps are spanned by 2,568 genes. • Gene fusions creates 2,984 new scaffolds • PacBio creates 3,629 new scaffolds – 13,831 reads spanning 2 scaffolds and having a [99-100]% coverage of the gene model. • Lin, Street and Delhomme, in preparation

51 Acknowledgments

• Nathaniel Street

• Yao-Cheng Lin

• Manfred Grabherr

• The spruce genome project

• You for your attention

52