<<

1/25/2019

Duckweed nuclear genome sizes 10 fold span

Chromosome-level genome assembly of the small )

b 2000

M

aquatic monocot provides ( 1500

e

z

evidence for epigenetic regulation of genome i

s Wolffiella

1000

purging dynamics e

m Landoltia o 500 Spirodela

n

e

Todd P. Michael, PhD G 0 Professor, Director of Informatics 0 20 40 60 80 100 120 J. Craig Venter Institute, La Jolla, CA USA Duckweed accessions PAG XXVII Wang et al., 2010 Jan 12, 2019 Flow cytometry: 3 biological replicates; (300 Mb) and Selaginella (100 Mb) internal controls

Not your favorite algae Spirodela polyrhiza, basal monocot, aquatic , fast growing, potential model system

Community Sequencing Project (CSP) Department of Energy and the Joint Genome Institute (DOE-JGI) to sequence the 150 Mb Spirodela genome Raritan Canal, New Brunswick, New Jersey ARTICLE NATURECOMMUNICATIONS| DOI: 10.1038/ ncomms4311

emnoideae, commonly known as duckweeds, are the the taxa of the Commelinid monocots such as the grasses from smallest, fastest growing and morphologically simplest of Poalesand acuminata, thewild-typediploid progenitor of Lfloweringplants1. Theplant body isorganized asathalloid , from (Fig. 1). or ‘frond’ lackingastem and in morederived specieseven roots Here we describe the genome and transcriptome of Greater (Fig. 1). Based on fossil records, the peculiar plant body Duckweed, Spirodela polyrhiza, representing thesmallest mono- architecture of this subfamily evolved by neotenous reduction cot genometodatewith asizeof 158Mb, which issimilar to the froman Araceaeancestor and it hasbeen interpreted botanically plant model genomeof Arabidopsisthaliana.Spirodelarepresents asjuvenileor embryonictissue2.Thereductionandsimplification abasal monocotyledonousspeciesfrom theAlismatalesand will of theplant body progresseswithin theLemnoideaefromancient be an invaluable genomic resource to study the history of the specieslikeSpirodelatowardsmorederived specieslikeWolffia. monocotyledonouslineage. Although reduced flowers are observed in duckweeds, they usually reproduce by vegetative daughter fronds initiated from themother frond (Supplementary Fig. 1). Doubling timeof the Results fastest growing duckweeds under optimal growth conditions is Sequence assembly. Genome sizes in the five o 30h, nearly twice as fast as other ‘fast-growing’ flowering genera span an order of magnitude from 158Mb in Spirodela and more than double that of conventional crops (more polyrhizato1,881Mbin Wolffiaarrhiza7. Owingtoitssmall size under Supplementary Note1). They areeasy to grow and have and basal position in theLemnoideaewesequenced theSpirodela However, the fastest plant for biomass production negligible lignin and high energy content in the form of easily polyrhizastrain7498bywhole-genomeshotgunsequencingusing NATURECOMMUNNICATAUTREIOCONMSM|UNDICOATI:IO1N0S.|10DO3I:81/0.n10c3o8/mncmomsm4s433111 ARTICLE ARTICLE Resolved Spirodelafermentable starcintoh (40–70%of b io32mass). Du cksuperweeds have been B 20contigssingle end, B 1 pair-end Roche/454 next-generation used for the removal of high levels of contaminants from sequencing and B 1 pair-end Sanger sequencing as described wastewater3 and for theproduction of recombinant proteinsfor under Methods (Supplementary Table 1). Although next- 4,5 vegetativeleaf-like‘fronds’ to astarcphh-arrimchacdeuotrimcalanaptpslticaagteiocnaslledand high-impact biofuel feedstock generation sequen–cing has been used to reduce the cost of • Doubling time in optimal conditions 31 3 NO 5 ‘turion’ . Unliketh4elinked v19eget aktivt heproteinaftrodnoedssnvoiat sccodingtoimpupleet,etuforri ogenes,lnansd in foo dleastpFrodIuSc Htofion6 .anydFraotma knowna ssueqpupen1oci nrfloweringgtedgenomt2hese, shc 2plantoorth-readrentechenoloogifes thhave b3ee2n BAC-based is less than 2 days (twice as fast as fall from the mother fronds5 otanxcoenomiactpuorient oafftveierw,sgteanrocmhiceffortshavelargely focused on insuffi1cient to assemble1 chromosome-size molecules with 3 pseudomolecules in distinct chromosome pairs (Fig. 3). In two “ ” accumulation. They sink to the bottom6of a pond and NIR NR other fast plants) germinate into new fronds by using starch as energy. These cases (pseudo #+7 and #21) in–dividual BACs were located on functionsrequiregenesfor starch biosynthesisincludingAGPase, 2-oxo- Bra+ssicales (ArabNidoHpsis) NO NH 4 F3 – A Wolffia microscopica plant gcluhtaraotemos4 omes othe5r than the 0remaining contig. For instance, SSplusGBSS,BEandDBE.Spirodelacontainedverysim7ilar gene 97 2 (tomato) 7 doubling every 30 hours family compositions as Arabidopsis (Supplementary Table 18). 0402B12and 035P147 labelled another chromosomethan theother The conservation of starch gene families from phylogenetic 4 GDH GS could theoretically give rise t4hreeBACPosaleso (rfice)pseudo #7. Thus, thesetwo pseudomoleculeswere analysis for Spirodela, rice, and Arabidopsis argues for F1 8 74 30 their essential functions. Thecladesof AGPaselargesubunit and Commelinids to 10 plants in ~ 4 months chimericandeachof thetwoarmshaFdtobeseparatedandjoined 97 Zingiberales (banana) 1 DBE had multiple members, whereas all others contained only Glutamate Glutamine Angiosperms Monocots 3 (a volume roughly equivalent 1 onesinglemember for thecorrespondingsubgroup. SpBEIII did to another arm to form oneof the20 chromosomes. Chimerism (Spirodela2) No not cluster with any , but provided a separate branch as a F2 to the planet). 9 11 of pseudomoleculescould bedSuti eto theshort readsof the‘454’ Spirodela-specific BE member, suggesting that it might have P (water-lily) GOGAT

2 sequencing–platform and repeat-dense regions of the chromo- • Potential biomass yield evolved into a special function from their common ancestor 0.01 3 (Supplementary Fig. 24) (further details on starch biosynthesis

1

1 somes. Although future work will have to convert pseudomole-

3

under Supplementary Discussion). 0 – 4 metric tons/hectare/day fresh 3 TFL AP1 SEP 0 Thehighgrowthratesof Spirodelarequiretheefficient usageof culesinto chromosomes, thecurrent analysisof thegenecontent

– 80 metric tons/hectare/year dry 2 nutrients. Nitrogen is generally a major limiting factor of plant 9 and order remained unaffected. growth and aprimary component in fertilizersto promotecrop Floral 2 FLC-like 1 Le • Realized biomass yield in Mirzapur 8 To confirmFTthat noSOCother chimerism underlies the overall growth. However, leaching of fertilizers, increasing amounts of 1 SVP homeotic 2 sewageand wastewater fromasteadily growingworldpopulation genes Bangladesh 7 assemblyqualityandcompleteness,welocalizedtelomericrepeats results in water pollution. Spirodela has been successfully 2 exploited for wastewater remediation because of its ability to in the pseudomolecules. In higher plants, telomeres are – 0.5-1.5 metric tons/hr/day fresh 6 2 removenitrogen with high efficiency, particularly in theform o1f AP2 2 cmhiRa1r5a6cterizSePdL by tandem repeats of the conserved heptamer – 13-38 metric tons/hr/year dry 5 ammonia, from polluted water5. Glutaminesynthetase(GS) and Sp TOE 2 glutamate synthase (GOGAT) are the core enzymes of the GS/ sequence TTTAGGG. Clusters of telomeric repeats were 4 3 GOGAT cycle in plants, the major biochemical modu1le for • Production limited by space, exclusively identifiWeod at the ends of the pseudomolecules, 23 ammonium assimilation. Despite a genome-wide reduction in sunlight, nutrients, and genenumber, copy numbers of theseenzymes were4retained or Juvenilesupporting theAdulat ssumption tFhlowaetringthere were no hybrids of 2 1 temperature. 2 even amplified in Spirodelawith up tofour timesmorecopiesof 2 Figure 1| S5ystematics and biology of the Lemnoidceahe.rDoucmkweoedsobemlongatlo taheromrdesr Al(isSmautaplespanldetmhe AerancetaaeframyilyF, aingea.rly3b)ra.ncCh-offnfrofimrmation of the F1d-GOGAT in Spirodela comparedthweimthonoA1cortaybleiddoonpoussiscroawnndanrcicesetor. In agreement with previous classifications70, (a) shows a phylogenetic tree of plastid-rbcL(ribulose-1, 5- (Fig. 92;0Supplementary Fig. 25) (fbui6srpthoesrphadtetcaairlbsoxoylnasenlairtgreo-gsuebnunit) genes of twaocdcicuotsr—aAcraybidoposisfthalisaneaq(Burasesincalcese, NC_a00s0s9e32m.1) banldytomawto a(Ssolanaalesl,sNoC_00p78o98s.2s);ible with the 19 1 efficiency under Sup1p8lement1a7ry Disctuhrseseiomno)n.ocots Spirodela polyrhiza (AlismFaitgaulerse, N9C|_d0S1ip5si8rt9or1d.1ie)bl,aruicehta(iProaoacnlteesr,ioNstCfic_0rp0ae1t3ph2w0e.a1a)ysat.nd(eab)laenAamnsach(eZenimngatibtsiecraidlenisa,gErtUahm017e04p5.1s),eaunddwoatemr-liloy alseacnulesdescribed outgroup(Nuphar advena,Nymphaeales,NoCf_n0i0tr8o7g8e8n).a(bss)ismhoilwaWangtsioanveintrhailgvh ieetwr po lfaal.,Snptisroids eNatillal,uislltursat trComeatdin.gSpscirhoe dm2014ealaticsahllyowthseaclonal,vegetativepropagationof 26 duckweeds(redrawnand simplifiedfromhLiagnhdolvter)rf.euDparreusgtehtneteraftrrionbdsoef(Fle1on)zowyrmig.iensateoffrtohmetGheS/vGegOetGatAivTecnyocdle,(Nthoe),mfroamjorthemother frond F0 andremain Figure2 | Characteristicsof theSpirodelagenoamttaech.edTthoiet boy tuhetestripucleir(cStli)e, wshhichoewvesntually breaksoff, thereby releasinganew plant cluster. Daughter frondsmay already initiatenew fronds(F2) Development and reproduction. Flowering plants undergo a module for ammonia assimilation and consistent with the high ability of themselvesbeforefull maturity. Rootsareattachedat theAprsophaylnluma(Pd). (dc)iitlluisotrnateasltheqpruogarenssitveitreadtucitvioenfroamsasleasf-slikmeboedny wtithosefvetrahl veinscompleteness the 32 pseudomolecsuerleies oof tdhiestinScptirpohdaeselatrgaensnitoiomnsedausrsinegmtbhleyir, tliicfek csyccalel,ingSpiisrodela to remove ammoniafrom sewage and wastewaters. Copy including theprogression from aveganedtautnivbreanocrhejdurvoeontsilteo aphthaaslleust-loike morphology in the Lemnoideae, redrawn after historical illustrations ‘Das Pflanzenreich’ from www.biolib.de; Sp: 500kb and blueand red barsdepict position ofStpeiroldoelma peolyrrihciza,aLne:dLemcneanmtinroor, mWoe: nWruiomclffibaerasrrhfooizraf.eachthgeene arseeshqouwneincreed for aSpsirsoedemla, blluye for trihceeand Spirodela pseudomolecules an adult phase with competency for sexual reproduction (flow- green for wAraebrideopsisc.a(bn) nA ehidghlyfsoimrplitfihedesicrhemceoonf theenretguolatforyESTs, BESand 454 reads via clusters. Heatmap trearcinkgs). Nilleuostetnrya,ttehefrporomlonoguattioenr otfo2juivnenielertcraiirtcs,leisGacComcmoonntennte,twork of the juvNeAnTiUleR-EtoC-OaMdMulUtNpIChAaTsIeONtrSa|n5:s3i3t1i1o|nDOinI: 1A0.r1a0b38id/ nocposmimssi4s311|www.nature.com/ naturecommunications gene, repeat and GApGheAn-ormeepneoantindtehneseivtoielust.ioCnoolfopularntmoargpanrsa.nTghesfroanrdeo3f0th–e50i%llus,tra&te2d0m.1M4 MaoasctmkgielilnanesgPuhba(lviSsehuseerpsveLprimalli,etefudm.nAcetllionringathltalsyrrseyismeriFlvaeridgp. a.ra4lo,gsaasnsdeamre blychecker method).Three Lemnoideae has been characterized as embryonic or juvenile only showdnifofrfesirmeplnicitty-sreiazsoends byboantecgheneessymobfolr. Faonr edxaommple,ly sampled 454 readswith 1, 2 0–30%, 0–70% andtis0su–e1,.5or%a,sraecsoptyelecdtoinv-elilkye. pGlaCnt,aintedratgiveenlyebecaoringtennetwacroety- APETALA1,CAULIFLOWERandFRUITFULarecloseparalogspromotingthe positively, repeat andledgoennse(SdupepnlesmiteientsarnyeFgiga.t1i)v.ely correlated, whereasGAGoAns-et of aannindflore5scence mgernistoemmbuet acreorevpererseangteed,onclyobryroenespgeonending to Lander-Waterman Although Spirodela has an increased copy number of symbol, AsPt1.aGteinsetgicrosupos hfav6in3g,ei8th7er saimnildar c9op9y%nu,mbsersrvorebdeinagscalibration setswithknown repeats are present rbeoprtehssoinrsgoefnteh-e atrnandsirtieopnefarot-mricjuhverneigleiotnosa.dTulht ephgaesenoinmeoverrepresentedinSpirodelaareshowninred; thosethat havesignificantly contains two roundscomfpaanricsoienntogeAnroabmideopdsisupalincdatrioicne,s.cFomorpoenaecnths goefnothmeicreduced ngumebnerosmareischocwnoivneblruae.ge. TheSpirodelaassembly contained 80%of the regulatory network enhancingtheprogression through theadult segment, the copy npuhmasebeanrdotfhepoanrsaeltoogfoaunsinrfleogreioscnensceismsehriostwemnwaesrebraedrucehdart in 1 read test set and 90% of the EST and BEStest sets. The the innermost circle,(Fdigu.p9;liScuaptpiolenmehnitsartyorTyabilse1i9ll)u,sfotraexteamdpbley, SrPeBd(Sruipbpbleomnesn.- Discussicoonntent valuesfor ESTsandBESwerealmost identical tothe5 taryFig. 26), MADS-box(SupplementaryFig. 27) andPEBPgene In higher plants, genenumber and genomesizeseem to benot families(Supplementary Fig. 28). In Arabidopsis, themicroRNA correlaterde. aAdlthosuegth, Awrahbiidcohpsissthhaoliuanlda hraespargeesneonmte stihzee whole sequence amount. of miR156 is necessary and sufficient to promote the juvenile similar toOSvpierordaellal, ithcoentaainsssBem28%blmyorceogemnesp.lTehteelnowesgesnecould be verifi1ed with the phaseand inhibit thetransition to theadult growth32. Copiesof count of Spirodelacouldinpart beduetothestructural reduction miR156werehighly abundant in Spirodela, with 24loci, or up to and juvednielesncartiubrerdedunciengwthemneaesdkfoinr agndmcoenstehquoedntlytothebe at least 90% for genic megabase(Mb) ge3n2olomci iefsh,igehslypseimcilarlliysofoormrsSwpeirreoindcleuldaed,,wcohnsiicsthentdwoiethsnroettentionseoqr uduepnlicaetisonaonf dgenZes8ac0ti%ng ifnotrhethadeultrephsats,e.wInhich isin thesamerangeas thepatternof preferentiallyretainedrepressorsof theadult phase, addition, Spirodela differs from previously reported angiosperm havesynteny withwhoetrheaesrAfraublidlyopsiesqhuadenoncleyd10gaenndormiceehsa.dT1h9elroceif.oTrhe, tgheenomestihneitsvlaackluofersecegnitvWeGnDsinandSruetrpoptralnesmpoesintiotnasr. yTheTables 4 and 5 (95.7% for read-length threshopopoldsiteofwasRotrcuheef/o4r54miRwNiAt1h69,aindvoelvpetdhinofdr2ou1ght lionwer geEneSTnusm,b8er3%in SfpoirodAelaramCaycthgerefnorees)b.e simply a tolerance33, and miRNA172, involved in the switch from consequenceof theongoingnon-functionalizationandlossof one combination withjuveloninlegto padauilrtepdh-asree34a,dwshicfhrowmere rBedAucCedsfroamnd9 anfods5micdopsyof aduTpliocatedfugerntehpearir, aemnasjourrfaeteotfhgenedquuplaiclaittioyn19o. f sequence assembly, we appeared to beacsoipgiensififocuandntinimArpabridoovpesims aenndttoimnattoh, eresbpeacltaivnelyc,etoof1 coWset proproasendthoatmthleyprsedeolmecintaendt v2eg4etatfiovesmrepirdodsucftioonr acnodnventional sequencing that (Supplementary Table9) (further details on thetransition from low floweringfrequency aswell asthereduced and simpleplant and genome sequjuevnenciletoqaudaultitpyhasfeourndnereSwuppelevmoenltuartyioDnisacursysionr)e.ferencebosd.y ofwSepirreodetlhaeisnatalliegasnt eindpwartitha ctohnseeqauesnscemofbtlheed 454 sequences and found Of the 158-Mb genome, as measured by flow cytometry that the sequencing error rates were 8 in 10,000, providing NATURECOMMUNICATIONS|5:3311|DOI: 10.1038/ ncomms4311|www.nature.com/ naturecommunications 9 (Supplementary Fig. 2), 90% was assembled& 20i1n4tMoacmcilolanPtuigbliss,her9s 7Li%mited. Al9l r8ig.h2ts2re%servaedc.curacy (Supplementary Table6). of thecontigs assembled in 252 scaffolds and 94.1%of them in thetop 50 largest scaffolds(Supplementary Table2). To align the scaffolds with chromosomes, we constructed a Repeat elements. The major sources of repeat elements in the Spirodela genomic library of bacterial artificial chromosomes genome are transposons and variable number tandem repeats (BACs) that wassubjected to DNA fingerprinting, resulting in a (VNTRs). Other sources like high copy number genes (for physical map (Methods). BACend sequences(BES) wereused to example, rRNA genes) and different degrees of duplications align theassembled sequenceswith thephysical map, providing (polyploidy, segmental duplications and tandem genes) usually uswithproof of anaccurateassemblyof DNA sequences.Wealso contributeonly to far lesser extent to total repeat sequences. To usedBACsthat werealignedtotheassemblywiththeir sequenced identifycommonandspecial featuresof theSpirodelagenomethe ends to derive their entire sequence from the assembled repeat datawereput intoacomparativecontext with thesimilar- sequences and used this information to select those that were sized Arabidopsisthaliana(At) (tigr 8version)8 andthreetofour low in repeat sequences for fluorescence in situ hybridization monocot genomes of different sizes, Brachpodium distachyon (FISH). Scaffoldswerejoined into 32pseudomolecules, usingthe (Bd)9, sativa (rice) (Os)10, bicolour (Sb)11 and DNA fingerprinted physical map with anchored sequenced Zeamays(maize) (Zm)12 (Supplementary Fig. 5). Comparingthe tagged sites (BESs), and one pseudomolecule labelled ‘0’ with 16mer frequency of Spirodela with other plant genomes shows all unanchored scaffolds(Fig. 2); (Supplementary Table3). Gaps that the kmer curve of Spirodela follows a similar trend as the in sequenceslikecentromeresamounted to 10.7%of thegenome equally sized Arabidopsis genome. In both genomes kmers and remained in unnamed bases(Ns). occurringZ 10timesareonly found in B 3–4%of thesequence. To examine how the 32 pseudomolecules relate to the Inthelarger monocot genomes,thereisacontinuousrisetowards 20 chromosomes of the haploid genome of Spirodela polyrhiza, increaseof genomesizewith kmersrepeated Z 10 timesstarting we applied a cytogenetic analysis as described under Methods. from 12%in Brachypodium up to 63%in sorghum.

NATURECOMMUNICATIONS|5:3311|DOI: 10.1038/ ncomms4311|www.nature.com/ naturecommunications 3 & 2014 Macmillan Publishers Limited. All rights reserved. 1/25/2019

Updated Spirodela polyrhiza genome assembled into 20 Spirodela verified to have very small rDNA chromosomes with new genome features Sequence gaps array (~80 units)

GC content

Gene expression (no ABA) 9509 9509

9501 9512 rDNA 9501 9503 9512 rDNA kb 9503

Gene expression (+ ABA) Copies

Predicted protein coding genes 100

Tandem repeats 6 Species rDNA units Genome size 250 Spirodela 80 150 Mb Predicted TEs 3 Arabidopsis 500 150 Mb smallRNAs 500 CpG DNA methylation

1 CHG DNA methylation 1000 CHH DNA methylation 25S rDNA probe 18S rDNA probe B E SacI B E SacI

Syntenous connections

SacI SacI SacI EcoRV BamHI BamHI BamHI EcoRV EcoRV 18S 25S 18S 5.8S 5.8S 9 kb Michael et al., Plant J 2017

BioNano optical maps great for error correction, Updated genome and TE annotation contig checking and connecting long contigs pipeline identified 10% more elements

Family Name TE (#) TE (bp) TE (%) All Transposon Elements 622 35,590,241 25.25% Class I: Retroelement 602 33,625,660 23.86% LTR Retrotransposon 496 28,934,312 20.53% Ty1-Copia (RLC) 125 6,441,111 4.57% Ty3-Gypsy (RLG) 169 9,079,948 6.44% Unclassified LTR (RLX) 202 13,413,253 9.52% Non-LTR Retrotransposon 106 4,691,348 3.33% SINE (RSX) 4 8,427 0.01% LINE (RIX) 6 339,208 0.24% DIRS (RYX) 9 643,866 0.46% Unknown Retrotransposon (RXX) 87 3,699,847 2.63% Class II: DNA 17 864,468 0.61% Helitron (DHX) 8 703,062 0.50% Maverick (DMX) 1 7,802 0.01% TIR (DTX) 5 131,160 0.09% Unknown DNA Transposon (DXX) 3 22,444 0.02% Unclassified 3 1,100,113 0.78%

BioNano maps used to identify highly LTRs are actively removed leaving “solo” reduced rDNA array elements littering the genome Frequency of Repeat Units in Plant Assemblies (Repeat Stretch Tolerance = 10%, Minimum Repeat Frequency = 5)

300

t

i n 7498

U 250

t

a 9509

e

p 200

e

R

f 150

o

y

c

n 100

e

u

q

e 50

r

F

0

4

6

.

8

. 2

.

2

4

1

.

6

1

.

8

1

. 3

.

2

2

4

2

.

6

2

.

8

2

. 4

.

2

3

4

3

.

6

3

.

8

3

.

5

.

2

4

4

4

.

6

4

.

8

4

.

6

.

2

5

4

5

.

6

5

.

8

5

.

7

.

2

6

4

6

.

6

6

.

8

6

. 8

.

2

7

4

7

.

6

7

.

8

7

.

9

.

2

8

4

8

.

6

8

.

8

8

.

0

2

.

4

9

.

6

9

.

8

9

1

.

1

9

.

2

0

4

0

.

6

1

0

.

8

0

.

1

2

.

1

1

1

1

1

1

1

1

1

1

1 Repeat Unit Size (kb) 1

Devos et al., 2002

2 1/25/2019

Spirodela is in extreme purging based on a Arabidopsis has a distinct DNA methylation profile

very high solo:intact LTR ration Spirodela gene methylation Spirodela TE methylation

bloating purging Arabidopsis gene methylation Arabidopsis TE methylation

Genome bloating is governed by class II copy-and- Spirodela polyrhiza has a surprisingly low paste TE proliferating across the genome DNA methylation level

Class I Class II 80% cut-and-paste copy-and-paste CpG 70% Excision CHG transcription 60% Reverse-transcription 50% CHH Insertion and insertion 40%

30%

20%

10%

0% DNA transposon cut-and-paste LTR retrotransposons copy-and-paste to a new Spirodela Arabidopsis Seteria italica Bradypodium Wolffia each replication cycle location every replication cycle polyrhiza thaliana distachyon australiana

Genome purging governed by recently transposed Genome size and gene DNA methylation level are roughly LTRs marked by DNA methylation 80% correlated, but Spirodela outlier 70% Wolffia australiana LTR copy-and-paste LTR copy-and-paste distant from genes close to genes 60% 50% Seteria iltalica 40% Oryza sativa Arabidopsis thaliana PolV 30%

CpG DNA CpGDNA methylation(%) 20%

siRNA spreading 10% Spirodela polyrhiza

0% 0 100 200 300 400 500 600 Genome size (Mb) Increased cytosine methylation Spirodela Arabidopsis Oryza Seteria Bradypodium Wolffia polyrhiza thaliana sativa italica distachyon australiana CpG 9.41% 32.80% 39.30% 44.40% 54.10% 73.0% CHG 3.37% 13.20% 19.60% 33.80% 30.90% 21.0% CHH 0.97% 5.60% 5.00% 4.70% 2.50% 1.5% LTR proliferation and cytosine methylation Young LTR removal genome size 150 125 300 490 350 375

3 1/25/2019

DNA methylation level correlates to GC Syntenous (duplicated and retained) genes have content of Spirodela chromosomes almost no gene body methylation

Spirodela chromosomes cluster by DNA Duplicated and retained, low methylation genes are methylation level suggesting relatedness involved in gene expression and carbon metabolism GO ID GO Term Count P-value Prob Expected GO:0006355 regulation of transcription, DNA-dependent 96 5.36E-26 0.0261 27.1 GO:0003700 sequence-specific DNA binding transcription factor activity 60 5.78E-20 0.0138 14.3 GO:0005515 protein binding 177 5.35E-18 0.0871 90.5 GO:0043565 sequence-specific DNA binding 40 4.12E-13 0.0096 10 GO:0003677 DNA binding 90 5.33E-11 0.041 42.6 GO:0016772 transferase activity, transferring phosphorus-containing groups 62 1.79E-09 0.0256 26.6 GO:0004672 protein kinase activity 50 1.79E-08 0.0198 20.5 2 GO:0006468 protein phosphorylation 50 1.94E-08 0.0198 20.6 GO:0005634 nucleus 49 3.36E-08 0.0196 20.3 1 GO:0006099 tricarboxylic acid (TCA) cycle 10 3.38E-08 0.0008 0.9 13 GO:0005622 intracellular 44 1.81E-07 0.0176 18.3 11 GO:0005388 calcium-transporting ATPase activity 6 3.33E-07 0.0002 0.3 4 GO:0070588 calcium ion transmembrane transport 6 3.33E-07 0.0002 0.3 15 GO:0006334 nucleosome assembly 10 3.53E-07 0.0011 1.1 9 GO:0004713 protein tyrosine kinase activity 10 1.10E-06 0.0012 1.3 8 GO:0008565 protein transporter activity 8 1.63E-06 0.0007 0.8 6 GO:0000786 nucleosome 12 3.08E-06 0.0021 2.2 5 GO:0031204 posttranslational protein targeting to membrane, translocation 4 4.41E-06 0.0001 0.1 19 GO:0004048 anthranilate phosphoribosyltransferase activity 4 4.41E-06 0.0001 0.1 18 GO:0005516 calmodulin binding 6 4.89E-06 0.0004 0.4 14 GO:0006397 mRNA processing 9 6.30E-06 0.0012 1.2 10 GO:0016757 transferase activity, transferring glycosyl groups 12 9.40E-06 0.0023 2.4 GO:0019538 protein metabolic process 6 1.71E-05 0.0005 0.5 3 GO:0005524 ATP binding 82 1.96E-05 0.0488 50.7 20 GO:0007165 signal transduction 18 1.96E-05 0.0053 5.5 12 GO:0030570 pectate lyase activity 4 2.14E-05 0.0001 0.2 7 GO:0006879 cellular iron ion homeostasis 4 2.14E-05 0.0001 0.2 16 GO:0015977 carbon fixation 4 2.14E-05 0.0001 0.2 17 GO:0008964 phosphoenolpyruvate carboxylase activity 4 2.14E-05 0.0001 0.2 GO:0006826 iron ion transport 4 2.14E-05 0.0001 0.2

7 6 7 2 0 3 0 4 8 9 5 6 8 9 5 4 1 3 1 2

1 1 1 2 1 1 1 1 1 1 1 GO:0033177 proton-transporting two-sector ATPase complex, proton-transporting domain 4 2.14E-05 0.0001 0.2 GO:0004449 isocitrate dehydrogenase (NAD+) activity 4 2.14E-05 0.0001 0.2

Gene Expression related Carbon metabolism

The low levels of DNA methylation correlate with the Resequencing of 9 Spirodela accession number of synetnous connections (duplications) finds very low level of polymorphism

Spirodela

• Average of 355,161 SNPs and 50,966 small INDELs per accession, with a variant rate of 0.008, which is even low compared to the predominantly selfing Arabidopsis

• The majority of SNPs fell into sequences that are up- or down- Arabidopsis stream from transcribed regions with the lowest level of variation in exons.

• In contrast, variation in introns and exons is about equal in Arabidopsis (~6%).

4 1/25/2019

C>T and G>A transitions were 5% higher in High level view suggests Spirodela TEs are relatively Spirodela as compared to Arabidopsis old, but broad profile compared to other plants

0.40 Arab Sp9242 Sp9316 Sp9501 Sp9502 Sp9504 Sp9506 Sp9509 Sp9511 Sp9512 Rice(Relative) A>C 5.6 3.8 3.6 4.0 3.8 3.6 3.6 4.0 3.9 3.8 0.35

A>G 12.9 15.3 15.0 15.9 16.2 15.1 15.1 16.4 16.2 15.6 0.30 Brachypodium(Relative) A>T 9.4 3.7 3.7 3.9 3.8 3.7 3.7 4.1 3.8 3.8 0.25 Arabidopsis(Relative) C>A 5.6 4.4 4.4 4.4 4.2 4.3 4.3 4.3 4.3 4.3 0.20 Spirodela(Relative) C>G 3.6 3.6 3.5 3.5 3.5 3.4 3.4 3.8 3.6 3.6

C>T 12.9 19.3 20.0 18.3 18.4 20.0 20.0 17.5 18.2 18.8 0.15 Relative Relative Frequency G>A 12.8 19.2 19.9 18.3 18.4 19.9 20.0 17.4 18.2 18.8 0.10 G>C 3.7 3.6 3.4 3.5 3.5 3.4 3.4 3.8 3.6 3.5 0.05 G>T 5.6 4.4 4.3 4.3 4.3 4.3 4.3 4.3 4.4 4.4 T>A 9.4 3.7 3.7 3.9 3.8 3.7 3.6 4.0 3.8 3.8 0.00 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 T>C 12.8 15.3 14.9 16.0 16.1 15.0 15.0 16.5 16.2 15.7 Age (Mya) T>G 5.7 3.7 3.6 3.9 3.9 3.6 3.6 4.0 3.9 3.8

LTR families were clustered based on their Conclusions and potential insights age (based on divergence, polymorphism) • Spirodela smallest monocot genome sequenced to date • Fewest protein coding genes of any known angiosperm • Extremely small rDNA complement Five distinct clusters emerge, suggesting that • Highest solo:intact LTR ratio of plant genomes tested LTR families have distinct histories in terms of amplification and purging • Lowest DNA methylation level of known plant genomes • Duplicated and retained genes have extremely low to no gene body methylation • Global variation is low with high level of C>T transitions, and very low in genes

• Hypothesis: Spirodela is on the other side of an extreme purge and now the low DNA methylation level reflects protect of further gene and LTR loss. • All LTR fragments were used per LTR family for the analysis • BLAST was used to establish divergence between all elements on a per family basis

Acknowledgments Cluster 1 has full length LTRs that young and/or recently transposed (low divergence) • Michael Group: Tim Motely, Shane Poplawski, Yuhong Ning, Connor McEntee, JD Trout, Doug Bryant, Elliott Meer, Sheri Manalili, Stephanie

Barnes, Ayeh Barekat, Jake Minich, Heather Smith LTR fragments

• Danforth Center, Todd Mockler • Rutgers University, Eric Lam • Washington University, Weixiong Zhang

• University of Georgia, Scott Jackson Divergence(%) • BioNano Genomics, Alex Hastie Full-length LTRs

5 1/25/2019

Example of a LTR family (189) with highly conserved DNA methylation levels vary across the five clusters full-length LTRs from cluster 1 consistent with differential life histories

Divergence (from less to more)

DNA methylation (%) DNA methylation Divergence(%)

Clusters

Cluster 1 LTR families are all highly methylated,

Cluster 2 has highly diverged LTR fragments consistent with being targeted for purging

Divergence(%) DNA methylation (%) DNA methylation

Example of a LTR family (106) with highly divergent More divergent cluster 2 families are more

LTR fragments from cluster 2 methylated

DNA methylation (%) DNA methylation Divergence(%)

6 1/25/2019

Spirodela LTRs have a complex history providing insight into proliferation, purging and methylation

• Very old LTR families are not methylated • Longer less diverged LTR families are highly methylated – Actively being purged? • More fragmented, divergent LTR families have variable DNA methylation • Relationship between DNA methylation and divergence beyond purging?

7