<<

Genome and transcriptome of the regeneration-competent , lignano.

Wasik K.A.*1, Gurtowski J.*1, Zhou X.1,2, Ramos O.M1, Delas M.J.1,3, Battistoni G.1,3, El Demerdash O.1, Falciatori I.1,3, Vizoso D.B.4, Smith A.D.5 ,, Ladurner P.6, Scharer L.4, McCombie W.R.1, Hannon G.J.1,3 and Schatz M.1

1 Watson School of Biological Sciences, Howard Hughes Medical Institute, Cold Spring Harbor Laboratory, New York 11724, USA; 2 Molecular and Cellular Biology Graduate Program, Stony Brook University, NY 11794; 3 Cancer Research UK Cambridge Institute, Li Ka Shing Centre, University of Cambridge, Cambridge CB2 0RE, United Kingdom; 4 Department of Evolutionary Biology, Zoological Institute, University of Basel, 4051 Basel, Switzerland; 5 Molecular and Computational Biology, University of Southern California, Los Angeles, CA 90089, USA; 6 Department of Evolutionary Biology, Institute of Zoology and Center for Molecular Biosciences Innsbruck, University of Innsbruck, A-6020 Innsbruck, Austria

The free-living flatworm, Macrostomum lignano, much like its better known relative, Schmidtea mediterranea, has a nearly unlimited regenerative capacity. Following injury, this species has the ability to regenerate almost an entirely new organism. This is attributable to the presence of an abundant somatic stem cell population, the neo- blasts. These cells are also essential for the ongoing maintenance of most tissues, as their loss leads to the rapid and irreversible degeneration of the . This set of unique properties makes an attractive species for studying the evolution of pathways involved in self-renewal, fate specification, and regeneration. The use of Macrosto- mum lignano, or other flatworms, as models, however, is hampered by the lack of a well-assembled and annotated genome sequence, fundamental to modern genetic and mo- lecular studies. Here we report the genomic sequence of Macrostomum lignano and an accompanying characterization of its transcriptome. The genome structure of Macrostomum lignano is remarkably complex, with ~75% of its sequence being comprised of simple repeats and transposon sequences. This has made high quality assembly from Illumina reads alone impossible (N50=414bp). We therefore obtained 130X coverage by long sequencing reads from the PacBio platform and combined this with more than 250X Illumina coverage to create a mixed assembly with a significantly improved N50 of 64 kb. We complemented the reference genome with an assembled and annotated transcriptome, and used both of these datasets in combination to probe gene expression patterns during regeneration, examining pathways important to stem cell function. Aditionally we found evidence of low levels of CpG methylation in Macrostomum lignano’s genome and evidence of trans-splicing in the worm’s transcriptome. Interestingly we found that flatworms lack Myc - a very conserved pluripotency factor in Bilaterians and beyond (cnidarians, poriferans). As a whole, our data will provide a crucial resource for the community for the study not only of invertebrate evolution but also of regeneration and so- matic pluripotency.

The Worm Genome and Transcriptome Assembly Transcripts Involved in Regeneration in M. lignano

A C F M. lignano K-mer model A B Pluripotency pathways from human/murine stem cells: Nematostella vectensis Stenostomum sthenum

Catenula lemnae 20 observed Aurelia aurita Cnidaria Microstomum lineare errors Jak-Stat signalling pathway Hydra magnipapillata Macrostomum lignano p1= 110 DNA DNA Prorhynchus stagnalis p2= 220 JAK STAT3 Klf4 Core transcriptional network Geocentrophora sphyrocephala p3= 330 Isodiametra pulchra Acoela p4= 440 Prosthiostomum siphunculus 5 RNA DEseq LIF LIFR MAPK signalling pathway

15 composite Maritigrella crozieri amputation extraction Grb2 Mek Erk1/2 c-Myc DNA pulex Leptoplana tremerralis Platyhelminthes RNA Echinoplana celerrima PI3K Akt Tbx3 Anopheles gambiae Microdalyellia schmidtii seq DNA DNA Sox2 Ecdysozoa Microdalyellia fusca Drosophila melanogaster Activin

Mesostoma lingua 10 ACVRI/II SMAD2/3 Proliferation Caenorhabditis elegans Nematoplana coelogynoporoides Nodal Caenorhabditis briggsae Monocelis sp. DNA Itaspiella helgolandica Platynereis dumerilii Schmidtea mediterranea 0 3 6 12 24 48 72 TGFβ signalling SMAD4 DUSP9 Dugesia japonica pathway DNA Lumbricus rubellus 5 Procotyla fluviatilis hours SMADs ID Agropecten irradians Dendrocoelum lacteum 23-mer frequency X10 DNA Erk1/2 Nanog Schmidtea mediterranea semperi Bmp4 BMPRI/II DNA Neobenedenia melleni p38 Dugesia japonica Gyrodactylus salaris Color Key TGFβ TGFβR

Schistosoma mansoni Taenia solium 0 Schistosoma japonicum DNA Schistosoma japonicum Schistosoma mansoni 0 100 200 300 400 500 600 Wnt signalling Lophotrochozoa Clonorchis siniensis 5 0 5 pathway Axin APC DNA Macrostomum lignano 23-mer Coverage Value 0h 3h 6h 12h 24h 48h 72h Wnt Frizzled Dvl GSK-3β β-catenin Strongylocentrotus purpuratus Catenulida Tricladida c-Myc Oct4 Ciona intestinalis Bothrioplanida G MAPK signalling pathway Lecithoepitheliata Monogenea Danio rerio FGF2 FGFR Ras Raf Mek Erk1/2 Polycladida Cestoda DNA Xenopus laevis Rhabdocoela Trematoda ML2 Assembly DNA 1e+06 IGF IGF-IR PI3K Akt 0.1 Gallus gallus ML1 Assembly Homo sapiens Deuterostomia PI3K-Akt signalling pathway B D E

DAPI 1e+05 head * * EdU Genes from human/murine stem cells:

Not found in M.lignano transcriptome STAT3 not found, other STATs identified in M.lignano transcriptome

T 1e+04 Found in M.lignano transcriptome KLF4 not found, other KLFs identified in M.lignano transcriptome Contig size O BMP4 not found, other BMPs identified in M.lignano transcriptome seminal male Factors with TGFβ -like domain found M.lignano transcriptome brain gut testes ovaries egg female* vesicle gonopore DE

antrum* eyes 1000

adhesive organs 50μm stylet tail 0 20 40 60 80 100 vesicula 100 A. Schematic representation of the experimental design: 200 worms (per replicate) underwent amputation at a level between the brain and the gonads. The heads were granulorum pharynx NG(x) % allowed to regenerate, and regenerating were collected at different timepoints post amputation (0, 3, 6, 12, 24, 48, 72 hours). RNA-Seq libraries from each timepoint

A. Phylogenetic analysis of 23 animal species using partial sequences of 43 genes. Fig. modified from Egger et al. (2015). B. Interference contrast image and a diagrammatic were analyzed for differentially expressed genes. Below - a heat map of differentially expressed genes at different regeneration timepoints. Each replicate is plotted separately.

representation of an adult Macrostomum lignano. C. Phylogenomic analysis of 27 flatworm species (21 free-living and 6 neodermatan) using >100,000 aligned amino acids. Downregulated and upregulated transcripts are labeled in green and red, respectively. Scale covers Log2 values. The samples are grouped with complete-linkage clustering

Fig. modified from Egger et al. (2015). D. Electron micrograph of a M. lignano neoblast. Note the small rim of cytoplasm (yellow) and the lack of cytoplasmic differentiation. using Euclidean distance. B.Known pluripotency pathways from H. sapiens and M. musculus were adapted from the Kyoto Encyclopedia of Genes and Genomes. Factors

er - endoplasmic reticulum; mi - mitochondria; mu - muscle; ncl - nucleolus; nu - nucleus (red). E. Immunofluorescence labeling of dividing neoblasts with EdU (red) in an that had potential homologues in M. lignano are labeled. adult worm. All cell nuclei are stained with DAPI (blue). T- testes, O - ovaries, DE – developing eggs, asterisks denote eyes. F. Representation of 23-mer frequency and cover-

age in the Illumina sequencing data generated from DNA extracted from a population of adult worms. M. lignano shows unusual 4-modal 23-mer distribution. G. Comparison

of Illumina only (ML1) and Pacbio (ML2) assemblies. Contig length distribution (Log2 scale) over the M. lignano genome in the ML1 (green) and ML2 (red) assemblies. Note that the ML1 assembly covers only about 55% of the genome. Additional Findings

A 0.837 785741_scma B 78575- 1_scma 0.891 368781_mosp 358631_mosp 0.784 0.976 4247911_psi Stephanostomum sp. - ACC----TATACGGTT---CTCT-GCCGTGTA------TATTAGT-C-ATGGT-AAGAA 4367212__psi 0.998 3981211_mfu Haematolechus sp. - ACC----TATACGGTT---CTCT-GCCGTGTA------TCAGTG--C-ATGGT-AAGAA 0.728 1086232_mli 0.892 1030111_mli Fasciola sp. AACC----TTAACGGTT---CTCTTGCCCTGTA------TATTAGTGC-ATGGTAAAGAA 0.938 9705812_mli 9649121_mli S.mansoni AACC----GTCACGGTT---TTAC--TCTTGTG------ATTTGTTGC-ATGGT-AAGAA 0.918 4991111_ltr 0.769 500662_mili Echinococcus sp. CACCG--TTAATCGGTC---CTTA--CCTTGCA------ATTTTGT---ATGGT-GAGTA 0.963 370851_mili 0.862 M.lignano - GCC G--TAAAGACGGT---CTCTTACTGCGAAGACTCAATTTATTGC-ATGCT-CAGTA Genome and Transcriptome Annotation 0.834 1028284_mli 3433911_mli 6889311_pst S.med SL1 - GCC G--TTAGACGGTC---TTATCGAAATCTATAT---AAATCTTAT-ATGGT-ACGGA myc_pdu 0.743 166474_cte S.med SL2 - GCC G--TTAGACGGTC---TTATCGAAATCTATAT---AAAAATTAT-ATGGT-GAGG 0.695 3029911_gvu 0.923 5733111_gvu Stylochus sp. TGCCGTATTTGACGGTCTCAAAAATTTCGTGTTTATTGCAATAATTGCAATGGT-AAGCA diminutive_dme 0.861 cmyc_hsa Notoplana sp. TGCCGTATTTGACGGTCTCAAAAATTTCGTGTTTATTGCAATAATTGCAATGGT-AAGCAA 0.81 0.974 mycl_hvu mycAl_hvu .** : . * : : : *** * .* * 0.9 myc1_hvu A B 0.727 nmyc_hsa C50 C1 0.789 lmyc_hsa

20 10 699701_csp Stephanostomum sp. TCGAA-----TTCGAC------CTATGGTCGAATAA-ATTCTTTGGCTAG-CCTCT---- 20 0 0 10 10 20 30 40 30 50 0 740831_csp 20 60 0 10 10 20 6012811_rsp Haematolechus sp. 30 TCGAG-----TTCGACTCACATCGTTGGTCGAATAAGATTATTTGGCTAG-CCTCCACTC 30 0 40 0.98 20 50 7137911_rsp 10 0 C.elegans 10 12 30 0 20 myc_aqu Fasciola sp. TCG------TTGGAC------CATCGGTCCAAACCCATTATTTGGCTAG-CCTCCATTC 20 30 myc2_hvu 10 40 0 D.melanogaster 0 88480_lgi S.mansoni CCG------TCGAC------CAAGAATCGAAGTT--TTCTTTGGCAGC-CCTAACACA 30 10 20 8624611_mli 20 10 30 7573711_mli Echinococcus sp. TCGATGCAGCTCAGGCTG-TGCCTACGGAGCTGACCCAGTATTTGGCTGGTCCTT----- 40 H.sapiens 0 30 7019111_pst 0 20 10 6927111_pst 10 M.lignano TCGACCCAGCTTCATCAAAT-AAAAGAATGCGAATCGAATATACAGCCGAGCCCGACAAC 10

20 4 S.mansoni max_aqu 0 30 MXL3_cel S.med SL1 CCG------TTATC------CAACATTAGTTGGTTAATTTTTGACAGTCACTTGAATC 20 0.995 6637312_rsp 10 0 10 M.lignano 100 worms 8002813_rsp S.med SL2 CCG------TTTGC------CAGCATTAGTTGGCTAATTTTTGACAGTAGCTTGCAT 0 10 0.767 Max_pdu 30 20 M.lignano sorted 0.918 0.93 5392611_gvu 20 30 0.836 Stylochus sp. TCAAAT------GAT------CCAGTGTGATCGTCGAGTCTTTG--ACAGGCCG----- 8 5204311_gvu 40 10 118760_cte 0 Notoplana sp. TCAAA------GAT------CCA-TGTGATCGTCGAGTCTTTGACACAGGCCG----- 0 stem/germ cells 133235_lgi - 30 10 0.953 8264415_csp 20 20 *. . : * *: . * 10 M.lignano RNA 8320811_csp 30 0.9 max_hsa 0 40 max_hvu 30 0 6 1677112_pca 20 10 0.815 1677111_pca Stephanostomum sp. ---TCGGGGGCTAA------96 10 20 1384111_pca 0 30 0.996 3759811_pca Haematolechus sp. TGGTCGGGGGCTA------108 30 40 0.771 193181i_pca 20 0 1 2256811_cle Fasciola sp. TG--CAGAGGCTAAGAATCC 110 10 10 2235912_cle 0 20 4 4338712_sst S.mansoni ----CGGGG------91 30 30 1 4186911_sst 20 40 4338711_sst Echinococcus sp. --- -CGAGGGCC------105 10 0 4186914_sst M.lignano TCGGCACTGTCTGCTCCGC- 130 0 10 Max_dme 20 MXL1_cel 30 329761_mili S.med SL1 - - ACAAGTGACTA T------107 20 30 2 0.974 4658511_mili 10 40 4658514_mili S.med SL2 - -GCAAGTGACTAT------106 0 0 0.822 401711_ece 10 Stylochus sp. 30 X Normalized frequency 0.931 8844611_ece --- -CGAGGCCTATAT---- 111 20 20 1291201_ltr 30 10 10 0.77 8721511_ltr Notoplana sp. ----CAAGGCCTATTT---- 111 0 0.907 2472711_psi 0 0 1865211_psi 30 10 0 20 40 60 80 100 0.903 442451_meli .. * 20 20 232631_meli 10 30 4161311_mfu A. M. lignano is lacking the Myc gene. Evolution of the of Myc and Max gene families across different repre- 0 Sequence complexity 0.998 0.919 4337911_mfu 30 0 1131511_msc 20 10 0.787 1961211_msc 5856311_mfu 10 20 0.996 0 30 1674911_mfu sentatives of the animal phyla. Mycs and Maxs gene candidates are retrieved based on reciprocal best 1535711_msc 30 0.783 0 0.962 437061_meli 20 10 170531_meli 10 20 0.991 323911_mosp 0 30 331811_mosp BLASTp from the available transcriptomes. The distance tree was inferred using neighbor-joining based on 30 0.828 2694411_nco 20 0 2289111_nco 10 10 8769711_nco 0 20 1 5046551_psi 30 30 1643511_psi JTT sequence evolution model (1000 bootstrap replicates). Human USF proteins are used as an outgroup. 20 0 0.969 S35429_sma 10 10 0 S000209_sma 20 0.951 301971_mosp 30 30 S. mediterranea C. elegans 20 173401_mosp 0 C D 10 0.928 USF2_hsa 10 0.926 The Myc branch is labeled in green, the Max branch is labeled in blue. dme – D. melanogaster, hsa – H. 0 20 D. melanogaster USF1_hsa 30 30 96815g22_mli 20 0 3098031_mfu 10 0 10 20 30 30 20 sapiens, lgi – L. gigantea, cte – C. teleta, hvu – H. vulgaris, aqu – A. queenslandica, cel – C. elegans, mli – 0 1122 820 10 0 10 20 30 30 0.3 20 0 10 0 10 20 30 30 20 0 M. lignano 10 10 0 20 % of bases 262 30 M. lignano, mfu - M. fusca, mosp – Monocelis sp, psi – P. siphunculus, ltr – L. tremelaris, ece – E. celerrima, 30 0 10 20 20 0 30

0

10 20 10 581 500

30 0 20 10 30 Genome masked by TRF , meli – M. lingua , msc – M. schmidtii, mili – M. lineare, nco – N. coelogynoporoides, rsp – Rhabdopleura sp., gvu – G. vulgaris, csp – Cerebratulus sp., pca – P. caudatus, sst 24.8 1368 1439 M.lignano 281 535 C1 Contig ID Tandem repeat unit size in bp C.elegans 6.8 – S. sthenum, cle – C. lemnae, pdu – P. dumerilli, sma – S. mediterranea, scma – S. mansoni. Transcript ID is next to each phylum name. B. Trans splicing in M. lignano. Align- 10 20 Tandem repeat count Contig size X 10 Kbp 1747 4.7 218 D.melanogaster 402 ment between first 130nt of M. lignano’s putative SL RNA and SL RNAs from other flatworms. The conserved splice junction is indicated by an arrowhead. Spliced leader miRNA count GC content 317 480 H.sapiens 2.2 Large repetitive elements sequences are labeled in blue. The potential initiator AUG (last three nucleotides of the spliced leader) is labeled in green. Trancsript abundance Illumina coverage S.mansoni 0.3 355

Conclusions:

A.Overview of the 50 largest contigs in the M. lignano genome, making up about 2.6 % of the total assembly. Different tracks denote (moving inwards): contig size X 10 Kbp; • We have assembled and annotated a highly repetitive genome using a mix of Pacbio and Illumina sequencing miRNA count (1-54 mapped miRNAs); large repetitive elements (RepeatScout) (1-4476 identified repeats); transcript count (1-43 mapped transcripts); Tandem repeat unit • We have found that: size in base pairs (1-500); Tandem repeat count (1-28); GC content (0-1); and Illumina coverage (4-160X). The color gradients correspond to the range of values for each - M. lignano’s genome shows evidence of CpG methylation track (lower values are lighter, higher values are darker). B. Sequence complexity comparison across five organisms. Drosophila melanogaster has an abundance of very low complexity sequence, not found in the other species. Macrostomum lignano has a sizable amount of moderately complex sequences that are not found in other species and - It has retained a large number of homeoboxes as compared to other flatworms that do not appear to be expressed. C. Tandem Repeat Finder was run on five species to assess their tandem and low complexity sequence composition. Macrostomum - The transcriptome shows evidence of trans-splicing lignano had far more bases masked by Tandem Repeat Finder than the other organisms in the test set. D.The number of reciprocal blast hits against the Homo sapiens tran- - Flatworms lost the very conserved Myc gene scriptome for four different species: Macrostomum lignano, Schmidtea mediterranea, Caenorhabditis elegans, and Drosophila melanogaster. Only the number of hits passing the E-value cutoff of ≤1e-10 is shown. • We have characterized the gene expression patterns during regeneration in M. lignano • Wasik et al. PNAS (2015); doi: 10.1073/pnas.1516718112