Identified Clusters

7th RepeatExplorer Workshop on the Application of Next Generation Sequencing to Repetitive DNA Analysis

Welcome !

Institute of Plant Molecular Biology, Biology Centre ASCR České Budějovice, Czech Rep.

Programme

Tuesday (May 22) Wednesday (May 23) Thursday (May 24)

9:30 - Principles and applications of graph- 9:00 -12:30 9:00 – 12:30 based repeat clustering (J. Macas) ● Lucia Campos-Dominguez – Using Repeat ● Andrew Leitch – The phylogenetic placement Explorer to find the genomic factors driving of Gnetales is important to understand 10 :10 - RepeatExplorer pipeline version 2.0 speciation in Begonia patterns of genome evolution across seed (P. Novák) ● Steven Dodsworth – Repeat dynamics in the plants: insights from Gnetum montanum largest diploid angiosperms, Fritillaria genome 10:30 - 11:00 Coffee break (Liliaceae) ● Sonia Garcia – Ribosomal DNA ● Floris C. Breman – The Pelargonium arrangements in Asteraceae: variability of 5S 11:00 - Using RepeatExplorer output for repeatome as a ‘DNA barcode’ and 35S examined with TAREAN and repeat annotation and quantification (J. ● Hanna Schneeweiss – Repeat dynamics in RepeatExplorer Macas) wild and cultivated chile peppers (genus ● Aleš Kovařík – Diversity of rDNA repeats Capsicum) in gymnosperms analysed by Repeat Explorer 11:20 - Transposon protein databases ● Daniel Vitales – Transposon dynamics and and other tools (P. Neumann) chromosome reorganizations drive genome size ● Vratislav Peška) – Telomeres – current evolution in Anacyclus challenges 11:40 – Assembly annotation using ● Veit Herklotz – Characterization of CANR4 ● Diogo C. Cabral-de-Mello – Satellite DNAs PROFREP and DANTE (N. Hoštáková) satellite repeat in roses and sex chromosomes evolution in Orthoptera ● Danijela Greguraš – Characterization of ● Nils Pilotte – Tropical parasites: improved satellite and centromeric repeats in genus diagnostics utilizing repetitive targets 12:00 - 13:30 Lunch Hieracium

13:30 - (18:00) Practical training I 12:30 - 13:30 Lunch 12:30 - 13:30 Lunch

● design of sequencing and repeat analysis 13:30 - (18:00) Practical training II 13:30 – (18:00) - Practical training III experiments ● introduction to Galaxy environment ● identification of satellite DNA using TAREAN ● combining repeat clustering with ChIP-seq data ● quality control and pre-processing of NGS ● understanding RepeatExplorer output ● identification and phylogenetic analysis of reads, dealing with various read formats ● cluster annotation and repeat composition of the retrotransposon protein domains ● setting up clustering analysis genome ● SeqGrapheR – visualization and annotation of ● comparative clustering of multiple samples ● comparative clustering of multiple species – data the cluster graphs interpretation ● advanced topics, troubleshooting ● repeat quantification (principles, sensitivity and 19:00 → : Dinner at “CITYgastro” reproducibility) ● design of hybridization probes based on RE AcknowledgmentsAcknowledgments

Biology Centre CAS, České Budějovice Institute of Plant Molecular Biology

Laboratory of Molecular Cytogenetics Petr Novák Renata Marecová Pavel Neumann Andrea Koblížková Iva Fuková Vlasta Tetourová Laura A. Robledillo Jana Látalová Nina Hoštáková Tihana Vondrak Sonja Klemme

REPEATS IN PLANT GENOMES

Satellite DNA

Retrotransposon

Repeats in plant genomes

Why is repetitive DNA there and why is it so abundant ? What are the evolutionary mechanisms acting on genomic repeats ?

Why is there differential accumulation of repetitive DNA ?

Repeats in plant genomes

Why is repetitive DNA there and why is it so abundant ? What are the evolutionary mechanisms acting on genomic repeats ?

Why is there differential accumulation of repetitive DNA ?

What is the contribution of different types and families of repetitive elements ?

What is the repeat composition in individual species or taxons ?

The challenge of repeat identification in complex genomes

Why is repetitive DNA there and why is it so abundant ? What are the evolutionary mechanisms acting on genomic repeats ?

Why is there differential accumulation of repetitive DNA ?

What is the contribution of different types and families of repetitive elements ?

What is the repeat composition in individual species or taxons ?

Next generation sequencing: getting sequence data is no longer a limiting factor

454/Roche Illumina Pacific Biosciences Oxford Nanopore Principles and tools for global repeat identification

Assembled genome or long contigs (BACs etc.) Y E S N O S

e s E

a Y b a t a d

e c n e r e f O e

R N

Repeat identification in sequence data

● Pure algorithmic (evaluates “repetitiveness” of substrings in DNA sequence) ● Signature or feature-based (searches for specific structures of biological importance) ● Similarity to known repeats Principles and tools for global repeat identification

Assembled genome or long contigs (BACs etc.) Y E S N O

RepeatMasker / RepBase S

e ● ● s E first “model” species model-related taxa

a ● ●

Y extensive resources (+ wet lab) often shotgun, “assembled” b a t a d

e c n e r e f O e

R N

Repeat identification in sequence data

● Pure algorithmic (evaluates “repetitiveness” of substrings in DNA sequence) ● Signature or feature-based (searches for specific structures of biological importance) ● Similarity to known repeats (uses curated databases of known repeats) Principles and tools for global repeat identification

RepeatMasker (A.F.A. Smit, R. Hubley & P. Green) INPUT (reads or assembled sequences) DATABASE SIMILARITY SEARCH RepBase, RepeatMasker custom db, WU-BLAST ... Crossmatch

OUTPUT table of hits

>seq_1 < masked NNNNNNNATGAATGCCTGTCCCCGTCAGGGTTGTTGAGTTTTTCCCTAGC sequence AAATCTCCCGGGCAGAGGTTTGTGTGGATGGTTTTCCCCAGCAGGNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN summary > NNNNNNNNNNNNNNNNNNNNNNNNNNNCCGATCGTCCGTTGACGTCGGGT table TGATCAGATTTTCCCCCAGAGTCACCTCTCCGTTGGTGTCGATAAACGAT GAGTTTTTCCTATGCGTCCGTTGACGTATAGCTGCATGTTCCCCAAAGAT CATTTGTTAATGC Principles and tools for global repeat identification

Assembled genome or long contigs (BACs etc.) Y E S N O

RepeatMasker / RepBase S

e ● ● s E first “model” species model-related taxa

a ● ●

Y extensive resources (+ wet lab) often shotgun, “assembled” b a t a d

e c n ● e LTR_STRUC, MITE-Hunter , etc. r e f O e

R N

Repeat identification in sequence data

● Pure algorithmic (evaluates “repetitiveness” of substrings in DNA sequence) ● Signature or feature-based (searches for specific structures of biological importance) ● Similarity to known repeats (uses curated databases of known repeats) Principles and tools for global repeat identification

Signature / structure-based approaches Copia

Gypsy LTR_STRUC (McCarthy and McDonald, 2003)

LTR_FINDER (Xu and Wang, 2007)

SINE-Finder (Wenke et al. 2011)

MITE-Hunter (Han and Wessler, 2010) MITE Analysis Toolkit (Yang and Hall, 2003)

HelitronFinder (Du et al. 2008) [HelA helitrons in maize]

Saha et al. (20008) Principles and tools for global repeat identification

Assembled genome or long contigs (BACs etc.) Y E S N O

RepeatMasker / RepBase S

e ● ● s E first “model” species model-related taxa

a ● ●

Y extensive resources (+ wet lab) often shotgun, “assembled” b a t a d

c (using NGS reads) n ● e LTR_STRUC, MITE-Hunter , etc. r

e ●

f direct assembly (phrap,...) O e ● REPET R N ● clustering-based

Repeat identification in sequence data

● Pure algorithmic (evaluates “repetitiveness” of substrings in DNA sequence) ● Signature or feature-based (searches for specific structures of biological importance) ● Similarity to known repeats (uses curated databases of known repeats) Principles and tools for global repeat identification

RepeatExplorer

sequencing Genome Repeat reads assembly analysis

BAC library Excluded from The most abundant assembly and/or conserved repeats

Non- clonable

Clustering-based repeat identification

Cluster 1

Cluster 2

Cluster 3

CLUSTER = a set of frequently overlapping reads = REPEAT FAMILY

Clustering-based repeat identification

pairwise alignments graph representation

read_1 similarity read_1 exceeding threshold read_2 read_2 [90% simil. 55% of length] read_3 read_3 read_4 read_4

Clustering-based repeat identification

Single linkage clustering => connected components

TGICL (TIGR Gene Indices clustering tool) Pertea et al., 2003

Macas et al. (2007) - Repetitive DNA in the pea (Pisum sativum L.) genome: comprehensive characterization using 454 sequencing and comparison to soybean and Medicago truncatula.

Clustering-based repeat identification

similarity of conserved domains, promoters, etc.

nested insertion

Single linkage clustering => connected components

TGICL (TIGR Gene Indices clustering tool) Pertea et al., 2003 Graph-based clustering

Graph-based clustering Sequence overlaps between the reads are transformed to a graph where the reads are represented as nodes and their similarities as edges connecting the nodes Graph structure is examined to detect communities of frequently connected nodes which are split to separate clusters

Graph-based characterization of repeat clusters

Virtual graphs used to analyze real data contain up to millions of nodes (reads)

Graph-based clustering

Sequence overlaps between the reads are transformed to a graph where the reads are represented as nodes and their similarities as edges connecting the nodes

Graph structure is examined to detect communities of frequently connected nodes which are split to separate clusters Graph-based characterization of repeat clusters

Virtual graphs used to analyze real data contain up to millions of nodes (reads)

Graph-based clustering

Sequence overlaps between the reads are transformed to a graph where the reads are represented as nodes and their similarities as edges connecting the nodes

Graph structure is examined to detect communities of frequently connected nodes which are split to separate clusters Graph-based characterization of repeat clusters

Identified clusters

Virtual graphs used to analyze real data contain up to millions of nodes (reads)

Graph-based clustering

Sequence overlaps between the reads are transformed to a graph where the reads are represented as nodes and their similarities as edges connecting the nodes

Graph structure is examined to detect communities of frequently connected nodes which are split to separate clusters Graph-based characterization of repeat clusters

Extraction of of individual clusters Identified clusters (graph shapes indicate repeat types)

Satellite (simple)

Satellite family (short + long monomers) >

LTR-retrotransposons

Clustering-based repeat identification

Cluster 1

Cluster 2

Cluster 3

CLUSTER = a set of frequently overlapping reads = REPEAT FAMILY

Repeat characterization

Cluster annotation and quantification

Cluster 1

Cluster 2

Cluster 3

…

(Cluster x) Repeat characterization

Cluster annotation and quantification Proportions of various repeat types in a genome

RepeatExplorer

What RepeatExplorer can do for you: Identify all sequences with certain number of repetitions Repeat quantification Provide models of repeat popoulations (sequence variability) Help in repeat classification/annotation (in plant genomes, Metazoa in prep.)

What it cannot do: Genome assembly or reconstruction of individual repeat copies Analyze repeats using assembled genomes as input Use long NGS reads (PacBio, Oxford Nanopore) as input (work in progress) Identify some tandem repeats with very short monomers or other low-complexity repeats

Think about proper design of your analysis – RE is just a program designed for a specific type of input (e.g. it needs WGS data as input and it will not work with BAC clones)

Repetitive DNA characterization using RepeatExplorer

Plants

Over 100 species characterized so far

Comparative studies

Whole genome assembly projects

Repetitive DNA characterization using RepeatExplorer

Plants

Over 100 species characterized so far

Comparative studies

Whole genome assembly projects

Mammals

Bats, deer Fish

Austrolebias charrua, Cynopoecilus melanotaenia Insects

Locust, grasshoppers, kissing bugs Worms

Soil helmints

APPLICATIONS: Repeat composition of individual species

Repeat type: Ty3/gypsy Ty1/copia Satellite DNA rDNA other/unknown

cluster size (% of reads) pea (Pisum sativum)

1C = 4,300 Mb

cluster size (% of reads) soybean (Glycine max)

1C = 1,115 Mb

cumulative % of reads Macas et al. (2007) BMC Genomics 8: 427. APPLICATIONS: Comparative analysis

SPECIES (samples) Reads Modify names Merge CLUSTERS CL1 NGS genome seq. A

CL2

B CL3

C CL4

Comparative analysis

proportion cluster size Sequence reads from of reads from multiple species are mixed individual and subjected to clustering species >>> analysis

Efficient identification of homologous repeats from different species

Easy quantification

SPECIES:

Musa ornata

M. balbisiana (B)

M. acuminata (A)

M. textillis (T)

M. beccarii

Ensete gilletii Comparative analysis

SPECIES:

Ensete gilletii Repeats Musa beccarii M. textillis (T) M. balbisiana (B) M. ornata M. acuminata (A)

Phylogeny based on ITS sequences

Novak et al. 2014 Comparative analysis (searching for B-specific repeats)

Detection of clusters enriched on Bs

100 B-specific Next-gen sequencing

75 As + Bs B-enriched ) B

Mix coded 2 %

reads + B and 0 50 % ( /

perform 0 0 1

CLUSTERING * B

Next-gen 2 % sequencing 25

As - 0 -

0 0 50 100 150 200

CL number

Aegilops speltoides, comparative analysis of 0B / 2B plants

CL183 Detection of clusters enriched on Bs CL205 100 CL183 CL205

75 CL126 CL148 ) B 2 % + B 0 50 % ( / 0 0 1 B * B 2 % B 25

0 B B 0 50 100 150 200 CL number Sequencing by Houben lab FISH by Alevtina Ruban

Repeats significantly enriched on Bs

%2B*100/ Repeat CL reads [%] 0B % 2B % /(%2B+%0B) type

126 5216 0.191 1688 0.120 3528 0.252 68 satellite (86 bp) 148 2877 0.105 934 0.067 1943 0.139 68 tandem (?) 183 731 0.027 1 0.000 730 0.052 100 satellite (~1.1 kb)

205 358 0.013 0 0.000 358 0.026 100 satellite (185 bp) Cluster-centered downstream applications

Cluster annotation and quantification CLUSTERS Represent specific repeat families/variants or their parts They are collections of sequence reads capturing full sequence variability of repeat populations

Using clusters as reference for similarity- based classification of: RNA-seq reads (mRNA, smRNAs,...) Detection of transcribed repeats Comparative analysis (tissues,...) ChIP-seq reads Association of repeats with specific types of chromatin Identification of centromeric repeats by ChIP-seq

Outline of the experiment Illumina seq. (36 nt reads)

Chromatin Sequencing Mapping reads to DNA isolation immunoprecipitation clusters of (“ChIP”) Isolation of nuclei, with CenH3 antibody 9.5 mil. reads long (>100nt) reads, digestion with calculating ratio of micrococcal ChIP / INPUT nuclease Sequencing DNA isolation Cluster = repeat (“INPUT”) 20 mil. reads family

100-fold enrichment 100.00

T 10.00 U t P A u ] N p I g n i /

o /

l P [ P I I h h C C 1.00 no enrichment

0.10 0 100 200 300 400 500 600 cluster no. Identification of centromeric repeats by ChIP-seq

Outline of the experiment Illumina seq. (36 nt reads)

100.00

10.00 t u p n i

P I h C 1.00

Satellites ● 0.10 ●TR_11 0 100 200 300 400 500 600 PisTR-B cluster no. Neumann et al. (2012) PLoS Genetics 8:e1002777 PROFREP - improving repeat annotation in genome assembly

Annotation based on clustering analysis

M. acuminata genome assembly Novak, Hribova, et al. Programme

Tuesday (May 22) Wednesday (May 23) Thursday (May 24)

13:30 - (18:00) Practical training I 12:30 - 13:30 Lunch 12:30 - 13:30 Lunch