7th RepeatExplorer Workshop on the Application of Next Generation Sequencing to Repetitive DNA Analysis
Welcome !
Institute of Plant Molecular Biology, Biology Centre ASCR České Budějovice, Czech Rep.
Programme
Tuesday (May 22) Wednesday (May 23) Thursday (May 24)
9:30 - Principles and applications of graph- 9:00 -12:30 9:00 – 12:30 based repeat clustering (J. Macas) ● Lucia Campos-Dominguez – Using Repeat ● Andrew Leitch – The phylogenetic placement Explorer to find the genomic factors driving of Gnetales is important to understand 10 :10 - RepeatExplorer pipeline version 2.0 speciation in Begonia patterns of genome evolution across seed (P. Novák) ● Steven Dodsworth – Repeat dynamics in the plants: insights from Gnetum montanum largest diploid angiosperms, Fritillaria genome 10:30 - 11:00 Coffee break (Liliaceae) ● Sonia Garcia – Ribosomal DNA ● Floris C. Breman – The Pelargonium arrangements in Asteraceae: variability of 5S 11:00 - Using RepeatExplorer output for repeatome as a ‘DNA barcode’ and 35S examined with TAREAN and repeat annotation and quantification (J. ● Hanna Schneeweiss – Repeat dynamics in RepeatExplorer Macas) wild and cultivated chile peppers (genus ● Aleš Kovařík – Diversity of rDNA repeats Capsicum) in gymnosperms analysed by Repeat Explorer 11:20 - Transposon protein databases ● Daniel Vitales – Transposon dynamics and and other tools (P. Neumann) chromosome reorganizations drive genome size ● Vratislav Peška) – Telomeres – current evolution in Anacyclus challenges 11:40 – Assembly annotation using ● Veit Herklotz – Characterization of CANR4 ● Diogo C. Cabral-de-Mello – Satellite DNAs PROFREP and DANTE (N. Hoštáková) satellite repeat in roses and sex chromosomes evolution in Orthoptera ● Danijela Greguraš – Characterization of ● Nils Pilotte – Tropical parasites: improved satellite and centromeric repeats in genus diagnostics utilizing repetitive targets 12:00 - 13:30 Lunch Hieracium
13:30 - (18:00) Practical training I 12:30 - 13:30 Lunch 12:30 - 13:30 Lunch
● design of sequencing and repeat analysis 13:30 - (18:00) Practical training II 13:30 – (18:00) - Practical training III experiments ● introduction to Galaxy environment ● identification of satellite DNA using TAREAN ● combining repeat clustering with ChIP-seq data ● quality control and pre-processing of NGS ● understanding RepeatExplorer output ● identification and phylogenetic analysis of reads, dealing with various read formats ● cluster annotation and repeat composition of the retrotransposon protein domains ● setting up clustering analysis genome ● SeqGrapheR – visualization and annotation of ● comparative clustering of multiple samples ● comparative clustering of multiple species – data the cluster graphs interpretation ● advanced topics, troubleshooting ● repeat quantification (principles, sensitivity and 19:00 → : Dinner at “CITYgastro” reproducibility) ● design of hybridization probes based on RE AcknowledgmentsAcknowledgments
Biology Centre CAS, České Budějovice Institute of Plant Molecular Biology
Laboratory of Molecular Cytogenetics Petr Novák Renata Marecová Pavel Neumann Andrea Koblížková Iva Fuková Vlasta Tetourová Laura A. Robledillo Jana Látalová Nina Hoštáková Tihana Vondrak Sonja Klemme
REPEATS IN PLANT GENOMES
Satellite DNA
Retrotransposon
Repeats in plant genomes
Why is repetitive DNA there and why is it so abundant ? What are the evolutionary mechanisms acting on genomic repeats ?
Why is there differential accumulation of repetitive DNA ?
Repeats in plant genomes
Why is repetitive DNA there and why is it so abundant ? What are the evolutionary mechanisms acting on genomic repeats ?
Why is there differential accumulation of repetitive DNA ?
What is the contribution of different types and families of repetitive elements ?
What is the repeat composition in individual species or taxons ?
The challenge of repeat identification in complex genomes
Why is repetitive DNA there and why is it so abundant ? What are the evolutionary mechanisms acting on genomic repeats ?
Why is there differential accumulation of repetitive DNA ?
What is the contribution of different types and families of repetitive elements ?
What is the repeat composition in individual species or taxons ?
Next generation sequencing: getting sequence data is no longer a limiting factor
454/Roche Illumina Pacific Biosciences Oxford Nanopore Principles and tools for global repeat identification
Assembled genome or long contigs (BACs etc.) Y E S N O S
e s E
a Y b a t a d
e c n e r e f O e
R N
Repeat identification in sequence data
● Pure algorithmic (evaluates “repetitiveness” of substrings in DNA sequence) ● Signature or feature-based (searches for specific structures of biological importance) ● Similarity to known repeats Principles and tools for global repeat identification
Assembled genome or long contigs (BACs etc.) Y E S N O
RepeatMasker / RepBase S
e ● ● s E first “model” species model-related taxa
a ● ●
Y extensive resources (+ wet lab) often shotgun, “assembled” b a t a d
e c n e r e f O e
R N
Repeat identification in sequence data
● Pure algorithmic (evaluates “repetitiveness” of substrings in DNA sequence) ● Signature or feature-based (searches for specific structures of biological importance) ● Similarity to known repeats (uses curated databases of known repeats) Principles and tools for global repeat identification
RepeatMasker (A.F.A. Smit, R. Hubley & P. Green) INPUT (reads or assembled sequences) DATABASE SIMILARITY SEARCH RepBase, RepeatMasker custom db, WU-BLAST ... Crossmatch
OUTPUT table of hits
>seq_1 < masked NNNNNNNATGAATGCCTGTCCCCGTCAGGGTTGTTGAGTTTTTCCCTAGC sequence AAATCTCCCGGGCAGAGGTTTGTGTGGATGGTTTTCCCCAGCAGGNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN summary > NNNNNNNNNNNNNNNNNNNNNNNNNNNCCGATCGTCCGTTGACGTCGGGT table TGATCAGATTTTCCCCCAGAGTCACCTCTCCGTTGGTGTCGATAAACGAT GAGTTTTTCCTATGCGTCCGTTGACGTATAGCTGCATGTTCCCCAAAGAT CATTTGTTAATGC Principles and tools for global repeat identification
Assembled genome or long contigs (BACs etc.) Y E S N O
RepeatMasker / RepBase S
e ● ● s E first “model” species model-related taxa
a ● ●
Y extensive resources (+ wet lab) often shotgun, “assembled” b a t a d
e c n ● e LTR_STRUC, MITE-Hunter , etc. r e f O e
R N
Repeat identification in sequence data
● Pure algorithmic (evaluates “repetitiveness” of substrings in DNA sequence) ● Signature or feature-based (searches for specific structures of biological importance) ● Similarity to known repeats (uses curated databases of known repeats) Principles and tools for global repeat identification
Signature / structure-based approaches Copia
Gypsy LTR_STRUC (McCarthy and McDonald, 2003)
LTR_FINDER (Xu and Wang, 2007)
SINE-Finder (Wenke et al. 2011)
MITE-Hunter (Han and Wessler, 2010) MITE Analysis Toolkit (Yang and Hall, 2003)
HelitronFinder (Du et al. 2008) [HelA helitrons in maize]
Saha et al. (20008) Principles and tools for global repeat identification
Assembled genome or long contigs (BACs etc.) Y E S N O
RepeatMasker / RepBase S
e ● ● s E first “model” species model-related taxa
a ● ●
Y extensive resources (+ wet lab) often shotgun, “assembled” b a t a d
e
c (using NGS reads) n ● e LTR_STRUC, MITE-Hunter , etc. r
e ●
f direct assembly (phrap,...) O e ● REPET R N ● clustering-based
Repeat identification in sequence data
● Pure algorithmic (evaluates “repetitiveness” of substrings in DNA sequence) ● Signature or feature-based (searches for specific structures of biological importance) ● Similarity to known repeats (uses curated databases of known repeats) Principles and tools for global repeat identification
RepeatExplorer
sequencing Genome Repeat reads assembly analysis
BAC library Excluded from The most abundant assembly and/or conserved repeats
Non- clonable
Clustering-based repeat identification
Cluster 1
Cluster 2
Cluster 3
CLUSTER = a set of frequently overlapping reads = REPEAT FAMILY
Clustering-based repeat identification
pairwise alignments graph representation
read_1 similarity read_1 exceeding threshold read_2 read_2 [90% simil. 55% of length] read_3 read_3 read_4 read_4
Clustering-based repeat identification
Single linkage clustering => connected components
TGICL (TIGR Gene Indices clustering tool) Pertea et al., 2003
Macas et al. (2007) - Repetitive DNA in the pea (Pisum sativum L.) genome: comprehensive characterization using 454 sequencing and comparison to soybean and Medicago truncatula.
Clustering-based repeat identification
similarity of conserved domains, promoters, etc.
nested insertion
Single linkage clustering => connected components
TGICL (TIGR Gene Indices clustering tool) Pertea et al., 2003 Graph-based clustering
Graph-based clustering Sequence overlaps between the reads are transformed to a graph where the reads are represented as nodes and their similarities as edges connecting the nodes Graph structure is examined to detect communities of frequently connected nodes which are split to separate clusters
Graph-based characterization of repeat clusters
Virtual graphs used to analyze real data contain up to millions of nodes (reads)
Graph-based clustering
Sequence overlaps between the reads are transformed to a graph where the reads are represented as nodes and their similarities as edges connecting the nodes
Graph structure is examined to detect communities of frequently connected nodes which are split to separate clusters Graph-based characterization of repeat clusters
Virtual graphs used to analyze real data contain up to millions of nodes (reads)
Graph-based clustering
Sequence overlaps between the reads are transformed to a graph where the reads are represented as nodes and their similarities as edges connecting the nodes
Graph structure is examined to detect communities of frequently connected nodes which are split to separate clusters Graph-based characterization of repeat clusters
Identified clusters
Virtual graphs used to analyze real data contain up to millions of nodes (reads)
Graph-based clustering
Sequence overlaps between the reads are transformed to a graph where the reads are represented as nodes and their similarities as edges connecting the nodes
Graph structure is examined to detect communities of frequently connected nodes which are split to separate clusters Graph-based characterization of repeat clusters
Extraction of of individual clusters Identified clusters (graph shapes indicate repeat types)
Satellite (simple)
Satellite family (short + long monomers) >
LTR-retrotransposons
Clustering-based repeat identification
Cluster 1
Cluster 2
Cluster 3
CLUSTER = a set of frequently overlapping reads = REPEAT FAMILY
Repeat characterization
Cluster annotation and quantification
Cluster 1
Cluster 2
Cluster 3
…
…
(Cluster x) Repeat characterization
Cluster annotation and quantification Proportions of various repeat types in a genome
RepeatExplorer
What RepeatExplorer can do for you: Identify all sequences with certain number of repetitions Repeat quantification Provide models of repeat popoulations (sequence variability) Help in repeat classification/annotation (in plant genomes, Metazoa in prep.)
What it cannot do: Genome assembly or reconstruction of individual repeat copies Analyze repeats using assembled genomes as input Use long NGS reads (PacBio, Oxford Nanopore) as input (work in progress) Identify some tandem repeats with very short monomers or other low-complexity repeats
Think about proper design of your analysis – RE is just a program designed for a specific type of input (e.g. it needs WGS data as input and it will not work with BAC clones)
Repetitive DNA characterization using RepeatExplorer
Plants
Over 100 species characterized so far
Comparative studies
Whole genome assembly projects
Repetitive DNA characterization using RepeatExplorer
Plants
Over 100 species characterized so far
Comparative studies
Whole genome assembly projects
Mammals
Bats, deer Fish
Austrolebias charrua, Cynopoecilus melanotaenia Insects
Locust, grasshoppers, kissing bugs Worms
Soil helmints
APPLICATIONS: Repeat composition of individual species
Repeat type: Ty3/gypsy Ty1/copia Satellite DNA rDNA other/unknown
cluster size (% of reads) pea (Pisum sativum)
1C = 4,300 Mb
cluster size (% of reads) soybean (Glycine max)
1C = 1,115 Mb
cumulative % of reads Macas et al. (2007) BMC Genomics 8: 427. APPLICATIONS: Comparative analysis
SPECIES (samples) Reads Modify names Merge CLUSTERS CL1 NGS genome seq. A
CL2
B CL3
C CL4
Comparative analysis
proportion cluster size Sequence reads from of reads from multiple species are mixed individual and subjected to clustering species >>> analysis
Efficient identification of homologous repeats from different species
Easy quantification
SPECIES:
Musa ornata
M. balbisiana (B)
M. acuminata (A)
M. textillis (T)
M. beccarii
Ensete gilletii Comparative analysis
SPECIES:
Ensete gilletii Repeats Musa beccarii M. textillis (T) M. balbisiana (B) M. ornata M. acuminata (A)
Phylogeny based on ITS sequences
Novak et al. 2014 Comparative analysis (searching for B-specific repeats)
Detection of clusters enriched on Bs
100 B-specific Next-gen sequencing
75 As + Bs B-enriched ) B
Mix coded 2 %
reads + B and 0 50 % ( /
perform 0 0 1
CLUSTERING * B
Next-gen 2 % sequencing 25
As - 0 -
0 0 50 100 150 200
CL number
Aegilops speltoides, comparative analysis of 0B / 2B plants
CL183 Detection of clusters enriched on Bs CL205 100 CL183 CL205
B
75 CL126 CL148 ) B 2 % + B 0 50 % ( / 0 0 1 B * B 2 % B 25
B
0 B B 0 50 100 150 200 CL number Sequencing by Houben lab FISH by Alevtina Ruban
Repeats significantly enriched on Bs
%2B*100/ Repeat CL reads [%] 0B % 2B % /(%2B+%0B) type
126 5216 0.191 1688 0.120 3528 0.252 68 satellite (86 bp) 148 2877 0.105 934 0.067 1943 0.139 68 tandem (?) 183 731 0.027 1 0.000 730 0.052 100 satellite (~1.1 kb)
205 358 0.013 0 0.000 358 0.026 100 satellite (185 bp) Cluster-centered downstream applications
Cluster annotation and quantification CLUSTERS Represent specific repeat families/variants or their parts They are collections of sequence reads capturing full sequence variability of repeat populations
Using clusters as reference for similarity- based classification of: RNA-seq reads (mRNA, smRNAs,...) Detection of transcribed repeats Comparative analysis (tissues,...) ChIP-seq reads Association of repeats with specific types of chromatin Identification of centromeric repeats by ChIP-seq
Outline of the experiment Illumina seq. (36 nt reads)
Chromatin Sequencing Mapping reads to DNA isolation immunoprecipitation clusters of (“ChIP”) Isolation of nuclei, with CenH3 antibody 9.5 mil. reads long (>100nt) reads, digestion with calculating ratio of micrococcal ChIP / INPUT nuclease Sequencing DNA isolation Cluster = repeat (“INPUT”) 20 mil. reads family
100-fold enrichment 100.00
T 10.00 U t P A u ] N p I g n i /
o /
l P [ P I I h h C C 1.00 no enrichment
0.10 0 100 200 300 400 500 600 cluster no. Identification of centromeric repeats by ChIP-seq
Outline of the experiment Illumina seq. (36 nt reads)
Chromatin Sequencing Mapping reads to DNA isolation immunoprecipitation clusters of (“ChIP”) Isolation of nuclei, with CenH3 antibody 9.5 mil. reads long (>100nt) reads, digestion with calculating ratio of micrococcal ChIP / INPUT nuclease Sequencing DNA isolation Cluster = repeat (“INPUT”) 20 mil. reads family
100.00
10.00 t u p n i
/
P I h C 1.00
Satellites ● 0.10 ●TR_11 0 100 200 300 400 500 600 PisTR-B cluster no. Neumann et al. (2012) PLoS Genetics 8:e1002777 PROFREP - improving repeat annotation in genome assembly
Annotation based on clustering analysis
M. acuminata genome assembly Novak, Hribova, et al. Programme
Tuesday (May 22) Wednesday (May 23) Thursday (May 24)
9:30 - Principles and applications of graph- 9:00 -12:30 9:00 – 12:30 based repeat clustering (J. Macas) ● Lucia Campos-Dominguez – Using Repeat ● Andrew Leitch – The phylogenetic placement Explorer to find the genomic factors driving of Gnetales is important to understand 10 :10 - RepeatExplorer pipeline version 2.0 speciation in Begonia patterns of genome evolution across seed (P. Novák) ● Steven Dodsworth – Repeat dynamics in the plants: insights from Gnetum montanum largest diploid angiosperms, Fritillaria genome 10:30 - 11:00 Coffee break (Liliaceae) ● Sonia Garcia – Ribosomal DNA ● Floris C. Breman – The Pelargonium arrangements in Asteraceae: variability of 5S 11:00 - Using RepeatExplorer output for repeatome as a ‘DNA barcode’ and 35S examined with TAREAN and repeat annotation and quantification (J. ● Hanna Schneeweiss – Repeat dynamics in RepeatExplorer Macas) wild and cultivated chile peppers (genus ● Aleš Kovařík – Diversity of rDNA repeats Capsicum) in gymnosperms analysed by Repeat Explorer 11:20 - Transposon protein databases ● Daniel Vitales – Transposon dynamics and and other tools (P. Neumann) chromosome reorganizations drive genome size ● Vratislav Peška) – Telomeres – current evolution in Anacyclus challenges 11:40 – Assembly annotation using ● Veit Herklotz – Characterization of CANR4 ● Diogo C. Cabral-de-Mello – Satellite DNAs PROFREP and DANTE (N. Hoštáková) satellite repeat in roses and sex chromosomes evolution in Orthoptera ● Danijela Greguraš – Characterization of ● Nils Pilotte – Tropical parasites: improved satellite and centromeric repeats in genus diagnostics utilizing repetitive targets 12:00 - 13:30 Lunch Hieracium
13:30 - (18:00) Practical training I 12:30 - 13:30 Lunch 12:30 - 13:30 Lunch
● design of sequencing and repeat analysis 13:30 - (18:00) Practical training II 13:30 – (18:00) - Practical training III experiments ● introduction to Galaxy environment ● identification of satellite DNA using TAREAN ● combining repeat clustering with ChIP-seq data ● quality control and pre-processing of NGS ● understanding RepeatExplorer output ● identification and phylogenetic analysis of reads, dealing with various read formats ● cluster annotation and repeat composition of the retrotransposon protein domains ● setting up clustering analysis genome ● SeqGrapheR – visualization and annotation of ● comparative clustering of multiple samples ● comparative clustering of multiple species – data the cluster graphs interpretation ● advanced topics, troubleshooting ● repeat quantification (principles, sensitivity and 19:00 → : Dinner at “CITYgastro” reproducibility) ● design of hybridization probes based on RE