microRNA and smallRNA profiling

Anton Enright Group Leader EMBL - European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK

http://www.ebi.ac.uk/enright/ [email protected]

High Throughput Analysis 2013 Thursday, 24 October 13 microRNA Regulation

Adapted from He L. and Hannon G.J; Nature reviews genetics 5:552 (2004)

EMBL-EBI 2 High throughput RNA Seq

Thursday, 24 October 13 microRNA Regulation

Adapted from He L. and Hannon G.J; Nature reviews genetics 5:552 (2004)

EMBL-EBI 2 High throughput RNA Seq

Thursday, 24 October 13 MicroRNA Processing

EMBL-EBI 3 High throughput RNA Seq

Thursday, 24 October 13 MicroRNA Processing

EMBL-EBI 3 High throughput RNA Seq

Thursday, 24 October 13 MicroRNA Processing

EMBL-EBI 3 High throughput RNA Seq

Thursday, 24 October 13 microRNA Profiling

EMBL-EBI 4 High throughput RNA Seq

Thursday, 24 October 13 microRNA Profiling

• MicroRNA microarray • Statistical issues with number of probes • Cross-hybridisation and biases • Cheap and straightforward

• Small RNA Sequencing • Quantitation can be difficult due to biases • Ability to detect novel , edits and variation • Reproducible

• qRT-PCR Based Approaches • quite accurate • lower throughput

• Single Molecule profiling • e.g. Nanostring • New technology

EMBL-EBI 5 High throughput RNA Seq

Thursday, 24 October 13 Nanostring

EMBL-EBI 6 High throughput RNA Seq

Thursday, 24 October 13 Nanostring

EMBL-EBI 6 High throughput RNA Seq

Thursday, 24 October 13 microRNA microarrays

• MicroRNA array manufacturers love to talk about how many probes they have on their chips.

• In reality there are < 1000 microRNAs for your species and likely <30 expressed in your sample.

• A small number of biologically relevant probes = poor reproducibility and background modelling

• miRPlus, viral microRNAs, snoRNAs are fairly useless.

• In practice significant changes can be observed, but need to validate

EMBL-EBI 7 High throughput RNA Seq

Thursday, 24 October 13 microRNA arrays

EMBL-EBI 8 High throughput RNA Seq

Thursday, 24 October 13 microRNA arrays

EMBL-EBI 9 High throughput RNA Seq

Thursday, 24 October 13 microRNA arrays

EMBL-EBI 10 High throughput RNA Seq

Thursday, 24 October 13 microRNA Sequencing

EMBL-EBI 11 High throughput RNA Seq

Thursday, 24 October 13 smallRNA Seq

• Illumina/Solexa • Roche 454 • ABI Solid • Ion Torrent

• Issues • Amplification biases • Poor quantitation • Mapping Ambiguities • Read Lengths • Size selection • Adaptor Contamination • Read Errors

EMBL-EBI 12 High throughput RNA Seq

Thursday, 24 October 13 Solexa TruSeq smallRNA prep

EMBL-EBI 13 High throughput RNA Seq

Thursday, 24 October 13 smallRNA Seq

• For mRNA sequencing each molecule is sampled across its length by 30-150nt reads • Biases tend to average out across the length of the transcript

• For small RNA sequencing molecules such as microRNAs are shorter than a typical read • No way to average out biases

EMBL-EBI 14 High throughput RNA Seq

Thursday, 24 October 13 microRNA Sequencing

• Biases • GC Bias • Barcode ligation bias • Adapter ligation bias • PCR Amplification bias

EMBL-EBI 15 High throughput RNA Seq

Thursday, 24 October 13 Downloaded from rnajournal.cshlp.org on December 6, 2012 - Published by Cold Spring Harbor Laboratory Press

RNA-ligase-dependent biases

microRNA Sequencing

• Biases • GC Bias • Barcode ligation bias • Adapter ligation bias • PCR Amplification bias

FIGURE 3. miRNA representation by sequencing varies by three orders of magnitude and is dependent on the structure of the mature miRNA and miRNA-adapter product. (A) Unsupervised hierarchical clustering of miRNA profiles derived from cDNA libraries generated from the pool of 815 oligoribonucleotides present in equimolar concentrations (pool A, Supplemental Table 1) using Rnl1, Rnl2(1–249), and Rnl2(1– 249)K227Q for the 39-adapter ligation step and sequenced by Solexa next-generation sequencing platform. (B) Pairwise comparison of Spearman rank correlation coefficients of the miRNA profiles from A.(C) Distribution of average sequence read frequencies of the 770 miRNAs present in equimolar concentrations in pool A in cDNA libraries generated using Rnl1, Rnl2(1–249), and Rnl2(1–249)K227Q in the 39-adapter ligation step. The number of biological replicates for each distribution is indicated. miRNA relative frequencies vary by 1000-fold.

least efficiently sequenced miRNAs)EMBL-EBI and 64% for miR-567 with equimolar amounts of adapter-ligated material, using 15 High throughput RNA Seq (the most frequently sequenced miRNA). This efficiency a 25-fold excess of radioactively labeled reverse transcrip- Thursday, 24 October 13 range explained well the observed broad sequence read tion primer followed by hydrolysis of the RNA template. frequency distribution as well as the rank of miRNAs. The yields of primer extension products were comparable Furthermore, the up to threefold over-representation of some (data not shown), indicating that reverse transcription was miRNAs relative to the mean read frequency is consistent with not a significant source of sequence-specific biases. the 21% cumulative ligation yield of pool A RNA (Table 2). Lastly, we examined the influence of excessive PCR on We further isolated the products from the 59-adapter small RNA read frequency distribution. A small RNA cDNA ligation for pool A, miR-567, miR-155, miR-10a, miR-16, library generated from pool A using Rnl2(1–249)K227Q for and miR-21 and performed reverse transcription reactions the 39-ligation step followed by Solexa sequencing. This

www.rnajournal.org 7 Absolute quantitation can be difficult but differential expression seems more robust

EMBL-EBI 16 High throughput RNA Seq

Thursday, 24 October 13 Downloaded from pixfunlobdot.59.to on June 23, 2010 - Published by Cold Spring Harbor Laboratory Press

Marioni et al. same concentration, only a small proportion of genes show evi- 2004) to the array data, we identified 8113 genes as differentially dence for differences among lanes over those expected from sam- expressed at an FDR of 0.1% (83% with an estimated absolute pling error. For sequences sampled at different concentrations, log2-fold change > 0.5, 43% > 1). Of these, 81% of genes were the differences were more appreciable. Thus, for the remainder of also identified as differentially expressed from the Illumina se- this paper, we consider only the data sequenced at a concentra- quencing data, providing strong evidence that the majority of tion of 3 pM (five lanes for each sample). genes called from the sequence data are genuinely differentially expressed between the two samples. Furthermore, estimates of

Identifying differentially expressed genes the log2-fold changes of gene expression levels between the samples across the two technologies are correlated (Spearman The Poisson model described above provides a natural framework correlation = 0.73) (Fig. 4). The correlation is greater for genes for identifying differentially expressed genes. Indeed, the model that are mapped to by large numbers of sequence reads. For ex- can be cast as a generalized linear model (McCullagh and Nelder ample, for genes mapped to by (on average) more than 32 reads 1989), and standard methods exist to estimate parameters, and to in both tissues (Ն5 on the log scale in Fig. 3), the Spearman compute P-values for each gene testing the null hypothesis that correlation of the fold changes across technologies is 0.79 com- it is not differentially expressed between two groups (see Meth- pared with 0.60 for genes mapped to by at least one but fewer than ods). 32 reads. These comparisons with the array data demonstrate that The results from the goodness-of-fit test above suggest that the Illumina sequencing technology and our analysis approach are a small proportion of genes show deviations from the Poisson performing well. A complete comparison of genewise results from assumption (extra-Poisson variation). To check whether this as- both technologies is available in Supplemental Table 3. pect of the data will lead to false-positive identifications of dif- Considered together, 6538 genes were identified as differen- ferentially expressed genes, we applied the Poisson model to tially expressed using either the sequencing or the array data but identify differentially expressed genes between groups of lanes not by both (Fig. 5). To further examine these discrepancies, we used to sequence the same sample. We observed that even for the used a third technology, quantitative PCR (qPCR), to test for pair of lanes that displayed the strongest evidence of a lane effect, differences in expression between the liver and kidney samples only 14 genes were identified as differentially expressed at a false for five genes called differentially expressed from the sequence discovery rate (FDR) of 0.1% (Supplemental Fig. 7). Similarly, data but not the array (MMP25, SLC5A1, MDK, ZNF570, GPR64) when we applied this model to groups that each contained two and for six genes that were found to be differentially expressed lanes used to sequence the same sample, the worst comparison using the array, but not the sequencing data (C16orf68, CD38, yielded only 24 genes that were incorrectly identified as differ- LSM7, S100P, PEX11A, GLOD5). We designed primers for the entially expressed. We conclude that, in this context, at this qPCR within 1 kb upstream of the annotated 3Ј-end of the genes stringent FDR, deviations from the Poisson model do not lead to (Methods). The qPCR results confirmed as differentially ex- the identification of an appreciable number of false-positive dif- pressed (t-test, P <0.01)fourofthefirstsetofgenes(allbut ferentially expressed genes. mRNA profilingZNF570), but only twocomparisons of the second set (CD38 and GLOD5). We next used this approach to identify differentially ex- Thus, overall, the qPCR results agreed more closely with the Il- pressed genes from the Illumina sequencing data, by comparing lumina sequencing results than with the array. five lanes each of liver-versus-kidney samples. At an FDR of 0.1%, Downloaded from pixfunlobdot.59.to on June 23, 2010 - Published by Cold Spring Harbor Laboratory Press we identified 11,493 genes as differentially expressed86% of between differentialsBeyond found the analysis of differences in gene expression the samples (94% of these had an estimated absolute log2-fold change > 0.5; 71% > 1). 0.76 CorrelationIn addition to identifying gene expression differences, sequenc- ing data can be used to identify novel exons and transcripts and RNA-seq: Estimating gene expression using sequencing Comparison of results across technologies databases. A comprehensive examination of these data and their reliability is therefore still necessary, but these preliminary data As a first step to comparing the sequence show the potential for short-sequence reads to detect splicing and array data, we compared the number variation. of sequence reads mapped to each gene with the corresponding (normalized) ab- solute intensities from the array (Fig. 3). Discussion Reassuringly, these two independent mea- Our results demonstrate the efficacy of high-throughput se- sures of transcript abundance are highly quencing for measuring gene expression levels. Using the Illu- correlated (Spearman correlation = 0.73 mina sequencing platform, we detected differential expression for liver, 0.75 for kidney). Interestingly, for 81% of genes called significantly differentially expressed from where results from the two technologies the array data, and the correlation of fold change ratios between differ, it is generally where the array inten- the two technologies (Spearman correlation = 0.73) is similar or sities are large and the sequence counts higher than observed in comparisons across different microarray small; a pattern that might be explained platforms (Shi et al. 2006). Furthermore, our analysis suggests that a large proportion of genes called differentially expressed by probe-specific background hybridiza- from the sequencing data but not from the array may be true tion on the array. positives: First, comparisons of lanes from the same sample iden- We next compared differentially tified at most 14 genes as differentially expressed, and secondly, expressed genes called from the Illumina results from qPCR on five genes identified as differentially ex- Figure 3. Comparing counts from Illumina sequencing with normalized intensities from the array,Figure 4. Comparison of estimated log2 fold changes (liver/kidney) sequencing data with those identified RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays from Illumina (Y-axis) and Affymetrix (X-axis). We consider only genes pressed in Illumina sequencing but not on the array confirmed for kidney (left) and liver (right). In each panel, the average (log2) counts for each gene are plotted on from the array. By applying a widely theJCX -axis,Marioni, and CE the Mason, corresponding SM Mane, M normalized Stephens, et intensities al - Genome from Research the array, 2008 are shown on the Y-axis.that To were interrogated using both platforms and genes where the mean four of them. The remaining gene (ZNF570) may represent a false used empirical Bayes approach (Smyth avoid taking the log of 0, we added 1 to each of the average counts prior to taking logs. number of counts across lanes was greater than 0 for both the liver and positive in the Illumina sequencing data. Alternatively, it may kidney samples. (Red and green dots) Genes called as differentially ex- reflect differences in the genic regions surveyed by the two tech- pressed based on the Illumina sequencing data at an FDR of 0.1%, with a mean number of counts greater than (red) or less than (green) 250 nologies. 1512 Genome Research reads in both tissues. (Black dots) Genes not called as differentially ex- www.genome.org pressed based on the Illumina sequencing data. The set of differentially Alternative analysis strategies expressed genes that show the strongest correlation between the two technologies seems to be those that are mapped to by many reads (red), The approach we took here to identifying differentially expressed while the correlation is weaker for differentially expressed genes mapped genes in the sequence data was based on a Poisson model. Good- to by fewer reads (green). ness-of-fit tests indicate that a small proportion of genes show clear deviations from this model (extra-Poisson variation), and to study alternative splicing. For example, to find novel exons or although we found that these deviations did not lead to false- transcripts, the distribution of intergenicEMBL-EBI reads (i.e., reads positive identification of differentially expressed genes at a strin- 17 High throughput RNA Seq mapped between currently annotated genes) across the genome gent FDR, there is nevertheless room for improved models that Thursday, 24 October 13 could be examined. If a large number of reads were mapped to a account for the extra-Poisson variation. One natural strategy particular genomic region, it would suggest that this region would be to replace the Poisson distribution with another distri- might provide a good target for follow-up work. Additionally, identifying sequence reads that span exon–exon junctions should help reconstruct the composition of alternative splice variants (although reconstructing entire transcripts will be chal- lenging, particularly with short reads). A comprehensive analysis of both these topics is beyond the scope of this study. Neverthe- less, to illustrate the potential of these data, we performed a pre- liminary analysis to identify reads that span exon–exon junc- tions. Since reads that cover exons that have been spliced together will not map directly back to the reference human genome, we developed a splicing detection algorithm (see Methods) to exam- ine all of the reads that did not align to (at least one location in) the genome. In kidney, we identified more than 200,000 reads that mapped to possible exon–exon junctions within a gene. Of the junctions mapped to, more than 30,000 showed twofold or greater coverage. As expected, we also found evidence for alter- native splicing (i.e., splice junctions that skip one or more of the exons). An example of a specific gene for which putative alter- native splice variants are present is C17orf45 on chromosome 17 (Fig. 6). We observed similar proportions of splicing isoforms for the liver. The number of reads supporting alternative splicing Figure 5. A Venn diagram summarizing the overlap between genes (Supplemental Table 4) should be taken as an estimate of the called as differentially expressed from the (left circle) sequence data and order of magnitude at this point, because a more careful analysis from the (right circle) array. The number of genes called by both tech- is needed to resolve possible exon-annotation conflicts in the nologies is indicated by the overlap between the two circles.

Genome Research 1513 www.genome.org EMBO Course 2011

• Comparison of Solexa and Exiqon LNA Arrays

EMBL-EBI 18 High throughput RNA Seq

Thursday, 24 October 13 Small RNAseq read contamination

• Band cutting to select the small RNA fraction can be highly inaccurate • rRNA • tRNA • snoRNA • pseudogenes (e.g. tRNA pseudogenes) • RNA degradation products • Repeat Elements

• Needs to be as reproducible as possible between samples • Mapping to microRNAs usually eliminates issues

EMBL-EBI 19 High throughput RNA Seq

Thursday, 24 October 13 Small RNAseq read contamination

Barcode 5’ 3’ 5’ adapter Read 3’ adapter

• Barcode Indels/Mismatches • Adapter contamination • Primer-Primer dimers • Poly-A • Poly-N • Low-Complexity tracts (tri nucleotide) • Low-Quality tracts (3’)

• These problems are largely ignored or treated naively. • Significantly affect read mapping.

EMBL-EBI 20 High throughput RNA Seq

Thursday, 24 October 13 Small RNAseq read contamination

Barcode 5’ 3’ 5’ adapter Read 3’ adapter

• Barcode Indels/Mismatches 5’ 3’ • Adapter contamination • Primer-Primer dimers • Poly-A • Poly-N • Low-Complexity tracts (tri nucleotide) • Low-Quality tracts (3’)

• These problems are largely ignored or treated naively. • Significantly affect read mapping.

EMBL-EBI 20 High throughput RNA Seq

Thursday, 24 October 13 Small RNAseq read contamination

Barcode 5’ 3’ 5’ adapter Read 3’ adapter

• Barcode Indels/Mismatches 5’ 3’

• Adapter contamination 5’ 3’ • Primer-Primer dimers • Poly-A • Poly-N • Low-Complexity tracts (tri nucleotide) • Low-Quality tracts (3’)

• These problems are largely ignored or treated naively. • Significantly affect read mapping.

EMBL-EBI 20 High throughput RNA Seq

Thursday, 24 October 13 Small RNAseq read contamination

Barcode 5’ 3’ 5’ adapter Read 3’ adapter

• Barcode Indels/Mismatches 5’ 3’

• Adapter contamination 5’ 3’

• Primer-Primer dimers 5’ 3’ • Poly-A • Poly-N • Low-Complexity tracts (tri nucleotide) • Low-Quality tracts (3’)

• These problems are largely ignored or treated naively. • Significantly affect read mapping.

EMBL-EBI 20 High throughput RNA Seq

Thursday, 24 October 13 Small RNAseq read contamination

Barcode 5’ 3’ 5’ adapter Read 3’ adapter

• Barcode Indels/Mismatches 5’ 3’

• Adapter contamination 5’ 3’

• Primer-Primer dimers 5’ 3’

• Poly-A 5’ AAAAAAAAAAAAA 3’ • Poly-N • Low-Complexity tracts (tri nucleotide) • Low-Quality tracts (3’)

• These problems are largely ignored or treated naively. • Significantly affect read mapping.

EMBL-EBI 20 High throughput RNA Seq

Thursday, 24 October 13 Small RNAseq read contamination

Barcode 5’ 3’ 5’ adapter Read 3’ adapter

• Barcode Indels/Mismatches 5’ 3’

• Adapter contamination 5’ 3’

• Primer-Primer dimers 5’ 3’

• Poly-A 5’ AAAAAAAAAAAAA 3’

• Poly-N 5’ NNNNNNNTCGNN 3’ • Low-Complexity tracts (tri nucleotide) • Low-Quality tracts (3’)

• These problems are largely ignored or treated naively. • Significantly affect read mapping.

EMBL-EBI 20 High throughput RNA Seq

Thursday, 24 October 13 Small RNAseq read contamination

Barcode 5’ 3’ 5’ adapter Read 3’ adapter

• Barcode Indels/Mismatches 5’ 3’

• Adapter contamination 5’ 3’

• Primer-Primer dimers 5’ 3’

• Poly-A 5’ AAAAAAAAAAAAA 3’

• Poly-N 5’ NNNNNNNTCGNN 3’

• Low-Complexity tracts (tri nucleotide) 5’ CGTCGTCGTCGT 3’ • Low-Quality tracts (3’)

• These problems are largely ignored or treated naively. • Significantly affect read mapping.

EMBL-EBI 20 High throughput RNA Seq

Thursday, 24 October 13 Small RNAseq read contamination

Barcode 5’ 3’ 5’ adapter Read 3’ adapter

• Barcode Indels/Mismatches 5’ 3’

• Adapter contamination 5’ 3’

• Primer-Primer dimers 5’ 3’

• Poly-A 5’ AAAAAAAAAAAAA 3’

• Poly-N 5’ NNNNNNNTCGNN 3’

• Low-Complexity tracts (tri nucleotide) 5’ CGTCGTCGTCGT 3’

• Low-Quality tracts (3’) 5’ 3’

2007 2010 Untitled 3 Untitled 6 Untitled 9 Untitled 15 Untitled 18 Untitled 21 Untitled 24 Untitled 27 Untitled 30

• These problems are largely ignored or treated naively. • Significantly affect read mapping.

EMBL-EBI 20 High throughput RNA Seq

Thursday, 24 October 13 2

Percentage of processed reads mapping to a specified The number of mappable reads for each sample number of loci given that reads reporting more than 20 Small RNA sequencing QC loci are discarded 100 lane 3000000 90

2500000 1 80 Cleaned read complexity − The proportion 70 Nucleotide frequency per base for Quality per cycle for the lane Nucleotide frequency per base for Read length plot for lane sample of reads removed by a specified the lane sample input sample − displayed as raw ASCII values the lane sample following Reaper following Reaper 2000000 trinucleotide complexity threshold 60

A T G C N A T G C N 10 Unique reads Repeated reads

Mappable1.0 20 3.5

1.0 1.0 50 30 110 UniquelyMappable 1500000 40 50 3.0 0.8 60 3323204 (80.8%) 40 0.8 0.8

100 70 80 2.5 ● 90 0.6 Read Number ●

90 30 0.6 0.6 1000000 2.0 Percentage of reads Percentage Quality 0.4 1.5 80 20 0.4 0.4 Frequency Frequency

500000 1.0 70 0.2 10 1418824 (34.5%) 0.2 10% Quantile 0.2 50% Quantile 0.5 ● Number of reads (millions) 90% Quantile ● ● ● 60 ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.0 0 0.0 0.0 0.0 0 threshold Proportion by removed 0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35 0 5 10 15 201 25 3 30 535 7 9 11 14 17 20 Base frequency by cycle Quality by Cycle Base frequency by cycle Sequence length (nt) lane Read length following Reaper Pie chart displaying those reads with recognised barcodes and the basis upon which Samples Loci to which a read maps others were discarded Total Reads: 7857769

Proportion of reads that map to annotation which Proportion of reads that map to annotation which is associated with repeats for lane is associated with genes for lane

Total Accepted Reads (Reads: 7.86 mill, 100%) 0.5 0.5

total_accepted. Million Reads: 7.857769 0.4 0.4

0.3 0.3 lane_sense lane_sense lane_antisense lane_antisense

0.2 0.2 Proportion of reads Proportion of reads 0.1 0.1

0.0 0.0 LTRs LINEs SINEs rRNAs miRNAs snRNAs satellites lincRNAs snoRNAs aa_tRNAs Mt_tRNAs miscRNAs other_tRNAs dust_repeats RNA_repeats pseudogenes other_repeats tandem_repeats unknown_repeats typeII_transposons protein_coding_exons processed_transcripts

lane : The percentage of processed reads mapping to each chromosome on either the sense and antisense strands

11 EMBL-EBI 21 High throughput RNA Seq 10 9 Thursday, 24 October 13 8 7 6 5 4 3 2 1 0 1 2 3 4 5 6 7 Percentage of processed reads Percentage 8 9 1 2 3 4 5 6 7 8 9 X Y 10 11 12 13 14 15 16 17 18 19 MT

Toplevel Ensembl chromosomes supercontig

Mappable Sense Uniquely Mappable Sense Mappable Antisense Uniquely Mappable Antisense Small RNA sequencing QC

• Eyeball your data

@HS24_10147:1:1101:1067:1989#0 GCCCGGCTAGCTCAGTCGGTAGAGCATGGGACTGGAATTCTCGGGTGCCN + BCCFFFFFHHHHFHHJGHIHIJIIJJIJHI?FHHHEHJIGIJJJJHIJJ! @HS24_10147:1:1101:2218:1975#0 TCGCTTGGTGCAGATCGGGACTGGAATTCTCGGGTGCCAAGGAACTCCAN + BBCFFFFDHHHHHJJJJJJJJJIJFJIIJJJJJJJJJJ@HIJIJIIHII! @HS24_10147:1:1101:2722:1968#0 CGACCTGGAATTCTCGGGTGCCAAGGAACTCCAGTCACCGATGTATCTCN + ?@@FFFFFFAHDAEHFHHGIIGGIAE9DFHIGFFHFGEGIGFD=CCCCF! @HS24_10147:1:1101:2977:1942#0 NCGCTTGGTGCAGATCGGGACTGGAATTCTCGGGTGCCAAGGAACTCCAN + !1BDFFDDHHHHHIJJIJIIJJJJIJJJJJJJJJJJJJGHIJJJJJJJJ! @HS24_10147:1:1101:5876:1978#0 TACAGTCCGACGATCTGGAATTCTCGGGTGCCAAGGCTCCAGTCACCGAN + =:=DDDDDD

EMBL-EBI 22 High throughput RNA Seq

Thursday, 24 October 13 Small RNA sequencing QC

• Eyeball your data

@HS24_10147:1:1101:1067:1989#0 GCCCGGCTAGCTCAGTCGGTAGAGCATGGGACTGGAATTCTCGGGTGCCNGCCCGGCTAGCTCAGTCGGTAGAGCATGGGACTGGAATTCTCGGGTGCCN + BCCFFFFFHHHHFHHJGHIHIJIIJJIJHI?FHHHEHJIGIJJJJHIJJ! @HS24_10147:1:1101:2218:1975#0 TCGCTTGGTGCAGATCGGGACTGGAATTCTCGGGTGCCAAGGAACTCCANTCGCTTGGTGCAGATCGGGACTGGAATTCTCGGGTGCCAAGGAACTCCAN + BBCFFFFDHHHHHJJJJJJJJJIJFJIIJJJJJJJJJJ@HIJIJIIHII! @HS24_10147:1:1101:2722:1968#0 CGACCTGGAATTCTCGGGTGCCAAGGAACTCCAGTCACCGATGTATCTCNCGACCTGGAATTCTCGGGTGCCAAGGAACTCCAGTCACCGATGTATCTCN + ?@@FFFFFFAHDAEHFHHGIIGGIAE9DFHIGFFHFGEGIGFD=CCCCF! @HS24_10147:1:1101:2977:1942#0 NCGCTTGGTGCAGATCGGGACTGGAATTCTCGGGTGCCAAGGAACTCCANNCGCTTGGTGCAGATCGGGACTGGAATTCTCGGGTGCCAAGGAACTCCAN + !1BDFFDDHHHHHIJJIJIIJJJJIJJJJJJJJJJJJJGHIJJJJJJJJ! @HS24_10147:1:1101:5876:1978#0 TACAGTCCGACGATCTGGAATTCTCGGGTGCCAAGGCTCCAGTCACCGANTACAGTCCGACGATCTGGAATTCTCGGGTGCCAAGGCTCCAGTCACCGAN + =:=DDDDDD

EMBL-EBI 22 High throughput RNA Seq

Thursday, 24 October 13 Some of our smallRNA tools

• Minion • De bruijn graph 3’ assembly analysis for small RNA reads • Can identify adapter sequences and large-scale contaminants

• Reaper • Extremely fast and accurate adapter finder and trimmer • Also cleans • Low-complexity tracts • PolyA, PolyN • Low-scoring tracts • Deals with complex read geometries and random seqs

• Tally • Collapses redundant reads together rapidly

Kraken: A set of tools for quality control and analysis of high-throughput sequence data Davis, Matthew, van Dongen Stijn, Abreu-Goodger Cei, Bartonicek Nenad, and Enright Anton J. Methods. (2013) EMBL-EBI 23 High throughput RNA Seq

Thursday, 24 October 13 Considerations for smallRNA seq analysis

• Single-end sequencing 50nt • Typically more samples fail in a smallRNA seq analysis • Replicates are essential • Aim for 4-5 biological replicates per sample if possible • Huge sequencing depth is not usually required per sample • 4M reads per sample can usually be sufficient • 20M reads per sample is ideal

• Spike-ins are sometimes used • Can be misleading due to sequencing biases • Can be hard to control quantity and accuracy

EMBL-EBI 24 High throughput RNA Seq

Thursday, 24 October 13 Mapping microRNA reads

EMBL-EBI 25 High throughput RNA Seq

Thursday, 24 October 13 Two ways to map small RNA reads

• Two approaches

Bowtie • Map all reads to the genome with an aligner tool • Bowtie: http://bowtie-bio.sourceforge.net • Select reads that overlap known miRNA locii Reads • Possibility to detect novel miRNAs (e.g. using novel miRDeep) Genome locus ? known known • Problems resolving depth across loci locus locus

Reads miRBase • BLASTn all reads against miRBase sequences mature • Fast and accurate • look for >95% identity and no more than 1-2 mismatches

BLAST http://wwwdev.ebi.ac.uk/enright-srv/krakenbot/ EMBL-EBI 26 High throughput RNA Seq

Thursday, 24 October 13 Two ways to map small RNA reads

• Two approaches

Bowtie • Map all reads to the genome with an aligner tool • Bowtie: http://bowtie-bio.sourceforge.net • Select reads that overlap known miRNA locii Reads • Possibility to detect novel miRNAs (e.g. using novel miRDeep) Genome locus ? known known • Problems resolving depth across loci locus locus

Reads miRBase • BLASTn all reads against miRBase sequences mature • Fast and accurate • look for >95% identity and no more than 1-2 mismatches

BLAST http://wwwdev.ebi.ac.uk/enright-srv/krakenbot/ EMBL-EBI 26 High throughput RNA Seq

Thursday, 24 October 13 Krakenbot Mapping Tool

• Takes cleaned reads from Reaper and Tally (via R) • Each read is scanned using BLASTn against all known microRNA precursor sequences • Multiple Species available

• Up to 2 mismatches allowed (configurable) • Reads assigned to best match • Ambiguous reads assigned randomly (rare) • 5p or 3p side of hairpin called automatically • Possibility to detect 5p and 3p modifications and RNA editing

• Beta software, not yet published

EMBL-EBI 27 High throughput RNA Seq

Thursday, 24 October 13 MicroRNA 3p modification

EMBL-EBI 28 High throughput RNA Seq

Thursday, 24 October 13 microRNA Normalisation and Differential expression

EMBL-EBI 29 High throughput RNA Seq

Thursday, 24 October 13 Small RNA Seq Normalisation

• Lane Depth Correction • Correct total miRNA mapping counts to be equal across all samples

• Statistical approaches usually better for normalisation and differential expression • DEseq • BaySeq • EdgeR

EMBL-EBI 30 High throughput RNA Seq

Thursday, 24 October 13 2

Percentage of processed reads mapping to a specified The number of mappable reads for each sample number of loci given that reads reporting more than 20 loci are discarded

100 lane 3000000 90

2500000 80 70 2000000 60 Mappable 50 1500000 UniquelyMappable

3323204 (80.8%) 40 ●

Read Number 1000000 30 ●

Percentage of reads Percentage 20 500000 10 1418824 (34.5%) ● ● 0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 1 3 5 7 9 11 14 17 20 Other tools for small RNAseq Analysislane Samples Loci to which a read maps

Proportion of reads that map to annotation which Proportion of reads that map to annotation which is associated with repeats for lane is associated with genes for lane

0.5 0.5

• Galaxy 0.4 0.4

0.3 0.3 lane_sense lane_sense lane_antisense lane_antisense • FASTX toolkit 0.2 0.2 Proportion of reads Proportion of reads 0.1 0.1

• UEA Toolkit 0.0 0.0 LTRs LINEs SINEs rRNAs miRNAs snRNAs satellites lincRNAs snoRNAs aa_tRNAs Mt_tRNAs miscRNAs other_tRNAs dust_repeats RNA_repeats pseudogenes other_repeats tandem_repeats unknown_repeats typeII_transposons protein_coding_exons • R & BioConductor processed_transcripts lane : The percentage of processed reads mapping to each chromosome on either the sense and antisense strands

11 10 9 • Reaper, Tally and Kraken 8 7 6 5 4 3 2 • http://www.ebi.ac.uk/research/enright/software/kraken1 0 1 2 3 4 5 6 7 Percentage of processed reads Percentage 8 9 1 2 3 4 5 6 7 8 9 X Y 10 11 12 13 14 15 16 17 18 19 1 MT Toplevel Ensembl chromosomes Cleaned read complexity − The proportion Nucleotide frequency per base for Quality per cycle for the lane Nucleotide frequency per base for Read length plot for lane sample of reads removed by a specified supercontig the lane sample input sample − displayed as raw ASCII values the lane sample following Reaper following Reaper trinucleotide complexity threshold

A T G C N A T G C N 10 Unique reads Repeated reads 1.0 20 3.5 1.0 1.0 30 110 40 50 Mappable Sense 3.0 Uniquely Mappable Sense Mappable Antisense Uniquely Mappable Antisense 0.8 60 0.8 0.8

100 70

80 2.5 90 0.6 90 0.6 0.6 2.0 Quality 0.4 1.5 80 0.4 0.4 Frequency Frequency 1.0 70 0.2

0.2 10% Quantile 0.2 50% Quantile 0.5 Number of reads (millions) 90% Quantile 60 0.0 0.0 0.0 0.0 Proportion removed by threshold Proportion by removed

0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35 Base frequency by cycle Quality by Cycle Base frequency by cycle Sequence length (nt) Read length following Reaper Pie chart displaying those reads with recognised barcodes and the basis upon which others were discarded Total Reads: 7857769 EMBL-EBI 31 High throughput RNA Seq Total Accepted Reads (Reads: 7.86 mill, 100%) Thursday, 24 October 13

total_accepted. Million Reads: 7.857769 EMBO Course example dataset

Human Tissue Libraries Brain Heart Liver

Sequenced on Illumina MiSeq

• http://wwwdev.ebi.ac.uk/enright-srv/courses/embo_oct_2013/ EMBL-EBI 32 High throughput RNA Seq

Thursday, 24 October 13 Downstream Analysis

Prediction of microRNA Targets

EMBL-EBI 33 High throughput RNA Seq

Thursday, 24 October 13 microRNA Target Sites

G U U 3’ U U G A G G U A G A G G U A U A G

5’ 3’ C U C C A U C U U C A U G U C U G C A U U C A A U UTR UTR seed region C C bulge U 6mer

7mer

A C U C C A U C U Canonical seed match (miRNA 2-8) A C U C C A U C U Shifted seed match (miRNA 1-7), 3’ shift in UTR, usually 1A A C U C C A U C U Shifted seed match, (miRNA 3-9), 5’ shift in UTR A C U C C A U C U 8-mer match, possible but less frequent, usually miRNA 1-9

Mazière, P. & Enright, A.J. Prediction of microRNA targets. Drug Discov. Today 12, 452–458 (2007).

EMBL-EBI 34 High throughput RNA Seq

Thursday, 24 October 13 microRNA Target Sites

G U U 3’ U U G A G G U A G A G G U A U A G

5’ 3’ C U C C A U C U U C A U G U C U G C A U U C A A U UTR UTR seed region C C bulge U 6mer

7mer

A C U C C A U C U Canonical seed match (miRNA 2-8) A C U C C A U C U Shifted seed match (miRNA 1-7), 3’ shift in UTR, usually 1A A C U C C A U C U Shifted seed match, (miRNA 3-9), 5’ shift in UTR A C U C C A U C U 8-mer match, possible but less frequent, usually miRNA 1-9

Mazière, P. & Enright, A.J. Prediction of microRNA targets. Drug Discov. Today 12, 452–458 (2007).

EMBL-EBI 34 High throughput RNA Seq

Thursday, 24 October 13 Problems for Target Prediction

• miRNAs are small

• 3’UTRs are large and quality is poor

• Few validated targets available for training • Positive controls • Negative controls

• Binding sites are not perfectly complementary

EMBL-EBI 35 High throughput RNA Seq

Thursday, 24 October 13 Computational Goals • Given a microRNA • Scan genes and find potential binding sites in 3’UTRs

• Use known rules for miRNA binding • High Complementarity (Especially at 5’ end)

• Method must be selective and sensitive • Minimize number of false positives • Detect most/all known targets (9 cases, 2 genomes, 3 microRNAs) • Let7 > lin41, hbl1, daf12, lin28 (elegans) • Lin4 > lin28, lin41, lin14 (elegans) • Bantam > wrinkled (drosophila)

• Prediction of microRNA function • Detected targets of a microRNA imply its function miRNA

Target

Enright AJ., John B., Gaul U., Tuschl T., Sander C., Marks S., Genome Biology (2003);5;R1 EMBL-EBI 36 High throughput RNA Seq

Thursday, 24 October 13 Computational Goals • Given a microRNA • Scan genes and find potential binding sites in 3’UTRs

• Use known rules for miRNA binding • High Complementarity (Especially at 5’ end)

• Method must be selective and sensitive • Minimize number of false positives • Detect most/all known targets (9 cases, 2 genomes, 3 microRNAs) • Let7 > lin41, hbl1, daf12, lin28 (elegans) • Lin4 > lin28, lin41, lin14 (elegans) • Bantam > wrinkled (drosophila)

• Prediction of microRNA function • Detected targets of a microRNA imply its function miRNA

Target

Enright AJ., John B., Gaul U., Tuschl T., Sander C., Marks S., Genome Biology (2003);5;R1 EMBL-EBI 36 High throughput RNA Seq

Thursday, 24 October 13 Types of Algorithms

• Dynamic programming alignment • Similar to Smith-Waterman/Waterman-Eggert • Search for multiple candidate sites in a UTR given a microRNA sequence

• Thermodynamic Analysis • Examine candidate sites for miRNA duplex stability

• Machine-Learning • Support vector machines

• Alignment Heuristics • Scaling factor rewards high complementarity at the 3’ end of the UTR

EMBL-EBI 37 High throughput RNA Seq

Thursday, 24 October 13 Types of Algorithms

Target Site Conservation

Human Transcript 3’ UTR

Human

Mouse

Rat

Zebrafish

Equivalent Transcript 3’ UTRs

EMBL-EBI 38 High throughput RNA Seq

Thursday, 24 October 13 Types of Algorithms

Target Site Conservation

Human Transcript 3’ UTR

Human

Mouse

Rat

Zebrafish

Equivalent Transcript 3’ UTRs

EMBL-EBI 38 High throughput RNA Seq

Thursday, 24 October 13 Types of Algorithms

Target Site Conservation

Human Transcript 3’ UTR

Human

Mouse

Rat

Zebrafish

Equivalent Transcript 3’ UTRs

EMBL-EBI 38 High throughput RNA Seq

Thursday, 24 October 13 Types of Algorithms

Target Site Conservation

Human Transcript 3’ UTR

Human

Mouse

Rat

Zebrafish

Equivalent Transcript 3’ UTRs

EMBL-EBI 38 High throughput RNA Seq

Thursday, 24 October 13 Prediction Algorithms

Mazière, P. & Enright, A.J. Prediction of microRNA targets. Drug Discov. Today 12, 452–458 (2007).

Many other methods and websites now available.

EMBL-EBI 39 High throughput RNA Seq

Thursday, 24 October 13 State of de novo miRNA target prediction ?

Comparison of Target Prediction Methods 10 mouse tissues large-scale validation 0.2

Target Methods 1.0 (# predictions in universe) ipts r PITA ALL (2132) 0.8 ansc r 7mer frequency (1578) DIANA−microT (1029)

0.6 TargetScan5 context (1022) EIMMo (628) 8mer frequency (254) 0.4 TargetScan5 PCT (224) miRBase conservation (185) 0.2 0.05 PITA TOP (135) rna22 (135) TPR using 390 up regulated t MirTargets2 (76) 0.0 PicTar (75) 0.0 0.2 0.4 0.6 0.8 1.0

FPR using 6199 non up regulated transcripts

Systematic discovery of microRNA signatures from large-scale gene expression data Abreu-Goodger C., van Dongen S, Enright A.J. (In Preparation) 2011

EMBL-EBI 40 High throughput RNA Seq

Thursday, 24 October 13 miRNA effects are observable at the mRNA level

• HeLa cells were injected with different miRs

• Down regulated transcripts were

enriched in miR seed Lim et al; Nature 433:769 (2005) targets

EMBL-EBI 41 High throughput RNA Seq

Thursday, 24 October 13 How can we exploit this experimentally ?

• Perturb a microRNA in system of interest • Measure transcript expression as a readout • Transcripts whose expression changes may be direct targets of miRNA • Computational analysis of genelist for confirmation

EMBL-EBI 42 High throughput RNA Seq

Thursday, 24 October 13 How can we exploit this experimentally ?

• Perturb a microRNA in system of interest • Measure transcript expression as a readout • Transcripts whose expression changes may be direct targets of miRNA • Computational analysis of genelist for confirmation

wildtype

EMBL-EBI 42 High throughput RNA Seq

Thursday, 24 October 13 How can we exploit this experimentally ?

• Perturb a microRNA in system of interest • Measure transcript expression as a readout • Transcripts whose expression changes may be direct targets of miRNA • Computational analysis of genelist for confirmation

wildtype

e.g. Deletion miRNA Knockdown perturbed Over-expression

EMBL-EBI 42 High throughput RNA Seq

Thursday, 24 October 13 How can we exploit this experimentally ?

• Perturb a microRNA in system of interest • Measure transcript expression as a readout • Transcripts whose expression changes may be direct targets of miRNA • Computational analysis of genelist for confirmation

wildtype

e.g. Deletion miRNA Knockdown perturbed Over-expression

EMBL-EBI 42 High throughput RNA Seq

Thursday, 24 October 13 How can we exploit this experimentally ?

• Perturb a microRNA in system of interest • Measure transcript expression as a readout • Transcripts whose expression changes may be direct targets of miRNA • Computational analysis of genelist for confirmation

wildtype

e.g. Deletion miRNA Knockdown perturbed Over-expression

EMBL-EBI 42 High throughput RNA Seq

Thursday, 24 October 13 How can we exploit this experimentally ?

• Perturb a microRNA in system of interest • Measure transcript expression as a readout • Transcripts whose expression changes may be direct targets of miRNA • Computational analysis of genelist for confirmation

wildtype

e.g. Deletion miRNA Knockdown perturbed Over-expression

EMBL-EBI 42 High throughput RNA Seq

Thursday, 24 October 13 How can we exploit this experimentally ?

• Perturb a microRNA in system of interest • Measure transcript expression as a readout • Transcripts whose expression changes may be direct targets of miRNA • Computational analysis of genelist for confirmation

wildtype

e.g. Deletion miRNA Knockdown perturbed Over-expression

EMBL-EBI 42 High throughput RNA Seq

Thursday, 24 October 13 How can we exploit this experimentally ?

• Perturb a microRNA in system of interest • Measure transcript expression as a readout • Transcripts whose expression changes may be direct targets of miRNA • Computational analysis of genelist for confirmation

wildtype 4

3

2

1

e.g. 0 Deletion miRNA Transcript A Transcript B Knockdown perturbed Over-expression

EMBL-EBI 42 High throughput RNA Seq

Thursday, 24 October 13 How can we exploit this experimentally ?

• Perturb a microRNA in system of interest • Measure transcript expression as a readout • Transcripts whose expression changes may be direct targets of miRNA • Computational analysis of genelist for confirmation

wildtype 4

3

2

1

e.g. 0 Deletion miRNA Transcript A Transcript B Wildtype Knockdown perturbed miRNA knockdown Over-expression

EMBL-EBI 42 High throughput RNA Seq

Thursday, 24 October 13 How can we prove causality ?

EMBL-EBI 43 High throughput RNA Seq

Thursday, 24 October 13 Sylamer An algorithm for miRNA seed enrichment analysis

Enrichment/Depletion Profile for a specific word

3’ UTRs

Rank 1 Rank N

Fast assessment of miRNA binding and siRNA off-targets from Expression Data van Dongen S., Abreu-Goodger C., Enright AJ. Nature Methods 2008

EMBL-EBI 44 High throughput RNA Seq

Thursday, 24 October 13 Sylamer An algorithm for miRNA seed enrichment analysis

Enrichment/Depletion Profile for a specific word

3’ UTRs

Rank 1 Rank N

Fast assessment of miRNA binding and siRNA off-targets from Expression Data van Dongen S., Abreu-Goodger C., Enright AJ. Nature Methods 2008

EMBL-EBI 44 High throughput RNA Seq

Thursday, 24 October 13 Sylamer An algorithm for miRNA seed enrichment analysis

Enrichment/Depletion Profile for a specific word

3’ UTRs

Rank 1 Rank N

Fast assessment of miRNA binding and siRNA off-targets from Expression Data van Dongen S., Abreu-Goodger C., Enright AJ. Nature Methods 2008

EMBL-EBI 44 High throughput RNA Seq

Thursday, 24 October 13 Sylamer An algorithm for miRNA seed enrichment analysis

Enrichment/Depletion Profile for a specific word

3’ UTRs

Rank 1 Rank N

Fast assessment of miRNA binding and siRNA off-targets from Expression Data van Dongen S., Abreu-Goodger C., Enright AJ. Nature Methods 2008

EMBL-EBI 44 High throughput RNA Seq

Thursday, 24 October 13 Sylamer An algorithm for miRNA seed enrichment analysis

Enrichment/Depletion Profile for a specific word

3’ UTRs

Rank 1 Rank N

Fast assessment of miRNA binding and siRNA off-targets from Expression Data van Dongen S., Abreu-Goodger C., Enright AJ. Nature Methods 2008

EMBL-EBI 44 High throughput RNA Seq

Thursday, 24 October 13 Sylamer An algorithm for miRNA seed enrichment analysis

Enrichment/Depletion Profile for a specific word

3’ UTRs

Rank 1 Rank N

Fast assessment of miRNA binding and siRNA off-targets from Expression Data van Dongen S., Abreu-Goodger C., Enright AJ. Nature Methods 2008

EMBL-EBI 44 High throughput RNA Seq

Thursday, 24 October 13 Sylamer An algorithm for miRNA seed enrichment analysis

Enrichment/Depletion Profile for a specific word

3’ UTRs

Rank 1 Rank N

Fast assessment of miRNA binding and siRNA off-targets from Expression Data van Dongen S., Abreu-Goodger C., Enright AJ. Nature Methods 2008

EMBL-EBI 44 High throughput RNA Seq

Thursday, 24 October 13 Sylamer An algorithm for miRNA seed enrichment analysis

Enrichment/Depletion Profile for a specific word

3’ UTRs

Rank 1 Rank N

Fast assessment of miRNA binding and siRNA off-targets from Expression Data van Dongen S., Abreu-Goodger C., Enright AJ. Nature Methods 2008

EMBL-EBI 44 High throughput RNA Seq

Thursday, 24 October 13 Sylamer An algorithm for miRNA seed enrichment analysis

Enrichment/Depletion Profile for a specific word

3’ UTRs

Rank 1 Rank N

Fast assessment of miRNA binding and siRNA off-targets from Expression Data van Dongen S., Abreu-Goodger C., Enright AJ. Nature Methods 2008

EMBL-EBI 44 High throughput RNA Seq

Thursday, 24 October 13 Does this actually work ?

EMBL-EBI 45 High throughput RNA Seq

Thursday, 24 October 13 MZ-Dicer Zebrafish

MicroRNAs regulate brain morphogenesis in zebrafish. Giraldez, A. J., Cinalli, R. M., Glasner, M. E., Enright, A. J., Thomson, J. M., Baskerville, S., Hammond, S. M., Bartel, D. P., and Schier, A. F. (2005). Science 308, 833–838. EMBL-EBI 46 High throughput RNA Seq

Thursday, 24 October 13 MZ-Dicer Zebrafish

MicroRNAs regulate brain morphogenesis in zebrafish. Giraldez, A. J., Cinalli, R. M., Glasner, M. E., Enright, A. J., Thomson, J. M., Baskerville, S., Hammond, S. M., Bartel, D. P., and Schier, A. F. (2005). Science 308, 833–838. EMBL-EBI 46 High throughput RNA Seq

Thursday, 24 October 13 MZ-Dicer Zebrafish

MicroRNAs regulate brain morphogenesis in zebrafish. Giraldez, A. J., Cinalli, R. M., Glasner, M. E., Enright, A. J., Thomson, J. M., Baskerville, S., Hammond, S. M., Bartel, D. P., and Schier, A. F. (2005). Science 308, 833–838. EMBL-EBI 46 High throughput RNA Seq

Thursday, 24 October 13 MZ-Dicer Zebrafish

MicroRNAs regulate brain morphogenesis in zebrafish. Giraldez, A. J., Cinalli, R. M., Glasner, M. E., Enright, A. J., Thomson, J. M., Baskerville, S., Hammond, S. M., Bartel, D. P., and Schier, A. F. (2005). Science 308, 833–838. EMBL-EBI 46 High throughput RNA Seq

Thursday, 24 October 13 MZ-Dicer Zebrafish

Inject miRNA (miR-430)

Measure expression on microarray

MicroRNAs regulate brain morphogenesis in zebrafish. Giraldez, A. J., Cinalli, R. M., Glasner, M. E., Enright, A. J., Thomson, J. M., Baskerville, S., Hammond, S. M., Bartel, D. P., and Schier, A. F. (2005). Science 308, 833–838. EMBL-EBI 46 High throughput RNA Seq

Thursday, 24 October 13 Zebrafish MZ-Dicer embryos, miR-430

Zebrafish MiR-430 promotes deadenylation and clearance of maternal mRNAs. Giraldez, A. J., Mishima, Y., Rihel, J., Grocock, R. J., van Dongen, S., Inoue, K., Enright, A. J., and Schier, A. F. (2006). Science 312, 75–79. EMBL-EBI 47 High throughput RNA Seq

Thursday, 24 October 13 Targets Validate Experimentally

Zebrafish MiR-430 promotes deadenylation and clearance of maternal mRNAs. Giraldez, A. J., Mishima, Y., Rihel, J., Grocock, R. J., van Dongen, S., Inoue, K., Enright, A. J., and Schier, A. F. (2006). Science 312, 75–79. EMBL-EBI 48 High throughput RNA Seq

Thursday, 24 October 13 Targets Validate Experimentally

Zebrafish MiR-430 promotes deadenylation and clearance of maternal mRNAs. Giraldez, A. J., Mishima, Y., Rihel, J., Grocock, R. J., van Dongen, S., Inoue, K., Enright, A. J., and Schier, A. F. (2006). Science 312, 75–79. EMBL-EBI 48 High throughput RNA Seq

Thursday, 24 October 13 Downstream Analysis

Prediction of Novel microRNAs

EMBL-EBI 49 High throughput RNA Seq

Thursday, 24 October 13 Identification of novel miRNAs

miRDeep2 accurately identifies known and hundreds of novel microRNA genes in seven animal clades. Friedländer MR, Mackowiak SD, Li N, Chen W, Rajewsky N, NAR 2012

EMBL-EBI 50 High throughput RNA Seq

Thursday, 24 October 13 EMBL-EBI 51 High throughput RNA Seq

Thursday, 24 October 13 ChipSeq on RNA ? CLIP Assays (HiTS-CLIP,par-CLIP,iCLIP)

• High-Throughput sequencing of precipitated mRNAs crosslinked to argonaute and miRNA

• Ideally, sequenced mRNA fragments correspond to likely locations of miRNA target binding sites

• Technically challenging

Argonaute HITS-CLIP decodes microRNA-mRNA interaction maps. Chi SW, Zang JB, Mele A, Darnell RB. • Amplification Biases Nature. 2009 Jul 23;460(7254):479-86. • IP problems • Experimental Noise

EMBL-EBI 52 High throughput RNA Seq

Thursday, 24 October 13 CLIP: Cross-Linking and ImmunoPrecipitation

• General scheme • In vivo cross-linking of the protein-RNA complex • Cell lysis and partial RNA degradation • Immunoprecipitation of the complex • Protein degradation • RNA sequencing

Genome-wide direct identification of microRNA bound targets

EMBL-EBI 53 High throughput RNA Seq

Thursday, 24 October 13 HITS-CLIP (High throughput sequencing – CLIP)

CLIP PROBLEMS:

, 607–614 1. Low efficiency of cross-linking 29

2. Location of cross-linking not immediately identifiable. Nature Biotechnology Nature Zhang, C., & Darnell, R.B. (2010)

EMBL-EBI High throughput RNA Seq

Thursday, 24 October 13 PAR-CLIP (Photoactivatable-Ribonucleoside-Enhanced- CLIP)

à Incorporation 4-SU: T to C transition.

141(1), 129-41 141(1), PAR-CLIP PROBLEMS: Cell

1. Uncross-linked with 4-SU: background mutation rate of 20%

2. Restricted to cell-based assays Hafner, M., et al. (2010)

EMBL-EBI High throughput RNA Seq

Thursday, 24 October 13 HITS-CLIP / CIMS (Crosslinking-induced mutations) , 607–614 29

à Irreversible covalent bond

à does not stop at a certain rate Nature Biotechnology Nature à Deletion at 8-20% rate

à Enables mapping of crosslink sites Zhang, C., & Darnell, R.B. (2010)

Correct read or an error

EMBL-EBI High throughput RNA Seq

Thursday, 24 October 13 Current Goals for CLIP Assays

a) To assess the usefulness of data from Ago2 CLIP assays

• Analyse data from several essays for each protocol

• Compare the results of different techniques

b) To develop pipelines and computational algorithms for CLIP data analysis

• Process all samples in a cohesive manner

• Development of a robust computational tool, designed specifically to CLIP

c) To use data from successful CLIP assays to study the mechanics and rules of miRNA binding

• Quality data will allow to explore the features responsible for miRNA binding

EMBL-EBI High throughput RNA Seq

Thursday, 24 October 13