Approximating Sequence Similarity Brad Solomon
March 27, 2019 Lecture 16: Scalable Methods for Genomics Approximating Sequence Similarity Brad Solomon With Special Focus on RNA-Seq
March 27, 2019 Lecture 16: Scalable Methods for Genomics A Quick Recap
Overlap between two sequences
overlap (19 bases) overhang (6 bases) …AGCCTAGACCTACAGGATGCGCGGACACGTAGCCAGGAC CAGTACTTGGATGCGCTGACACGTAGCTTATCCGGT… overhang % identity = 18/19 % = 94.7% RNA-seq Challenges overlap - region of similarity between regions overhang - un-aligned ends of the sequences Challenge 1: Eukaryotic genes are spliced The assembler screens merges based on: Solution: Use a spliced aligner, and assemble isoforms • length of overlap TopHat: discovering spliced junctions with RNA-Seq. • % identity in overlap region Trapnell et al (2009) Bioinformatics. 25:0 1105-1111 • maximum overhang size.
[How do we compute the overlap?] Challenge 2: Read Count != Transcript abundance Solution: Infer underlying abundances (e.g. TPM) [Do we really want to do all-vs-all?] Transcript assembly and quantification by RNA-seq Trapnell et al (2010) Nat. Biotech. 25(5): 511-515 See Lecture 4 Assembly & Whole Genome Alignment Challenge 3: Transcript abundances are stochastic Solution: Replicates, replicates, and more replicates
RNA-seq differential expression studies: more sequence or more replication? Liu et al (2013) Bioinformatics. doi:10.1093/bioinformatics/btt688
!2 See Lecture 10 RNA Sequencing Sequence similarity at scale is a ‘big’ problem Open Access Data (Gigabases)
3 Sequence similarity at scale is a ‘big’ problem
The exponential growth of sequence data is a “Big Data” problem Open Access Data (Gigabases)
3 Biological questions at scale
!4 Biological questions at scale
Experiment Discovery I have sample X, find me related samples
!4 Biological questions at scale
Experiment Discovery I have sample X, find me related samples
Containment Query I am interested in transcript X, find me studies expressing it
!4 Biological questions at scale
Experiment Discovery I have sample X, find me related samples
Containment Query I am interested in transcript X, find me studies expressing it
Cardinality Estimation
I have N studies, how many unique sequences are present?
!4 Biological questions at scale
Experiment Discovery I have sample X, find me related samples
Containment Query I am interested in transcript X, find me studies expressing it
Cardinality Estimation
I have N studies, how many unique sequences are present?
Expression Estimation I am interested in the global expression patterns of transcriptome X on N studies
!4 Biological questions at scale
Experiment Discovery I have sample X, find me related samples
Containment Query I am interested in transcript X, find me studies expressing it
Cardinality Estimation
I have N studies, how many unique sequences are present?
Expression Estimation I am interested in the global expression patterns of transcriptome X on N studies
How can we solve these problems?
!4 Biological questions at scale
Experiment Discovery I have sample X, find me related samples
Containment Query I am interested in transcript X, find me studies expressing it
Cardinality Estimation
I have N studies, how many unique sequences are present?
Expression Estimation I am interested in the global expression patterns of transcriptome X on N studies
How can we solve these problems?
!5 Biological questions at scale
Experiment Discovery Measure similarity as alignment overlap I have sample X, find me related samples
Containment Query I am interested in transcript X, find me studies expressing it
Cardinality Estimation
I have N studies, how many unique sequences are present?
Expression Estimation I am interested in the global expression patterns of transcriptome X on N studies
How can we solve these problems?
!5 Biological questions at scale
Experiment Discovery Measure similarity as alignment overlap I have sample X, find me related samples
Containment Query Align each study to a reference, observe X I am interested in transcript X, find me studies expressing it
Cardinality Estimation
I have N studies, how many unique sequences are present?
Expression Estimation I am interested in the global expression patterns of transcriptome X on N studies
How can we solve these problems?
!5 Biological questions at scale
Experiment Discovery Measure similarity as alignment overlap I have sample X, find me related samples
Containment Query Align each study to a reference, observe X I am interested in transcript X, find me studies expressing it
Align each study to a reference, observe Cardinality Estimation I have N studies, how many unique sequences are present?
Expression Estimation I am interested in the global expression patterns of transcriptome X on N studies
How can we solve these problems?
!5 Biological questions at scale
Experiment Discovery Measure similarity as alignment overlap I have sample X, find me related samples
Containment Query Align each study to a reference, observe X I am interested in transcript X, find me studies expressing it
Align each study to a reference, observe Cardinality Estimation I have N studies, how many unique sequences are present?
Align each study to a reference, normalize and observe Expression Estimation I am interested in the global expression patterns of transcriptome X on N studies
How can we solve these problems?
!5 Methods for alignment at scale
ExAC Browser Beta
!6 Methods for alignment at scale
ExAC Browser Beta
!6 Methods for alignment at scale
ExAC Browser Beta
!6 Rail-RNA: Bulk Alignment
Given a set of N sequencing studies,
1) Aggregate the reads into sets of overlapping reads and readlets (a subsequence of a set of reads with partial overlap)
2) Perform parallel alignment where each thread is given distinct reads based on nucleotide similarity
3) Second pass alignment on all non-perfect or tied score alignments adds quality scores to break ties. Additional readlets are generated for poor alignments
4) Readlets are parallel aligned based on nucleotide similarity
5) Further steps are taken to identify exon-exon junctions and one final alignment is performed using the bulk exon data identified from the studies
6) All individual alignments are output, with a single primary alignment compiled for each individual study
Rail-RNA: scalable analysis of RNA-seq splicing and coverage Nellore et al (2017) Bioinformatics Rail-RNA: Bulk Alignment
Given a set of N sequencing studies,
1) Aggregate the reads into sets of overlapping reads and readlets (a subsequence of a set of reads with partial overlap)
2) Perform parallel alignment where each thread is given distinct reads based on nucleotide similarity
3) Second pass alignment on all non-perfect or tied score alignments adds quality scores to break ties. Additional readlets are generated for poor alignments
4) Readlets are parallel aligned based on nucleotide similarity
5) Further steps are taken to identify exon-exon junctions and one final alignment is performed using the bulk exon data identified from the studies
6) All individual alignments are output, with a single primary alignment compiled for each individual study
Rail-RNA: scalable analysis of RNA-seq splicing and coverage Nellore et al (2017) Bioinformatics Rail-RNA: Bulk Alignment
Given a set of N sequencing studies,
1) Aggregate the reads into sets of overlapping reads and readlets (a subsequence of a set of reads with partial overlap)
2) Perform parallel alignment where each thread is given distinct reads based on nucleotide similarity
3) Second pass alignment on all non-perfect or tied score alignments adds quality scores to break ties. Additional readlets are generated for poor alignments
4) Readlets are parallel aligned based on nucleotide similarity
5) Further steps are taken to identify exon-exon junctions and one final alignment is performed using the bulk exon data identified from the studies
6) All individual alignments are output, with a single primary alignment compiled for each individual study
Rail-RNA: scalable analysis of RNA-seq splicing and coverage Nellore et al (2017) Bioinformatics Recount2: Combining methods
1) A collection of studies is selected 2) Rail-RNA was run on cloud computing services in batches
48,558 samples from SRA 11,350 samples from TCGA 9,662 samples from GTEx 4) The resulting data was made publicly available BigWig
3) Each alignment is stored as a BigWig (a dense data storage structure) Reproducible RNA-seq analysis using recount2 Collado-Torres et al (2017) Nature Biotechnology Recount2: Combining methods
1) A collection of studies is selected 2) Rail-RNA was run on cloud computing services in batches
48,558 samples from SRA 11,350 samples from TCGA 9,662 samples from GTEx 4) The resulting data was made publicly available How were these selected? BigWig
3) Each alignment is stored as a BigWig (a dense data storage structure) Reproducible RNA-seq analysis using recount2 Collado-Torres et al (2017) Nature Biotechnology Recount2: Combining methods
How were the batches selected? 1) A collection of studies is selected 2) Rail-RNA was run on cloud computing services in batches
48,558 samples from SRA 11,350 samples from TCGA 9,662 samples from GTEx 4) The resulting data was made publicly available How were these selected? BigWig
3) Each alignment is stored as a BigWig (a dense data storage structure) Reproducible RNA-seq analysis using recount2 Collado-Torres et al (2017) Nature Biotechnology Recount2: Combining methods
How were the batches selected? 1) A collection of studies is selected 2) Rail-RNA was run on cloud computing services in batches
48,558 samples from SRA How is this designed? 11,350 samples from TCGA 9,662 samples from GTEx 4) The resulting data was made publicly available How were these selected? BigWig
3) Each alignment is stored as a BigWig (a dense data storage structure) Reproducible RNA-seq analysis using recount2 Collado-Torres et al (2017) Nature Biotechnology Alignment can’t scale to Big Data
Recount2 (2017) Open Access Data (Gigabases)
9 Alignment can’t scale to Big Data
Recount2 (2017) Recount2 processed 70,603 samples Open Access Data (Gigabases)
9 Alignment can’t scale to Big Data
The SRA has 4,782,578 samples
Recount2 (2017) Recount2 processed 70,603 samples Open Access Data (Gigabases)
9 Recap: Sailfish BRIEF COMMUNICATIONS 1) Indexing Parse reference transcriptome into kmers Figure •1 Overview of the Sailfish pipeline. (a,b) Sailfish consists of an Reference transcripts indexing phase (a) that is invoked by the command ‘sailfish index’ and a quantification phase (b) invoked by the command ‘sailfish quant’. All unique kmers are hashed and counted Read data The Sailfish• index has four components: (1) a perfect hash function mapping a Index each k-mer in the transcript set to a unique integer between 0 and N−1, (per transcriptome & Perfect hash function (1) where N is the number of unique k-mers in the set of transcripts; (2) an array choice of k) Array of k-mer counts (2) recording• theTwo number indices of times each for k-mer bi-directional occurs in the reference mapping set; (3) an of index mapping each transcript to the multiset of k-mers that it contains; Transcript k-mer transcripts and kmers to contained to containing (4) an index mapping each k-mer to the set of transcripts in which it k-mers transcripts appears. The quantification phase consists of counting the indexed k-mers (3) (4) in the set of reads and then applying an EM procedure to determine the maximum-likelihood estimates of relative transcript abundance. K-mer count b Quant assignments are illustrated by vertical gray bars on lines representing known Intersect reads with Unhashable (per set of reads) transcripts; the horizontal lines intersecting the gray bars represent the indexed k-mers k-mers average of the current k-mer count assignments for each transcript. Hashable k-mers may be more affected by errors in the reads (Supplementary Fig. 1). Aggregate counts Further, many of our data structures can be represented as arrays of atomic integers (Online Methods). This allows our software to Estimate transcript be concurrent and lock-free where possible, leading to an approach abundances from allocated k-mers that scales well with the number of available CPUs (Supplementary Fig. 2). Additional benefits of the Sailfish approach are discussed in
Supplementary Note 1. … … Sailfish works in two phases: indexing and quantification (Fig. 1). A Sailfish index is built from a particular set of reference transcripts Transcripts with Reallocate Average k-mer current k-mer k-mers based coverage (a FASTA sequence file) and a specific choice of k-mer length, k. allocations on abundance The index consists of data structures that make counting k-mers in estimates a set!10 of reads and resolving their potential origin in the set of tran- scripts efficient (Online Methods). The most important data struc- needs to be rebuilt only when the set of reference transcripts or the ture in the index is the minimal perfect hash function9 that maps choice of k changes. each k-mer in the reference transcripts to an integer identifier such The quantification phase of Sailfish takes as input the above that no two k-mers share an identifier. Pairing the minimum perfect index and a set of RNA-seq reads and produces an estimate of the hash function with an atomically updateable array of k-mer counts relative abundance of each transcript in the reference, measured allows k-mers to be counted even faster than with existing advanced in reads per kilobase per million mapped reads (RPKM), k-mers lock-free hashes such as that used in Jellyfish10. The index also con- per kilobase per million mapped k-mers (KPKM) and transcripts tains a pair of look-up tables that allow fast access to the k-mers per million (TPM) (see Online Methods for the definitions of these appearing in a specific transcript as well as the transcripts in which a measures). First, Sailfish counts the number of times each indexed particular k-mer appears, both in amortized constant time. The index k-mer occurs in the set of reads. Owing to the index, this process is efficient and scalable (Supplementary Fig. 2). Sailfish then applies an EM procedure to determine maximum likelihood estimates a c 4
) 8 10 Alignment 2 Figure 2 Speed and accuracy of Sailfish. (a) The correlation between qPCR Quantification 6 estimates of gene abundance (x axis) and the estimates of Sailfish. The 4 qPCR results are taken from the microarray quality control study (MAQC)15. 2 18.55 h 0 103 The results shown here are for the human brain tissue, and the RNA-seq– 10.19 h 7.70 h –2 6.03 h 6.96 h based estimates were computed using the reads from SRA accession –4
Sailfish KPKM(log SRX016366 (81,250,481 35bp single-end reads). The set of transcripts –6 2.27 h used in this experiment were the curated RefSeq20 transcripts (accession 0 2 4 2 –8 –6 –4 –2 10 –10 prefix NM) from hg18 (31,148 transcripts). (b) The correlation between qPCR(log2) Time (min) the ground truth FPKM in a simulated data set (x axis) and the abundance b 0.26 h estimates of Sailfish. The quantification in this experiment was performed ) 2 10 1 on a set of 96,520 transcript sequences taken from Ensembl21 GRCh37.73. 10 0.09 h (c) The total time taken by each method, Sailfish, RSEM, eXpress and 5 Cufflinks, to estimate isoform abundance on each data set. The total time 0 taken by a method is the height of the corresponding bar, and the total is 0 10 further broken down into the time taken to perform read-alignment (for
Sailfish KPKM(log –5 press Sailfish, we instead measured the time taken to count the k-mers in the Sailfish RSEM Sailfish RSEM eX Cufflinks eXpressCufflinks –5 0 5 10 read set) and the time taken to quantify abundance given the aligned SRX016366 Synthetic reads (or k-mer counts). All tools were run in multithreaded mode (where True FPKM(log2) applicable) and were allowed to use up to 16 threads. (d) Accuracy of each d of the methods on human brain tissue and a synthetic data set. Accuracy Human brain tissue Synthetic is measured by the Pearson (log-transformed) and Spearman correlation Sailfish RSEM eXpress CufflinksSailfish RSEM eXpress Cufflinks coefficients between estimated abundance values and MAQC qPCR data (for Pearson 0.85 0.82 0.85 0.85 0.96 0.96 0.95 0.94 human brain tissue) or ground truth (for simulated data). Root-mean-square Spearman 0.84 0.80 0.85 0.85 0.76 0.77 0.77 0.75 RMSE – – – – 6.06 8.90 8.83 10.05 error (RMSE) and median percentage error (medPE) are calculated medPE – – – – 6.50 12.48 14.06 12.42 as described in the Online Methods.
NATURE BIOTECHNOLOGY VOLUME 32 NUMBER 5 MAY 2014 463 Recap: Sailfish BRIEF COMMUNICATIONS 1) Indexing Parse reference transcriptome into kmers Figure •1 Overview of the Sailfish pipeline. (a,b) Sailfish consists of an Reference transcripts indexing phase (a) that is invoked by the command ‘sailfish index’ and a quantification phase (b) invoked by the command ‘sailfish quant’. All unique kmers are hashed and counted Read data The Sailfish• index has four components: (1) a perfect hash function mapping a Index each k-mer in the transcript set to a unique integer between 0 and N−1, (per transcriptome & Perfect hash function (1) where N is the number of unique k-mers in the set of transcripts; (2) an array choice of k) Array of k-mer counts (2) recording• theTwo number indices of times each for k-mer bi-directional occurs in the reference mapping set; (3) an of index mapping each transcript to the multiset of k-mers that it contains; Transcript k-mer transcripts and kmers to contained to containing (4) an index mapping each k-mer to the set of transcripts in which it k-mers transcripts appears. The quantification phase consists of counting the indexed k-mers (3) (4) in the set of reads and then applying an EM procedure to determine the maximum-likelihood estimates of relative transcript abundance. K-mer count b Quant assignments are illustrated by vertical gray bars on lines representing known Intersect reads with Unhashable (per set of reads) transcripts; the horizontal lines intersecting the gray bars represent the indexed k-mers k-mers average of the current k-mer count assignments for each transcript. 2) Quantification Hashable k-mers • Count kmers in read set may be more affected by errors in the reads (Supplementary Fig. 1). Aggregate counts Further, many of our data structures can be represented as arrays of atomic• Use integers EM (Online procedure Methods). Thisto estimate allows our software transcript to Estimate transcript be concurrent and lock-free where possible, leading to an approach abundances from abundances, repeating as necessary allocated k-mers that scales well with the number of available CPUs (Supplementary Fig. 2). Additional benefits of the Sailfish approach are discussed in
Supplementary Note 1. … … Sailfish works in two phases: indexing and quantification (Fig. 1). A Sailfish index is built from a particular set of reference transcripts Transcripts with Reallocate Average k-mer current k-mer k-mers based coverage (a FASTA sequence file) and a specific choice of k-mer length, k. allocations on abundance The index consists of data structures that make counting k-mers in estimates a set!10 of reads and resolving their potential origin in the set of tran- scripts efficient (Online Methods). The most important data struc- needs to be rebuilt only when the set of reference transcripts or the ture in the index is the minimal perfect hash function9 that maps choice of k changes. each k-mer in the reference transcripts to an integer identifier such The quantification phase of Sailfish takes as input the above that no two k-mers share an identifier. Pairing the minimum perfect index and a set of RNA-seq reads and produces an estimate of the hash function with an atomically updateable array of k-mer counts relative abundance of each transcript in the reference, measured allows k-mers to be counted even faster than with existing advanced in reads per kilobase per million mapped reads (RPKM), k-mers lock-free hashes such as that used in Jellyfish10. The index also con- per kilobase per million mapped k-mers (KPKM) and transcripts tains a pair of look-up tables that allow fast access to the k-mers per million (TPM) (see Online Methods for the definitions of these appearing in a specific transcript as well as the transcripts in which a measures). First, Sailfish counts the number of times each indexed particular k-mer appears, both in amortized constant time. The index k-mer occurs in the set of reads. Owing to the index, this process is efficient and scalable (Supplementary Fig. 2). Sailfish then applies an EM procedure to determine maximum likelihood estimates a c 4
) 8 10 Alignment 2 Figure 2 Speed and accuracy of Sailfish. (a) The correlation between qPCR Quantification 6 estimates of gene abundance (x axis) and the estimates of Sailfish. The 4 qPCR results are taken from the microarray quality control study (MAQC)15. 2 18.55 h 0 103 The results shown here are for the human brain tissue, and the RNA-seq– 10.19 h 7.70 h –2 6.03 h 6.96 h based estimates were computed using the reads from SRA accession –4
Sailfish KPKM(log SRX016366 (81,250,481 35bp single-end reads). The set of transcripts –6 2.27 h used in this experiment were the curated RefSeq20 transcripts (accession 0 2 4 2 –8 –6 –4 –2 10 –10 prefix NM) from hg18 (31,148 transcripts). (b) The correlation between qPCR(log2) Time (min) the ground truth FPKM in a simulated data set (x axis) and the abundance b 0.26 h estimates of Sailfish. The quantification in this experiment was performed ) 2 10 1 on a set of 96,520 transcript sequences taken from Ensembl21 GRCh37.73. 10 0.09 h (c) The total time taken by each method, Sailfish, RSEM, eXpress and 5 Cufflinks, to estimate isoform abundance on each data set. The total time 0 taken by a method is the height of the corresponding bar, and the total is 0 10 further broken down into the time taken to perform read-alignment (for
Sailfish KPKM(log –5 press Sailfish, we instead measured the time taken to count the k-mers in the Sailfish RSEM Sailfish RSEM eX Cufflinks eXpressCufflinks –5 0 5 10 read set) and the time taken to quantify abundance given the aligned SRX016366 Synthetic reads (or k-mer counts). All tools were run in multithreaded mode (where True FPKM(log2) applicable) and were allowed to use up to 16 threads. (d) Accuracy of each d of the methods on human brain tissue and a synthetic data set. Accuracy Human brain tissue Synthetic is measured by the Pearson (log-transformed) and Spearman correlation Sailfish RSEM eXpress CufflinksSailfish RSEM eXpress Cufflinks coefficients between estimated abundance values and MAQC qPCR data (for Pearson 0.85 0.82 0.85 0.85 0.96 0.96 0.95 0.94 human brain tissue) or ground truth (for simulated data). Root-mean-square Spearman 0.84 0.80 0.85 0.85 0.76 0.77 0.77 0.75 RMSE – – – – 6.06 8.90 8.83 10.05 error (RMSE) and median percentage error (medPE) are calculated medPE – – – – 6.50 12.48 14.06 12.42 as described in the Online Methods.
NATURE BIOTECHNOLOGY VOLUME 32 NUMBER 5 MAY 2014 463 Recap: Sailfish BRIEF COMMUNICATIONS 1) Indexing Parse reference transcriptome into kmers Figure •1 Overview of the Sailfish pipeline. (a,b) Sailfish consists of an Reference transcripts indexing phase (a) that is invoked by the command ‘sailfish index’ and a quantification phase (b) invoked by the command ‘sailfish quant’. All unique kmers are hashed and counted Read data The Sailfish• index has four components: (1) a perfect hash function mapping a Index each k-mer in the transcript set to a unique integer between 0 and N−1, (per transcriptome & Perfect hash function (1) where N is the number of unique k-mers in the set of transcripts; (2) an array choice of k) Array of k-mer counts (2) recording• theTwo number indices of times each for k-mer bi-directional occurs in the reference mapping set; (3) an of index mapping each transcript to the multiset of k-mers that it contains; Transcript k-mer transcripts and kmers to contained to containing (4) an index mapping each k-mer to the set of transcripts in which it k-mers transcripts appears. The quantification phase consists of counting the indexed k-mers (3) (4) in the set of reads and then applying an EM procedure to determine the maximum-likelihood estimates of relative transcript abundance. K-mer count b Quant assignments are illustrated by vertical gray bars on lines representing known Intersect reads with Unhashable (per set of reads) transcripts; the horizontal lines intersecting the gray bars represent the indexed k-mers k-mers average of the current k-mer count assignments for each transcript. 2) Quantification Hashable k-mers • Count kmers in read set may be more affected by errors in the reads (Supplementary Fig. 1). Aggregate counts Further, many of our data structures can be represented as arrays of atomic• Use integers EM (Online procedure Methods). Thisto estimate allows our software transcript to Estimate transcript be concurrent and lock-free where possible, leading to an approach abundances from abundances, repeating as necessary allocated k-mers that scales well with the number of available CPUs (Supplementary Fig. 2). Additional benefits of the Sailfish approach are discussed in
Supplementary Note 1. … … Sailfish works in two phases: indexing and quantification (Fig. 1). A Sailfish index is built from a particular set of reference transcripts Transcripts with Reallocate Average k-mer current k-mer k-mers based coverage (a FASTA sequence file) and a specific choice of k-mer length, k. allocations on abundance The index consists of data structures that make counting k-mers in estimates a set!10 of reads and resolvingSailfish, their potential Salmon, origin in the and set ofKallisto tran- all operate on a per-sample basis scripts efficient (Online Methods). The most important data struc- needs to be rebuilt only when the set of reference transcripts or the ture in the index is the minimal perfect hash function9 that maps choice of k changes. each k-mer in the reference transcripts to an integer identifier such The quantification phase of Sailfish takes as input the above that no two k-mers share an identifier. Pairing the minimum perfect index and a set of RNA-seq reads and produces an estimate of the hash function with an atomically updateable array of k-mer counts relative abundance of each transcript in the reference, measured allows k-mers to be counted even faster than with existing advanced in reads per kilobase per million mapped reads (RPKM), k-mers lock-free hashes such as that used in Jellyfish10. The index also con- per kilobase per million mapped k-mers (KPKM) and transcripts tains a pair of look-up tables that allow fast access to the k-mers per million (TPM) (see Online Methods for the definitions of these appearing in a specific transcript as well as the transcripts in which a measures). First, Sailfish counts the number of times each indexed particular k-mer appears, both in amortized constant time. The index k-mer occurs in the set of reads. Owing to the index, this process is efficient and scalable (Supplementary Fig. 2). Sailfish then applies an EM procedure to determine maximum likelihood estimates a c 4
) 8 10 Alignment 2 Figure 2 Speed and accuracy of Sailfish. (a) The correlation between qPCR Quantification 6 estimates of gene abundance (x axis) and the estimates of Sailfish. The 4 qPCR results are taken from the microarray quality control study (MAQC)15. 2 18.55 h 0 103 The results shown here are for the human brain tissue, and the RNA-seq– 10.19 h 7.70 h –2 6.03 h 6.96 h based estimates were computed using the reads from SRA accession –4
Sailfish KPKM(log SRX016366 (81,250,481 35bp single-end reads). The set of transcripts –6 2.27 h used in this experiment were the curated RefSeq20 transcripts (accession 0 2 4 2 –8 –6 –4 –2 10 –10 prefix NM) from hg18 (31,148 transcripts). (b) The correlation between qPCR(log2) Time (min) the ground truth FPKM in a simulated data set (x axis) and the abundance b 0.26 h estimates of Sailfish. The quantification in this experiment was performed ) 2 10 1 on a set of 96,520 transcript sequences taken from Ensembl21 GRCh37.73. 10 0.09 h (c) The total time taken by each method, Sailfish, RSEM, eXpress and 5 Cufflinks, to estimate isoform abundance on each data set. The total time 0 taken by a method is the height of the corresponding bar, and the total is 0 10 further broken down into the time taken to perform read-alignment (for
Sailfish KPKM(log –5 press Sailfish, we instead measured the time taken to count the k-mers in the Sailfish RSEM Sailfish RSEM eX Cufflinks eXpressCufflinks –5 0 5 10 read set) and the time taken to quantify abundance given the aligned SRX016366 Synthetic reads (or k-mer counts). All tools were run in multithreaded mode (where True FPKM(log2) applicable) and were allowed to use up to 16 threads. (d) Accuracy of each d of the methods on human brain tissue and a synthetic data set. Accuracy Human brain tissue Synthetic is measured by the Pearson (log-transformed) and Spearman correlation Sailfish RSEM eXpress CufflinksSailfish RSEM eXpress Cufflinks coefficients between estimated abundance values and MAQC qPCR data (for Pearson 0.85 0.82 0.85 0.85 0.96 0.96 0.95 0.94 human brain tissue) or ground truth (for simulated data). Root-mean-square Spearman 0.84 0.80 0.85 0.85 0.76 0.77 0.77 0.75 RMSE – – – – 6.06 8.90 8.83 10.05 error (RMSE) and median percentage error (medPE) are calculated medPE – – – – 6.50 12.48 14.06 12.42 as described in the Online Methods.
NATURE BIOTECHNOLOGY VOLUME 32 NUMBER 5 MAY 2014 463 The Big Data Problem
• There’s an abundance of underutilized sequence data • Alignment can solve a number of important biological questions but doesn’t scale
• Alignment-free methods are more efficient but are not several orders of magnitude more efficient The Big Data Problem
• There’s an abundance of underutilized sequence data • Alignment can solve a number of important biological questions but doesn’t scale
• Alignment-free methods are more efficient but are not several orders of magnitude more efficient
How can we address big data? The Big Data Problem
• There’s an abundance of underutilized sequence data • Alignment can solve a number of important biological questions but doesn’t scale
• Alignment-free methods are more efficient but are not several orders of magnitude more efficient
How can we address big data?
Sketching! Sketching algorithms trade accuracy for speed
Given a box, it’s easy to discover what’s inside
!12 Sketching algorithms trade accuracy for speed
Given a box, it’s easy to discover what’s inside But what is there’s too many boxes?
!12 Sketching algorithms trade accuracy for speed
Given a box, it’s easy to discover what’s inside But what is there’s too many boxes?
A sketch solution would organize boxes based on labels, simpler observations such as size or weight, or sub-sample boxes to learn about its neighbors !12 TradeoReview:ffs in Similarity Similarity metrics Metrics • Hamming distance – Count the number of substitutions to transform one string into another MIKESCHATZ ||X||XXXX| MICESHATZZ 5 • Edit distance – The minimum number of substitutions, insertions, or deletions to transform one string into another
MIKESCHAT-Z ||X||X|||X| MICES-HATZZ 3
See Lecture 10: RNA Sequencing TradeoReview:ffs in Similarity Similarity metrics Metrics • Hamming distance – Count the number of substitutions to transform one string into another MIKESCHATZ ||X||XXXX| Much faster to calculate MICESHATZZ 5 • Edit distance – The minimum number of substitutions, insertions, or deletions to transform one string into another
MIKESCHAT-Z ||X||X|||X| MICES-HATZZ 3
See Lecture 10: RNA Sequencing TradeoReview:ffs in Similarity Similarity metrics Metrics • Hamming distance – Count the number of substitutions to transform one string into another MIKESCHATZ ||X||XXXX| Much faster to calculate MICESHATZZ 5 • Edit distance – The minimum number of substitutions, insertions, or deletions to transform one string into another
MIKESCHAT-Z ||X||X|||X| More biologically meaningful MICES-HATZZ 3
See Lecture 10: RNA Sequencing TradeoReview:ffs in Similarity Similarity metrics Metrics • Hamming distance – Count the number of substitutions to transform one string into another MIKESCHATZ ||X||XXXX| Much faster to calculate MICESHATZZ 5 • Edit distance – The minimum number of substitutions, insertions, or deletions to transform one string into another
MIKESCHAT-Z ||X||X|||X| More biologically meaningful MICES-HATZZ 3 Can you think of other tradeoffs in See Lecture 10: RNA Sequencing alignment methods? Solving Big Data through Sketching
Experiment Discovery I have sample X, find me related samples
Containment Query I am interested in transcript X, find me experiments expressing it
Cardinality Estimation
I have N studies, how many unique sequences are present?
Expression Estimation I am interested in the global expression patterns of transcriptome X in N studies
!14 Solving Big Data through Sketching
Experiment Discovery Mash [Minhash] I have sample X, find me related samples
Containment Query I am interested in transcript X, find me experiments expressing it
Cardinality Estimation
I have N studies, how many unique sequences are present?
Expression Estimation I am interested in the global expression patterns of transcriptome X in N studies
!14 Solving Big Data through Sketching
Experiment Discovery Mash [Minhash] I have sample X, find me related samples
Containment Query Sequence Bloom Trees [Bloom Filter] I am interested in transcript X, find me experiments expressing it
Cardinality Estimation
I have N studies, how many unique sequences are present?
Expression Estimation I am interested in the global expression patterns of transcriptome X in N studies
!14 Solving Big Data through Sketching
Experiment Discovery Mash [Minhash] I have sample X, find me related samples
Containment Query Sequence Bloom Trees [Bloom Filter] I am interested in transcript X, find me experiments expressing it
Dashing [HyperLogLog] Cardinality Estimation I have N studies, how many unique sequences are present?
Expression Estimation I am interested in the global expression patterns of transcriptome X in N studies
!14 Solving Big Data through Sketching
Experiment Discovery Mash [Minhash] I have sample X, find me related samples
Containment Query Sequence Bloom Trees [Bloom Filter] I am interested in transcript X, find me experiments expressing it
Dashing [HyperLogLog] Cardinality Estimation I have N studies, how many unique sequences are present?
Salmon / Kallisto Expression Estimation I am interested in the global expression patterns of transcriptome X in N studies
!14 Solving Big Data through Sketching
Experiment Discovery Mash [Minhash] I have sample X, find me related samples
Containment Query Sequence Bloom Trees [Bloom Filter] I am interested in transcript X, find me experiments expressing it
Dashing [HyperLogLog] Cardinality Estimation I have N studies, how many unique sequences are present?
Salmon / Kallisto Expression Estimation I am interested in the global expression patterns of transcriptome X in N studies
Mantis [Counting Quotient Filter]
!14 © 2015 Nature America, Inc. All rights reserved. computed to determine the overlap offset (0) for is positions their in difference median the and sequences original the in located are case) this in CCG and (ACC min-mers shared the sketches, accurate estimates. ( the error bound controlled by S ARTICLES 624 fraction of entries shared between the sketches of two sequences share the same minimum fingerprints (underlined) for than the set of all of the ordered set of its the min-mer for that hash. ( The generated using an XORShift pseudo-random number generator ( ( H of hash functions determines the resulting sketch size number The functions. hash multiple by fingerprints integer to converted resulting in 12 the sequence into its constituent ( Figure 1 a For sketches. using without reads two between similarity Jaccard the measures exactly that approach a direct with MHAP overlapping 2 Figure sensitivity. 80% over achieve to required is 16-mers ~150 of only a sketch rate, error a 15% with genome human the to reads kbp 10 mapping For overlapping. than easier is genome the in reference a to additive reads high-error mapping roughly reads, two is the of rate error alignment an of rate error the because ( estimate Jaccard the of error expected the reduces which size, sketch the increasing by improved 1 Notes rate of 30%, so MHAP uses 10 kbp reads simulated from the human genome with an overlap error of value largest the use to preferable is it so found, matches false of number ( input overlap maintaining accuracy reads, acceptable while detection the than smaller magnitude of order an sketches use to possible Itis filter. the of sensitivity small. the reduces sketches min-mers keep fewer to using However, preferable is it so size, sketch the to tional storeindex,requiredcomparehash,andtimeto The filtering. alignment efficient for sketches MinHash uses MHAP MinHash alignment filtering RESULTS similarity. estimating for technique efficient putationally 1 Fig. shared of number the with correlated strongly is estimate resulting The sketches. their to be by estimated simply computing the Hamming between distance two of similarity Jaccard the allows hashing ity-sensitive a sequence makes the sketch ( for min-mers of collection The retained. is min-mer, or fingerprint, valued minimum the only function, hash each For functions. hash glesor for a DNA one must sequence, convert all of minimizers as a generalization viewed larity similarity document to applied a Fig. Fig. 2a 2 = 4, four independent hash sets are generated for each sequence sequence each for generated are sets hash independent four = 4, ) To create a MinHash sketch of a DNA sequence 1… (0.5) serves as an estimate of their true Jaccard similarity (0.22), with with (0.22), similarity Jaccard true their of estimate an as serves (0.5) The efficiency of MHAP improves with increased read length. length. read increased with improves MHAP of efficiency The from overlaps kbp 2 detect effectively can 16-mers Specifically,
k H -mer generating the minimum value for each hash is referred to as to referred is hash each for value minimum the generating -mer 23– ). In MHAP, after the initial hash ( ). Because the sketches are comparatively small, this is a com a is this small, comparatively are sketches the Because ). q
, 2 b k -grams) to integer fingerprints using multiple, randomized randomized multiple, using fingerprints integer to -grams) Rapid overlapping of noisy reads using MinHash sketches. sketches. MinHash using reads noisy of overlapping Rapid 5 c and ). For human, using a small value of that maintains sensitivity. maintains that and metagenomic clustering compares the total number of 2 k -mers each for and Online Methods). Sensitivity can be further further be can Sensitivity Methods). Online and k -mers. In this example, the sketches of k- e ) If sufficient similarity is detected between two two between detected is similarity sufficient ) If mers between two sequences ( sequences two between mers H min-mer fingerprints, which is much smaller smaller much is which fingerprints, min-mer c ) The sketch of a sequence is composed composed is a sequence of sketch ) The k H = 16 by default ( default by 16 = . In practice, S Fig. Fig. 1 Supplementary Fig. 1 Fig. Supplementary k 2 and and -mers. In the example shown, 0 , image similarity , image 1 Berlin et al (2015) Assembling large genomes with single-molecule sequencing andlocality-sensitive hashing and Online Methods). This local This Methods). Online and S 2 . ( 1 ), subsequent fingerprints are are fingerprints subsequent ), 2 b 6 k 2 H ) All . The approach can also be be also can approach The . 7 -mers (also known as, shin . Briefly, to create . a to Briefly, create sketch >> 4 is required to obtain >> obtain to required 4 is k Fig. Fig. 2 S k -mers counted during during counted -mers k 1 (e.g., 10) increases the -mers are then then are -mers S and and , we first decomposed decomposed first , we ( in the be estimatedbytheoverlap 4) TheJaccard similaritycan each hashfunctionis chosen 3) Thesmallestvalues for Nature Biotechnology ( 2) Multiplehashfunctions into 1) Sequencedecomposed Minhash) b H k 2 Γ 1 2 , -mers is propor is -mers . Here, where where . Here, S Supplementary Supplementary , sequence simi and and )mapkmerstovalues. Supplementary Supplementary 2 ). Additionally, Additionally, ). . S kmers 1 Min and and k 2 -mer sets sets -mer . ( k S = 3, 2… S d imum The MinhashSketch 1 ) The ) The 2 and and
H
). ).
- - - - -
Hash trols how many alignments are reported for each read. The HGAP sensitivityis primarily affected by the balanced that ( accuracy with chosen speed were sensitive) and (fast settings parameter ( sweep parameter the on Based threshold. similarity Jaccard the and Methods). Online and magnitude faster than BLASR at all levels of sensitivity sensitivity of levels all at BLASR than ( faster magnitude of order an and it tested genomes all alignments; across pairs accurate possible consistently was all all considers overlapping MHAP for contrast, suited In reads. ideally mapping of not for is and designed reference a originally to was BLASR BWA-MEM, Like assembler sets sets assembler ( 2 Tables chemistries Supplementary sequencing and settings parameter multiple using evaluated were tools the and genomes, reference to reads ping comparing by map from inferred were overlaps,which true overlapsto evaluated detected was DALIGNER and BLASR MHAP, of overlaps noisy all between pairs of reads ( detect reliably not did algorithms these of versions BWA-MEM BLASR reads, SMRT for designed tools other overlap of MHAP two per.sensitive versus and Wespecificity sensitivity the evaluated highly a also is MHAP fast, being to addition In MHAP overlapping performance technologies. sequencing long-read future of accuracy and length read increasing the with improve to expected is 1 Note for read length; of increasing decreases reads (which plexity is governed only by the sketch size (a constant) and the number com the because length, read increasing with rapidly MHAP decays by performed comparisons min-mer of number relative the length, overlap 20% a and minimum sequenced, bases of total number fixed long-read overlaps. Although developed for the Dazzler assembler, assembler, Dazzler the for developed Although overlaps. long-read ( coverage replicon uneven and complexity sequence by affected and BLASR genome-dependent genomes. highly was sensitivity and repetitive runtime for overlaps missed in result can this Supplementary Table 2 Supplementary Supplementary Figs. 2 Figs. Supplementary MHAP sensitivity is tunable based on the size of Like MHAP, DALIGNER utilizes efficient VOLUME 33 VOLUME es and and e d c b a Supplementary Table 1 Supplementary 3 0 , SNAP 18 20 33 24 22 33 40 58 14 19 [
54 5 S bestn 1 5 1 , NUMBER 6 Sketch : 13 28 48 28 23 37 57 14
1 84 36 75 1 2 ,
equal to the depth of sequencing coverage, but but coverage, sequencing of depth the to equal 3
1 2 Table Table 56 11 60 47 11 16 36 57 ( and RazerS and , S 24 72 04 2 3 1 )
and ) and empirical assembly tests, two MHAP two tests, assembly empirical and ) 15 Supplementary Note 3 Supplementary S S and 39 54 43 26 54 15 19 36 61 ] 2 1
1 6 5 4 J : : 1 3 ( JUNE 2015 S and 3 ). 1 , min-mers and S 2 ) 3 Supplementary Note 2 Note Supplementary ≈ 2 ). Thus, the efficiency of MHAP of efficiency the Thus, ).
were also evaluated, but current were evaluated, also 2/4 =0.5 Supplementary Figs. 2 2 Figs. Supplementary bestn
NATURE NATURE BIOTECHNOLOGY 27 54 13 35 24 22 49 44 11 18 36 k 5 [ 1 parameter, which con 5 -mer matching to detect , Sketch 2 33 56 30 48 44 27 54 13 19 8
64 75 1 1 2 and DALIGNER
, ). ). The performance